Abstract
Recently, voice leakage gradually raises more significant concerns of users, due to its underlying sensitive and private information when providing intelligent services. Existing studies demonstrate the feasibility of applying learning-based solutions on built-in sensor measurements to recover voices. However, due to the privacy concerns, large-scale voices-sensor measurements samples for model training are not publicly available, leading to significant efforts in data collection for such an attack. In this paper, we propose a training-free and universal eavesdropping attack on built-in speakers, VoiceListener, which releases the data collection efforts and is able to adapt to various voices, platforms, and domains. In particular, VoiceListener develops an aliasing-corrected super resolution mechanism, including an aliasing-based pitch estimation and an aliasing-corrected voice recovering, to convert the undersampled narrow-band sensor measurements to wide-band voices. Extensive experiments demonstrate that our proposed VoiceListener could accurately recover the voices from undersampled sensor measurements and is robust to different voices, platforms and domains, realizing the universal eavesdropping attack.
- Amazon. 2021. Amazon Alexa - Learn what Alexa can do | Amazon.com. https://www.amazon.com/b?node=21576558011. (2021).Google Scholar
- S Abhishek Anand and Nitesh Saxena. 2018. Speechless: Analyzing the threat to speech privacy from smartphone motion sensors. In Proceedings of IEEE S&P. 1000--1017.Google Scholar
- Apple. 2021. Getting Raw Gyroscope Events. https://developer.apple.com/documentation/coremotion/getting_raw_gyroscope_events. (2021).Google Scholar
- Apple. 2021. Siri - Apple. https://www.apple.com/siri/. (2021).Google Scholar
- Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu, and Kui Ren. 2020. Learning-based practical smartphone eavesdropping with built-in accelerometer. In Proceedings of NDSS. 23--26.Google ScholarCross Ref
- Sören Becker, Marcel Ackermann, Sebastian Lapuschkin, Klaus-Robert Müller, and Wojciech Samek. 2018. Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals. arXiv preprint arXiv:1807.03418 (2018). arXiv:1807.03418Google Scholar
- Fox Bussiness. 2019. Apple's Siri is eavesdropping on your conversations, putting users at risk: Report. https://www.foxbusiness.com/technology/apples-siri-is-eavesdropping-on-your-conversations-putting-users-at-risk. (2019).Google Scholar
- John Cunnison Catford. 1988. A practical introduction to phonetics. Clarendon Press Oxford.Google Scholar
- CowBoy Channel. 2021. Voice Assistant Industry Size, Market Share: 2021 Market Research with Growth, Manufacturers, Segments and 2023 Forecasts Research. https://www.thecowboychannel.com/story/43600953/voice-assistant-industry-size-market-share-2021-market-research-with-growth-manufacturers-segments-and-2023-forecasts-research. (2021).Google Scholar
- Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. In Proceedings of IEEE S&P. San Francisco, CA, USA, 694--711.Google ScholarCross Ref
- Meng Chen, Li Lu, Zhongjie Ba, and Kui Ren. 2022. PhoneyTalker: An Out-of-the-Box Toolkit for Adversarial Example Attack on Speaker Recognition. In Proceedings of IEEE INFOCOM. London, United Kingdom, 1419--1428.Google ScholarDigital Library
- ChinaDialy. 2018. Suit claims Baidu apps illegally tap data. http://www.chinadaily.com.cn/a/201801/06/WS5a5016cfa31008cf16da568a.html. (2018).Google Scholar
- Julien Epps and W Harvey Holmes. 1999. A new technique for wideband enhancement of coded narrowband speech. In Proceedings of IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria. 174--176.Google ScholarCross Ref
- Ming Gao, Yajie Liu, Yike Chen, Yimin Li, Zhongjie Ba, Xian Xu, and Jinsong Han. 2022. InertiEAR: Automatic and Device-independent IMU-based Eavesdropping on Smartphones. In Proceedings of IEEE INFOCOM. 1129--1138.Google ScholarDigital Library
- John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1--1.1. NASA STI/Recon technical report n 93 (1993), 27403.Google ScholarCross Ref
- Google. 2021. Android Developer. https://developer.android.com/guide/topics/sensors/sensors_overview. (2021).Google Scholar
- Google. 2021. Google Assistant, your own personal Google. https://assistant.google.com/. (2021).Google Scholar
- Augustine Gray and John Markel. 1976. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 24, 5 (1976), 380--391.Google ScholarCross Ref
- Daniel Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 2 (1984), 236--243.Google ScholarCross Ref
- Jun Han, Albert Jin Chung, and Patrick Tague. 2017. Pitchln: eavesdropping via intelligible speech reconstruction using non-acoustic sensor fusion. In Proceedings of ACM/IEEE IPSN. 181--192.Google ScholarDigital Library
- Dik J Hermes. 1988. Measurement of pitch by subharmonic summation. The Journal of the Acoustical Society of America 83, 1 (1988), 257--264.Google ScholarCross Ref
- Shehzeen Hussain, Paarth Neekhara, Shlomo Dubnov, Julian J. McAuley, and Farinaz Koushanfar. 2021. WaveGuard: Understanding and Mitigating Audio Adversarial Examples. In Proceedings of USENIX Security. 2273--2290.Google Scholar
- Peter Jax and Peter Vary. 2003. Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model. In Proceedings of IEEE ICASSP, Vol. 1. I-I.Google ScholarCross Ref
- Peter Jax and Peter Vary. 2003. On artificial bandwidth extension of telephone speech. Signal Processing 83, 8 (2003), 1707--1719.Google ScholarDigital Library
- Volodymyr Kuleshov, S Zayd Enam, and Stefano Ermon. 2017. Audio super-resolution using neural nets. In Proceedings of ICLR.Google Scholar
- Guy Lemarquand, Romain Ravaud, Iman Shahosseini, Valérie Lemarquand, Jean Moulin, and Elie Lefeuvre. 2012. MEMS electrodynamic loudspeakers for mobile phones. Applied Acoustics 73, 4 (2012), 379--385.Google ScholarCross Ref
- Xinyu Li, Venkata Chebiyyam, Katrin Kirchhoff, and AI Amazon. 2019. Speech Audio Super-Resolution for Speech Recognition.. In Proceedings of ISCA INTERSPEECH. 3416--3420.Google ScholarCross Ref
- Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. 2020. AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations. In Proceedings of ACM CCS. Virtual Event, USA, 1121--1134.Google ScholarDigital Library
- Teck Yian Lim, Raymond A Yeh, Yijia Xu, Minh N Do, and Mark Hasegawa-Johnson. 2018. Time-frequency networks for audio super-resolution. In Proceedings of IEEE ICASSP. 646--650.Google ScholarDigital Library
- John Makhoul and Michael Berouti. 1979. High-frequency regeneration in speech coding systems. In Proceedings of IEEE ICASSP, Vol. 4. 428--431.Google Scholar
- Michael I Mandel and Young Suk Cho. 2015. Audio super-resolution using concatenative resynthesis. In Proceedings of IEEE WASPAA. 1--5.Google ScholarCross Ref
- Héctor A. Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018. V-Speech: Noise-Robust Speech Capturing Glasses Using Vibration Sensors. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2, 4 (2018).Google Scholar
- MEMSIC. 2021. MMC3416xPJ. http://www.memsic.com/uploadfiles/2021/02/20210210110317113.pdf. (2021).Google Scholar
- Yan Michalevsky, Dan Boneh, and Gabi Nakibly. 2014. Gyrophone: Recognizing speech from gyroscope signals. In Proceedings of USENIX Security. 1053--1067.Google Scholar
- Microsoft. 2021. Cortana - Your personal productivity assistant. https://www.microsoft.com/en-us/cortana. (2021).Google Scholar
- D Murali Mohan, Dileep B Karpur, Manoj Narayan, and J Kishore. 2011. Artificial bandwidth extension of narrowband speech using Gaussian mixture model. In Proceedings of IEEE International Conference on Communications and Signal Processing. 410--412.Google Scholar
- Kun-Youl Park and Hyung Soon Kim. 2000. Narrowband to wideband conversion of speech using GMM based transformation. In Proceedings of IEEE ICASSP, Vol. 3. 1843--1846.Google Scholar
- Yasheng Qian and Peter Kabal. 2002. Wideband speech recovery from narrowband speech using classified codebook mapping. In Proceedings of Australian International Conference on Speech Science, Technology. 106--111.Google Scholar
- Nathanaël Carraz Rakotonirina. 2021. Self-Attention for Audio Super-Resolution. In Proceedings of IEEE International Workshop on Machine Learning for Signal Processing. 1--6.Google ScholarCross Ref
- Sriram Sami, Yimin Dai, Sean Rui Xiang Tan, Nirupam Roy, and Jun Han. 2020. Spying with your robot vacuum cleaner: eavesdropping via lidar sensors. In Proceedings ACM SenSys. 354--367.Google ScholarDigital Library
- Samsung. 2021. Samsung Bixby: Your Personal Voice Assistant | Samsung US. https://www.samsung.com/us/explore/bixby/. (2021).Google Scholar
- Weigao Su, Daibo Liu, Taiyuan Zhang, and Hongbo Jiang. 2022. Towards Device Independent Eavesdropping on Telephone Conversations with Built-in Accelerometer. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 4 (2022).Google Scholar
- Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 4214--4217.Google ScholarCross Ref
- The New York Times. 2019. Amazon's Alexa Never Stops Listening to You. Should You Worry? https://www.nytimes.com/wirecutter/blog/amazons-alexa-never-stops-listening-to-you/. (2019).Google Scholar
- Heming Wang and Deliang Wang. 2020. Time-frequency loss for CNN based speech super-resolution. In Proceedings of IEEE ICASSP. 861--865.Google ScholarCross Ref
- Tianshi Wang, Shuochao Yao, Shengzhong Liu, Jinyang Li, Dongxin Liu, Huajie Shao, Ruijie Wang, and Tarek Abdelzaher. 2021. Audio Keyword Reconstruction from On-Device Motion Sensor Signals via Neural Frequency Unfolding. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 3 (2021).Google ScholarDigital Library
- Teng Wei, Shu Wang, Anfu Zhou, and Xinyu Zhang. 2015. Acoustic eavesdropping through wireless vibrometry. In Proceedings of ACM MobiCom. 130--141.Google ScholarDigital Library
- Sheng Yao and Cheung-Fat Chan. 2005. Block-based bandwidth extension of narrowband speech signal by using CDHMM. In Proceedings of IEEE ICASSP, Vol. 1. I-793.Google Scholar
- Li Zhang, Parth H Pathak, Muchen Wu, Yixin Zhao, and Prasant Mohapatra. 2015. Accelword: Energy efficient hotword detection through accelerometer. In Proceedings of ACM MobiSys. 301--315.Google ScholarDigital Library
Index Terms
- VoiceListener: A Training-free and Universal Eavesdropping Attack on Built-in Speakers of Mobile Devices
Recommendations
Deep learning for multisensor image resolution enhancement
GeoAI '17: Proceedings of the 1st Workshop on Artificial Intelligence and Deep Learning for Geographic Knowledge DiscoveryWe describe a deep learning convolutional neural network (CNN) for enhancing low resolution multispectral satellite imagery without the use of a panchromatic image. For training, low resolution images are used as input and corresponding high resolution ...
A jamming approach to enhance enterprise Wi-Fi secrecy through spatial access control
Prevalent Wi-Fi networks have adopted various protections to prevent eavesdropping caused by the intrinsic shared nature of wireless medium. However, many of them are based on pre-shared secret incurring key management costs, and are still vulnerable ...
Super-resolution reconstruction of hyperspectral images
Hyperspectral images are used for aerial and space imagery applications, including target detection, tracking, agricultural, and natural resource exploration. Unfortunately, atmospheric scattering, secondary illumination, changing viewing angles, and ...
Comments