research-article

VoiceListener: A Training-free and Universal Eavesdropping Attack on Built-in Speakers of Mobile Devices

Authors:
Lei Wang

Zhejiang University, School of Cyber Science and Technology, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, China

Zhejiang University, School of Cyber Science and Technology, China and ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, China

0000-0003-0975-2252
View Profile

,
Meng Chen

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

0000-0002-4775-5107
View Profile

,
Li Lu

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

0000-0001-5230-3749
View Profile

,
Zhongjie Ba

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

0000-0003-0921-8869
View Profile

,
Feng Lin

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

0000-0001-5240-5200
View Profile

,
Kui Ren

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

Zhejiang University, School of Cyber Science and Technology, Hangzhou, China

0000-0003-3441-6277
View Profile

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 7 Issue 1Article No.: 32pp 1–22https://doi.org/10.1145/3580789

Published:28 March 2023Publication History

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Abstract

Recently, voice leakage gradually raises more significant concerns of users, due to its underlying sensitive and private information when providing intelligent services. Existing studies demonstrate the feasibility of applying learning-based solutions on built-in sensor measurements to recover voices. However, due to the privacy concerns, large-scale voices-sensor measurements samples for model training are not publicly available, leading to significant efforts in data collection for such an attack. In this paper, we propose a training-free and universal eavesdropping attack on built-in speakers, VoiceListener, which releases the data collection efforts and is able to adapt to various voices, platforms, and domains. In particular, VoiceListener develops an aliasing-corrected super resolution mechanism, including an aliasing-based pitch estimation and an aliasing-corrected voice recovering, to convert the undersampled narrow-band sensor measurements to wide-band voices. Extensive experiments demonstrate that our proposed VoiceListener could accurately recover the voices from undersampled sensor measurements and is robust to different voices, platforms and domains, realizing the universal eavesdropping attack.

References

Amazon. 2021. Amazon Alexa - Learn what Alexa can do | Amazon.com. https://www.amazon.com/b?node=21576558011. (2021).Google Scholar
S Abhishek Anand and Nitesh Saxena. 2018. Speechless: Analyzing the threat to speech privacy from smartphone motion sensors. In Proceedings of IEEE S&P. 1000--1017.Google Scholar
Apple. 2021. Getting Raw Gyroscope Events. https://developer.apple.com/documentation/coremotion/getting_raw_gyroscope_events. (2021).Google Scholar
Apple. 2021. Siri - Apple. https://www.apple.com/siri/. (2021).Google Scholar
Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu, and Kui Ren. 2020. Learning-based practical smartphone eavesdropping with built-in accelerometer. In Proceedings of NDSS. 23--26.Google ScholarCross Ref
Sören Becker, Marcel Ackermann, Sebastian Lapuschkin, Klaus-Robert Müller, and Wojciech Samek. 2018. Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals. arXiv preprint arXiv:1807.03418 (2018). arXiv:1807.03418Google Scholar
Fox Bussiness. 2019. Apple's Siri is eavesdropping on your conversations, putting users at risk: Report. https://www.foxbusiness.com/technology/apples-siri-is-eavesdropping-on-your-conversations-putting-users-at-risk. (2019).Google Scholar
John Cunnison Catford. 1988. A practical introduction to phonetics. Clarendon Press Oxford.Google Scholar
CowBoy Channel. 2021. Voice Assistant Industry Size, Market Share: 2021 Market Research with Growth, Manufacturers, Segments and 2023 Forecasts Research. https://www.thecowboychannel.com/story/43600953/voice-assistant-industry-size-market-share-2021-market-research-with-growth-manufacturers-segments-and-2023-forecasts-research. (2021).Google Scholar
Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. In Proceedings of IEEE S&P. San Francisco, CA, USA, 694--711.Google ScholarCross Ref
Meng Chen, Li Lu, Zhongjie Ba, and Kui Ren. 2022. PhoneyTalker: An Out-of-the-Box Toolkit for Adversarial Example Attack on Speaker Recognition. In Proceedings of IEEE INFOCOM. London, United Kingdom, 1419--1428.Google ScholarDigital Library
ChinaDialy. 2018. Suit claims Baidu apps illegally tap data. http://www.chinadaily.com.cn/a/201801/06/WS5a5016cfa31008cf16da568a.html. (2018).Google Scholar
Julien Epps and W Harvey Holmes. 1999. A new technique for wideband enhancement of coded narrowband speech. In Proceedings of IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria. 174--176.Google ScholarCross Ref
Ming Gao, Yajie Liu, Yike Chen, Yimin Li, Zhongjie Ba, Xian Xu, and Jinsong Han. 2022. InertiEAR: Automatic and Device-independent IMU-based Eavesdropping on Smartphones. In Proceedings of IEEE INFOCOM. 1129--1138.Google ScholarDigital Library
John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1--1.1. NASA STI/Recon technical report n 93 (1993), 27403.Google ScholarCross Ref
Google. 2021. Android Developer. https://developer.android.com/guide/topics/sensors/sensors_overview. (2021).Google Scholar
Google. 2021. Google Assistant, your own personal Google. https://assistant.google.com/. (2021).Google Scholar
Augustine Gray and John Markel. 1976. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 24, 5 (1976), 380--391.Google ScholarCross Ref
Daniel Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 2 (1984), 236--243.Google ScholarCross Ref
Jun Han, Albert Jin Chung, and Patrick Tague. 2017. Pitchln: eavesdropping via intelligible speech reconstruction using non-acoustic sensor fusion. In Proceedings of ACM/IEEE IPSN. 181--192.Google ScholarDigital Library
Dik J Hermes. 1988. Measurement of pitch by subharmonic summation. The Journal of the Acoustical Society of America 83, 1 (1988), 257--264.Google ScholarCross Ref
Shehzeen Hussain, Paarth Neekhara, Shlomo Dubnov, Julian J. McAuley, and Farinaz Koushanfar. 2021. WaveGuard: Understanding and Mitigating Audio Adversarial Examples. In Proceedings of USENIX Security. 2273--2290.Google Scholar
Peter Jax and Peter Vary. 2003. Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model. In Proceedings of IEEE ICASSP, Vol. 1. I-I.Google ScholarCross Ref
Peter Jax and Peter Vary. 2003. On artificial bandwidth extension of telephone speech. Signal Processing 83, 8 (2003), 1707--1719.Google ScholarDigital Library
Volodymyr Kuleshov, S Zayd Enam, and Stefano Ermon. 2017. Audio super-resolution using neural nets. In Proceedings of ICLR.Google Scholar
Guy Lemarquand, Romain Ravaud, Iman Shahosseini, Valérie Lemarquand, Jean Moulin, and Elie Lefeuvre. 2012. MEMS electrodynamic loudspeakers for mobile phones. Applied Acoustics 73, 4 (2012), 379--385.Google ScholarCross Ref
Xinyu Li, Venkata Chebiyyam, Katrin Kirchhoff, and AI Amazon. 2019. Speech Audio Super-Resolution for Speech Recognition.. In Proceedings of ISCA INTERSPEECH. 3416--3420.Google ScholarCross Ref
Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. 2020. AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations. In Proceedings of ACM CCS. Virtual Event, USA, 1121--1134.Google ScholarDigital Library
Teck Yian Lim, Raymond A Yeh, Yijia Xu, Minh N Do, and Mark Hasegawa-Johnson. 2018. Time-frequency networks for audio super-resolution. In Proceedings of IEEE ICASSP. 646--650.Google ScholarDigital Library
John Makhoul and Michael Berouti. 1979. High-frequency regeneration in speech coding systems. In Proceedings of IEEE ICASSP, Vol. 4. 428--431.Google Scholar
Michael I Mandel and Young Suk Cho. 2015. Audio super-resolution using concatenative resynthesis. In Proceedings of IEEE WASPAA. 1--5.Google ScholarCross Ref
Héctor A. Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018. V-Speech: Noise-Robust Speech Capturing Glasses Using Vibration Sensors. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2, 4 (2018).Google Scholar
MEMSIC. 2021. MMC3416xPJ. http://www.memsic.com/uploadfiles/2021/02/20210210110317113.pdf. (2021).Google Scholar
Yan Michalevsky, Dan Boneh, and Gabi Nakibly. 2014. Gyrophone: Recognizing speech from gyroscope signals. In Proceedings of USENIX Security. 1053--1067.Google Scholar
Microsoft. 2021. Cortana - Your personal productivity assistant. https://www.microsoft.com/en-us/cortana. (2021).Google Scholar
D Murali Mohan, Dileep B Karpur, Manoj Narayan, and J Kishore. 2011. Artificial bandwidth extension of narrowband speech using Gaussian mixture model. In Proceedings of IEEE International Conference on Communications and Signal Processing. 410--412.Google Scholar
Kun-Youl Park and Hyung Soon Kim. 2000. Narrowband to wideband conversion of speech using GMM based transformation. In Proceedings of IEEE ICASSP, Vol. 3. 1843--1846.Google Scholar
Yasheng Qian and Peter Kabal. 2002. Wideband speech recovery from narrowband speech using classified codebook mapping. In Proceedings of Australian International Conference on Speech Science, Technology. 106--111.Google Scholar
Nathanaël Carraz Rakotonirina. 2021. Self-Attention for Audio Super-Resolution. In Proceedings of IEEE International Workshop on Machine Learning for Signal Processing. 1--6.Google ScholarCross Ref
Sriram Sami, Yimin Dai, Sean Rui Xiang Tan, Nirupam Roy, and Jun Han. 2020. Spying with your robot vacuum cleaner: eavesdropping via lidar sensors. In Proceedings ACM SenSys. 354--367.Google ScholarDigital Library
Samsung. 2021. Samsung Bixby: Your Personal Voice Assistant | Samsung US. https://www.samsung.com/us/explore/bixby/. (2021).Google Scholar
Weigao Su, Daibo Liu, Taiyuan Zhang, and Hongbo Jiang. 2022. Towards Device Independent Eavesdropping on Telephone Conversations with Built-in Accelerometer. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 4 (2022).Google Scholar
Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 4214--4217.Google ScholarCross Ref
The New York Times. 2019. Amazon's Alexa Never Stops Listening to You. Should You Worry? https://www.nytimes.com/wirecutter/blog/amazons-alexa-never-stops-listening-to-you/. (2019).Google Scholar
Heming Wang and Deliang Wang. 2020. Time-frequency loss for CNN based speech super-resolution. In Proceedings of IEEE ICASSP. 861--865.Google ScholarCross Ref
Tianshi Wang, Shuochao Yao, Shengzhong Liu, Jinyang Li, Dongxin Liu, Huajie Shao, Ruijie Wang, and Tarek Abdelzaher. 2021. Audio Keyword Reconstruction from On-Device Motion Sensor Signals via Neural Frequency Unfolding. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 3 (2021).Google ScholarDigital Library
Teng Wei, Shu Wang, Anfu Zhou, and Xinyu Zhang. 2015. Acoustic eavesdropping through wireless vibrometry. In Proceedings of ACM MobiCom. 130--141.Google ScholarDigital Library
Sheng Yao and Cheung-Fat Chan. 2005. Block-based bandwidth extension of narrowband speech signal by using CDHMM. In Proceedings of IEEE ICASSP, Vol. 1. I-793.Google Scholar
Li Zhang, Parth H Pathak, Muchen Wu, Yixin Zhao, and Prasant Mohapatra. 2015. Accelword: Energy efficient hotword detection through accelerometer. In Proceedings of ACM MobiSys. 301--315.Google ScholarDigital Library

Index Terms

VoiceListener: A Training-free and Universal Eavesdropping Attack on Built-in Speakers of Mobile Devices
1. Human-centered computing
  1. Ubiquitous and mobile computing
2. Security and privacy
  1. Security in hardware
    1. Embedded systems security

Recommendations

Deep learning for multisensor image resolution enhancement
GeoAI '17: Proceedings of the 1st Workshop on Artificial Intelligence and Deep Learning for Geographic Knowledge Discovery

We describe a deep learning convolutional neural network (CNN) for enhancing low resolution multispectral satellite imagery without the use of a panchromatic image. For training, low resolution images are used as input and corresponding high resolution ...
Read More
A jamming approach to enhance enterprise Wi-Fi secrecy through spatial access control

Prevalent Wi-Fi networks have adopted various protections to prevent eavesdropping caused by the intrinsic shared nature of wireless medium. However, many of them are based on pre-shared secret incurring key management costs, and are still vulnerable ...
Read More
Super-resolution reconstruction of hyperspectral images

Hyperspectral images are used for aerial and space imagery applications, including target detection, tracking, agricultural, and natural resource exploration. Unfortunately, atmospheric scattering, secondary illumination, changing viewing angles, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 7, Issue 1
March 2023
1243 pages
EISSN:2474-9567
DOI:10.1145/3589760
Issue’s Table of Contents

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2023
Published in imwut Volume 7, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Eavesdropping
aliasing correction
speakers
super resolution
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 417
  Total Downloads
- Downloads (Last 12 months)287
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

VoiceListener: A Training-free and Universal Eavesdropping Attack on Built-in Speakers of Mobile Devices

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Abstract

References

Cited By

Index Terms

Recommendations

Deep learning for multisensor image resolution enhancement

A jamming approach to enhance enterprise Wi-Fi secrecy through spatial access control

Super-resolution reconstruction of hyperspectral images

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

VoiceListener: A Training-free and Universal Eavesdropping Attack on Built-in Speakers of Mobile Devices

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Abstract

References

Cited By

Index Terms

Recommendations

Deep learning for multisensor image resolution enhancement

A jamming approach to enhance enterprise Wi-Fi secrecy through spatial access control

Super-resolution reconstruction of hyperspectral images

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media