ABSTRACT
Many speech signals are compressed with MP3 to reduce the data rate. In many synthetic speech detection methods the spectrogram of the speech signal is used. This usually requires the speech signal to be fully decompressed. We show that the design of MP3 compression allows one to approximate the spectrogram of the MP3 compressed speech efficiently without fully decoding the compressed speech. We denote the spectograms obtained using our proposed approach by Efficient Spectrograms (E-Specs). E-Spec can reduce the complexity of spectrogram computation by ~77.60 percentage points (p.p.) and save ~37.87 p.p. of MP3 decoding time. E-Spec bypasses the reconstruction artifacts introduced by the MP3 synthesis filterbank, which makes it useful in speech forensics tasks. We tested E-Spec in the synthetic speech detection, where a detector is asked to determine whether a speech signal is synthesized or recorded from a human. We examined 4 different neural network architectures to evaluate the performance of E-Spec compared to speech features extracted from the fully decoded speech signal. E-Spec achieved the best synthetic speech detection performance for 3 architectures; it also achieved the best overall detection performance across architectures. The computation of E-Spec is an approximation to Short Time Fourier Transform (STFT). E-Spec can be extended to other audio compression methods.
- Zaynab Almutairi and Hebah Elgibreen. 2022. A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions. Algorithms, Vol. 15, 5 (2022), 155. https://doi.org/10.3390/a15050155Google Scholar
- Vipin Bansal, Gaurav Pahwa, and Nirmal Kannan. 2020. Cough Classification for COVID-19 Based on Audio MFCC Features Using Convolutional Neural Networks. Proceedings of the 2020 IEEE International Conference on Computing, Power and Communication Technologies (2020), 604--608. https://doi.org/10.1109/GUCON48875.2020.9231094Google ScholarCross Ref
- Emily R Bartusiak and Edward J Delp. 2021. Frequency Domain-based Detection of Generated Audio. Proceedings of IS&T International Symposium on Electronic Imaging: Media Watermarking, Security, and Forensics (2021), 273-1-273-7. https://doi.org/10.2352/ISSN.2470-1173.2021.4.MWSF-273 Virtual.Google ScholarCross Ref
- Kratika Bhagtani, Amit Kumar Singh Yadav, Emily R Bartusiak, Ziyue Xiang, Ruiting Shao, Sriram Baireddy, and Edward J Delp. 2022. An Overview of Recent Work in Media Forensics: Methods and threats. arXiv preprint arXiv:2204.12067 (2022). https://doi.org/10.48550/arXiv.2204.12067Google Scholar
- Tiziano Bianchi, Alessia De Rosa, Marco Fontani, Giovanni Rocciolo, and Alessandro Piva. 2013. Detection and Classification of Double Compressed MP3 Audio Tracks. Proceedings of the First ACM Workshop on Information Hiding and Multimedia Security (2013), 159--164. https://doi.org/10.1145/2482513.2482523 Montpellier, France.Google ScholarDigital Library
- Judith C Brown. 1991. Calculation of a Constant Q Spectral Transform. The Journal of the Acoustical Society of America, Vol. 89, 1 (1991), 425--434. https://doi.org/10.1121/1.400476Google ScholarCross Ref
- Tom Fawcett. 2006. An Introduction to ROC Analysis. Pattern recognition letters, Vol. 27, 8 (2006), 861--874. https://doi.org/10.1016/j.patrec.2005.10.010Google ScholarDigital Library
- Deepanway Ghosal and Maheshkumar H. Kolekar. 2018. Music Genre Recognition Using Deep Neural Networks and Transfer Learning. Proceedings of Interspeech 2018 (2018), 2087--2091. https://doi.org/10.21437/Interspeech.2018-2045Google Scholar
- Eric Grinstein, Ngoc Q. K. Duong, Alexey Ozerov, and Patrick Pérez. 2018. Audio Style Transfer. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018), 586--590. https://doi.org/10.1109/ICASSP.2018.8461711Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (2016), 770--778. https://doi.org/10.1109/CVPR.2016.90 Las Vegas, NV, USA.Google ScholarCross Ref
- Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Hartwig Adam, and Quoc Le. 2019. Searching for MobileNetV3. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (2019), 1314--1324. https://doi.org/10.1109/ICCV.2019.00140Google ScholarCross Ref
- International Organization for Standardization. 1995. ISO/IEC 13818--3:1995 - Information technology -- Generic coding of moving pictures and associated audio information -- Part 3: Audio. https://www.iso.org/standard/22991.htmlGoogle Scholar
- International Organization for Standardization. 1997. ISO/IEC 13818--7:1997 Information technology - Generic Coding of Moving Pictures and Associated Audio Information - Part 7: Advanced Audio Coding (AAC). https://www.iso.org/standard/25040.htmlGoogle Scholar
- Joebert S. Jacaba. 2001. Audio Compression Using Modified Discrete Cosine Transform: The MP3 Coding Standard. Bachelor's thesis. University of the Philippines, Manila. https://www.math.utah.edu/ gustafso/s2016/2270/project-ideas/audio-mp3-compression-MDCT-jacaba_main.pdfGoogle Scholar
- Muhammad Mohsin Kabir, Muhammad F Mridha, Jungpil Shin, Israt Jahan, and Abu Quwsar Ohi. 2021. A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities. IEEE Access, Vol. 9 (2021), 79236--79263. https://doi.org/10.1109/ACCESS.2021.3084299Google ScholarCross Ref
- Hasam Khalid, Minha Kim, Shahroz Tariq, and Simon S. Woo. 2021. Evaluation of an Audio-Video Multimodal Deepfake Dataset Using Unimodal and Multimodal Detectors. Proceedings of the 1st Workshop on Synthetic Multimedia -- Audiovisual Deepfake Generation and Detection (2021), 7--15. https://doi.org/10.1145/3476099.3484315 Virtual.Google ScholarDigital Library
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014). https://doi.org/10.48550/arXiv.1412.6980Google Scholar
- lieff. 2018. textttminimp3: Minimalistic MP3 Decoder Single Header Library. https://github.com/lieff/minimp3Google Scholar
- Qingzhong Liu, Andrew H Sung, and Mengyu Qiao. 2010. Detection of double MP3 compression. Cognitive Computation, Vol. 2 (2010), 291--296. https://doi.org/10.1007/s12559-010-9045-4Google ScholarCross Ref
- Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi H. Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, and Kong Aik Lee. 2021. ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech. IEEE Transactions on Biometrics, Behavior, and Identity Science, Vol. 3, 2 (2021), 252--265. https://doi.org/10.1109/TBIOM.2021.3059479Google ScholarCross Ref
- T.Q. Nguyen. 1994. Near-perfect-reconstruction Pseudo-QMF Banks. IEEE Transactions on Signal Processing, Vol. 42, 1 (1994), 65--76. https://doi.org/10.1109/78.258122Google ScholarDigital Library
- Alan V. Oppenheim. 1970. Speech Spectrograms Using the Fast Fourier Transform. IEEE Spectrum, Vol. 7, 8 (1970), 57--62. https://doi.org/10.1109/MSPEC.1970.5213512Google ScholarDigital Library
- Ted Painter and Andreas Spanias. 2000. Perceptual Coding of Digital Audio. Proc. IEEE, Vol. 88, 4 (2000), 451--515. https://doi.org/10.1109/5.842996Google ScholarCross Ref
- Rassol Raissi. 2002. The Theory Behind MP3. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.113.6804Google Scholar
- Ricardo Reimao and Vassilios Tzerpos. 2019. FoR: A Dataset for Synthetic Speech Detection. Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (2019), 1--10. https://doi.org/10.1109/SPED.2019.8906599 Timisoara, Romania.Google ScholarCross Ref
- Ricardo Reimao and Vassilios Tzerpos. 2021. Synthetic Speech Detection Using Neural Networks. Proceedings of the 2021 International Conference on Speech Technology and Human-Computer Dialogue (2021), 97--102. https://doi.org/10.1109/SpeD53181.2021.9587406Google ScholarCross Ref
- Joseph Rothweiler. 1983. Polyphase Quadrature Filters--A New Subband Coding Technique. Proceedings of the 1983 IEEE International Conference on Acoustics, Speech, and Signal Processing (1983), 1280--1283. https://doi.org/10.1109/ICASSP.1983.1172005Google ScholarCross Ref
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, Vol. 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-yGoogle ScholarDigital Library
- Michael E. Schuckers. 2010. Receiver Operating Characteristic Curve and Equal Error Rate. In Computational Methods in Biometric Authentication: Statistical Methods for Performance Evaluation. Springer London, London, 155--204. https://doi.org/10.1007/978-1-84996-202-5_5Google Scholar
- Premjeet Singh, Goutam Saha, and Md Sahidullah. 2021. Non-linear Frequency Warping Using Constant-Q Transformation for Speech Emotion Recognition. Proceedings of the 2021 International Conference on Computer Communication and Informatics (2021), 1--6. https://doi.org/10.1109/ICCCI50826.2021.9402569Google ScholarCross Ref
- John S. Sobolewski. 2003. Data Transmission Media. In Encyclopedia of Physical Science and Technology (Third Edition), Robert A. Meyers (Ed.). Academic Press, New York, 277--303. https://doi.org/10.1016/B0-12-227410-5/00165-4Google Scholar
- Praveen Sripada. 2006. MP3 Decoder in Theory and Practice. Master's thesis. Blekinge Institute of Technology, Ronneby, Sweden. https://www.diva-portal.org/smash/get/diva2:830195/FULLTEXT01.pdfGoogle Scholar
- Mingxing Tan and Quoc Le. 2021. EfficientNetV2: Smaller Models and Faster Training. Proceedings of International Conference on Machine Learning, Vol. 139 (2021), 10096--10106. Virtual.Google Scholar
- TorchAudio Contributors. 2023. TorchAudio Documentation. https://pytorch.org/audio/master/index.htmlGoogle Scholar
- Cheuk Kin Wai. 2023. nnAudio 0.3.1. https://kinwaicheuk.github.io/nnAudio/index.htmlGoogle Scholar
- Ye Wang and Mikka Vilermo. 2003. Modified Discrete Cosine Transform: Its Implications for Audio Coding and Error Concealment. Journal of the Audio Engineering Society, Vol. 51, 1/2 (2003), 52--61.Google Scholar
- Ziyue Xiang, Paolo Bestagini, Stefano Tubaro, and Edward J. Delp. 2022. Forensic Analysis and Localization of Multiply Compressed MP3 Audio Using Transformers. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (2022), 2929--2933. https://doi.org/10.1109/ICASSP43922.2022.9747639 Singapore.Google Scholar
- Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (2017), 5987--5995. https://doi.org/10.1109/CVPR.2017.634 Honolulu, HI, USA.Google ScholarCross Ref
- Amit Kumar Singh Yadav, Ziyue Xiang, Emily R. Bartusiak, Paolo Bestagini, Stefano Tubaro, and Edward J. Delp. 2023. ASSD: Synthetic Speech Detection in the AAC Compressed Domain. Proceedings of the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (2023).Google Scholar
- Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al. 2021. ASVspoof 2021: Accelerating Progress in Spoofed and Deepfake Speech Detection. arXiv preprint arXiv:2109.00537 (2021). https://doi.org/10.48550/arXiv.2109.00537Google Scholar
- Diqun Yan, Rangding Wang, Jinglei Zhou, Chao Jin, and Zhifeng Wang. 2018. Compression History Detection for MP3 Audio. KSII Transactions on Internet and Information Systems (TIIS), Vol. 12, 2 (2018), 662--675. https://doi.org/10.3837/tiis.2018.02.007Google Scholar
- Mohammed Zakariah, Muhammad Khurram Khan, and Hafiz Malik. 2018. Digital Multimedia Audio Forensics: Past, Present and Future. Multimedia tools and applications, Vol. 77 (2018), 1009--1040. https://doi.org/10.1007/s11042-016-4277-2Google ScholarDigital Library
- Fang Zheng, Guoliang Zhang, and Zhanjiang Song. 2001. Comparison of Different Implementations of MFCC. Journal of Computer science and Technology, Vol. 16 (2001), 582--589. https://doi.org/10.1007/BF02943243Google ScholarDigital Library
- Pedram Abdzadeh Ziabary and Hadi Veisi. 2021. A Countermeasure Based on CQT Spectrogram for Deepfake Speech Detection. Proceedings of the 2021 7th International Conference on Signal Processing and Intelligent Systems (2021), 1--5. https://doi.org/10.1109/ICSPIS54653.2021.9729387Google ScholarCross Ref
Index Terms
- Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection
Recommendations
Intelligibility of time-compressed synthetic speech
Analysis of listeners' intelligibility of natural and synthetic time-compressed speech.Different compression methods are applied to normal and fast speech.We evaluated a linear method and two non linear methods that act on the duration model.The linear ...
FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection
Digital Forensics and WatermarkingAbstractAs increasing development of text-to-speech (TTS) and voice conversion (VC) technologies, the detection of synthetic speech has been suffered dramatically. In order to promote the development of synthetic speech detection model against Mandarin ...
Dithering techniques in automatic recognition of speech corrupted by MP3 compression
A large portion of the audio files distributed over the Internet or those stored in personal and corporate media archives are in a compressed form. There exist several compression techniques and algorithms but it is the MPEG Layer-3 (known as MP3) that ...
Comments