Abstract
Depressive disorder is a kind of mental illness with a high incidence rate due to the stress from the environment or social impact. Depression affects mood and behavior that leads to various problem domains such as education, family, and workplace problems. Suicide attempt is found in severe depression cases as well. However, depression is a treatable condition if diagnosed by psychiatrists. In Thailand, many people who aware of mental disorders do not seek help from psychiatric hospitals due to long waiting services and high fees. Therefore, we aim to create an application for users to do self-assessment by collecting their voice signal data. In our experiment, we define the voice data obtained from the depressive patient during a therapy session in a psychiatric hospital as positive class. The negative class is the voice data of non-depressive people obtained from the interview session with university students. Each audio file has been rendered into spectrograph. The spectrograph is a visual representation of power spectrum. A power spectrum is the Mel frequency-spaced cepstral coefficients (MFCCs) extracted from the human voice that changes over time using fast Fourier transform and discrete cosine transform (DCT) algorithms. Since some research claimed that DCT causes some spectral features to be loss, we do empirical studies between applied DCT and non- DCT spectrographs set. Moreover some research studies stated that larger window provides more detail of speech activity on power spectrum which affected to the performance of depressive detection, so we explore Blackman-Harris and Blackman window functions to create different set of spectrographs to prove that idea on Thai speech dataset. Deep learning models based on the deep residual network (ResNet) are explored to see its potential on classification. Different numbers of convolution layers such as ResNet-34, ResNet-50, and ResNet-101 are examined, respectively. The experimental results show that both trained ResNet-50 model from different type of spectrograph can achieve higher than 70% of F1-Score which is the best performance above other approaches. We found that the model learning from spectrograph extracted by Blackman window function with non-DCT algorithm provides the best sensitivity at 74.45% showing. To the best of our knowledge, our approach gives the highest F1-score when compared to the state of the art methods.
Similar content being viewed by others
Data availability statement
Raw data that support the findings of this study are available from the first author, upon a reasonable request.
References
Bufferd SJ, Dougherty LR, Carlson GA, Klein DN (2011) Parent reported mental health in preschoolers: findings using a diagnostic interview. Compr Psychiatry 52(4):359–369
Lotrakul P, Meeroslam P, Wichai S (1998) Abnormal psychosocial situations in children and adolescents attending child mental health center. J Psychiatr Assoc Thail 43(3):226–239
Arin N (2015) Psychological distress and attitudes toward seeking professional psychological help among university students. J Clin Psychol Thail 46(1):16–29
Gould MS, King R, Greenwald S, Flisher AJ, Goodman S, Canino G, Shaffer D (1998) Psychopathology associated with suicidal ideation and attempts among children and adolescents. J Am Acad Child Adolesc Psychiatry 37(9):915–923
Easden MH, Fletcher RB (2018) Therapist competence in case conceptualization and outcome in CBT for depression. J Psychother Res 20(2):151–169
Wang J, Zhang L, Liu T, Pan W, Hu B, Zhu T (2019) Acoustic differences between healthy and depressed people: a cross-situation study. BMC Psychiatry 19:300
Alpert M, Pouget ER, Silva RR (2001) Reflections of depression in acoustic measures of the patient’s speech. J Affect Disord 66:59–69
Chaisan A, Sukahuk R (2013) Emotional classification from Thai text message using machine learning technique. In: The 9th National Conference on Computing and Information Technology, 9–10 May 2013, pp 260–266
Sarakit P. Classifying emotion in Thai youtube comments. In: International Conference of Information and Communication Technology for Embedded Systems, 6th IC-ICTES, IEEE, 1–5
Chansky TE, Kendall PC (1997) Social expectancies and self-perceptions in anxiety-disordered children. J Anxiety Disord 11(4):347–363
Compton SN, Burns BJ, Helen LE, Robertson E (2002) Review of the evidence base for treatment of childhood psychopathology: internalizing disorders. J Consult Clin Psychol 70(6):1240–1266
Deshmukh O, Espy-Wilson C, Salomon A, Singh J (2005) Detection of periodicity and aperiodicity in speech signal based on temporal information. IEEE Trans Speech Audio Process 13:5
Cannizzaro M, Harel B, Reilly N, Chappell P, Snyder PJ (2004) Voice acoustical measurement of the severity of the major depression. Brain Cogn 56(1):30–35
Mundt JC, Vogel AP, Feltner DE, Lenderking WR (2012) Vocal acosutic biomakers of depression serverity and treatment response. Biol Psychiat 72(7):580–587
Richmond K (2002) Estimating articulatory parameters from the acoustic speech signal. University of Edinburgh
Quatieri TF, Malyska N (2012) Vocal-source biomarkers for depression: a link to psychomotor activity. In: Proceedings of the 13th Annual Conference of the International Speech Communication Association 2012 (Portland, Oregon, USA, 2012). INTERSPEECH 2012
Ooi Brian KE, Lech M, Aleen BN (2014) Prediction of major depression in adolescents using an optimized multi-channel weighted speech classification system. Biomed Signal Process Control 14:228–239
He L, Cao C (2018) Automated depression analysis using convolutional neural networks from speech. J Biomed Inform 83:103–111
Liu L, Fieguth P, Pietikainen M, Lao S (2015) Median robust extended local binary pattern for texture classification. IEEE Trans Image Process 25(3):1368–1381. https://doi.org/10.1109/TIP.2016.2522378
Valstar M, Schuller B, Smith K, Eyben F, Jiang B, Bilakhia S, Schniedar S, Cowie R, Pantic M (2013) The continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, ACM, pp 41–48
Valstar M, Schuller B, Smith K, Almaev T, Eyben F, Krajewski, Cowie R, Pantic M (2014) 3D dimensional affect and depression recognition challenge. In: Proceedings of 4th ACM International Workshop on Audio/Visual Emotion Challenge, ACM, 3–10
McGinnis RS, McGinnis E, Hruschak J, Lopez- Duran NL, Fitzgerald FK, Rosenblum KL, Muzik M (2019) Rapid detection of internalizing diagnosis in young children enabled by wearable sensors and machine learning. PLoS ONE 14(1):1–16
McGinnis RS, McGinnis E, Hruschak J, Lopez-Duran NL, Fitzgerald K, Rosenblum KL, Muzik M (2018) Wearable sensors and machine learning diagnose anxiety and depression in young children. In: Proceedings of the 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) (Las Vegas, Nevada, USA, 4–7 March 2018, 2018). IEEE
McGinnis EW, Anderau SP, Hruschak J, Gurchiek RD, Lopez-Duran NL, Fitzgerald K, Rosenblum KL, Muzik M, McGinnis RS (2019) Giving voice to vulnerable children: machine learning analysis of speech detects anxiety and depression in early childhood. IEEE J Biomed Health Inform 23(6):2294–2301
Lopez-Duran NL, McGinnis E, Kuhlman K, Geiss E, Vargas I, Mayer S (2015) HPA-axis stress reactivity in youth depression: evidence of impaired regulatory processes in depressed boys. Stress 18(5):545–553
Chlasta K, Wolk K, Krejtz I (2019) Automated speech- based screening of depression using deep convolutional neural networks. In: Proceedings of the CENTERIS - International Conference on Enterprise Information systems/projMAN—International Conference on Project Management/HCist—International Conference on Health and social Care Information Systems and Technologies 2019 (Sousse, Tunisia, 16–18 October 2019, 2019). Procedia Computer Science
Huang Z, Epps J, Joachim D (2020) Exploiting vocal tract coordination using dilated CNNS for depression detection in naturalistic environments. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Barcelona, Spain, 2020). IEEE
Seneviratne N, Williamson JR, Lammert AC, Quatieri TF, Espy-Wilson C (2020) Extended study on the use of vocal tract variables to quantify neuromotor coordination in depression
Rejaibi E, Komaty A, Meriaueau F, Agrebi S, Othmani A (2022) MFCC-based recurrent neural network for automatic clinical depression recognition and assessment from speech. Biomed Signal Process Control 71:1–11
Purwins H, Li B, Virtanen T, Schluter J, Chang SY, Sainath T (2019) Deep learning for audio signal processing. J Select Top Signal Process 13(2):206–219
Habib M, Faris M, Qaddoura R (2021) Toward an automatic quality assessment of voice-based telemedicine consultations: a deep learning approach. Sensors 21(9):1–26
Zhang Q, Li Z, Hu Y (2021) Aretrieval algorithm for encrypted speech based on convolutional neural network and deep hashing. Multimed Tools Appl 80:1201–1221
Wang Z, Yan W, Oates T (2017) Time series classification from scratch with deep neural networks: a strong baseline. In: Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN) (Anchorage, Alaska, USA, 2017). IEEE
Wu Y (1990) New FFT structures based on the Bruun algorithm. IEEE Trans Acoust Speech Signal Process 38(1):188–191
Bruun G (1978) z-transform DFT filters and FFT’s. Inst Electr Electron Eng. Trans Acoust Speech Signal Process 26(1):56–63
Wang YAZG (2014) Compressed wideband spectrum sensing based on discrete cosine transform. Sci World J 2014:1–5
Verdet F (2011) Exploring variabilities through factor analysis in automatic acoustic language recognition. University of Fribourg, Université d’Avignon et des Pays du Vaucluse, Avignon, France
Kadiri S, Kethireddy R, Alku P (2020) Parkinson’s disease detection from speech using single frequency filtering cepstral coefficients. In: Proceedings of the Interspeech (Shanghai, China, 2020). Interspeech
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (Las Vegas, Nevada, USA, 27–30 June 2016, 2016). IEEE
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Suparatpinyo, S., Soonthornphisaj, N. Smart voice recognition based on deep learning for depression diagnosis. Artif Life Robotics 28, 332–342 (2023). https://doi.org/10.1007/s10015-023-00852-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10015-023-00852-4