Elsevier

Speech Communication

Volume 101, July 2018, Pages 11-25
Speech Communication

Speaker models for monitoring Parkinson’s disease progression considering different communication channels and acoustic conditions

https://doi.org/10.1016/j.specom.2018.05.007Get rights and content

Highlights

  • The paper introduces the use of speaker models (GMM-UBM and i-vectors) to evaluate the progression of Parkinson’s disease (PD) from speech. This is one of the first papers addressing the task of individual speaker models to assess Parkinson’s disease progression based on speech recordings captured in different recording sessions.

  • The suitability of the proposed approach for monitoring Parkinson’s patients from speech is evaluated considering recordings captured through different communication channels: Skype, Google Hangouts, landlines, and mobile phones.

  • Two different scenarios are considered to test the proposed approach: (i) longitudinal recordings captured from 2012 and 2016, and (ii) recordings captured in the home of the patients during 4 months (one day per month, every two hours and during 8 h).

  • The use of the two recording sets mentioned above make the experiments reported in this paper highly original and novel, thus we consider that this work is a significant contribution to the development of computer-aided tools to monitor people suffering from Parkinson’s disease.

Abstract

Symptoms of Parkinson’s disease vary from patient to patient. Additionally, the progression of those symptoms also differs among patients. Most of the studies on the analysis of speech of people with Parkinson’s disease do not consider such an individual variation. This paper presents a methodology for the automatic and individual monitoring of speech disorders developed by PD patients. The neurological state and dysarthria level of the patients are evaluated. The proposed system is based on individual speaker models which are created for each patient. Two different models are evaluated, the classical GMM–UBM and the i–vectors approach. These two methods are compared with respect to a baseline found with a traditional Support Vector Regressor. Different speech aspects (phonation, articulation, and prosody) are considered to model recordings of spontaneous speech and a read text. A multi-aspect coefficient is proposed with the aim of incorporating information from all of these speech aspects into a single measure. Two different scenarios are considered to assess a set with seven PD patients: (1) the longitudinal test set which consists of speech recordings captured in five recording sessions distributed from 2012 to 2016, and (2) the at-home test set which consists of speech recordings captured in the home of the same seven patients during 4 months (one day per month, four times per day). The UBM is trained with the recordings of 100 speakers (50 with Parkinson’s disease and 50 healthy speakers) captured with controlled acoustic conditions and a professional audio-setting. With the aim of evaluating the suitability of the proposed approaches and the possibility of extending this kind of systems to remotely assess the speech of the patients, a total of five different communication channels (sound-proof booth, Skype®, Hangouts®, mobile phone, and land-line) are considered to train and test the system. Due to the reduced number of recording sessions in the longitudinal test set, the experiments that involved this set are evaluated with the Pearson’s correlation. The experiments with the at-home test set are evaluated with the Spearman’s correlation. The results estimating the dysarthria level of the patients in the at-home test set indicate a correlation of 0.55 with a modified version of the Frenchay Dysarthria Assessment scale when the GMM-UBM model is applied upon the Skype® recordings. The results in the longitudinal test set indicate a correlation of 0.77 using a model based on i-vectors with recordings captured in the sound-proof-booth. The evaluation of the neurological state of the patients in the longitudinal test set shows correlations of up to 0.55 with the Movement Disorder Society - Unified Parkinson’s Disease Rating Scale also using models based on i-vectors created with Skype® recordings. These results suggest that the i–vector approach is suitable when the acoustic conditions among recording sessions differ (longitudinal test set). The GMM-UBM approach seems to be more suitable when the acoustic conditions do not change a lot among recording sessions (at-home test set). Particularly, the best results were obtained with the Skype® calls, which can be explained due to several preprocessing stages that this codec applies to the audio signals. In general, the results suggest that the proposed approaches are suitable for tele-monitoring the dysarthria level and the neurological state of PD patients.

Introduction

People suffering from PD are characterized by the progressive loss of dopaminergic neurons in the midbrain (Hornykiewicz, 1998). PD symptoms include tremor, slow movement, lack of coordination, and speech impairments (Ho, Iansek, Marigliani, Bradshaw, Gates, 1999, Darley, Aronson, Brown, 1969). Currently, neurologists rely on medical history, physical and neurological examinations to assess the patients. This procedure has two main limitations: (i) it is not objective (the evaluation depends on the doctor’s criterion and expertise), and (ii) due to the motor disability of PD patients, to visit a hospital to perform medical screenings and/or assessments is expensive and difficult (Theodoros et al., 2006). Besides such difficulties, the symptoms progress differently among patients, thus it is important to monitor their symptoms individually (per patient) and over long periods of time. Such a monitoring is not feasible if the patient is required to visit the doctor to every screening. The most suitable methods to perform continuous monitoring of the symptoms are based on computer-aided tools. These methods have captured the attention of the research community because they are objective, easy to use, and reproducible. Speech signals are one of the most suitable ways to capture information about the neurological state of PD patients (Tsanas, Little, McSharry, Ramig, 2010, Skodda, Grönheit, Mancinelli, Schlegel, 2013, Orozco-Arroyave, Hönig, et al., 2016). Studies reported in the state-of-the-art about assessing the neurological state of PD patients from speech signals always consider situations where the acoustic conditions are relatively controlled, i.e., quiet rooms, good/expensive microphones, and direct connection to the recording device. Additionally, the state-of-the-art is mainly based on classical methods to model speech signals, i.e., measurements are extracted from the speech signal and regression methods are used to assess the neurological state of the patient. This paper presents a methodology for the individual monitoring of speech impairments developed by PD patients during the disease progression. The proposed approach overcomes the state-of-the-art in several aspects: (i) the method is based on individual models, which are based on Gaussian Mixture Models – Universal Background Models (GMM–UBM), thus the system performance is adapted to the speech of each patient, (ii) different communication channels are considered including land-lines, mobile phones, Internet-based systems (Skype® and Hangouts®), and traditional recordings performed during a medical appointment. The proposed approach is also tested on two kinds of recordings: (i) signals captured during several recording sessions distributed from 2012 to 2016, and (ii) signals captured in 16 sessions performed in the houses of several patients during 4 months (one day per month, every two hours and during 8 h). The use of these two recording sets make the experiments reported in this paper highly original and novel, thus we consider that this work is a significant contribution to the development of computer-aided tools to monitor the progression of PD.

There is no standard test to diagnose PD. Doctors rely on the clinical history and physical examinations to assess patients. There are several tests to evaluate the disease severity. One of the most widely used is the Movement Disorder Society - Unified Parkinson’s Disease Rating Scale (MDS-UPDRS). This scale is divided into four sections: Section 1 comprises non-motor experiences (13 items), Section 2 includes motor activities of daily living (13 items), Section 3 evaluates motor capabilities (33 items), and Section 4 considers motor complications (6 items) (Goetz et al., 2008). Although the scale has a total of 65 items, speech is only considered in one of them.

There are several scales and clinical methods to evaluate dysarthric speech. One of them is the Frenchay Dysarthria Assessment–2 (FDA–2) (Enderby and Palmer, 2008). The original version of the FDA–2 considers several factors that are affected in people suffering from dysarthria, such as reflexes, respiration, lips movement, palate movement, laryngeal capability, tongue posture/movement, intelligibility, and others. The FDA–2 requires the patient to visit the examiner, which is not possible in most cases when people suffering from PD are considered. Bering this in mind, it was necessary to develop a modified version of the FDA (m–FDA), which can be administered based on speech signals previously recorded, thus the patient is not required to visit the clinician to be evaluated (Cernak et al., 2017). The m–FDA considers several aspects of speech: respiration, lips movement, palate/velum movement, larynx, tongue, monotonicity, and intelligibility. Speech impairments are evaluated in a total of 13 items and each of them ranges from 0 (normal or completely healthy) to 4 (very impaired), thus the total score of the scale ranges from 0 to 52.

In recent years the research community has been interested in developing methods to assess the neurological state of PD patients from speech. One of the reasons to look for such an aim is to reduce treatment and monitoring costs and another reason is to develop objective tools/systems that help clinicians in the assessment and screening of the patients. In Asgari and Shafran (2010) the authors proposed a methodology to assess the UPDRS-III score from speech recordings of 82 subjects. The participants were asked to perform three speech tasks including the sustained phonation of the vowel /a/, the rapid repetition of the syllables (/pa/-/ta/-/ka/), and the reading of three standard texts. The set of features extracted from the speech recordings include pitch, spectral entropy, 13 cepstral coefficients, the number and duration of voiced and unvoiced frames, jitter, shimmer, Harmonic to Noise Ratio (HNR), and the ratio of energy in the first and second harmonics. The set of features was computed separately for each speech task. The UPDRS scores were obtained using two Support Vector Regressor (SVR)-based approaches: (1) ϵ-SVR and (2) ν-SVR. Additionally, different kernels were used to train the SVRs including polynomial, radial basis function, and sigmoid functions. The authors reported that it is possible to estimate the UPDRS-III with a Mean Absolute Error (MAE) of 5.66 using an ε-SVR with a cubic polynomial kernel. Later in Bayestehtashk et al. (2015) the authors compared three regression techniques to assess the UPDRS scores including ridge regression, Least Absolute Shrinkage and Selection Operator (LASSO) regression, and linear SVR. Speech recordings of 168 patients were collected in a single recording session. Besides the features described in Asgari and Shafran (2010), the authors added information extracted with the openSMILE toolkit (Eyben et al., 2010). The authors reported that the neurological state of the patients can be assessed with a MAE of 5.5 considering only PD patients in the training process, however, due to the lack of longitudinal data, it is not clear whether the proposed approach is suitable to track the neurological state of each patient. Furthermore, the results are presented only in terms of the MAE, which only makes sense when there is a baseline to compare the performance of the models. Besides, in the INTERSPEECH 2015 Computational Paralinguistic Challenge (ComParE 2015) our team participated in the organization of the Parkinson’s Condition sub-challenge, where the task of neurological state evaluation of PD patients from speech was addressed (Schuller et al., 2015). Recordings of the 50 patients (25 male, 25 female) included in the PC-GITA database (Orozco-Arroyave et al., 2014) were considered to form the train and development subsets. The test set included a total of 11 patients recorded in non-controlled noise conditions, i.e., not using a sound-proof booth and a professional audio setting. A total of 42 speech tasks were considered. The neurological state of the patients was assessed by a neurologist expert according to the motor section of the MDS-UPDRS (MDS-UPDRS-III). The winners of the challenge reported a Spearman’s correlation coefficient of 0.65 between the real MDS-UPDRS-III scores and the estimated values. The authors developed a model based on Deep Rectifier Neural Networks and Gaussian Processes Regression (Grósz et al., 2015). Although, the results obtained by the winners are moderate (0.50 ≤ r ≤ 0.70), a comparison with a dysarthria scale is missing in order to determine whether the introduced methods are suitable to detect speech impairments developed by PD patients. Recently, in Orozco-Arroyave et al. (2016b) our team presented a methodology to estimate the neurological state of PD patients from speech signals. Recordings of Spanish, German, and Czech PD patients were considered to estimate their neurological state according to the UPDRS-III score. The regression process was performed using a linear ϵ-SVR. Four different speech tasks were considered. The authors applied the articulation model introduced in Orozco-Arroyave (2016). The model consists of extracting the energy in the transitions from unvoiced to voiced (onset) and from voiced to unvoiced (offset) segments considering different frequency bands distributed according to the Bark and the Mel scales. Additionally, speech intelligibility was objectively evaluated using the Google Inc.® automatic speech recognition system. According to the authors the neurological state of the patients, in terms of the MDS-UPDRS-III score, can be estimated with a Spearman’s correlation of up to 0.74 when several speech tasks are modeled considering the fusion of articulation and intelligibility measures.

Note that most of the studies in the literature are focused on assessing the neurological state of groups of PD patients. Assessments are performed considering only one recording session, thus the disease progression is not evaluated/modeled. The next subsection presents the most recent contributions of the research community to perform longitudinal evaluations, i.e., longitudinal monitoring, of patients suffering from PD considering several recording sessions.

There are several studies about automatic monitoring of PD symptoms from speech considering different recording sessions distributed over a period of time. In Tsanas et al. (2010) the authors considered recordings of sustained vowels to estimate the disease progression. The signals were modeled using several acoustic measures including jitter, shimmer, Noise to Harmonic Ratio (NHR), HNR, Relative Amplitude Perturbation, Period Perturbation Quotient, Amplitude Perturbation Quotient, Recurrence Period Density Entropy, Detented Fluctuation Analysis, and Pitch Period Entropy. The UPDRS-III scores were assessed using three linear regression techniques: Least Squares (LS), Iteratively Re-weighted Least Squares, and LASSO. The Classification And Regression Trees (CARTs) approach was also applied. The speech of 42 PD patients (28 male, 14 female) was recorded once per week during six months. Neurologist experts evaluated the patients three times along the study, thus the weekly UPDRS scores were obtained by the authors using a piecewise linear interpolation. The performance of the regression techniques was evaluated using the MAE. The authors reported that the CARTs is the best approach with a MAE of 7.5 points in the evaluation of the total value of the UPDRS scale. The scores of the motor section in the UPDRS (UPDRS-III) were estimated with a MAE of 6 points. This study was one of the first reporting results of PD severity assessment from speech. However, the authors were not aware of the speaker independence because their experiments mixed recordings of the test and train sets, thus the reported results are highly optimistic and biased. The progression of speech impairments in a longitudinal study is presented in Skodda et al. (2013). The speech of 80 PD patients (48 male, 32 female) was recorded from 2002 to 2012 in two recording sessions. The time between the first and second session ranged from 12 to 88 months. A control group of 60 healthy persons (30 male, 30 female) was also considered. The participants were asked to read a text and to produce a sustained phonation of the vowel /a/. In both sessions the patients were assessed by neurologist experts according to the UPDRS-III. The audio signals were perceptually evaluated by two of the authors (Skodda and Grönheit). Four aspects of speech were considered in the perceptual evaluation: voice, articulation, prosody, and fluency. These aspects were used by the authors to describe motor speech disorders suffered by PD patients. Additionally, an acoustic analysis was performed to describe these speech aspects. Voice was modeled with a set of features including jitter, shimmer, NHR, and average of the pitch. For articulation the Vowel Articulation Index (VAI) and the percentage of pauses within polysyllabic words are considered. Prosody is analyzed with the estimation of the standard deviation of the pitch. Fluency was evaluated considering the Net Speech Rate (NSR) and the pause ratio. To assess the progression of speech and voice impairments the authors compared the extracted features in the first and the second session using paired and unpaired t-test. The authors found significant differences for shimmer, NHR, NSR, pause ratio, and VAI when features extracted from the first session are compared with respect to the same features extracted from the second session. Although, longitudinal data is considered to assess the progression of speech impairments due to PD, only two recording sessions are considered. Furthermore, the authors used a statistical test to detect changes in speech, thus it is not clear whether the method is suitable to monitor speech disorders of patients with PD. A study for the monitoring of PD progression is also presented in Gómez-Vilda et al. (2015). The authors recorded a total of four male patients every week during one month in four recording sessions. Speech recordings of 100 healthy speakers (50 male, 50 female) were also considered. Sustained phonations of the vowel /a/ were modeled using different features to describe tremor, perturbation of the vocal folds, and biomechanical phonation impairment. Features from the 50 male healthy controls (HC) were used as baseline to describe the normal state of the speech. During the recording sessions the patients continued their pharmacological treatment and received speech therapy. Each patient was evaluated according to the H&Y scale. The suitability of the features used to describe phonation impairments was evaluated by a weighted sum of the extracted features as a function of a sigmoid that ranges from 0 to 5. According to the authors, the most relevant features are jitter, vocal fold body mass, body stiffness, adduction defect, physiological and neurological tremor amplitude, flutter amplitude, and global tremor. Similarly, in Gómez-Vilda et al. (2015), the authors proposed the Log Likelihood Improvement Ratio (LLIR) as a metric to compare speech recordings of eight male PD patients captured in four recording sessions. The patients followed pharmacological treatment and received speech therapy. The aim of the study was to detect changes in the voice before and after the treatment using the same feature set described in Gómez-Vilda et al. (2015). The authors reported that LLIR is a good metric to detect changes in phonation when the patient is under treatment. Although the authors detected changes in phonation measures, it is not clear whether the same approach is suitable to detect changes in the general neurological state of PD patients. Additionally, the patients are assessed only during one month, which is a very short period of time to detect changes in the neurological state of the patients due to the disease progression. One of the main constraints of addressing longitudinal studies with PD patients is to have continuous contact with them. Thanks to the strong relation of our Lab with the Parkinson’s Foundation in Medellín (goo.gl/ihwjLy) we have had continuous contact with Parkinson’s patients and they have been actively collaborating in our research activities. In Arias-Vergara et al. (2016) our team addressed several experiments with the GMM-UBM approach to model speech impairments developed by seven PD patients. The speech of these patients were captured in several recording sessions between 2012 and 2015. The results of that study motivated us to continue addressing research in individual speaker model methods to monitor symptoms of PD patients. Recently, in García et al. (2017a) we introduced the use of the i-vector approach to assess the neurological state of a group with 50 PD patients. Similarly, in García et al. (2017b) speech impairments of PD patients speaking three different languages (Spanish, German, and Czech) were evaluated considering the i-vector approach. The results indicate that this method is suitable to be applied in different languages. Although the results were promising, those studies were focused on evaluating correlations between a given clinical scale (MDS-UPDRS-III or m-FDA) and the result of a model. In this paper we decided to continue working on this topic but applying the GMM-UBM and i-vector approaches for the individual monitoring of the progression of speech impairments developed by PD patients.

The analysis of PD from voice signals recorded in different acoustic conditions has not been extensively addressed in the literature. In Tsanas et al. (2012), speech recordings of 52 PD patients are transmitted over a simulated mobile telephone network. The authors aimed to estimate the UPDRS scores considering features extracted from sustained phonations of the vowel /a/. Although the aim was very interesting and revolutionary by that time, the results reported in the study were biased because the authors mixed recordings of train and test speakers into the same set, thus the main question regarding the suitability of voice analysis for PD detection remained unanswered. Additionally, besides the necessity of assuring the speaker independence, experiments with continuous speech signals are required in order to extend the application of those approaches to real-world scenarios. Recently, in Vásquez-Correa et al. (2017), researchers from our Lab evaluated the effects of background noise, different distortion levels, and telephone codecs in the automatic classification of PD vs. HC speakers. The results indicated that background noise has the strongest effect in the classification accuracy. The effect of telephone channels was not critical, except for the mobile channel, where the low bit-rate codecs caused an important reduction in the classification accuracy.

This paper considers speech signals of people suffering from PD recorded during several sessions from 2012 to 2016, i.e., longitudinal study. As a group of speakers is recorded several times, those recordings are suitable to develop a system to model individual changes in the speech of PD patients. Acoustic conditions of those recordings were different between sessions, thus this corpus represents a real-world scenario to study the neurological state of PD patients from speech in real acoustic conditions. Two approaches are explored here, one is based on GMM–UBMs and the other one is based on i–vectors. Both methods are trained considering different aspects of speech: phonation, articulation, and prosody. Additionally, in order to assess the suitability of the approaches in different acoustic and communication conditions, five different communication channels are considered: sound proof booth, Skype®, Google Hangouts®, land-line, and mobile phone. Besides those channels, the proposed approach is tested upon recordings captured in the house of the patients (the same group that is considered in the longitudinal experiments). Those patients were recorded in 16 sessions during four months, i.e., one day per month, every two hours during eight hours per day. As in the case of the longitudinal recordings, the acoustic conditions were not controlled, thus this set represents a real-world scenario for the study of the neurological state of PD patients. To the best of our knowledge this is the first study introducing and testing individual speaker models to monitor PD progression considering speech signals captured with different communication channels/codecs, and at-home recordings.

Section snippets

Datasets

Three datasets are considered in this study, one is used to train the models and the other two sets are considered to test. All of the participants followed two speech tasks: (1) a monologue and (2) the reading of a standard text. For the monologue, the speakers were asked to talk about different topics such as hobbies, daily living activities, family, and others. The reading task included a phonetically balanced text which contains 36 words. The average duration of the monologues and the

Experiments with the at-home test set

Table 5 shows the results obtained when the SVR is used to estimate the m–FDA scores for the at-home test set. Each row corresponds to the Spearman’s correlation coefficient calculated between the estimated scores and the real m–FDA. It can be observed that none of the results were satisfactory. The highest correlations were obtained only for patient P1 when the articulation features were considered to train the SVR. This can be likely explained because typically, SVRs are used to estimate

Conclusions

This study presented a methodology to monitor the progression of speech impairments in PD patients using speaker models. Different speech aspects (phonation, articulation, and prosody) were considered to model different speech deficits exhibited by the patients. With the aim of evaluating the suitability of the methods to perform remote monitoring of speech impairments developed by patients with PD, the speech recordings were re-transmitted through different communication channels (sound-proof

Acknowledgments

This project was funded by CODI at Universidad de Antioquia (grants # PRV16-2-01 and 2015-7683). The work has received also funding from the European Unions Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Grant Agreement no. 766287. Tomás Arias-Vergara is under grants of Convocatoria Doctorado Nacional-785 financed by COLCIENCIAS. The authors would like to thank all of the patients and collaborators from Fundalianza Parkinson Colombia. Without their support and

References (39)

  • N. Dehak et al.

    Discriminative and Generative Approaches for Long- and Short-term Speaker Characteristics Modeling Application to Speaker Verification

    (2010)
  • P.M. Enderby et al.

    FDA-2: Frenchay Dysarthria Assessment: Examiner’s Manual

    (2008)
  • F. Eyben et al.

    Opensmile: the munich versatile and fast open-source audio feature extractor

    Proceedings of the 18th International Conference on Multimedia

    (2010)
  • N. García et al.

    Evaluation of the neurological state of people with Parkinson’s disease using i-vectors

    Proceeding of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH)

    (2017)
  • N. García et al.

    Language independent assessment of motor impairments of patients with Parkinson’s disease using i-vectors

    Lect. Notes Comput. Sci.

    (2017)
  • J.I. Godino-Llorente et al.

    Dimensionality reduction of a pathological voice quality assessment system based on Gaussian mixture models and short-term cepstral parameters

    IEEE Trans. Biomed. Eng.

    (2006)
  • C.G. Goetz

    Movement disorder society-sponsored revision of the unified Parkinson’s disease rating scale (MDS-UPDRS): scale presentation and clinimetric testing results

    Mov.Disord.

    (2008)
  • P. Gómez-Vilda et al.

    Monitoring Parkinson’s disease from phonation improvement by Log Likelihood Ratios

    Bioinspired Intelligence (IWOBI), Fourth International Work Conference on

    (2015)
  • P. Gómez-Vilda et al.

    Parkinson’s disease monitoring from phonation biomechanics

    Lect. Notes Comput. Sci.

    (2015)
  • Cited by (25)

    • Robust language independent voice data driven Parkinson's disease detection

      2024, Engineering Applications of Artificial Intelligence
    • Impact of noise on the performance of automatic systems for vocal fold lesions detection

      2021, Biocybernetics and Biomedical Engineering
      Citation Excerpt :

      In order to be useful, it is required that these systems remain robust even when the recordings are captured in a non-controlled environment. Experiments have been carried on in order to assess different channels in remote disease monitoring [17,18]. Even mobile healthcare applications have been tested in controlled acoustical environments, like [19], which mentions that experiments are carried out in an as low as 30 dB background noise room.

    View all citing articles on Scopus
    View full text