Elsevier

Computer Speech & Language

Volume 56, July 2019, Pages 80-94
Computer Speech & Language

An automated assessment framework for atypical prosody and stereotyped idiosyncratic phrases related to autism spectrum disorder

https://doi.org/10.1016/j.csl.2018.11.002Get rights and content

Abstract

Autism Spectrum Disorder (ASD), a neurodevelopmental disability, has become one of the high incidence diseases among children. Studies indicate that early diagnosis and intervention treatments help to achieve positive longitudinal outcomes. In this paper, we focus on the speech and language abnormalities of young children with ASD and present an automated assessment framework in quantifying atypical prosody and stereotyped idiosyncratic phrases related to ASD. For detecting atypical prosody from speech, we propose both the hand-crafted feature based method as well as the end-to-end deep learning framework. First, we use the OpenSMILE toolkit to extract utterance level high dimensional acoustic features followed by a support vector machine (SVM) backend as the conventional baseline. Second, we propose several end-to-end deep neural network setups and configurations to model the atypical prosody label directly from the constant Q transform spectrogram of speech. Third, we apply cross-validation on the training data to perform segments selection and enhance the subject level classification performance. Fourth, we fuse the deep learning based methods with the conventional baseline at the score level to further enhance the overall system performance. For detecting the stereotyped idiosyncratic usage of words or phrases from speech transcripts, we adopt language model, dependency treebank and Term Frequency–Inverse Document Frequency (TF–IDF) in addition to Linguistic Inquiry and Word Count software (LIWC) methods to extract a set of text features followed by a standard SVM backend. We collect a database of spontaneous Mandarin speech recorded during the Autism Diagnostic Observation Schedule (ADOS) Module 2 and Module 3 sessions. The Module 2 part consists of 118 children while the Module 3 part includes 71 children. Experimental results on this database show that our proposed methods can effectively predict the atypical prosody and stereotyped idiosyncratic phrases codes for young children with the risk of ASD. On the two categories classification task, the unweighted accuracy of the aforementioned two tasks are 88.1% and 77.8%, respectively.

Introduction

Autism Spectrum Disorder (ASD) refers to a group of symptoms related to social impairments and communication difficulties. It has become one of the high incidence diseases among children. A recent analysis from the Centers for Disease Control and Prevention estimates that 1 in 68 children has ASD in the United States (Christensen et al., 2016). Early behavioral and educational interventions have been proved to be very successful in many clinical studies. This attaches great significance to the recognition of common ASD behavior patterns and diagnoses at the early stage.

In paralinguistics, prosody relates to several communicative functions such as intonation, tone, pitch, stress, rhythm, etc. Prosody can reflect many important elements of language including the emphasis, contrast and affective state of the speaker (McCann and Peppé, 2003). These are critical information in human communication. Therefore, atypical prosody is one of the common symptoms related to ASD. Specifically, children with ASD may speak in flat, robot-like or a sing-song voice (Fusaroli et al., 2016).

In this work, we not only focus on the speech signal but also study the language patterns for ASD detection. For ASD children who are verbally fluent, they may have various kinds of language communication abnormalities, such as stereotyped, repetitive and idiosyncratic usage of words or phrases. Children with stereotyped idiosyncratic usage of words or phrases often use some inflexible and rigid words and expressions during the conversation. The words or phrases they uttered sometimes may be inappropriate for the context. Moreover, ASD children may create some new and weird words during the conversation.

Both the aforementioned speech and language cues are important for clinicians to perform diagnosis. The Autism Diagnostic Observation Schedule (ADOS) is a standard screening test to help clinicians observe children’s language and behavior patterns relevant to the diagnosis of autism. It consists of a series of structured and semi-structured tasks assessing social interaction, communication, playing, and imaginative usage of materials (Lord et al., 2000). There are four different modules designed mainly according to the subject’s age and linguistic capability. Moreover, speech and language abnormalities are included in all these four modules. The ADOS screening provides codes to quantify the items on an integer scale from “0” to “2” based on the severity of each abnormality category (Gotham et al., 2009). Taking atypical prosody as an example, “0” denotes no abnormal prosody; “1” stands for some changes on pitch/tone, a bit flat/exaggerate intonation, slightly abnormal volume, a little slow/fast/jerky rhythm and “2” implies markedly and consistently abnormalities on the aforementioned aspects (Lord et al., 2000).

In the ADOS screening, therapists need to identify multiple behavior codes related to speech and language, including atypical prosody, stereotyped idiosyncratic phrases, etc. As many research and treatment methods in the psychology field, this kind of evaluation or diagnosis requires experienced experts or clinicians with intensive specialized training. Another issue is the subjective inconsistency between clinicians, which could sometimes make the results ambiguous at some certain levels. Researchers have proposed strategies to utilize speech and language processing techniques to support clinicians with quantitative analysis of ASD children’s prosody (Bone, Black, Lee, Williams, Levitt, Lee, Narayanan, 2012, Chaspari, Provost, Katsamanis, Narayanan, 2012, Bone, Black, Ramakrishna, Grossman, Narayanan, 2015, Bone, Chaspari, Narayanan, 2017) and language patterns (Kumar et al., 2016). Furthermore, since pattern recognition and machine learning methods have demonstrated promising results in modeling behavior symptoms and relationships with expert’s experience (Narayanan, Georgiou, 2013, Xiao, Imel, Georgiou, Atkins, Narayanan, 2015), some automated screening and evaluating tools based on objective measurements directly extracted from recordings are proposed (Gong, Gong, Levy-Lambert, Green, Hogan, Guttag, 2016, Xiao, Can, Gibson, Imel, Atkins, Georgiou, Narayanan, 2016). These automated coding tools showed great potential to be scalable and assist clinicians to analyze the variation trend of a specific symptom in long term monitoring or assessments.

In this paper, we focus on the speech and language abnormalities and present an automated assessment framework to determine the existence and severity level of atypical prosody and stereotyped idiosyncratic phrases for young children under the ADOS Module 2 and 3 setup.

On the speech side, we model the atypical prosody abnormality using both the traditional strategy and the deep learning framework. We demonstrate that the end-to-end techniques can achieve comparable performance against the baseline system even on a small-scale dataset. Since we directly model the ASD related atypical prosody code from the spectrograms in an end-to-end manner, there is no prior domain knowledge required for feature engineering. The fusion of the two systems can further improve the overall system performance at the segment level in terms of the unweighted average recall (UAR). This result shows that the end-to-end framework has great potential in the field of behavior signal processing (BSP) (Black et al., 2013). Moreover, among all the speech segments in an ADOS conversation session, not every segment reflects the atypical prosody information and therefore we adopt the cross validation strategy on the training set to perform segment selection and improve the accuracy on both segment and person levels.

On the language side, besides n-gram language model, categorical word counts from the Linguistic Inquiry and Word Count software (LIWC) these baseline features and maximum entropy classifier (Kumar et al., 2016), we also propose dependency treebank and Term Frequency-Inverse Document Frequency (TF–IDF) these two methods to extract features that are more related to the stereotyped idiosyncratic usage of words or phrases. We concatenate all these four features together and adopt a standard SVM classifier as the backend.

Furthermore, we also investigate the cutoff boundary of the code 0/1/2 by merging 1/2 as a new code to form a binary classification task. Experimental results show that our trained models are more confident at distinguishing between normal and abnormal cases rather than estimating the detailed severity level of abnormal behaviors. In this study, our goal is not just to recognize the three-category code and use it the same way as described in the ADOS manual (adding together all the codes and compare with the cutoff threshold). This two-category code itself can serve as a quantitative measure of atypical prosody or stereotyped idiosyncratic phrases. We can use the proposed method to perform coarse screening. Besides that, we can also fuse the recognized two-category code with other automatic calculated codes from related tests, e.g. respond to name (Liu et al., 2017), response to non-social sound stimuli, joint attention, etc.

The remainder of the paper is organized as follows. Section 2 describes our database. The proposed methods are explained in Section 3 and Section 4, respectively. Experimental results and discussions are presented in Section 5 while conclusions and future works are provided in Section 6.

Section snippets

Database description

We perform experiments on the data collected from our behavior observation and analysis lab in the Third Affiliated Hospital of Sun Yat-sen University as demonstrated in Fig. 1. Our audio database is collected in the real ADOS module 2 and module 3 screening environment. As you can observe from Fig. 1, our multimodal behavior signal capture system is equipped with multiple HD cameras and Kinect sensor to capture vision data during the child-psychologist interactions. As for the audio data,

Methods for atypical prosody detection

The baseline system is implemented using the OpenSMILE feature extractor followed by a Support Vector Machine (SVM) classifier. Our end-to-end deep learning framework uses spectrograms as the input, and performs supervised learning using deep neural networks. Finally we perform score level fusion by averaging the prediction scores from the aforementioned systems. Section 3.2 and Section 3.3 introduce these two methods in detail.

Methods for stereotyped idiosyncratic phrases detection

Generally, the stereotyped idiosyncratic phrases detection task can be considered as a supervised text classification problem. Given that the scale of our transcript database is quite small, we adopt several feature extractors that could match with the definition of ’stereotyped/idiosyncratic usage of words or phrases’ and the domain expert knowledge from clinicians. After the features are extracted, we use the LibSVM toolkit (Chang and Lin, 2011) with a linear kernel to perform leave one

Results for the atypical prosody detection task

In this section, we compare the classification results between the OpenSMILE+SVM baseline and our proposed end-to-end approaches. Besides the 0/1/2 three categories classification, we also perform binary classification by merging the code 1 and 2 together as a new class to enhance the practical usability. Moreover, we also show the results with segment selection and score level fusion. The details of the database and evaluation protocol are presented in Section 2 and Table 4.

Conclusions and future works

In this paper, we present an automated assessment framework in quantifying atypical prosody and stereotyped idiosyncratic phrases related to ASD. We collected an audio database during the ADOS screening sessions, the Module 2 part consists of 118 children while the Module 3 part includes 71 children. For detecting the atypical prosody, the proposed end-to-end deep learning methods achieve superior performance at the segment level, but not at the person-level. The cross validation based segment

Acknowledgments

This research was funded in part by the National Natural Science Foundation of China (61773413,81873801,81601533), Natural Science Foundation of Guangzhou City (201707010363), Guangdong Science and Technology Program for Industrial Development (20160914), Six talent peaks project in Jiangsu Province (JY-074) and National Key Research and Development Program (2016YFC0103905).

References (46)

  • M.P. Black et al.

    Toward automating a human behavioral coding system for married couples interactions using speech acoustic features

    Speech Commun.

    (2013)
  • M.J. Alam et al.

    Combining amplitude and phase-based features for speaker verification with short duration utterances

    Proceeding of Interspeech

    (2015)
  • D. Bone et al.

    Spontaneous-speech acoustic-prosodic features of children with autism and the interacting psychologist

    Proceeding of Interspeech

    (2012)
  • D. Bone et al.

    Acoustic-prosodic correlates of ‘awkward’ prosody in story retellings from adolescents with autism

    Proceeding of Interspeech

    (2015)
  • D. Bone et al.

    Chapter 15 behavioral signal processing and autism: Learning from multimodal behavioral signals

    Autism Imaging and Devices

    (2017)
  • CaiD. et al.

    End-to-end deep learning framework for speech paralinguistics detection based on perception aware spectrum

    Proceeding of Interspeech

    (2017)
  • CaiW. et al.

    Countermeasures for automatic speaker verification replay spoofing attack : On data augmentation, feature representation, classification and fusion

    Proceeding of Interspeech

    (2017)
  • W. Cai et al.

    Insights into end-to-end learning scheme for language identification

    Proceeding of ICASSP

    (2018)
  • ChangC.C. et al.

    LIBSVM: A Library for Support Vector Machines

    (2011)
  • T. Chaspari et al.

    An acoustic analysis of shared enjoyment in eca interactions of children with autism

    Proceeding of ICASSP

    (2012)
  • CheW. et al.

    Ltp: A chinese language technology platform

    J. Chin. Inform. Process.

    (2010)
  • D.L. Christensen et al.

    Prevalence and characteristics of autism spectrum disorder among 4-year-old children in the autism and developmental disabilities monitoring network

    J. Develop. Behav. Pediat.

    (2016)
  • H. Dubey et al.

    A speaker diarization system for studying peer-led team learning groups

    Proceeding of Interspeech

    (2016)
  • F. Eyben

    Opensmile: the munich versatile and fast open-source audio feature extractor

    Proceeding of ACM International Conference on Multimedia

    (2010)
  • R. Fusaroli et al.

    “is voice a marker for autism spectrum disorder? a systematic review and meta-analysis”

    Autism Res.

    (2016)
  • GongJ.J. et al.

    Towards an automated screening tool for developmental speech and language impairments

    Proceeding of Interspeech

    (2016)
  • K. Gotham et al.

    Standardizing ados scores for a measure of severity in autism spectrum disorders

    J. Aut. Develop. Disorders

    (2009)
  • R. Grzadzinski et al.

    Parent-reported and clinician-observed autism spectrum disorder (asd) symptoms in children with attention deficit/hyperactivity disorder (adhd): implications for practice under dsm-5

    Molecular Aut.

    (2016)
  • Harutyunyan, H., Khachatrian, H., 2016. Combining cnn and rnn for spoken language identification. In:...
  • R.M. Hegde et al.

    Significance of the modified group delay feature in speech recognition

    IEEE Trans. Audio, Speech, Lang. Process.

    (2007)
  • R.M. Hegde et al.

    Application of the modified group delay function to speaker identification and discrimination

    Proceeding of ICASSP

    (2004)
  • G. Heigold et al.

    End-to-end text-dependent speaker verification

    Proceeding of ICASSP

    (2016)
  • M. Kumar et al.

    Objective language feature analysis in children with neurodevelopmental disorders during autism assessment

    Proceeding of Interspeech

    (2016)
  • Cited by (38)

    • One-dimensional convolutional neural network and hybrid deep-learning paradigm for classification of specific language impaired children using their speech

      2022, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      For the recognition and classification jobs, both CNNs and recurrent neural networks (RNNs, e.g., LSTM) found central roles on various occasions owing to their robust data-handling potency. In the majority of speech related CNN/RNN's applications like pathological voice detection, emotion recognition, predicting abnormalities in children, classification of environmental sounds, the input data were either transformed into image domain by computing their spectrograms from speech/audio signals [16–19] or the feature matrix extracted from the audio waveforms [20,21]. Whereas, another set of authors have subsumed frame-level and sample-level operations to allow raw speech signal right at the input of the model for speaker verification, speech recognition, music tagging, and end-to-end environmental sound classification tasks [22–25].

    • Analysis of atypical prosodic patterns in the speech of people with Down syndrome

      2021, Biomedical Signal Processing and Control
      Citation Excerpt :

      The same has been found for adolescents with the syndrome [16]. Speech has shown to be useful as a biomedical signal related to different syndromes and diseases such as Parkinson’s [17–19], autism spectrum disorder [20], depression [21], Alzheimer’s [22], ataxia [23], aphasia [24], dysarthria [25–27] or bipolar disorder [28], among others. This is also the case of DS [29,30,13,31].

    • Identification of autism spectrum disorder using multi-regional resting-state data through an attention learning approach

      2021, Biomedical Signal Processing and Control
      Citation Excerpt :

      All these efforts have demonstrated that the deep learning approaches hold great promise for identifying the individuals with ASD. Of course, in addition to the individual identification task, it also has potential in other applications, such as the detection of speech and language abnormalities [25] as well as stereotypical motor movement [28]. In this paper, we investigated the feasibility of applying attention learning approach to identify individuals with ASD, based directly on the resting-state functional magnetic resonance imaging (rsfMRI) data.

    • Using Measures of Vowel Space for Autistic Traits Characterization

      2024, IEEE/ACM Transactions on Audio Speech and Language Processing
    View all citing articles on Scopus

    ☆This paper has been recommended for acceptance by Prof. R. K. Moore.

    View full text