An automated assessment framework for atypical prosody and stereotyped idiosyncratic phrases related to autism spectrum disorder

doi:10.1016/j.csl.2018.11.002

Computer Speech & Language

Volume 56, July 2019, Pages 80-94

https://doi.org/10.1016/j.csl.2018.11.002 Get rights and content

Abstract

Autism Spectrum Disorder (ASD), a neurodevelopmental disability, has become one of the high incidence diseases among children. Studies indicate that early diagnosis and intervention treatments help to achieve positive longitudinal outcomes. In this paper, we focus on the speech and language abnormalities of young children with ASD and present an automated assessment framework in quantifying atypical prosody and stereotyped idiosyncratic phrases related to ASD. For detecting atypical prosody from speech, we propose both the hand-crafted feature based method as well as the end-to-end deep learning framework. First, we use the OpenSMILE toolkit to extract utterance level high dimensional acoustic features followed by a support vector machine (SVM) backend as the conventional baseline. Second, we propose several end-to-end deep neural network setups and configurations to model the atypical prosody label directly from the constant Q transform spectrogram of speech. Third, we apply cross-validation on the training data to perform segments selection and enhance the subject level classification performance. Fourth, we fuse the deep learning based methods with the conventional baseline at the score level to further enhance the overall system performance. For detecting the stereotyped idiosyncratic usage of words or phrases from speech transcripts, we adopt language model, dependency treebank and Term Frequency–Inverse Document Frequency (TF–IDF) in addition to Linguistic Inquiry and Word Count software (LIWC) methods to extract a set of text features followed by a standard SVM backend. We collect a database of spontaneous Mandarin speech recorded during the Autism Diagnostic Observation Schedule (ADOS) Module 2 and Module 3 sessions. The Module 2 part consists of 118 children while the Module 3 part includes 71 children. Experimental results on this database show that our proposed methods can effectively predict the atypical prosody and stereotyped idiosyncratic phrases codes for young children with the risk of ASD. On the two categories classification task, the unweighted accuracy of the aforementioned two tasks are 88.1% and 77.8%, respectively.

Introduction

Autism Spectrum Disorder (ASD) refers to a group of symptoms related to social impairments and communication difficulties. It has become one of the high incidence diseases among children. A recent analysis from the Centers for Disease Control and Prevention estimates that 1 in 68 children has ASD in the United States (Christensen et al., 2016). Early behavioral and educational interventions have been proved to be very successful in many clinical studies. This attaches great significance to the recognition of common ASD behavior patterns and diagnoses at the early stage.

In paralinguistics, prosody relates to several communicative functions such as intonation, tone, pitch, stress, rhythm, etc. Prosody can reflect many important elements of language including the emphasis, contrast and affective state of the speaker (McCann and Peppé, 2003). These are critical information in human communication. Therefore, atypical prosody is one of the common symptoms related to ASD. Specifically, children with ASD may speak in flat, robot-like or a sing-song voice (Fusaroli et al., 2016).

In this work, we not only focus on the speech signal but also study the language patterns for ASD detection. For ASD children who are verbally fluent, they may have various kinds of language communication abnormalities, such as stereotyped, repetitive and idiosyncratic usage of words or phrases. Children with stereotyped idiosyncratic usage of words or phrases often use some inflexible and rigid words and expressions during the conversation. The words or phrases they uttered sometimes may be inappropriate for the context. Moreover, ASD children may create some new and weird words during the conversation.

Both the aforementioned speech and language cues are important for clinicians to perform diagnosis. The Autism Diagnostic Observation Schedule (ADOS) is a standard screening test to help clinicians observe children’s language and behavior patterns relevant to the diagnosis of autism. It consists of a series of structured and semi-structured tasks assessing social interaction, communication, playing, and imaginative usage of materials (Lord et al., 2000). There are four different modules designed mainly according to the subject’s age and linguistic capability. Moreover, speech and language abnormalities are included in all these four modules. The ADOS screening provides codes to quantify the items on an integer scale from “0” to “2” based on the severity of each abnormality category (Gotham et al., 2009). Taking atypical prosody as an example, “0” denotes no abnormal prosody; “1” stands for some changes on pitch/tone, a bit flat/exaggerate intonation, slightly abnormal volume, a little slow/fast/jerky rhythm and “2” implies markedly and consistently abnormalities on the aforementioned aspects (Lord et al., 2000).

In the ADOS screening, therapists need to identify multiple behavior codes related to speech and language, including atypical prosody, stereotyped idiosyncratic phrases, etc. As many research and treatment methods in the psychology field, this kind of evaluation or diagnosis requires experienced experts or clinicians with intensive specialized training. Another issue is the subjective inconsistency between clinicians, which could sometimes make the results ambiguous at some certain levels. Researchers have proposed strategies to utilize speech and language processing techniques to support clinicians with quantitative analysis of ASD children’s prosody (Bone, Black, Lee, Williams, Levitt, Lee, Narayanan, 2012, Chaspari, Provost, Katsamanis, Narayanan, 2012, Bone, Black, Ramakrishna, Grossman, Narayanan, 2015, Bone, Chaspari, Narayanan, 2017) and language patterns (Kumar et al., 2016). Furthermore, since pattern recognition and machine learning methods have demonstrated promising results in modeling behavior symptoms and relationships with expert’s experience (Narayanan, Georgiou, 2013, Xiao, Imel, Georgiou, Atkins, Narayanan, 2015), some automated screening and evaluating tools based on objective measurements directly extracted from recordings are proposed (Gong, Gong, Levy-Lambert, Green, Hogan, Guttag, 2016, Xiao, Can, Gibson, Imel, Atkins, Georgiou, Narayanan, 2016). These automated coding tools showed great potential to be scalable and assist clinicians to analyze the variation trend of a specific symptom in long term monitoring or assessments.

In this paper, we focus on the speech and language abnormalities and present an automated assessment framework to determine the existence and severity level of atypical prosody and stereotyped idiosyncratic phrases for young children under the ADOS Module 2 and 3 setup.

On the speech side, we model the atypical prosody abnormality using both the traditional strategy and the deep learning framework. We demonstrate that the end-to-end techniques can achieve comparable performance against the baseline system even on a small-scale dataset. Since we directly model the ASD related atypical prosody code from the spectrograms in an end-to-end manner, there is no prior domain knowledge required for feature engineering. The fusion of the two systems can further improve the overall system performance at the segment level in terms of the unweighted average recall (UAR). This result shows that the end-to-end framework has great potential in the field of behavior signal processing (BSP) (Black et al., 2013). Moreover, among all the speech segments in an ADOS conversation session, not every segment reflects the atypical prosody information and therefore we adopt the cross validation strategy on the training set to perform segment selection and improve the accuracy on both segment and person levels.

On the language side, besides n-gram language model, categorical word counts from the Linguistic Inquiry and Word Count software (LIWC) these baseline features and maximum entropy classifier (Kumar et al., 2016), we also propose dependency treebank and Term Frequency-Inverse Document Frequency (TF–IDF) these two methods to extract features that are more related to the stereotyped idiosyncratic usage of words or phrases. We concatenate all these four features together and adopt a standard SVM classifier as the backend.

Furthermore, we also investigate the cutoff boundary of the code 0/1/2 by merging 1/2 as a new code to form a binary classification task. Experimental results show that our trained models are more confident at distinguishing between normal and abnormal cases rather than estimating the detailed severity level of abnormal behaviors. In this study, our goal is not just to recognize the three-category code and use it the same way as described in the ADOS manual (adding together all the codes and compare with the cutoff threshold). This two-category code itself can serve as a quantitative measure of atypical prosody or stereotyped idiosyncratic phrases. We can use the proposed method to perform coarse screening. Besides that, we can also fuse the recognized two-category code with other automatic calculated codes from related tests, e.g. respond to name (Liu et al., 2017), response to non-social sound stimuli, joint attention, etc.

The remainder of the paper is organized as follows. Section 2 describes our database. The proposed methods are explained in Section 3 and Section 4, respectively. Experimental results and discussions are presented in Section 5 while conclusions and future works are provided in Section 6.

Section snippets

Database description

We perform experiments on the data collected from our behavior observation and analysis lab in the Third Affiliated Hospital of Sun Yat-sen University as demonstrated in Fig. 1. Our audio database is collected in the real ADOS module 2 and module 3 screening environment. As you can observe from Fig. 1, our multimodal behavior signal capture system is equipped with multiple HD cameras and Kinect sensor to capture vision data during the child-psychologist interactions. As for the audio data,

Methods for atypical prosody detection

The baseline system is implemented using the OpenSMILE feature extractor followed by a Support Vector Machine (SVM) classifier. Our end-to-end deep learning framework uses spectrograms as the input, and performs supervised learning using deep neural networks. Finally we perform score level fusion by averaging the prediction scores from the aforementioned systems. Section 3.2 and Section 3.3 introduce these two methods in detail.

Methods for stereotyped idiosyncratic phrases detection

Generally, the stereotyped idiosyncratic phrases detection task can be considered as a supervised text classification problem. Given that the scale of our transcript database is quite small, we adopt several feature extractors that could match with the definition of ’stereotyped/idiosyncratic usage of words or phrases’ and the domain expert knowledge from clinicians. After the features are extracted, we use the LibSVM toolkit (Chang and Lin, 2011) with a linear kernel to perform leave one

Results for the atypical prosody detection task

In this section, we compare the classification results between the OpenSMILE+SVM baseline and our proposed end-to-end approaches. Besides the 0/1/2 three categories classification, we also perform binary classification by merging the code 1 and 2 together as a new class to enhance the practical usability. Moreover, we also show the results with segment selection and score level fusion. The details of the database and evaluation protocol are presented in Section 2 and Table 4.

Conclusions and future works

In this paper, we present an automated assessment framework in quantifying atypical prosody and stereotyped idiosyncratic phrases related to ASD. We collected an audio database during the ADOS screening sessions, the Module 2 part consists of 118 children while the Module 3 part includes 71 children. For detecting the atypical prosody, the proposed end-to-end deep learning methods achieve superior performance at the segment level, but not at the person-level. The cross validation based segment

Acknowledgments

This research was funded in part by the National Natural Science Foundation of China (61773413,81873801,81601533), Natural Science Foundation of Guangzhou City (201707010363), Guangdong Science and Technology Program for Industrial Development (20160914), Six talent peaks project in Jiangsu Province (JY-074) and National Key Research and Development Program (2016YFC0103905).

References (46)

M.P. Black et al.
Toward automating a human behavioral coding system for married couples interactions using speech acoustic features
Speech Commun.
(2013)
M.J. Alam et al.
Combining amplitude and phase-based features for speaker verification with short duration utterances
Proceeding of Interspeech
(2015)
D. Bone et al.
Spontaneous-speech acoustic-prosodic features of children with autism and the interacting psychologist
Proceeding of Interspeech
(2012)
D. Bone et al.
Acoustic-prosodic correlates of ‘awkward’ prosody in story retellings from adolescents with autism
Proceeding of Interspeech
(2015)
D. Bone et al.
Chapter 15 behavioral signal processing and autism: Learning from multimodal behavioral signals
Autism Imaging and Devices
(2017)
CaiD. et al.
End-to-end deep learning framework for speech paralinguistics detection based on perception aware spectrum
Proceeding of Interspeech
(2017)
CaiW. et al.
Countermeasures for automatic speaker verification replay spoofing attack : On data augmentation, feature representation, classification and fusion
Proceeding of Interspeech
(2017)
W. Cai et al.
Insights into end-to-end learning scheme for language identification
Proceeding of ICASSP
(2018)
ChangC.C. et al.
LIBSVM: A Library for Support Vector Machines
(2011)
T. Chaspari et al.
An acoustic analysis of shared enjoyment in eca interactions of children with autism
Proceeding of ICASSP
(2012)

CheW. et al.

Ltp: A chinese language technology platform

J. Chin. Inform. Process.

(2010)

D.L. Christensen et al.

Prevalence and characteristics of autism spectrum disorder among 4-year-old children in the autism and developmental disabilities monitoring network

J. Develop. Behav. Pediat.

(2016)

H. Dubey et al.

A speaker diarization system for studying peer-led team learning groups

Proceeding of Interspeech

(2016)

F. Eyben

Opensmile: the munich versatile and fast open-source audio feature extractor

Proceeding of ACM International Conference on Multimedia

(2010)

R. Fusaroli et al.

“is voice a marker for autism spectrum disorder? a systematic review and meta-analysis”

Autism Res.

(2016)

GongJ.J. et al.

Towards an automated screening tool for developmental speech and language impairments

Proceeding of Interspeech

(2016)

K. Gotham et al.

Standardizing ados scores for a measure of severity in autism spectrum disorders

J. Aut. Develop. Disorders

(2009)

R. Grzadzinski et al.

Parent-reported and clinician-observed autism spectrum disorder (asd) symptoms in children with attention deficit/hyperactivity disorder (adhd): implications for practice under dsm-5

Molecular Aut.

(2016)

Harutyunyan, H., Khachatrian, H., 2016. Combining cnn and rnn for spoken language identification. In:...

R.M. Hegde et al.

Significance of the modified group delay feature in speech recognition

IEEE Trans. Audio, Speech, Lang. Process.

(2007)

R.M. Hegde et al.

Application of the modified group delay function to speaker identification and discrimination

Proceeding of ICASSP

(2004)

G. Heigold et al.

End-to-end text-dependent speaker verification

Proceeding of ICASSP

(2016)

M. Kumar et al.

Objective language feature analysis in children with neurodevelopmental disorders during autism assessment

Proceeding of Interspeech

(2016)

Cited by (38)

One-dimensional convolutional neural network and hybrid deep-learning paradigm for classification of specific language impaired children using their speech
2022, Computer Methods and Programs in Biomedicine
Citation Excerpt :
For the recognition and classification jobs, both CNNs and recurrent neural networks (RNNs, e.g., LSTM) found central roles on various occasions owing to their robust data-handling potency. In the majority of speech related CNN/RNN's applications like pathological voice detection, emotion recognition, predicting abnormalities in children, classification of environmental sounds, the input data were either transformed into image domain by computing their spectrograms from speech/audio signals [16–19] or the feature matrix extracted from the audio waveforms [20,21]. Whereas, another set of authors have subsumed frame-level and sample-level operations to allow raw speech signal right at the input of the model for speaker verification, speech recognition, music tagging, and end-to-end environmental sound classification tasks [22–25].
Screening children for communicational disorders such as specific language impairment (SLI) is always challenging as it requires clinicians to follow a series of steps to evaluate the subjects. Artificial intelligence and computer-aided diagnosis have supported health professionals in making swift and error-free decisions about the neurodevelopmental state of children vis-à-vis language comprehension and production. Past studies have claimed that typical developing (TD) and SLI children show distinct vocal characteristics that can serve as discriminating facets between them. The objective of this study is to group children in SLI or TD categories by processing their raw speech signals using two proposed approaches: a customized convolutional neural network (CNN) model and a hybrid deep-learning framework where CNN is combined with long-short-term-memory (LSTM).
We considered a publicly available speech database of SLI and typical children of Czech accents for this study. The convolution filters in both the proposed CNN and hybrid models (CNN-LSTM) estimated fuzzy-automated features from the speech utterance. We performed the experiments in five separate sessions. Data augmentations were performed in each of those sessions to enhance the training strength.
Our hybrid model exhibited a perfect 100% accuracy and F-measure for almost all the session-trials compared to CNN alone which achieved an average accuracy close to 90% and F-measure ≥ 92%. The models have further illustrated their robust classification essences by securing values of reliability indexes over 90%.
The results confirm the effectiveness of proposed approaches for the detection of SLI in children using their raw speech signals. Both the models do not require any dedicated feature extraction unit for their operations. The models may also be suitable for screening SLI and other neurodevelopmental disorders in children of different linguistic accents.
Analysis of atypical prosodic patterns in the speech of people with Down syndrome
2021, Biomedical Signal Processing and Control
Citation Excerpt :
The same has been found for adolescents with the syndrome [16]. Speech has shown to be useful as a biomedical signal related to different syndromes and diseases such as Parkinson’s [17–19], autism spectrum disorder [20], depression [21], Alzheimer’s [22], ataxia [23], aphasia [24], dysarthria [25–27] or bipolar disorder [28], among others. This is also the case of DS [29,30,13,31].
The speech of people with Down syndrome (DS) shows prosodic features which are distinct from those observed in the oral productions of typically developing (TD) speakers. Although a different prosodic realization does not necessarily imply wrong expression of prosodic functions, atypical expression may hinder communication skills. The focus of this work is to ascertain whether this can be the case in individuals with DS. To do so, we analyze the acoustic features that better characterize the utterances of speakers with DS when expressing prosodic functions related to emotion, turn-end and phrasal chunking, comparing them with those used by TD speakers. An oral corpus of speech utterances has been recorded using the PEPS-C prosodic competence evaluation tool. We use automatic classifiers to prove that the prosodic features that better predict prosodic functions in TD speakers are less informative in speakers with DS. Although atypical features are observed in speakers with DS when producing prosodic functions, the intended prosodic function can be identified by listeners and, in most cases, the features correctly discriminate the function with analytical methods. However, a greater difference between the minimal pairs presented in the PEPS-C test is found for TD speakers in comparison with DS speakers. The proposed methodological approach provides, on the one hand, an identification of the set of features that distinguish the prosodic productions of DS and TD speakers and, on the other, a set of target features for therapy with speakers with DS.
Identification of autism spectrum disorder using multi-regional resting-state data through an attention learning approach
2021, Biomedical Signal Processing and Control
Citation Excerpt :
All these efforts have demonstrated that the deep learning approaches hold great promise for identifying the individuals with ASD. Of course, in addition to the individual identification task, it also has potential in other applications, such as the detection of speech and language abnormalities [25] as well as stereotypical motor movement [28]. In this paper, we investigated the feasibility of applying attention learning approach to identify individuals with ASD, based directly on the resting-state functional magnetic resonance imaging (rsfMRI) data.
Resting-state functional magnetic resonance imaging (rsfMRI) holds the promise to produce objective biomarkers of autism spectrum disorder (ASD). However, recent imaging efforts have focused on the functional connectivity measures and the resting-state data independently at different regions of interest (ROIs). In the present study, we investigated the multi-regional resting-state data for discovering potential biomarkers of ASD.
For better understanding of the results, we considered the brain activities at ROIs derived from the CC200 atlas. An attention learning approach, stacking a long short-term memory (LSTM) recurrent neural network and an autoencoder network, was proposed to explore the atypical features of brain activities for ASD. And we demonstrated the feasibility of proposed method with an application to the Autism Brain Imaging Data Exchange (ABIDE) database.
Based on the augmented data from 674 subjects, experiments achieved good classification accuracy of 74.7% under the intra-site cross-validation and 71.3% under the inter-site. The results outperform those from traditional machine learning classifiers (including support vector machine and random forest) and previously reported single LSTM network. Analysis on the weights of our optimal model highlighted the brain regions that are known to be implicated in ASD.
This study demonstrates that the attention learning with multi-regional resting-state data has the potential for screening autistic patients.
Can Natural Speech Prosody Distinguish Autism Spectrum Disorders? A Meta-Analysis
2024, Behavioral Sciences
Using Measures of Vowel Space for Autistic Traits Characterization
2024, IEEE/ACM Transactions on Audio Speech and Language Processing
Prosodic signatures of ASD severity and developmental delay in preschoolers
2023, npj Digital Medicine

View all citing articles on Scopus

: ☆This paper has been recommended for acceptance by Prof. R. K. Moore.

View full text

An automated assessment framework for atypical prosody and stereotyped idiosyncratic phrases related to autism spectrum disorder

Abstract

Introduction

Section snippets

Database description

Methods for atypical prosody detection

Methods for stereotyped idiosyncratic phrases detection

Results for the atypical prosody detection task

Conclusions and future works

Acknowledgments

Speech Commun.

Combining amplitude and phase-based features for speaker verification with short duration utterances

Proceeding of Interspeech

Spontaneous-speech acoustic-prosodic features of children with autism and the interacting psychologist

Proceeding of Interspeech

Acoustic-prosodic correlates of ‘awkward’ prosody in story retellings from adolescents with autism

Proceeding of Interspeech

Chapter 15 behavioral signal processing and autism: Learning from multimodal behavioral signals

Autism Imaging and Devices

End-to-end deep learning framework for speech paralinguistics detection based on perception aware spectrum

Proceeding of Interspeech

Countermeasures for automatic speaker verification replay spoofing attack : On data augmentation, feature representation, classification and fusion

Proceeding of Interspeech

Insights into end-to-end learning scheme for language identification

Proceeding of ICASSP

LIBSVM: A Library for Support Vector Machines

An acoustic analysis of shared enjoyment in eca interactions of children with autism

Proceeding of ICASSP

Ltp: A chinese language technology platform

J. Chin. Inform. Process.

Prevalence and characteristics of autism spectrum disorder among 4-year-old children in the autism and developmental disabilities monitoring network

J. Develop. Behav. Pediat.

A speaker diarization system for studying peer-led team learning groups

Proceeding of Interspeech

Opensmile: the munich versatile and fast open-source audio feature extractor

Proceeding of ACM International Conference on Multimedia

“is voice a marker for autism spectrum disorder? a systematic review and meta-analysis”

Autism Res.

Towards an automated screening tool for developmental speech and language impairments

Proceeding of Interspeech

Standardizing ados scores for a measure of severity in autism spectrum disorders

J. Aut. Develop. Disorders

Parent-reported and clinician-observed autism spectrum disorder (asd) symptoms in children with attention deficit/hyperactivity disorder (adhd): implications for practice under dsm-5

Molecular Aut.

Significance of the modified group delay feature in speech recognition

IEEE Trans. Audio, Speech, Lang. Process.

Application of the modified group delay function to speaker identification and discrimination

Proceeding of ICASSP

End-to-end text-dependent speaker verification

Proceeding of ICASSP

Objective language feature analysis in children with neurodevelopmental disorders during autism assessment

Proceeding of Interspeech