Development of Japanese infant speech database from longitudinal recordings

doi:10.1016/j.specom.2009.01.009

Speech Communication

Volume 51, Issue 6, June 2009, Pages 510-520

https://doi.org/10.1016/j.specom.2009.01.009 Get rights and content

Abstract

Developmental research on speech production requires both a cross-sectional and a longitudinal speech database. Previous longitudinal speech databases are limited in terms of recording period or number of utterances. An infant speech database was developed from 5 years of recordings containing a large number of daily life utterances of five Japanese infants and their parents. The resulting database contains 269,467 utterances with various types of information including a transcription, an F0 value, and a phoneme label. This database can be used in future research on the development of speech production.

Introduction

Infant speech development can be studied using several approaches. One involves analyzing infant utterances acoustically to reveal the developmental changes that occur with age. For this acoustic analysis, the infant utterances can be collected by using either a cross-sectional or a longitudinal approach.

With the cross-sectional approach (e.g., Eguchi and Hirsh, 1969, Keating and Buhr, 1978, Kent and Murray, 1982, Robb and Saxman, 1985), infant utterances are collected from many infants at different ages. A merit of this approach is that infant utterances are obtained quickly. However, its demerit is that individual differences may cause an artifact, because the utterances of infants at a particular age do not necessarily have the same acoustic characteristics. They might be affected by the developmental state of the speech organs and the speaking ability of each infant. Results obtained with a cross-sectional approach might not accurately reflect the course of speech development (Bennett, 1983, Kent, 1976).

In contrast to the cross-sectional approach, utterances are collected from the same infant at different ages with the longitudinal approach (e.g., Bennett, 1983, Fairbanks, 1942, Hollien et al., 1994, McRoberts and Best, 1997, Robb et al., 1989, Shepard and Lane, 1968). The longitudinal approach usually employs only a few infants and sometimes only one (e.g., McRoberts and Best, 1997, Reissland, 1998). The longitudinal approach is more robust against the artifacts of individual difference than the cross-sectional approach, because it traces the individual developmental change in each infant. However, with the longitudinal approach, it takes a very long time to obtain infant utterances, sometimes several years. Therefore, the longitudinal approach is more expensive than the cross-sectional approach in terms of time consumption. Because of this difficulty, the longitudinal approach has been employed by fewer studies than the cross-sectional approach. However, both these approaches are important to research on speech development.

To overcome the problem posed by the longitudinal approach, it is a good idea to develop an infant speech database and share it among researchers. The child language data exchange system (CHILDES) (MacWhinney, 2000) is a project based on this idea. CHILDES is a set of databases of infant speech in many languages. It contains an utterance file, its transcription, and other useful information about infant speech on a large scale. Some of the databases in CHILDES also contain video files of infant utterances. CHILDES is being developed in different languages by many researchers who have provided their databases.

There are four Japanese databases in CHILDES, namely the Noji, Ishii, Hamasaki, and Miyata databases.

The Noji database consists of data collected frequently from one infant between the ages of 0 and 7 years. However, this database only contains utterance transcriptions and does not provide utterance files, because it is based on Noji’s diary.

The other three databases provide utterance files and/or video files, although their collection periods are much shorter than that of Noji’s database. The Ishii database is based on one infant. Data were largely collected bimonthly. The collection periods were from 8 to 23 months and from 41 to 44 months. The Hamasaki database is also based on one infant. Data were collected two or three times per month between the ages of 26 and 43 months. The Miyata database is based on three infants. The data were collected from about 15 months to about 36 months. The Miyata database also provides video files.

The four databases described above are useful for analyzing the development of, for example, an infant’s vocabulary and syntactic rules. However, they might not prove suitable for a longitudinal acoustic analysis of infant utterances, because the collection periods are short, the number of utterances is not very large, and utterance files are not provided. These three factors make it difficult to observe longitudinal acoustic changes in infant utterances from birth to childhood.

With this as the background, we first digitally recorded the utterances of Japanese infants and parents over about 5 years. Then, we developed our infant speech database from these longitudinal recordings by extracting utterances and providing them with various pieces of information.

We laid particular stress on information obtained from acoustic analysis when we developed our infant speech database. An example of this information is the start and end times of an utterance. With information about these times, it is possible to analyze developmental changes in the utterance duration, pause duration, utterance overlap, speaking rate, and utterance rhythm. Another example of this information is the fundamental frequency, with which it is possible to analyze developmental changes in the pitch accent and intonation pattern. A database containing the above information about utterances would lead to a better understanding of the developmental acoustic changes in the time and frequency domains that reflect the development of articulation skill and proficiency in language processing.

Section snippets

Participants

Five infants [A(kk), B(sk), C(sa), D(ma), and E(mk)] and their parents participated voluntarily in the recording. They were all Japanese. All the infants were born and raised in Tokyo or in Kanagawa prefecture, which adjoins Tokyo. Infant gender, birth month, height, and weight are shown in Table 1. They had no symptoms of disorder with respect to speech perception or speech production. Infants B and E are siblings. Infants C and D are also siblings. Infant A has a brother who is 10 years older

Database development

An infant speech database was developed from the longitudinal recordings. The database consists of a session file, an utterance file, a transcription file, a property tag file, a time record file, a comment file, a fundamental frequency file, a voiced/unvoiced label file, and a phoneme label file. These files were stored in directories, which were classified by month and infant. HTML files were developed as links to these files. Fig. 1 is a schematic diagram of the database structure. The

Research applications

Our infant speech database has contributed to several pieces of developmental research. For example, Ishizuka et al. (2007) revealed developmental changes in the spectral peaks of vowels in an infant’s utterances. They found that the spectral peaks gradually diverge to form a set of Japanese vowels by 24 months of age. Amano et al. (2006) analyzed the developmental change of F0 in infants’ and parents’ utterances. They found that the infants’ F0 decreases almost constantly along with age in

Database release

Our infant speech database with its search software and a custom-made audio player is released by The Speech Resources Consortium (http://research.nii.ac.jp/src/eng/index.html) at a price of 85,500 yen. A waveform editor tuned to the database will be available from Arcadia Corporation (http://www.arcadia.co.jp/). This waveform editor is not included in the database price. The waveform editor automatically overlays the F0 data in the database onto a spectrogram. It also automatically overlays the

Conclusion

An infant speech database was developed from 5 years of recordings of utterances of five Japanese infants and their parents. This database contains a large number of utterances and their transcriptions, F0 values, and phoneme labels. This database makes it possible to trace the speech development of a particular infant from its birth until 5 years of age. It also offers the possibility to trace parents’ utterances addressed at an infant during this same period. Therefore, this database

References (22)

N. Reissland
The pitch of ‘real’ and ‘rhetorical’ questions directed by a father to his daughter: a longitudinal case study
Infant Behav. Develop.
(1998)
S. Amano et al.
Fundamental frequency of infants’ and parents’ utterances in longitudinal recordings
J. Acoust. Soc. Amer.
(2006)
S. Bennett
A 3-year longitudinal study of school-aged children’s fundamental frequencies
J. Speech Hear. Res.
(1983)
S. Eguchi et al.
Development of speech sounds in children
Acta Otolaryngol. (Suppl.)
(1969)
G. Fairbanks
An acoustical study of the pitch of infant hunger wails
Child Develop.
(1942)
H. Hollien et al.
Longitudinal research on adolescent voice change in males
J. Acoust. Soc. Amer.
(1994)
T. Inui et al.
A study of word and sentence acquisition: quantitative analysis of longitudinal data
Cognit. Stud.
(2003)
I. Ishizuka et al.
Longitudinal developmental changes in spectral peaks of vowels produced by Japanese infants
J. Acoust. Soc. Amer.
(2007)
S. Kajikawa et al.
Development of speech communication between mother and child
Pafoumansu Kyouiku
(2004)
S. Kajikawa et al.
Speech overlap in Japanese mother–child conversations
J. Child Language
(2004)

P. Keating et al.

Fundamental frequency in the speech of infants and children

J. Acoust. Soc. Amer.

(1978)

Cited by (7)

Relationship between oxytocin and maternal approach behaviors to infants’ vocalizations
2020, Comprehensive Psychoneuroendocrinology
Citation Excerpt :
To evaluate the degree of participants’ intent to approach or avoid the voice stimuli, we also asked them to rate their desire to pick up the baby (pick up) or leave the baby alone (ignore) using a VAS. Experimental voice stimuli were collected from the NTT infant voice database [33], which included voice samples recorded from 3- to 12-month-old infants. Two authors and one experiment cooperator independently labeled each voice stimulus as “crying,” “babbling,” or “laughing.”
Infants communicate their emotions to caregivers mainly through vocalizations. Research has shown that maternal oxytocin levels relate to adaptive parenting; however, little empirical research exists regarding the effects of endogenous oxytocin levels on maternal responses to infant vocalizations. Thus, in this study, we examined the relationship between mothers’ salivary oxytocin levels, subjective feelings, and behavioral response to infants’ emotional vocalizations. Additionally, we examined the relationship between psychological traits and maternal behavioral responses to infant vocalizations. In this study, 39 mothers were asked to stand on a balance board while listening to infant vocalization stimuli, to measure movements of their center of pressure, an index of approach-avoidance behavior. Sixty infant vocalizations (laughter, crying, and neutral) were presented for 6 s each. Afterwards, participants were asked to rate their subjective responses to each stimulus (not aroused – aroused, displeased – pleased, not urgent – urgent, and healthy – sick). Maternal oxytocin levels were negatively correlated with anterior movement of the center of pressure in response to infants’ crying and babbling vocalizations, though no relationship was found between maternal approach-avoidance behavior toward infant laughter and oxytocin levels. This study indicated that maternal approach behavior toward infant vocalizations varies as a function of maternal endogenous oxytocin and the type of infant vocalization.
Acquisition of vowel articulation in childhood investigated by acoustic-to-articulatory inversion
2017, Infant Behavior and Development
Citation Excerpt :
Especially, we analyzed longitudinal changes in combinations of multiple articulatory organs to show how flexible coordination of multiple articulatory organs develops. We used the NTT Japanese infant speech database (Amano et al., 2006, 2009; Ishizuka et al., 2007) for this study. This database contains the utterances of five normally developing children and their parents, recorded with 16-bit quantization at a sampling rate of 16 kHz.
While the acoustical features of speech sounds in children have been extensively studied, limited information is available as to their articulation during speech production. Instead of directly measuring articulatory movements, this study used an acoustic-to-articulatory inversion model with scalable vocal tract size to estimate developmental changes in articulatory state during vowel production. Using a pseudo-inverse Jacobian matrix of a model mapping seven articulatory parameters to acoustic ones, the formant frequencies of each vowel produced by three Japanese children over time at ages between 6 and 60 months were transformed into articulatory parameters. We conducted the discriminant analysis to reveal differences in articulatory states for production of each vowel. The analysis suggested that development of vowel production went through gradual functionalization of articulatory parameters. At 6–9 months, the coordination of position of tongue body and lip aperture forms three vowels: front, back, and central. At 10–17 months, recruitments of jaw and tongue apex enable differentiation of these three vowels into five. At 18 months and older, recruitment of tongue shape produces more distinct vowels specific to Japanese. These results suggest that the jaw and tongue apex contributed to speech production by young children regardless of kinds of vowel. Moreover, initial articulatory states for each vowel could be distinguished by the manner of coordination between lip and tongue, and these initial states are differentiated and refined into articulations adjusted to the native language over the course of development.
Learnability of prosodic boundaries: Is infant-directed speech easier?
2016, Journal of the Acoustical Society of America
Development of a serial order in speech constrained by articulatory coordination
2013, PLoS ONE
Evaluation of healthcare institutions for long-term preservation of electronic health records
2011, Communications in Computer and Information Science
Infant Speech Database for Longitudinal Analysis of Spoken Language Development
2010, DiSS-LPSS Joint Workshop 2010 - The 5th Workshop on Disfluency in Spontaneous Speech and the 2nd International Symposium on Linguistic Patterns in Spontaneous Speech

View all citing articles on Scopus

^☆: Parts of this research were presented at the 7th International Conference on Spoken Language Processing, Denver, CO, September 16–20, 2002.

¹: Now at NTT Advanced Technology Corporation.

View full text

Development of Japanese infant speech database from longitudinal recordings☆

Abstract

Introduction

Section snippets

Participants

Database development

Research applications

Database release

Conclusion

Infant Behav. Develop.

Fundamental frequency of infants’ and parents’ utterances in longitudinal recordings

J. Acoust. Soc. Amer.

A 3-year longitudinal study of school-aged children’s fundamental frequencies

J. Speech Hear. Res.

Development of speech sounds in children

Acta Otolaryngol. (Suppl.)

An acoustical study of the pitch of infant hunger wails

Child Develop.

Longitudinal research on adolescent voice change in males

J. Acoust. Soc. Amer.

A study of word and sentence acquisition: quantitative analysis of longitudinal data

Cognit. Stud.

Longitudinal developmental changes in spectral peaks of vowels produced by Japanese infants

J. Acoust. Soc. Amer.

Development of speech communication between mother and child

Pafoumansu Kyouiku

Speech overlap in Japanese mother–child conversations

J. Child Language

Fundamental frequency in the speech of infants and children

J. Acoust. Soc. Amer.