Elsevier

Speech Communication

Volume 51, Issue 6, June 2009, Pages 510-520
Speech Communication

Development of Japanese infant speech database from longitudinal recordings

https://doi.org/10.1016/j.specom.2009.01.009Get rights and content

Abstract

Developmental research on speech production requires both a cross-sectional and a longitudinal speech database. Previous longitudinal speech databases are limited in terms of recording period or number of utterances. An infant speech database was developed from 5 years of recordings containing a large number of daily life utterances of five Japanese infants and their parents. The resulting database contains 269,467 utterances with various types of information including a transcription, an F0 value, and a phoneme label. This database can be used in future research on the development of speech production.

Introduction

Infant speech development can be studied using several approaches. One involves analyzing infant utterances acoustically to reveal the developmental changes that occur with age. For this acoustic analysis, the infant utterances can be collected by using either a cross-sectional or a longitudinal approach.

With the cross-sectional approach (e.g., Eguchi and Hirsh, 1969, Keating and Buhr, 1978, Kent and Murray, 1982, Robb and Saxman, 1985), infant utterances are collected from many infants at different ages. A merit of this approach is that infant utterances are obtained quickly. However, its demerit is that individual differences may cause an artifact, because the utterances of infants at a particular age do not necessarily have the same acoustic characteristics. They might be affected by the developmental state of the speech organs and the speaking ability of each infant. Results obtained with a cross-sectional approach might not accurately reflect the course of speech development (Bennett, 1983, Kent, 1976).

In contrast to the cross-sectional approach, utterances are collected from the same infant at different ages with the longitudinal approach (e.g., Bennett, 1983, Fairbanks, 1942, Hollien et al., 1994, McRoberts and Best, 1997, Robb et al., 1989, Shepard and Lane, 1968). The longitudinal approach usually employs only a few infants and sometimes only one (e.g., McRoberts and Best, 1997, Reissland, 1998). The longitudinal approach is more robust against the artifacts of individual difference than the cross-sectional approach, because it traces the individual developmental change in each infant. However, with the longitudinal approach, it takes a very long time to obtain infant utterances, sometimes several years. Therefore, the longitudinal approach is more expensive than the cross-sectional approach in terms of time consumption. Because of this difficulty, the longitudinal approach has been employed by fewer studies than the cross-sectional approach. However, both these approaches are important to research on speech development.

To overcome the problem posed by the longitudinal approach, it is a good idea to develop an infant speech database and share it among researchers. The child language data exchange system (CHILDES) (MacWhinney, 2000) is a project based on this idea. CHILDES is a set of databases of infant speech in many languages. It contains an utterance file, its transcription, and other useful information about infant speech on a large scale. Some of the databases in CHILDES also contain video files of infant utterances. CHILDES is being developed in different languages by many researchers who have provided their databases.

There are four Japanese databases in CHILDES, namely the Noji, Ishii, Hamasaki, and Miyata databases.

The Noji database consists of data collected frequently from one infant between the ages of 0 and 7 years. However, this database only contains utterance transcriptions and does not provide utterance files, because it is based on Noji’s diary.

The other three databases provide utterance files and/or video files, although their collection periods are much shorter than that of Noji’s database. The Ishii database is based on one infant. Data were largely collected bimonthly. The collection periods were from 8 to 23 months and from 41 to 44 months. The Hamasaki database is also based on one infant. Data were collected two or three times per month between the ages of 26 and 43 months. The Miyata database is based on three infants. The data were collected from about 15 months to about 36 months. The Miyata database also provides video files.

The four databases described above are useful for analyzing the development of, for example, an infant’s vocabulary and syntactic rules. However, they might not prove suitable for a longitudinal acoustic analysis of infant utterances, because the collection periods are short, the number of utterances is not very large, and utterance files are not provided. These three factors make it difficult to observe longitudinal acoustic changes in infant utterances from birth to childhood.

With this as the background, we first digitally recorded the utterances of Japanese infants and parents over about 5 years. Then, we developed our infant speech database from these longitudinal recordings by extracting utterances and providing them with various pieces of information.

We laid particular stress on information obtained from acoustic analysis when we developed our infant speech database. An example of this information is the start and end times of an utterance. With information about these times, it is possible to analyze developmental changes in the utterance duration, pause duration, utterance overlap, speaking rate, and utterance rhythm. Another example of this information is the fundamental frequency, with which it is possible to analyze developmental changes in the pitch accent and intonation pattern. A database containing the above information about utterances would lead to a better understanding of the developmental acoustic changes in the time and frequency domains that reflect the development of articulation skill and proficiency in language processing.

Section snippets

Participants

Five infants [A(kk), B(sk), C(sa), D(ma), and E(mk)] and their parents participated voluntarily in the recording. They were all Japanese. All the infants were born and raised in Tokyo or in Kanagawa prefecture, which adjoins Tokyo. Infant gender, birth month, height, and weight are shown in Table 1. They had no symptoms of disorder with respect to speech perception or speech production. Infants B and E are siblings. Infants C and D are also siblings. Infant A has a brother who is 10 years older

Database development

An infant speech database was developed from the longitudinal recordings. The database consists of a session file, an utterance file, a transcription file, a property tag file, a time record file, a comment file, a fundamental frequency file, a voiced/unvoiced label file, and a phoneme label file. These files were stored in directories, which were classified by month and infant. HTML files were developed as links to these files. Fig. 1 is a schematic diagram of the database structure. The

Research applications

Our infant speech database has contributed to several pieces of developmental research. For example, Ishizuka et al. (2007) revealed developmental changes in the spectral peaks of vowels in an infant’s utterances. They found that the spectral peaks gradually diverge to form a set of Japanese vowels by 24 months of age. Amano et al. (2006) analyzed the developmental change of F0 in infants’ and parents’ utterances. They found that the infants’ F0 decreases almost constantly along with age in

Database release

Our infant speech database with its search software and a custom-made audio player is released by The Speech Resources Consortium (http://research.nii.ac.jp/src/eng/index.html) at a price of 85,500 yen. A waveform editor tuned to the database will be available from Arcadia Corporation (http://www.arcadia.co.jp/). This waveform editor is not included in the database price. The waveform editor automatically overlays the F0 data in the database onto a spectrogram. It also automatically overlays the

Conclusion

An infant speech database was developed from 5 years of recordings of utterances of five Japanese infants and their parents. This database contains a large number of utterances and their transcriptions, F0 values, and phoneme labels. This database makes it possible to trace the speech development of a particular infant from its birth until 5 years of age. It also offers the possibility to trace parents’ utterances addressed at an infant during this same period. Therefore, this database

References (22)

  • N. Reissland

    The pitch of ‘real’ and ‘rhetorical’ questions directed by a father to his daughter: a longitudinal case study

    Infant Behav. Develop.

    (1998)
  • S. Amano et al.

    Fundamental frequency of infants’ and parents’ utterances in longitudinal recordings

    J. Acoust. Soc. Amer.

    (2006)
  • S. Bennett

    A 3-year longitudinal study of school-aged children’s fundamental frequencies

    J. Speech Hear. Res.

    (1983)
  • S. Eguchi et al.

    Development of speech sounds in children

    Acta Otolaryngol. (Suppl.)

    (1969)
  • G. Fairbanks

    An acoustical study of the pitch of infant hunger wails

    Child Develop.

    (1942)
  • H. Hollien et al.

    Longitudinal research on adolescent voice change in males

    J. Acoust. Soc. Amer.

    (1994)
  • T. Inui et al.

    A study of word and sentence acquisition: quantitative analysis of longitudinal data

    Cognit. Stud.

    (2003)
  • I. Ishizuka et al.

    Longitudinal developmental changes in spectral peaks of vowels produced by Japanese infants

    J. Acoust. Soc. Amer.

    (2007)
  • S. Kajikawa et al.

    Development of speech communication between mother and child

    Pafoumansu Kyouiku

    (2004)
  • S. Kajikawa et al.

    Speech overlap in Japanese mother–child conversations

    J. Child Language

    (2004)
  • P. Keating et al.

    Fundamental frequency in the speech of infants and children

    J. Acoust. Soc. Amer.

    (1978)
  • Cited by (7)

    • Relationship between oxytocin and maternal approach behaviors to infants’ vocalizations

      2020, Comprehensive Psychoneuroendocrinology
      Citation Excerpt :

      To evaluate the degree of participants’ intent to approach or avoid the voice stimuli, we also asked them to rate their desire to pick up the baby (pick up) or leave the baby alone (ignore) using a VAS. Experimental voice stimuli were collected from the NTT infant voice database [33], which included voice samples recorded from 3- to 12-month-old infants. Two authors and one experiment cooperator independently labeled each voice stimulus as “crying,” “babbling,” or “laughing.”

    • Acquisition of vowel articulation in childhood investigated by acoustic-to-articulatory inversion

      2017, Infant Behavior and Development
      Citation Excerpt :

      Especially, we analyzed longitudinal changes in combinations of multiple articulatory organs to show how flexible coordination of multiple articulatory organs develops. We used the NTT Japanese infant speech database (Amano et al., 2006, 2009; Ishizuka et al., 2007) for this study. This database contains the utterances of five normally developing children and their parents, recorded with 16-bit quantization at a sampling rate of 16 kHz.

    • Learnability of prosodic boundaries: Is infant-directed speech easier?

      2016, Journal of the Acoustical Society of America
    • Infant Speech Database for Longitudinal Analysis of Spoken Language Development

      2010, DiSS-LPSS Joint Workshop 2010 - The 5th Workshop on Disfluency in Spontaneous Speech and the 2nd International Symposium on Linguistic Patterns in Spontaneous Speech
    View all citing articles on Scopus

    Parts of this research were presented at the 7th International Conference on Spoken Language Processing, Denver, CO, September 16–20, 2002.

    1

    Now at NTT Advanced Technology Corporation.

    View full text