Elsevier

Speech Communication

Volume 50, Issue 10, October 2008, Pages 847-862
Speech Communication

Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news

https://doi.org/10.1016/j.specom.2008.05.008Get rights and content

Abstract

The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, including lexica, written newspaper corpora and speech transcriptions. Finite state transducers produced the best results for written newspaper corpora, but the maximum entropy approach also proved to be a good choice, suitable for the capitalization of speech transcriptions, and allowing straightforward on-the-fly capitalization. Evaluation results are presented both for written newspaper corpora and for broadcast news speech transcriptions. The frequency of each punctuation mark in BN speech transcriptions was analyzed for three different languages: English, Spanish and Portuguese. The punctuation task was performed using a maximum entropy modeling approach, which combines different types of information both lexical and acoustic. The contribution of each feature was analyzed individually and separated results for each focus condition are given, making it possible to analyze the performance differences between planned and spontaneous speech. All results were evaluated on speech transcriptions of a Portuguese broadcast news corpus. The benefits of enriching speech recognition with punctuation and capitalization are shown in an example, illustrating the effects of described experiments into spoken texts.

Introduction

Enormous quantities of digital audio and video data are daily produced by TV stations, radio, and other media organizations. Automatic speech recognition (ASR) systems can now be applied to such sources of information in order to enrich them with additional information for applications, such as: indexing, cataloging, subtitling, translation and multimedia content production. Automatic speech recognition output consists of raw text, often in lower-case format and without any punctuation information. Even if useful for many applications, such as indexing and cataloging, for other tasks, such as subtitling and multimedia content production, the ASR output benefits from the correct punctuation and capitalization. In general, enriching the speech output aims to improve legibility, enhancing information for future human and machine processing. Apart from the insertion of punctuation marks and capitalization, enriching speech recognition covers other activities, such as detection and filtering of disfluencies, not addressed in this paper.

Depending on the application, punctuation and capitalization tasks may be required to work online. For example, on-the-fly subtitling for oral presentations or TV shows demands a very small delay between the speech production and the corresponding transcription. In these systems, both the computational delay and the number of words to the right of the current word that are required to make a decision, are important aspects to be taken into consideration. One of the goals behind this work consists of building a module for integration on an on-the-fly subtitling system, and a number of choices were taken with this purpose, for example, all subsequent experiments avoid a right context longer than two words for making a decision.

This paper describes a set of experiments concerning punctuation and capitalization recovery for spoken texts, providing the first joint evaluation results of these two tasks on Portuguese broadcast news. The remaining of this section describes related work, both on capitalization and punctuation. Section 2 describes the performance measures used for evaluation. Section 3 describes the main corpus and other resources. Section 4 is centered on the capitalization task, presenting the multiple employed methodologies and results achieved. Section 5 focus on the punctuation task, describing how the corpus was processed, the feature set used by the maximum entropy approach, and results concerning punctuation insertion. Section 6 presents a concrete example, showing the benefits of punctuation and capitalization over spoken texts. Sections 7 Concluding remarks, 8 Future work present some final comments and address the future work.

The capitalization task, also known as truecasing (Lita et al., 2003), consists of rewriting each word of an input text with its proper case information given its context. Different practical applications benefit from automatic capitalization as a preprocessing step: many computer applications, such as word processing and e-mail clients, perform automatic capitalization along with spell corrections and grammar check; and while dealing with speech recognition output, automatic capitalization provides relevant information for automatic content extraction, named entity recognition, and machine translation.

Capitalization can be viewed as a lexical ambiguity resolution problem, where each word has different graphical forms. Yarowsky (1994) presents a statistic procedure for lexical ambiguity resolution, based on decision lists, that achieved good results when applied to accent restoration in Spanish and French. The capitalization and accent restoration problems can be treated using the same methods, given that a different accentuation can be regarded as a different word form. Mikheev, 1999, Mikheev, 2002 also presents an approach to the disambiguation of capitalized common words, but only where capitalization is expected, such as the first word of the sentence or after a period.

The capitalization problem may also be seen as a sequence tagging problem (Chelba and Acero, 2004, Lita et al., 2003, Kim and Woodland, 2004), where each lower-case word is associated to a tag that describes its capitalization form. Chelba and Acero (2004) study the impact of using increasing amounts of training data as well as a small amount of adaptation. This work uses a maximum entropy markov model (MEMM) based approach, which allows to combine different features. A large written newspaper corpora is used for training and the test data consists of broadcast news data. Lita et al. (2003) builds a trigram language model (LM) with pairs (word, tag), estimated from a corpus with case information, and then uses dynamic programming to disambiguate over all possible tag assignments on a sentence. A preparatory study on the capitalization of Portuguese broadcast news has been performed by Batista et al. (2007b).

Other related work includes a bilingual capitalization model for capitalizing machine translation (MT) outputs using conditional random fields (CRFs) reported by Wang et al. (2006). This work exploits case information both from source and target sentences of the MT system, producing better performance than a baseline capitalizer using a trigram language model.

Spoken language is similar to written text in many aspects, but is different in many others, mostly due to the way these communication methods are produced. Current ASR systems focus on minimizing the WER (word error rate), making no attempts to detect structural information which is available in written texts. Spoken language is also typically less organized than textual material, making it a challenge to bridge the gap between spoken and written material. The insertion of punctuation marks into spoken texts is a way of approximating such texts, even if a given punctuation mark may assume a slightly different behavior in speech. For example, a sentence in spontaneous speech does not always correspond to a sentence in written text.

A large number of punctuation marks can be considered for spoken texts, including: comma; period or full stop; exclamation mark; question mark; colon; semicolon; and quotation marks. However, most of these marks rarely occur and are quite difficult to insert or evaluate. Hence, most of the available studies focus either on full stop or in comma, which have higher corpus frequencies. Previous work on other punctuation marks, such as question mark and exclamation mark, have not shown promising results (Christensen et al., 2001).

Comma is the most frequent punctuation mark, but it is also the most problematic because it serves many different purposes. It can be used to: introduce a word, phrase or construction; separate long independent constructions; separate words within a sentence; separate elements in a series; separate thousands, millions, etc. in a number; and also prevent misreading. Beeferman et al. (1998) describe a lightweight method for automatically inserting intra-sentence punctuation marks into text. This method relies on a trigram LM built solely using lexical information, and uses the Viterbi algorithm for classification. The paper focus the comma punctuation mark and presents a qualitative evaluation based on user satisfaction, concluding that the system performance is qualitatively higher than sentence accuracy rate would indicate.

When dealing with conversational speech the notion of utterance (Jurafsky and Martin, 2000) or sentence-like unit (SU) is often used (Strassel, 2004) instead of “sentence”. A SU may correspond to a grammatical sentence, or can be semantically complete but smaller than a sentence. Detecting a SU consists of finding the limits of that SU, which roughly corresponds to the task of detecting the period or full stop in conversational speech. SU boundary detection has gained increasing attention during recent years, and it has been part of the NIST rich transcription evaluations. It provides a basis for further natural language processing, and its impact on subsequent tasks has been recently analyzed in many speech processing studies (Harper et al., 2005, Mrozinsk et al., 2006).

The work conducted by Kim and Woodland, 2001, Christensen et al., 2001 uses a general HMM framework that allows the combination of lexical and prosodic cues for recovering punctuation marks. A similar approach was also used by Gotoh and Renals, 2000, Shriberg et al., 2000 for detecting sentence boundaries. Another approach, based on a maximum entropy model, was developed by Huang and Zweig (2002) to recover punctuation in the Switchboard corpus, using textual cues. Different modeling approaches, combining different prosodic and textual features have also been recently investigated by other authors, such as (Liu et al., 2006) for sentence boundary detection, and (Batista et al., 2007a) for punctuation recovery on Portuguese broadcast news.

Section snippets

Performance measures

The following well-known performance measures are used in punctuation and capitalization tasks: precision, recall, and slot error rate (SER) (Makhoul et al., 1999), defined in Eqs. (1), (2), (3). For the punctuation task, a slot corresponds to the occurrence of a punctuation mark in the corpus. For the capitalization task, a slot corresponds to all words not written as a lower-case formPrecision=CH=CC+S+IRecall=CR=CC+S+DSER=totalsloterrorsR=I+D+SC+D+S

In the equations, C is the number of correct

Information sources

Both capitalization and punctuation tasks described here share the same spoken corpus, however for the capitalization task other information sources were used, including a written newspaper corpus and two small lexica containing case information. The following subsections provide more details about each one of the data sources.

Capitalization task

The present study explores three ways of writing a word: lower-case, all-upper, and first-capitalized, not covering mixed-case words such as “McDonald’s” and “SuSE”.

The experiments were conducted both on written newspaper corpora and on spoken transcriptions, making it possible to analyze the impact of the different methodologies over these two different data. Written newspaper corpus, lexica and spoken transcriptions were combined in order to provide richer training sets and reduce the problem

Punctuation task

In order to better understand the usage of each punctuation mark, their occurrence was counted in written newspaper corpora, using RecPUB and published statistics from WSJ. Results are shown in Table 10, revealing that comma is the most common punctuation mark for Portuguese written corpora. The full stop frequency is lower for Portuguese, revealing that the Portuguese written language contains longer sentences when compared to English.

An equivalent study was also performed in Europarl (Koehn,

Concrete example

Fig. 11 shows an example of text extracted from an automatic speech transcription, where the first word of each sentence is marked on bold for better identification of the beginning of each sentence. The text at top is splitted into sentences, according to the segmentation proposed by the APP/ASR module. The text at middle was automatically enriched with full stops and capitalization information, and the segmentation was performed accordingly to the full stop prediction. The first word of each

Concluding remarks

This paper addresses two tasks that contribute to the enrichment of the output of an ASR system.

Concerning the capitalization task, three different methods were described and results were presented, both for manual transcriptions of speech and written newspaper corpora. The experiments show that the used speech recognition corpus is too small to cover much of the vocabulary. Another conclusion is that manually built lexica can contribute to enhance the results when the training dataset is

Future work

For the capitalization task, only three ways of writing a word were explored: lower-case, all-upper, first-capitalized, not covering mixed-case words such as McDonald’s and SuSE. These words are now being addressed by a small lexicon, but no evaluation was performed so far in order to assess the performance improvement.

The train and test corpora used in these experiments consisted of manually corrected and annotated speech transcriptions. A strategy must be defined in order to perform the

Acknowledgements

This paper has been partially funded by the FCT Projects LECTRA (POSC/PLP/58697/2004) and DIGA (POSI/PLP/41319/2001), and by the PRIME National Project TECNOVOZ number 03/165. INESC-ID Lisboa had support from the POSI program of the “Quadro Comunitário de Apoio III”.

References (36)

  • J.-H. Kim et al.

    Automatic capitalisation generation for speech input

    Comput. Speech Lang.

    (2004)
  • Y. Liu et al.

    Enriching speech recognition with automatic detection of sentence boundaries and disfluencies

    IEEE Trans. Audio, Speech Lang. Process.

    (2006)
  • E. Shriberg et al.

    Prosody-based automatic segmentation of speech into sentences and topics

    Speech Commun.

    (2000)
  • Amaral, R., Meinedo, H., Caseiro, D., Trancoso, I., Neto, J.P., 2007. A prototype system for selective dissemination of...
  • Batista, F., Caseiro, D., Mamede, N.J., Trancoso, I., 2007a. Recovering punctuation marks for automatic speech...
  • Batista, F., Mamede, N.J., Caseiro, D., Trancoso, I., 2007b. A lightweight on-the-fly capitalization system for...
  • D. Beeferman et al.

    Cyberpunc: a lightweight punctuation annotation system for speech

    Proc. IEEE ICASSP

    (1998)
  • A.L. Berger et al.

    A maximum entropy approach to natural language processing

    Comput. Linguist.

    (1996)
  • Chelba, C., Acero, A., 2004. Adaptation of maximum entropy capitalizer: little data can help a lot. In:...
  • Christensen, H., Gotoh, Y., Renals, S., 2001. Punctuation annotation using statistical prosody models. In: Proc. ISCA...
  • Collins, M., Singer, Y., 1999. Unsupervised models for named entity classification. In: Proc. Joint SIGDAT Conf. on...
  • Daumé, III, H., 2004. Notes on CG and LM-BFGS optimization of logistic regression,...
  • Gotoh, Y., Renals, S., 2000. Sentence boundary detection in broadcast speech transcripts. In: Proc. ISCA Workshop:...
  • Harper, M., Dorr, B., Hale, J., Roark, B., Shafran, I., Lease, M., Liu, Y., Snover, M., Yung, L., Krasnyanskaya, A.,...
  • Huang, J., Zweig, G., 2002. Maximum entropy model for punctuation annotation from speech. In: Proc. ICSLP, pp....
  • D. Jurafsky et al.

    Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

    (2000)
  • Kim, J., Woodland, P.C., 2001. The use of prosody in a combined system for punctuation generation and speech...
  • Koehn, P., 2005. Europarl: a parallel corpus for statistical machine translation. In: MT Summit...
  • Cited by (41)

    • Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts[Formula presented]

      2021, Expert Systems with Applications
      Citation Excerpt :

      A few years later, the detection of sentence boundaries was one of the main focus of the DARPA EARS rich transcription program (Liu et al., 2006). Discriminative models, such as Maximum Entropy (ME), and Conditional Random Field (CRF) have also been successfully used for this task (Batista et al., 2007, 2008, 2012, 2010, 2009; Huang & Zweig, 2002; Liu et al., 2006; Lu & Ng, 2010; Ueffing et al., 2013). [3] and [29,30] performed experiments comparing the predominant HMM approach with ME and CRF models, concluding that discriminative models generally outperform generative models.

    • Fuzzy-based discriminative feature representation for children's speech recognition

      2014, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      Despite the versatility of the English language, efforts on developing ASR systems are not limited to English. In several languages, such as Chinese [2,3], Japanese [4,5], Thai [6,7], Portuguese [8], and Arabic [9], research in this area continues to develop efficient ASR systems. While most speech recognition systems focus more on adults than children, the speech recognition of children has numerous applications, such as in educational games for children [10–12] aids for pronunciation [13,14], and in reading based on interactive platforms [10,15].

    • ASR-based exercises for listening comprehension practice in European Portuguese

      2013, Computer Speech and Language
      Citation Excerpt :

      Neutral declaratives – f4 Sentence boundaries are determined by a statistical module that recovers full-stops and commas (Batista et al., 2008). In order to reject speech continuations and give preference to neutral declaratives, an additional constraint on pitch was used.

    • Improving Readability for Automatic Speech Recognition Transcription

      2023, ACM Transactions on Asian and Low-Resource Language Information Processing
    • Capitalization and punctuation restoration: a survey

      2022, Artificial Intelligence Review
    View all citing articles on Scopus
    View full text