Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news
Introduction
Enormous quantities of digital audio and video data are daily produced by TV stations, radio, and other media organizations. Automatic speech recognition (ASR) systems can now be applied to such sources of information in order to enrich them with additional information for applications, such as: indexing, cataloging, subtitling, translation and multimedia content production. Automatic speech recognition output consists of raw text, often in lower-case format and without any punctuation information. Even if useful for many applications, such as indexing and cataloging, for other tasks, such as subtitling and multimedia content production, the ASR output benefits from the correct punctuation and capitalization. In general, enriching the speech output aims to improve legibility, enhancing information for future human and machine processing. Apart from the insertion of punctuation marks and capitalization, enriching speech recognition covers other activities, such as detection and filtering of disfluencies, not addressed in this paper.
Depending on the application, punctuation and capitalization tasks may be required to work online. For example, on-the-fly subtitling for oral presentations or TV shows demands a very small delay between the speech production and the corresponding transcription. In these systems, both the computational delay and the number of words to the right of the current word that are required to make a decision, are important aspects to be taken into consideration. One of the goals behind this work consists of building a module for integration on an on-the-fly subtitling system, and a number of choices were taken with this purpose, for example, all subsequent experiments avoid a right context longer than two words for making a decision.
This paper describes a set of experiments concerning punctuation and capitalization recovery for spoken texts, providing the first joint evaluation results of these two tasks on Portuguese broadcast news. The remaining of this section describes related work, both on capitalization and punctuation. Section 2 describes the performance measures used for evaluation. Section 3 describes the main corpus and other resources. Section 4 is centered on the capitalization task, presenting the multiple employed methodologies and results achieved. Section 5 focus on the punctuation task, describing how the corpus was processed, the feature set used by the maximum entropy approach, and results concerning punctuation insertion. Section 6 presents a concrete example, showing the benefits of punctuation and capitalization over spoken texts. Sections 7 Concluding remarks, 8 Future work present some final comments and address the future work.
The capitalization task, also known as truecasing (Lita et al., 2003), consists of rewriting each word of an input text with its proper case information given its context. Different practical applications benefit from automatic capitalization as a preprocessing step: many computer applications, such as word processing and e-mail clients, perform automatic capitalization along with spell corrections and grammar check; and while dealing with speech recognition output, automatic capitalization provides relevant information for automatic content extraction, named entity recognition, and machine translation.
Capitalization can be viewed as a lexical ambiguity resolution problem, where each word has different graphical forms. Yarowsky (1994) presents a statistic procedure for lexical ambiguity resolution, based on decision lists, that achieved good results when applied to accent restoration in Spanish and French. The capitalization and accent restoration problems can be treated using the same methods, given that a different accentuation can be regarded as a different word form. Mikheev, 1999, Mikheev, 2002 also presents an approach to the disambiguation of capitalized common words, but only where capitalization is expected, such as the first word of the sentence or after a period.
The capitalization problem may also be seen as a sequence tagging problem (Chelba and Acero, 2004, Lita et al., 2003, Kim and Woodland, 2004), where each lower-case word is associated to a tag that describes its capitalization form. Chelba and Acero (2004) study the impact of using increasing amounts of training data as well as a small amount of adaptation. This work uses a maximum entropy markov model (MEMM) based approach, which allows to combine different features. A large written newspaper corpora is used for training and the test data consists of broadcast news data. Lita et al. (2003) builds a trigram language model (LM) with pairs (word, tag), estimated from a corpus with case information, and then uses dynamic programming to disambiguate over all possible tag assignments on a sentence. A preparatory study on the capitalization of Portuguese broadcast news has been performed by Batista et al. (2007b).
Other related work includes a bilingual capitalization model for capitalizing machine translation (MT) outputs using conditional random fields (CRFs) reported by Wang et al. (2006). This work exploits case information both from source and target sentences of the MT system, producing better performance than a baseline capitalizer using a trigram language model.
Spoken language is similar to written text in many aspects, but is different in many others, mostly due to the way these communication methods are produced. Current ASR systems focus on minimizing the WER (word error rate), making no attempts to detect structural information which is available in written texts. Spoken language is also typically less organized than textual material, making it a challenge to bridge the gap between spoken and written material. The insertion of punctuation marks into spoken texts is a way of approximating such texts, even if a given punctuation mark may assume a slightly different behavior in speech. For example, a sentence in spontaneous speech does not always correspond to a sentence in written text.
A large number of punctuation marks can be considered for spoken texts, including: comma; period or full stop; exclamation mark; question mark; colon; semicolon; and quotation marks. However, most of these marks rarely occur and are quite difficult to insert or evaluate. Hence, most of the available studies focus either on full stop or in comma, which have higher corpus frequencies. Previous work on other punctuation marks, such as question mark and exclamation mark, have not shown promising results (Christensen et al., 2001).
Comma is the most frequent punctuation mark, but it is also the most problematic because it serves many different purposes. It can be used to: introduce a word, phrase or construction; separate long independent constructions; separate words within a sentence; separate elements in a series; separate thousands, millions, etc. in a number; and also prevent misreading. Beeferman et al. (1998) describe a lightweight method for automatically inserting intra-sentence punctuation marks into text. This method relies on a trigram LM built solely using lexical information, and uses the Viterbi algorithm for classification. The paper focus the comma punctuation mark and presents a qualitative evaluation based on user satisfaction, concluding that the system performance is qualitatively higher than sentence accuracy rate would indicate.
When dealing with conversational speech the notion of utterance (Jurafsky and Martin, 2000) or sentence-like unit (SU) is often used (Strassel, 2004) instead of “sentence”. A SU may correspond to a grammatical sentence, or can be semantically complete but smaller than a sentence. Detecting a SU consists of finding the limits of that SU, which roughly corresponds to the task of detecting the period or full stop in conversational speech. SU boundary detection has gained increasing attention during recent years, and it has been part of the NIST rich transcription evaluations. It provides a basis for further natural language processing, and its impact on subsequent tasks has been recently analyzed in many speech processing studies (Harper et al., 2005, Mrozinsk et al., 2006).
The work conducted by Kim and Woodland, 2001, Christensen et al., 2001 uses a general HMM framework that allows the combination of lexical and prosodic cues for recovering punctuation marks. A similar approach was also used by Gotoh and Renals, 2000, Shriberg et al., 2000 for detecting sentence boundaries. Another approach, based on a maximum entropy model, was developed by Huang and Zweig (2002) to recover punctuation in the Switchboard corpus, using textual cues. Different modeling approaches, combining different prosodic and textual features have also been recently investigated by other authors, such as (Liu et al., 2006) for sentence boundary detection, and (Batista et al., 2007a) for punctuation recovery on Portuguese broadcast news.
Section snippets
Performance measures
The following well-known performance measures are used in punctuation and capitalization tasks: precision, recall, and slot error rate (SER) (Makhoul et al., 1999), defined in Eqs. (1), (2), (3). For the punctuation task, a slot corresponds to the occurrence of a punctuation mark in the corpus. For the capitalization task, a slot corresponds to all words not written as a lower-case form
In the equations, C is the number of correct
Information sources
Both capitalization and punctuation tasks described here share the same spoken corpus, however for the capitalization task other information sources were used, including a written newspaper corpus and two small lexica containing case information. The following subsections provide more details about each one of the data sources.
Capitalization task
The present study explores three ways of writing a word: lower-case, all-upper, and first-capitalized, not covering mixed-case words such as “McDonald’s” and “SuSE”.
The experiments were conducted both on written newspaper corpora and on spoken transcriptions, making it possible to analyze the impact of the different methodologies over these two different data. Written newspaper corpus, lexica and spoken transcriptions were combined in order to provide richer training sets and reduce the problem
Punctuation task
In order to better understand the usage of each punctuation mark, their occurrence was counted in written newspaper corpora, using RecPUB and published statistics from WSJ. Results are shown in Table 10, revealing that comma is the most common punctuation mark for Portuguese written corpora. The full stop frequency is lower for Portuguese, revealing that the Portuguese written language contains longer sentences when compared to English.
An equivalent study was also performed in Europarl (Koehn,
Concrete example
Fig. 11 shows an example of text extracted from an automatic speech transcription, where the first word of each sentence is marked on bold for better identification of the beginning of each sentence. The text at top is splitted into sentences, according to the segmentation proposed by the APP/ASR module. The text at middle was automatically enriched with full stops and capitalization information, and the segmentation was performed accordingly to the full stop prediction. The first word of each
Concluding remarks
This paper addresses two tasks that contribute to the enrichment of the output of an ASR system.
Concerning the capitalization task, three different methods were described and results were presented, both for manual transcriptions of speech and written newspaper corpora. The experiments show that the used speech recognition corpus is too small to cover much of the vocabulary. Another conclusion is that manually built lexica can contribute to enhance the results when the training dataset is
Future work
For the capitalization task, only three ways of writing a word were explored: lower-case, all-upper, first-capitalized, not covering mixed-case words such as McDonald’s and SuSE. These words are now being addressed by a small lexicon, but no evaluation was performed so far in order to assess the performance improvement.
The train and test corpora used in these experiments consisted of manually corrected and annotated speech transcriptions. A strategy must be defined in order to perform the
Acknowledgements
This paper has been partially funded by the FCT Projects LECTRA (POSC/PLP/58697/2004) and DIGA (POSI/PLP/41319/2001), and by the PRIME National Project TECNOVOZ number 03/165. INESC-ID Lisboa had support from the POSI program of the “Quadro Comunitário de Apoio III”.
References (36)
- et al.
Automatic capitalisation generation for speech input
Comput. Speech Lang.
(2004) - et al.
Enriching speech recognition with automatic detection of sentence boundaries and disfluencies
IEEE Trans. Audio, Speech Lang. Process.
(2006) - et al.
Prosody-based automatic segmentation of speech into sentences and topics
Speech Commun.
(2000) - Amaral, R., Meinedo, H., Caseiro, D., Trancoso, I., Neto, J.P., 2007. A prototype system for selective dissemination of...
- Batista, F., Caseiro, D., Mamede, N.J., Trancoso, I., 2007a. Recovering punctuation marks for automatic speech...
- Batista, F., Mamede, N.J., Caseiro, D., Trancoso, I., 2007b. A lightweight on-the-fly capitalization system for...
- et al.
Cyberpunc: a lightweight punctuation annotation system for speech
Proc. IEEE ICASSP
(1998) - et al.
A maximum entropy approach to natural language processing
Comput. Linguist.
(1996) - Chelba, C., Acero, A., 2004. Adaptation of maximum entropy capitalizer: little data can help a lot. In:...
- Christensen, H., Gotoh, Y., Renals, S., 2001. Punctuation annotation using statistical prosody models. In: Proc. ISCA...
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Cited by (41)
Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts[Formula presented]
2021, Expert Systems with ApplicationsCitation Excerpt :A few years later, the detection of sentence boundaries was one of the main focus of the DARPA EARS rich transcription program (Liu et al., 2006). Discriminative models, such as Maximum Entropy (ME), and Conditional Random Field (CRF) have also been successfully used for this task (Batista et al., 2007, 2008, 2012, 2010, 2009; Huang & Zweig, 2002; Liu et al., 2006; Lu & Ng, 2010; Ueffing et al., 2013). [3] and [29,30] performed experiments comparing the predominant HMM approach with ME and CRF models, concluding that discriminative models generally outperform generative models.
Fuzzy-based discriminative feature representation for children's speech recognition
2014, Digital Signal Processing: A Review JournalCitation Excerpt :Despite the versatility of the English language, efforts on developing ASR systems are not limited to English. In several languages, such as Chinese [2,3], Japanese [4,5], Thai [6,7], Portuguese [8], and Arabic [9], research in this area continues to develop efficient ASR systems. While most speech recognition systems focus more on adults than children, the speech recognition of children has numerous applications, such as in educational games for children [10–12] aids for pronunciation [13,14], and in reading based on interactive platforms [10,15].
ASR-based exercises for listening comprehension practice in European Portuguese
2013, Computer Speech and LanguageCitation Excerpt :Neutral declaratives – f4 Sentence boundaries are determined by a statistical module that recovers full-stops and commas (Batista et al., 2008). In order to reject speech continuations and give preference to neutral declaratives, an additional constraint on pitch was used.
Improving Readability for Automatic Speech Recognition Transcription
2023, ACM Transactions on Asian and Low-Resource Language Information ProcessingCapitalization and punctuation restoration: a survey
2022, Artificial Intelligence Review