Skip to main content
Log in

Simultaneous translation of lectures and speeches

  • Published:
Machine Translation

Abstract

With increasing globalization, communication across language and cultural boundaries is becoming an essential requirement of doing business, delivering education, and providing public services. Due to the considerable cost of human translation services, only a small fraction of text documents and an even smaller percentage of spoken encounters, such as international meetings and conferences, are translated, with most resorting to the use of a common language (e.g. English) or not taking place at all. Technology may provide a potentially revolutionary way out if real-time, domain-independent, simultaneous speech translation can be realized. In this paper, we present a simultaneous speech translation system based on statistical recognition and translation technology. We discuss the technology, various system improvements and propose mechanisms for user-friendly delivery of the result. Over extensive component and end-to-end system evaluations and comparisons with human translation performance, we conclude that machines can already deliver comprehensible simultaneous translation output. Moreover, while machine performance is affected by recognition errors (and thus can be improved), human performance is limited by the cognitive challenge of performing the task in real time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

AMI:

Meeting transcription data from Augmented Multi-party Interaction (see Table 3)

ASR:

Automatic speech recognition

BN:

Data from broadcast news corpora (see Table 3)

CHIL:

Computers in the human interaction loop

cMLLR:

Constrained maximum likelihood linear regression

DG:

Directorate general

EC:

European Commission

EM:

Expectation maximization

EPPS:

European Parliament plenary sessions

GALE:

Global Autonomous Language Exploitation

GWRD:

Data from Gigaword corpus (see Table 3)

HNSRD:

Data from UK Parliament debates (see Table 3)

ICSI:

(Data recorded at) International Computer Science Institute

JRTk:

Janus Recognition Toolkit

MLLR:

Maximum likelihood linear regression

MT:

Machine translation

MTG:

Meeting transcription data (see Table 3)

NIST:

National Institute of Standards and Technology—MT evaluation measure (Doddington 2002) Data recorded at NIST

OOV:

Out of vocabulary

PROC:

Data from conference proceedings (see Table 3)

RTF:

Real-time factor

RT-06S:

NIST 2006 Rich Transcription evaluation

RWTH:

Rheinisch-Westfälische Technische Hochschule (Aachen)

SMNR:

Lectures and seminars from the CHIL project

SMT:

Statistical machine translation

SRI:

Stanford Research Institute

SST:

Speech-to-speech translation

STR-DUST:

Speech TRanslation: Domain-Unlimited, Spontaneous and Trainable

SWB:

Data from switchboard transcriptions (see Table 3)

TC-STAR:

Technologies and Corpora for Speech-to-Speech-Translation

TED:

Translanguage English database

TH:

Technische Hochschule

TTS:

Text to speech

UKA-. . .:

See Table 3

UW-M:

See Table 3

VAD:

Voice activity detection

VTLN:

Vocal tract length normalization

WER:

Word error rate

References

  • Accipio Consulting (2006) Sprachtechnologien für Europa [Language technologies for Europe]. ITC IRST, Trento, Italy. Available at http://www.tc-star.org/pubblicazioni/D17_HLT_DE.pdf. Accessed 29 Oct 2008

  • Al-Khanji R, El-Shiyab S, Hussein R (2000) On the use of compensatory strategies in simultaneous interpretation. Meta J Traduc 45: 544–557

    Google Scholar 

  • Atal B (1974) Effectiveness of linear prediction characteristics of speech wave for automatic speaker identification and verification. J Acoust Soc Am 55: 1304–1312

    Article  Google Scholar 

  • Bahl L, Brown P, de Souza P, Mercer R (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP ’86, IEEE international conference on acoustics, speech, and signal processing, Tokyo, Japan, pp 49–52

  • Bain K, Basson S, Faisman A, Kanevsky D (2005) Accessibility, transcription, and access everywhere. IBM Syst J 44: 589–603

    Article  Google Scholar 

  • Barik HC (1969) A study of simultaneous interpretation. PhD thesis, University of North Carolina at Chapel Hill

  • Bellegarda JR (2004) Statistical language model adaptation: review and persepectives. Speech Commun 42: 93–108

    Article  Google Scholar 

  • Black AW, Taylor PA (1997) The Festival speech synthesis system: system documentation. Technical Report HCRC/TR-83, Human Communciation Research Centre, University of Edinburgh, Edinburgh, Scotland

  • Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1994) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19: 263–311

    Google Scholar 

  • Bulyko I, Ostendorf M, Stolcke A (2003) Getting more mileage from Web text sources for conversational speech language modeling using class-dependent mixtures. In: HLT-NAACL 2003 Human language technology conference of the North American chapter of the Association for Computational Linguistics, Companion volume: short papers, student research workshop, demonstrations, tutorial abstracts, Edmonton, Alberta, Canada, pp 7–9

  • Burger S, MacLaren V, Waibel A (2004) ISL meeting speech part 1, catalog nbr LDC2004S05, Linguistic Data Consortium, Philadelphia, PA

  • Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M, Lathoud G, Lincoln M, Lisowska A, McCowan I, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus: A pre-announcement. In: 2nd joint workshop on multimodal interaction and related machine learning algorithms MLMI 05, Edinburgh, Scotland, pp 28–39

  • Cettolo M, Falavigna D (1998) Automatic detection of semantic boundaries based on acoustic and lexical knowledge. In: Fifth international conference on spoken language processing, ICSLP’98, Sydney, Australia, pp 1551–1554

  • Cettolo M, Federico M (2006) Text segmentation criteria for statistical machine translation. In: Salakoski T, Ginter F, Pyysalo S, Pahikkala T (eds) Advances in natural language processing, 5th international conference, FinTAL 2006, Turku, Finland, LNCS 4139. Springer Verlag, Berlin, pp 664–673

  • Cettolo M, Brugnara F, Federico M (2004) Advances in the automatic transcription of lectures. In: ICASSP 2004, IEEE international conference on acoustics, speech, and signal processing, Montreal, Canada, pp 769–772

  • Chen CJ (1999) Speech recognition with automatic punctuation. In: Sixth European conference on speech communication and technology (Eurospeech’99), Budapest, Hungary, pp 447–450

  • de Mori R, Federico M (1999) Language model adaptation. In: Ponting K (eds) Computational models of speech pattern processing. Springer Verlag, Berlin, pp 280–303

    Google Scholar 

  • Doddington G (2002) Automatic evaluation of MT quality using n-gram co-occurrence statistics. In: Proceedings of human language technology conference 2002, San Diego, CA, 138–145

  • Eide E, Gish H (1996) A paramteric approach to vocal tract length normalization. In: 1996 IEEE international conference on acoustics, speech, and signal processing, Atlanta, Georgia, pp 346–348

  • Finke M, Geutner P, Hild H, Kemp T, Ries K, Westphal M (1997) The Karlsruhe-verbmobil speech recognition engine. In: 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP’97), Munich, Germany, pp 83–86

  • Fiscus J (1997) A post-processing system to yield reduced word error rates: recogniser output voting error reduction (ROVER). In: Proceedings of the 1997 IEEE workshop on automatic speech recognition and understanding, Santa Barbara, CA, pp 347–352

  • Fiscus J, Garofolo J, Przybocki M, Fisher W, Pallett D (1998) 1997 English broadcast news speech (HUB4), catalog nbr LDC98S71, Linguistic Data Consortium, Philadelphia, PA

  • Foster G, Kuhn R, Johnson H (2006) Phrasetable smoothing for statistical machine translation. In: EMNLP 2006 conference on empirical methods in natural language processing, Sydney, Australia, pp 53–61

  • Fritsch J, Rogina I (1996) The bucket box intersection (BBI) algorithm for fast approximative evaluation of diagonal mixture Gaussians. In: 1996 IEEE international conference on acoustics, speech, and signal processing, Atlanta, Georgia, pp 837–840

  • Fügen C, Kolss M (2007) The influence of utterance chunking on machine translation performance. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2837–2840

  • Fügen C, Westphal M, Schneider M, Schultz T, Waibel A (2001) LingWear: a mobile tourist information system. In: Proceedings of the first international conference on human language technology research, San Diego, California, 5 pp

  • Fügen C, Ikbal S, Kraft F, Kumatani K, Laskowski K, McDonough JW, Ostendorf M, Stüker S, Wölfel M (2006a) The ISL RT-06S speech-to-text system. In: Renals et al (2006), pp 407–418

  • Fügen C, Kolss M, Paulik M, Waibel A (2006b) Open domain speech translation: from seminars and speeches to lectures. In: TC-STAR workshop on speech to speech translation, Barcelona, Spain, pp 81–86

  • Furui S (1986) Cepstral analysis technique for automatic speaker verification. IEEE T Acoust Speech Signal Proc 34:52–59

    Google Scholar 

  • Furui S (2005) Recent progress in corpus-based spontaneous speech recognition. IEICE T Inform Syst E88-D:366–375

  • Furui S (2007) Recent advances in automatic speech summarization. In: Symposium on large-scale knowledge resources (LKR 2007), Tokyo, Japan, pp 49–54

  • Gales MJF (1998) Maximum likelihood linear transformations for HMM-based speech recognition. Comput Speech Lang 12: 75–98

    Article  Google Scholar 

  • Garofolo JS, Michel M, Stanford VM, Tabassi E, Fiscus J, Laprun CD, Pratz N, Lard J (2004) NIST meeting pilot corpus speech, catalog nbr LDC2004S09, Linguistic Data Consortium, Philadelphia, PA

  • Geutner P, Finke M, Scheytt P (1998) Adaptive vocabularies for transcribing multilingual broadcast news. In: Proceedings of the 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP ’97), Seattle, Washington, pp 925–928

  • Glass J, Hazen TJ, Cyphers S, Malioutov I, Huynh D, Barzilay R (2007) Recent progress in the MIT spoken lecture processing project. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2553–2556

  • Godfrey JJ, Holliman E (1993) Switchboard-1 transcripts, catalog nbr LDC93T4, Linguistic Data Consortium, Philadelphia, PA

  • Gollan C, Bisani M, Kanthak S, Schlüter R, Ney H (2005) Cross domain automatic transcription on the TC-STAR EPPS corpus. In: ICASSP, 2005 IEEE conference on acoustics, speech, and signal processing, Philadelphia, PA, pp 825–828

  • Gollan C, Hahn S, Schlüter R, Ney H (2007) An improved method for unsupervised training of LVCSR systems. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2101–2104

  • Graff D (1994) UN parallel text (complete), catalog nbr LDC94T4A, Linguistic Data Consortium, Philadelphia, PA

  • Graff D (2003) English gigaword, catalog nbr LDC2003T05, Linguistic Data Consortium, Philadelphia, PA

  • Graff D, Garofolo J, Fiscus J, Fisher W, Pallett D (1997) 1996 English broadcast news speech (HUB4), catalog nbr LDC97S44, Linguistic Data Consortium, Philadelphia, PA

  • Hamon O, Mostefa D, Choukri K (2007) End-to-end evaluation of a speech-to-speech translation system in TC-STAR. In: Machine translation summit XI, Copenhagen, Denmark, pp 223–230

  • Henderson JA (1982) Some psychological aspects of simultaneous interpreting. Incorp Ling 21(4): 149–150

    Google Scholar 

  • Hendricks PV (1971) Simultaneous interpreting: a practical book. Longman, London

    Google Scholar 

  • Huang J, Zweig G (2002) Maximum entropy model for punctuation annotation from speech. In: 7th international conference on spoken language processing (ICSLP 2002, Interspeech 2002), Denver, Colorado, pp 917–920

  • Huang J, Westphal M, Chen SF, Siohan O, Povey D, Libal V, Soneiro A, Schulz H, Ross T, Potamianos G (2006) The IBM rich transcription spring 2006 speech-to-text system for lecture meetings. In: Renals et al (2006), pp 432–443

  • Janin A, Edwards J, Ellis D, Gelbart D, Morgan N, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2004) ICSI meeting speech, catalog nbr LDC2004S02, Linguistic Data Consortium, Philadelphia, PA

  • Jones R (1998) Conference interpreting explained. St. Jerome Publishing, Manchester

    Google Scholar 

  • Kim J-H, Woodland PC (2001) The use of prosody in a combined system for punctuation generation and speech recognition. In: Eurospeech 2001 Scandinavia, 7th European conference on speech communication and technology, 2nd Interspeech event, Aalborg, Denmark, pp 2757–2760

  • Klakow D, Peters J (2002) Testing the correlation of word error rate and perplexity. Speech Commun 38: 19–28

    Article  Google Scholar 

  • Koehn P, Axelrod A, Mayne AB, Callison-Burch C, Osborne M, Talbot D (2005) Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of international workshop on spoken language translation, Pittsburgh, PA

  • Kolss M, Zhao B, Vogel S, Hildebrand AS, Niehues J, Venugopal A, Zhang Y (2006) The ISL statistical machine translation system for the TC-STAR spring 2006 evaluation. In: TC-STAR workshop on speech to speech translation, Barcelona, Spain

  • Kopczynski A (1994) Bridging the gap: empirical research in simultaneous interpretation. John Benjamins, Amsterdam/Philadelphia

  • Lamel LF, Schiel F, Fourcin A, Mariani J, Tillmann HG (1994) The translanguage English database TED. In: Third international conference on spoken language processing (ICSLP 94), Yokohama, Japan, pp 1795–1798

  • Lamel L, Bilinski E, Adda G, Gauvain J-L, Schwenk H (2006) The LIMSI RT06s lecture transcription system. In: Renals et al. (2006), pp 457–468

  • Lamel L, Gauvain J-L, Adda G, Barras C, Bilinski E, Galibert O, Pujol A, Schwenk H, Zhu X (2007) The LIMSI 2006 TC-STAR EPPS transcription systems. In: ICASSP 2007, international conference on acoustics, speech, and signal processing, Honolulu, Hawaii, pp 997–1000

  • Lederer M (1978) Simultaneous interpretation: units of meaning and other features. In: Gerver D, Sinaiko HW (eds) Language interpretation and communication. Plenum Press, New York, pp 323–332

    Google Scholar 

  • Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9: 171–185

    Article  Google Scholar 

  • Liu Y (2004) Structural event detection for rich transcription of speech. PhD thesis, Purdue University, West Lafayette, IN

  • Lööf J, Bisani M, Gollan C, Heigold G, Hoffmeister B, Plahl C, Schlüter R, Ney H (2006) The 2006 RWTH parliamentary speeches transcription system. In: Interspeech 2006 – ICSLP, ninth international conference on spoken language processing, Pittsburgh, PA, pp 105–108

  • Mani I (2001) Automatic summarization. John Benjamins, Amsterdam

    Google Scholar 

  • Matusov E, Leusch G, Bender O, Ney H (2005) Evaluating machine translation output with automatic sentence segmentation. In: Proceedings of international workshop on spoken language translation, Pittsburgh, PA

  • Matusov E, Mauser A, Ney H (2006) Automatic sentence segmentation and punctuation prediction for spoken language translation. In: International workshop on spoken language translation, Kyoto, Japan, pp 158–165

  • Matusov E, Leusch G, Banchs RE, Bertoldi N, Déchelotte D, Federico M, Kolss M, Lee Y-S, Mariño JB, Paulik M, Roukos S, Schwenk H, Ney H (2008) System combination for machine translation of spoken and written language. IEEE T Audio Speech Lang Proc 16: 1222–1237

    Article  Google Scholar 

  • Morimoto T, Takezawa T, Yato F, Sagayama S, Tashiro T, Nagata M, Kurematsu A (1993) ATR’s speech translation system: ASURA. In: European conference on speech communication and technology 1993, Eurospeech 1993, Berlin, Germany, pp 1291–1294

  • Moser-Mercer B, Kunzli A, Korac M (1998) Prolonged turns in interpreting: effects on quality, physiological and psychological stress (Pilot Study). Interpreting: Int J Res Prac Interpreting 3:47–64

    Google Scholar 

  • Normandin Y (1991) Hidden Markov models, maximum mutual information estimation and the speech recognition problem. PhD thesis, McGill University, Montreal, Quebec, Canada

  • Och FJ (2003) Minimum error rate training in statistical machine translation. In: 41st annual meeting of the Association for Computational Linguistics, Sapporo, Japan, pp 160–167

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29:19–51

    Article  Google Scholar 

  • Olszewski D, Prasetyo F, Linhard K (2005) Steerable highly directional audio beam louspeaker. In: Interspeech’2005 – Eurospeech, Lisboa, Portugal, pp 137–140

  • Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th annual meeting of the Association of Computational Linguistics, Philadelphia, Pennsylvania, pp 311–318

  • Paulik M, Waibel A (2008) Extracting clues from human interpreter speech for spoken language translation. In: ICASSP 2008 IEEE international conference on acoustics, speech, and signal processing, Las Vegas, Nevada, pp 5097–5100

  • Ramabhadran B, Huang J, Picheny M (2003) Towards automatic transcription of large spoken archives – English ASR for the Malach project. In: Proceedings of the 2003 IEEE conference on acoustics, speech, and signal processing (ICASSP 2003), Hong Kong, China, pp 216–219

  • Ramabhadran B, Siohan O, Mangu L, Zweig G, Westphal M, Schulz H, Soneiro A (2006) The IBM 2006 speech transcription system for European Parliamentary speeches. In: Interspeech 2006 – ICSLP, ninth international conference on spoken language processing, Pittsburgh, PA, pp 1225–1228

  • Rao S, Lane I, Schultz T (2007) Optimizing sentence segmentation for spoken language translation. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2845–2848

  • Renals S, Bengio S, Fiskus J (eds) (2006) Machine learning for multimodal interaction: third international workshop, MLMI 2006, Bethesda. Revised selected papers, LNCS 4299, Springer Verlag, Berlin

  • Rogina I, Schaaf T (2002) Lecture and presentation tracking in an intelligent meeting room. In: 4th IEEE international conference on multimodal interfaces (ICMI 2002), Pittsburgh, PA, pp 47–52

  • Roukos S, Graff D, Melamed D (1995) Hansard French/English, catalog nbr LDC95T20, Linguistic Data Consortium, Philadelphia, PA

  • Seleskovitch D (1978) Interpreting for international conferences: problems of language and communication. Pen & Booth, Washington DC

    Google Scholar 

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of Association for Machine Translation in the Americas: visions for the future of machine translation, Cambridge, Massachusetts, pp 223–231

  • Soltau H, Metze F, Fügen C, Waibel A (2001) A one-pass decoder based on polymorphic linguistic context assignment. In: ASRU 2001, automatic speech recognition and understanding workshop, Madonna di Campiglio, Trento, Italy, pp 214–217

  • Soltau H, Yu H, Metze F, Fügen C, Jin Q, Jou S-C (2004) The 2003 ISL rich transcription system for conversational telephony speech. In: ICASSP 2004, IEEE international conference on acoustics, speech, and signal processing, Montreal, Canada, pp 773–776

  • Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: 7th international conference on spoken language processing (ICSLP 2002, Interspeech 2002), Denver, Colorado, pp 901–904

  • Stüker S, Fügen C, Hsiao R, Ikbal S, Jin Q, Kraft F, Paulik M, Raab M, Tam Y-C, Wölfel M (2006) The ISL TC-STAR spring 2006 ASR evaluation systems. In: TC-STAR workshop on speech to speech translation, Barcelona, Spain, pp 139–144

  • Stüker S, Paulik M, Kolss M, Fügen C, Waibel A (2007) Speech translation enhanced ASR for European Parliament speeches – on the influence of ASR performance on speech translation. In: ICASSP 2007, international conference on acoustics, speech, and signal processing, Honolulu, Hawaii, pp 1293–1296

  • Trancoso I, Nunes R, Neves L (2006) Classroom lecture recognition. In: Vieira R, Quaresma P, Nunes MdGV, Mamede NJ, Oliveira C, Dias MC (eds) Computational processing of the Portuguese language, 7th international workshop, PROPOR 2006, Itatiaia, Brazil, LNCS 3960, Springer Verlag, Berlin, pp 190–199

  • Vidal M (1997) New study on fatigue confirms need for working in teams. Proteus Newsl NAJIT 6.1

  • Vogel S (2003) SMT decoder dissected: word reordering. In: International conference on natural language processing and knowledge engineering, Beijing, China, pp 561–566

  • Vogel S (2005) PESA: phrase pair extraction as sentence splitting. In: MT summit X, the tenth machine translation summit, Phuket, Thailand, pp 251–258

  • Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: COLING-96, the 16th international conference on computational linguistics, Copenhagen, Denmark, pp 836–841

  • Waibel A, Fügen C (2008) Spoken language translation. IEEE Signal Proc Mag 25(3): 70–79

    Article  Google Scholar 

  • Waibel A, Stiefelhagen R (eds) (2009) Computers in the human interaction loop. Springer Verlag, Berlin

    Google Scholar 

  • Waibel A, Jain AN, McNair AE, Saito H, Hauptmann AG, Tebelskis J (1991) JANUS, a speech-to-speech translation using connectionist and symbolic processing strategies. In: ICASSP-91, proceedings of the international conference on acoustics, speech, and signal processing, Toronto, Canada, pp 793–796

  • Waibel A, Steusloff H, Stiefelhagen R, the CHIL Project Consortium (2004) CHIL – computers in the human interaction loop. In: WIAMIS 2004, 5th international workshop on image analysis for multimedia interactive services, Lisbon, Portugal, 4 pp

  • Yagi SM (2000) Studying style in simultaneous interpretation. Meta J Traduc 45: 520–547

    Google Scholar 

  • Yuan J, Liberman M, Cieri C (2006) Towards an integrated understanding of speaking rate in conversation. In: Interspeech 2006 – ICSLP, ninth international conference on spoken language processing, Pittsburgh, Pennsylvania, paper Mon3A3O-1

  • Zechner K (2002) Summarization of spoken language – challenges, methods, and prospects. Speech Technol Expert eZine, 6

  • Zhan P, Westphal M (1997) Speaker normalization based on frequency warping. In: 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP’97), Munich, Germany, p 1039

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alex Waibel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fügen, C., Waibel, A. & Kolss, M. Simultaneous translation of lectures and speeches. Machine Translation 21, 209–252 (2007). https://doi.org/10.1007/s10590-008-9047-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-008-9047-0

Keywords

Navigation