Skip to main content
Log in

A Speech-Centric Perspective for Human-Computer Interface: A Case Study

  • Published:
Journal of VLSI signal processing systems for signal, image and video technology Aims and scope Submit manuscript

Abstract

Speech technology has been playing a central role in enhancing human-machine interactions, especially for small devices for which graphical user interface has obvious limitations. The speech-centric perspective for human-computer interface advanced in this paper derives from the view that speech is the only natural and expressive modality to enable people to access information from and to interact with any device. In this paper, we describe some recent work conducted at Microsoft Research, aimed at the development of enabling technologies for speech-centric multimodal human-computer interaction. In particular, we present a case study of a prototype system, called MapPointS, which is a speech-centric multimodal map-query application for North America. This prototype navigation system provides rich functionalities that allow users to obtain map-related information through speech, text, and pointing devices. Users can verbally query for state maps, city maps, directions, places, nearby businesses and other useful information within North America. They can also verbally control applications such as changing the map size and panning the map moving interactively through speech. In the current system, the results of the queries are presented back to users through graphical user interface. An overview and major components of the MapPointS system will be presented in detail first. This will be followed by software design engineering principles and considerations adopted in developing the MapPointS system, and by a description of some key robust speech processing technologies underlying general speech-centric human-computer interaction systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. L. Comerford, D. Frank, P. Gopalakrishnan, R. Gopinath, and J. Sedivy, “The IBM Personal Speech Assistant,” in Proc. Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, Salt Lake City, UT, May 2001.

  2. L. Deng, K. Wang, A. Acero, H. Hon, J. Droppo, C. Boulis, Y. Wang, D. Jacoby, M. Mahajan, C. Chelba, and X.D. Huang. “Distributed Speech Processing in MiPad's Multimodal User Interface,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 8, 2002, pp. 605–619.

    Article  Google Scholar 

  3. L. Deng, A. Acero, M. Plumpe, and X.D. Huang. “Large-Vocabulary Speech Recognition Under Adverse Acoustic Environments,” in Proc-ICSLP-2000, Beijing, China, Oct. 2000, vol. 3, p. 806–809.

  4. L. Deng, A. Acero, L. Jiang, J. Droppo, and XD Huang. “High-Performance Robust Speech Recognition Using Stereo Training Data,” in Proc. ICASSP-2000, vol. I, Salt Lake City, Utah, April 2001, pp. 301–304.

  5. J. Droppo, L. Deng, and A. Acero, “Evaluation of SPLICE on the Aurora2 and 3 Databases,” in Proc. ICSLP-2002, Denver, CO., 2002.

  6. S. Dusan, G. Gadbois, and J. Flanagan, “Multimodal Interaction on PDA's Integrating Speech and Pen Inputs,” in Proc. Eurospeech-2003, Geneva, Switzerland, Sept. 2003.

  7. R. Hamburgen, D. Wallach, M. Viredaz, L. Brakmo, C. Waldspurger, J. Bartlett, T. Mann, and K. Farkas, “Itsy: Stretch the Bounds of Mobile Computing,” in IEEE Computer, April 2001, pp. 28–36.

  8. H.G. Hirsch and D. Pearce, “The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems Under Noisy Conditions,” in Proc. ISCA ITRW-ASR2000 Automatic Speech Recognition: Challenges for the Next Millennium, Paris, France, Sept. 2000.

  9. M. Johnston et al., “MATCH: An architecture for Multimodal Dialogue Systems,” in Proc. ACL-2002, Philadelphia, PA, 2002.

  10. J. Lai, (Ed.), “Conversational Interfaces: Special section,” Communications of the ACM, vol. 43, no. 9, 2000.

  11. A. Nefian, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy, “Dynamic Bayesian Networks for Audio-Visual Speech Recognition,” EURASIP Journal of Applied Signal Processing, vol. 11, 2002, pp. 1–15.

    Google Scholar 

  12. C. Neti, G. Iyengar, G. Potamianos, A. Senior, and B. Maison. “Perceptual Interfaces for Information Interaction: Joint Processing of Audio and Visual Information for Human-Computer Interaction,” in Proc. ICSLP-2000, Beijing, vol. 1, 2000, pp. 11–14.

  13. S. Oviatt, “Breaking the Robustness Barrier: Recent Progress on the Design of Robust Multimodal Systems,” in M. Zelkowitz, Advances in Computers (Ed.), Academic Press, vol. 56, 2002, pp. 305–341.

  14. R. Rose, S. Parthasarathy, B. Gajic, A. Rosenberg, and S. Narayanan, “On the Implementation of ASR Algorithm for Hand-held Wireless Mobile Devices,” in Proc. ICASSP-2001, Salt Lake City, Utah, vol. I, 2001.

  15. J. Segura, A. Torre, M. Benitez, and A. Peinado, “Model-Based Compensation of the Additive Noise for Continuous Speech Recognition: Experiments using the AURORA2 Database and Tasks,” Proc. Eurospeech-2001, Aalborg, Denmark, 2001.

  16. R. Shama, V. Pavlovic, and T. Huang, “Toward Multimodal Human-Computer Interface,” in Proc. IEEE, vol. 86, no. 5, 1998, pp. 853–869.

    Article  Google Scholar 

  17. K. Wang, “Implementation of a Multimodal Dialog System Using Extended Markup Language,” in Proc. ICSLP-2000, Beijing, China, 2000.

Download references

Author information

Authors and Affiliations

Authors

Additional information

Li Deng received the B.S. degree from University of Science and Technology of China in 1982, Master from University of Wisconsin-Madison in 1984, and Ph.D. from University of Wisconsin-Madison in 1986. He worked on large vocabulary automatic speech recognition in Montreal, Canada, 1986–1989. In 1989, he joined Dept. Electrical and Computer Engineering, University of Waterloo, Ontario, Canada as Assistant Professor, where he became tenured. Full Professor in 1996. From 1992 to 1993, he conducted sabbatical research at Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Mass, and from 1997–1998, at ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan. Since 1999, he joined Microsoft Research, Redmond, WA as Senior Researcher, and as affiliate full professor in Electrical Engineering at University of Washington, Seattle. His research interests include acoustic-phonetic modeling of speech, speech and speaker recognition, speech synthesis and enhancement, speech production and perception, auditory speech processing, noise robust speech processing, statistical methods and machine learning, nonlinear signal processing, spoken language systems, multimedia signal processing, and multimodal human-computer interaction. In these areas, he has published over 200 technical papers and book chapters, and is inventor and co-inventor of numerous patents. He co-authored the book “Speech Processing—A Dynamic and Optimization-Oriented Approach” (2003, Marcel Dekker Publishers, New York).

He served on Education Committee and Speech Processing Technical Committee of the IEEE Signal Processing Society 1996–2000, and was Associate Editor for IEEE Transactions on Speech and Audio Processing 2002–2005. He currently serves on Multimedia Signal Processing Technical Committee. He was a Technical Chair of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP04). He is Fellow of the Acoustical Society of America and Fellow of the IEEE.

Dong Yu, who joined Microsoft in 1998, holds a BS degree on Electrical Engineering from Zhejiang University, China, an MS degree on Electrical Engineering from Chinese Academy of Sciences, and an MS degree on Computer Science from Indiana University at Bloomington, USA. He is currently a Ph.D. candidate on Computer Science at University of Idaho, USA.

Dong Yu's research interests are in areas of speech recognition and processing, and computer and network security. He has published more than 20 papers in journals and conferences in above areas, and has applied for more than 10 US and international patents.

Mr. Dong Yu has served as reviewers of many journals and conferences, including Journal of Computer Security, ICASSP, and InterSpeech.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deng, L., Yu, D. A Speech-Centric Perspective for Human-Computer Interface: A Case Study. J VLSI Sign Process Syst Sign Image Video Technol 41, 255–269 (2005). https://doi.org/10.1007/s11265-005-4150-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-005-4150-4

Keywords

Navigation