Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments

Shaikh, Ayaz A.; Kumar, Dinesh K.; Gubbi, Jayavardhana

doi:10.1007/s00371-012-0751-7

Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments

Original Article
Published: 13 September 2012

Volume 29, pages 969–982, (2013)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Ayaz A. Shaikh¹,
Dinesh K. Kumar¹ &
Jayavardhana Gubbi²

530 Accesses
5 Citations
Explore all metrics

Abstract

Appearance-based visual speech recognition using only video signals is presented. The proposed technique is based on the use of directional motion history images (DMHIs), which is an extension of the popular optical-flow method for object tracking. Zernike moments of each DMHI are computed in order to perform the classification. The technique incorporates automatic temporal segmentation of isolated utterances. The segmentation of isolated utterance is achieved using pair-wise pixel comparison. Support vector machine is used for classification and the results are based on leave-one-out paradigm. Experimental results show that the proposed technique achieves better performance in visemes recognition than others reported in literature. The benefit of this proposed visual speech recognition method is that it is suitable for real-time applications due to quick motion tracking system and the fast classification method employed. It has applications in command and control using lip movement to text conversion and can be used in noisy environment and also for assisting speech impaired persons.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Article 10 September 2018

Usha Sharma, Sushila Maheshkar, … Rahul Kaushik

Temporal and Spatial Features for Visual Speech Recognition

A comparative study of English viseme recognition methods and algorithms

Article Open access 07 October 2017

Dawid Jachimski, Andrzej Czyzewski & Tomasz Ciszewski

Notes

The experimental procedure was approved by the Human Experiments Ethics Committee of RMIT University (ASEHAPP 12-10).

References

Xiaodong, C., Alwan, A.: Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR. IEEE Trans. Speech Audio Process. 13(6), 1161–1172 (2005)
Article Google Scholar
Haitian, X., Zheng-Hua, T., Dalsgaard, P., Lindberg, B.: Robust speech recognition by nonlocal means denoising processing. IEEE Signal Process. Lett. 15, 701–704 (2008)
Article Google Scholar
Zheng-Hua, T., Lindberg, B.: Low-complexity variable frame rate analysis for speech recognition and voice activity detection. IEEE J. Sel. Top. Signal Process. 4(5), 798–807 (2010)
Article Google Scholar
Petajan, E.: Automatic lipreading to enhance speech recognition. In: IEEE Global Telecommunications Conference, Atlanta, GA, USA, pp. 265–272. IEEE Computer Society Press, Los Alamitos (1984)
Google Scholar
Arjunan, S.P., Kumar, D.K., Yau, W.C., Weghorn, H.: Unspoken vowel recognition using facial electromyogram. In: Engineering in Medicine and Biology Society. EMBS’06. 28th Annual International Conference of the IEEE, Aug. 30 2006–Sept. 3 2006, pp. 2191–2194 (2006)
Google Scholar
Schultz, T., Wand, M.: Modeling coarticulation in EMG-based continuous speech recognition. Speech Commun. 52(4), 341–353 (2010). doi:10.1016/j.specom.2009.12.002
Article Google Scholar
Medizinelektronik, C.: (2008). http://www.articulograph.de/
Soquet, A., Saerens, M., Lecuit, V.: Complementary cues for speech recognition, pp. 1645–1648 (1999)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
Yau, W.C., Kumar, D.K., Arjunan, S.P.: Visual speech recognition using dynamic features and support vector machines. Int. J. Image Graph. 8(3), 419–437 (2008)
Article Google Scholar
Xiang, T., Gong, S.: Beyond tracking: modelling activity and understanding behaviour. Int. J. Comput. Vis. 67(1), 21–51 (2006)
Article Google Scholar
Meng, H., Pears, N., Bailey, C.: Motion information combination for fast human action recognition. In: Proc. Computer Vision Theory and Applications, Spain, pp. 21–28 (2007)
Google Scholar
Ma, J., Cole, R., Pellom, B., Ward, W., Wise, B.: Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data. Comput. Animat. Virtual Worlds 15(5), 485–500 (2004)
Article Google Scholar
Govokhina, O., Bailly, G., Breton, G.: Learning Optimal Audiovisual Phasing for a HMM-based Control Model for Facial Animation (2007)
Musti, U., Toutios, A., Ouni, S., Colotte, V., Wrobel-Dautcourt, B., Berger, M.O.: HMM-based automatic visual speech segmentation using facial data. In: Proc. of the Interspeech 2010, pp. 1401–1404 (2010)
Google Scholar
Koprinska, I., Carrato, S.: Temporal video segmentation: a survey. Signal Process. Image Commun. 16(5), 477–500 (2001)
Article Google Scholar
Da Silveira, L.G., Facon, J., Borges, D.L.: Visual speech recognition: a solution from feature extraction to words classification. In: XVI Brazilian Symposium on Computer Graphics and Image Processing SIBGRAPI 2003, pp. 399–405 (2003)
Chapter Google Scholar
Luettin, J., Thacker, N.A., Beet, S.W.: Active shape models for visual speech feature extraction. NATO ASI Ser. Comput. Syst. Sci. 150, 383–390 (1996)
Article Google Scholar
Otani, K., Hasegawa, T.: The image input microphone—a new nonacoustic speech communication system by media conversion from oral motion images to speech. IEEE J. Sel. Areas Commun. 13(1), 42–48 (1995)
Article Google Scholar
Hueber, T., Chollet, G., Denby, B., Stone, M., Zouari, L.: Ouisper: corpus based synthesis driven by articulatory data. In: 16th International Congress of Phonetic Sciences, pp. 2193–2196 (2007)
Google Scholar
Yuhas, B.P., Goldstein, M.H. Jr., Sejnowski, T.J.: Integration of acoustic and visual speech signals using neural networks. IEEE Commun. Mag. 27(11), 65–71 (1989)
Article Google Scholar
Bregler, C., Konig, Y.: Eigenlips for robust speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-94, 19–22 Apr 1994, pp. 669–672 (1994)
Google Scholar
Silsbee, P.L., Bovik, A.C.: Computer lipreading for improved accuracy in automatic speech recognition. IEEE Trans. Speech Audio Process. 4(5), 337–351 (1996)
Article Google Scholar
Potamianos, G., Luettin, J., Neti, C.: Hierarchical discriminant features for audio-visual LVCSR. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01), vol. 161, pp. 165–168 (2001)
Google Scholar
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)
Article Google Scholar
Yuille, A.L., Hallinan, P.W., Cohen, D.S.: Feature extraction from faces using deformable templates. Int. J. Comput. Vis. 8(2), 99–111 (1992)
Article Google Scholar
Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Int. J. Comput. Vis. 1(4), 321–331 (1988)
Article Google Scholar
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)
Article Google Scholar
Matthews, I., Cootes, T., Bangham, J., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)
Article Google Scholar
Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, 31 Oct–2 Nov, pp. 572–577 (1994)
Google Scholar
Mase, K., Pentland, A.: Automatic lipreading by optical-flow analysis. Syst. Comput. Jpn. 22(6), 67–76 (1991)
Google Scholar
Iwano, K., Yoshinaga, T., Tamura, S., Furui, S.: Audio-visual speech recognition using lip information extracted from side-face images. EURASIP J. Audio Speech Music Process. 2007(1), 4 (2007)
Google Scholar
Rajavel, R., Sathidevi, P.S.: A novel algorithm for acoustic and visual classifiers decision fusion in audio-visual speech recognition system. Signal Process. 4(1), 23–37 (2010)
Google Scholar
Venkatesh Babu, R., Ramakrishnan, K.: Recognition of human actions using motion history information extracted from the compressed video. Image Vis. Comput. 22(8), 597–607 (2004)
Article Google Scholar
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2–3), 249–257 (2006)
Article Google Scholar
Valstar, M., Patras, I., Pantic, M.: Facial action unit recognition using temporal templates. In: 13th IEEE International Workshop on Robot and Human Interactive Communication, ROMAN, 20–22 Sept 2004, pp. 253–258 (2004)
Google Scholar
Ahad, M.: Analysis of motion self-occlusion problem due to motion overwriting for human activity recognition. J. Multimed. 5(1), 36–46 (2010)
Google Scholar
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book, vol. 2. Entropic Cambridge Research Laboratory, Cambridge (1997)
Google Scholar
Su, Q., Silsbee, P.L.: Robust audiovisual integration using semicontinuous hidden Markov models. In: Proc. International Conference on Spoken Language Processing, Philadelphia, PA, pp. 42–45 (2002)
Google Scholar
Krone, G., Talk, B., Wichert, A., Palm, G.: Neural architectures for sensor fusion in speech recognition. In: Proc. European Tutorial Workshop on Audio-Visual Speech Processing, Rhodes, Greece, pp. 57–60 (1997)
Google Scholar
Yau, W., Kumar, D., Arjunan, S.: Voiceless speech recognition using dynamic visual speech features. In: Proc. of HCSNet Workshop on the Use of Vision in HCI, Canberra, Australia, pp. 93–101. Australian Computer Society, Inc., Canberra (2006)
Google Scholar
Duchnowski, P., Meier, U., Waibel, A.: See me, hear me: integrating automatic speech recognition and lip-reading. In: Proceedings of the International Conference on Spoken Language and Processing, Yokohama, Japan, pp. 547–550. Citeseer, Princeton (1994)
Google Scholar
Heckmann, M., Berthommier, F., Kroschel, K.: A hybrid ANN/HMM audio-visual speech recognition system. In: International Conference on Auditory-Visual Speech Processing, Aalborg, Denmark, pp. 189–194. Citeseer, Princeton (2001)
Google Scholar
Yau, W., Kant Kumar, D., Chinnadurai, T.: Lip-reading technique using spatio-temporal templates and support vector machines. In: Progress in Pattern Recognition, Image Analysis and Applications, pp. 610–617 (2008)
Chapter Google Scholar
Ganapathiraju, A., Hamaker, J., Picone, J.: Hybrid SVM/HMM architectures for speech recognition. In: International Conference on Spoken Language Processing, pp. 504–507. Citeseer, Princeton (2000)
Google Scholar
Gordan, M., Kotropoulos, C., Pitas, I.: A support vector machine-based dynamic network for visual speech recognition applications. EURASIP J. Appl. Signal Process. 2002(1), 1248–1259 (2002)
Article MATH Google Scholar
Sun, D., Roth, S., Lewis, J., Black, M.: Learning optical flow. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) Computer Vision—ECCV 2008. Lecture Notes in Computer Science, vol. 5304, pp. 83–97. Springer, Berlin (2008)
Chapter Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Article MathSciNet Google Scholar
Khotanzad, A., Hong, Y.H.: Invariant image recognition by Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 12(5), 489–497 (1990)
Article Google Scholar
Teh, C.H., Chin, R.T.: On image analysis by the methods of moments. IEEE Trans. Pattern Anal. Mach. Intell. 10(4), 496–513 (1988)
Article MATH Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a Library for Support Vector Machines (2001)
Yau, W.C.: Video Analysis of Mouth Movement Using Motion Templates for Computer-Based Lip-Reading. RMIT University, Melbourne (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical and Computer Engineering and Health Innovations Research Institute, RMIT University, Melbourne, Vic, 3001, Australia
Ayaz A. Shaikh & Dinesh K. Kumar
ISSNIP, Dept of Electrical and Electronic Engineering, The University of Melbourne, Melbourne, Vic, 3010, Australia
Jayavardhana Gubbi

Authors

Ayaz A. Shaikh
View author publications
You can also search for this author in PubMed Google Scholar
Dinesh K. Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Jayavardhana Gubbi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ayaz A. Shaikh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shaikh, A.A., Kumar, D.K. & Gubbi, J. Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments. Vis Comput 29, 969–982 (2013). https://doi.org/10.1007/s00371-012-0751-7

Download citation

Published: 13 September 2012
Issue Date: October 2013
DOI: https://doi.org/10.1007/s00371-012-0751-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments

Abstract

Access this article

Similar content being viewed by others

Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Temporal and Spatial Features for Visual Speech Recognition

A comparative study of English viseme recognition methods and algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments

Abstract

Access this article

Similar content being viewed by others

Visual Speech Recognition Using Optical Flow and Hidden Markov Model

Temporal and Spatial Features for Visual Speech Recognition

A comparative study of English viseme recognition methods and algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation