Skip to main content
Log in

Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments

  • Original Article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Appearance-based visual speech recognition using only video signals is presented. The proposed technique is based on the use of directional motion history images (DMHIs), which is an extension of the popular optical-flow method for object tracking. Zernike moments of each DMHI are computed in order to perform the classification. The technique incorporates automatic temporal segmentation of isolated utterances. The segmentation of isolated utterance is achieved using pair-wise pixel comparison. Support vector machine is used for classification and the results are based on leave-one-out paradigm. Experimental results show that the proposed technique achieves better performance in visemes recognition than others reported in literature. The benefit of this proposed visual speech recognition method is that it is suitable for real-time applications due to quick motion tracking system and the fast classification method employed. It has applications in command and control using lip movement to text conversion and can be used in noisy environment and also for assisting speech impaired persons.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The experimental procedure was approved by the Human Experiments Ethics Committee of RMIT University (ASEHAPP 12-10).

References

  1. Xiaodong, C., Alwan, A.: Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR. IEEE Trans. Speech Audio Process. 13(6), 1161–1172 (2005)

    Article  Google Scholar 

  2. Haitian, X., Zheng-Hua, T., Dalsgaard, P., Lindberg, B.: Robust speech recognition by nonlocal means denoising processing. IEEE Signal Process. Lett. 15, 701–704 (2008)

    Article  Google Scholar 

  3. Zheng-Hua, T., Lindberg, B.: Low-complexity variable frame rate analysis for speech recognition and voice activity detection. IEEE J. Sel. Top. Signal Process. 4(5), 798–807 (2010)

    Article  Google Scholar 

  4. Petajan, E.: Automatic lipreading to enhance speech recognition. In: IEEE Global Telecommunications Conference, Atlanta, GA, USA, pp. 265–272. IEEE Computer Society Press, Los Alamitos (1984)

    Google Scholar 

  5. Arjunan, S.P., Kumar, D.K., Yau, W.C., Weghorn, H.: Unspoken vowel recognition using facial electromyogram. In: Engineering in Medicine and Biology Society. EMBS’06. 28th Annual International Conference of the IEEE, Aug. 30 2006–Sept. 3 2006, pp. 2191–2194 (2006)

    Google Scholar 

  6. Schultz, T., Wand, M.: Modeling coarticulation in EMG-based continuous speech recognition. Speech Commun. 52(4), 341–353 (2010). doi:10.1016/j.specom.2009.12.002

    Article  Google Scholar 

  7. Medizinelektronik, C.: (2008). http://www.articulograph.de/

  8. Soquet, A., Saerens, M., Lecuit, V.: Complementary cues for speech recognition, pp. 1645–1648 (1999)

  9. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)

    Article  Google Scholar 

  10. Yau, W.C., Kumar, D.K., Arjunan, S.P.: Visual speech recognition using dynamic features and support vector machines. Int. J. Image Graph. 8(3), 419–437 (2008)

    Article  Google Scholar 

  11. Xiang, T., Gong, S.: Beyond tracking: modelling activity and understanding behaviour. Int. J. Comput. Vis. 67(1), 21–51 (2006)

    Article  Google Scholar 

  12. Meng, H., Pears, N., Bailey, C.: Motion information combination for fast human action recognition. In: Proc. Computer Vision Theory and Applications, Spain, pp. 21–28 (2007)

    Google Scholar 

  13. Ma, J., Cole, R., Pellom, B., Ward, W., Wise, B.: Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data. Comput. Animat. Virtual Worlds 15(5), 485–500 (2004)

    Article  Google Scholar 

  14. Govokhina, O., Bailly, G., Breton, G.: Learning Optimal Audiovisual Phasing for a HMM-based Control Model for Facial Animation (2007)

  15. Musti, U., Toutios, A., Ouni, S., Colotte, V., Wrobel-Dautcourt, B., Berger, M.O.: HMM-based automatic visual speech segmentation using facial data. In: Proc. of the Interspeech 2010, pp. 1401–1404 (2010)

    Google Scholar 

  16. Koprinska, I., Carrato, S.: Temporal video segmentation: a survey. Signal Process. Image Commun. 16(5), 477–500 (2001)

    Article  Google Scholar 

  17. Da Silveira, L.G., Facon, J., Borges, D.L.: Visual speech recognition: a solution from feature extraction to words classification. In: XVI Brazilian Symposium on Computer Graphics and Image Processing SIBGRAPI 2003, pp. 399–405 (2003)

    Chapter  Google Scholar 

  18. Luettin, J., Thacker, N.A., Beet, S.W.: Active shape models for visual speech feature extraction. NATO ASI Ser. Comput. Syst. Sci. 150, 383–390 (1996)

    Article  Google Scholar 

  19. Otani, K., Hasegawa, T.: The image input microphone—a new nonacoustic speech communication system by media conversion from oral motion images to speech. IEEE J. Sel. Areas Commun. 13(1), 42–48 (1995)

    Article  Google Scholar 

  20. Hueber, T., Chollet, G., Denby, B., Stone, M., Zouari, L.: Ouisper: corpus based synthesis driven by articulatory data. In: 16th International Congress of Phonetic Sciences, pp. 2193–2196 (2007)

    Google Scholar 

  21. Yuhas, B.P., Goldstein, M.H. Jr., Sejnowski, T.J.: Integration of acoustic and visual speech signals using neural networks. IEEE Commun. Mag. 27(11), 65–71 (1989)

    Article  Google Scholar 

  22. Bregler, C., Konig, Y.: Eigenlips for robust speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-94, 19–22 Apr 1994, pp. 669–672 (1994)

    Google Scholar 

  23. Silsbee, P.L., Bovik, A.C.: Computer lipreading for improved accuracy in automatic speech recognition. IEEE Trans. Speech Audio Process. 4(5), 337–351 (1996)

    Article  Google Scholar 

  24. Potamianos, G., Luettin, J., Neti, C.: Hierarchical discriminant features for audio-visual LVCSR. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01), vol. 161, pp. 165–168 (2001)

    Google Scholar 

  25. Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)

    Article  Google Scholar 

  26. Yuille, A.L., Hallinan, P.W., Cohen, D.S.: Feature extraction from faces using deformable templates. Int. J. Comput. Vis. 8(2), 99–111 (1992)

    Article  Google Scholar 

  27. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Int. J. Comput. Vis. 1(4), 321–331 (1988)

    Article  Google Scholar 

  28. Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)

    Article  Google Scholar 

  29. Matthews, I., Cootes, T., Bangham, J., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)

    Article  Google Scholar 

  30. Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, 31 Oct–2 Nov, pp. 572–577 (1994)

    Google Scholar 

  31. Mase, K., Pentland, A.: Automatic lipreading by optical-flow analysis. Syst. Comput. Jpn. 22(6), 67–76 (1991)

    Google Scholar 

  32. Iwano, K., Yoshinaga, T., Tamura, S., Furui, S.: Audio-visual speech recognition using lip information extracted from side-face images. EURASIP J. Audio Speech Music Process. 2007(1), 4 (2007)

    Google Scholar 

  33. Rajavel, R., Sathidevi, P.S.: A novel algorithm for acoustic and visual classifiers decision fusion in audio-visual speech recognition system. Signal Process. 4(1), 23–37 (2010)

    Google Scholar 

  34. Venkatesh Babu, R., Ramakrishnan, K.: Recognition of human actions using motion history information extracted from the compressed video. Image Vis. Comput. 22(8), 597–607 (2004)

    Article  Google Scholar 

  35. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2–3), 249–257 (2006)

    Article  Google Scholar 

  36. Valstar, M., Patras, I., Pantic, M.: Facial action unit recognition using temporal templates. In: 13th IEEE International Workshop on Robot and Human Interactive Communication, ROMAN, 20–22 Sept 2004, pp. 253–258 (2004)

    Google Scholar 

  37. Ahad, M.: Analysis of motion self-occlusion problem due to motion overwriting for human activity recognition. J. Multimed. 5(1), 36–46 (2010)

    Google Scholar 

  38. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book, vol. 2. Entropic Cambridge Research Laboratory, Cambridge (1997)

    Google Scholar 

  39. Su, Q., Silsbee, P.L.: Robust audiovisual integration using semicontinuous hidden Markov models. In: Proc. International Conference on Spoken Language Processing, Philadelphia, PA, pp. 42–45 (2002)

    Google Scholar 

  40. Krone, G., Talk, B., Wichert, A., Palm, G.: Neural architectures for sensor fusion in speech recognition. In: Proc. European Tutorial Workshop on Audio-Visual Speech Processing, Rhodes, Greece, pp. 57–60 (1997)

    Google Scholar 

  41. Yau, W., Kumar, D., Arjunan, S.: Voiceless speech recognition using dynamic visual speech features. In: Proc. of HCSNet Workshop on the Use of Vision in HCI, Canberra, Australia, pp. 93–101. Australian Computer Society, Inc., Canberra (2006)

    Google Scholar 

  42. Duchnowski, P., Meier, U., Waibel, A.: See me, hear me: integrating automatic speech recognition and lip-reading. In: Proceedings of the International Conference on Spoken Language and Processing, Yokohama, Japan, pp. 547–550. Citeseer, Princeton (1994)

    Google Scholar 

  43. Heckmann, M., Berthommier, F., Kroschel, K.: A hybrid ANN/HMM audio-visual speech recognition system. In: International Conference on Auditory-Visual Speech Processing, Aalborg, Denmark, pp. 189–194. Citeseer, Princeton (2001)

    Google Scholar 

  44. Yau, W., Kant Kumar, D., Chinnadurai, T.: Lip-reading technique using spatio-temporal templates and support vector machines. In: Progress in Pattern Recognition, Image Analysis and Applications, pp. 610–617 (2008)

    Chapter  Google Scholar 

  45. Ganapathiraju, A., Hamaker, J., Picone, J.: Hybrid SVM/HMM architectures for speech recognition. In: International Conference on Spoken Language Processing, pp. 504–507. Citeseer, Princeton (2000)

    Google Scholar 

  46. Gordan, M., Kotropoulos, C., Pitas, I.: A support vector machine-based dynamic network for visual speech recognition applications. EURASIP J. Appl. Signal Process. 2002(1), 1248–1259 (2002)

    Article  MATH  Google Scholar 

  47. Sun, D., Roth, S., Lewis, J., Black, M.: Learning optical flow. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) Computer Vision—ECCV 2008. Lecture Notes in Computer Science, vol. 5304, pp. 83–97. Springer, Berlin (2008)

    Chapter  Google Scholar 

  48. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)

    Article  MathSciNet  Google Scholar 

  49. Khotanzad, A., Hong, Y.H.: Invariant image recognition by Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 12(5), 489–497 (1990)

    Article  Google Scholar 

  50. Teh, C.H., Chin, R.T.: On image analysis by the methods of moments. IEEE Trans. Pattern Anal. Mach. Intell. 10(4), 496–513 (1988)

    Article  MATH  Google Scholar 

  51. Chang, C.C., Lin, C.J.: LIBSVM: a Library for Support Vector Machines (2001)

  52. Yau, W.C.: Video Analysis of Mouth Movement Using Motion Templates for Computer-Based Lip-Reading. RMIT University, Melbourne (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ayaz A. Shaikh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shaikh, A.A., Kumar, D.K. & Gubbi, J. Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments. Vis Comput 29, 969–982 (2013). https://doi.org/10.1007/s00371-012-0751-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-012-0751-7

Keywords

Navigation