Skip to main content

Audio-visual Identity Verification: An Introductory Overview

  • Chapter
Progress in Nonlinear Speech Processing

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4391))

Abstract

Verification of identity is commonly achieved by looking at the face of a person and listening to his (her) speech. Automatic means of achieving this verification has been studied for several decades. Indeed, a talking face offers many features to achieve a robust verification of identity. The current deployment of videophones drives new opportunities for a secured access to remote servers (banking, certification, call centers, etc.). The synchrony of the speech signal and lip movements is a necessary condition to check that the observed talking face has not been manipulated and/or synthesized. This overview addresses face, speaker and talking face verification, as well as face and voice transformation techniques. It is demonstrated that a dedicated impostor needs limited information from a client to fool state of the art audio-visual identity verification systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker Verification using Adapted Gaussian Mixture Models. Digital Signal Processing 10, 19–41 (2000)

    Article  Google Scholar 

  2. Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. of Royal Statistical Society 39(1), 1–22 (1977)

    MathSciNet  MATH  Google Scholar 

  3. Blouet, R., Mokbel, C., Mokbel, H., Sanchez, E., Chollet, G.: BECARS: a Free Software for Speaker Verification. In: ODYSSEY 2004, pp. 145–148 (2004)

    Google Scholar 

  4. Mokbel, C.: Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework. IEEE Transactions on Speech and Audio Processing 9, 342–357 (2001)

    Article  Google Scholar 

  5. Brunelli, R., Poggio, T.: Face recognition: Features versus templates. IEEE Trans. on Pattern Analysis and Machine Intelligence 15(10), 1042–1052 (1993)

    Article  Google Scholar 

  6. Wiskott, L., Fellous, J.M., Krüger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. In: Sommer, G., Daniilidis, K., Pauli, J. (eds.) CAIP 1997. LNCS, vol. 1296, pp. 456–463. Springer, Heidelberg (1997)

    Chapter  Google Scholar 

  7. Abboud, B., Davoine, F., Dang, M.: Expressive face recognition and synthesis. In: IEEE CVPR workshop on Computer Vision and Pattern Recognition for Human Computer Interaction, Madison, U.S.A. (2003)

    Google Scholar 

  8. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)

    Article  Google Scholar 

  9. Moghaddam, B., Pentland, A.: Beyond euclidean eigenspaces: Bayesian matching for visual recognition. In: Face Recognition: From Theories to Applications, Springer, Berlin (1998)

    Google Scholar 

  10. Li, S., Lu, J.: Face recognition using the nearest feature line method. IEEE Transactions on Neural Networks 10, 439–443 (1999)

    Article  Google Scholar 

  11. Vapnik, V.: Statistical Learning Theory. Wiley, Chichester (1998)

    MATH  Google Scholar 

  12. Bartlett, M.S., Littlewort, G., Fasel, I., Movellan, J.R.: Real time face detection and facial expression recognition: Development and applications to human computer interaction. In: IEEE CVPR workshop on Computer Vision and Pattern Recognition for Human Computer Interaction, Madison, U.S.A. (2003)

    Google Scholar 

  13. Heisele, B., Ho, P., Wu, J., Poggio, T.: Face recognition: Component-based versus global approaches. Computer Vision and Image Understanding 91, 6–21 (2003)

    Article  Google Scholar 

  14. Padgett, C., Cottrell, G., Adolphs, R.: Categorical perception in facial emotion classification. In: Proceedings of the Eighteenth Annual Cognitive Science Conference, San Diego, CA, pp. 249–253 (1996)

    Google Scholar 

  15. Lien, J., Zlochower, A., Cohn, J., Li, C., Kanade, T.: Automatically recognizing facial expressions in the spatio temporal domain. In: Proceedings of the Workshop on Perceptual User Interfaces, Alberta, Canada (1997)

    Google Scholar 

  16. Bredin, H., Dehak, N., Chollet, G.: GMM-based SVM for Face Recognition. In: International Conference on Pattern Recognition (2006)

    Google Scholar 

  17. Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., Thiran, J.P.: The BANCA Database and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  18. BT-DAVID: http://eegalilee.swan.ac.uk/

  19. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: The Extended M2VTS Database. In: Audio- and Video-Based Biometric Person Authentication, pp. 72–77 (1999)

    Google Scholar 

  20. Garcia-Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., Jardins, J.L., Lunter, J., Ni, Y., Petrovska-Delacretaz, D.: BIOMET: a Multimodal Person Authentication Database including Face, Voice, Fingerprint, Hand and Signature Modalities. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 845–853. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  21. Bredin, H., Miguel, A., Witten, I.H., Chollet, G.: Detecting Replay Attacks in Audiovisual Identity Verification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (2006)

    Google Scholar 

  22. Stylianou, Y., Cappé, O., Moulines, E.: Statistical Methods for Voice Quality Transformation. In: European Conference on Speech Communication and Technology (1995)

    Google Scholar 

  23. Perrot, P., Aversano, G., Chollet, G., Charbit, M.: Voice Forgery Using ALISP: Indexation in a Client Memory. In: ICASSP (2005)

    Google Scholar 

  24. Romdhani, S., Vetter, T.: Efficient, robust and accurate fitting of a 3D morphable model. In: IEEE Intl. Conference on Computer Vision, Nice, France (2003)

    Google Scholar 

  25. Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Trans. on Pattern Analysis and Machine Intelligence 15(6), 569–579 (1993)

    Article  Google Scholar 

  26. Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., Salesin, D.: Synthesizing realistic facial expressions from photographs. In: Siggraph proceedings, pp. 75–84 (1998)

    Google Scholar 

  27. Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. In: ACM Siggraph, San Antonio, Texas (2002)

    Google Scholar 

  28. Bregler, C., Covel, M., Slaney, M.: Video rewrite: Driving visual speech with audio. In: Siggraph proceedings, pp. 353–360 (1997)

    Google Scholar 

  29. Ahlberg, J.: An active model for facial feature tracking. EURASIP Journal on applied signal processing 6, 566–571 (2002)

    Article  Google Scholar 

  30. Abboud, B., Davoine, F., Dang, M.: Facial expression recognition and synthesis based on an appearance model. Signal Processing: Image Communication 10(8), 723–740 (2004)

    Article  Google Scholar 

  31. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)

    Article  Google Scholar 

  32. Bailly-Bailliere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariethoz, J., Matas, J., Messer, K., Popovici, V., Pore, F., Ruiz, B., Thiran, J.P.: The BANCA database and evaluation protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, Springer, Heidelberg (2003)

    Google Scholar 

  33. Bredin, H., Chollet, G.: Measuring Audio and Visual Speech Synchrony: Methods and Applications. In: International Conference on Visual Information Engineering (2006)

    Google Scholar 

  34. Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-Visual Automatic Speech Recognition: An Overview. In: Issues in Visual and Audio-Visual Speech Processing, MIT Press, Cambridge (2004)

    Google Scholar 

  35. Hershey, J., Movellan, J.: Audio-Vision: Using Audio-Visual Synchrony to Locate Sounds. In: Neural Information Processing Systems (1999)

    Google Scholar 

  36. Fisher, J.W., Darell, T.: Speaker Association With Signal-Level Audiovisual Fusion. IEEE Transactions on Multimedia 6(3), 406–413 (2004)

    Article  Google Scholar 

  37. Slaney, M., Covell, M.: FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. Neural Information Processing Society 13 (2000)

    Google Scholar 

  38. Cutler, R., Davis, L.: Look Who’s Talking: Speaker Detection using Video and Audio Correlation. In: International Conference on Multimedia and Expo, pp. 1589–1592 (2000)

    Google Scholar 

  39. Nock, H., Iyengar, G., Neti, C.: Assessing Face and Speech Consistency for Monologue Detection in Video. In: Multimedia’02, pp. 303–306 (2002)

    Google Scholar 

  40. Iyengar, G., Nock, H., Neti, C.: Audio-Visual Synchrony for Detection of Monologues in Video Archives. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 329–332 (2003)

    Google Scholar 

  41. Chetty, G., Wagner, M.: “Liveness” Verification in Audio-Video Authentication. In: Australian International Conference on Speech Science and Technology, pp. 358–363 (2004)

    Google Scholar 

  42. Sugamura, N., Itakura, F.: Speech Analysis and Synthesis Methods developed at ECL in NTT–From LPC to LSP. Speech Communications 5(2), 199–215 (1986)

    Article  Google Scholar 

  43. Yehia, H., Rubin, P., Vatikiotis-Bateson, E.: Quantitative Association of Vocal-Tract and Facial Behavior. Speech Communication 28, 23–43 (1998)

    Article  Google Scholar 

  44. Bregler, C., Konig, Y.: “Eigenlips” for Robust Speech Recognition. In: International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 19–22 (1994)

    Google Scholar 

  45. Goecke, R., Millar, B.: Statistical Analysis of the Relationship between Audio and Video Speech Parameters for Australian English. In: International Conference on Audio-Visual Speech Processing (2003)

    Google Scholar 

  46. Eveno, N., Besacier, L.: Co-Inertia Analysis for ”Liveness” Test in Audio-Visual Biometrics. In: International Symposium on Image and Signal Processing Analysis, pp. 257–261 (2005)

    Google Scholar 

  47. Eveno, N., Besacier, L.: A Speaker Independent Liveness Test for Audio-Video Biometrics. In: 9th European Conference on Speech Communication and Technology (2005)

    Google Scholar 

  48. Chibelushi, C.C., Mason, J.S., Deravi, F.: Integrated Person Identification Using Voice and Facial Features. In: IEE Colloquium on Image Processing for Security Applications, vol. 4, pp. 1–5 (1997)

    Google Scholar 

  49. Smaragdis, P., Casey, M.: Audio/Visual Independent Components. In: International Symposium on Independent Component Analysis and Blind Signal Separation, pp. 709–714 (2003)

    Google Scholar 

  50. Dolédec, S., Chessel, D.: Co-Inertia Analysis: an Alternative Method for Studying Species-Environment Relationships. Freshwater Biology 31, 277–294 (1994)

    Article  Google Scholar 

  51. Fisher, J.W., Darrell, T., Freeman, W.T., Viola, P.: Learning Joint Statistical Models for Audio-Visual Fusion and Segregation. In: Advances in Neural Information Processing Systems (2001)

    Google Scholar 

  52. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  53. Bengio, S.: An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition. In: Advances in Neural Information Processing Systems (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Yannis Stylianou Marcos Faundez-Zanuy Anna Esposito

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this chapter

Cite this chapter

Abboud, B., Bredin, H., Aversano, G., Chollet, G. (2007). Audio-visual Identity Verification: An Introductory Overview. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds) Progress in Nonlinear Speech Processing. Lecture Notes in Computer Science, vol 4391. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71505-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71505-4_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71503-0

  • Online ISBN: 978-3-540-71505-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics