Audio-visual Identity Verification: An Introductory Overview

Abboud, Bouchra; Bredin, Hervé; Aversano, Guido; Chollet, Gérard

doi:10.1007/978-3-540-71505-4_8

Bouchra Abboud¹,
Hervé Bredin¹,
Guido Aversano¹ &
…
Gérard Chollet¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4391))

1145 Accesses
2 Citations

Abstract

Verification of identity is commonly achieved by looking at the face of a person and listening to his (her) speech. Automatic means of achieving this verification has been studied for several decades. Indeed, a talking face offers many features to achieve a robust verification of identity. The current deployment of videophones drives new opportunities for a secured access to remote servers (banking, certification, call centers, etc.). The synchrony of the speech signal and lip movements is a necessary condition to check that the observed talking face has not been manipulated and/or synthesized. This overview addresses face, speaker and talking face verification, as well as face and voice transformation techniques. It is demonstrated that a dedicated impostor needs limited information from a client to fool state of the art audio-visual identity verification systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker Verification using Adapted Gaussian Mixture Models. Digital Signal Processing 10, 19–41 (2000)
Article Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. of Royal Statistical Society 39(1), 1–22 (1977)
MathSciNet MATH Google Scholar
Blouet, R., Mokbel, C., Mokbel, H., Sanchez, E., Chollet, G.: BECARS: a Free Software for Speaker Verification. In: ODYSSEY 2004, pp. 145–148 (2004)
Google Scholar
Mokbel, C.: Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework. IEEE Transactions on Speech and Audio Processing 9, 342–357 (2001)
Article Google Scholar
Brunelli, R., Poggio, T.: Face recognition: Features versus templates. IEEE Trans. on Pattern Analysis and Machine Intelligence 15(10), 1042–1052 (1993)
Article Google Scholar
Wiskott, L., Fellous, J.M., Krüger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. In: Sommer, G., Daniilidis, K., Pauli, J. (eds.) CAIP 1997. LNCS, vol. 1296, pp. 456–463. Springer, Heidelberg (1997)
Chapter Google Scholar
Abboud, B., Davoine, F., Dang, M.: Expressive face recognition and synthesis. In: IEEE CVPR workshop on Computer Vision and Pattern Recognition for Human Computer Interaction, Madison, U.S.A. (2003)
Google Scholar
Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)
Article Google Scholar
Moghaddam, B., Pentland, A.: Beyond euclidean eigenspaces: Bayesian matching for visual recognition. In: Face Recognition: From Theories to Applications, Springer, Berlin (1998)
Google Scholar
Li, S., Lu, J.: Face recognition using the nearest feature line method. IEEE Transactions on Neural Networks 10, 439–443 (1999)
Article Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, Chichester (1998)
MATH Google Scholar
Bartlett, M.S., Littlewort, G., Fasel, I., Movellan, J.R.: Real time face detection and facial expression recognition: Development and applications to human computer interaction. In: IEEE CVPR workshop on Computer Vision and Pattern Recognition for Human Computer Interaction, Madison, U.S.A. (2003)
Google Scholar
Heisele, B., Ho, P., Wu, J., Poggio, T.: Face recognition: Component-based versus global approaches. Computer Vision and Image Understanding 91, 6–21 (2003)
Article Google Scholar
Padgett, C., Cottrell, G., Adolphs, R.: Categorical perception in facial emotion classification. In: Proceedings of the Eighteenth Annual Cognitive Science Conference, San Diego, CA, pp. 249–253 (1996)
Google Scholar
Lien, J., Zlochower, A., Cohn, J., Li, C., Kanade, T.: Automatically recognizing facial expressions in the spatio temporal domain. In: Proceedings of the Workshop on Perceptual User Interfaces, Alberta, Canada (1997)
Google Scholar
Bredin, H., Dehak, N., Chollet, G.: GMM-based SVM for Face Recognition. In: International Conference on Pattern Recognition (2006)
Google Scholar
Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., Thiran, J.P.: The BANCA Database and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)
Chapter Google Scholar
BT-DAVID: http://eegalilee.swan.ac.uk/
Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: The Extended M2VTS Database. In: Audio- and Video-Based Biometric Person Authentication, pp. 72–77 (1999)
Google Scholar
Garcia-Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., Jardins, J.L., Lunter, J., Ni, Y., Petrovska-Delacretaz, D.: BIOMET: a Multimodal Person Authentication Database including Face, Voice, Fingerprint, Hand and Signature Modalities. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 845–853. Springer, Heidelberg (2003)
Chapter Google Scholar
Bredin, H., Miguel, A., Witten, I.H., Chollet, G.: Detecting Replay Attacks in Audiovisual Identity Verification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (2006)
Google Scholar
Stylianou, Y., Cappé, O., Moulines, E.: Statistical Methods for Voice Quality Transformation. In: European Conference on Speech Communication and Technology (1995)
Google Scholar
Perrot, P., Aversano, G., Chollet, G., Charbit, M.: Voice Forgery Using ALISP: Indexation in a Client Memory. In: ICASSP (2005)
Google Scholar
Romdhani, S., Vetter, T.: Efficient, robust and accurate fitting of a 3D morphable model. In: IEEE Intl. Conference on Computer Vision, Nice, France (2003)
Google Scholar
Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Trans. on Pattern Analysis and Machine Intelligence 15(6), 569–579 (1993)
Article Google Scholar
Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., Salesin, D.: Synthesizing realistic facial expressions from photographs. In: Siggraph proceedings, pp. 75–84 (1998)
Google Scholar
Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. In: ACM Siggraph, San Antonio, Texas (2002)
Google Scholar
Bregler, C., Covel, M., Slaney, M.: Video rewrite: Driving visual speech with audio. In: Siggraph proceedings, pp. 353–360 (1997)
Google Scholar
Ahlberg, J.: An active model for facial feature tracking. EURASIP Journal on applied signal processing 6, 566–571 (2002)
Article Google Scholar
Abboud, B., Davoine, F., Dang, M.: Facial expression recognition and synthesis based on an appearance model. Signal Processing: Image Communication 10(8), 723–740 (2004)
Article Google Scholar
Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)
Article Google Scholar
Bailly-Bailliere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariethoz, J., Matas, J., Messer, K., Popovici, V., Pore, F., Ruiz, B., Thiran, J.P.: The BANCA database and evaluation protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, Springer, Heidelberg (2003)
Google Scholar
Bredin, H., Chollet, G.: Measuring Audio and Visual Speech Synchrony: Methods and Applications. In: International Conference on Visual Information Engineering (2006)
Google Scholar
Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-Visual Automatic Speech Recognition: An Overview. In: Issues in Visual and Audio-Visual Speech Processing, MIT Press, Cambridge (2004)
Google Scholar
Hershey, J., Movellan, J.: Audio-Vision: Using Audio-Visual Synchrony to Locate Sounds. In: Neural Information Processing Systems (1999)
Google Scholar
Fisher, J.W., Darell, T.: Speaker Association With Signal-Level Audiovisual Fusion. IEEE Transactions on Multimedia 6(3), 406–413 (2004)
Article Google Scholar
Slaney, M., Covell, M.: FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. Neural Information Processing Society 13 (2000)
Google Scholar
Cutler, R., Davis, L.: Look Who’s Talking: Speaker Detection using Video and Audio Correlation. In: International Conference on Multimedia and Expo, pp. 1589–1592 (2000)
Google Scholar
Nock, H., Iyengar, G., Neti, C.: Assessing Face and Speech Consistency for Monologue Detection in Video. In: Multimedia’02, pp. 303–306 (2002)
Google Scholar
Iyengar, G., Nock, H., Neti, C.: Audio-Visual Synchrony for Detection of Monologues in Video Archives. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 329–332 (2003)
Google Scholar
Chetty, G., Wagner, M.: “Liveness” Verification in Audio-Video Authentication. In: Australian International Conference on Speech Science and Technology, pp. 358–363 (2004)
Google Scholar
Sugamura, N., Itakura, F.: Speech Analysis and Synthesis Methods developed at ECL in NTT–From LPC to LSP. Speech Communications 5(2), 199–215 (1986)
Article Google Scholar
Yehia, H., Rubin, P., Vatikiotis-Bateson, E.: Quantitative Association of Vocal-Tract and Facial Behavior. Speech Communication 28, 23–43 (1998)
Article Google Scholar
Bregler, C., Konig, Y.: “Eigenlips” for Robust Speech Recognition. In: International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 19–22 (1994)
Google Scholar
Goecke, R., Millar, B.: Statistical Analysis of the Relationship between Audio and Video Speech Parameters for Australian English. In: International Conference on Audio-Visual Speech Processing (2003)
Google Scholar
Eveno, N., Besacier, L.: Co-Inertia Analysis for ”Liveness” Test in Audio-Visual Biometrics. In: International Symposium on Image and Signal Processing Analysis, pp. 257–261 (2005)
Google Scholar
Eveno, N., Besacier, L.: A Speaker Independent Liveness Test for Audio-Video Biometrics. In: 9th European Conference on Speech Communication and Technology (2005)
Google Scholar
Chibelushi, C.C., Mason, J.S., Deravi, F.: Integrated Person Identification Using Voice and Facial Features. In: IEE Colloquium on Image Processing for Security Applications, vol. 4, pp. 1–5 (1997)
Google Scholar
Smaragdis, P., Casey, M.: Audio/Visual Independent Components. In: International Symposium on Independent Component Analysis and Blind Signal Separation, pp. 709–714 (2003)
Google Scholar
Dolédec, S., Chessel, D.: Co-Inertia Analysis: an Alternative Method for Studying Species-Environment Relationships. Freshwater Biology 31, 277–294 (1994)
Article Google Scholar
Fisher, J.W., Darrell, T., Freeman, W.T., Viola, P.: Learning Joint Statistical Models for Audio-Visual Fusion and Segregation. In: Advances in Neural Information Processing Systems (2001)
Google Scholar
Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Article Google Scholar
Bengio, S.: An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition. In: Advances in Neural Information Processing Systems (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS-LTCI, GET-ENST, 46 rue Barrault, 75013 Paris, France
Bouchra Abboud, Hervé Bredin, Guido Aversano & Gérard Chollet

Authors

Bouchra Abboud
View author publications
You can also search for this author in PubMed Google Scholar
Hervé Bredin
View author publications
You can also search for this author in PubMed Google Scholar
Guido Aversano
View author publications
You can also search for this author in PubMed Google Scholar
Gérard Chollet
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Yannis Stylianou Marcos Faundez-Zanuy Anna Esposito

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Abboud, B., Bredin, H., Aversano, G., Chollet, G. (2007). Audio-visual Identity Verification: An Introductory Overview. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds) Progress in Nonlinear Speech Processing. Lecture Notes in Computer Science, vol 4391. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71505-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-71505-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71503-0
Online ISBN: 978-3-540-71505-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics