Abstract
Verification of identity is commonly achieved by looking at the face of a person and listening to his (her) speech. Automatic means of achieving this verification has been studied for several decades. Indeed, a talking face offers many features to achieve a robust verification of identity. The current deployment of videophones drives new opportunities for a secured access to remote servers (banking, certification, call centers, etc.). The synchrony of the speech signal and lip movements is a necessary condition to check that the observed talking face has not been manipulated and/or synthesized. This overview addresses face, speaker and talking face verification, as well as face and voice transformation techniques. It is demonstrated that a dedicated impostor needs limited information from a client to fool state of the art audio-visual identity verification systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker Verification using Adapted Gaussian Mixture Models. Digital Signal Processing 10, 19–41 (2000)
Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. of Royal Statistical Society 39(1), 1–22 (1977)
Blouet, R., Mokbel, C., Mokbel, H., Sanchez, E., Chollet, G.: BECARS: a Free Software for Speaker Verification. In: ODYSSEY 2004, pp. 145–148 (2004)
Mokbel, C.: Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework. IEEE Transactions on Speech and Audio Processing 9, 342–357 (2001)
Brunelli, R., Poggio, T.: Face recognition: Features versus templates. IEEE Trans. on Pattern Analysis and Machine Intelligence 15(10), 1042–1052 (1993)
Wiskott, L., Fellous, J.M., Krüger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. In: Sommer, G., Daniilidis, K., Pauli, J. (eds.) CAIP 1997. LNCS, vol. 1296, pp. 456–463. Springer, Heidelberg (1997)
Abboud, B., Davoine, F., Dang, M.: Expressive face recognition and synthesis. In: IEEE CVPR workshop on Computer Vision and Pattern Recognition for Human Computer Interaction, Madison, U.S.A. (2003)
Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)
Moghaddam, B., Pentland, A.: Beyond euclidean eigenspaces: Bayesian matching for visual recognition. In: Face Recognition: From Theories to Applications, Springer, Berlin (1998)
Li, S., Lu, J.: Face recognition using the nearest feature line method. IEEE Transactions on Neural Networks 10, 439–443 (1999)
Vapnik, V.: Statistical Learning Theory. Wiley, Chichester (1998)
Bartlett, M.S., Littlewort, G., Fasel, I., Movellan, J.R.: Real time face detection and facial expression recognition: Development and applications to human computer interaction. In: IEEE CVPR workshop on Computer Vision and Pattern Recognition for Human Computer Interaction, Madison, U.S.A. (2003)
Heisele, B., Ho, P., Wu, J., Poggio, T.: Face recognition: Component-based versus global approaches. Computer Vision and Image Understanding 91, 6–21 (2003)
Padgett, C., Cottrell, G., Adolphs, R.: Categorical perception in facial emotion classification. In: Proceedings of the Eighteenth Annual Cognitive Science Conference, San Diego, CA, pp. 249–253 (1996)
Lien, J., Zlochower, A., Cohn, J., Li, C., Kanade, T.: Automatically recognizing facial expressions in the spatio temporal domain. In: Proceedings of the Workshop on Perceptual User Interfaces, Alberta, Canada (1997)
Bredin, H., Dehak, N., Chollet, G.: GMM-based SVM for Face Recognition. In: International Conference on Pattern Recognition (2006)
Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., Thiran, J.P.: The BANCA Database and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)
BT-DAVID: http://eegalilee.swan.ac.uk/
Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: The Extended M2VTS Database. In: Audio- and Video-Based Biometric Person Authentication, pp. 72–77 (1999)
Garcia-Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., Jardins, J.L., Lunter, J., Ni, Y., Petrovska-Delacretaz, D.: BIOMET: a Multimodal Person Authentication Database including Face, Voice, Fingerprint, Hand and Signature Modalities. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 845–853. Springer, Heidelberg (2003)
Bredin, H., Miguel, A., Witten, I.H., Chollet, G.: Detecting Replay Attacks in Audiovisual Identity Verification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (2006)
Stylianou, Y., Cappé, O., Moulines, E.: Statistical Methods for Voice Quality Transformation. In: European Conference on Speech Communication and Technology (1995)
Perrot, P., Aversano, G., Chollet, G., Charbit, M.: Voice Forgery Using ALISP: Indexation in a Client Memory. In: ICASSP (2005)
Romdhani, S., Vetter, T.: Efficient, robust and accurate fitting of a 3D morphable model. In: IEEE Intl. Conference on Computer Vision, Nice, France (2003)
Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Trans. on Pattern Analysis and Machine Intelligence 15(6), 569–579 (1993)
Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., Salesin, D.: Synthesizing realistic facial expressions from photographs. In: Siggraph proceedings, pp. 75–84 (1998)
Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. In: ACM Siggraph, San Antonio, Texas (2002)
Bregler, C., Covel, M., Slaney, M.: Video rewrite: Driving visual speech with audio. In: Siggraph proceedings, pp. 353–360 (1997)
Ahlberg, J.: An active model for facial feature tracking. EURASIP Journal on applied signal processing 6, 566–571 (2002)
Abboud, B., Davoine, F., Dang, M.: Facial expression recognition and synthesis based on an appearance model. Signal Processing: Image Communication 10(8), 723–740 (2004)
Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)
Bailly-Bailliere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariethoz, J., Matas, J., Messer, K., Popovici, V., Pore, F., Ruiz, B., Thiran, J.P.: The BANCA database and evaluation protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, Springer, Heidelberg (2003)
Bredin, H., Chollet, G.: Measuring Audio and Visual Speech Synchrony: Methods and Applications. In: International Conference on Visual Information Engineering (2006)
Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-Visual Automatic Speech Recognition: An Overview. In: Issues in Visual and Audio-Visual Speech Processing, MIT Press, Cambridge (2004)
Hershey, J., Movellan, J.: Audio-Vision: Using Audio-Visual Synchrony to Locate Sounds. In: Neural Information Processing Systems (1999)
Fisher, J.W., Darell, T.: Speaker Association With Signal-Level Audiovisual Fusion. IEEE Transactions on Multimedia 6(3), 406–413 (2004)
Slaney, M., Covell, M.: FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks. Neural Information Processing Society 13 (2000)
Cutler, R., Davis, L.: Look Who’s Talking: Speaker Detection using Video and Audio Correlation. In: International Conference on Multimedia and Expo, pp. 1589–1592 (2000)
Nock, H., Iyengar, G., Neti, C.: Assessing Face and Speech Consistency for Monologue Detection in Video. In: Multimedia’02, pp. 303–306 (2002)
Iyengar, G., Nock, H., Neti, C.: Audio-Visual Synchrony for Detection of Monologues in Video Archives. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 329–332 (2003)
Chetty, G., Wagner, M.: “Liveness” Verification in Audio-Video Authentication. In: Australian International Conference on Speech Science and Technology, pp. 358–363 (2004)
Sugamura, N., Itakura, F.: Speech Analysis and Synthesis Methods developed at ECL in NTT–From LPC to LSP. Speech Communications 5(2), 199–215 (1986)
Yehia, H., Rubin, P., Vatikiotis-Bateson, E.: Quantitative Association of Vocal-Tract and Facial Behavior. Speech Communication 28, 23–43 (1998)
Bregler, C., Konig, Y.: “Eigenlips” for Robust Speech Recognition. In: International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 19–22 (1994)
Goecke, R., Millar, B.: Statistical Analysis of the Relationship between Audio and Video Speech Parameters for Australian English. In: International Conference on Audio-Visual Speech Processing (2003)
Eveno, N., Besacier, L.: Co-Inertia Analysis for ”Liveness” Test in Audio-Visual Biometrics. In: International Symposium on Image and Signal Processing Analysis, pp. 257–261 (2005)
Eveno, N., Besacier, L.: A Speaker Independent Liveness Test for Audio-Video Biometrics. In: 9th European Conference on Speech Communication and Technology (2005)
Chibelushi, C.C., Mason, J.S., Deravi, F.: Integrated Person Identification Using Voice and Facial Features. In: IEE Colloquium on Image Processing for Security Applications, vol. 4, pp. 1–5 (1997)
Smaragdis, P., Casey, M.: Audio/Visual Independent Components. In: International Symposium on Independent Component Analysis and Blind Signal Separation, pp. 709–714 (2003)
Dolédec, S., Chessel, D.: Co-Inertia Analysis: an Alternative Method for Studying Species-Environment Relationships. Freshwater Biology 31, 277–294 (1994)
Fisher, J.W., Darrell, T., Freeman, W.T., Viola, P.: Learning Joint Statistical Models for Audio-Visual Fusion and Segregation. In: Advances in Neural Information Processing Systems (2001)
Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Bengio, S.: An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition. In: Advances in Neural Information Processing Systems (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this chapter
Cite this chapter
Abboud, B., Bredin, H., Aversano, G., Chollet, G. (2007). Audio-visual Identity Verification: An Introductory Overview. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds) Progress in Nonlinear Speech Processing. Lecture Notes in Computer Science, vol 4391. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71505-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-71505-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71503-0
Online ISBN: 978-3-540-71505-4
eBook Packages: Computer ScienceComputer Science (R0)