Abstract
Persons’ identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2 % for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.
Similar content being viewed by others
References
Barras C, Zhu X, Meignier S, Gauvain J-L (2006) Multi-stage speaker diarization of broadcast news. IEEE Trans Audio, Speech Language Processing 14(5):1505–1512
Bendris M, Favre B, Charlet D, Damnati G, Auguste R, Martinet J, Senay G (2013) Unsupervised face identification in TV content using audio-visual sources. In: Proceedings of the 11th international workshop on content-based multimedia indexing (CBMI), pp 243–249
Béchet F, Bendris M, Charlet D, Damnati G, Favre B, Rouvier M, Auguste R, Bigot B, Dufour R, Fredouille C, Linares G, Martinet J, Senay G, Tirilly P (2014) Multimodal understanding for person recognition in video broadcasts. In: 15th annual conference of the internationnal speech communication association (INTERSPEECH)
Bredin H, Poignant J (2013) Integer Linear Programming for Speaker Diarization and Cross-Modal Identification in TV Broadcast. In: the 14th annual conference of the international speech communication association, (INTERSPEECH)
Bredin H, Poignant J, Tapaswi M, Fortier G, Le VB, Napoleon T, Gao H, Barras C, Rosset S, Besacier L, Verbeek J, Quénot G, Jurie F, Kemal Ekenel H (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: Workshop on information fusion in computer vision for concept recognition, ECCV-IFCVCR, pp 385–394
Bredin H, Poignant J, Fortier G, Tapaswi M, Le VB, Sarkar A, Barras C, Rosset S, Roy A, Yang Q, Gao H, Mignon A, Verbeek J, Besacier L, Quénot G, Kemal Ekenel H, Stiefelhagen R (2013) QCompere at REPERE 2013. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH-SLAM
Bredin H, Roy A, Le VB, Barras C (2014) Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identication in TV broadcast. In: International journal of multimedia information retrieval
Buml M, Bernardin K, Fischer M, Ekenel HK, Stiefelhagen R (2010) Multi-pose face recognition for person retrieval in camera networks. In: 7th International conference on advanced video and signal-based surveillance, AVSS, pp 441–447
Canseco-Rodriguez L, Lamel L, Gauvain J-L (2004) Speaker diarization from speech transcripts. In: the 5th annual conference of the international speech communication association, INTERSPEECH
Canseco L, Lamel L, Gauvain J-L (2005) A comparative study using manual and automatic transcriptions for diarization. In: IEEE workshop on automatic speech recognition and understanding, pp 415–419
Chen SS, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: DARPA broadcast news transcription and understanding workshop, pp 127–132
Estève Y, Meignier S, Deléglise P, Mauclair J (2007) Extracting true speaker identities from transcriptions. In: the 8th annual conference of the international speech communication association, INTERSPEECH, pp 2601–2604
Favre B, Damnati G, Béchet F, Bendris M, Charlet D, Auguste R, Ayache S, Bigot B, Delteil A, Dufour R, Fredouille C, Linares G, Martinet J, Senay G, Tirilly P (2013) PERCOLI: a person identification system for the 2013 REPERE challenge. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH
Gay P, Dupuy G, Lailler C, Odobez J-M, Meignier S, Deléglise P (2014) Comparison of two methods for unsupervised person identification in TV shows. In: 12th international workshop on content-based multimedia indexing (CBMI)
Giraudel A, Carré M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus: a multimodal corpus for person recognition. In: the 8th international conference on language resources and evaluation, LREC
Guillaumin M, Verbeek J, Schmid C (2009) Is that you? Metric learning approaches for face identification. In: the IEEE 12th international conference on computer vision, pp 498–505
Houghton R (1999) Named faces: putting names to faces. IEEE Intell Syst 14:45–50
Jousse V, Petit-Renaud S, Meignier S, Estève Y, Jacquin C (2009) Automatic named identification of speakers using diarization and ASR systems
Kahn J, Galibert O, Quintard L, Carré M, Giraudel A, Joly P (2012) A presentation of the REPERE challenge. In: the 10th international workshop on content-based multimedia indexing (CBMI), pp 1–6
Khoury E, Snac C, Joly P (2012) Audiovisual diarization of people in video content. In: Multimedia tools and applications
Le VB, Barras C, Ferràs M (2010) On the use of GSV-SVM for speaker diarization and tracking. In: Odyssey - the speaker and language recognition workshop, pp 146–150
Mauclair J, Meignier S, Estève Y (2006) Speaker diarization: about whom the speaker is talking?. In: IEEE Odyssey 2006 - the speaker and language recognition workshop
Petit-Renaud S, Jousse V, Meignier S, Estève Y (2010) Identification of speakers by name using belief functions. In: the 13th international conference on information processing and management of uncertainty in knowledge-based systems, theory and methods, IPMU, pp 179–188
Pham PT, Moens M-F, Tuytelaars T (2010) Naming persons in news video with label propagation. In: IEEE international conference on Multimedia and Expo, ICME, p 15281533
Pham PT, Tuytelaars T, Moens M-F (2011) Naming people in news videos with label propagation. IEEE MultiMedia 18(3):4455
Poignant J, Besacier L, Quénot G, Thollard F (2012) From text detection in videos to person identification. In: IEEE international conference on multimedia and expo, ICME, pp 854–859
Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quénot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: the 13rd annual conference of the international speech communication association, INTERSPEECH, pp 2650–2653
Poignant J, Besacier L, Quénot G (2013) Nommage non-supervisé des personnes dans les émissions de télévision: une revue du potentiel de chaque modalité. In: la 10ème cOnférence en recherche d’Information et applications, CORIA
Poignant J, Besacier L, Le VB, Rosset S, Quénot G (2013) Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both ?. In: the 14th annual conference of the international speech communication association, INTERSPEECH
Poignant J, Bredin H, Besacier L, Quénot G, Barras C (2013) Towards a better integration of written names for unsupervised speakers identification in videos. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH-SLAM
Poignant J, Besacier L, Quénot G (2014) Nommage non-supervisé des personnes dans les émissions de télévision: utilisation des noms écrits, des noms prononcés ou des deux?. In: Documents numriques, pp 37–60
Rouvier M, Meignier S (2012) A Global Optimization Framework For Speaker Diarization. In: Odyssey - the speaker and language recognition workshop
Rouvier M, Favre B, Bendris M, Charlet D, Damnati G (2014) Scene understanding for identifying persons in TV shows: beyond face authentication. In: 12th international workshop on content-based multimedia indexing (CBMI)
Sato T, Kanade T, Hughes TK, Smith MA, Satoh S (1999) Video OCR: Indexing digital news libraries by recognition of superimposed caption. In: ACM Multimedia Systems
Satoh S, Nakamura Y, Kanade T (1999) Name-It: naming and detecting faces in news videos. IEEE Multimedia 6:22–35
Tranter SE (2006) Who really spoke when? finding speaker turns and identities in broadcast news audio. In: the 31st IEEE international conference on acoustics, speech and signal processing, ICASSP, pp 1013–1016
Uřičář M, Franc V, Hlaváč V (2012) Detector of Facial Landmarks Learned by the Structured Output SVM. In: the 7th international conference on computer vision theory and applications, pp 547–556
Yang J., Hauptmann A G (2004) Naming every individual in news video monologues
Yang J, Yan R, Hauptmann A G (2005) Multiple instance learning for labeling faces in broadcasting news video. In: the 13th ACM international conference on multimedia, ACMMM, pp 31–40
Acknowledgments
This work was partly realized as part of the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Poignant, J., Fortier, G., Besacier, L. et al. Naming multi-modal clusters to identify persons in TV broadcast. Multimed Tools Appl 75, 8999–9023 (2016). https://doi.org/10.1007/s11042-015-2723-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2723-1