Skip to main content
Log in

Naming multi-modal clusters to identify persons in TV broadcast

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Persons’ identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool (Poignant et al. 2012); these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2 % for speaker identification and 60.2 % for face identification. Adding few biometric models improves results and leads to 82.4 % and 65.6 % for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8 % F-measure, while 908 face models provide only 30.5 % F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.afcp-parole.org/etape.html

References

  1. Barras C, Zhu X, Meignier S, Gauvain J-L (2006) Multi-stage speaker diarization of broadcast news. IEEE Trans Audio, Speech Language Processing 14(5):1505–1512

    Article  Google Scholar 

  2. Bendris M, Favre B, Charlet D, Damnati G, Auguste R, Martinet J, Senay G (2013) Unsupervised face identification in TV content using audio-visual sources. In: Proceedings of the 11th international workshop on content-based multimedia indexing (CBMI), pp 243–249

  3. Béchet F, Bendris M, Charlet D, Damnati G, Favre B, Rouvier M, Auguste R, Bigot B, Dufour R, Fredouille C, Linares G, Martinet J, Senay G, Tirilly P (2014) Multimodal understanding for person recognition in video broadcasts. In: 15th annual conference of the internationnal speech communication association (INTERSPEECH)

  4. Bredin H, Poignant J (2013) Integer Linear Programming for Speaker Diarization and Cross-Modal Identification in TV Broadcast. In: the 14th annual conference of the international speech communication association, (INTERSPEECH)

  5. Bredin H, Poignant J, Tapaswi M, Fortier G, Le VB, Napoleon T, Gao H, Barras C, Rosset S, Besacier L, Verbeek J, Quénot G, Jurie F, Kemal Ekenel H (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: Workshop on information fusion in computer vision for concept recognition, ECCV-IFCVCR, pp 385–394

  6. Bredin H, Poignant J, Fortier G, Tapaswi M, Le VB, Sarkar A, Barras C, Rosset S, Roy A, Yang Q, Gao H, Mignon A, Verbeek J, Besacier L, Quénot G, Kemal Ekenel H, Stiefelhagen R (2013) QCompere at REPERE 2013. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH-SLAM

  7. Bredin H, Roy A, Le VB, Barras C (2014) Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identication in TV broadcast. In: International journal of multimedia information retrieval

  8. Buml M, Bernardin K, Fischer M, Ekenel HK, Stiefelhagen R (2010) Multi-pose face recognition for person retrieval in camera networks. In: 7th International conference on advanced video and signal-based surveillance, AVSS, pp 441–447

  9. Canseco-Rodriguez L, Lamel L, Gauvain J-L (2004) Speaker diarization from speech transcripts. In: the 5th annual conference of the international speech communication association, INTERSPEECH

  10. Canseco L, Lamel L, Gauvain J-L (2005) A comparative study using manual and automatic transcriptions for diarization. In: IEEE workshop on automatic speech recognition and understanding, pp 415–419

  11. Chen SS, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: DARPA broadcast news transcription and understanding workshop, pp 127–132

  12. Estève Y, Meignier S, Deléglise P, Mauclair J (2007) Extracting true speaker identities from transcriptions. In: the 8th annual conference of the international speech communication association, INTERSPEECH, pp 2601–2604

  13. Favre B, Damnati G, Béchet F, Bendris M, Charlet D, Auguste R, Ayache S, Bigot B, Delteil A, Dufour R, Fredouille C, Linares G, Martinet J, Senay G, Tirilly P (2013) PERCOLI: a person identification system for the 2013 REPERE challenge. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH

  14. Gay P, Dupuy G, Lailler C, Odobez J-M, Meignier S, Deléglise P (2014) Comparison of two methods for unsupervised person identification in TV shows. In: 12th international workshop on content-based multimedia indexing (CBMI)

  15. Giraudel A, Carré M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus: a multimodal corpus for person recognition. In: the 8th international conference on language resources and evaluation, LREC

  16. Guillaumin M, Verbeek J, Schmid C (2009) Is that you? Metric learning approaches for face identification. In: the IEEE 12th international conference on computer vision, pp 498–505

  17. Houghton R (1999) Named faces: putting names to faces. IEEE Intell Syst 14:45–50

    Article  Google Scholar 

  18. Jousse V, Petit-Renaud S, Meignier S, Estève Y, Jacquin C (2009) Automatic named identification of speakers using diarization and ASR systems

  19. Kahn J, Galibert O, Quintard L, Carré M, Giraudel A, Joly P (2012) A presentation of the REPERE challenge. In: the 10th international workshop on content-based multimedia indexing (CBMI), pp 1–6

  20. Khoury E, Snac C, Joly P (2012) Audiovisual diarization of people in video content. In: Multimedia tools and applications

  21. Le VB, Barras C, Ferràs M (2010) On the use of GSV-SVM for speaker diarization and tracking. In: Odyssey - the speaker and language recognition workshop, pp 146–150

  22. Mauclair J, Meignier S, Estève Y (2006) Speaker diarization: about whom the speaker is talking?. In: IEEE Odyssey 2006 - the speaker and language recognition workshop

  23. Petit-Renaud S, Jousse V, Meignier S, Estève Y (2010) Identification of speakers by name using belief functions. In: the 13th international conference on information processing and management of uncertainty in knowledge-based systems, theory and methods, IPMU, pp 179–188

  24. Pham PT, Moens M-F, Tuytelaars T (2010) Naming persons in news video with label propagation. In: IEEE international conference on Multimedia and Expo, ICME, p 15281533

  25. Pham PT, Tuytelaars T, Moens M-F (2011) Naming people in news videos with label propagation. IEEE MultiMedia 18(3):4455

    Article  Google Scholar 

  26. Poignant J, Besacier L, Quénot G, Thollard F (2012) From text detection in videos to person identification. In: IEEE international conference on multimedia and expo, ICME, pp 854–859

  27. Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quénot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: the 13rd annual conference of the international speech communication association, INTERSPEECH, pp 2650–2653

  28. Poignant J, Besacier L, Quénot G (2013) Nommage non-supervisé des personnes dans les émissions de télévision: une revue du potentiel de chaque modalité. In: la 10ème cOnférence en recherche d’Information et applications, CORIA

  29. Poignant J, Besacier L, Le VB, Rosset S, Quénot G (2013) Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both ?. In: the 14th annual conference of the international speech communication association, INTERSPEECH

  30. Poignant J, Bredin H, Besacier L, Quénot G, Barras C (2013) Towards a better integration of written names for unsupervised speakers identification in videos. In: First workshop on speech, language and audio in multimedia - the 14th annual conference of the international speech communication association, INTERSPEECH-SLAM

  31. Poignant J, Besacier L, Quénot G (2014) Nommage non-supervisé des personnes dans les émissions de télévision: utilisation des noms écrits, des noms prononcés ou des deux?. In: Documents numriques, pp 37–60

  32. Rouvier M, Meignier S (2012) A Global Optimization Framework For Speaker Diarization. In: Odyssey - the speaker and language recognition workshop

  33. Rouvier M, Favre B, Bendris M, Charlet D, Damnati G (2014) Scene understanding for identifying persons in TV shows: beyond face authentication. In: 12th international workshop on content-based multimedia indexing (CBMI)

  34. Sato T, Kanade T, Hughes TK, Smith MA, Satoh S (1999) Video OCR: Indexing digital news libraries by recognition of superimposed caption. In: ACM Multimedia Systems

  35. Satoh S, Nakamura Y, Kanade T (1999) Name-It: naming and detecting faces in news videos. IEEE Multimedia 6:22–35

    Article  Google Scholar 

  36. Tranter SE (2006) Who really spoke when? finding speaker turns and identities in broadcast news audio. In: the 31st IEEE international conference on acoustics, speech and signal processing, ICASSP, pp 1013–1016

  37. Uřičář M, Franc V, Hlaváč V (2012) Detector of Facial Landmarks Learned by the Structured Output SVM. In: the 7th international conference on computer vision theory and applications, pp 547–556

  38. Yang J., Hauptmann A G (2004) Naming every individual in news video monologues

  39. Yang J, Yan R, Hauptmann A G (2005) Multiple instance learning for labeling faces in broadcasting news video. In: the 13th ACM international conference on multimedia, ACMMM, pp 31–40

Download references

Acknowledgments

This work was partly realized as part of the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johann Poignant.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Poignant, J., Fortier, G., Besacier, L. et al. Naming multi-modal clusters to identify persons in TV broadcast. Multimed Tools Appl 75, 8999–9023 (2016). https://doi.org/10.1007/s11042-015-2723-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2723-1

Keywords

Navigation