Skip to main content
Log in

Best practices for learning video concept detectors from social media examples

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Learning video concept detectors from social media sources, such as Flickr images and YouTube videos, has the potential to address a wide variety of concept queries for video search. While the potential has been recognized by many, and progress on the topic has been impressive, we argue that key questions crucial to know how to learn effective video concept detectors from social media examples? remain open. As an initial attempt to answer these questions, we conduct an experimental study using a video search engine which is capable of learning concept detectors from social media examples, be it socially tagged videos or socially tagged images. Within the video search engine we investigate three strategies for positive example selection, three negative example selection strategies and three learning strategies. The performance is evaluated on the challenging TRECVID 2012 benchmark consisting of 600 h of Internet video. From the experiments we derive four best practices: (1) tagged images are a better source for learning video concepts than tagged videos, (2) selecting tag relevant positive training examples is always beneficial, (3) selecting relevant negative examples is advantageous and should be treated differently for video and image sources, and (4) learning concept detectors with selected relevant training data before learning is better then incorporating the relevance during the learning process. The best practices within our video search engine lead to state-of-the-art performance in the TRECVID 2013 benchmark for concept detection without manually provided annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Ballan L, Bertini M, Del Bimbo A, Serra G (2011) Enriching and localizing semantic tags in internet videos. In: MM 1541–1544

  2. Chang S-F, Ellis D, Jiang W, Lee K, Yanagawa A, Loui AC, Luo J (2007) Large-scale multimodal semantic concept detection for consumer video. In: MIR 255–264

  3. Fan J, Shen Y, Zhou N, Gao Y (2010) Harvesting large-scale weakly-tagged image databases from the web. In: CVPR 802–809

  4. Heikkila M, Pietikainen M, Schmid C (2009) Description of interest regions with local binary patterns. In: PR 42(3):425–436

  5. Hu Y, Li M, Yu N (2008) Multiple-instance ranking: learning to rank images for image retrieval. In: CVPR 1–8

  6. Hwang SJ, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. In: IJCV 100(2):134–153

  7. Jain V, Varma M (2011) Learning to re-rank: query-dependent image re-ranking using click data. In: WWW 277–286

  8. Jiang W, Cotton CV, Chang S-F, Ellis D, Loui AC (2009) Short-term audio-visual atoms for generic video concept classification. In: MM. doi:10.1145/1631272.1631277

  9. Jiang Y-G, Yang J, Ngo C-W, Hauptmann A (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. In: TMM 12(1):42–53

  10. Joachims T (2002) Optimizing search engines using clickthrough data. In: SIGKDD 133–142

  11. Kennedy LS, Chang S-F, Kozintsev IV (2006) To search or to label?: predicting the performance of search-based automatic image classifiers. In: MIR 249–258

  12. Kim J, Pavlovic V (2012) Attribute rating for classification of visual objects. In: ICPR 1611–1614

  13. Kordumova S, Li X, Snoek CGM (2013) Evaluating sources and strategies for learning video concepts from social media. In: CBMI 91–96

  14. Li M (2007) Texture moment for content-based image retrieval. In: ICME 508–511

  15. Li X, Snoek CGM, Worring M (2009) Learning social tag relevance by neighbor voting. In: TMM 11(7):1310–1322

  16. Li X, Snoek CGM, Worring M (2010) Unsupervised multi-feature tag relevance learning for social image retrieval. In: CIVR 10–17

  17. Li X, Snoek CGM, Worring M, Koelma DC, Smeulders AWM (2013) Bootstrapping visual categorization with relevant negatives. In: TMM 15(4):933–945

  18. Li X, Snoek CGM, Worring M, Smeulders AWM (2012) Harvesting social images for bi-concept search. In: TMM 14(4):1091–1104

  19. Li G, Wang M, Zheng Y-T, Li H, Zha Z-J, Chua T-S (2011) Shottagger: tag location for internet videos. In: ICMR. doi:10.1145/1991996.1992033

  20. Liu D, Hua X, Yang L, Wang M, Zhang H (2009) Tag ranking. In: WWW 351–360

  21. Liu Y, Xu D, Tsang IW-H, Luo J (2011) Textual query of personal photos facilitated by large-scale web data. In: PAMI 33(5):1022–1036

  22. Lowe DG (2003) Distinctive image features from scale-invariant keypoints. In: IJCV 60(2):91–110

  23. Maji S, Berg A, Malik J (2008) Classification using intersection kernel support vector machines is efficient. In: CVPR 1–8

  24. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. In: IJCV 42(3):145–175

  25. Ray S, Craven M (2005) Supervised versus multiple instance learning: an empirical comparison. In ICML 697–704

  26. Schindler G, Zitnick L, Brown M (2008) Internet video category recognition. CVPR. doi:10.1109/CVPRW.2008.4562960

  27. Schroff F, Criminisi A, Zisserman A (2007) Harvesting image databases from the web. In: ICCV 33(4):754–66

  28. Settles B, Craven M, Ray S (2008) Multiple-instance active learning. In: NIPS 1289–1296

  29. Setz A, Snoek CGM (2009) Can social tagged images aid concept-based video search? In: ICME 1460–1463

  30. Sigurbjörnsson B, van Zwol R (2008) Flickr tag recommendation based on collective knowledge. In: WWW 327–336

  31. Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: ICCV 2:1470–1477

  32. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: MIR 321–330

  33. Sun Y, Kojima A (2011) A novel method for semantic video concept learning using web images. In: MM 1081–1084

  34. Ulges A, Koch M, Borth D (2012) Linking visual concept detection with viewer demographics. In: ICMR. doi:10.1145/2324796.2324827

  35. Ulges A, Schulze C, Keysers D, Breuel T (2008) A system that learns to tag videos by watching youtube. In: ICVS 5008:415–424

  36. Ulges A, Schulze C, Keysers D, Breuel T (2008) Identifying relevant frames in weakly labeled videos for training concept detectors. In: CIVR 9–16

  37. Uricchio T, Ballan L, Bertini M, Del Bimbo A (2013) An evaluation of nearest-neighbor methods for tag refinement. In: ICME 1–6

  38. van de Sande K, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. In: PAMI 32(9):1582–1596

  39. Vapnik VN (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  40. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV 3551–3558

  41. Wang Z, Zhao M, Song Y, Kumar S, Li B (2010) Youtubecat: learning to categorize wild web videos. In: CVPR

  42. Yan R, Hauptmann AG, Jin R (2003) Negative pseudo-relevance feedback in content-based video retrieval. I:n MM 343–346

  43. Yang J, Hauptmann A (2008) (Un)reliability of video concept detection. In: CIVR 85–94

  44. Zhao W-L, Wu X, Ngo C-W (2010) On the annotation of web videos by efficient near-duplicate search. In: TMM 12(5):448–461

  45. Zhu S, Ngo C-W, Jiang Y-G (2012) Sampling and ontologically pooling web images for visual concept learning. In: TMM 14(4):1068–1078

Download references

Acknowledgments

This research is supported by the STW STORY project, the Dutch national program COMMIT, the Chinese NSFC (No. 61303184), SRFDP (No. 20130004120006), the Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China (No. 14XNLQ01), and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20067. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xirong Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kordumova, S., Li, X. & Snoek, C.G.M. Best practices for learning video concept detectors from social media examples. Multimed Tools Appl 74, 1291–1315 (2015). https://doi.org/10.1007/s11042-014-2056-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2056-5

Keywords

Navigation