skip to main content
10.1145/1718487.1718524acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Learning similarity metrics for event identification in social media

Published:04 February 2010Publication History

ABSTRACT

Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of real-world events of different type and scale. By automatically identifying these events and their associated user-contributed social media documents, which is the focus of this paper, we can enable event browsing and search in state-of-the-art search engines. To address this problem, we exploit the rich "context" associated with social media content, including user-provided annotations (e.g., title, tags) and automatically generated information (e.g., content creation time). Using this rich context, which includes both textual and non-textual features, we can define appropriate document similarity metrics to enable online clustering of media to events. As a key contribution of this paper, we explore a variety of techniques for learning multi-feature similarity metrics for social media documents in a principled manner. We evaluate our techniques on large-scale, real-world datasets of event images from Flickr. Our evaluation results suggest that our approach identifies events, and their associated social media documents, more effectively than the state-of-the-art strategies on which we build.

References

  1. E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of the First ACM International Conference on Web Search and Data Mining (WSDM'08), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Allan. Introduction to topic detection and tracking. In J. Allan, editor, Topic Detection and Tracking -- Event-based Information Organization, pages 1--16. Kluwer Academic Publisher, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR'98), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Amer-Yahia, M. Benedikt, L.V.S. Lakshmanan, and J. Stoyanovich. Efficient network aware search in collaborative tagging sites. PVLDB, 1(1):710--721, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Amigo, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. Becker, M. Naaman, and L. Gravano. Event identification in social media. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB'09), June 2009.Google ScholarGoogle Scholar
  7. P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.Google ScholarGoogle Scholar
  8. M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM'05), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Bilenko, B. Kamath, and R.J. Mooney. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM'06), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Bilenko and R.J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Chen and A. Roy. Event detection from Flickr data through wavelet-based spatial analysis. In Proceedings of the 2009 ACM CIKM International Conference on Information and Knowledge Management (CIKM'09), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z.S. Chen, D.V. Kalashnikov, and S. Mehrotra. Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of the 2009 ACM International Conference on Management of Data (SIGMOD'09), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W.W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning (ICML'07), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1--30, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. U.M. Diwekar. Introduction to applied optimization. Springer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Domeniconi and M. Al-Razgan. Weighted cluster ensembles: Methods and analysis. ACM Transactions on Knowledge Discovery from Data, 2(4):1--40, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Gionis, H. Mannila, and P. Tsaparas. Clustering reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00), 2000.Google ScholarGoogle Scholar
  19. V. Hatzivassiloglou, L. Gravano, and A. Maganti. An investigation of linguistic features and clustering algorithms for topical document clustering. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR'00), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M.A. Hernandez and S.J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1996 ACM International Conference on Management of Data (SIGMOD'96), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proceedings of the First ACM International Conference on Web Search and Data Mining (WSDM'08), Feb. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Heymann, D. Ramage, and H. Garcia-Molina. Social tag prediction. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR'08), July 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning: Application in VLSI domain. In Proceedings of the 34th ACM Conference on Design Automation (DAC'97), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Kennedy and M. Naaman. Less talk, more rock: Automated organization of community-contributed collections of concert videos. In Proceedings of the 18th International World Wide Web Conference (WWW'09), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Kennedy, M. Naaman, S. Ahern, R. Nair, and T. Rattenbury. How Flickr helps us make sense of the world: context and content in community-contributed media collections. In Proceedings of the 15th International Conference on Multimedia (MULTIMEDIA'07), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of the 27th ACM International Conference on Research and Development in Information Retrieval (SIGIR'04), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Liu, L. Sun, Y. Rui, Y. Shi, and S. Yang. Web video topic discovery and tracking via bipartite graph reinforcement model. In Proceedings of the 17th International World Wide Web Conference (WWW'08), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi. Simple semantics in topic detection and tracking. Information Retrieval, 7(3-4):347--368, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge Univ. Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. McCallum, K. Nigam, and L.H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Rattenbury, N. Good, and M. Naaman. Towards automatic extraction of event and place semantics from Flickr tags. In Proceedings of the 30th ACM International Conference on Research and Development in Information Retrieval (SIGIR'07), pages 103--110, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S.E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In Proceedings of the Fourteenth Text REtrieval Conference (TREC-8), 1999.Google ScholarGoogle Scholar
  33. R.W. Sinnott. Virtues of the Haversine. Sky and Telescope, 68:159, 1984.Google ScholarGoogle Scholar
  34. A. Strehl, J. Ghosh, and C. Cardie. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S.C.A. Thomopoulos, D.K. Bougoulias, and C.-D. Wann. Dignet: an unsupervised-learning clustering algorithm for clustering and data fusion. IEEE Transactions on Aerospace Electronic Systems, 31:21--38, Jan. 1995.Google ScholarGoogle ScholarCross RefCross Ref
  36. I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2 edition, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, 2002.Google ScholarGoogle Scholar
  38. Y. Yang, J. Carbonell, R. Brown, T. Pierce, B.T. Archibald, and X. Liu. Learning approaches for detecting and tracking news events. IEEE Intel ligent Systems Special Issue on Applications of Intel ligent Information Retrieval, 14(4):32--43, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective and on-line event detection. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR'98), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K. Zhang, J. Zi, and L.G. Wu. New event detection based on indexing-tree and named entity. In Proceedings of the 30th ACM International Conference on Research and Development in Information Retrieval (SIGIR'07), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An effcient data clustering method for very large databases. In Proceedings of the 1996 ACM International Conference on Management of Data (SIGMOD'96), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning similarity metrics for event identification in social media

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '10: Proceedings of the third ACM international conference on Web search and data mining
      February 2010
      468 pages
      ISBN:9781605588896
      DOI:10.1145/1718487

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 February 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader