ABSTRACT
Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of real-world events of different type and scale. By automatically identifying these events and their associated user-contributed social media documents, which is the focus of this paper, we can enable event browsing and search in state-of-the-art search engines. To address this problem, we exploit the rich "context" associated with social media content, including user-provided annotations (e.g., title, tags) and automatically generated information (e.g., content creation time). Using this rich context, which includes both textual and non-textual features, we can define appropriate document similarity metrics to enable online clustering of media to events. As a key contribution of this paper, we explore a variety of techniques for learning multi-feature similarity metrics for social media documents in a principled manner. We evaluate our techniques on large-scale, real-world datasets of event images from Flickr. Our evaluation results suggest that our approach identifies events, and their associated social media documents, more effectively than the state-of-the-art strategies on which we build.
- E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of the First ACM International Conference on Web Search and Data Mining (WSDM'08), 2008. Google ScholarDigital Library
- J. Allan. Introduction to topic detection and tracking. In J. Allan, editor, Topic Detection and Tracking -- Event-based Information Organization, pages 1--16. Kluwer Academic Publisher, 2002. Google ScholarDigital Library
- J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR'98), 1998. Google ScholarDigital Library
- S. Amer-Yahia, M. Benedikt, L.V.S. Lakshmanan, and J. Stoyanovich. Efficient network aware search in collaborative tagging sites. PVLDB, 1(1):710--721, 2008. Google ScholarDigital Library
- E. Amigo, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 2008. Google ScholarDigital Library
- H. Becker, M. Naaman, and L. Gravano. Event identification in social media. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB'09), June 2009.Google Scholar
- P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.Google Scholar
- M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM'05), 2005. Google ScholarDigital Library
- M. Bilenko, B. Kamath, and R.J. Mooney. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM'06), 2006. Google ScholarDigital Library
- M. Bilenko and R.J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), 2003. Google ScholarDigital Library
- L. Chen and A. Roy. Event detection from Flickr data through wavelet-based spatial analysis. In Proceedings of the 2009 ACM CIKM International Conference on Information and Knowledge Management (CIKM'09), 2009. Google ScholarDigital Library
- Z.S. Chen, D.V. Kalashnikov, and S. Mehrotra. Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of the 2009 ACM International Conference on Management of Data (SIGMOD'09), 2009. Google ScholarDigital Library
- W.W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), 2002. Google ScholarDigital Library
- J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning (ICML'07), 2007. Google ScholarDigital Library
- J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1--30, 2006. Google ScholarDigital Library
- U.M. Diwekar. Introduction to applied optimization. Springer, 2003. Google ScholarDigital Library
- C. Domeniconi and M. Al-Razgan. Weighted cluster ensembles: Methods and analysis. ACM Transactions on Knowledge Discovery from Data, 2(4):1--40, 2009. Google ScholarDigital Library
- A. Gionis, H. Mannila, and P. Tsaparas. Clustering reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00), 2000.Google Scholar
- V. Hatzivassiloglou, L. Gravano, and A. Maganti. An investigation of linguistic features and clustering algorithms for topical document clustering. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR'00), 2000. Google ScholarDigital Library
- M.A. Hernandez and S.J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1996 ACM International Conference on Management of Data (SIGMOD'96), 1995. Google ScholarDigital Library
- P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proceedings of the First ACM International Conference on Web Search and Data Mining (WSDM'08), Feb. 2008. Google ScholarDigital Library
- P. Heymann, D. Ramage, and H. Garcia-Molina. Social tag prediction. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR'08), July 2008. Google ScholarDigital Library
- G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning: Application in VLSI domain. In Proceedings of the 34th ACM Conference on Design Automation (DAC'97), 1997. Google ScholarDigital Library
- L. Kennedy and M. Naaman. Less talk, more rock: Automated organization of community-contributed collections of concert videos. In Proceedings of the 18th International World Wide Web Conference (WWW'09), 2009. Google ScholarDigital Library
- L. Kennedy, M. Naaman, S. Ahern, R. Nair, and T. Rattenbury. How Flickr helps us make sense of the world: context and content in community-contributed media collections. In Proceedings of the 15th International Conference on Multimedia (MULTIMEDIA'07), 2007. Google ScholarDigital Library
- G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of the 27th ACM International Conference on Research and Development in Information Retrieval (SIGIR'04), 2004. Google ScholarDigital Library
- L. Liu, L. Sun, Y. Rui, Y. Shi, and S. Yang. Web video topic discovery and tracking via bipartite graph reinforcement model. In Proceedings of the 17th International World Wide Web Conference (WWW'08), 2008. Google ScholarDigital Library
- J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi. Simple semantics in topic detection and tracking. Information Retrieval, 7(3-4):347--368, 2004. Google ScholarDigital Library
- C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge Univ. Press, 2008. Google ScholarDigital Library
- A. McCallum, K. Nigam, and L.H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00), 2000. Google ScholarDigital Library
- T. Rattenbury, N. Good, and M. Naaman. Towards automatic extraction of event and place semantics from Flickr tags. In Proceedings of the 30th ACM International Conference on Research and Development in Information Retrieval (SIGIR'07), pages 103--110, 2007. Google ScholarDigital Library
- S.E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In Proceedings of the Fourteenth Text REtrieval Conference (TREC-8), 1999.Google Scholar
- R.W. Sinnott. Virtues of the Haversine. Sky and Telescope, 68:159, 1984.Google Scholar
- A. Strehl, J. Ghosh, and C. Cardie. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002. Google ScholarDigital Library
- S.C.A. Thomopoulos, D.K. Bougoulias, and C.-D. Wann. Dignet: an unsupervised-learning clustering algorithm for clustering and data fusion. IEEE Transactions on Aerospace Electronic Systems, 31:21--38, Jan. 1995.Google ScholarCross Ref
- I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2 edition, 2005. Google ScholarDigital Library
- E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, 2002.Google Scholar
- Y. Yang, J. Carbonell, R. Brown, T. Pierce, B.T. Archibald, and X. Liu. Learning approaches for detecting and tracking news events. IEEE Intel ligent Systems Special Issue on Applications of Intel ligent Information Retrieval, 14(4):32--43, 1999. Google ScholarDigital Library
- Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective and on-line event detection. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR'98), 1998. Google ScholarDigital Library
- K. Zhang, J. Zi, and L.G. Wu. New event detection based on indexing-tree and named entity. In Proceedings of the 30th ACM International Conference on Research and Development in Information Retrieval (SIGIR'07), 2007. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An effcient data clustering method for very large databases. In Proceedings of the 1996 ACM International Conference on Management of Data (SIGMOD'96), 1996. Google ScholarDigital Library
Index Terms
- Learning similarity metrics for event identification in social media
Recommendations
Identifying content for planned events across social media sites
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data miningUser-contributed Web data contains rich and diverse information about a variety of events in the physical world, such as shows, festivals, conferences and more. This information ranges from known event features (e.g., title, time, location) posted on ...
Uses and gratifications of social networking sites for bridging and bonding social capital
Applying uses and gratifications theory (UGT) and social capital theory, our study examined users of four social networking sites (SNSs) (Facebook, Twitter, Instagram, and Snapchat), and their influence on online bridging and bonding social capital. ...
Anonymity, Intimacy and Self-Disclosure in Social Media
CHI '16: Proceedings of the 2016 CHI Conference on Human Factors in Computing SystemsSelf-disclosure is rewarding and provides significant benefits for individuals, but it also involves risks, especially in social media settings. We conducted an online experiment to study the relationship between content intimacy and willingness to self-...
Comments