research-article

Learning similarity metrics for event identification in social media

Authors:
Hila Becker

Columbia University, New York, NY, USA

Columbia University, New York, NY, USA
View Profile

,
Mor Naaman

Rutgers University, New Brunswick, NJ, USA

Rutgers University, New Brunswick, NJ, USA
View Profile

,
Luis Gravano

Columbia University, New York, NY, USA

Columbia University, New York, NY, USA
View Profile

WSDM '10: Proceedings of the third ACM international conference on Web search and data miningFebruary 2010Pages 291–300https://doi.org/10.1145/1718487.1718524

Published:04 February 2010Publication History

WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

Pages 291–300

ABSTRACT

Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of real-world events of different type and scale. By automatically identifying these events and their associated user-contributed social media documents, which is the focus of this paper, we can enable event browsing and search in state-of-the-art search engines. To address this problem, we exploit the rich "context" associated with social media content, including user-provided annotations (e.g., title, tags) and automatically generated information (e.g., content creation time). Using this rich context, which includes both textual and non-textual features, we can define appropriate document similarity metrics to enable online clustering of media to events. As a key contribution of this paper, we explore a variety of techniques for learning multi-feature similarity metrics for social media documents in a principled manner. We evaluate our techniques on large-scale, real-world datasets of event images from Flickr. Our evaluation results suggest that our approach identifies events, and their associated social media documents, more effectively than the state-of-the-art strategies on which we build.

References

E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of the First ACM International Conference on Web Search and Data Mining (WSDM'08), 2008. Google ScholarDigital Library
J. Allan. Introduction to topic detection and tracking. In J. Allan, editor, Topic Detection and Tracking -- Event-based Information Organization, pages 1--16. Kluwer Academic Publisher, 2002. Google ScholarDigital Library
J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR'98), 1998. Google ScholarDigital Library
S. Amer-Yahia, M. Benedikt, L.V.S. Lakshmanan, and J. Stoyanovich. Efficient network aware search in collaborative tagging sites. PVLDB, 1(1):710--721, 2008. Google ScholarDigital Library
E. Amigo, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 2008. Google ScholarDigital Library
H. Becker, M. Naaman, and L. Gravano. Event identification in social media. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB'09), June 2009.Google Scholar
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.Google Scholar
M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM'05), 2005. Google ScholarDigital Library
M. Bilenko, B. Kamath, and R.J. Mooney. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM'06), 2006. Google ScholarDigital Library
M. Bilenko and R.J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), 2003. Google ScholarDigital Library
L. Chen and A. Roy. Event detection from Flickr data through wavelet-based spatial analysis. In Proceedings of the 2009 ACM CIKM International Conference on Information and Knowledge Management (CIKM'09), 2009. Google ScholarDigital Library
Z.S. Chen, D.V. Kalashnikov, and S. Mehrotra. Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of the 2009 ACM International Conference on Management of Data (SIGMOD'09), 2009. Google ScholarDigital Library
W.W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), 2002. Google ScholarDigital Library
J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning (ICML'07), 2007. Google ScholarDigital Library
J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1--30, 2006. Google ScholarDigital Library
U.M. Diwekar. Introduction to applied optimization. Springer, 2003. Google ScholarDigital Library
C. Domeniconi and M. Al-Razgan. Weighted cluster ensembles: Methods and analysis. ACM Transactions on Knowledge Discovery from Data, 2(4):1--40, 2009. Google ScholarDigital Library
A. Gionis, H. Mannila, and P. Tsaparas. Clustering reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00), 2000.Google Scholar
V. Hatzivassiloglou, L. Gravano, and A. Maganti. An investigation of linguistic features and clustering algorithms for topical document clustering. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR'00), 2000. Google ScholarDigital Library
M.A. Hernandez and S.J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1996 ACM International Conference on Management of Data (SIGMOD'96), 1995. Google ScholarDigital Library
P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proceedings of the First ACM International Conference on Web Search and Data Mining (WSDM'08), Feb. 2008. Google ScholarDigital Library
P. Heymann, D. Ramage, and H. Garcia-Molina. Social tag prediction. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR'08), July 2008. Google ScholarDigital Library
G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning: Application in VLSI domain. In Proceedings of the 34th ACM Conference on Design Automation (DAC'97), 1997. Google ScholarDigital Library
L. Kennedy and M. Naaman. Less talk, more rock: Automated organization of community-contributed collections of concert videos. In Proceedings of the 18th International World Wide Web Conference (WWW'09), 2009. Google ScholarDigital Library
L. Kennedy, M. Naaman, S. Ahern, R. Nair, and T. Rattenbury. How Flickr helps us make sense of the world: context and content in community-contributed media collections. In Proceedings of the 15th International Conference on Multimedia (MULTIMEDIA'07), 2007. Google ScholarDigital Library
G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of the 27th ACM International Conference on Research and Development in Information Retrieval (SIGIR'04), 2004. Google ScholarDigital Library
L. Liu, L. Sun, Y. Rui, Y. Shi, and S. Yang. Web video topic discovery and tracking via bipartite graph reinforcement model. In Proceedings of the 17th International World Wide Web Conference (WWW'08), 2008. Google ScholarDigital Library
J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi. Simple semantics in topic detection and tracking. Information Retrieval, 7(3-4):347--368, 2004. Google ScholarDigital Library
C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge Univ. Press, 2008. Google ScholarDigital Library
A. McCallum, K. Nigam, and L.H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'00), 2000. Google ScholarDigital Library
T. Rattenbury, N. Good, and M. Naaman. Towards automatic extraction of event and place semantics from Flickr tags. In Proceedings of the 30th ACM International Conference on Research and Development in Information Retrieval (SIGIR'07), pages 103--110, 2007. Google ScholarDigital Library
S.E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In Proceedings of the Fourteenth Text REtrieval Conference (TREC-8), 1999.Google Scholar
R.W. Sinnott. Virtues of the Haversine. Sky and Telescope, 68:159, 1984.Google Scholar
A. Strehl, J. Ghosh, and C. Cardie. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002. Google ScholarDigital Library
S.C.A. Thomopoulos, D.K. Bougoulias, and C.-D. Wann. Dignet: an unsupervised-learning clustering algorithm for clustering and data fusion. IEEE Transactions on Aerospace Electronic Systems, 31:21--38, Jan. 1995.Google ScholarCross Ref
I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2 edition, 2005. Google ScholarDigital Library
E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, 2002.Google Scholar
Y. Yang, J. Carbonell, R. Brown, T. Pierce, B.T. Archibald, and X. Liu. Learning approaches for detecting and tracking news events. IEEE Intel ligent Systems Special Issue on Applications of Intel ligent Information Retrieval, 14(4):32--43, 1999. Google ScholarDigital Library
Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective and on-line event detection. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR'98), 1998. Google ScholarDigital Library
K. Zhang, J. Zi, and L.G. Wu. New event detection based on indexing-tree and named entity. In Proceedings of the 30th ACM International Conference on Research and Development in Information Retrieval (SIGIR'07), 2007. Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An effcient data clustering method for very large databases. In Proceedings of the 1996 ACM International Conference on Management of Data (SIGMOD'96), 1996. Google ScholarDigital Library

Index Terms

Learning similarity metrics for event identification in social media
1. Information systems
  1. Information retrieval

Recommendations

Identifying content for planned events across social media sites
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

User-contributed Web data contains rich and diverse information about a variety of events in the physical world, such as shows, festivals, conferences and more. This information ranges from known event features (e.g., title, time, location) posted on ...
Read More
Uses and gratifications of social networking sites for bridging and bonding social capital

Applying uses and gratifications theory (UGT) and social capital theory, our study examined users of four social networking sites (SNSs) (Facebook, Twitter, Instagram, and Snapchat), and their influence on online bridging and bonding social capital. ...
Read More
Anonymity, Intimacy and Self-Disclosure in Social Media
CHI '16: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems

Self-disclosure is rewarding and provides significant benefits for individuals, but it also involves risks, especially in social media settings. We conducted an online experiment to study the relationship between content intimacy and willingness to self-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '10: Proceedings of the third ACM international conference on Web search and data mining
February 2010
468 pages
ISBN:9781605588896
DOI:10.1145/1718487
General Chairs:
Brian D. Davison
Lehigh University, USA
,
Torsten Suel
Polytechnic Institute of NYU, USA
,
Program Chairs:
Nick Craswell
Microsoft, USA
,
Bing Liu
University of Illinois, Chicago, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 February 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
event identification
similarity metric learning
social media
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate498of2,863submissions,17%
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 248
  Total Citations
  View Citations
- 4,096
  Total Downloads
- Downloads (Last 12 months)69
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning similarity metrics for event identification in social media

WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Identifying content for planned events across social media sites

Uses and gratifications of social networking sites for bridging and bonding social capital

Anonymity, Intimacy and Self-Disclosure in Social Media