research-article

Exploiting Wikipedia as external knowledge for document clustering

Authors:
Xiaohua Hu

Drexel University, Philadelphia, PA, USA

Drexel University, Philadelphia, PA, USA
View Profile

,
Xiaodan Zhang

Drexel University, Philadelphia, PA, USA

Drexel University, Philadelphia, PA, USA
View Profile

,
Caimei Lu

Drexel University, Philadelphia, PA, USA

Drexel University, Philadelphia, PA, USA
View Profile

,
E. K. Park

University of Missouri at Kansas City, Kansas City, MO, USA

University of Missouri at Kansas City, Kansas City, MO, USA
View Profile

,
Xiaohua Zhou

Drexel University, Philadelphia, PA, USA

Drexel University, Philadelphia, PA, USA
View Profile

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningJune 2009Pages 389–396https://doi.org/10.1145/1557019.1557066

Published:28 June 2009Publication History

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 389–396

ABSTRACT

In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they may be falsely assigned to different clusters due to the lack of shared core words, although the core words they use are probably synonyms or semantically associated in other forms. The most common way to solve this problem is to enrich document representation with the background knowledge in an ontology. There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise. In this paper, we present a novel text clustering method to address these two issues by enriching document representation with Wikipedia concept and category information. We develop two approaches, exact match and relatedness-match, to map text documents to Wikipedia concepts, and further to Wikipedia categories. Then the text documents are clustered based on a similarity metric which combines document content information, concept information as well as category information. The experimental results using the proposed clustering framework on three datasets (20-newsgroup, TDT2, and LA Times) show that clustering performance improves significantly by enriching document representation with Wikipedia concepts and categories.

Supplemental Material

p389-hu.mp4

mp4

119.2 MB

Download

References

Banerjee, S., Ramanathan, K. and Gupta, A. 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Amsterdam, The Netherlands, July 23-27, 2007). ACM Press, New York, NY, 787--788. Google ScholarDigital Library
Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence. (Boston, MA, July 16-20, 2006). 1301--1306. Google ScholarDigital Library
Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. (Hyderabad, India, January 6-12, 2007). 1606--1611. Google ScholarDigital Library
Hotho, A., Staab, S.and Stumme, G. 2003. Wordnet improves text document clustering. In Proceedings of Semantic Web Workshop, the 26th annual International ACM SIGIR Conference. (Toronto, Canada, Jul. 28-Aug.1, 2003).Google Scholar
Hotho, A., Maedche, A. and Staab, S. Text Clustering Based on Good Aggregations, In Proceedings of the 2001 IEEE International Conference on Data Mining. (San Jose, CA, Nov. 29-Dec.02, 2001,). IEEE Computer Society, Washington, DC, 607--608. Google ScholarDigital Library
Hu, J., Fang, L., Cao, Y., et al. Enhancing Text Clustering by Leveraging Wikipedia Semantics. In Proceedings of the 31st annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Singapore, July 20 - 24, 2008). ACM Press, New York, NY, 179--186. Google ScholarDigital Library
Li, Y., Luk, W.P.R, Ho, K.S.E., and Chung, R.L.K. 2007. Improving Weak Ad-Hoc Queries using Wikipedia as External Corpus. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Amsterdam, The Netherlands, July 23-27, 2007). ACM Press, New York, NY, 797 -- 798. Google ScholarDigital Library
Milne, D. 2007. Computing Semantic Relatedness using Wikipedia Link Structure. In Proceedings of the 5th New Zealand Computer Science Research Student Conference. (Hamilton, New Zealand, April 10-13, 2007).Google Scholar
Phan, X., Nguyen, L. and Horiguchi, S. 2008. Learning to Classify Short and Sparse Test&Web with Hidden Topics from large-scale Data collection. In Proceedings of 17th International World Wide Web Conference. (Beijing, China, April 21-25, 2008). ACM Press, New York, NY, 91--100. Google ScholarDigital Library
Steinbach, M., Karypis, G. and Kumar, V. 2000. A Comparison of document clustering techniques. Technical Report. Department of Computer Science and Engineering, University of Minnesota.Google Scholar
Wang, P. and Domeniconi, C. 2008. Building Semantic Kernels for text classification using Wikipedia. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (Nevada, Las Vegas, August 24 - 27, 2008). ACM Press, New York, NY, 713--721. Google ScholarDigital Library
Yoo, I., Hu, X. and Song, I.-Y. 2006. Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. (Philadelphia, PA, August 20 - 23, 2006). ACM Press, New York, NY, 791 -- 796. Google ScholarDigital Library
Zhang, X., Jing, L., Hu, X., et al. A Comparative Study of Ontology Based Term Similarity Measures on Document Clustering. In Proceedings of 12th International conference on Database Systems for Advanced Applications. (Bangkok, Thailand, April 9-12, 2007).115--126. Google ScholarDigital Library
Zhao, Y. and Karypis, G. 2001. Criterion functions for document clustering: experiments and analysis, Technical Report. Department of Computer Science, University of Minnesota.Google Scholar
Zhong, S. and Ghosh, J. 2005. Generative model-based document clustering: a comparative study. Knowledge and Information Systems. 8, 3 (Sep. 2005). 374--384. Google ScholarDigital Library

Index Terms

Exploiting Wikipedia as external knowledge for document clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Enhancing text clustering by leveraging Wikipedia semantics
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Most traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome ...
Read More
Multilingual document clustering using wikipedia as external knowledge
IRFC'11: Proceedings of the Second international conference on Multidisciplinary information retrieval facility

This paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia has evolved to be a major structured multilingual knowledge base. It has been highly exploited in many monolingual clustering approaches and also in comparing ...
Read More
An ensemble approach for text document clustering using Wikipedia concepts
DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering

Most text clustering algorithms represent a corpus as a document-term matrix in the bag of words model. The feature values are computed based on term frequencies in documents and no semantic relatedness between terms is considered. Therefore, two ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
June 2009
1426 pages
ISBN:9781605584959
DOI:10.1145/1557019
General Chairs:
John Elder
Elder Research, Inc., USA
,
Françoise Soulié Fogelman
KXEN, France
,
Program Chairs:
Peter Flach
University of Bristol, UK
,
Mohammed Zaki
RPI, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Wikipedia
document representation
text clustering
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 180
  Total Citations
  View Citations
- 2,959
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploiting Wikipedia as external knowledge for document clustering

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Enhancing text clustering by leveraging Wikipedia semantics

Multilingual document clustering using wikipedia as external knowledge

An ensemble approach for text document clustering using Wikipedia concepts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Exploiting Wikipedia as external knowledge for document clustering

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Enhancing text clustering by leveraging Wikipedia semantics

Multilingual document clustering using wikipedia as external knowledge

An ensemble approach for text document clustering using Wikipedia concepts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media