skip to main content
10.1145/1557019.1557066acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Exploiting Wikipedia as external knowledge for document clustering

Published:28 June 2009Publication History

ABSTRACT

In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they may be falsely assigned to different clusters due to the lack of shared core words, although the core words they use are probably synonyms or semantically associated in other forms. The most common way to solve this problem is to enrich document representation with the background knowledge in an ontology. There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise. In this paper, we present a novel text clustering method to address these two issues by enriching document representation with Wikipedia concept and category information. We develop two approaches, exact match and relatedness-match, to map text documents to Wikipedia concepts, and further to Wikipedia categories. Then the text documents are clustered based on a similarity metric which combines document content information, concept information as well as category information. The experimental results using the proposed clustering framework on three datasets (20-newsgroup, TDT2, and LA Times) show that clustering performance improves significantly by enriching document representation with Wikipedia concepts and categories.

Skip Supplemental Material Section

Supplemental Material

p389-hu.mp4

mp4

119.2 MB

References

  1. Banerjee, S., Ramanathan, K. and Gupta, A. 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Amsterdam, The Netherlands, July 23-27, 2007). ACM Press, New York, NY, 787--788. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence. (Boston, MA, July 16-20, 2006). 1301--1306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. (Hyderabad, India, January 6-12, 2007). 1606--1611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Hotho, A., Staab, S.and Stumme, G. 2003. Wordnet improves text document clustering. In Proceedings of Semantic Web Workshop, the 26th annual International ACM SIGIR Conference. (Toronto, Canada, Jul. 28-Aug.1, 2003).Google ScholarGoogle Scholar
  5. Hotho, A., Maedche, A. and Staab, S. Text Clustering Based on Good Aggregations, In Proceedings of the 2001 IEEE International Conference on Data Mining. (San Jose, CA, Nov. 29-Dec.02, 2001,). IEEE Computer Society, Washington, DC, 607--608. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hu, J., Fang, L., Cao, Y., et al. Enhancing Text Clustering by Leveraging Wikipedia Semantics. In Proceedings of the 31st annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Singapore, July 20 - 24, 2008). ACM Press, New York, NY, 179--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Li, Y., Luk, W.P.R, Ho, K.S.E., and Chung, R.L.K. 2007. Improving Weak Ad-Hoc Queries using Wikipedia as External Corpus. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Amsterdam, The Netherlands, July 23-27, 2007). ACM Press, New York, NY, 797 -- 798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Milne, D. 2007. Computing Semantic Relatedness using Wikipedia Link Structure. In Proceedings of the 5th New Zealand Computer Science Research Student Conference. (Hamilton, New Zealand, April 10-13, 2007).Google ScholarGoogle Scholar
  9. Phan, X., Nguyen, L. and Horiguchi, S. 2008. Learning to Classify Short and Sparse Test&Web with Hidden Topics from large-scale Data collection. In Proceedings of 17th International World Wide Web Conference. (Beijing, China, April 21-25, 2008). ACM Press, New York, NY, 91--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Steinbach, M., Karypis, G. and Kumar, V. 2000. A Comparison of document clustering techniques. Technical Report. Department of Computer Science and Engineering, University of Minnesota.Google ScholarGoogle Scholar
  11. Wang, P. and Domeniconi, C. 2008. Building Semantic Kernels for text classification using Wikipedia. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (Nevada, Las Vegas, August 24 - 27, 2008). ACM Press, New York, NY, 713--721. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yoo, I., Hu, X. and Song, I.-Y. 2006. Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. (Philadelphia, PA, August 20 - 23, 2006). ACM Press, New York, NY, 791 -- 796. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zhang, X., Jing, L., Hu, X., et al. A Comparative Study of Ontology Based Term Similarity Measures on Document Clustering. In Proceedings of 12th International conference on Database Systems for Advanced Applications. (Bangkok, Thailand, April 9-12, 2007).115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zhao, Y. and Karypis, G. 2001. Criterion functions for document clustering: experiments and analysis, Technical Report. Department of Computer Science, University of Minnesota.Google ScholarGoogle Scholar
  15. Zhong, S. and Ghosh, J. 2005. Generative model-based document clustering: a comparative study. Knowledge and Information Systems. 8, 3 (Sep. 2005). 374--384. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploiting Wikipedia as external knowledge for document clustering

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
      June 2009
      1426 pages
      ISBN:9781605584959
      DOI:10.1145/1557019

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader