ABSTRACT
In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they may be falsely assigned to different clusters due to the lack of shared core words, although the core words they use are probably synonyms or semantically associated in other forms. The most common way to solve this problem is to enrich document representation with the background knowledge in an ontology. There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise. In this paper, we present a novel text clustering method to address these two issues by enriching document representation with Wikipedia concept and category information. We develop two approaches, exact match and relatedness-match, to map text documents to Wikipedia concepts, and further to Wikipedia categories. Then the text documents are clustered based on a similarity metric which combines document content information, concept information as well as category information. The experimental results using the proposed clustering framework on three datasets (20-newsgroup, TDT2, and LA Times) show that clustering performance improves significantly by enriching document representation with Wikipedia concepts and categories.
Supplemental Material
- Banerjee, S., Ramanathan, K. and Gupta, A. 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Amsterdam, The Netherlands, July 23-27, 2007). ACM Press, New York, NY, 787--788. Google ScholarDigital Library
- Gabrilovich, E. and Markovitch, S. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence. (Boston, MA, July 16-20, 2006). 1301--1306. Google ScholarDigital Library
- Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. (Hyderabad, India, January 6-12, 2007). 1606--1611. Google ScholarDigital Library
- Hotho, A., Staab, S.and Stumme, G. 2003. Wordnet improves text document clustering. In Proceedings of Semantic Web Workshop, the 26th annual International ACM SIGIR Conference. (Toronto, Canada, Jul. 28-Aug.1, 2003).Google Scholar
- Hotho, A., Maedche, A. and Staab, S. Text Clustering Based on Good Aggregations, In Proceedings of the 2001 IEEE International Conference on Data Mining. (San Jose, CA, Nov. 29-Dec.02, 2001,). IEEE Computer Society, Washington, DC, 607--608. Google ScholarDigital Library
- Hu, J., Fang, L., Cao, Y., et al. Enhancing Text Clustering by Leveraging Wikipedia Semantics. In Proceedings of the 31st annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Singapore, July 20 - 24, 2008). ACM Press, New York, NY, 179--186. Google ScholarDigital Library
- Li, Y., Luk, W.P.R, Ho, K.S.E., and Chung, R.L.K. 2007. Improving Weak Ad-Hoc Queries using Wikipedia as External Corpus. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (Amsterdam, The Netherlands, July 23-27, 2007). ACM Press, New York, NY, 797 -- 798. Google ScholarDigital Library
- Milne, D. 2007. Computing Semantic Relatedness using Wikipedia Link Structure. In Proceedings of the 5th New Zealand Computer Science Research Student Conference. (Hamilton, New Zealand, April 10-13, 2007).Google Scholar
- Phan, X., Nguyen, L. and Horiguchi, S. 2008. Learning to Classify Short and Sparse Test&Web with Hidden Topics from large-scale Data collection. In Proceedings of 17th International World Wide Web Conference. (Beijing, China, April 21-25, 2008). ACM Press, New York, NY, 91--100. Google ScholarDigital Library
- Steinbach, M., Karypis, G. and Kumar, V. 2000. A Comparison of document clustering techniques. Technical Report. Department of Computer Science and Engineering, University of Minnesota.Google Scholar
- Wang, P. and Domeniconi, C. 2008. Building Semantic Kernels for text classification using Wikipedia. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (Nevada, Las Vegas, August 24 - 27, 2008). ACM Press, New York, NY, 713--721. Google ScholarDigital Library
- Yoo, I., Hu, X. and Song, I.-Y. 2006. Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. (Philadelphia, PA, August 20 - 23, 2006). ACM Press, New York, NY, 791 -- 796. Google ScholarDigital Library
- Zhang, X., Jing, L., Hu, X., et al. A Comparative Study of Ontology Based Term Similarity Measures on Document Clustering. In Proceedings of 12th International conference on Database Systems for Advanced Applications. (Bangkok, Thailand, April 9-12, 2007).115--126. Google ScholarDigital Library
- Zhao, Y. and Karypis, G. 2001. Criterion functions for document clustering: experiments and analysis, Technical Report. Department of Computer Science, University of Minnesota.Google Scholar
- Zhong, S. and Ghosh, J. 2005. Generative model-based document clustering: a comparative study. Knowledge and Information Systems. 8, 3 (Sep. 2005). 374--384. Google ScholarDigital Library
Index Terms
- Exploiting Wikipedia as external knowledge for document clustering
Recommendations
Enhancing text clustering by leveraging Wikipedia semantics
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalMost traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome ...
Multilingual document clustering using wikipedia as external knowledge
IRFC'11: Proceedings of the Second international conference on Multidisciplinary information retrieval facilityThis paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia has evolved to be a major structured multilingual knowledge base. It has been highly exploited in many monolingual clustering approaches and also in comparing ...
An ensemble approach for text document clustering using Wikipedia concepts
DocEng '14: Proceedings of the 2014 ACM symposium on Document engineeringMost text clustering algorithms represent a corpus as a document-term matrix in the bag of words model. The feature values are computed based on term frequencies in documents and no semantic relatedness between terms is considered. Therefore, two ...
Comments