Copyright © 2005 Elsevier Ltd All rights reserved.
Combining preference- and content-based approaches for improving document clustering effectiveness
Received 3 May 2004;
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
E-commerce and knowledge management applications generate and consume tremendous amounts of online information that is typically available as textual documents. To facilitate subsequent access of and leverage from these textual documents, the efficient and effective management of the ever-increasing volume of documents is essential to both organizations and individuals. Document management practices suggest the popularity of using categories (e.g., folders) for organizing, archiving, and accessing documents. Document clustering represents an appealing approach to enable organizations or individuals to create and maintain document categories automatically. Existing document clustering techniques usually group together similar documents on the basis of their textual content similarity. However, such content-based approaches operate at the lexical level and suffer greatly from the word mismatch problem. Therefore, this study aims to address this problem by exploiting users’ document grouping preferences, as exhibited in those individuals’ folder sets, to support document clustering. Specifically, we propose a hybrid document clustering technique that combines preference- and content-based approaches. Using a traditional content-based and a preference/content switching document clustering technique as performance benchmarks, our empirical evaluation results show that the proposed hybrid technique improves the clustering effectiveness measured by both cluster precision and cluster recall.
Keywords: Document clustering; Hierarchical agglomerative clustering; Preference-based document clustering; Document management; Digital library
Article Outline
- 1. Introduction
- 2. Overview of document clustering approaches
- 2.1. Content-based document clustering approach
- 2.2. Noncontent-based and hybrid document clustering approaches
- 3. Design of preference- and content-based document clustering technique
- 3.1. Feature extraction and selection
- 3.2. Document representation
- 3.3. Content-based document similarity estimation
- 3.4. Preference-based document similarity estimation
- 3.5. Clustering
- 4. Empirical evaluation
- 4.1. Performance benchmarks
- 4.2. Data collection
- 4.3. Evaluation procedure
- 4.4. Evaluation criteria
- 4.5. Parameter tuning experiments and results
- 4.5.1. Tuning experiment on number of features k
- 4.5.2. Tuning experiment on document representation scheme
- 4.5.3. Tuning experiment on h for the hybrid technique
- 4.6. Comparative evaluation results
- 5. In-depth analysis using synthetic data sets
- 5.1. Generation of synthetic users’ folder sets
- 5.2. Analysis of effects of h
- 5.3. Analysis of effects of Sig N
- 6. Conclusions and future research directions
- Acknowledgements
- References







E-mail Article
Add to my Quick Links

Cited By in Scopus (6)







