ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
advertisementadvertisement
Information Processing & Management
Volume 42, Issue 2, March 2006, Pages 350-372
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (358 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
Special issue
View Record in Scopus
 
doi:10.1016/j.ipm.2005.06.008    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2005 Elsevier Ltd All rights reserved.

Combining preference- and content-based approaches for improving document clustering effectiveness

Chih-Ping Weia, Corresponding Author Contact Information, E-mail The Corresponding Author, Chin-Sheng Yanga, E-mail The Corresponding Author, Han-Wei Hsiaob, E-mail The Corresponding Author and Tsang-Hsiang Chengc, E-mail The Corresponding Author

aDepartment of Information Management, College of Management, National Sun Yat-sen University, Kaohsiung, Taiwan, ROC bDepartment of Information Management, National University of Kaohsiung, Kaohsiung, Taiwan, ROC cDepartment of Business Administration, Southern Taiwan University of Technology, Tainan, Taiwan, ROC

Received 3 May 2004; 
accepted 16 June 2005. 
Available online 24 August 2005.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

E-commerce and knowledge management applications generate and consume tremendous amounts of online information that is typically available as textual documents. To facilitate subsequent access of and leverage from these textual documents, the efficient and effective management of the ever-increasing volume of documents is essential to both organizations and individuals. Document management practices suggest the popularity of using categories (e.g., folders) for organizing, archiving, and accessing documents. Document clustering represents an appealing approach to enable organizations or individuals to create and maintain document categories automatically. Existing document clustering techniques usually group together similar documents on the basis of their textual content similarity. However, such content-based approaches operate at the lexical level and suffer greatly from the word mismatch problem. Therefore, this study aims to address this problem by exploiting users’ document grouping preferences, as exhibited in those individuals’ folder sets, to support document clustering. Specifically, we propose a hybrid document clustering technique that combines preference- and content-based approaches. Using a traditional content-based and a preference/content switching document clustering technique as performance benchmarks, our empirical evaluation results show that the proposed hybrid technique improves the clustering effectiveness measured by both cluster precision and cluster recall.

Keywords: Document clustering; Hierarchical agglomerative clustering; Preference-based document clustering; Document management; Digital library

Article Outline

1. Introduction
2. Overview of document clustering approaches
2.1. Content-based document clustering approach
2.2. Noncontent-based and hybrid document clustering approaches
3. Design of preference- and content-based document clustering technique
3.1. Feature extraction and selection
3.2. Document representation
3.3. Content-based document similarity estimation
3.4. Preference-based document similarity estimation
3.5. Clustering
4. Empirical evaluation
4.1. Performance benchmarks
4.2. Data collection
4.3. Evaluation procedure
4.4. Evaluation criteria
4.5. Parameter tuning experiments and results
4.5.1. Tuning experiment on number of features k
4.5.2. Tuning experiment on document representation scheme
4.5.3. Tuning experiment on h for the hybrid technique
4.6. Comparative evaluation results
5. In-depth analysis using synthetic data sets
5.1. Generation of synthetic users’ folder sets
5.2. Analysis of effects of h
5.3. Analysis of effects of Sig N
6. Conclusions and future research directions
Acknowledgements
References















 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.