Document clustering: An evaluation of some experiments with the cranfield 1400 collection

doi:10.1016/0306-4573(75)90006-0

Information Processing & Management

Volume 11, Issues 5–7, 1975, Pages 171-182

https://doi.org/10.1016/0306-4573(75)90006-0 Get rights and content

Abstract

The single-link cluster method is used to construct a hierarchic classification for the 1400 documents in the Cranfield test collection. A variety of retrieval strategies applied to this hierarchy are evaluated in terms of effectiveness and efficiency. Comparisons are made between our results and those of similar experiments in document clustering on the Smart project.

References (16)

G. Salton
Automatic Information Organisation and Retrieval
(1968)
N. Jardine et al.
Mathematical Taxonomy
(1971)
C.W. Cleverdon et al.
Factors Determining the Performance of Indexing Systems
C.W. Cleverdon et al.
Factors Determining the Performance of Indexing Systems
T.M. Aitchison et al.
Comparative Evaluation of Index Languages
E.M. Keen et al.
Report of an Information Science Index Languages Test
(1972)
C.J. Van Rijsbergen
Further experiments with hierarchic clustering in document retrieval
Inform. Stor. Retr.
(1974)
R.M. Cormack
A review of classification
J.R. Statist. soc. A
(1971)
N. Jardine et al.
The use of hierarchic clustering in information retrieval
Inform. Stor. Retr.
(1971)

There are more references available in the full text version of this article.

Cited by (69)

Probabilistic co-relevance for query-sensitive similarity measurement in information retrieval
2013, Information Processing and Management
Citation Excerpt :
This method was motivated by the cluster hypothesis, which states that “closely associated documents tend to be relevant to the same requests” (Jardine & Rijsbergen, 1971; Rijsbergen, 1979). Thus far, numerous studies have been conducted, for instance, initial trials based on hierarchical clustering that employed different types of merging criteria, i.e., single linkage, complete linkage, group average, and Ward’s method (Croft, 1980; El-Hamdouchi & Willett, 1986; Griffiths, Robinson, & Willett, 1984; Jardine & Rijsbergen, 1971; Rijsbergen & Croft, 1975; Voorhees, 1985). There are also more recent language modeling approaches based on partitional clustering (Liu & Croft, 2004; Na, Kang, Roh, & Lee, 2007) and document expansion using nearest neighbors as a cluster (Kurland & Lee, 2004; Tao, Wang, Mei, & Zhai, 2006).
Interdocument similarities are the fundamental information source required in cluster-based retrieval, which is an advanced retrieval approach that significantly improves performance during information retrieval (IR). An effective similarity metric is query-sensitive similarity, which was introduced by Tombros and Rijsbergen as method to more directly satisfy the cluster hypothesis that forms the basis of cluster-based retrieval. Although this method is reported to be effective, existing applications of query-specific similarity are still limited to vector space models wherein there is no connection to probabilistic approaches. We suggest a probabilistic framework that defines query-sensitive similarity based on probabilistic co-relevance, where the similarity between two documents is proportional to the probability that they are both co-relevant to a specific given query. We further simplify the proposed co-relevance-based similarity by decomposing it into two separate relevance models. We then formulate all the requisite components for the proposed similarity metric in terms of scoring functions used by language modeling methods. Experimental results obtained using standard TREC test collections consistently showed that the proposed query-sensitive similarity measure performs better than term-based similarity and existing query-sensitive similarity in the context of Voorhees’ nearest neighbor test (NNT).
Cluster-based patent retrieval
2007, Information Processing and Management
Through the recent NTCIR workshops, patent retrieval casts many challenging issues to information retrieval community. Unlike newspaper articles, patent documents are very long and well structured. These characteristics raise the necessity to reassess existing retrieval techniques that have been mainly developed for structure-less and short documents such as newspapers. This study investigates cluster-based retrieval in the context of invalidity search task of patent retrieval. Cluster-based retrieval assumes that clusters would provide additional evidence to match user’s information need. Thus far, cluster-based retrieval approaches have relied on automatically-created clusters. Fortunately, all patents have manually-assigned cluster information, international patent classification codes. International patent classification is a standard taxonomy for classifying patents, and has currently about 69,000 nodes which are organized into a five-level hierarchical system. Thus, patent documents could provide the best test bed to develop and evaluate cluster-based retrieval techniques. Experiments using the NTCIR-4 patent collection showed that the cluster-based language model could be helpful to improving the cluster-less baseline language model.
A reliable FAQ retrieval system using a query log classification technique based on latent semantic analysis
2007, Information Processing and Management
To obtain high performances, previous works on FAQ retrieval used high-level knowledge bases or handcrafted rules. However, it is a time and effort consuming job to construct these knowledge bases and rules whenever application domains are changed. To overcome this problem, we propose a high-performance FAQ retrieval system only using users’ query logs as knowledge sources. During indexing time, the proposed system efficiently clusters users’ query logs using classification techniques based on latent semantic analysis. During retrieval time, the proposed system smoothes FAQs using the query log clusters. In the experiment, the proposed system outperformed the conventional information retrieval systems in FAQ retrieval. Based on various experiments, we found that the proposed system could alleviate critical lexical disagreement problems in short document retrieval. In addition, we believe that the proposed system is more practical and reliable than the previous FAQ retrieval systems because it uses only data-driven methods without high-level knowledge sources.
High-performance FAQ retrieval using an automatic clustering method of query logs
2006, Information Processing and Management
Citation Excerpt :
There have been numerous studies on how clustering can be employed to improve retrieval results (Liu & Croft, 2004). The cluster-based retrieval can be divided into two types: static clustering methods (Jardine & van Rijsbergen, 1971; van Rijsbergen & Croft, 1975) and query specific clustering methods (Hearst & Pedersen, 1996; Tombros, Villa, & van Rijsbergen, 2002). The static clustering methods group entire collections in advance, independent of the user’s query, and clusters are retrieved based on how well their centroids match the user’s query.
To resolve some of lexical disagreement problems between queries and FAQs, we propose a reliable FAQ retrieval system using query log clustering. On indexing time, the proposed system clusters the logs of users’ queries into predefined FAQ categories. To increase the precision and the recall rate of clustering, the proposed system adopts a new similarity measure using a machine readable dictionary. On searching time, the proposed system calculates the similarities between users’ queries and each cluster in order to smooth FAQs. By virtue of the cluster-based retrieval technique, the proposed system could partially bridge lexical chasms between queries and FAQs. In addition, the proposed system outperforms the traditional information retrieval systems in FAQ retrieval.
Tree view self-organisation of web content - Institute for Water Education
2005, Neurocomputing
When browsing a large set of unstructured documents, it is advantageous if the documents have been organised and presented in a way that makes navigation efficient, understanding underlying concepts easy and locating related information quickly. This paper proposes a new method termed Treeview self-organising maps (Treeview SOMs) for clustering and organising text documents by means of a series of independently and automatically created, hierarchical one-dimensional SOMs. The method generates a topological taxonomy tree for a set of unstructured text documents in terms of presentation and visualisation. The documents are organised in a hierarchy of dynamically generated and automatically validated topics extracted from the corpus of the documents. The results presented in a labelled tree view, clearly show underlying contents of the documents and can help browsing the document set more efficiently than those of previous work using SOMs or hierarchical clustering methods. A brief overview on general document clustering and a review on SOM-based document analysis methods are also provided together with a comparison among them.
The effectiveness of query-specific hierarchic clustering in information retrieval
2002, Information Processing and Management
Citation Excerpt :
Hierarchic methods1 on the other hand, result in tree-like classifications in which small clusters of documents that are found to be strongly similar to each other are nested within larger clusters that contain less similar documents (Willett, 1988). Two main methods, and many variants of them, for matching a query against a document hierarchy have been proposed (Croft, 1980; Jardine & Van Rijsbergen, 1971; Van Rijsbergen, 1974, 1975; Voorhees, 1985): a top–down search, and a bottom–up search. In both types of search, a single cluster that satisfies a retrieval criterion is retrieved.
Hierarchic document clustering has been widely applied to information retrieval (IR) on the grounds of its potential improved effectiveness over inverted file search (IFS). However, previous research has been inconclusive as to whether clustering does bring improvements. In this paper we take the view that if hierarchic clustering is applied to search results (query-specific clustering), then it has the potential to increase the retrieval effectiveness compared both to that of static clustering and of conventional IFS. We conducted a number of experiments using five document collections and four hierarchic clustering methods. Our results show that the effectiveness of query-specific clustering is indeed higher, and suggest that there is scope for its application to IR.

View all citing articles on Scopus

^†: Present address: Royal Society Fellow at Cambridge University Computing Laboratory, Cambridge, England.

View full text

Document clustering: An evaluation of some experiments with the cranfield 1400 collection

Abstract

Factors Determining the Performance of Indexing Systems

Factors Determining the Performance of Indexing Systems

Comparative Evaluation of Index Languages

Report of an Information Science Index Languages Test

Further experiments with hierarchic clustering in document retrieval

Inform. Stor. Retr.

A review of classification

J.R. Statist. soc. A

The use of hierarchic clustering in information retrieval

Inform. Stor. Retr.