Copyright © 2002 Elsevier Ltd. All rights reserved.
Experiments in discourse analysis impact on information classification and retrieval algorithms
Received 1 February 2002;
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
Researchers in indexing and retrieval systems have been advocating the inclusion of more contextual information to improve results. The proliferation of full-text databases and advances in computer storage capacity have made it possible to carry out text analysis by means of linguistic and extra-linguistic knowledge. Since the mid 80s, research has tended to pay more attention to context, giving discourse analysis a more central role. The research presented in this paper aims to check whether discourse variables have an impact on modern information retrieval and classification algorithms. In order to evaluate this hypothesis, a functional framework for information analysis in an automated environment has been proposed, where the n-grams (filtering) and the k-means and Chen’s classification algorithms have been tested against sub-collections of documents based on the following discourse variables: “Genre”, “Register”, “Domain terminology”, and “Document structure”. The results obtained with the algorithms for the different sub-collections were compared to the MeSH information structure. These demonstrate that n-grams does not appear to have a clear dependence on discourse variables, though the k-means classification algorithm does, but only on domain terminology and document structure, and finally Chen’s algorithm has a clear dependence on all of the discourse variables. This information could be used to design better classification algorithms, where discourse variables should be taken into account. Other minor conclusions drawn from these results are also presented.
Author Keywords: Discourse-model; Context-analysis; Computational-linguistics; Text-analysis-methods; Filtering; n-grams; k-means; Co-wording
Article Outline
- 1. Introduction
- 2. A theoretical overview of some aspects of discourse
- 3. The experimental framework for studying the impact of discourse variables in indexing and classification algorithms
- 3.1. Document set selection
- 3.2. Discourse-based sub-collection construction
- 3.2.1. Genre sub-collections
- 3.2.2. Register sub-collections
- 3.2.3. Domain terminologies sub-collections
- 3.2.4. Document structure sub-collections
- 3.3. Testing indexing-retrieval algorithms
- 3.4. Testing classification algorithms
- 4. Experiments and results
- 4.1. Experiment 1: Impact of discourse in the n-grams algorithm
- 4.2. Experiment 2: Impact of discourse in the k-means algorithm
- 4.3. Experiment 3: The impact of discourse in the Chen algorithm
- 5. Conclusions
- References







E-mail Article
Add to my Quick Links

Cited By in Scopus (2)







