ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
advertisementadvertisement
Information Processing & Management
Volume 39, Issue 6, November 2003, Pages 825-851
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (551 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
Special issue
View Record in Scopus
 
doi:10.1016/S0306-4573(02)00081-X    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2002 Elsevier Ltd. All rights reserved.

Experiments in discourse analysis impact on information classification and retrieval algorithms

J. MoratoCorresponding Author Contact Information, E-mail The Corresponding Author, J. Llorens, G. Genova and J. A. Moreiro

Department of Computer Science, Universidad Carlos III de Madrid, Av. Universidad, 30-28911, Leganés, Madrid, Spain

Received 1 February 2002; 
accepted 13 August 2002. ;
Available online 19 December 2002.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

Researchers in indexing and retrieval systems have been advocating the inclusion of more contextual information to improve results. The proliferation of full-text databases and advances in computer storage capacity have made it possible to carry out text analysis by means of linguistic and extra-linguistic knowledge. Since the mid 80s, research has tended to pay more attention to context, giving discourse analysis a more central role. The research presented in this paper aims to check whether discourse variables have an impact on modern information retrieval and classification algorithms. In order to evaluate this hypothesis, a functional framework for information analysis in an automated environment has been proposed, where the n-grams (filtering) and the k-means and Chen’s classification algorithms have been tested against sub-collections of documents based on the following discourse variables: “Genre”, “Register”, “Domain terminology”, and “Document structure”. The results obtained with the algorithms for the different sub-collections were compared to the MeSH information structure. These demonstrate that n-grams does not appear to have a clear dependence on discourse variables, though the k-means classification algorithm does, but only on domain terminology and document structure, and finally Chen’s algorithm has a clear dependence on all of the discourse variables. This information could be used to design better classification algorithms, where discourse variables should be taken into account. Other minor conclusions drawn from these results are also presented.

Author Keywords: Discourse-model; Context-analysis; Computational-linguistics; Text-analysis-methods; Filtering; n-grams; k-means; Co-wording

Article Outline

1. Introduction
2. A theoretical overview of some aspects of discourse
2.1. Genre
2.2. Register
2.3. Domain terminology
2.4. Document structure
3. The experimental framework for studying the impact of discourse variables in indexing and classification algorithms
3.1. Document set selection
3.2. Discourse-based sub-collection construction
3.2.1. Genre sub-collections
3.2.2. Register sub-collections
3.2.3. Domain terminologies sub-collections
3.2.4. Document structure sub-collections
3.3. Testing indexing-retrieval algorithms
3.3.1. n-grams term-filtering algorithm
3.3.2. Test framework for n-grams
3.4. Testing classification algorithms
3.4.1. k-means classification algorithm
3.4.2. Test framework for k-means
3.4.3. Chen–Co-wording classification algorithm
3.4.4. Test framework for Chen/Co-wording
4. Experiments and results
4.1. Experiment 1: Impact of discourse in the n-grams algorithm
4.2. Experiment 2: Impact of discourse in the k-means algorithm
4.3. Experiment 3: The impact of discourse in the Chen algorithm
5. Conclusions
References






 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.