Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

Peters, C.; Koster, C. H. A.

doi:10.1007/3-540-45886-7_17

Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

C. Peters⁷ &
C. H. A. Koster⁷

Conference paper
First Online: 01 January 2002

471 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2291))

Abstract

This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified χ² (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters- 21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general “install-and-forget” term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

ISO/TAG4/WG3, (R. Cohen, P. Clifford, P. Giacomo, O. Mathiesen, C. Peters, B. Taylor, K. Weise), Guide to the expression of uncertainties in measurement, ISO publication ISBN-92-67-10118-9.
Google Scholar
Apté, C. and Damerau, F. (1994) Automated learning of decision rules for text categorization. In: ACM Transactions on Information Systems 12(3):233–251, 1994.
Article Google Scholar
L. Douglas Baker and Andrew Kachites McCallum, Distributional clustering of words for text-classification, In: Proceedings SIGIR 98, pp. 96–103.
Google Scholar
E. Richard Cohen, Uncertainty and error in physical measurements, At: The International summer school of physics “Enrico Fermi”, SIF Course CX, Metrology at the frontiers of physics and technology, Lerici (Italy), 27 June–7 July 1989.
Google Scholar
W.W. Cohen and Y. Singer (1999), Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 13, 1, 100–111.
Google Scholar
I. Dugan, Y. Karov, D. Roth (1997), Mistake-Driven Learning in Text Categorization. In: Proceedings of the Second Conference on Empirical Methods in NLP, pp. 55–63.
Google Scholar
A. Grove, N. Littlestone, and D. Schuurmans (2001), General convergence results for linear discriminant updates. Machine Learning 43(3), pp. 173–210.
Article MATH Google Scholar
C.H.A. Koster, M. Seutter and J. Beney (2000), Classifying Patent Applications with Winnow, Proceedings Benelearn 2001, Antwerpen, 8pp.
Google Scholar
M. Krier and F. Zaccà (2001), Automatic Categorisation Applications at the European Patent Office, International CHemical Information Conference, Nimes, October 2001, 10 pp.
Google Scholar
L. D. Landau, E.M. Lifschitz, Lehrbuch der theoretischen Physik V, Statistische Physik Teil 1, Akademie Verlag Berlin, 1979.
Google Scholar
David D. Lewis, An evaluation of Phrasal and Clustered representations on a Text Categorization task, Fifteenth Annual International ACM SIGIR, Copenhagen, 1992.
Google Scholar
H. Ragas and C.H.A. Koster, Four classification algorithms compared on a Dutch corpus, Proceedings SIGIR 98, pp. 369–370.
Google Scholar
J.J. Rocchio (1971), Relevance feedback in Information Retrieval, In: Salton, G. (ed.), The Smart Retrieval system-experiments in automatic document processing, Prentice-Hall, Englewood Cliffs, NJ, pp 313–323.
Google Scholar
Fabrizio Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys, Forthcoming, 2002 http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf
Yiming Yang and Jan Pederson (1997), Feature selection in statistical learning of text categorization. In: ICML 97, pp. 412–420.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Nijmegen, The Netherlands
C. Peters & C. H. A. Koster

Authors

C. Peters
View author publications
You can also search for this author in PubMed Google Scholar
C. H. A. Koster
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Sciences, University of Strathclyde, 26 Richmond Street, G1 1XH, Glasgow, UK
Fabio Crestani
School of Information and Communication Technologies, University of Paisley, High Street, PA1 2BE, Paisley, UK
Mark Girolami
Computing Science Department, University of Glasgow, 17 Lilybank Gardens, G12 8RZ, Glasgow, UK
Cornelis Joost van Rijsbergen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peters, C., Koster, C.H.A. (2002). Uncertainty-Based Noise Reduction and Term Selection in Text Categorization. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds) Advances in Information Retrieval. ECIR 2002. Lecture Notes in Computer Science, vol 2291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45886-7_17

Download citation

DOI: https://doi.org/10.1007/3-540-45886-7_17
Published: 14 March 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43343-9
Online ISBN: 978-3-540-45886-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics