Skip to main content

Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2291))

Abstract

This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified χ2 (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters- 21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general “install-and-forget” term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ISO/TAG4/WG3, (R. Cohen, P. Clifford, P. Giacomo, O. Mathiesen, C. Peters, B. Taylor, K. Weise), Guide to the expression of uncertainties in measurement, ISO publication ISBN-92-67-10118-9.

    Google Scholar 

  2. Apté, C. and Damerau, F. (1994) Automated learning of decision rules for text categorization. In: ACM Transactions on Information Systems 12(3):233–251, 1994.

    Article  Google Scholar 

  3. L. Douglas Baker and Andrew Kachites McCallum, Distributional clustering of words for text-classification, In: Proceedings SIGIR 98, pp. 96–103.

    Google Scholar 

  4. E. Richard Cohen, Uncertainty and error in physical measurements, At: The International summer school of physics “Enrico Fermi”, SIF Course CX, Metrology at the frontiers of physics and technology, Lerici (Italy), 27 June–7 July 1989.

    Google Scholar 

  5. W.W. Cohen and Y. Singer (1999), Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 13, 1, 100–111.

    Google Scholar 

  6. I. Dugan, Y. Karov, D. Roth (1997), Mistake-Driven Learning in Text Categorization. In: Proceedings of the Second Conference on Empirical Methods in NLP, pp. 55–63.

    Google Scholar 

  7. A. Grove, N. Littlestone, and D. Schuurmans (2001), General convergence results for linear discriminant updates. Machine Learning 43(3), pp. 173–210.

    Article  MATH  Google Scholar 

  8. C.H.A. Koster, M. Seutter and J. Beney (2000), Classifying Patent Applications with Winnow, Proceedings Benelearn 2001, Antwerpen, 8pp.

    Google Scholar 

  9. M. Krier and F. Zaccà (2001), Automatic Categorisation Applications at the European Patent Office, International CHemical Information Conference, Nimes, October 2001, 10 pp.

    Google Scholar 

  10. L. D. Landau, E.M. Lifschitz, Lehrbuch der theoretischen Physik V, Statistische Physik Teil 1, Akademie Verlag Berlin, 1979.

    Google Scholar 

  11. David D. Lewis, An evaluation of Phrasal and Clustered representations on a Text Categorization task, Fifteenth Annual International ACM SIGIR, Copenhagen, 1992.

    Google Scholar 

  12. H. Ragas and C.H.A. Koster, Four classification algorithms compared on a Dutch corpus, Proceedings SIGIR 98, pp. 369–370.

    Google Scholar 

  13. J.J. Rocchio (1971), Relevance feedback in Information Retrieval, In: Salton, G. (ed.), The Smart Retrieval system-experiments in automatic document processing, Prentice-Hall, Englewood Cliffs, NJ, pp 313–323.

    Google Scholar 

  14. Fabrizio Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys, Forthcoming, 2002 http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf

  15. Yiming Yang and Jan Pederson (1997), Feature selection in statistical learning of text categorization. In: ICML 97, pp. 412–420.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Peters, C., Koster, C.H.A. (2002). Uncertainty-Based Noise Reduction and Term Selection in Text Categorization. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds) Advances in Information Retrieval. ECIR 2002. Lecture Notes in Computer Science, vol 2291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45886-7_17

Download citation

  • DOI: https://doi.org/10.1007/3-540-45886-7_17

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43343-9

  • Online ISBN: 978-3-540-45886-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics