Skip to main content

Document Representation and Quality of Text: An Analysis

  • Chapter
Survey of Text Mining II

There are three factors involved in text classification: the classification model, the similarity measure, and the document representation. In this chapter, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classification.We will also show that the text quality affects the choice of document representation. In our experiments we have used the centroid-based classification, which is a simple and robust text classi-fication scheme. We will compare four different types of document representation: N-grams, single terms, phrases, and a logic-based document representation called RDR. The N-gram representation is a string-based representation with no linguistic processing. The single-term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. Our experiments on many text collections yielded similar results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • A. AleAhmad, P. Hakimian, and F. Oroumchian. N-gram and local context analysis for persian text retrieval. International Symposium on Signal Processing and its Applications (ISSPA2007), 2007.

    Google Scholar 

  • M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. Technical Report UT-CS-94-270, University of Tennessee, 1994. Available from World Wide Web: citeseer.ist.psu.edu/ berry95using.html.

    Google Scholar 

  • A. Collins and R. Michalski. The logic of plausible reasoning: a core theory. Cognitive Science, 13(1):1-49, 1989. Available from World Wide Web: citeseer. ist.psu.edu/collins89logic.html.

    Google Scholar 

  • F. Crestani and C.J. van Rijsbergen. Probability kinematics in information retrieval. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 291-299, ACM Press, New York, 1995.

    Google Scholar 

  • M. Damashek. Gauging similarity with n-grams: language-independent categorization of text. Science, 267(5199):843, 1995.

    Article  Google Scholar 

  • J. Davis and M. Goadrich. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, pages 233-240, ACM Press, New York, 2006.

    Chapter  Google Scholar 

  • E. Greengrass. Information Retrieval: A Survey. IR Report, 120600, 2000. Available from World Wide Web: http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf.

  • E.H. Han and G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results. Springer, New York, 2000.

    Google Scholar 

  • A. Jalali and F. Oroumchian. Rich document representation for document clustering. In Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval Avignon (Vaucluse), pages 800-808, RIAO, Paris, France, 2004.

    Google Scholar 

  • R. Kjeldsen and P.R. Cohen. The evolution and performance of the GRANT system. IEEE Expert, pages 73-79, 1988.

    Google Scholar 

  • J.H. Lee. Properties of extended Boolean models in information retrieval. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 182-190, ACM Press, New York, 1994.

    Google Scholar 

  • E.D. Liddy, W. Paik, and E.S. Yu. Text categorization for multiple users based on semantic features from a machine-readable dictionary. ACM Transactions on Information Systems (TOIS), 12(3):278-295, 1994.

    Article  Google Scholar 

  • F. Oroumchian and R.N. Oddy. An application of plausible reasoning to information retrieval. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 244-252, ACM Press, New York, 1996.

    Google Scholar 

  • C. Pearce and C. Nicholas. TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data. Journal of the American Society for Information Science, 47(4):263-275, 1996.

    Article  Google Scholar 

  • M.F. Porter. An algorithm for suffix stripping. Information Systems, 40(3):211-218,1980.

    Google Scholar 

  • C.J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, 1979.

    Google Scholar 

  • F. Raja, M. Keikha, F. Oroumchian, and M. Rahgozar. Using Rich Document Representation in XML Information Retrieval. Proceedings of the Fifth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Springer, New York, 2006.

    Google Scholar 

  • G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988.

    Article  Google Scholar 

  • C.Y. Suen. N-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2):164-172,1979.

    Article  Google Scholar 

  • Y. Yang and X. Liu. A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42-49, ACM Press, New York, 1999.

    Chapter  Google Scholar 

  • E.M. Zamora, J.J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information Processing and Management, 17(6):305-316, 1981.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag London Limited

About this chapter

Cite this chapter

Keikha, M., Razavian, N.S., Oroumchian, F., Razi, H.S. (2008). Document Representation and Quality of Text: An Analysis. In: Berry, M.W., Castellanos, M. (eds) Survey of Text Mining II. Springer, London. https://doi.org/10.1007/978-1-84800-046-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-046-9_12

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-045-2

  • Online ISBN: 978-1-84800-046-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics