There are three factors involved in text classification: the classification model, the similarity measure, and the document representation. In this chapter, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classification.We will also show that the text quality affects the choice of document representation. In our experiments we have used the centroid-based classification, which is a simple and robust text classi-fication scheme. We will compare four different types of document representation: N-grams, single terms, phrases, and a logic-based document representation called RDR. The N-gram representation is a string-based representation with no linguistic processing. The single-term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. Our experiments on many text collections yielded similar results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. AleAhmad, P. Hakimian, and F. Oroumchian. N-gram and local context analysis for persian text retrieval. International Symposium on Signal Processing and its Applications (ISSPA2007), 2007.
M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. Technical Report UT-CS-94-270, University of Tennessee, 1994. Available from World Wide Web: citeseer.ist.psu.edu/ berry95using.html.
A. Collins and R. Michalski. The logic of plausible reasoning: a core theory. Cognitive Science, 13(1):1-49, 1989. Available from World Wide Web: citeseer. ist.psu.edu/collins89logic.html.
F. Crestani and C.J. van Rijsbergen. Probability kinematics in information retrieval. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 291-299, ACM Press, New York, 1995.
M. Damashek. Gauging similarity with n-grams: language-independent categorization of text. Science, 267(5199):843, 1995.
J. Davis and M. Goadrich. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, pages 233-240, ACM Press, New York, 2006.
E. Greengrass. Information Retrieval: A Survey. IR Report, 120600, 2000. Available from World Wide Web: http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf.
E.H. Han and G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results. Springer, New York, 2000.
A. Jalali and F. Oroumchian. Rich document representation for document clustering. In Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval Avignon (Vaucluse), pages 800-808, RIAO, Paris, France, 2004.
R. Kjeldsen and P.R. Cohen. The evolution and performance of the GRANT system. IEEE Expert, pages 73-79, 1988.
J.H. Lee. Properties of extended Boolean models in information retrieval. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 182-190, ACM Press, New York, 1994.
E.D. Liddy, W. Paik, and E.S. Yu. Text categorization for multiple users based on semantic features from a machine-readable dictionary. ACM Transactions on Information Systems (TOIS), 12(3):278-295, 1994.
F. Oroumchian and R.N. Oddy. An application of plausible reasoning to information retrieval. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 244-252, ACM Press, New York, 1996.
C. Pearce and C. Nicholas. TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data. Journal of the American Society for Information Science, 47(4):263-275, 1996.
M.F. Porter. An algorithm for suffix stripping. Information Systems, 40(3):211-218,1980.
C.J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, 1979.
F. Raja, M. Keikha, F. Oroumchian, and M. Rahgozar. Using Rich Document Representation in XML Information Retrieval. Proceedings of the Fifth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Springer, New York, 2006.
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988.
C.Y. Suen. N-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2):164-172,1979.
Y. Yang and X. Liu. A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42-49, ACM Press, New York, 1999.
E.M. Zamora, J.J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information Processing and Management, 17(6):305-316, 1981.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag London Limited
About this chapter
Cite this chapter
Keikha, M., Razavian, N.S., Oroumchian, F., Razi, H.S. (2008). Document Representation and Quality of Text: An Analysis. In: Berry, M.W., Castellanos, M. (eds) Survey of Text Mining II. Springer, London. https://doi.org/10.1007/978-1-84800-046-9_12
Download citation
DOI: https://doi.org/10.1007/978-1-84800-046-9_12
Publisher Name: Springer, London
Print ISBN: 978-1-84800-045-2
Online ISBN: 978-1-84800-046-9
eBook Packages: Computer ScienceComputer Science (R0)