Document Representation and Quality of Text: An Analysis

Keikha, Mostafa; Razavian, Narjes Sharif; Oroumchian, Farhad; Razi, Hassan Seyed

doi:10.1007/978-1-84800-046-9_12

Mostafa Keikha³,
Narjes Sharif Razavian³,
Farhad Oroumchian⁴ &
…
Hassan Seyed Razi³

2229 Accesses
5 Citations

There are three factors involved in text classification: the classification model, the similarity measure, and the document representation. In this chapter, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classification.We will also show that the text quality affects the choice of document representation. In our experiments we have used the centroid-based classification, which is a simple and robust text classi-fication scheme. We will compare four different types of document representation: N-grams, single terms, phrases, and a logic-based document representation called RDR. The N-gram representation is a string-based representation with no linguistic processing. The single-term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. Our experiments on many text collections yielded similar results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. AleAhmad, P. Hakimian, and F. Oroumchian. N-gram and local context analysis for persian text retrieval. International Symposium on Signal Processing and its Applications (ISSPA2007), 2007.
Google Scholar
M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. Technical Report UT-CS-94-270, University of Tennessee, 1994. Available from World Wide Web: citeseer.ist.psu.edu/ berry95using.html.
Google Scholar
A. Collins and R. Michalski. The logic of plausible reasoning: a core theory. Cognitive Science, 13(1):1-49, 1989. Available from World Wide Web: citeseer. ist.psu.edu/collins89logic.html.
Google Scholar
F. Crestani and C.J. van Rijsbergen. Probability kinematics in information retrieval. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 291-299, ACM Press, New York, 1995.
Google Scholar
M. Damashek. Gauging similarity with n-grams: language-independent categorization of text. Science, 267(5199):843, 1995.
Article Google Scholar
J. Davis and M. Goadrich. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, pages 233-240, ACM Press, New York, 2006.
Chapter Google Scholar
E. Greengrass. Information Retrieval: A Survey. IR Report, 120600, 2000. Available from World Wide Web: http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf.
E.H. Han and G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results. Springer, New York, 2000.
Google Scholar
A. Jalali and F. Oroumchian. Rich document representation for document clustering. In Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval Avignon (Vaucluse), pages 800-808, RIAO, Paris, France, 2004.
Google Scholar
R. Kjeldsen and P.R. Cohen. The evolution and performance of the GRANT system. IEEE Expert, pages 73-79, 1988.
Google Scholar
J.H. Lee. Properties of extended Boolean models in information retrieval. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 182-190, ACM Press, New York, 1994.
Google Scholar
E.D. Liddy, W. Paik, and E.S. Yu. Text categorization for multiple users based on semantic features from a machine-readable dictionary. ACM Transactions on Information Systems (TOIS), 12(3):278-295, 1994.
Article Google Scholar
F. Oroumchian and R.N. Oddy. An application of plausible reasoning to information retrieval. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Developement in Information Retrieval, pages 244-252, ACM Press, New York, 1996.
Google Scholar
C. Pearce and C. Nicholas. TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data. Journal of the American Society for Information Science, 47(4):263-275, 1996.
Article Google Scholar
M.F. Porter. An algorithm for suffix stripping. Information Systems, 40(3):211-218,1980.
Google Scholar
C.J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, 1979.
Google Scholar
F. Raja, M. Keikha, F. Oroumchian, and M. Rahgozar. Using Rich Document Representation in XML Information Retrieval. Proceedings of the Fifth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Springer, New York, 2006.
Google Scholar
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988.
Article Google Scholar
C.Y. Suen. N-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2):164-172,1979.
Article Google Scholar
Y. Yang and X. Liu. A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42-49, ACM Press, New York, 1999.
Chapter Google Scholar
E.M. Zamora, J.J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information Processing and Management, 17(6):305-316, 1981.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Tehran, 14395-515, Tehran, Iran
Mostafa Keikha, Narjes Sharif Razavian & Hassan Seyed Razi
College of Information Technology, University of Wollongong in Dubai, 20183, Dubai, UAE
Farhad Oroumchian

Authors

Mostafa Keikha
View author publications
You can also search for this author in PubMed Google Scholar
Narjes Sharif Razavian
View author publications
You can also search for this author in PubMed Google Scholar
Farhad Oroumchian
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Seyed Razi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Tennessee, USA
Michael W. Berry
Hewlett-Packard Laboratories, Palo Alto, California, USA
Malu Castellanos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Keikha, M., Razavian, N.S., Oroumchian, F., Razi, H.S. (2008). Document Representation and Quality of Text: An Analysis. In: Berry, M.W., Castellanos, M. (eds) Survey of Text Mining II. Springer, London. https://doi.org/10.1007/978-1-84800-046-9_12

Download citation

DOI: https://doi.org/10.1007/978-1-84800-046-9_12
Publisher Name: Springer, London
Print ISBN: 978-1-84800-045-2
Online ISBN: 978-1-84800-046-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics