Classification of Text Documents Using B-Tree

Harish, B. S.; Guru, D. S.; Manjunath, S.

doi:10.1007/978-3-642-27308-7_66

B. S. Harish¹⁸,
D. S. Guru¹⁹ &
S. Manjunath¹⁸

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 85))

Included in the following conference series:

International Conference on Computer Science and Information Technology

1375 Accesses

Abstract

In this paper, we propose an unconventional method of representing and classifying text documents, which preserves the sequence of term occurrence in a test document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. In addition, in order to avoid sequential matching during classification, we propose to index the terms in B-tree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Salton, G., Wang, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)
Article MATH Google Scholar
Li, Y.H., Jain, A.K.: Classification of Text Documents. The Computer Journal 41, 537–546 (1998)
Article MATH Google Scholar
Hotho, A., Maedche, A., Staab, S.: Ontology-based text clustering. In: International Joint Conference on Artificial Intelligence, USA, pp. 30–37 (2001)
Google Scholar
Cavnar, W.B.: Using an N-Gram based document representation with a vector processing retrieval model. In: The Third Text Retrieval Conference (TREC-3), pp. 269–278 (1994)
Google Scholar
Milios, E., Zhang, Y., He, B., Dong, L.: Automatic term extraction and document similarity in special text corpora. In: Sixth Conference of the Pacific Association for Computational Linguistics (PACLing 2003), Canada, pp. 275–284 (2003)
Google Scholar
Wei, C.P., Yang, C.C., Lin, C.M.: A Latent Semantic Indexing-based approach to multilingual document clustering. Journal of Decision Support System 45, 606–620 (2008)
Article Google Scholar
He, X., Cai, D., Liu, H., Ma, W.Y.: Locality Preserving Indexing for document representation. In: SIGIR, pp. 96–103 (2004)
Google Scholar
Cai, D., He, X., Zhang, W.V., Han, J.: Regularized Locality Preserving Indexing via Spectral Regression. In: ACM International Conference on Information and Knowledge Management (CIKM 2007), Portugal, pp. 741–750 (2007)
Google Scholar
Choudhary, B., Bhattacharyya, P.: Text clustering using Universal Networking Language representation. In: Eleventh International World Wide Web Conference (2002)
Google Scholar
Seabastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Article Google Scholar
Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report, IR – 408 (2004)
Google Scholar
Bernotas, M., Karklius, K., Laurutis, R., Slotkiene, A.: The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Journal of Information Technology and Control 36, 217–220 (2007)
Google Scholar
Dinesh, R.: POOR: Partially Occluded Object Recognizers – Some Novel Techniques. Ph.D. Thesis, University of Mysore (2006)
Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of ACM 18(9), 509–517 (1975)
Article MathSciNet MATH Google Scholar
Kumar, A.: G – tree: A new datastructure for organizing multidimensional data. IEEE Transactions on Knowledge and Data Engineering 6(2), 341–347 (1994)
Article Google Scholar
Robinson, J.T.: The KDB tree: A search structure for large multidimensional dynamic indexes. In: Proceedings of ACM SIGMOD Conference, Ann Arbor, MI, pp. 10–18
Google Scholar
Dandamudi, S.P., Sorenson, P.G.: An empirical performance comparison of some variations of the k-d tree and bd tree. Computer and Information Sciences 14(3), 134–158 (1985)
Google Scholar
Punitha, P.: IARS: Image Archival and Retrieval Systems. Ph.D. Thesis, University of Mysore (2005)
Google Scholar
http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
Isa, D., Lee, L.H., Kallimani, V.P., Rajkumar, R.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering 20, 23–31 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Science and Engineering, SJCE, Mysore, Karnataka, India
B. S. Harish & S. Manjunath
Department of Studies in Computer Science, University of Mysore, Manasagangothri, Mysore, Karnataka, India
D. S. Guru

Authors

B. S. Harish
View author publications
You can also search for this author in PubMed Google Scholar
D. S. Guru
View author publications
You can also search for this author in PubMed Google Scholar
S. Manjunath
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Jackson State University, Jackson, MS, USA
Natarajan Meghanathan
University of Calcutta, Calcutta, India
Nabendu Chaki
Wireilla Net Solutions PTY Ltd., Melbourne, VIC, Australia
Dhinaharan Nagamalai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Harish, B.S., Guru, D.S., Manjunath, S. (2012). Classification of Text Documents Using B-Tree. In: Meghanathan, N., Chaki, N., Nagamalai, D. (eds) Advances in Computer Science and Information Technology. Computer Science and Engineering. CCSIT 2012. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 85. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27308-7_66

Download citation

DOI: https://doi.org/10.1007/978-3-642-27308-7_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27307-0
Online ISBN: 978-3-642-27308-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics