Abstract
In this paper, we propose an unconventional method of representing and classifying text documents, which preserves the sequence of term occurrence in a test document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. In addition, in order to avoid sequential matching during classification, we propose to index the terms in B-tree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Salton, G., Wang, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)
Li, Y.H., Jain, A.K.: Classification of Text Documents. The Computer Journal 41, 537–546 (1998)
Hotho, A., Maedche, A., Staab, S.: Ontology-based text clustering. In: International Joint Conference on Artificial Intelligence, USA, pp. 30–37 (2001)
Cavnar, W.B.: Using an N-Gram based document representation with a vector processing retrieval model. In: The Third Text Retrieval Conference (TREC-3), pp. 269–278 (1994)
Milios, E., Zhang, Y., He, B., Dong, L.: Automatic term extraction and document similarity in special text corpora. In: Sixth Conference of the Pacific Association for Computational Linguistics (PACLing 2003), Canada, pp. 275–284 (2003)
Wei, C.P., Yang, C.C., Lin, C.M.: A Latent Semantic Indexing-based approach to multilingual document clustering. Journal of Decision Support System 45, 606–620 (2008)
He, X., Cai, D., Liu, H., Ma, W.Y.: Locality Preserving Indexing for document representation. In: SIGIR, pp. 96–103 (2004)
Cai, D., He, X., Zhang, W.V., Han, J.: Regularized Locality Preserving Indexing via Spectral Regression. In: ACM International Conference on Information and Knowledge Management (CIKM 2007), Portugal, pp. 741–750 (2007)
Choudhary, B., Bhattacharyya, P.: Text clustering using Universal Networking Language representation. In: Eleventh International World Wide Web Conference (2002)
Seabastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report, IR – 408 (2004)
Bernotas, M., Karklius, K., Laurutis, R., Slotkiene, A.: The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Journal of Information Technology and Control 36, 217–220 (2007)
Dinesh, R.: POOR: Partially Occluded Object Recognizers – Some Novel Techniques. Ph.D. Thesis, University of Mysore (2006)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of ACM 18(9), 509–517 (1975)
Kumar, A.: G – tree: A new datastructure for organizing multidimensional data. IEEE Transactions on Knowledge and Data Engineering 6(2), 341–347 (1994)
Robinson, J.T.: The KDB tree: A search structure for large multidimensional dynamic indexes. In: Proceedings of ACM SIGMOD Conference, Ann Arbor, MI, pp. 10–18
Dandamudi, S.P., Sorenson, P.G.: An empirical performance comparison of some variations of the k-d tree and bd tree. Computer and Information Sciences 14(3), 134–158 (1985)
Punitha, P.: IARS: Image Archival and Retrieval Systems. Ph.D. Thesis, University of Mysore (2005)
http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
Isa, D., Lee, L.H., Kallimani, V.P., Rajkumar, R.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering 20, 23–31 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Harish, B.S., Guru, D.S., Manjunath, S. (2012). Classification of Text Documents Using B-Tree. In: Meghanathan, N., Chaki, N., Nagamalai, D. (eds) Advances in Computer Science and Information Technology. Computer Science and Engineering. CCSIT 2012. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 85. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27308-7_66
Download citation
DOI: https://doi.org/10.1007/978-3-642-27308-7_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27307-0
Online ISBN: 978-3-642-27308-7
eBook Packages: Computer ScienceComputer Science (R0)