Skip to main content

Abstract

In this paper, we propose an unconventional method of representing and classifying text documents, which preserves the sequence of term occurrence in a test document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. In addition, in order to avoid sequential matching during classification, we propose to index the terms in B-tree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Salton, G., Wang, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  2. Li, Y.H., Jain, A.K.: Classification of Text Documents. The Computer Journal 41, 537–546 (1998)

    Article  MATH  Google Scholar 

  3. Hotho, A., Maedche, A., Staab, S.: Ontology-based text clustering. In: International Joint Conference on Artificial Intelligence, USA, pp. 30–37 (2001)

    Google Scholar 

  4. Cavnar, W.B.: Using an N-Gram based document representation with a vector processing retrieval model. In: The Third Text Retrieval Conference (TREC-3), pp. 269–278 (1994)

    Google Scholar 

  5. Milios, E., Zhang, Y., He, B., Dong, L.: Automatic term extraction and document similarity in special text corpora. In: Sixth Conference of the Pacific Association for Computational Linguistics (PACLing 2003), Canada, pp. 275–284 (2003)

    Google Scholar 

  6. Wei, C.P., Yang, C.C., Lin, C.M.: A Latent Semantic Indexing-based approach to multilingual document clustering. Journal of Decision Support System 45, 606–620 (2008)

    Article  Google Scholar 

  7. He, X., Cai, D., Liu, H., Ma, W.Y.: Locality Preserving Indexing for document representation. In: SIGIR, pp. 96–103 (2004)

    Google Scholar 

  8. Cai, D., He, X., Zhang, W.V., Han, J.: Regularized Locality Preserving Indexing via Spectral Regression. In: ACM International Conference on Information and Knowledge Management (CIKM 2007), Portugal, pp. 741–750 (2007)

    Google Scholar 

  9. Choudhary, B., Bhattacharyya, P.: Text clustering using Universal Networking Language representation. In: Eleventh International World Wide Web Conference (2002)

    Google Scholar 

  10. Seabastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  Google Scholar 

  11. Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report, IR – 408 (2004)

    Google Scholar 

  12. Bernotas, M., Karklius, K., Laurutis, R., Slotkiene, A.: The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Journal of Information Technology and Control 36, 217–220 (2007)

    Google Scholar 

  13. Dinesh, R.: POOR: Partially Occluded Object Recognizers – Some Novel Techniques. Ph.D. Thesis, University of Mysore (2006)

    Google Scholar 

  14. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of ACM 18(9), 509–517 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  15. Kumar, A.: G – tree: A new datastructure for organizing multidimensional data. IEEE Transactions on Knowledge and Data Engineering 6(2), 341–347 (1994)

    Article  Google Scholar 

  16. Robinson, J.T.: The KDB tree: A search structure for large multidimensional dynamic indexes. In: Proceedings of ACM SIGMOD Conference, Ann Arbor, MI, pp. 10–18

    Google Scholar 

  17. Dandamudi, S.P., Sorenson, P.G.: An empirical performance comparison of some variations of the k-d tree and bd tree. Computer and Information Sciences 14(3), 134–158 (1985)

    Google Scholar 

  18. Punitha, P.: IARS: Image Archival and Retrieval Systems. Ph.D. Thesis, University of Mysore (2005)

    Google Scholar 

  19. http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

  20. Isa, D., Lee, L.H., Kallimani, V.P., Rajkumar, R.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering 20, 23–31 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering

About this paper

Cite this paper

Harish, B.S., Guru, D.S., Manjunath, S. (2012). Classification of Text Documents Using B-Tree. In: Meghanathan, N., Chaki, N., Nagamalai, D. (eds) Advances in Computer Science and Information Technology. Computer Science and Engineering. CCSIT 2012. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 85. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27308-7_66

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27308-7_66

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27307-0

  • Online ISBN: 978-3-642-27308-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics