Skip to main content

Structure-Based Supervised Term Weighting and Regularization for Text Classification

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11608))

Abstract

Text documents have rich information that can be useful for different tasks. How to utilise the rich information in texts effectively and efficiently for tasks such as text classification is still an active research topic. One approach is to weight the terms in a text document based on their relevance to the classification task at hand. Another approach is to utilise structural information in a text document to regularize learning so that the learned model is more accurate. An important question is, can we combine the two approaches to achieve better performance? This paper presents a novel method for utilising the rich information in texts. We use supervised term weighting, which utilises the class information in a set of pre-classified training documents, thus the resulting term weighting is class specific. We also use structured regularization, which incorporates structural information into the learning process. A graph is built for each class from the pre-classified training documents and structural information in the graphs is used to calculate the supervised term weights and to define the groups for structured regularization. Experimental results for six text classification tasks show the increase in text classification accuracy with the utilisation of structural information in text for both weighting and regularization. Using graph-based text representation for supervised term weighting and structured regularization can build a compact model with considerable improvement in the performance of text classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://disi.unitn.it/moschitti/corpora.htm.

References

  1. Aggarwal, C.C.: Data Classification: Algorithms and Applications. CRC Press, Boca Raton (2014)

    Book  Google Scholar 

  2. Bakin, S., et al.: Adaptive regression and model selection in data mining problems (1999)

    Google Scholar 

  3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, P10008 (2008)

    Article  Google Scholar 

  4. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, New York (2007)

    Google Scholar 

  5. Friedman, J., Hastie, T., Tibshirani, R.: A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736 (2010)

  6. Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)

    Article  Google Scholar 

  7. Lewis, D.D.: Representation quality in text classification: an introduction and experiment. In: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, 24–27 June 1990 (1990)

    Google Scholar 

  8. Martins, A.F., Smith, N.A., Aguiar, P.M., Figueiredo, M.A.: Structured sparsity in structured prediction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1500–1511 (2011)

    Google Scholar 

  9. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34, 1–47 (2002)

    Article  Google Scholar 

  10. Shanavas, N., Wang, H., Lin, Z., Hawe, G.: Centrality-based approach for supervised term weighting. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 1261–1268. IEEE (2016)

    Google Scholar 

  11. Skianis, K., Rousseau, F., Vazirgiannis, M.: Regularizing text categorization with clusters of words. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1827–1837 (2016)

    Google Scholar 

  12. Skianis, K., Tziortziotis, N., Vazirgiannis, M.: Orthogonal matching pursuit for text classification. In: Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pp. 93–103 (2018)

    Google Scholar 

  13. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  14. Yogatama, D., Smith, N.: Making the most of bag of words: sentence regularization with alternating direction method of multipliers. In: International Conference on Machine Learning, pp. 656–664 (2014)

    Google Scholar 

  15. Yogatama, D., Smith, N.A.: Linguistic structured sparsity in text categorization. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 786–796 (2014)

    Google Scholar 

  16. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68, 49–67 (2006)

    Article  MathSciNet  Google Scholar 

  17. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67, 301–320 (2005)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niloofer Shanavas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shanavas, N., Wang, H., Lin, Z., Hawe, G. (2019). Structure-Based Supervised Term Weighting and Regularization for Text Classification. In: Métais, E., Meziane, F., Vadera, S., Sugumaran, V., Saraee, M. (eds) Natural Language Processing and Information Systems. NLDB 2019. Lecture Notes in Computer Science(), vol 11608. Springer, Cham. https://doi.org/10.1007/978-3-030-23281-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-23281-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-23280-1

  • Online ISBN: 978-3-030-23281-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics