Skip to main content

Utilizing Vector Models for Automatic Text Lemmatization

  • Conference paper
  • First Online:
SOFSEM 2016: Theory and Practice of Computer Science (SOFSEM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9587))

Abstract

In this paper we tackle the problem of lemmatization of inflectional languages. We introduce a new algorithm which utilizes vector models of words. Current approaches in this area are limited to knowing either full grammar rules or the translation matrix between the word and its basic form. However, this information is encoded in natural text. Our solution uses text corpora to build vector models of words and a small amount of user input to infer lemmas. We have evaluated our approach on the Slovak language and present interesting findings on its feasibility for real-world utilization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bansal, M., Gimpel, K., Livescu, K.: Tailoring continuous word representations for dependency parsing. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2014)

    Google Scholar 

  2. Garabík, R.: Slovak morphology analyzer based on Levenshtein edit operations. In: Proceedings of 1st Workshop on Intelligent and Knowledge-Oriented Technologies, pp. 2–5 (2006)

    Google Scholar 

  3. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781

  4. Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of Conference of the North American Chapter of the ACL: Human Language Technologies, HLT-NAACL 2013, pp. 746–751 (2013)

    Google Scholar 

  5. JÚĽĽŠ: Slovak national corpus - prim-6.0-public-all. Bratislava: ĽĽ. Štúr Institute of Linguistics SAS (2013). http://korpus.juls.savba.sk

  6. Cortes, C., Vapnik, V.: Support-vector networks. In: Machine learning, p. 99 (1995)

    Google Scholar 

  7. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  8. Brychcín, T., Konopík, M.: Hps: high precision stemmer. Inf. Process. Manage. 51(1), 68–91 (2015)

    Article  Google Scholar 

  9. Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 621–630. ACM (2009)

    Google Scholar 

  10. Krajči, S., Novotný, R.: Hľľadanie základného tvaru slovenského slova na základe spoločného konca slov (In Slovak). In: 1st Workshop on Intelligent and Knowledge Oriented Technologies, pp. 99–101 (2006)

    Google Scholar 

  11. Šajgalík, M., Barla, M., Bieliková, M.: Exploring multidimensional continuous feature space to extract relevant words. In: Besacier, L., Dediu, A.-H., Martín-Vide, C. (eds.) SLSP 2014. LNCS, vol. 8791, pp. 159–170. Springer, Heidelberg (2014)

    Google Scholar 

Download references

Acknowledgments

This work was partially supported by the Scientific Grant Agency of Slovak Republic, grant No. VG 1/0646/15 and the Cultural and Educational Grant Agency of the Slovak Republic, grant No. KEGA 009STU-4/2014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marián Šimko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gallay, L., Šimko, M. (2016). Utilizing Vector Models for Automatic Text Lemmatization. In: Freivalds, R., Engels, G., Catania, B. (eds) SOFSEM 2016: Theory and Practice of Computer Science. SOFSEM 2016. Lecture Notes in Computer Science(), vol 9587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49192-8_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-49192-8_43

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-49191-1

  • Online ISBN: 978-3-662-49192-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics