skip to main content
research-article

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing

Authors Info & Claims
Published:27 December 2022Publication History
Skip Abstract Section

Abstract

Natural Language Processing (NLP) has been in practice for the past couple of decades, and extensive work has been done for the Western languages, particularly the English language. The Eastern counterpart, especially the languages of the Indian subcontinent, needs attention as not much language processing work has been done on these languages. Western languages are rich in dictionaries, WordNet, and associated tools, while Indian languages are lagging behind in this segment. Marathi is the third most spoken language in India and the 15th most spoken language worldwide. Lack of resources, complex linguistic facts, and the inclusion of prevalent dialects of neighbors have resulted in limited work for Marathi. The aim of this study is to provide an insight into the various linguistic resources, tools, and state-of-the-art techniques applied to the processing of the Marathi language. Initially, morphological descriptions of the Marathi language are provided, followed by a discussion on the characteristics of the Marathi language. Thereafter, for Marathi language, the availability of corpus, tools, and techniques to be used to develop NLP tasks is reviewed. Finally, gap analysis is discussed in current research and future directions for this new and dynamic area of research are listed that will benefit the Marathi Language Processing research community.

REFERENCES

  1. [1] Abraham Basil, Goel Danish, Siddarth Divya, Bali Kalika, Chopra Manu, Choudhury Monojit, Joshi Pratik, Jyoti Preethi, Sitaram Sunayana, and Seshadri Vivek. 2020. Crowdsourcing speech data for low-resource languages from low-income workers. In Proceedings of the 12th Language Resources and Evaluation Conference. 28192826.Google ScholarGoogle Scholar
  2. [2] Agarwal Alekh and Bhattacharyya Pushpak. 2006. Augmenting word net with polarity information on adjectives. In Proceedings of the 3rd International Wordnet Conference. 38. Google ScholarGoogle Scholar
  3. [3] Agić Željko and Vulić Ivan. 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 32043210.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Alom Md Zahangir, Taha Tarek M., Yakopcic Chris, Westberg Stefan, Sidike Paheding, Nasrin Mst Shamima, Hasan Mahmudul, Essen Brian C. Van, Awwal Abdul A. S., and Asari Vijayan K.. 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8, 3 (March2019), 292. Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Alom Md Zahangir, Taha Tarek M., Yakopcic Chris, Westberg Stefan, Sidike Paheding, Nasrin Mst Shamima, Hasan Mahmudul, Essen Brian C. Van, Awwal Abdul A. S., and Asari Vijayan K.. 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8, 3 (2019), 292. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Amin Dhiraj and Govilkar Sharvari. 2015. ARQAS: Augmented reality based question answering system using ontology in HINDI and MARATHI language. Int. J. Comput. Appl. 126, 13 (2015).Google ScholarGoogle Scholar
  7. [7] Ansari Mohammed Arshad and Govilkar Sharvari. 2018. Sentiment analysis of mixed code for the transliterated hindi and marathi texts. Int. J. Nat. Lang. Comput. 7 (2018).Google ScholarGoogle Scholar
  8. [8] Arora Gaurav. 2020. iNLTK: Natural language toolkit for Indic languages. In Proceedings of the 2nd Workshop for NLP Open Source Software (NLP-OSS’20). 6671.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Baker Paul, Hardie Andrew, McEnery Tony, Cunningham Hamish, and Gaizauskas Robert J.. 2002. EMILLE, A 67-million word corpus of Indic languages: Data collection, mark-up and harmonisation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’02).Google ScholarGoogle Scholar
  10. [10] Banerjee Somnath and Bandyopadhyay Sivaji. 2012. Bengali question classification: Towards developing QA system. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing. 2540.Google ScholarGoogle Scholar
  11. [11] Bhagavatula Mahathi, Santosh GSK, and Varma Vasudeva. 2012. Named entity recognition an aid to improve multilingual entity filling in language-independent approach. In Proceedings of the 1st Workshop on Information and Knowledge Management for Developing Region. 310.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Bharati Akshar, Sangal Rajeev, Sharma Dipti, and Singh Anil Kumar. 2014. SSF: A common representation scheme for language analysis for language technology infrastructure development. In Proceedings of the International Conference on Computational Linguistics (COLING’14). 66.Google ScholarGoogle Scholar
  13. [13] Bharati Akshar, Sangal Rajeev, Sharma Dipti Misra, and Bai Lakshmi. 2006. Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages. Technical Report.Google ScholarGoogle Scholar
  14. [14] Bhattacharyya Pushpak. 2010. IndoWordnet. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA). 18.Google ScholarGoogle Scholar
  15. [15] Bhingardive Sudha and Bhattacharyya Pushpak. 2017. Word sense disambiguation using IndoWordNet. In The WordNet in Indian Languages. Springer, 243260.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Bhingardive Sudha, Shaikh Samiulla, and Bhattacharyya Pushpak. 2013. Neighbors help: Bilingual unsupervised WSD using context. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 538542.Google ScholarGoogle Scholar
  17. [17] Bhole Darshana S. and Patil Sandip S.. 2018. Detection of paraphrases for Devanagari languages using support vector machine. In Proceedings of the International Conference on Communication information and Computing Technology (ICCICT’18). 15. Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Chakrabarty D., Pande P., Narayan D., and Bhattacharyya Pushpak. 2002. An experience in building the Indo WordNet—A WordNet for Hindi. In Proceedings of the International Conference on Global WordNet (GWC’02).Google ScholarGoogle Scholar
  19. [19] Chaudhari Chitra V., Khaire Ashwini V., Murtadak Rashmi R., and Sirsulla Komal S.. 2017. Sentiment analysis in Marathi using Marathi WordNet. Imp. J. Interdiscip. Res. 3, 4 (2017), 12531256.Google ScholarGoogle Scholar
  20. [20] Choudhary Narayan. 2021. LDC-IL: The Indian repository of resources for language technology. Lang. Res. Eval. (2021), 113.Google ScholarGoogle Scholar
  21. [21] Christodouloupoulos Christos and Steedman Mark. 2015. A massively parallel corpus: The Bible in 100 languages. Lang. Resource. Eval. 49, 2 (2015), 375395.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Das Amitava and Bandyopadhyay Sivaji. 2010. SentiWordNet for Indian languages. In Proceedings of the 8th Workshop on Asian Language Resouces. 5663.Google ScholarGoogle Scholar
  23. [23] Dave Bhargav, Gangopadhyay Surupendu, Majumder Prasenjit, Bhattacharya Pushpak, Sarkar Sudeshna, and Devi Sobha Lalitha. 2020. FIRE 2020 EDNIL track: Event detection from news in Indian languages. In Forum for Information Retrieval Evaluation. 2528.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Deshmukh Sujata, Patil Nileema, Rotiwar Surabhi, and Nunes Jason. 2017. Sentiment analysis of Marathi language. Int. J. Res. Publ. Eng. Technol. 3, 6 (2017), 9397.Google ScholarGoogle Scholar
  25. [25] Deshpande Madhuri M. and Gore Sharad D.. 2018. A hybrid part-of-speech tagger for Marathi sentences. In Proceedings of the International Conference on Communication information and Computing Technology (ICCICT’18). IEEE, 110.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Dhumal Rushali and Kiwelekar Arvind. 2020. Deep learning techniques for part of speech tagging by natural language processing. In Proceedings of the 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA’20). IEEE, 7681. Google ScholarGoogle Scholar
  27. [27] Dolamic Ljiljana and Savoy Jacques. 2010. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans. As. Lang. Inf. Process. 9, 3 (2010), 124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Farkiya Alabhya, Saini Prashant, Sinha Shubham, and Desai Sharmishta. 2015. Natural language processing using NLTK and WordNet. Int. J. Comput. Sci. Inf. Technol. 6, 6 (2015), 54655469.Google ScholarGoogle Scholar
  29. [29] Gaikwad Saurabh Sampatrao, Ranasinghe Tharindu, Zampieri Marcos, and Homan Christopher. 2021. Cross-lingual offensive language identification for low resource languages: The case of Marathi. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’21). 437443.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Gayen Vivekananda and Sarkar Kamal. 2014. An HMM based named entity recognition system for indian languages: The JU system at ICON 2013. arXiv:1405.7397. Retrieved from https://arxiv.org/abs/1405.7397.Google ScholarGoogle Scholar
  31. [31] KPMG Google and. 2017. Indian Languages–Defining India’s Internet–KPMG India. Retrieved November 02, 2020 from https://home.kpmg/in/en/home/insights/2017/04/indian-language-internet-users.html.Google ScholarGoogle Scholar
  32. [32] Govilkar Sharvari S. and Bakal J. W.. 2017. Question answering system using ontology in Marathi language. Int. J. Artif. Intell. Appl. 8 (2017), 5364.Google ScholarGoogle Scholar
  33. [33] Goyal Archana, Gupta Vishal, and Kumar Manish. 2018. Recent named entity recognition and classification techniques: A systematic review. Comput. Sci. Rev. 29 (2018), 2143. Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Gupta Deepak, Ekbal Asif, and Bhattacharyya Pushpak. 2019. A deep neural network framework for English Hindi question answering. ACM Trans. As. Low-Resour. Lang. Inf. Process. 19, 2 (2019), 122.Google ScholarGoogle Scholar
  35. [35] Gupta Deepak, Kumari Surabhi, Ekbal Asif, and Bhattacharyya Pushpak. 2018. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google ScholarGoogle Scholar
  36. [36] Gupta Somil and Khade Nilesh. 2020. Bert based multilingual machine comprehension in English and Hindi. arXiv:2006.01432. Retrieved from https://arxiv.org/abs/2006.01432.Google ScholarGoogle Scholar
  37. [37] Haddow Barry and Kirefu Faheem. 2020. PMIndia—A collection of parallel corpora of languages of India. arXiv:2001.09907. Retrieved from https://arxiv.org/abs/2001.09907.Google ScholarGoogle Scholar
  38. [38] Hasan Tahmid, Bhattacharjee Abhik, Islam Md. Saiful, Mubasshir Kazi, Li Yuan-Fang, Kang Yong-Bin, Rahman M. Sohel, and Shahriyar Rifat. 2021. XL-Sum: Large-scale multilingual abstractive summarization for 44 Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 46934703. Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] He Fei, Chu Shan Hui Cathy, Kjartansson Oddur, Rivera Clara, Katanova Anna, Gutkin Alexander, Demirsahin Isin, Johny Cibu, Jansche Martin, Sarin Supheakmungkol, and Pipatsrisawat Knot. 2020. Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems. In Proceedings of the 12th Language Resources and Evaluation Conference. 64946503. Google ScholarGoogle Scholar
  40. [40] Itankar Prashant and Mane Ms Anushree. 2021. Marathi text document summarization using neural networks. Int. Organiz. Res. Dev. 8, 2 (2021), 44.Google ScholarGoogle Scholar
  41. [41] Jha Girish Nath. 2010. The TDIL program and the Indian langauge corpora intitiative (ILCI). In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 982985.Google ScholarGoogle Scholar
  42. [42] Joshi Aditya, Balamurali A. R., Bhattacharyya Pushpak, et al. 2010. A fall-back strategy for sentiment analysis in Hindi: A case study. In Proceedings of the 8th International Conference on Natural Language Processing (ICON’10).Google ScholarGoogle Scholar
  43. [43] Joshi Raviraj. 2022. L3cube-mahacorpus and mahabert: Marathi monolingual corpus, Marathi bert language models, and resources. arXiv:2202.01159. Retrieved from https://arxiv.org/abs/2002.01159.Google ScholarGoogle Scholar
  44. [44] Joshi Shripad S.. 2013. Sandhi splitting of Marathi compound words. Int. J. Adv. Comput. Theory Eng. 2, 2 (2013), 4346.Google ScholarGoogle Scholar
  45. [45] Kakwani Divyanshu, Kunchukuttan Anoop, Golla Satish, Gokul N. C., Bhattacharyya Avik, Khapra Mitesh M., and Kumar Pratyush. 2020. IndicNLPSuite : Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: Findings. 49484961.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Kar Debanjana, Sarkar Sudeshna, and Goyal Pawan. 2020. Event argument extraction using causal knowledge structures. In Proceedings of the 17th International Conference on Natural Language Processing (ICON’20). 287296.Google ScholarGoogle Scholar
  47. [47] Khandale Kalpana and Mahender C. Namrata. 2019. Rule-based design for anaphora resolution of Marathi sentence. In Proceedings of the IEEE 5th International Conference for Convergence in Technology (I2CT’19). IEEE, 17.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Khandale Kalpana B.. 2020. Natural language processing based rule based discourse analysis of Marathi text. In Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC’20). IEEE, 356362. Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Khapra Mitesh M., Joshi Salil, and Bhattacharyya Pushpak. 2011. It takes two to tango: A bilingual unsupervised approach for estimating sense distributions using expectation maximization. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 695704.Google ScholarGoogle Scholar
  50. [50] Khapra Mitesh M., Joshi Salil, Chatterjee Arindam, and Bhattacharyya Pushpak. 2011. Together we can: Bilingual bootstrapping for WSD. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 561569.Google ScholarGoogle Scholar
  51. [51] Khapra Mitesh M., Kulkarni Anup, Sohoney Saurabh, and Bhattacharyya Pushpak. 2010. All words domain adapted WSD: Finding a middle ground between supervision and unsupervision. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 15321541.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Kharate Namrata G. and Patil Varsha H.. 2021. Word sense disambiguation for Marathi language using WordNet and the lesk approach. In Proceeding of 1st Doctoral Symposium on Natural Computing Research (DSNCR’20), Vol. 169. Springer Nature, 45.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Kulkarni Atharva, Mandhane Meet, Likhitkar Manali, Kshirsagar Gayatri, Jagdale Jayashree, and Joshi Raviraj. 2022. Experimental evaluation of deep learning models for Marathi text classification. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Springer, 605613.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Kulkarni Atharva, Mandhane Meet, Likhitkar Manali, Kshirsagar Gayatri, and Joshi Raviraj. 2021. L3CubeMahaSent: A Marathi tweet-based sentiment analysis dataset. In Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 213220.Google ScholarGoogle Scholar
  55. [55] Kumar N. Kiran, Santosh G. S. K., and Varma Vasudeva. 2011. A language-independent approach to identify the named entities in under-resourced languages and clustering multilingual documents. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 7482.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Kumar Praveen, Kashyap Shrikant, Mittal Ankush, and Gupta Sumit. 2005. A Hindi question answering system for E-learning documents. In Proceedings of the 3rd International Conference on Intelligent Sensing and Information Processing. IEEE, 8085.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Kunchukuttan Anoop, Kakwani Divyanshu, Golla Satish, Bhattacharyya Avik, Khapra Mitesh M., and Kumar Pratyush. 2020. AI4Bharat-IndicNLP corpus: Monolingual corpora and word embeddings for Indic languages. arXiv:2005.00085. Retrieved from https://arxiv.org/abs/2005.00085.Google ScholarGoogle Scholar
  58. [58] Kunchukuttan Anoop, Puduppully Ratish, and Bhattacharyya Pushpak. 2015. Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 8185. Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] LDC-IL. 2020. A Gold Standard Marathi Raw Text Corpus. Retrieved from https://data.ldcil.org/text/text-raw-corpus/a-gold-standard-marathi-raw-text-corpus.Google ScholarGoogle Scholar
  60. [60] Loni Babak. 2011. A Survey of State-of-the-art Methods on Question Classification. Technical Report. Delft University of Technology, Mediamatics Department.Google ScholarGoogle Scholar
  61. [61] Madhani Yash, Parthan Sushane, Bedekar Priyanka, Khapra Ruchi, Seshadri Vivek, Kunchukuttan Anoop, Kumar Pratyush, and Khapra Mitesh M.. 2022. Aksharantar: Towards building open transliteration tools for the next billion users. arXiv:2205.03018. Retrieved from https://arxiv.org/abs/2205.03018.Google ScholarGoogle Scholar
  62. [62] Maheshwari Ayush, Patel Hrishikesh, Rathod Nandan, and Bhattacharyya Pushpak. 2019. Tale of tails using rule augmented sequence labeling for event extraction. arXiv:1908.07018. Retrieved from https://arxiv.org/abs/1908.07018.Google ScholarGoogle Scholar
  63. [63] Malarkodi C. S. and Devi Sobha Lalitha. 2020. A deeper study on features for named entity recognition. In Proceedings of the 5th Workshop on Indian Language Data: Resources and Evaluation (WILDRE5). 6672.Google ScholarGoogle Scholar
  64. [64] Mhaske N. T. and Patil A. S.. 2021. Resource creation for opinion mining: A case study with Marathi movie reviews. Int. J. Inf. Technol. (2021), 19.Google ScholarGoogle Scholar
  65. [65] Mhaske Neelima and Patil Ajay S.. 2016. Issues and challenges in analyzing opinions in Marathi text. Int. J. Comput. Sci. Iss. 13, 2 (2016), 19.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] MHRD. 2013. Languages in India. Technical Report.Google ScholarGoogle Scholar
  67. [67] Miller George A.. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 3941.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Murthy Rudra, Khapra Mitesh M., and Bhattacharyya Pushpak. 2018. Improving NER tagging performance in low-resource languages via multilingual learning. ACM Trans. As. Low-Resour. Lang. Inf. Process. 18, 2 (2018), 120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Naik Ramesh Ram, Landge Maheshkumar B., et al. 2017. Plagiarism detection in Marathi language using semantic analysis. Int. J. Strateg. Inf. Technol. Appl. 8, 4 (2017), 3039.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] Naik Ramesh R., Landge Maheshkumar B., and Mahender C. Namrata. 2016. Development of Marathi text corpus for plagiarism detection in Marathi language. In Proceedings of the 2nd International Conference on Cognitive Knowledge Engineering (ICKE’16). 340344.Google ScholarGoogle Scholar
  71. [71] Naik Ramesh R., Landge Maheshkumar B., and Mahender C. Namrata. 2018. Word level plagiarism detection of Marathi text using N-gram approach. In Proceedings of the International Conference on Recent Trends in Image Processing and Pattern Recognition, Santosh K. C. and Hegadi Ravindra S. (Eds.). Springer, Singapore, 1423. Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Naik Ramesh R., Landge Maheshkumar B., and Mahender C. Namrata. 2019. A proposed model to identify paraphrasing in Marathi text. In Proceedings of the National Conference on Recent Innovation in Computer Science & Electronics. 4851.Google ScholarGoogle Scholar
  73. [73] Narhari Shraddha A. and Shedge Rajashree. 2017. Text categorization of Marathi documents using modified LINGO. In Proceedings of the International Conference on Advances in Computing, Communication and Control (ICAC3’17). IEEE, 15.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Orkphol Korawit and Yang Wu. 2019. Word sense disambiguation using cosine similarity collaborates with Word2vec and WordNet. Fut. Internet 11, 5 (2019), 114.Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Otter Daniel W., Medina Julian R., and Kalita Jugal K.. 2020. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neur. Netw. Learn. Syst. (2020). Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Paik Jiaul H., Mitra Mandar, Parui Swapan K., and Järvelin Kalervo. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4 (2011), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. [77] Paik Jiaul H., Pal Dipasree, and Parui Swapan K.. 2011. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 863872.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Paik Jiaul H. and Parui Swapan K.. 2011. A fast corpus-based stemmer. ACM Trans. As. Lang. Inf. Process. 10, 2 (2011), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Paik Jiaul H., Parui Swapan K., Pal Dipasree, and Robertson Stephen E.. 2013. Effective and robust query-based stemming. ACM Trans. Inf. Syst. 31, 4 (2013), 129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. [80] Pan Xiaoman, Zhang Boliang, May Jonathan, Nothman Joel, Knight Kevin, and Ji Heng. 2017. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 19461958. Google ScholarGoogle ScholarCross RefCross Ref
  81. [81] Patel Anup, Ramakrishnan Ganesh, and Bhattacharya Pushpak. 2009. Relational learning assisted construction of rule base for Indian language NER. In Proceedings of 7th International Conference on Natural Language Processing (ICON’09), (2009), 7th.Google ScholarGoogle Scholar
  82. [82] Patil H. B., Patil A. S., and Pawar B. V.. 2014. Part-of-speech tagger for Marathi language using limited training corpora. Int. J. Comput. Appl. 975 (2014), 8887.Google ScholarGoogle Scholar
  83. [83] Patil Harshali B., Mhaske Neelima T., and Patil Ajay S.. 2017. Design and development of a dictionary based stemmer for Marathi language. In International Conference on Next Generation Computing Technologies.Springer, Singapore, 769777. Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Patil Harshali B. and Patil Ajay S.. 2017. MarS : A rule-based stemmer for morphologically. In Proceedings of the International Conference on Computer, Communications and Electronics (Comptelix’17). IEEE, 580584. Google ScholarGoogle Scholar
  85. [85] Patil Harshali B. and Patil Ajay S.. 2020. A hybrid stemmer for the affix stacking language: Marathi. In Computing in Engineering and Technology. Springer, 441449. Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Patil Nita, Patil Ajay S., and Pawar B. V.. 2017. Hybrid approach for Marathi named entity recognition. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017). 103111.Google ScholarGoogle Scholar
  87. [87] Patil Nita, Patil Ajay S., and Pawar B. V.. 2020. Named entity recognition using conditional random fields. Proc. Comput. Sci. 167 (2020), 11811188.Google ScholarGoogle ScholarCross RefCross Ref
  88. [88] Patil Nita V., Patil Ajay S., and Pawar B. V.. 2017. HMM based named entity recognition for inflectional language. In Proceedings of the International Conference on Computer, Communications and Electronics (Comptelix’17). IEEE, 565572.Google ScholarGoogle ScholarCross RefCross Ref
  89. [89] Patil Parth, Ranade Aparna, Sabane Maithili, Litake Onkar, and Joshi Raviraj. 2022. L3Cube-MahaNER: A Marathi named entity recognition dataset and BERT models. arXiv:2204.06029. Retrieved from https://arxiv.org/abs/2204.06029.Google ScholarGoogle Scholar
  90. [90] Patil Rupali P., Bhavsar R. P., and Pawar B. V.. 2019. Automatic Marathi text classification. Int. J. Innovat. Technol. Explor. Eng. 9 (2019), 24462454. Issue 2.Google ScholarGoogle ScholarCross RefCross Ref
  91. [91] Pawar S. V. and Mali S.. 2017. Sentiment analysis in Marathi language. Int. J. Recent Innov. Trends Comput. Commun. (2017), 23218169.Google ScholarGoogle Scholar
  92. [92] Philip Jerin, Siripragada Shashank, Namboodiri Vinay P., and Jawahar C. V.. 2021. Revisiting low resource status of Indian languages in machine translation. In Proceedings of the 8th ACM India Joint International Conference on Data Science & Management of Data (IKDD CODS’21) and 26th COMAD. 178187.Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. [93] Popale Lata and Bhattacharyya Pushpak. 2017. Creating Marathi WordNet. In The WordNet in Indian Languages, Dash Niladri Sekhar, Bhattacharyya Pushpak, and Pawar Jyoti D. (Eds.). Springer, Singapore, 147166. Google ScholarGoogle ScholarCross RefCross Ref
  94. [94] Rajan Annie, Salgaonkar Ambuja, and Joshi Ramprasad. 2020. A survey of Konkani NLP resources. Comput. Sci. Rev. 38 (2020), 100299. Google ScholarGoogle ScholarCross RefCross Ref
  95. [95] Ramesh Gowtham, Doddapaneni Sumanth, Bheemaraj Aravinth, Jobanputra Mayank, Raghavan A. K., Sharma Ajitesh, Sahoo Sujit, Diddee Harshita, Kakwani Divyanshu, Kumar Navneet, et al. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Trans. Assoc. Comput. Linguist. 10 (2022), 145162.Google ScholarGoogle ScholarCross RefCross Ref
  96. [96] Rani Pratibha, Pudi Vikram, and Sharma Dipti M.. 2017. Semisupervied data driven word sense disambiguation for resource-poor languages. In Proceedings of the 14th International Conference on Natural Language Processing (ICON’17). 503512.Google ScholarGoogle Scholar
  97. [97] Rathod Yogeshwari V.. 2018. Extractive text summarization of Marathi news articles. Int. Res. J. Eng. Technol. 5 (2018), 12041210.Google ScholarGoogle Scholar
  98. [98] Ravishankar Vinit. 2017. A universal dependencies treebank for Marathi. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. 190200.Google ScholarGoogle Scholar
  99. [99] Ray Santosh Kumar, Ahmad Amir, and Shaalan Khaled. 2018. A review of the state of the art in Hindi question answering systems. In Intelligent Natural Language Processing: Trends and Applications. 265292. Google ScholarGoogle ScholarCross RefCross Ref
  100. [100] Roark Brian, Wolf-Sonkin Lawrence, Kirov Christo, Mielke Sabrina J., Johny Cibu, Demirsahin Isin, and Hall Keith. 2020. Processing South Asian languages written in the Latin script: The dakshina dataset. In Proceedings of the 12th Language Resources and Evaluation Conference. Google ScholarGoogle Scholar
  101. [101] Sahoo Sovan Kumar, Saha Saumajit, Ekbal Asif, and Bhattacharyya Pushpak. 2020. A platform for event extraction in Hindi. In Proceedings of the 12th Language Resources and Evaluation Conference. 22412250.Google ScholarGoogle Scholar
  102. [102] Savoy Jacques, Dolamic Ljiljana, and Akasereh Mitra. 2013. Information retrieval with Hindi, Bengali, and Marathi languages: Evaluation and analysis. In Multilingual Information Access in South Asian Languages. Springer, 334352.Google ScholarGoogle ScholarCross RefCross Ref
  103. [103] Scherrer Yves. 2020. TaPaCo: A corpus of sentential paraphrases for 73 languages. In Proceedings of the 12th Language Resources and Evaluation Conference. 68686873.Google ScholarGoogle Scholar
  104. [104] Shah Sonali Rajesh, Kaushik Abhishek, Sharma Shubham, and Shah Janice. 2020. Opinion-mining on Marglish and Devanagari comments of YouTube cookery channels using parametric and non-parametric learning models. Big Data Cogn. Comput. 4, 1 (2020), 3.Google ScholarGoogle ScholarCross RefCross Ref
  105. [105] Sharma Raksha and Bhattacharyya Pushpak. 2014. A sentiment analyzer for Hindi using Hindi senti lexicon. In Proceedings of the 11th International Conference on Natural Language Processing. 150155.Google ScholarGoogle Scholar
  106. [106] Singh Jasmeet and Gupta Vishal. 2016. Text stemming: Approaches, applications, and challenges. ACM Comput. Surv. 49, 3 (2016), 146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. [107] Singh Jasmeet and Gupta Vishal. 2017. An efficient corpus-based stemmer. Cogn. Comput. 9, 5 (2017), 671688.Google ScholarGoogle ScholarCross RefCross Ref
  108. [108] Singh Jasmeet and Gupta Vishal. 2019. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics. Knowl.-Bas. Syst. 180 (2019), 147162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. [109] Singh Jyoti, Joshi Nisheeth, and Mathur Iti. 2013. Part of speech tagging of Marathi text using trigram method. Int. J. Adv. Inf. Technol. 3, 2 (2013), 3541. Google ScholarGoogle ScholarCross RefCross Ref
  110. [110] Singh Jyoti, Joshi Nisheeth, and Mathur Iti. 2014. Marathi parts-of-speech tagger using supervised learning. In Intelligent Computing, Networking, and Informatics. Springer, 251257.Google ScholarGoogle Scholar
  111. [111] Siripragada Shashank, Philip Jerin, Namboodiri Vinay P., and Jawahar C. V.. 2020. A multilingual parallel corpora collection effort for Indian languages. In Proceedings of the 12th Language Resources and Evaluation Conference. 37433751.Google ScholarGoogle Scholar
  112. [112] Srivastava Shruti and Govilkar Sharvari. 2018. Paraphrase identification of Marathi sentences. In Proceedings of the International Conference on Intelligent Data Communication Technologies and Internet of Things. Springer, 534544.Google ScholarGoogle Scholar
  113. [113] Srivastava Shruti and Govilkar Sharvari. 2020. Detecting paraphrases in Marathi language. Int. J. of Smart Computing and Information Technology 1, 1 (2020), 717.Google ScholarGoogle Scholar
  114. [114] Suárez Ortiz, Javier Pedro, Sagot Benoît, Romary Laurent, Javier Pedro, Suárez Ortiz, Sagot Benoît, Romary Laurent, Pipeline Asynchronous, Javier Pedro, and Su Ortiz. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In Proceedings of the 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache, 18. Google ScholarGoogle ScholarCross RefCross Ref
  115. [115] Mayeesha Tasmiah Tahsin, Sarwar Abdullah Md, and Rahman Rashedur M.. 2021. Deep learning based question answering system in Bengali. J. Inf. Telecommun. 5, 2 (2021), 145178.Google ScholarGoogle Scholar
  116. [116] Tandon Juhi and Sharma Dipti Misra. 2017. Unity in diversity: A unified parsing strategy for major Indian languages. In Proceedings of the 4th International Conference on Dependency Linguistics (Depling’17). 255265.Google ScholarGoogle Scholar
  117. [117] Tiedemann Jörg. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the International Conference on Language Resources and Evaluation. 22142218.Google ScholarGoogle Scholar
  118. [118] Velankar Abhishek, Patil Hrushikesh, Gore Amol, Salunke Shubham, and Joshi Raviraj. 2022. L3Cube-MahaHate: A tweet-based Marathi hate speech detection dataset and BERT models. arXiv:2203.13778. Retrieved from https://arxiv.org/abs/2203.13778.Google ScholarGoogle Scholar
  119. [119] Wenzek Guillaume, Lachaux Marie-Anne, Conneau Alexis, Chaudhary Vishrav, Guzmán Francisco, Joulin Armand, and Grave Édouard. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference. 40034012.Google ScholarGoogle Scholar
  120. [120] Young Tom, Hazarika Devamanyu, Poria Soujanya, and Cambria Erik. 2018. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 3 (2018), 5575.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
        February 2023
        624 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3572719
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 December 2022
        • Online AM: 13 July 2022
        • Accepted: 4 July 2022
        • Revised: 8 June 2022
        • Received: 14 May 2021
        Published in tallip Volume 22, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format