research-article

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing

Authors:
Pawan Lahoti

Department of Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, Rajasthan, India

Department of Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, Rajasthan, India

0000-0002-8903-3735
View Profile

,
Namita Mittal

Department of Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, Rajasthan, India

Department of Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, Rajasthan, India

0000-0001-6886-9974
View Profile

,
Girdhari Singh

Department of Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, Rajasthan, India

Department of Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, Rajasthan, India

0000-0003-3971-4687
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 2Article No.: 47pp 1–34https://doi.org/10.1145/3548457

Published:27 December 2022Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Natural Language Processing (NLP) has been in practice for the past couple of decades, and extensive work has been done for the Western languages, particularly the English language. The Eastern counterpart, especially the languages of the Indian subcontinent, needs attention as not much language processing work has been done on these languages. Western languages are rich in dictionaries, WordNet, and associated tools, while Indian languages are lagging behind in this segment. Marathi is the third most spoken language in India and the 15th most spoken language worldwide. Lack of resources, complex linguistic facts, and the inclusion of prevalent dialects of neighbors have resulted in limited work for Marathi. The aim of this study is to provide an insight into the various linguistic resources, tools, and state-of-the-art techniques applied to the processing of the Marathi language. Initially, morphological descriptions of the Marathi language are provided, followed by a discussion on the characteristics of the Marathi language. Thereafter, for Marathi language, the availability of corpus, tools, and techniques to be used to develop NLP tasks is reviewed. Finally, gap analysis is discussed in current research and future directions for this new and dynamic area of research are listed that will benefit the Marathi Language Processing research community.

REFERENCES

[1] Abraham Basil, Goel Danish, Siddarth Divya, Bali Kalika, Chopra Manu, Choudhury Monojit, Joshi Pratik, Jyoti Preethi, Sitaram Sunayana, and Seshadri Vivek. 2020. Crowdsourcing speech data for low-resource languages from low-income workers. In Proceedings of the 12th Language Resources and Evaluation Conference. 2819–2826.Google Scholar
[2] Agarwal Alekh and Bhattacharyya Pushpak. 2006. Augmenting word net with polarity information on adjectives. In Proceedings of the 3rd International Wordnet Conference. 3–8. Google Scholar
[3] Agić Željko and Vulić Ivan. 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3204–3210.Google ScholarCross Ref
[4] Alom Md Zahangir, Taha Tarek M., Yakopcic Chris, Westberg Stefan, Sidike Paheding, Nasrin Mst Shamima, Hasan Mahmudul, Essen Brian C. Van, Awwal Abdul A. S., and Asari Vijayan K.. 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8, 3 (March2019), 292. Google ScholarCross Ref
[5] Alom Md Zahangir, Taha Tarek M., Yakopcic Chris, Westberg Stefan, Sidike Paheding, Nasrin Mst Shamima, Hasan Mahmudul, Essen Brian C. Van, Awwal Abdul A. S., and Asari Vijayan K.. 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8, 3 (2019), 292. Google ScholarCross Ref
[6] Amin Dhiraj and Govilkar Sharvari. 2015. ARQAS: Augmented reality based question answering system using ontology in HINDI and MARATHI language. Int. J. Comput. Appl. 126, 13 (2015).Google Scholar
[7] Ansari Mohammed Arshad and Govilkar Sharvari. 2018. Sentiment analysis of mixed code for the transliterated hindi and marathi texts. Int. J. Nat. Lang. Comput. 7 (2018).Google Scholar
[8] Arora Gaurav. 2020. iNLTK: Natural language toolkit for Indic languages. In Proceedings of the 2nd Workshop for NLP Open Source Software (NLP-OSS’20). 66–71.Google ScholarCross Ref
[9] Baker Paul, Hardie Andrew, McEnery Tony, Cunningham Hamish, and Gaizauskas Robert J.. 2002. EMILLE, A 67-million word corpus of Indic languages: Data collection, mark-up and harmonisation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’02).Google Scholar
[10] Banerjee Somnath and Bandyopadhyay Sivaji. 2012. Bengali question classification: Towards developing QA system. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing. 25–40.Google Scholar
[11] Bhagavatula Mahathi, Santosh GSK, and Varma Vasudeva. 2012. Named entity recognition an aid to improve multilingual entity filling in language-independent approach. In Proceedings of the 1st Workshop on Information and Knowledge Management for Developing Region. 3–10.Google ScholarDigital Library
[12] Bharati Akshar, Sangal Rajeev, Sharma Dipti, and Singh Anil Kumar. 2014. SSF: A common representation scheme for language analysis for language technology infrastructure development. In Proceedings of the International Conference on Computational Linguistics (COLING’14). 66.Google Scholar
[13] Bharati Akshar, Sangal Rajeev, Sharma Dipti Misra, and Bai Lakshmi. 2006. Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages. Technical Report.Google Scholar
[14] Bhattacharyya Pushpak. 2010. IndoWordnet. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA). 1–8.Google Scholar
[15] Bhingardive Sudha and Bhattacharyya Pushpak. 2017. Word sense disambiguation using IndoWordNet. In The WordNet in Indian Languages. Springer, 243–260.Google ScholarCross Ref
[16] Bhingardive Sudha, Shaikh Samiulla, and Bhattacharyya Pushpak. 2013. Neighbors help: Bilingual unsupervised WSD using context. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 538–542.Google Scholar
[17] Bhole Darshana S. and Patil Sandip S.. 2018. Detection of paraphrases for Devanagari languages using support vector machine. In Proceedings of the International Conference on Communication information and Computing Technology (ICCICT’18). 1–5. Google ScholarCross Ref
[18] Chakrabarty D., Pande P., Narayan D., and Bhattacharyya Pushpak. 2002. An experience in building the Indo WordNet—A WordNet for Hindi. In Proceedings of the International Conference on Global WordNet (GWC’02).Google Scholar
[19] Chaudhari Chitra V., Khaire Ashwini V., Murtadak Rashmi R., and Sirsulla Komal S.. 2017. Sentiment analysis in Marathi using Marathi WordNet. Imp. J. Interdiscip. Res. 3, 4 (2017), 1253–1256.Google Scholar
[20] Choudhary Narayan. 2021. LDC-IL: The Indian repository of resources for language technology. Lang. Res. Eval. (2021), 1–13.Google Scholar
[21] Christodouloupoulos Christos and Steedman Mark. 2015. A massively parallel corpus: The Bible in 100 languages. Lang. Resource. Eval. 49, 2 (2015), 375–395.Google ScholarDigital Library
[22] Das Amitava and Bandyopadhyay Sivaji. 2010. SentiWordNet for Indian languages. In Proceedings of the 8th Workshop on Asian Language Resouces. 56–63.Google Scholar
[23] Dave Bhargav, Gangopadhyay Surupendu, Majumder Prasenjit, Bhattacharya Pushpak, Sarkar Sudeshna, and Devi Sobha Lalitha. 2020. FIRE 2020 EDNIL track: Event detection from news in Indian languages. In Forum for Information Retrieval Evaluation. 25–28.Google ScholarDigital Library
[24] Deshmukh Sujata, Patil Nileema, Rotiwar Surabhi, and Nunes Jason. 2017. Sentiment analysis of Marathi language. Int. J. Res. Publ. Eng. Technol. 3, 6 (2017), 93–97.Google Scholar
[25] Deshpande Madhuri M. and Gore Sharad D.. 2018. A hybrid part-of-speech tagger for Marathi sentences. In Proceedings of the International Conference on Communication information and Computing Technology (ICCICT’18). IEEE, 1–10.Google ScholarCross Ref
[26] Dhumal Rushali and Kiwelekar Arvind. 2020. Deep learning techniques for part of speech tagging by natural language processing. In Proceedings of the 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA’20). IEEE, 76–81. Google Scholar
[27] Dolamic Ljiljana and Savoy Jacques. 2010. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans. As. Lang. Inf. Process. 9, 3 (2010), 1–24. Google ScholarDigital Library
[28] Farkiya Alabhya, Saini Prashant, Sinha Shubham, and Desai Sharmishta. 2015. Natural language processing using NLTK and WordNet. Int. J. Comput. Sci. Inf. Technol. 6, 6 (2015), 5465–5469.Google Scholar
[29] Gaikwad Saurabh Sampatrao, Ranasinghe Tharindu, Zampieri Marcos, and Homan Christopher. 2021. Cross-lingual offensive language identification for low resource languages: The case of Marathi. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’21). 437–443.Google ScholarCross Ref
[30] Gayen Vivekananda and Sarkar Kamal. 2014. An HMM based named entity recognition system for indian languages: The JU system at ICON 2013. arXiv:1405.7397. Retrieved from https://arxiv.org/abs/1405.7397.Google Scholar
[31] KPMG Google and. 2017. Indian Languages–Defining India’s Internet–KPMG India. Retrieved November 02, 2020 from https://home.kpmg/in/en/home/insights/2017/04/indian-language-internet-users.html.Google Scholar
[32] Govilkar Sharvari S. and Bakal J. W.. 2017. Question answering system using ontology in Marathi language. Int. J. Artif. Intell. Appl. 8 (2017), 53–64.Google Scholar
[33] Goyal Archana, Gupta Vishal, and Kumar Manish. 2018. Recent named entity recognition and classification techniques: A systematic review. Comput. Sci. Rev. 29 (2018), 21–43. Google ScholarCross Ref
[34] Gupta Deepak, Ekbal Asif, and Bhattacharyya Pushpak. 2019. A deep neural network framework for English Hindi question answering. ACM Trans. As. Low-Resour. Lang. Inf. Process. 19, 2 (2019), 1–22.Google Scholar
[35] Gupta Deepak, Kumari Surabhi, Ekbal Asif, and Bhattacharyya Pushpak. 2018. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google Scholar
[36] Gupta Somil and Khade Nilesh. 2020. Bert based multilingual machine comprehension in English and Hindi. arXiv:2006.01432. Retrieved from https://arxiv.org/abs/2006.01432.Google Scholar
[37] Haddow Barry and Kirefu Faheem. 2020. PMIndia—A collection of parallel corpora of languages of India. arXiv:2001.09907. Retrieved from https://arxiv.org/abs/2001.09907.Google Scholar
[38] Hasan Tahmid, Bhattacharjee Abhik, Islam Md. Saiful, Mubasshir Kazi, Li Yuan-Fang, Kang Yong-Bin, Rahman M. Sohel, and Shahriyar Rifat. 2021. XL-Sum: Large-scale multilingual abstractive summarization for 44 Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 4693–4703. Google ScholarCross Ref
[39] He Fei, Chu Shan Hui Cathy, Kjartansson Oddur, Rivera Clara, Katanova Anna, Gutkin Alexander, Demirsahin Isin, Johny Cibu, Jansche Martin, Sarin Supheakmungkol, and Pipatsrisawat Knot. 2020. Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems. In Proceedings of the 12th Language Resources and Evaluation Conference. 6494–6503. Google Scholar
[40] Itankar Prashant and Mane Ms Anushree. 2021. Marathi text document summarization using neural networks. Int. Organiz. Res. Dev. 8, 2 (2021), 4–4.Google Scholar
[41] Jha Girish Nath. 2010. The TDIL program and the Indian langauge corpora intitiative (ILCI). In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 982–985.Google Scholar
[42] Joshi Aditya, Balamurali A. R., Bhattacharyya Pushpak, et al. 2010. A fall-back strategy for sentiment analysis in Hindi: A case study. In Proceedings of the 8th International Conference on Natural Language Processing (ICON’10).Google Scholar
[43] Joshi Raviraj. 2022. L3cube-mahacorpus and mahabert: Marathi monolingual corpus, Marathi bert language models, and resources. arXiv:2202.01159. Retrieved from https://arxiv.org/abs/2002.01159.Google Scholar
[44] Joshi Shripad S.. 2013. Sandhi splitting of Marathi compound words. Int. J. Adv. Comput. Theory Eng. 2, 2 (2013), 43–46.Google Scholar
[45] Kakwani Divyanshu, Kunchukuttan Anoop, Golla Satish, Gokul N. C., Bhattacharyya Avik, Khapra Mitesh M., and Kumar Pratyush. 2020. IndicNLPSuite : Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: Findings. 4948–4961.Google ScholarCross Ref
[46] Kar Debanjana, Sarkar Sudeshna, and Goyal Pawan. 2020. Event argument extraction using causal knowledge structures. In Proceedings of the 17th International Conference on Natural Language Processing (ICON’20). 287–296.Google Scholar
[47] Khandale Kalpana and Mahender C. Namrata. 2019. Rule-based design for anaphora resolution of Marathi sentence. In Proceedings of the IEEE 5th International Conference for Convergence in Technology (I2CT’19). IEEE, 1–7.Google ScholarCross Ref
[48] Khandale Kalpana B.. 2020. Natural language processing based rule based discourse analysis of Marathi text. In Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC’20). IEEE, 356–362. Google ScholarCross Ref
[49] Khapra Mitesh M., Joshi Salil, and Bhattacharyya Pushpak. 2011. It takes two to tango: A bilingual unsupervised approach for estimating sense distributions using expectation maximization. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 695–704.Google Scholar
[50] Khapra Mitesh M., Joshi Salil, Chatterjee Arindam, and Bhattacharyya Pushpak. 2011. Together we can: Bilingual bootstrapping for WSD. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 561–569.Google Scholar
[51] Khapra Mitesh M., Kulkarni Anup, Sohoney Saurabh, and Bhattacharyya Pushpak. 2010. All words domain adapted WSD: Finding a middle ground between supervision and unsupervision. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 1532–1541.Google ScholarDigital Library
[52] Kharate Namrata G. and Patil Varsha H.. 2021. Word sense disambiguation for Marathi language using WordNet and the lesk approach. In Proceeding of 1st Doctoral Symposium on Natural Computing Research (DSNCR’20), Vol. 169. Springer Nature, 45.Google ScholarCross Ref
[53] Kulkarni Atharva, Mandhane Meet, Likhitkar Manali, Kshirsagar Gayatri, Jagdale Jayashree, and Joshi Raviraj. 2022. Experimental evaluation of deep learning models for Marathi text classification. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Springer, 605–613.Google ScholarCross Ref
[54] Kulkarni Atharva, Mandhane Meet, Likhitkar Manali, Kshirsagar Gayatri, and Joshi Raviraj. 2021. L3CubeMahaSent: A Marathi tweet-based sentiment analysis dataset. In Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 213–220.Google Scholar
[55] Kumar N. Kiran, Santosh G. S. K., and Varma Vasudeva. 2011. A language-independent approach to identify the named entities in under-resourced languages and clustering multilingual documents. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 74–82.Google ScholarCross Ref
[56] Kumar Praveen, Kashyap Shrikant, Mittal Ankush, and Gupta Sumit. 2005. A Hindi question answering system for E-learning documents. In Proceedings of the 3rd International Conference on Intelligent Sensing and Information Processing. IEEE, 80–85.Google ScholarDigital Library
[57] Kunchukuttan Anoop, Kakwani Divyanshu, Golla Satish, Bhattacharyya Avik, Khapra Mitesh M., and Kumar Pratyush. 2020. AI4Bharat-IndicNLP corpus: Monolingual corpora and word embeddings for Indic languages. arXiv:2005.00085. Retrieved from https://arxiv.org/abs/2005.00085.Google Scholar
[58] Kunchukuttan Anoop, Puduppully Ratish, and Bhattacharyya Pushpak. 2015. Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 81–85. Google ScholarCross Ref
[59] LDC-IL. 2020. A Gold Standard Marathi Raw Text Corpus. Retrieved from https://data.ldcil.org/text/text-raw-corpus/a-gold-standard-marathi-raw-text-corpus.Google Scholar
[60] Loni Babak. 2011. A Survey of State-of-the-art Methods on Question Classification. Technical Report. Delft University of Technology, Mediamatics Department.Google Scholar
[61] Madhani Yash, Parthan Sushane, Bedekar Priyanka, Khapra Ruchi, Seshadri Vivek, Kunchukuttan Anoop, Kumar Pratyush, and Khapra Mitesh M.. 2022. Aksharantar: Towards building open transliteration tools for the next billion users. arXiv:2205.03018. Retrieved from https://arxiv.org/abs/2205.03018.Google Scholar
[62] Maheshwari Ayush, Patel Hrishikesh, Rathod Nandan, and Bhattacharyya Pushpak. 2019. Tale of tails using rule augmented sequence labeling for event extraction. arXiv:1908.07018. Retrieved from https://arxiv.org/abs/1908.07018.Google Scholar
[63] Malarkodi C. S. and Devi Sobha Lalitha. 2020. A deeper study on features for named entity recognition. In Proceedings of the 5th Workshop on Indian Language Data: Resources and Evaluation (WILDRE5). 66–72.Google Scholar
[64] Mhaske N. T. and Patil A. S.. 2021. Resource creation for opinion mining: A case study with Marathi movie reviews. Int. J. Inf. Technol. (2021), 1–9.Google Scholar
[65] Mhaske Neelima and Patil Ajay S.. 2016. Issues and challenges in analyzing opinions in Marathi text. Int. J. Comput. Sci. Iss. 13, 2 (2016), 19.Google ScholarCross Ref
[66] MHRD. 2013. Languages in India. Technical Report.Google Scholar
[67] Miller George A.. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39–41.Google ScholarDigital Library
[68] Murthy Rudra, Khapra Mitesh M., and Bhattacharyya Pushpak. 2018. Improving NER tagging performance in low-resource languages via multilingual learning. ACM Trans. As. Low-Resour. Lang. Inf. Process. 18, 2 (2018), 1–20. Google ScholarDigital Library
[69] Naik Ramesh Ram, Landge Maheshkumar B., et al. 2017. Plagiarism detection in Marathi language using semantic analysis. Int. J. Strateg. Inf. Technol. Appl. 8, 4 (2017), 30–39.Google ScholarDigital Library
[70] Naik Ramesh R., Landge Maheshkumar B., and Mahender C. Namrata. 2016. Development of Marathi text corpus for plagiarism detection in Marathi language. In Proceedings of the 2nd International Conference on Cognitive Knowledge Engineering (ICKE’16). 340–344.Google Scholar
[71] Naik Ramesh R., Landge Maheshkumar B., and Mahender C. Namrata. 2018. Word level plagiarism detection of Marathi text using N-gram approach. In Proceedings of the International Conference on Recent Trends in Image Processing and Pattern Recognition, Santosh K. C. and Hegadi Ravindra S. (Eds.). Springer, Singapore, 14–23. Google ScholarCross Ref
[72] Naik Ramesh R., Landge Maheshkumar B., and Mahender C. Namrata. 2019. A proposed model to identify paraphrasing in Marathi text. In Proceedings of the National Conference on Recent Innovation in Computer Science & Electronics. 48–51.Google Scholar
[73] Narhari Shraddha A. and Shedge Rajashree. 2017. Text categorization of Marathi documents using modified LINGO. In Proceedings of the International Conference on Advances in Computing, Communication and Control (ICAC3’17). IEEE, 1–5.Google ScholarCross Ref
[74] Orkphol Korawit and Yang Wu. 2019. Word sense disambiguation using cosine similarity collaborates with Word2vec and WordNet. Fut. Internet 11, 5 (2019), 114.Google ScholarCross Ref
[75] Otter Daniel W., Medina Julian R., and Kalita Jugal K.. 2020. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neur. Netw. Learn. Syst. (2020). Google ScholarCross Ref
[76] Paik Jiaul H., Mitra Mandar, Parui Swapan K., and Järvelin Kalervo. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4 (2011), 1–24.Google ScholarDigital Library
[77] Paik Jiaul H., Pal Dipasree, and Parui Swapan K.. 2011. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 863–872.Google ScholarDigital Library
[78] Paik Jiaul H. and Parui Swapan K.. 2011. A fast corpus-based stemmer. ACM Trans. As. Lang. Inf. Process. 10, 2 (2011), 1–16.Google ScholarDigital Library
[79] Paik Jiaul H., Parui Swapan K., Pal Dipasree, and Robertson Stephen E.. 2013. Effective and robust query-based stemming. ACM Trans. Inf. Syst. 31, 4 (2013), 1–29.Google ScholarDigital Library
[80] Pan Xiaoman, Zhang Boliang, May Jonathan, Nothman Joel, Knight Kevin, and Ji Heng. 2017. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 1946–1958. Google ScholarCross Ref
[81] Patel Anup, Ramakrishnan Ganesh, and Bhattacharya Pushpak. 2009. Relational learning assisted construction of rule base for Indian language NER. In Proceedings of 7th International Conference on Natural Language Processing (ICON’09), (2009), 7th.Google Scholar
[82] Patil H. B., Patil A. S., and Pawar B. V.. 2014. Part-of-speech tagger for Marathi language using limited training corpora. Int. J. Comput. Appl. 975 (2014), 8887.Google Scholar
[83] Patil Harshali B., Mhaske Neelima T., and Patil Ajay S.. 2017. Design and development of a dictionary based stemmer for Marathi language. In International Conference on Next Generation Computing Technologies.Springer, Singapore, 769–777. Google ScholarCross Ref
[84] Patil Harshali B. and Patil Ajay S.. 2017. MarS : A rule-based stemmer for morphologically. In Proceedings of the International Conference on Computer, Communications and Electronics (Comptelix’17). IEEE, 580–584. Google Scholar
[85] Patil Harshali B. and Patil Ajay S.. 2020. A hybrid stemmer for the affix stacking language: Marathi. In Computing in Engineering and Technology. Springer, 441–449. Google ScholarCross Ref
[86] Patil Nita, Patil Ajay S., and Pawar B. V.. 2017. Hybrid approach for Marathi named entity recognition. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017). 103–111.Google Scholar
[87] Patil Nita, Patil Ajay S., and Pawar B. V.. 2020. Named entity recognition using conditional random fields. Proc. Comput. Sci. 167 (2020), 1181–1188.Google ScholarCross Ref
[88] Patil Nita V., Patil Ajay S., and Pawar B. V.. 2017. HMM based named entity recognition for inflectional language. In Proceedings of the International Conference on Computer, Communications and Electronics (Comptelix’17). IEEE, 565–572.Google ScholarCross Ref
[89] Patil Parth, Ranade Aparna, Sabane Maithili, Litake Onkar, and Joshi Raviraj. 2022. L3Cube-MahaNER: A Marathi named entity recognition dataset and BERT models. arXiv:2204.06029. Retrieved from https://arxiv.org/abs/2204.06029.Google Scholar
[90] Patil Rupali P., Bhavsar R. P., and Pawar B. V.. 2019. Automatic Marathi text classification. Int. J. Innovat. Technol. Explor. Eng. 9 (2019), 2446–2454. Issue 2.Google ScholarCross Ref
[91] Pawar S. V. and Mali S.. 2017. Sentiment analysis in Marathi language. Int. J. Recent Innov. Trends Comput. Commun. (2017), 2321–8169.Google Scholar
[92] Philip Jerin, Siripragada Shashank, Namboodiri Vinay P., and Jawahar C. V.. 2021. Revisiting low resource status of Indian languages in machine translation. In Proceedings of the 8th ACM India Joint International Conference on Data Science & Management of Data (IKDD CODS’21) and 26th COMAD. 178–187.Google ScholarDigital Library
[93] Popale Lata and Bhattacharyya Pushpak. 2017. Creating Marathi WordNet. In The WordNet in Indian Languages, Dash Niladri Sekhar, Bhattacharyya Pushpak, and Pawar Jyoti D. (Eds.). Springer, Singapore, 147–166. Google ScholarCross Ref
[94] Rajan Annie, Salgaonkar Ambuja, and Joshi Ramprasad. 2020. A survey of Konkani NLP resources. Comput. Sci. Rev. 38 (2020), 100299. Google ScholarCross Ref
[95] Ramesh Gowtham, Doddapaneni Sumanth, Bheemaraj Aravinth, Jobanputra Mayank, Raghavan A. K., Sharma Ajitesh, Sahoo Sujit, Diddee Harshita, Kakwani Divyanshu, Kumar Navneet, et al. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Trans. Assoc. Comput. Linguist. 10 (2022), 145–162.Google ScholarCross Ref
[96] Rani Pratibha, Pudi Vikram, and Sharma Dipti M.. 2017. Semisupervied data driven word sense disambiguation for resource-poor languages. In Proceedings of the 14th International Conference on Natural Language Processing (ICON’17). 503–512.Google Scholar
[97] Rathod Yogeshwari V.. 2018. Extractive text summarization of Marathi news articles. Int. Res. J. Eng. Technol. 5 (2018), 1204–1210.Google Scholar
[98] Ravishankar Vinit. 2017. A universal dependencies treebank for Marathi. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. 190–200.Google Scholar
[99] Ray Santosh Kumar, Ahmad Amir, and Shaalan Khaled. 2018. A review of the state of the art in Hindi question answering systems. In Intelligent Natural Language Processing: Trends and Applications. 265–292. Google ScholarCross Ref
[100] Roark Brian, Wolf-Sonkin Lawrence, Kirov Christo, Mielke Sabrina J., Johny Cibu, Demirsahin Isin, and Hall Keith. 2020. Processing South Asian languages written in the Latin script: The dakshina dataset. In Proceedings of the 12th Language Resources and Evaluation Conference. Google Scholar
[101] Sahoo Sovan Kumar, Saha Saumajit, Ekbal Asif, and Bhattacharyya Pushpak. 2020. A platform for event extraction in Hindi. In Proceedings of the 12th Language Resources and Evaluation Conference. 2241–2250.Google Scholar
[102] Savoy Jacques, Dolamic Ljiljana, and Akasereh Mitra. 2013. Information retrieval with Hindi, Bengali, and Marathi languages: Evaluation and analysis. In Multilingual Information Access in South Asian Languages. Springer, 334–352.Google ScholarCross Ref
[103] Scherrer Yves. 2020. TaPaCo: A corpus of sentential paraphrases for 73 languages. In Proceedings of the 12th Language Resources and Evaluation Conference. 6868–6873.Google Scholar
[104] Shah Sonali Rajesh, Kaushik Abhishek, Sharma Shubham, and Shah Janice. 2020. Opinion-mining on Marglish and Devanagari comments of YouTube cookery channels using parametric and non-parametric learning models. Big Data Cogn. Comput. 4, 1 (2020), 3.Google ScholarCross Ref
[105] Sharma Raksha and Bhattacharyya Pushpak. 2014. A sentiment analyzer for Hindi using Hindi senti lexicon. In Proceedings of the 11th International Conference on Natural Language Processing. 150–155.Google Scholar
[106] Singh Jasmeet and Gupta Vishal. 2016. Text stemming: Approaches, applications, and challenges. ACM Comput. Surv. 49, 3 (2016), 1–46.Google ScholarDigital Library
[107] Singh Jasmeet and Gupta Vishal. 2017. An efficient corpus-based stemmer. Cogn. Comput. 9, 5 (2017), 671–688.Google ScholarCross Ref
[108] Singh Jasmeet and Gupta Vishal. 2019. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics. Knowl.-Bas. Syst. 180 (2019), 147–162.Google ScholarDigital Library
[109] Singh Jyoti, Joshi Nisheeth, and Mathur Iti. 2013. Part of speech tagging of Marathi text using trigram method. Int. J. Adv. Inf. Technol. 3, 2 (2013), 35–41. Google ScholarCross Ref
[110] Singh Jyoti, Joshi Nisheeth, and Mathur Iti. 2014. Marathi parts-of-speech tagger using supervised learning. In Intelligent Computing, Networking, and Informatics. Springer, 251–257.Google Scholar
[111] Siripragada Shashank, Philip Jerin, Namboodiri Vinay P., and Jawahar C. V.. 2020. A multilingual parallel corpora collection effort for Indian languages. In Proceedings of the 12th Language Resources and Evaluation Conference. 3743–3751.Google Scholar
[112] Srivastava Shruti and Govilkar Sharvari. 2018. Paraphrase identification of Marathi sentences. In Proceedings of the International Conference on Intelligent Data Communication Technologies and Internet of Things. Springer, 534–544.Google Scholar
[113] Srivastava Shruti and Govilkar Sharvari. 2020. Detecting paraphrases in Marathi language. Int. J. of Smart Computing and Information Technology 1, 1 (2020), 7–17.Google Scholar
[114] Suárez Ortiz, Javier Pedro, Sagot Benoît, Romary Laurent, Javier Pedro, Suárez Ortiz, Sagot Benoît, Romary Laurent, Pipeline Asynchronous, Javier Pedro, and Su Ortiz. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In Proceedings of the 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache, 1–8. Google ScholarCross Ref
[115] Mayeesha Tasmiah Tahsin, Sarwar Abdullah Md, and Rahman Rashedur M.. 2021. Deep learning based question answering system in Bengali. J. Inf. Telecommun. 5, 2 (2021), 145–178.Google Scholar
[116] Tandon Juhi and Sharma Dipti Misra. 2017. Unity in diversity: A unified parsing strategy for major Indian languages. In Proceedings of the 4th International Conference on Dependency Linguistics (Depling’17). 255–265.Google Scholar
[117] Tiedemann Jörg. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the International Conference on Language Resources and Evaluation. 2214–2218.Google Scholar
[118] Velankar Abhishek, Patil Hrushikesh, Gore Amol, Salunke Shubham, and Joshi Raviraj. 2022. L3Cube-MahaHate: A tweet-based Marathi hate speech detection dataset and BERT models. arXiv:2203.13778. Retrieved from https://arxiv.org/abs/2203.13778.Google Scholar
[119] Wenzek Guillaume, Lachaux Marie-Anne, Conneau Alexis, Chaudhary Vishrav, Guzmán Francisco, Joulin Armand, and Grave Édouard. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference. 4003–4012.Google Scholar
[120] Young Tom, Hazarika Devamanyu, Poria Soujanya, and Cambria Erik. 2018. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 3 (2018), 55–75.Google ScholarCross Ref

Index Terms

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. General and reference
  1. Document types
    1. Surveys and overviews

Recommendations

Urdu language processing: a survey

Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core ...
Read More
Toward an Effective Igbo Part-of-Speech Tagger

Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments ...
Read More
A Basic Language Resource Kit Implementation for the IgboNLP Project

Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 2
February 2023
624 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3572719
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 December 2022
- Online AM: 13 July 2022
- Accepted: 4 July 2022
- Revised: 8 June 2022
- Received: 14 May 2021
Published in tallip Volume 22, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Marathi language
Marathi morphology
Marathi resources
Part-of-Speech (POS) tagging
Named Entity Recognition (NER)
Word Sense Disambiguation (WSD)
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 671
  Total Downloads
- Downloads (Last 12 months)259
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Urdu language processing: a survey

Toward an Effective Igbo Part-of-Speech Tagger

A Basic Language Resource Kit Implementation for the IgboNLP Project

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Caption

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Urdu language processing: a survey

Toward an Effective Igbo Part-of-Speech Tagger

A Basic Language Resource Kit Implementation for the IgboNLP Project

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media