Abstract
Natural Language Processing (NLP) has been in practice for the past couple of decades, and extensive work has been done for the Western languages, particularly the English language. The Eastern counterpart, especially the languages of the Indian subcontinent, needs attention as not much language processing work has been done on these languages. Western languages are rich in dictionaries, WordNet, and associated tools, while Indian languages are lagging behind in this segment. Marathi is the third most spoken language in India and the 15th most spoken language worldwide. Lack of resources, complex linguistic facts, and the inclusion of prevalent dialects of neighbors have resulted in limited work for Marathi. The aim of this study is to provide an insight into the various linguistic resources, tools, and state-of-the-art techniques applied to the processing of the Marathi language. Initially, morphological descriptions of the Marathi language are provided, followed by a discussion on the characteristics of the Marathi language. Thereafter, for Marathi language, the availability of corpus, tools, and techniques to be used to develop NLP tasks is reviewed. Finally, gap analysis is discussed in current research and future directions for this new and dynamic area of research are listed that will benefit the Marathi Language Processing research community.
- [1] . 2020. Crowdsourcing speech data for low-resource languages from low-income workers. In Proceedings of the 12th Language Resources and Evaluation Conference. 2819–2826.Google Scholar
- [2] . 2006. Augmenting word net with polarity information on adjectives. In Proceedings of the 3rd International Wordnet Conference. 3–8. Google Scholar
- [3] . 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3204–3210.Google ScholarCross Ref
- [4] . 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8, 3 (
March 2019), 292. Google ScholarCross Ref - [5] . 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8, 3 (2019), 292. Google ScholarCross Ref
- [6] . 2015. ARQAS: Augmented reality based question answering system using ontology in HINDI and MARATHI language. Int. J. Comput. Appl. 126, 13 (2015).Google Scholar
- [7] . 2018. Sentiment analysis of mixed code for the transliterated hindi and marathi texts. Int. J. Nat. Lang. Comput. 7 (2018).Google Scholar
- [8] . 2020. iNLTK: Natural language toolkit for Indic languages. In Proceedings of the 2nd Workshop for NLP Open Source Software (NLP-OSS’20). 66–71.Google ScholarCross Ref
- [9] . 2002. EMILLE, A 67-million word corpus of Indic languages: Data collection, mark-up and harmonisation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’02).Google Scholar
- [10] . 2012. Bengali question classification: Towards developing QA system. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing. 25–40.Google Scholar
- [11] . 2012. Named entity recognition an aid to improve multilingual entity filling in language-independent approach. In Proceedings of the 1st Workshop on Information and Knowledge Management for Developing Region. 3–10.Google ScholarDigital Library
- [12] . 2014. SSF: A common representation scheme for language analysis for language technology infrastructure development. In Proceedings of the International Conference on Computational Linguistics (COLING’14). 66.Google Scholar
- [13] . 2006. Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages. Technical Report.Google Scholar
- [14] . 2010. IndoWordnet. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA). 1–8.Google Scholar
- [15] . 2017. Word sense disambiguation using IndoWordNet. In The WordNet in Indian Languages. Springer, 243–260.Google ScholarCross Ref
- [16] . 2013. Neighbors help: Bilingual unsupervised WSD using context. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 538–542.Google Scholar
- [17] . 2018. Detection of paraphrases for Devanagari languages using support vector machine. In Proceedings of the International Conference on Communication information and Computing Technology (ICCICT’18). 1–5. Google ScholarCross Ref
- [18] . 2002. An experience in building the Indo WordNet—A WordNet for Hindi. In Proceedings of the International Conference on Global WordNet (GWC’02).Google Scholar
- [19] . 2017. Sentiment analysis in Marathi using Marathi WordNet. Imp. J. Interdiscip. Res. 3, 4 (2017), 1253–1256.Google Scholar
- [20] . 2021. LDC-IL: The Indian repository of resources for language technology. Lang. Res. Eval. (2021), 1–13.Google Scholar
- [21] . 2015. A massively parallel corpus: The Bible in 100 languages. Lang. Resource. Eval. 49, 2 (2015), 375–395.Google ScholarDigital Library
- [22] . 2010. SentiWordNet for Indian languages. In Proceedings of the 8th Workshop on Asian Language Resouces. 56–63.Google Scholar
- [23] . 2020. FIRE 2020 EDNIL track: Event detection from news in Indian languages. In Forum for Information Retrieval Evaluation. 25–28.Google ScholarDigital Library
- [24] . 2017. Sentiment analysis of Marathi language. Int. J. Res. Publ. Eng. Technol. 3, 6 (2017), 93–97.Google Scholar
- [25] . 2018. A hybrid part-of-speech tagger for Marathi sentences. In Proceedings of the International Conference on Communication information and Computing Technology (ICCICT’18). IEEE, 1–10.Google ScholarCross Ref
- [26] . 2020. Deep learning techniques for part of speech tagging by natural language processing. In Proceedings of the 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA’20). IEEE, 76–81. Google Scholar
- [27] . 2010. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans. As. Lang. Inf. Process. 9, 3 (2010), 1–24. Google ScholarDigital Library
- [28] . 2015. Natural language processing using NLTK and WordNet. Int. J. Comput. Sci. Inf. Technol. 6, 6 (2015), 5465–5469.Google Scholar
- [29] . 2021. Cross-lingual offensive language identification for low resource languages: The case of Marathi. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’21). 437–443.Google ScholarCross Ref
- [30] . 2014. An HMM based named entity recognition system for indian languages: The JU system at ICON 2013. arXiv:1405.7397. Retrieved from https://arxiv.org/abs/1405.7397.Google Scholar
- [31] . 2017. Indian Languages–Defining India’s Internet–KPMG India. Retrieved November 02, 2020 from https://home.kpmg/in/en/home/insights/2017/04/indian-language-internet-users.html.Google Scholar
- [32] . 2017. Question answering system using ontology in Marathi language. Int. J. Artif. Intell. Appl. 8 (2017), 53–64.Google Scholar
- [33] . 2018. Recent named entity recognition and classification techniques: A systematic review. Comput. Sci. Rev. 29 (2018), 21–43. Google ScholarCross Ref
- [34] . 2019. A deep neural network framework for English Hindi question answering. ACM Trans. As. Low-Resour. Lang. Inf. Process. 19, 2 (2019), 1–22.Google Scholar
- [35] . 2018. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google Scholar
- [36] . 2020. Bert based multilingual machine comprehension in English and Hindi. arXiv:2006.01432. Retrieved from https://arxiv.org/abs/2006.01432.Google Scholar
- [37] . 2020. PMIndia—A collection of parallel corpora of languages of India. arXiv:2001.09907. Retrieved from https://arxiv.org/abs/2001.09907.Google Scholar
- [38] . 2021. XL-Sum: Large-scale multilingual abstractive summarization for 44 Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 4693–4703. Google ScholarCross Ref
- [39] . 2020. Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems. In Proceedings of the 12th Language Resources and Evaluation Conference. 6494–6503. Google Scholar
- [40] . 2021. Marathi text document summarization using neural networks. Int. Organiz. Res. Dev. 8, 2 (2021), 4–4.Google Scholar
- [41] . 2010. The TDIL program and the Indian langauge corpora intitiative (ILCI). In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 982–985.Google Scholar
- [42] . 2010. A fall-back strategy for sentiment analysis in Hindi: A case study. In Proceedings of the 8th International Conference on Natural Language Processing (ICON’10).Google Scholar
- [43] . 2022. L3cube-mahacorpus and mahabert: Marathi monolingual corpus, Marathi bert language models, and resources. arXiv:2202.01159. Retrieved from https://arxiv.org/abs/2002.01159.Google Scholar
- [44] . 2013. Sandhi splitting of Marathi compound words. Int. J. Adv. Comput. Theory Eng. 2, 2 (2013), 43–46.Google Scholar
- [45] . 2020. IndicNLPSuite : Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: Findings. 4948–4961.Google ScholarCross Ref
- [46] . 2020. Event argument extraction using causal knowledge structures. In Proceedings of the 17th International Conference on Natural Language Processing (ICON’20). 287–296.Google Scholar
- [47] . 2019. Rule-based design for anaphora resolution of Marathi sentence. In Proceedings of the IEEE 5th International Conference for Convergence in Technology (I2CT’19). IEEE, 1–7.Google ScholarCross Ref
- [48] . 2020. Natural language processing based rule based discourse analysis of Marathi text. In Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC’20). IEEE, 356–362. Google ScholarCross Ref
- [49] . 2011. It takes two to tango: A bilingual unsupervised approach for estimating sense distributions using expectation maximization. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 695–704.Google Scholar
- [50] . 2011. Together we can: Bilingual bootstrapping for WSD. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 561–569.Google Scholar
- [51] . 2010. All words domain adapted WSD: Finding a middle ground between supervision and unsupervision. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 1532–1541.Google ScholarDigital Library
- [52] . 2021. Word sense disambiguation for Marathi language using WordNet and the lesk approach. In Proceeding of 1st Doctoral Symposium on Natural Computing Research (DSNCR’20), Vol. 169. Springer Nature, 45.Google ScholarCross Ref
- [53] . 2022. Experimental evaluation of deep learning models for Marathi text classification. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Springer, 605–613.Google ScholarCross Ref
- [54] . 2021. L3CubeMahaSent: A Marathi tweet-based sentiment analysis dataset. In Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 213–220.Google Scholar
- [55] . 2011. A language-independent approach to identify the named entities in under-resourced languages and clustering multilingual documents. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 74–82.Google ScholarCross Ref
- [56] . 2005. A Hindi question answering system for E-learning documents. In Proceedings of the 3rd International Conference on Intelligent Sensing and Information Processing. IEEE, 80–85.Google ScholarDigital Library
- [57] . 2020. AI4Bharat-IndicNLP corpus: Monolingual corpora and word embeddings for Indic languages.
arXiv:2005.00085 . Retrieved from https://arxiv.org/abs/2005.00085.Google Scholar - [58] . 2015. Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 81–85. Google ScholarCross Ref
- [59] . 2020. A Gold Standard Marathi Raw Text Corpus. Retrieved from https://data.ldcil.org/text/text-raw-corpus/a-gold-standard-marathi-raw-text-corpus.Google Scholar
- [60] . 2011. A Survey of State-of-the-art Methods on Question Classification.
Technical Report . Delft University of Technology, Mediamatics Department.Google Scholar - [61] . 2022. Aksharantar: Towards building open transliteration tools for the next billion users. arXiv:2205.03018. Retrieved from https://arxiv.org/abs/2205.03018.Google Scholar
- [62] . 2019. Tale of tails using rule augmented sequence labeling for event extraction.
arXiv:1908.07018 . Retrieved from https://arxiv.org/abs/1908.07018.Google Scholar - [63] . 2020. A deeper study on features for named entity recognition. In Proceedings of the 5th Workshop on Indian Language Data: Resources and Evaluation (WILDRE5). 66–72.Google Scholar
- [64] . 2021. Resource creation for opinion mining: A case study with Marathi movie reviews. Int. J. Inf. Technol. (2021), 1–9.Google Scholar
- [65] . 2016. Issues and challenges in analyzing opinions in Marathi text. Int. J. Comput. Sci. Iss. 13, 2 (2016), 19.Google ScholarCross Ref
- [66] . 2013. Languages in India.
Technical Report .Google Scholar - [67] . 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39–41.Google ScholarDigital Library
- [68] . 2018. Improving NER tagging performance in low-resource languages via multilingual learning. ACM Trans. As. Low-Resour. Lang. Inf. Process. 18, 2 (2018), 1–20. Google ScholarDigital Library
- [69] . 2017. Plagiarism detection in Marathi language using semantic analysis. Int. J. Strateg. Inf. Technol. Appl. 8, 4 (2017), 30–39.Google ScholarDigital Library
- [70] . 2016. Development of Marathi text corpus for plagiarism detection in Marathi language. In Proceedings of the 2nd International Conference on Cognitive Knowledge Engineering (ICKE’16). 340–344.Google Scholar
- [71] . 2018. Word level plagiarism detection of Marathi text using N-gram approach. In Proceedings of the International Conference on Recent Trends in Image Processing and Pattern Recognition, and (Eds.). Springer, Singapore, 14–23. Google ScholarCross Ref
- [72] . 2019. A proposed model to identify paraphrasing in Marathi text. In Proceedings of the National Conference on Recent Innovation in Computer Science & Electronics. 48–51.Google Scholar
- [73] . 2017. Text categorization of Marathi documents using modified LINGO. In Proceedings of the International Conference on Advances in Computing, Communication and Control (ICAC3’17). IEEE, 1–5.Google ScholarCross Ref
- [74] . 2019. Word sense disambiguation using cosine similarity collaborates with Word2vec and WordNet. Fut. Internet 11, 5 (2019), 114.Google ScholarCross Ref
- [75] . 2020. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neur. Netw. Learn. Syst. (2020). Google ScholarCross Ref
- [76] . 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4 (2011), 1–24.Google ScholarDigital Library
- [77] . 2011. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 863–872.Google ScholarDigital Library
- [78] . 2011. A fast corpus-based stemmer. ACM Trans. As. Lang. Inf. Process. 10, 2 (2011), 1–16.Google ScholarDigital Library
- [79] . 2013. Effective and robust query-based stemming. ACM Trans. Inf. Syst. 31, 4 (2013), 1–29.Google ScholarDigital Library
- [80] . 2017. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 1946–1958. Google ScholarCross Ref
- [81] . 2009. Relational learning assisted construction of rule base for Indian language NER. In Proceedings of 7th International Conference on Natural Language Processing (ICON’09), (2009), 7th.Google Scholar
- [82] . 2014. Part-of-speech tagger for Marathi language using limited training corpora. Int. J. Comput. Appl. 975 (2014), 8887.Google Scholar
- [83] . 2017. Design and development of a dictionary based stemmer for Marathi language. In International Conference on Next Generation Computing Technologies.Springer, Singapore, 769–777. Google ScholarCross Ref
- [84] . 2017. MarS : A rule-based stemmer for morphologically. In Proceedings of the International Conference on Computer, Communications and Electronics (Comptelix’17). IEEE, 580–584. Google Scholar
- [85] . 2020. A hybrid stemmer for the affix stacking language: Marathi. In Computing in Engineering and Technology. Springer, 441–449. Google ScholarCross Ref
- [86] . 2017. Hybrid approach for Marathi named entity recognition. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017). 103–111.Google Scholar
- [87] . 2020. Named entity recognition using conditional random fields. Proc. Comput. Sci. 167 (2020), 1181–1188.Google ScholarCross Ref
- [88] . 2017. HMM based named entity recognition for inflectional language. In Proceedings of the International Conference on Computer, Communications and Electronics (Comptelix’17). IEEE, 565–572.Google ScholarCross Ref
- [89] . 2022. L3Cube-MahaNER: A Marathi named entity recognition dataset and BERT models. arXiv:2204.06029. Retrieved from https://arxiv.org/abs/2204.06029.Google Scholar
- [90] . 2019. Automatic Marathi text classification. Int. J. Innovat. Technol. Explor. Eng. 9 (2019), 2446–2454. Issue 2.Google ScholarCross Ref
- [91] . 2017. Sentiment analysis in Marathi language. Int. J. Recent Innov. Trends Comput. Commun. (2017), 2321–8169.Google Scholar
- [92] . 2021. Revisiting low resource status of Indian languages in machine translation. In Proceedings of the 8th ACM India Joint International Conference on Data Science & Management of Data (IKDD CODS’21) and 26th COMAD. 178–187.Google ScholarDigital Library
- [93] . 2017. Creating Marathi WordNet. In The WordNet in Indian Languages, , , and (Eds.). Springer, Singapore, 147–166. Google ScholarCross Ref
- [94] . 2020. A survey of Konkani NLP resources. Comput. Sci. Rev. 38 (2020), 100299. Google ScholarCross Ref
- [95] . 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Trans. Assoc. Comput. Linguist. 10 (2022), 145–162.Google ScholarCross Ref
- [96] . 2017. Semisupervied data driven word sense disambiguation for resource-poor languages. In Proceedings of the 14th International Conference on Natural Language Processing (ICON’17). 503–512.Google Scholar
- [97] . 2018. Extractive text summarization of Marathi news articles. Int. Res. J. Eng. Technol. 5 (2018), 1204–1210.Google Scholar
- [98] . 2017. A universal dependencies treebank for Marathi. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. 190–200.Google Scholar
- [99] . 2018. A review of the state of the art in Hindi question answering systems. In Intelligent Natural Language Processing: Trends and Applications. 265–292. Google ScholarCross Ref
- [100] . 2020. Processing South Asian languages written in the Latin script: The dakshina dataset. In Proceedings of the 12th Language Resources and Evaluation Conference. Google Scholar
- [101] . 2020. A platform for event extraction in Hindi. In Proceedings of the 12th Language Resources and Evaluation Conference. 2241–2250.Google Scholar
- [102] . 2013. Information retrieval with Hindi, Bengali, and Marathi languages: Evaluation and analysis. In Multilingual Information Access in South Asian Languages. Springer, 334–352.Google ScholarCross Ref
- [103] . 2020. TaPaCo: A corpus of sentential paraphrases for 73 languages. In Proceedings of the 12th Language Resources and Evaluation Conference. 6868–6873.Google Scholar
- [104] . 2020. Opinion-mining on Marglish and Devanagari comments of YouTube cookery channels using parametric and non-parametric learning models. Big Data Cogn. Comput. 4, 1 (2020), 3.Google ScholarCross Ref
- [105] . 2014. A sentiment analyzer for Hindi using Hindi senti lexicon. In Proceedings of the 11th International Conference on Natural Language Processing. 150–155.Google Scholar
- [106] . 2016. Text stemming: Approaches, applications, and challenges. ACM Comput. Surv. 49, 3 (2016), 1–46.Google ScholarDigital Library
- [107] . 2017. An efficient corpus-based stemmer. Cogn. Comput. 9, 5 (2017), 671–688.Google ScholarCross Ref
- [108] . 2019. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics. Knowl.-Bas. Syst. 180 (2019), 147–162.Google ScholarDigital Library
- [109] . 2013. Part of speech tagging of Marathi text using trigram method. Int. J. Adv. Inf. Technol. 3, 2 (2013), 35–41. Google ScholarCross Ref
- [110] . 2014. Marathi parts-of-speech tagger using supervised learning. In Intelligent Computing, Networking, and Informatics. Springer, 251–257.Google Scholar
- [111] . 2020. A multilingual parallel corpora collection effort for Indian languages. In Proceedings of the 12th Language Resources and Evaluation Conference. 3743–3751.Google Scholar
- [112] . 2018. Paraphrase identification of Marathi sentences. In Proceedings of the International Conference on Intelligent Data Communication Technologies and Internet of Things. Springer, 534–544.Google Scholar
- [113] . 2020. Detecting paraphrases in Marathi language. Int. J. of Smart Computing and Information Technology 1, 1 (2020), 7–17.Google Scholar
- [114] . 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In Proceedings of the 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache, 1–8. Google ScholarCross Ref
- [115] . 2021. Deep learning based question answering system in Bengali. J. Inf. Telecommun. 5, 2 (2021), 145–178.Google Scholar
- [116] . 2017. Unity in diversity: A unified parsing strategy for major Indian languages. In Proceedings of the 4th International Conference on Dependency Linguistics (Depling’17). 255–265.Google Scholar
- [117] . 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the International Conference on Language Resources and Evaluation. 2214–2218.Google Scholar
- [118] . 2022. L3Cube-MahaHate: A tweet-based Marathi hate speech detection dataset and BERT models. arXiv:2203.13778. Retrieved from https://arxiv.org/abs/2203.13778.Google Scholar
- [119] . 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference. 4003–4012.Google Scholar
- [120] . 2018. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 3 (2018), 55–75.Google ScholarCross Ref
Index Terms
- A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing
Recommendations
Urdu language processing: a survey
Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core ...
Toward an Effective Igbo Part-of-Speech Tagger
Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments ...
A Basic Language Resource Kit Implementation for the IgboNLP Project
Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach ...
Comments