Abstract
Machine translation is the core problem for several natural language processing research across the globe. However, building a translation system involving low-resource languages remains a challenge with respect to statistical machine translation (SMT). This work proposes and studies the effect of a phrase-induced hybrid machine translation system for translation from English to Tamil, under a low-resource setting. Unlike conventional hybrid MT systems, the free-word ordering feature of the target language Tamil is exploited to form a re-ordered target language model and to extend the parallel text corpus for training the SMT. In the current work, a novel rule-based phrase-extraction method, implemented using parts-of-speech (POS) and place-of-pause in both languages is proposed, which is used to pre-process the training corpus for developing the back-off phrase-induced SMT. Further, out-of-vocabulary (OOV) words are handled using speech-based transliteration and two-level thesaurus intersection techniques based on the POS tag of the OOV word. To ensure that the input with OOV words does not skip phrase-level translation in the hierarchical model, a phrase-level example-based machine translation approach is adopted to find the closest matching phrase and perform translation followed by OOV replacement. The proposed system results in a bilingual evaluation understudy score of 84.78 and a translation edit rate of 19.12. The performance of the system is compared in terms of adequacy and fluency, with existing translation systems for this specific language pair, and it is observed that the proposed system outperforms its counterparts.
- M. Anand Kumar, V. Dhanalakshmi, K. P. Soman, and S. Rajendran. 2014a. Factored statistical machine translation system for English to Tamil language. Pertanika Journal of Social Science and Humanities 22 (2014), 1045--1061.Google Scholar
- M. Anand Kumar, V. Dhanalakshmi, K. P. Soman, and V. Sharmiladevi. 2014b. Improving the performance of English-Tamil statistical machine translation system using source-side pre-processing. In Proceedings of the International Conference on Advances in Computer Science. Springer India, 287--297.Google Scholar
- G. Anushiya Rachel, V. Sherlin Solomi, K. Naveenkumar, P. Vijayalakshmi, and T. Nagarajan. 2015. A small-footprint context-independent HMM-based synthesizer for Tamil. International Journal of Speech Technology 18, 3 (2015), 405--418. Google ScholarDigital Library
- Karunesh Arora, Michael Paul, and Eiichiro Sumita. 2008. Translation of unknown words in phrase-based statistical machine translation for languages of rich morphology. In the First International Workshop on Spoken Languages Technologies for Under-Resourced Languages (SLTU). International Speech Communication Association (ISCA). 70--75.Google Scholar
- Rafael E. Banchs, Luis F. D’Haro, and Haizhou Li. 2015. Adequacy-fluency metrics: Evaluating MT in the continuous space model framework. IEEE Transactions on Audio, Speech, and Language Processing 23, 3 (2015), 472--482. Google ScholarDigital Library
- EILMT. 2017. ANUVADAKSH. Retrieved April 26, 2018 from http://eilmt.tdil-dc.gov.in.Google Scholar
- Google. 2017. About—Google Translate. Retrieved April 26, 2018 from https://translate.google.com/intl/en/about/.Google Scholar
- Biman Gujral, Huda Khayrallah, and Philipp Koehn. 2016. Translation of unknown words in low resource languages. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), Association for Computational Linguistics (ACL).Google Scholar
- Nizar Habash. 2008. Four techniques for online handling of out-of-vocabulary words in Arabic-English statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 57--60. Google ScholarDigital Library
- Krupakar Hans and R. S. Milton. 2016. Improving the performance of neural machine translation involving morphologically rich languages. arXiv preprint arXiv:1612.02482 (2016).Google Scholar
- DeITY India. 2017. Anuvadaksh—An Expert English to Indian Languages Machine Translation System EILMT. Retrieved April 26, 2018 from http://tdil-dc.in/index.php?option=com_vertical8parentid=728lang=en.Google Scholar
- Ann Irvine. 2013. Statistical machine translation in low resource settings. In HLT-NAACL. 54--61.Google Scholar
- Ann Irvine and Chris Callison-Burch. 2013. Combining bilingual and comparable corpora for low resource machine translation. In Proceedings of the 8th Workshop on Statistical Machine Translation. 262--270.Google Scholar
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 177--180. Google ScholarDigital Library
- Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol. 1. Association for Computational Linguistics, 48--54. Google ScholarDigital Library
- Alon Lavie, Katharina Probst, Erik Peterson, Stephan Vogel, Lori Levin, Ariadna Font Llitjós, and Jaime G. Carbonell. 2004. A trainable transfer-based machine translation approach for languages with limited resources. In Proceedings of Workshop of the European Association for Machine Translation. 116--123.Google Scholar
- Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 17th International Conference on Computational Linguistics,, Vol. 2. Association for Computational Linguistics, 768--774. Google ScholarDigital Library
- India Linguistic Resource, TDIL. 2017. English-Tamil Text Corpus. Last accessed on April 26, 2018 from http://tdil-dc.in/index.php?option=com_download8task=fsearch8Itemid=5478lang=en.Google Scholar
- Parkvall Mikael. 2007. Världens 100 största språk 2007 (The World’s 100 Largest Languages in 2007). In National Encylkopedin.Google Scholar
- K. Mrinalini, G. Sangavi, and P. Vijayalakshmi. 2016. Performance improvement of machine translation system using LID and post-editing. In Proceedings of the 2016 IEEE Region 10 Conference (TENCON’16). IEEE, 2134--2137.Google Scholar
- Preslav Nakov and Hwee Tou Ng. 2009. Improved statistical machine translation for resource-poor languages using related resource-rich languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3. Association for Computational Linguistics, 1358--1367. Google ScholarDigital Library
- Hideo Okuma, Hirofumi Yamamoto, and Eiichiro Sumita. 2008. Introducing a translation dictionary into phrase-based SMT. IEICE Transactions on Information and Systems 91, 7 (2008), 2051--2057. Google ScholarDigital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318. Google ScholarDigital Library
- Xuan-Hieu Phan. 2017. CRFTagger: CRF English POS Tagger. Retrieved March 03, 2017 from http://crftagger.sourceforge.net.Google Scholar
- B. Ramani, S. Lilly Christina, Rachel G. Anushiya, V. Sherlin Solomi, Mahesh Kumar Nandwana, Anusha Prakash, S Aswin Shanmugam, Raghava Krishnan, S. Kishore Prahalad, K. Samudravijaya, et al. 2013. A common attribute based unified HTS framework for speech synthesis in Indian languages. In SSW8. 291--296.Google Scholar
- Loganathan Ramasamy, Ondřej Bojar, and Zdeněk Žabokrtský. 2012. Morphological processing for English-Tamil statistical machine translation. In Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL’12). 113--122.Google Scholar
- G. Sangavi, K. Mrinalini, and P. Vijayalakshmi. 2016. Analysis on bilingual machine translation systems for English and Tamil. In Proceedings of the 2016 International Conference on Computation of Power, Energy Information and Commuincation (ICCPEIC’16). IEEE, 245--250.Google Scholar
- Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, Vol. 200. Association for Computational Linguistics (ACL), 223--231.Google Scholar
- Matthew Snover, Nitin Madnani, Bonnie J. Dorr, and Richard Schwartz. 2009. Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the 4th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 259--268. Google ScholarDigital Library
- L. Sobha, G. Sindhuja, L. Gracy, N. Padmapriya, A. Gnanapriya, and N. H. Parimala. 2016. AUKBC Tamil part-of-speech corpus (AUKBC-TamilPOSCorpus2016v1).Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3104--3112. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf. Google ScholarDigital Library
- Thesaurus.net. 2017. Thesaurus. Retrieved May 16, 2018 from http://www.thesaurus.net.Google Scholar
- Sneha Tripathi and Juran Krishna Sarkhel. 2010. Approaches to machine translation. Annals of Library and Information Studies, Vol. 57 (2010), 388--393.Google Scholar
- Harrassowitz Verlag. 2010. Tamil Language for Europeans, Ziegenbalg’s Grammatica Damulica. Hubert 8 Co.Google Scholar
- Peng Xu, Jaeho Kang, Michael Ringgaard, and Franz Och. 2009. Using a dependency parser to improve SMT for subject-object-verb languages. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 245--253. Google ScholarDigital Library
Index Terms
- Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems
Recommendations
Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation
We propose a new method for inducing a phrase-based translation model from a pair of unrelated monolingual corpora. Our method is able to deal with phrases of arbitrary length and to find phrase pairs that are useful for statistical machine translation, ...
The Utility of Hierarchical Phrase-Based Model Machine Translation for Low Resource Languages
Computational Linguistics and Intelligent Text ProcessingAbstractThe paper uses a hierarchical phrase-based model to develop Statistical Machine Translation (SMT) Systems for four low resourced South Asian languages. South Asian languages predominantly use traditional statistical and neural machine approaches ...
Overview of Verb Phrase Translation in Machine Translation: English to Tamil and Hindi to Tamil
FIRE '18: Proceedings of the 10th Annual Meeting of the Forum for Information Retrieval EvaluationWe present an overview of verb phrase translation in machine translation from English to Tamil and Hindi to Tamil track, where English, Hindi and Tamil belong to three different language families, namely, Indo-European, Indo-Aryan and Dravidian family ...
Comments