skip to main content
research-article

Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems

Published:14 December 2018Publication History
Skip Abstract Section

Abstract

Machine translation is the core problem for several natural language processing research across the globe. However, building a translation system involving low-resource languages remains a challenge with respect to statistical machine translation (SMT). This work proposes and studies the effect of a phrase-induced hybrid machine translation system for translation from English to Tamil, under a low-resource setting. Unlike conventional hybrid MT systems, the free-word ordering feature of the target language Tamil is exploited to form a re-ordered target language model and to extend the parallel text corpus for training the SMT. In the current work, a novel rule-based phrase-extraction method, implemented using parts-of-speech (POS) and place-of-pause in both languages is proposed, which is used to pre-process the training corpus for developing the back-off phrase-induced SMT. Further, out-of-vocabulary (OOV) words are handled using speech-based transliteration and two-level thesaurus intersection techniques based on the POS tag of the OOV word. To ensure that the input with OOV words does not skip phrase-level translation in the hierarchical model, a phrase-level example-based machine translation approach is adopted to find the closest matching phrase and perform translation followed by OOV replacement. The proposed system results in a bilingual evaluation understudy score of 84.78 and a translation edit rate of 19.12. The performance of the system is compared in terms of adequacy and fluency, with existing translation systems for this specific language pair, and it is observed that the proposed system outperforms its counterparts.

References

  1. M. Anand Kumar, V. Dhanalakshmi, K. P. Soman, and S. Rajendran. 2014a. Factored statistical machine translation system for English to Tamil language. Pertanika Journal of Social Science and Humanities 22 (2014), 1045--1061.Google ScholarGoogle Scholar
  2. M. Anand Kumar, V. Dhanalakshmi, K. P. Soman, and V. Sharmiladevi. 2014b. Improving the performance of English-Tamil statistical machine translation system using source-side pre-processing. In Proceedings of the International Conference on Advances in Computer Science. Springer India, 287--297.Google ScholarGoogle Scholar
  3. G. Anushiya Rachel, V. Sherlin Solomi, K. Naveenkumar, P. Vijayalakshmi, and T. Nagarajan. 2015. A small-footprint context-independent HMM-based synthesizer for Tamil. International Journal of Speech Technology 18, 3 (2015), 405--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Karunesh Arora, Michael Paul, and Eiichiro Sumita. 2008. Translation of unknown words in phrase-based statistical machine translation for languages of rich morphology. In the First International Workshop on Spoken Languages Technologies for Under-Resourced Languages (SLTU). International Speech Communication Association (ISCA). 70--75.Google ScholarGoogle Scholar
  5. Rafael E. Banchs, Luis F. D’Haro, and Haizhou Li. 2015. Adequacy-fluency metrics: Evaluating MT in the continuous space model framework. IEEE Transactions on Audio, Speech, and Language Processing 23, 3 (2015), 472--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. EILMT. 2017. ANUVADAKSH. Retrieved April 26, 2018 from http://eilmt.tdil-dc.gov.in.Google ScholarGoogle Scholar
  7. Google. 2017. About—Google Translate. Retrieved April 26, 2018 from https://translate.google.com/intl/en/about/.Google ScholarGoogle Scholar
  8. Biman Gujral, Huda Khayrallah, and Philipp Koehn. 2016. Translation of unknown words in low resource languages. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), Association for Computational Linguistics (ACL).Google ScholarGoogle Scholar
  9. Nizar Habash. 2008. Four techniques for online handling of out-of-vocabulary words in Arabic-English statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 57--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Krupakar Hans and R. S. Milton. 2016. Improving the performance of neural machine translation involving morphologically rich languages. arXiv preprint arXiv:1612.02482 (2016).Google ScholarGoogle Scholar
  11. DeITY India. 2017. Anuvadaksh—An Expert English to Indian Languages Machine Translation System EILMT. Retrieved April 26, 2018 from http://tdil-dc.in/index.php?option=com_vertical8parentid=728lang=en.Google ScholarGoogle Scholar
  12. Ann Irvine. 2013. Statistical machine translation in low resource settings. In HLT-NAACL. 54--61.Google ScholarGoogle Scholar
  13. Ann Irvine and Chris Callison-Burch. 2013. Combining bilingual and comparable corpora for low resource machine translation. In Proceedings of the 8th Workshop on Statistical Machine Translation. 262--270.Google ScholarGoogle Scholar
  14. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 177--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol. 1. Association for Computational Linguistics, 48--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alon Lavie, Katharina Probst, Erik Peterson, Stephan Vogel, Lori Levin, Ariadna Font Llitjós, and Jaime G. Carbonell. 2004. A trainable transfer-based machine translation approach for languages with limited resources. In Proceedings of Workshop of the European Association for Machine Translation. 116--123.Google ScholarGoogle Scholar
  17. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 17th International Conference on Computational Linguistics,, Vol. 2. Association for Computational Linguistics, 768--774. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. India Linguistic Resource, TDIL. 2017. English-Tamil Text Corpus. Last accessed on April 26, 2018 from http://tdil-dc.in/index.php?option=com_download8task=fsearch8Itemid=5478lang=en.Google ScholarGoogle Scholar
  19. Parkvall Mikael. 2007. Världens 100 största språk 2007 (The World’s 100 Largest Languages in 2007). In National Encylkopedin.Google ScholarGoogle Scholar
  20. K. Mrinalini, G. Sangavi, and P. Vijayalakshmi. 2016. Performance improvement of machine translation system using LID and post-editing. In Proceedings of the 2016 IEEE Region 10 Conference (TENCON’16). IEEE, 2134--2137.Google ScholarGoogle Scholar
  21. Preslav Nakov and Hwee Tou Ng. 2009. Improved statistical machine translation for resource-poor languages using related resource-rich languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3. Association for Computational Linguistics, 1358--1367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hideo Okuma, Hirofumi Yamamoto, and Eiichiro Sumita. 2008. Introducing a translation dictionary into phrase-based SMT. IEICE Transactions on Information and Systems 91, 7 (2008), 2051--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xuan-Hieu Phan. 2017. CRFTagger: CRF English POS Tagger. Retrieved March 03, 2017 from http://crftagger.sourceforge.net.Google ScholarGoogle Scholar
  25. B. Ramani, S. Lilly Christina, Rachel G. Anushiya, V. Sherlin Solomi, Mahesh Kumar Nandwana, Anusha Prakash, S Aswin Shanmugam, Raghava Krishnan, S. Kishore Prahalad, K. Samudravijaya, et al. 2013. A common attribute based unified HTS framework for speech synthesis in Indian languages. In SSW8. 291--296.Google ScholarGoogle Scholar
  26. Loganathan Ramasamy, Ondřej Bojar, and Zdeněk Žabokrtský. 2012. Morphological processing for English-Tamil statistical machine translation. In Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL’12). 113--122.Google ScholarGoogle Scholar
  27. G. Sangavi, K. Mrinalini, and P. Vijayalakshmi. 2016. Analysis on bilingual machine translation systems for English and Tamil. In Proceedings of the 2016 International Conference on Computation of Power, Energy Information and Commuincation (ICCPEIC’16). IEEE, 245--250.Google ScholarGoogle Scholar
  28. Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, Vol. 200. Association for Computational Linguistics (ACL), 223--231.Google ScholarGoogle Scholar
  29. Matthew Snover, Nitin Madnani, Bonnie J. Dorr, and Richard Schwartz. 2009. Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the 4th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 259--268. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L. Sobha, G. Sindhuja, L. Gracy, N. Padmapriya, A. Gnanapriya, and N. H. Parimala. 2016. AUKBC Tamil part-of-speech corpus (AUKBC-TamilPOSCorpus2016v1).Google ScholarGoogle Scholar
  31. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3104--3112. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Thesaurus.net. 2017. Thesaurus. Retrieved May 16, 2018 from http://www.thesaurus.net.Google ScholarGoogle Scholar
  33. Sneha Tripathi and Juran Krishna Sarkhel. 2010. Approaches to machine translation. Annals of Library and Information Studies, Vol. 57 (2010), 388--393.Google ScholarGoogle Scholar
  34. Harrassowitz Verlag. 2010. Tamil Language for Europeans, Ziegenbalg’s Grammatica Damulica. Hubert 8 Co.Google ScholarGoogle Scholar
  35. Peng Xu, Jaeho Kang, Michael Ringgaard, and Franz Och. 2009. Using a dependency parser to improve SMT for subject-object-verb languages. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 245--253. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 2
        June 2019
        208 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3300146
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 December 2018
        • Accepted: 1 August 2018
        • Revised: 1 May 2018
        • Received: 1 February 2018
        Published in tallip Volume 18, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format