research-article

Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems

Authors:
K. Mrinalini

SSN College of Engineering, India

SSN College of Engineering, India

0000-0003-4632-2518
View Profile

,
T. Nagarajan

SSN College of Engineering, India

SSN College of Engineering, India
View Profile

,
P. Vijayalakshmi

SSN College of Engineering, India

SSN College of Engineering, India
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 18 Issue 2Article No.: 12pp 1–22https://doi.org/10.1145/3265751

Published:14 December 2018Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Machine translation is the core problem for several natural language processing research across the globe. However, building a translation system involving low-resource languages remains a challenge with respect to statistical machine translation (SMT). This work proposes and studies the effect of a phrase-induced hybrid machine translation system for translation from English to Tamil, under a low-resource setting. Unlike conventional hybrid MT systems, the free-word ordering feature of the target language Tamil is exploited to form a re-ordered target language model and to extend the parallel text corpus for training the SMT. In the current work, a novel rule-based phrase-extraction method, implemented using parts-of-speech (POS) and place-of-pause in both languages is proposed, which is used to pre-process the training corpus for developing the back-off phrase-induced SMT. Further, out-of-vocabulary (OOV) words are handled using speech-based transliteration and two-level thesaurus intersection techniques based on the POS tag of the OOV word. To ensure that the input with OOV words does not skip phrase-level translation in the hierarchical model, a phrase-level example-based machine translation approach is adopted to find the closest matching phrase and perform translation followed by OOV replacement. The proposed system results in a bilingual evaluation understudy score of 84.78 and a translation edit rate of 19.12. The performance of the system is compared in terms of adequacy and fluency, with existing translation systems for this specific language pair, and it is observed that the proposed system outperforms its counterparts.

References

M. Anand Kumar, V. Dhanalakshmi, K. P. Soman, and S. Rajendran. 2014a. Factored statistical machine translation system for English to Tamil language. Pertanika Journal of Social Science and Humanities 22 (2014), 1045--1061.Google Scholar
M. Anand Kumar, V. Dhanalakshmi, K. P. Soman, and V. Sharmiladevi. 2014b. Improving the performance of English-Tamil statistical machine translation system using source-side pre-processing. In Proceedings of the International Conference on Advances in Computer Science. Springer India, 287--297.Google Scholar
G. Anushiya Rachel, V. Sherlin Solomi, K. Naveenkumar, P. Vijayalakshmi, and T. Nagarajan. 2015. A small-footprint context-independent HMM-based synthesizer for Tamil. International Journal of Speech Technology 18, 3 (2015), 405--418. Google ScholarDigital Library
Karunesh Arora, Michael Paul, and Eiichiro Sumita. 2008. Translation of unknown words in phrase-based statistical machine translation for languages of rich morphology. In the First International Workshop on Spoken Languages Technologies for Under-Resourced Languages (SLTU). International Speech Communication Association (ISCA). 70--75.Google Scholar
Rafael E. Banchs, Luis F. D’Haro, and Haizhou Li. 2015. Adequacy-fluency metrics: Evaluating MT in the continuous space model framework. IEEE Transactions on Audio, Speech, and Language Processing 23, 3 (2015), 472--482. Google ScholarDigital Library
EILMT. 2017. ANUVADAKSH. Retrieved April 26, 2018 from http://eilmt.tdil-dc.gov.in.Google Scholar
Google. 2017. About—Google Translate. Retrieved April 26, 2018 from https://translate.google.com/intl/en/about/.Google Scholar
Biman Gujral, Huda Khayrallah, and Philipp Koehn. 2016. Translation of unknown words in low resource languages. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), Association for Computational Linguistics (ACL).Google Scholar
Nizar Habash. 2008. Four techniques for online handling of out-of-vocabulary words in Arabic-English statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 57--60. Google ScholarDigital Library
Krupakar Hans and R. S. Milton. 2016. Improving the performance of neural machine translation involving morphologically rich languages. arXiv preprint arXiv:1612.02482 (2016).Google Scholar
DeITY India. 2017. Anuvadaksh—An Expert English to Indian Languages Machine Translation System EILMT. Retrieved April 26, 2018 from http://tdil-dc.in/index.php?option=com_vertical8parentid=728lang=en.Google Scholar
Ann Irvine. 2013. Statistical machine translation in low resource settings. In HLT-NAACL. 54--61.Google Scholar
Ann Irvine and Chris Callison-Burch. 2013. Combining bilingual and comparable corpora for low resource machine translation. In Proceedings of the 8th Workshop on Statistical Machine Translation. 262--270.Google Scholar
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 177--180. Google ScholarDigital Library
Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol. 1. Association for Computational Linguistics, 48--54. Google ScholarDigital Library
Alon Lavie, Katharina Probst, Erik Peterson, Stephan Vogel, Lori Levin, Ariadna Font Llitjós, and Jaime G. Carbonell. 2004. A trainable transfer-based machine translation approach for languages with limited resources. In Proceedings of Workshop of the European Association for Machine Translation. 116--123.Google Scholar
Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 17th International Conference on Computational Linguistics,, Vol. 2. Association for Computational Linguistics, 768--774. Google ScholarDigital Library
India Linguistic Resource, TDIL. 2017. English-Tamil Text Corpus. Last accessed on April 26, 2018 from http://tdil-dc.in/index.php?option=com_download8task=fsearch8Itemid=5478lang=en.Google Scholar
Parkvall Mikael. 2007. Världens 100 största språk 2007 (The World’s 100 Largest Languages in 2007). In National Encylkopedin.Google Scholar
K. Mrinalini, G. Sangavi, and P. Vijayalakshmi. 2016. Performance improvement of machine translation system using LID and post-editing. In Proceedings of the 2016 IEEE Region 10 Conference (TENCON’16). IEEE, 2134--2137.Google Scholar
Preslav Nakov and Hwee Tou Ng. 2009. Improved statistical machine translation for resource-poor languages using related resource-rich languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3. Association for Computational Linguistics, 1358--1367. Google ScholarDigital Library
Hideo Okuma, Hirofumi Yamamoto, and Eiichiro Sumita. 2008. Introducing a translation dictionary into phrase-based SMT. IEICE Transactions on Information and Systems 91, 7 (2008), 2051--2057. Google ScholarDigital Library
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318. Google ScholarDigital Library
Xuan-Hieu Phan. 2017. CRFTagger: CRF English POS Tagger. Retrieved March 03, 2017 from http://crftagger.sourceforge.net.Google Scholar
B. Ramani, S. Lilly Christina, Rachel G. Anushiya, V. Sherlin Solomi, Mahesh Kumar Nandwana, Anusha Prakash, S Aswin Shanmugam, Raghava Krishnan, S. Kishore Prahalad, K. Samudravijaya, et al. 2013. A common attribute based unified HTS framework for speech synthesis in Indian languages. In SSW8. 291--296.Google Scholar
Loganathan Ramasamy, Ondřej Bojar, and Zdeněk Žabokrtský. 2012. Morphological processing for English-Tamil statistical machine translation. In Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL’12). 113--122.Google Scholar
G. Sangavi, K. Mrinalini, and P. Vijayalakshmi. 2016. Analysis on bilingual machine translation systems for English and Tamil. In Proceedings of the 2016 International Conference on Computation of Power, Energy Information and Commuincation (ICCPEIC’16). IEEE, 245--250.Google Scholar
Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, Vol. 200. Association for Computational Linguistics (ACL), 223--231.Google Scholar
Matthew Snover, Nitin Madnani, Bonnie J. Dorr, and Richard Schwartz. 2009. Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the 4th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 259--268. Google ScholarDigital Library
L. Sobha, G. Sindhuja, L. Gracy, N. Padmapriya, A. Gnanapriya, and N. H. Parimala. 2016. AUKBC Tamil part-of-speech corpus (AUKBC-TamilPOSCorpus2016v1).Google Scholar
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3104--3112. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf. Google ScholarDigital Library
Thesaurus.net. 2017. Thesaurus. Retrieved May 16, 2018 from http://www.thesaurus.net.Google Scholar
Sneha Tripathi and Juran Krishna Sarkhel. 2010. Approaches to machine translation. Annals of Library and Information Studies, Vol. 57 (2010), 388--393.Google Scholar
Harrassowitz Verlag. 2010. Tamil Language for Europeans, Ziegenbalg’s Grammatica Damulica. Hubert 8 Co.Google Scholar
Peng Xu, Jaeho Kang, Michael Ringgaard, and Franz Och. 2009. Using a dependency parser to improve SMT for subject-object-verb languages. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 245--253. Google ScholarDigital Library

Index Terms

Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
      2. Machine translation

Recommendations

Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation

We propose a new method for inducing a phrase-based translation model from a pair of unrelated monolingual corpora. Our method is able to deal with phrases of arbitrary length and to find phrase pairs that are useful for statistical machine translation, ...
Read More
The Utility of Hierarchical Phrase-Based Model Machine Translation for Low Resource Languages
Computational Linguistics and Intelligent Text Processing
Abstract
The paper uses a hierarchical phrase-based model to develop Statistical Machine Translation (SMT) Systems for four low resourced South Asian languages. South Asian languages predominantly use traditional statistical and neural machine approaches ...
Read More
Overview of Verb Phrase Translation in Machine Translation: English to Tamil and Hindi to Tamil
FIRE '18: Proceedings of the 10th Annual Meeting of the Forum for Information Retrieval Evaluation

We present an overview of verb phrase translation in machine translation from English to Tamil and Hindi to Tamil track, where English, Hindi and Tamil belong to three different language families, namely, Indo-European, Indo-Aryan and Dravidian family ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 18, Issue 2
June 2019
208 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3300146
Editor:
Nianwen Xue
Brandeis University, Waltham, USA
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 December 2018
- Accepted: 1 August 2018
- Revised: 1 May 2018
- Received: 1 February 2018
Published in tallip Volume 18, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Low-resource machine translation
PL-EBMT
POS
place-of-pause based phrase extraction
thesaurus intersection
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 173
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation

The Utility of Hierarchical Phrase-Based Model Machine Translation for Low Resource Languages

Overview of Verb Phrase Translation in Machine Translation: English to Tamil and Hindi to Tamil

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation

The Utility of Hierarchical Phrase-Based Model Machine Translation for Low Resource Languages

Overview of Verb Phrase Translation in Machine Translation: English to Tamil and Hindi to Tamil

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media