Abstract
Natural Language Processing (NLP) is the field that strives to fill the communication gap between the different sections of societies One of the NLP processes, Machine Translation (MT) is used to translate any language from native language by understanding and generating natural language. The research in Myanmar-English MT has started since 2010. However, the translation accuracy is not becoming raise since the complex syntactic structure of Myanmar Language and the scarceness of resources. We found that it takes a lot of time to collect language resources such as Myanmar-English aligned corpora and Treebank. This paper presents current work of Myanmar-English machine translation system based on statistical methods. The aim of this paper is to introduce the Myanmar-English translation model and the comparative study using Asian Language Tree-bank (ALT) data.
Similar content being viewed by others
Keywords
1 Introduction
There are many tasks needed to perform in machine translation. For Myanmar language, word segmentation is the early step to do since Myanmar language does not use spaces between words. After that, syntactic analysis, semantic analysis and synthesis analysis has to be done to complete the translation. There are three approaches used in MT: rule based, example based and statistical based. Our main motivation for this research is to investigate Myanmar-English MT based on statistical methods. We will evaluate the accuracy and also contribute the comparative study of the translation model using ALT data.
2 Related Work
Nowadays, the study of automatic translation of Myanmar to English is very few. In this section, previous works in machine translation on Myanmar language are reviewed. Recent Statistical machine translation systems based on phrase or word group and use probabilistic model by using source channel approach or direct probability model (log linear model).
Czajkowski and Wai [4, 7] studied Myanmar-English Bidirectional Machine Translation system by using transfer based approach. In the analysis stage, input source sentence is parsed using existing parsers. They used Stanford Parser for parsing English language [13] and Myanmar 3 parser for parsing Myanmar language [11]. They also used tree to tree transformation approach such as Synchronous Context Free Grammar (SCFG) rules to change source sentence structure to target sentence structure. The examples are shown in Fig. 1.
Input:
Output: She is a beautiful girl.
They translated English to Myanmar using tree to tree transformation approach using Synchronous Context Free Grammar (SCFG) rules. Morphological synthesis is also to improve smooth translation because Myanmar language is a morphologically rich language. It considers the articles (a, an, the) and cardinal number especially. They are translated as and so on but in Myanmar Language this article is translated depend on noun. Therefore, to solve this problem, sense of noun is getting from Myanmar Word Net. They also used Myanmar-English bilingual lexicon is 13373 words.
Foster et al. [3] proposed string to tree and tree to string Statistical Machine translation for Myanmar language. They published evaluation of the quality of string-to-tree (S2T) and tree-to-string (T2S) statistical machine translation methods between Myanmar and Chinese, English, English, French, German in both direction. They used multilingual Basic Travel Expressions Corpus(BTEC), which is a collection of travel related expressions [12]. The BLEU score results for Myanmar to English is 44.53 and English to Myanmar is 42.83.
Thu et al. [15] studied Factored Machine Translation for Myanmar to English, Japanese and Vice Versa. Factored machine translation models extend traditional Phrase Based Statistical Machine translation (PB-SMT) by taking into account not only the surface form of the words, but also linguistic knowledge such as the dictionary form (lemma), part-of-speech (POS) and morphological tags. They also used Basic Travel Expressions Corpus (BTEC) [13]. The BLEU score results for Myanmar to English is 20.74. They assumed that due to the lack of training data for POS tag factor, the Myanmar annotations for the factor intended to be incomplete, and potentially accurate. Most of All previous research is the use of small corpora. Most of the NLP works are based on Rules. In this work, we focused on machine translation with ALT data.
3 Overview of Myanmar-English Translation Model
Figure 2 shows overview of the machine translation for Myanmar to English. In the writing systems of many Asian languages, such as Myanmar, Chinese, Japanese and Thai, words are not delimited by spaces. There are no blanks in Myanmar text forward boundaries. In segmentation step, we used Myanmar Language Segmentator published by UCSY NLP Lab [14]. In Part-of-Speech tagging, the segmented sentence is tagged using bigram part-of-speech for Myanmar language [10]. They used 20 POS tags and 6 for finer tags. The category for a word, can be constructed from the features of that word. For instances of POS tag with category, word must be tagged with NN.Person (Person category of Noun tag), with PPM.Direction (Direction type of Postpositional Marker), with PRN.Person (Person type of Pronoun), with JJ.Dem (Demonstrative sense of Adjective), with RB.State (State of Adverb) and so on. Our ALT data defined 14 POS tags to be used in the ALT corpus in order to get more detailed syntactic information of both source and target languages. They are Abbreviation (ABB), Adjective (ADJ), Adverb (ADV), Conjunction (CONJ), Foreign words (FOR), Interjection (INT), Noun (N), Number (NUM), Particle (PART), Post positional marker (PPM), Pronoun (PRON), Punctuation (PUNC), Symbol (SB), Verb (V). In translation phase, the tagged Myanmar sentence is translated to English sentence using Phrase Based Myanmar-English translation model proposed by Thet et al. [2] Myanmar language is inflected language and there are very few creations and researches of corpora in Myanmar, comparing to other language. Therefore, Myanmar phrases translation model is based on syntactic structure and morphology of Myanmar language (see Fig. 3).
Moreover, this translation model also interacts with Word Sense Disambiguation (WSD) [5] to solve ambiguities when a phrase has with more than one sense. For example, the polysemous Myanmar noun would translate to three different English words in the following three sentences:
Myanmar-English bilingual corpus is proposed by [6] is used as a main knowledge source for this phrase translation and Word Sense Disambiguation. This Bitext corpora play an important role in the development of Machine Translation. Meta-data annotation includes: (i) Information about part-of-speech, (ii) Lemma information, (iii) Segmented words, (iv) Word/Phrase alignment, and (v) Locality information. The full format specification is available as a txt file. In total, the corpus consists of approximately 5000 parallel sentences for general domain (such as local newspaper, dictionaries, middle school text book, etc.) [16]. Moreover, Myanmar is a verb final language and reordering is needed when our language is translated from other languages with different word orders. This system used reordering rules by proposed [2], automatic reordering rule generation and application of generated reordering rules in stochastic reordering model (see Fig. 4).
The ALT project was first proposed by the National Institute of Information and Communications Technology, Japan (NICT) in 2014. NICT started to build Japanese and English ALT and worked with the University of Computer Studies, Yangon, Myanmar (UCSY) to build Myanmar ALT in 2015. ALT has about 20,000 sentences extracted from the English Wikinews. These were already translated into the six languages, in order to provide word segmentation, POS tagging, and syntax analysis annotations, in addition to the word alignment information. Figure 5 shows the word alignment annotation between an English sentence and the corresponding translated Myanmar sentence, and Fig. 6 shows the constituency tree building [9].
4 Evaluation and Results
BLEU is the best known and best adopted Machine Evaluation for (machine) translation [17]. BLEU is an automatic evaluation technique which is a geometric mean of n-gram matching. To compute the BLEU score, one has to count the number of n-grams in the test translation that have a match in the corresponding reference translations. The formula used to calculate the n-gram precision is simple. The words from a candidate translation that match with a word in the reference translation (human translation) are counted, and then divided by the number of words in the candidate translation. IBM’s formula for calculating BLEU score is as follows [18]:
where brevity penalty is calculated as:
where c is the length of the corpus of hypothesis translations, and r is the effective reference (is calculated as the sum of the single reference translation from each set which is closest to the hypothesis translation) corpus length. The n-gram precision is calculated as:
count(ngram) is the count of n-grams found both in s i and r i . count sys (ngram) is the count of n-grams found only in s i .
According to the Fig. 7, current phrase translation BLEU score is 79.7 and they used 12817 sentences parallel corpus size [2]. The best results got by adding morphology and POS of Myanmar language to baseline system. Postpositional markers have ambiguous meaning in translation. By using POS tags, the system reduced ambiguous in postpositional markers. Especially, ambiguous in Subject PPM “has, have, had” and Place PPM “at”, Subject PPM “null” and Leave PPM “from”, Used PPM “with” and Compare PPM “and” Used PPM “with” and Cause PPM “because of” and Place PPM “at” and Extract PPM “among”.
Myanmar word segmentation accuracy is 97 % [14] and Myanmar part of speech tagging accuracy is 597 % [10].
In [5], for evaluation purpose, the test sentences are grouped into two groups, 150 sentences for Type1 (Test sentences is taking in the training set) and Type2 (Test sentences that are composed of words in the training sentences, but not exactly the same sentences in the training set). The accuracy of type-I is 98 % and type-II is 90 %. [2, 5, 6] used the same training data.
The Word alignment for Myanmar to English translation accuracy is 89 % [6].
For reordering, the accuracy is 98.9 % in simple sentences, 95.4 in complex sentences and 93.6 in compound sentences [1].
Evaluation Result of Myanmar to English Machine Translation is 82.14 % and English to Myanmar Machine Translation is 80.45 [7].
In the future, more and more training data are going to be trained. The accuracy will be higher. We have to test ALT data for bidirectional Myanmar to English machine translation using human evaluation with bilingual judges.
5 Conclusion
In conclusion, we can say that the field of machine translation has been and continues a key focus of research on natural language processing and that extended to the development of many positive results. Moreover, perfection is still far away. Most of the previous works for Myanmar language machine translation used small corpora and rule based. We focused on construction of a statistical MT model at the end to increase the performance of the machine translation system.
This paper also discussed the ALT project. ALT is intended to accelerate NLP development in low resource Asian languages. The corpus consists of about 20,000 sentences from the news domain consisting of Asian language translations from a shared English source text together with accompanying word segmentation, word alignment, POS tagging, and syntax trees. ALT includes English, Indonesian, Japanese, Khmer, Malay, Myanmar and Vietnamese in the short term, and extend to other languages in the long term through collaboration with international research organizations.
References
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
May, P., Ehrlich, H.-C., Steinke, T.: ZIB structure prediction pipeline: composing a complex biological workflow through web services. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1148–1158. Springer, Heidelberg (2006). doi:10.1007/11823285_121
Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)
Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C.: Grid information services for distributed resource sharing. In: 10th IEEE International Symposium on High Performance Distributed Computing, pp. 181–184. IEEE Press, New York (2001)
Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The physiology of the grid: an open grid services architecture for distributed systems integration. Technical report, Global Grid Forum (2002)
National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov
Wai, T.T., Htwe, T.M., Thein, N.L.: Automatic reordering rule generation and application of reordering rules in stochastic reordering model for English-Myanmar machine translation. Int. J. Comput. Appl. 27(8), 19–25 (2011)
Zin, T.T., Soe, K.M., Thein, N.L.: Translation model of Myanmar phrases for statistical machine translation. In: Huang, D.-S., Gan, Y., Gupta, P., Gromiha, M. (eds.) ICIC 2011. LNCS, vol. 6839, pp. 235–242. Springer, Heidelberg (2012). doi:10.1007/978-3-642-25944-9_31. ISBN 978-3-642-25943-2
Thu, Y.K., Finch, A., Sumita, E., Pa, W.P, Htike, K.W.W.: String to tree and tree to string statistical machine translation for Myanmar language. In: 14th International Conference on Computer Applications, ICCA (2016)
Win, Y.Y., Nwe, T.H.: Myanmar-English bidirectional machine translation translation system by using transfer based approach. In: 13th International Conference on Computer Applications, ICCA (2015)
Aung, N.T.T., Thein, N.L.: Myanmar word disambiguation for Myanmar-English machine translation. Int. J. Comput. Appl. 27(8) (2011)
Nwet, K.T., Thein, N.L., Soe, K.M.: Word alignment system based on hybrid approach for Myanmar-English machine translation. In: SICE Annual Conference, 13–18 September 2011. Waseda University, Tokyo, Japan (2011)
Win, Y.Y., Thida, A.: English to Myanmar translation system with numerical particle identification, Int. J. Inf. Technol. Comput. Sci. 6, 37–43 (2016). Accessed June 2016 in MECS. http://www.mecs-press.org/
Win, A.T., Words to phrase reordering machine translation system in Myanmar-English using English grammar rules (2011)
Thu, Y.K., Finch, A., Sumita, E., Pa, W.P.: Introducing the Asian Language Treebank (ALT)
Htay, H.H., Murthy, K.N.: Myanmar word segmentation using syllable level longest matching. In: The 6th Workshop on Asian Language Resources (2008)
Myint, P.H., Htwe, T.M., Thein, N.L.: Bigram part-of-speech for Myanmar language. In: Proceedings of the 2011 International Conference on Information Communication and Management (ICICM 2011) (2011)
Phu, S.L.: Development of Lexico-conceptual knowledge resources and syntax analyzer for Myanmar language. Ph.D. thesis, University of Computer Studies, Mandalay
Kikui, G., Sumita, E., Takezawa, T., Yamamoto, S.: Creating corpora for speech to speech translation. In: Proceedings of EUROSOEECH 2003, pp, 381–384 (2003)
Thu, Y.K., Finch, A., Sumita, E., Sagisaka, Y.: Factored machine translation for Myanmar to English, Japanese and Vice Versa, ICCA (2012)
Nwet, K.T.: Developing word to phrase alignment for Myanmar-English machine translation, ICCA (2016)
Acknowledgment
This work is partly supported by the ASEAN IVO Project “Open Collaboration for Developing and Using Asian Language Treebank”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Nwet, K.T., Soe, K.M. (2017). Myanmar-English Machine Translation Model. In: Pan, JS., Lin, JW., Wang, CH., Jiang, X. (eds) Genetic and Evolutionary Computing. ICGEC 2016. Advances in Intelligent Systems and Computing, vol 536. Springer, Cham. https://doi.org/10.1007/978-3-319-48490-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-48490-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48489-1
Online ISBN: 978-3-319-48490-7
eBook Packages: EngineeringEngineering (R0)