Keywords

1 Introduction

There are many tasks needed to perform in machine translation. For Myanmar language, word segmentation is the early step to do since Myanmar language does not use spaces between words. After that, syntactic analysis, semantic analysis and synthesis analysis has to be done to complete the translation. There are three approaches used in MT: rule based, example based and statistical based. Our main motivation for this research is to investigate Myanmar-English MT based on statistical methods. We will evaluate the accuracy and also contribute the comparative study of the translation model using ALT data.

2 Related Work

Nowadays, the study of automatic translation of Myanmar to English is very few. In this section, previous works in machine translation on Myanmar language are reviewed. Recent Statistical machine translation systems based on phrase or word group and use probabilistic model by using source channel approach or direct probability model (log linear model).

Czajkowski and Wai [4, 7] studied Myanmar-English Bidirectional Machine Translation system by using transfer based approach. In the analysis stage, input source sentence is parsed using existing parsers. They used Stanford Parser for parsing English language [13] and Myanmar 3 parser for parsing Myanmar language [11]. They also used tree to tree transformation approach such as Synchronous Context Free Grammar (SCFG) rules to change source sentence structure to target sentence structure. The examples are shown in Fig. 1.

Fig. 1.
figure 1

Myanmar sentence structure before transformation and myanmar sentence after transformation

Input:

Output: She is a beautiful girl.

They translated English to Myanmar using tree to tree transformation approach using Synchronous Context Free Grammar (SCFG) rules. Morphological synthesis is also to improve smooth translation because Myanmar language is a morphologically rich language. It considers the articles (a, an, the) and cardinal number especially. They are translated as and so on but in Myanmar Language this article is translated depend on noun. Therefore, to solve this problem, sense of noun is getting from Myanmar Word Net. They also used Myanmar-English bilingual lexicon is 13373 words.

Foster et al. [3] proposed string to tree and tree to string Statistical Machine translation for Myanmar language. They published evaluation of the quality of string-to-tree (S2T) and tree-to-string (T2S) statistical machine translation methods between Myanmar and Chinese, English, English, French, German in both direction. They used multilingual Basic Travel Expressions Corpus(BTEC), which is a collection of travel related expressions [12]. The BLEU score results for Myanmar to English is 44.53 and English to Myanmar is 42.83.

Thu et al. [15] studied Factored Machine Translation for Myanmar to English, Japanese and Vice Versa. Factored machine translation models extend traditional Phrase Based Statistical Machine translation (PB-SMT) by taking into account not only the surface form of the words, but also linguistic knowledge such as the dictionary form (lemma), part-of-speech (POS) and morphological tags. They also used Basic Travel Expressions Corpus (BTEC) [13]. The BLEU score results for Myanmar to English is 20.74. They assumed that due to the lack of training data for POS tag factor, the Myanmar annotations for the factor intended to be incomplete, and potentially accurate. Most of All previous research is the use of small corpora. Most of the NLP works are based on Rules. In this work, we focused on machine translation with ALT data.

3 Overview of Myanmar-English Translation Model

Figure 2 shows overview of the machine translation for Myanmar to English. In the writing systems of many Asian languages, such as Myanmar, Chinese, Japanese and Thai, words are not delimited by spaces. There are no blanks in Myanmar text forward boundaries. In segmentation step, we used Myanmar Language Segmentator published by UCSY NLP Lab [14]. In Part-of-Speech tagging, the segmented sentence is tagged using bigram part-of-speech for Myanmar language [10]. They used 20 POS tags and 6 for finer tags. The category for a word, can be constructed from the features of that word. For instances of POS tag with category, word must be tagged with NN.Person (Person category of Noun tag), with PPM.Direction (Direction type of Postpositional Marker), with PRN.Person (Person type of Pronoun), with JJ.Dem (Demonstrative sense of Adjective), with RB.State (State of Adverb) and so on. Our ALT data defined 14 POS tags to be used in the ALT corpus in order to get more detailed syntactic information of both source and target languages. They are Abbreviation (ABB), Adjective (ADJ), Adverb (ADV), Conjunction (CONJ), Foreign words (FOR), Interjection (INT), Noun (N), Number (NUM), Particle (PART), Post positional marker (PPM), Pronoun (PRON), Punctuation (PUNC), Symbol (SB), Verb (V). In translation phase, the tagged Myanmar sentence is translated to English sentence using Phrase Based Myanmar-English translation model proposed by Thet et al. [2] Myanmar language is inflected language and there are very few creations and researches of corpora in Myanmar, comparing to other language. Therefore, Myanmar phrases translation model is based on syntactic structure and morphology of Myanmar language (see Fig. 3).

Fig. 2.
figure 2

One Myanmar word has three English words sense example

Fig. 3.
figure 3

Overview of Myanmar to English machine translation

Moreover, this translation model also interacts with Word Sense Disambiguation (WSD) [5] to solve ambiguities when a phrase has with more than one sense. For example, the polysemous Myanmar noun would translate to three different English words in the following three sentences:

Myanmar-English bilingual corpus is proposed by [6] is used as a main knowledge source for this phrase translation and Word Sense Disambiguation. This Bitext corpora play an important role in the development of Machine Translation. Meta-data annotation includes: (i) Information about part-of-speech, (ii) Lemma information, (iii) Segmented words, (iv) Word/Phrase alignment, and (v) Locality information. The full format specification is available as a txt file. In total, the corpus consists of approximately 5000 parallel sentences for general domain (such as local newspaper, dictionaries, middle school text book, etc.) [16]. Moreover, Myanmar is a verb final language and reordering is needed when our language is translated from other languages with different word orders. This system used reordering rules by proposed [2], automatic reordering rule generation and application of generated reordering rules in stochastic reordering model (see Fig. 4).

Fig. 4.
figure 4

Phrase based translation

The ALT project was first proposed by the National Institute of Information and Communications Technology, Japan (NICT) in 2014. NICT started to build Japanese and English ALT and worked with the University of Computer Studies, Yangon, Myanmar (UCSY) to build Myanmar ALT in 2015. ALT has about 20,000 sentences extracted from the English Wikinews. These were already translated into the six languages, in order to provide word segmentation, POS tagging, and syntax analysis annotations, in addition to the word alignment information. Figure 5 shows the word alignment annotation between an English sentence and the corresponding translated Myanmar sentence, and Fig. 6 shows the constituency tree building [9].

Fig. 5.
figure 5

Word alignment interface

Fig. 6.
figure 6

Tree building interface

4 Evaluation and Results

BLEU is the best known and best adopted Machine Evaluation for (machine) translation [17]. BLEU is an automatic evaluation technique which is a geometric mean of n-gram matching. To compute the BLEU score, one has to count the number of n-grams in the test translation that have a match in the corresponding reference translations. The formula used to calculate the n-gram precision is simple. The words from a candidate translation that match with a word in the reference translation (human translation) are counted, and then divided by the number of words in the candidate translation. IBM’s formula for calculating BLEU score is as follows [18]:

$$ {\text{BLEU}} = BP \times exp\left( {\sum\nolimits_{n = 1}^{4} {\frac{1}{n}{ \log }(pn)} } \right). $$
(1)

where brevity penalty is calculated as:

$$ {\text{BP}} = { \hbox{min} }\left( { 1,{\text{e}}^{{ 1- {\text{r}}/{\text{c}}}} } \right) $$
(2)

where c is the length of the corpus of hypothesis translations, and r is the effective reference (is calculated as the sum of the single reference translation from each set which is closest to the hypothesis translation) corpus length. The n-gram precision is calculated as:

$$ P_{n} = \frac{{\mathop \sum \nolimits_{i = 1}^{I} \mathop \sum\nolimits_{{ngram\,\epsilon\,S_{i} }} count(ngram)}}{{\mathop \sum\nolimits_{i = 1}^{I} \mathop \sum\nolimits_{{ngram\,\epsilon\,S_{i} }} count_{sys} (ngram)}} $$
(3)

count(ngram) is the count of n-grams found both in s i and r i . count sys (ngram) is the count of n-grams found only in s i .

According to the Fig. 7, current phrase translation BLEU score is 79.7 and they used 12817 sentences parallel corpus size [2]. The best results got by adding morphology and POS of Myanmar language to baseline system. Postpositional markers have ambiguous meaning in translation. By using POS tags, the system reduced ambiguous in postpositional markers. Especially, ambiguous in Subject PPM “has, have, had” and Place PPM “at”, Subject PPM “null” and Leave PPM “from”, Used PPM “with” and Compare PPM “and” Used PPM “with” and Cause PPM “because of” and Place PPM “at” and Extract PPM “among”.

Fig. 7.
figure 7

BLEU scores for translating to English

Myanmar word segmentation accuracy is 97 % [14] and Myanmar part of speech tagging accuracy is 597 % [10].

In [5], for evaluation purpose, the test sentences are grouped into two groups, 150 sentences for Type1 (Test sentences is taking in the training set) and Type2 (Test sentences that are composed of words in the training sentences, but not exactly the same sentences in the training set). The accuracy of type-I is 98 % and type-II is 90 %. [2, 5, 6] used the same training data.

The Word alignment for Myanmar to English translation accuracy is 89 % [6].

For reordering, the accuracy is 98.9 % in simple sentences, 95.4 in complex sentences and 93.6 in compound sentences [1].

Evaluation Result of Myanmar to English Machine Translation is 82.14 % and English to Myanmar Machine Translation is 80.45 [7].

In the future, more and more training data are going to be trained. The accuracy will be higher. We have to test ALT data for bidirectional Myanmar to English machine translation using human evaluation with bilingual judges.

5 Conclusion

In conclusion, we can say that the field of machine translation has been and continues a key focus of research on natural language processing and that extended to the development of many positive results. Moreover, perfection is still far away. Most of the previous works for Myanmar language machine translation used small corpora and rule based. We focused on construction of a statistical MT model at the end to increase the performance of the machine translation system.

This paper also discussed the ALT project. ALT is intended to accelerate NLP development in low resource Asian languages. The corpus consists of about 20,000 sentences from the news domain consisting of Asian language translations from a shared English source text together with accompanying word segmentation, word alignment, POS tagging, and syntax trees. ALT includes English, Indonesian, Japanese, Khmer, Malay, Myanmar and Vietnamese in the short term, and extend to other languages in the long term through collaboration with international research organizations.