Elsevier

Computer Speech & Language

Volume 45, September 2017, Pages 137-148
Computer Speech & Language

On integrating a language model into neural machine translation

https://doi.org/10.1016/j.csl.2017.01.014Get rights and content

Abstract

Recent advances in end-to-end neural machine translation models have achieved promising results on high-resource language pairs such as En→ Fr and En→ De. One of the major factor behind these successes is the availability of high quality parallel corpora. We explore two strategies on leveraging abundant amount of monolingual data for neural machine translation. We observe improvements by both combining scores from neural language model trained only on target monolingual data with neural machine translation model and fusing hidden-states of these two models. We obtain up to 2 BLEU improvement over hierarchical and phrase-based baseline on low-resource language pair, Turkish→ English. Our method was initially motivated towards tasks with less parallel data, but we also show that it extends to high resource languages such as Cs→ En and De→ En translation tasks, where we obtain 0.39 and 0.47 BLEU improvements over the neural machine translation baselines, respectively.

Introduction

Neural machine translation (NMT) is an end-to-end neural network based approach to statistical machine translation (Kalchbrenner, Blunsom, 2013, Sutskever, Vinyals, Le, 2014, Cho, van Merrienboer, Gulcehre, Bougares, Schwenk, Bengio, 2014, Bahdanau, Cho, Bengio) which has recently been able to achieve state-of-the-art translation quality on many language pairs (Sutskever, Vinyals, Le, 2014, Jean, Cho, Memisevic, Bengio, Chung, Cho, Bengio, Luong, Manning). For the recent advances made since the publication of the initial technical report version of this manuscript Gulcehre et al. (2015), we refer readers to Section 2.

A large part of the recent successes of NMT has been due to the availability of large amounts of high quality, sentence aligned corpora. In the case of low resource language pairs however, or in a task with severe domain restrictions, large parallel corpora may not be available. In contrast, monolingual corpora is almost always abundant and universally available. Despite being “unlabeled”, monolingual corpora still exhibit rich linguistic structure that can be useful for machine translation.

In this work, we explore the question of how to leverage monolingual corpora. Specifically, we explore two ways of integrating a language model (LM) trained only on monolingual data (target language) into an NMT system. First, by combining the LM and NMT system at the output level (which we term shallow fusion). Second, by fusing the LM and NMT at the level of their hidden states, enabling us to combine them nonlinearly (which we term deep fusion).

We evaluate the proposed fusion strategies under two distinct settings. The first setting involves low-resource translation tasks (Turkish → English and Chinese → English SMS Chat), where training an NMT system with a small parallel corpus alone often leads to severe overfitting. Second, we further evaluate our methods on higher-resource tasks (German→ English and Czech→ English). In addition to observing an improvement in translation quality on the small scale tasks, we observe an improvement in the large scale setting. In addition, our experiments show that our deep fusion method of fusing the LM and NMT systems more significantly improves translation quality than the shallow fusion method.

In the following section (Section 2), we review recent work in neural machine translation. We present our basic model architecture in Section 3  and describe our shallow and deep fusion approaches in Section 4. Next, we describe our datasets and experiments in Sections 5–6.

Section snippets

Background: neural machine translation

Statistical machine translation (SMT) systems maximize the conditional probability p(yx) of a correct target translation y given a source sentence x. This is done by maximizing separately a language model p(y) and the (inverse) translation model p(xy) component by using Bayes’ rule: p(yx)p(xy)p(y).

This decomposition into a language model and translation model are meant to make full advantage of available corpora, i.e. monolingual corpora for fitting the language model and parallel corpora

Model description

We use the RNNSearch (Bahdanau et al., 2014) model that learns to jointly (soft-)align and translate as the baseline neural machine translation system in this paper, we will refer it as “NMT”.

The encoder of the NMT is a bidirectional RNN that consists of a forward and a backward RNN (Schuster and Paliwal, 1997). The forward RNN reads the input sentence x=(x1,,xT) from left to right, resulting in a sequence of hidden states (h1,,hT). The backward RNN reads x in the opposite direction and

Integrating language model into the decoder

We propose two approaches for integrating a neural language model (NLM) into a neural machine translation system. Without loss of generality, we use language models implemented with recurrent neural networks (RNNLM, (Mikolov et al., 2011)) which are equivalent to the decoder described in the previous section except that it is not conditioned on a context vector (i.e., ct=0 in Eqs. (4)–(5)).

In the following sections, we assume that both an NMT model (on parallel corpora) as well as a recurrent

Datasets

We evaluate the proposed approaches for integrating monolingual LM into NMT on four different language pairs: Chinese to English (Zh→ En), Turkish to English (Tr→ En), German to English (De→ En) and Czech to English (Cs→ En). We consider Tr→ En (IWSLT’14) and Zh→ En (OpenMT’15) tasks as low-resource translation tasks. By including De→ En and Cs→ En WMT’15 translation tasks in our experiments, we evaluate the performance of our proposed model on high-resource translation tasks as well. In the

Experimental settings

We compare results of each model after finetuning their hyperparameters for each language pair separately on the development set. We treat the type of optimization algorithm as a hyperparameter of the model, and as a result, use a different optimization algorithm for each model.

Results and analysis

In our experiments, we first provide results on tasks that have low-resources, such as Tr→ En IWSLT ’14 and Zh→ En OpenMT’15 (SMS and call center transcripts.) We also provide results on high-resource tasks, such as De→ En and Cs→ En, showing that incorporating language models to the decoder of an NMT system can improve the results of these models as well.

A review of recent advances in NMT

Since we proposed the fusion methods in the earlier technical report (Gulcehre et al., 2015), many advances have been made in the field of neural machine translation, more specifically in the context of integrating monolingual corpora and low-resource translation. In this section, we review some of the related works and their relation to our proposal.

Conclusion and future work

In this paper, we propose and compare two methods–shallow and deep fusion– for incorporating monolingual corpora into an existing neural translation system. We empirically evaluate these approaches on low-resource translation (Tr→ En), focused-domain translation (Zh→ En, SMS/Chat and conversational speech) and high-resource translation (Cs→ En and De→ En) tasks. On the first two tasks, we observed that the proposed deep fusion improves the translation quality by up to +2.0 BLEU against the

Acknowledgments

The authors would like to thank the developers of Theano (Bergstra, Breuleux, Bastien, Lamblin, Pascanu, Desjardins, Turian, Warde-Farley, Bengio, 2010, Bastien, Lamblin, Pascanu, Bergstra, Goodfellow, Bergeron, Bouchard, Warde-Farley, Bengio). We acknowledge the support of the following organizations for research funding and computing support: NSERC, Samsung, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFAR. O.F. was funded by TUBITAK 2214-A Program during this work. KC

References (35)

  • Sennrich, R., Haddow, B., Birch, A., 2015. Neural machine translation of rare words with subword units. arXiv preprint...
  • Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. arXiv...
  • Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., Warde-Farley, D.,...
  • J. Bergstra et al.

    Theano: a CPU and GPU math expression compiler

    Proceedings of the Python for Scientific Computing Conference (SciPy)

    (2010)
  • M. Cettolo et al.

    Wit3: Web inventory of transcribed and translated talks

    Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT)

    (2012)
  • ChiangD.

    A hierarchical phrase-based model for statistical machine translation

    Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

    (2005)
  • ChoK. et al.

    Learning phrase representations using RNN encoder-decoder for statistical machine translation

    Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014)

    (2014)
  • Chung, J., Cho, K., Bengio, Y., 2016. A character-level decoder without explicit segmentation for neural machine...
  • Firat, O., Cho, K., Bengio, Y., 2016. Multi-way, multilingual neural machine translation with a shared attention...
  • I. Goodfellow et al.

    Maxout networks

    Proceedings of The 30th International Conference on Machine Learning

    (2013)
  • A. Graves

    Practical variational inference for neural networks

    Advances in Neural Information Processing Systems

    (2011)
  • Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.-C., Bougares, F., Schwenk, H., Bengio, Y., 2015. On...
  • Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. R., 2012. Improving neural networks by...
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • Jean, S., Cho, K., Memisevic, R., Bengio, Y., 2014. On using very large target vocabulary for neural machine...
  • N. Kalchbrenner et al.

    Recurrent continuous translation models

    Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP)

    (2013)
  • Kingma, D. P., Ba, J., 2014. Adam: a method for stochastic optimization. arXiv:1412.6980...
  • Cited by (0)

    This paper is an extended version of an earlier technical report Gulcehre et al. (2015)

    1

    Equally contributed

    View full text