Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution

Toan Q. Nguyen; Kenton Murray; David Chiang

doi:10.18653/v1/2021.iwslt-1.33

Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution

Toan Q. Nguyen, Kenton Murray, David Chiang

Abstract

In this paper, we investigate the driving factors behind concatenation, a simple but effective data augmentation method for low-resource neural machine translation. Our experiments suggest that discourse context is unlikely the cause for concatenation improving BLEU by about +1 across four language pairs. Instead, we demonstrate that the improvement comes from three other factors unrelated to discourse: context diversity, length diversity, and (to a lesser extent) position shifting.

Anthology ID:: 2021.iwslt-1.33
Volume:: Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
Month:: August
Year:: 2021
Address:: Bangkok, Thailand (online)
Editors:: Marcello Federico, Alex Waibel, Marta R. Costa-jussà, Jan Niehues, Sebastian Stuker, Elizabeth Salesky
Venue:: IWSLT
SIG:: SIGSLT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 287–293
Language:
URL:: https://aclanthology.org/2021.iwslt-1.33
DOI:: 10.18653/v1/2021.iwslt-1.33
Bibkey:
Cite (ACL):: Toan Q. Nguyen, Kenton Murray, and David Chiang. 2021. Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 287–293, Bangkok, Thailand (online). Association for Computational Linguistics.
Cite (Informal):: Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution (Nguyen et al., IWSLT 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.iwslt-1.33.pdf

PDF Cite Search