Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution

Toan Q. Nguyen, Kenton Murray, David Chiang


Abstract
In this paper, we investigate the driving factors behind concatenation, a simple but effective data augmentation method for low-resource neural machine translation. Our experiments suggest that discourse context is unlikely the cause for concatenation improving BLEU by about +1 across four language pairs. Instead, we demonstrate that the improvement comes from three other factors unrelated to discourse: context diversity, length diversity, and (to a lesser extent) position shifting.
Anthology ID:
2021.iwslt-1.33
Volume:
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
Month:
August
Year:
2021
Address:
Bangkok, Thailand (online)
Editors:
Marcello Federico, Alex Waibel, Marta R. Costa-jussà, Jan Niehues, Sebastian Stuker, Elizabeth Salesky
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
Note:
Pages:
287–293
Language:
URL:
https://aclanthology.org/2021.iwslt-1.33
DOI:
10.18653/v1/2021.iwslt-1.33
Bibkey:
Cite (ACL):
Toan Q. Nguyen, Kenton Murray, and David Chiang. 2021. Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 287–293, Bangkok, Thailand (online). Association for Computational Linguistics.
Cite (Informal):
Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution (Nguyen et al., IWSLT 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.iwslt-1.33.pdf