ABSTRACT
We introduce BLANC, a family of dynamic, trainable evaluation metrics for machine translation. Flexible, parametrized models can be learned from past data and automatically optimized to correlate well with human judgments for different criteria (e.g. adequacy, fluency) using different correlation measures. Towards this end, we discuss ACS (all common skip-ngrams), a practical algorithm with trainable parameters that estimates reference-candidate translation overlap by computing a weighted sum of all common skip-ngrams in polynomial time. We show that the BLEU and ROUGE metric families are special cases of BLANC, and we compare correlations with human judgments across these three metric families. We analyze the algorithmic complexity of ACS and argue that it is more powerful in modeling both local meaning and sentence-level structure, while offering the same practicality as the established algorithms it generalizes.
- Y. Akiba, K. Iamamurfa, and E. Sumita. 2001. Using multiple edit distances to automatically rank machine translation output. MT Summit VIII. Google ScholarDigital Library
- C. Culy and S. Z. Riehemann. 2003. The limits of n-gram translation evaluation metrics. Machine Translation Summit IX.Google Scholar
- George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Human Language Technology Conference (HLT). Google ScholarDigital Library
- V. I. Levenshtein. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR.Google Scholar
- C. Y. Lin and F. J. Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip bigram statistics. ACL. Google ScholarDigital Library
- S. Niessen, F. J. Och, G. Leusch, and H. Ney. 2000. An evaluation tool for machine translation: Fast evaluation for mt research. LREC.Google Scholar
- K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. 2001. Bleu: a method for automatic evaluation of machine translation. IBM Research Report. Google ScholarDigital Library
- R. Soricut and E. Brill. 2004. A unified framework for automatic evaluation using n-gram co-occurence statistics. ACL. Google ScholarDigital Library
- K. Y. Su, M. W. Wu, and J. S. Chang. 1992. A new quantitative quality measure for machine translation systems. COLING. Google ScholarDigital Library
- J. P. Turian, L. Shen, and I. D. Melamed. 2003. Evaluation of machine translation and its evaluation. MT Summit IX.Google Scholar
- C. J. Van-Rijsbergen. 1979. Information retrieval. Google ScholarDigital Library
- BLANC: learning evaluation metrics for MT
Recommendations
Are Slice-Based Cohesion Metrics Actually Useful in Effort-Aware Post-Release Fault-Proneness Prediction? An Empirical Study
Background. Slice-based cohesion metrics leverage program slices with respect to the output variables of a module to quantify the strength of functional relatedness of the elements within the module. Although slice-based cohesion metrics have been ...
Lack of Conceptual Cohesion of Methods: A new alternative to Lack Of Cohesion of Methods
ISEC '15: Proceedings of the 8th India Software Engineering ConferenceWhile often defined in informal ways, class cohesion reflects important properties of modules in a software system. High cohesion for classes is one of the desirable properties in Object Oriented (OO) analysis as it supports program comprehension, ...
Comments