Translationese in Machine Translation Evaluation
Pith reviewed 2026-05-25 17:34 UTC · model grok-4.3
The pith
Differences between original writing and translated text distort machine translation evaluation results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Analysis shows evidence to support differences in text originally written in a given language relative to translated text and this can potentially negatively impact the accuracy of machine translation evaluations. For this reason reverse-created test data should be omitted from future machine translation test sets. In addition, a re-evaluation of a past high-profile machine translation evaluation claiming human-parity of MT finds potential ways of improving reliability, including attention to the statistical power of significance tests.
What carries the argument
translationese, the presence of unusual features of translated text that distinguish it from originally written text in the same language, which is shown through analysis to affect evaluation accuracy.
If this is right
- Reverse-created test data should be omitted from future machine translation test sets.
- Re-evaluations of past human-parity claims can identify ways to improve reliability.
- Power analysis indicates a suitable minimum sample size of translations for studies investigating human parity.
- Using the provided comprehensive check-list can help ensure accuracy and reliability in future MT evaluations.
Where Pith is reading between the lines
- MT systems trained or evaluated on mixed data may learn to produce translationese rather than natural language.
- Similar biases could affect evaluations in other areas like machine translation of other modalities or multilingual tasks.
- Future work could develop automated methods to create or select test sets free of translationese effects.
Load-bearing premise
That the observed differences between original and translated text are primarily attributable to translationese rather than other factors and that these differences are the main driver of inaccurate evaluation outcomes.
What would settle it
Conducting MT evaluations using only test sets with originally written target text versus those with translated references and observing whether performance rankings or human parity claims change significantly.
Figures
read the original abstract
The term translationese has been used to describe the presence of unusual features of translated text. In this paper, we provide a detailed analysis of the adverse effects of translationese on machine translation evaluation results. Our analysis shows evidence to support differences in text originally written in a given language relative to translated text and this can potentially negatively impact the accuracy of machine translation evaluations. For this reason we recommend that reverse-created test data be omitted from future machine translation test sets. In addition, we provide a re-evaluation of a past high-profile machine translation evaluation claiming human-parity of MT, as well as analysis of the since re-evaluations of it. We find potential ways of improving the reliability of all three past evaluations. One important issue not previously considered is the statistical power of significance tests applied in past evaluations that aim to investigate human-parity of MT. Since the very aim of such evaluations is to reveal legitimate ties between human and MT systems, power analysis is of particular importance, where low power could result in claims of human parity that in fact simply correspond to Type II error. We therefore provide a detailed power analysis of tests used in such evaluations to provide an indication of a suitable minimum sample size of translations for such studies. Subsequently, since no past evaluation that aimed to investigate claims of human parity ticks all boxes in terms of accuracy and reliability, we rerun the evaluation of the systems claiming human parity. Finally, we provide a comprehensive check-list for future machine translation evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes adverse effects of translationese on MT evaluation, presenting evidence of differences between original and translated text, recommending omission of reverse-created test data from future MT test sets, re-evaluating a high-profile human-parity claim (including power analysis of significance tests), and providing a checklist for reliable future evaluations.
Significance. If substantiated, the work could improve MT evaluation reliability by addressing translationese and ensuring adequate statistical power in human-parity studies. The power analysis and checklist are constructive contributions that directly support more robust experimental design.
major comments (2)
- [power analysis and re-evaluation sections] Re-evaluation and power analysis section: the claim that low statistical power in past human-parity evaluations risks Type II error is central to the re-evaluation, but the paper does not report the exact effect sizes or variance estimates used to derive the recommended minimum sample size, preventing verification that the suggested N is sufficient across typical MT score distributions.
- [analysis and recommendation sections] Analysis of translationese effects and recommendation: the recommendation to omit reverse-created test data rests on observed text differences driving evaluation inaccuracy, yet the manuscript does not isolate translationese from confounders (e.g., domain shift, sentence length, or reference quality) via controlled ablation or regression; the re-evaluation therefore cannot establish that translationese is the primary causal factor rather than a correlate.
minor comments (2)
- [checklist section] The checklist for future evaluations would benefit from being presented as a numbered table rather than inline text for easier reference.
- [figures on text differences] Some figure captions describing text differences could more explicitly state the statistical test and sample size used to support significance claims.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We address the major comments point by point below, and will incorporate revisions to improve the clarity and rigor of the manuscript.
read point-by-point responses
-
Referee: [power analysis and re-evaluation sections] Re-evaluation and power analysis section: the claim that low statistical power in past human-parity evaluations risks Type II error is central to the re-evaluation, but the paper does not report the exact effect sizes or variance estimates used to derive the recommended minimum sample size, preventing verification that the suggested N is sufficient across typical MT score distributions.
Authors: We agree that providing the exact effect sizes and variance estimates is necessary for full reproducibility and verification of the power analysis. In the revised manuscript, we will explicitly report the effect size (Cohen's d or equivalent based on observed score differences) and variance estimates derived from the WMT datasets used in the re-evaluation. This will allow readers to confirm that the recommended minimum sample size is appropriate for typical MT evaluation score distributions. revision: yes
-
Referee: [analysis and recommendation sections] Analysis of translationese effects and recommendation: the recommendation to omit reverse-created test data rests on observed text differences driving evaluation inaccuracy, yet the manuscript does not isolate translationese from confounders (e.g., domain shift, sentence length, or reference quality) via controlled ablation or regression; the re-evaluation therefore cannot establish that translationese is the primary causal factor rather than a correlate.
Authors: We acknowledge that our analysis shows correlations between translationese features and evaluation differences but does not include explicit ablation studies or regression models to isolate translationese from all potential confounders. However, the test sets analyzed are from controlled WMT shared tasks where domain and other factors are standardized, and the differences align with known properties of translationese from prior linguistic studies. To strengthen this, we will add a section discussing potential confounders and note the limitations of correlational evidence, while maintaining the recommendation based on the substantial observed effects. A full causal analysis would require additional experiments beyond the scope of this work but is suggested as future research. revision: partial
Circularity Check
No significant circularity; empirical analysis stands on observed data and standard statistical methods.
full rationale
The paper's core claims rest on direct comparisons of original vs. translated text, re-evaluation of prior MT systems, and power analysis of significance tests. No equations, fitted parameters renamed as predictions, or self-citations are invoked as load-bearing premises that reduce the result to its inputs by construction. The recommendation to omit reverse-created data follows from reported text differences rather than any definitional loop or ansatz smuggled via prior work. This is a standard empirical study with independent content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Differences exist between originally written text and translated text that can impact MT evaluation accuracy
Forward citations
Cited by 2 Pith papers
-
Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models
Maistros 8B is a new state-of-the-art open-weights Greek LLM built via knowledge distillation from large reasoning models on the CulturaQA dataset.
-
CHORUS: Effort-Aware Multi-Agent Human-AI Collaboration for Professional Translation
CHORUS multi-agent system reduced professional translation time by 33.8% while lowering cognitive effort and raising BLEU/COMET scores in a 30-participant within-subject study.
Reference graph
Works this paper leans on
-
[1]
Mona Baker, Gill Francis, and Elena Tognini-Bonelli. 1993. Corpus linguistics and translation studies: Implications and applications. In Text and Technology: In Honour of John Sinclair, Netherlands. John Benjamins Publishing Company
work page 1993
-
[2]
Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ale s Tamchyna. 2014. http://www.aclweb.org/anthology/W/W14/W14-3302 Findings of the 2014 workshop on statistical machine translation . In Proceedings of the Ninth Wor...
work page 2014
-
[3]
Ond r ej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. http://www.aclweb.org/anthology/W13-2201 Findings of the 2013 Workshop on Statistical Machine Translation . In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1...
work page 2013
-
[4]
Ond r ej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. http://www.a...
work page 2016
-
[5]
Ond r ej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. http://aclweb.org/anthology/W15-3001 Findings of the 2015 workshop on statistical machine translation . In Proceedings of the Ten...
work page 2015
-
[6]
Ond r ej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2018. http://www.aclweb.org/anthology/W18-6401 Findings of the 2018 conference on machine translation (wmt18) . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 272--307, Belgium,...
work page 2018
-
[7]
Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. http://www.aclweb.org/anthology/W/W07/W07-0218 (meta-) evaluation of machine translation . In Proceedings of the Second Workshop on Statistical Machine Translation, pages 136--158, Prague, Czech Republic. Association for Computational Linguistics
work page 2007
-
[8]
Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2008. http://www.aclweb.org/anthology/W/W08/W08-0309 Further meta-evaluation of machine translation . In Proceedings of the Third Workshop on Statistical Machine Translation, pages 70--106, Columbus, Ohio. Association for Computational Linguistics
work page 2008
-
[9]
Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar Zaidan. 2010. http://www.aclweb.org/anthology/W10-1703 Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation . In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 1...
work page 2010
-
[10]
Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. http://www.aclweb.org/anthology/W12-3102 Findings of the 2012 workshop on statistical machine translation . In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 10--51, Montr \'e al, Canada. Association for Computational Linguistics
work page 2012
-
[11]
Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. http://www.aclweb.org/anthology/W/W09/W09-0401 Findings of the 2009 W orkshop on S tatistical M achine T ranslation . In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 1--28, Athens, Greece. Association for Computational Linguistics
work page 2009
-
[12]
Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. 2011. http://www.aclweb.org/anthology/W11-2103 Findings of the 2011 workshop on statistical machine translation . In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 22--64, Edinburgh, Scotland. Association for Computational Linguistics
work page 2011
-
[13]
Jacob Cohen. 1988. Statistical power analysis for the social sciences. Hillsdale, NJ: Erlbaum
work page 1988
-
[14]
Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2016. https://doi.org/10.1017/S1351324915000339 Can machine translation systems be evaluated by the crowd alone . Natural Language Engineering, FirstView:1--28
-
[15]
Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys - Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie - Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. http://arxiv.org/abs...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Gennadi Lambersky, Noam Ordan, and Shuly Wintner. 2012. Language models for machine translation: Original vs. translated texts. Computational Linguistics, 38:4
work page 2012
-
[17]
Samuel L \"a ubli, Rico Sennrich, and Martin Volk. 2018. http://arxiv.org/abs/1808.07048 Has Neural Machine Translation Achieved Human Parity? A Case for Document-level Evaluation . In EMNLP 2018 , Brussels, Belgium. Association for Computational Linguistics
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation . In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318, Philadelphia, Pennsylvania
work page 2002
-
[19]
Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way. 2018. https://arxiv.org/pdf/1808.10432.pdf Attaining the unattainable? reassessing claims of human parity in neural machine translation . CoRR, abs/1808.10432
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.