pith. sign in

arxiv: 1906.09833 · v1 · pith:DYN33KKBnew · submitted 2019-06-24 · 💻 cs.CL · cs.AI

Translationese in Machine Translation Evaluation

Pith reviewed 2026-05-25 17:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords translationesemachine translation evaluationhuman paritytest setsstatistical powerreverse-created dataevaluation reliability
0
0 comments X

The pith

Differences between original writing and translated text distort machine translation evaluation results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how translationese, or unusual features in translated text, creates systematic differences from text originally written in the target language. These differences can lead to inaccurate judgments about the quality of machine translation systems. The authors re-evaluate a high-profile study that claimed human parity for MT and point out problems with statistical power in the tests used. They recommend excluding test data created by reverse translation from future evaluations and provide a checklist for more reliable assessments.

Core claim

Analysis shows evidence to support differences in text originally written in a given language relative to translated text and this can potentially negatively impact the accuracy of machine translation evaluations. For this reason reverse-created test data should be omitted from future machine translation test sets. In addition, a re-evaluation of a past high-profile machine translation evaluation claiming human-parity of MT finds potential ways of improving reliability, including attention to the statistical power of significance tests.

What carries the argument

translationese, the presence of unusual features of translated text that distinguish it from originally written text in the same language, which is shown through analysis to affect evaluation accuracy.

If this is right

  • Reverse-created test data should be omitted from future machine translation test sets.
  • Re-evaluations of past human-parity claims can identify ways to improve reliability.
  • Power analysis indicates a suitable minimum sample size of translations for studies investigating human parity.
  • Using the provided comprehensive check-list can help ensure accuracy and reliability in future MT evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • MT systems trained or evaluated on mixed data may learn to produce translationese rather than natural language.
  • Similar biases could affect evaluations in other areas like machine translation of other modalities or multilingual tasks.
  • Future work could develop automated methods to create or select test sets free of translationese effects.

Load-bearing premise

That the observed differences between original and translated text are primarily attributable to translationese rather than other factors and that these differences are the main driver of inaccurate evaluation outcomes.

What would settle it

Conducting MT evaluations using only test sets with originally written target text versus those with translated references and observing whether performance rankings or human parity claims change significantly.

Figures

Figures reproduced from arXiv: 1906.09833 by Barry Haddow, Philipp Koehn, Yvette Graham.

Figure 1
Figure 1. Figure 1: Creation of MT test sets for machine transla [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Differences in human evaluation DA scores for test sentences created in the reverse direction to testing [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Differences in BLEU scores for systems participating in WMT-15–WMT-18 news translation task com [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sentence length distribution in test data of WMT-15–WMT-18 news translation task for text in non [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sentence length distribution in test data of [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Differences in unigram precision for systems participating in WMT-15–WMT-18 news translation [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Differences in BLEU scores for pairs of sys [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Differences in Human evaluation DA scores [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

The term translationese has been used to describe the presence of unusual features of translated text. In this paper, we provide a detailed analysis of the adverse effects of translationese on machine translation evaluation results. Our analysis shows evidence to support differences in text originally written in a given language relative to translated text and this can potentially negatively impact the accuracy of machine translation evaluations. For this reason we recommend that reverse-created test data be omitted from future machine translation test sets. In addition, we provide a re-evaluation of a past high-profile machine translation evaluation claiming human-parity of MT, as well as analysis of the since re-evaluations of it. We find potential ways of improving the reliability of all three past evaluations. One important issue not previously considered is the statistical power of significance tests applied in past evaluations that aim to investigate human-parity of MT. Since the very aim of such evaluations is to reveal legitimate ties between human and MT systems, power analysis is of particular importance, where low power could result in claims of human parity that in fact simply correspond to Type II error. We therefore provide a detailed power analysis of tests used in such evaluations to provide an indication of a suitable minimum sample size of translations for such studies. Subsequently, since no past evaluation that aimed to investigate claims of human parity ticks all boxes in terms of accuracy and reliability, we rerun the evaluation of the systems claiming human parity. Finally, we provide a comprehensive check-list for future machine translation evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes adverse effects of translationese on MT evaluation, presenting evidence of differences between original and translated text, recommending omission of reverse-created test data from future MT test sets, re-evaluating a high-profile human-parity claim (including power analysis of significance tests), and providing a checklist for reliable future evaluations.

Significance. If substantiated, the work could improve MT evaluation reliability by addressing translationese and ensuring adequate statistical power in human-parity studies. The power analysis and checklist are constructive contributions that directly support more robust experimental design.

major comments (2)
  1. [power analysis and re-evaluation sections] Re-evaluation and power analysis section: the claim that low statistical power in past human-parity evaluations risks Type II error is central to the re-evaluation, but the paper does not report the exact effect sizes or variance estimates used to derive the recommended minimum sample size, preventing verification that the suggested N is sufficient across typical MT score distributions.
  2. [analysis and recommendation sections] Analysis of translationese effects and recommendation: the recommendation to omit reverse-created test data rests on observed text differences driving evaluation inaccuracy, yet the manuscript does not isolate translationese from confounders (e.g., domain shift, sentence length, or reference quality) via controlled ablation or regression; the re-evaluation therefore cannot establish that translationese is the primary causal factor rather than a correlate.
minor comments (2)
  1. [checklist section] The checklist for future evaluations would benefit from being presented as a numbered table rather than inline text for easier reference.
  2. [figures on text differences] Some figure captions describing text differences could more explicitly state the statistical test and sample size used to support significance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address the major comments point by point below, and will incorporate revisions to improve the clarity and rigor of the manuscript.

read point-by-point responses
  1. Referee: [power analysis and re-evaluation sections] Re-evaluation and power analysis section: the claim that low statistical power in past human-parity evaluations risks Type II error is central to the re-evaluation, but the paper does not report the exact effect sizes or variance estimates used to derive the recommended minimum sample size, preventing verification that the suggested N is sufficient across typical MT score distributions.

    Authors: We agree that providing the exact effect sizes and variance estimates is necessary for full reproducibility and verification of the power analysis. In the revised manuscript, we will explicitly report the effect size (Cohen's d or equivalent based on observed score differences) and variance estimates derived from the WMT datasets used in the re-evaluation. This will allow readers to confirm that the recommended minimum sample size is appropriate for typical MT evaluation score distributions. revision: yes

  2. Referee: [analysis and recommendation sections] Analysis of translationese effects and recommendation: the recommendation to omit reverse-created test data rests on observed text differences driving evaluation inaccuracy, yet the manuscript does not isolate translationese from confounders (e.g., domain shift, sentence length, or reference quality) via controlled ablation or regression; the re-evaluation therefore cannot establish that translationese is the primary causal factor rather than a correlate.

    Authors: We acknowledge that our analysis shows correlations between translationese features and evaluation differences but does not include explicit ablation studies or regression models to isolate translationese from all potential confounders. However, the test sets analyzed are from controlled WMT shared tasks where domain and other factors are standardized, and the differences align with known properties of translationese from prior linguistic studies. To strengthen this, we will add a section discussing potential confounders and note the limitations of correlational evidence, while maintaining the recommendation based on the substantial observed effects. A full causal analysis would require additional experiments beyond the scope of this work but is suggested as future research. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical analysis stands on observed data and standard statistical methods.

full rationale

The paper's core claims rest on direct comparisons of original vs. translated text, re-evaluation of prior MT systems, and power analysis of significance tests. No equations, fitted parameters renamed as predictions, or self-citations are invoked as load-bearing premises that reduce the result to its inputs by construction. The recommendation to omit reverse-created data follows from reported text differences rather than any definitional loop or ansatz smuggled via prior work. This is a standard empirical study with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, invented entities, or detailed axioms listed. Main domain assumption is that translationese differences exist and affect evaluation accuracy.

axioms (1)
  • domain assumption Differences exist between originally written text and translated text that can impact MT evaluation accuracy
    Central premise invoked to support the recommendation to omit reverse-created test data.

pith-pipeline@v0.9.0 · 5787 in / 1190 out tokens · 24766 ms · 2026-05-25T17:34:40.716491+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Maistros 8B is a new state-of-the-art open-weights Greek LLM built via knowledge distillation from large reasoning models on the CulturaQA dataset.

  2. CHORUS: Effort-Aware Multi-Agent Human-AI Collaboration for Professional Translation

    cs.HC 2026-02 unverdicted novelty 6.0

    CHORUS multi-agent system reduced professional translation time by 33.8% while lowering cognitive effort and raising BLEU/COMET scores in a 30-participant within-subject study.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Mona Baker, Gill Francis, and Elena Tognini-Bonelli. 1993. Corpus linguistics and translation studies: Implications and applications. In Text and Technology: In Honour of John Sinclair, Netherlands. John Benjamins Publishing Company

  2. [2]

    Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ale s Tamchyna. 2014. http://www.aclweb.org/anthology/W/W14/W14-3302 Findings of the 2014 workshop on statistical machine translation . In Proceedings of the Ninth Wor...

  3. [3]

    Ond r ej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. http://www.aclweb.org/anthology/W13-2201 Findings of the 2013 Workshop on Statistical Machine Translation . In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1...

  4. [4]

    Ond r ej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. http://www.a...

  5. [5]

    Ond r ej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. http://aclweb.org/anthology/W15-3001 Findings of the 2015 workshop on statistical machine translation . In Proceedings of the Ten...

  6. [6]

    Ond r ej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2018. http://www.aclweb.org/anthology/W18-6401 Findings of the 2018 conference on machine translation (wmt18) . In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 272--307, Belgium,...

  7. [7]

    Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. http://www.aclweb.org/anthology/W/W07/W07-0218 (meta-) evaluation of machine translation . In Proceedings of the Second Workshop on Statistical Machine Translation, pages 136--158, Prague, Czech Republic. Association for Computational Linguistics

  8. [8]

    Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2008. http://www.aclweb.org/anthology/W/W08/W08-0309 Further meta-evaluation of machine translation . In Proceedings of the Third Workshop on Statistical Machine Translation, pages 70--106, Columbus, Ohio. Association for Computational Linguistics

  9. [9]

    Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar Zaidan. 2010. http://www.aclweb.org/anthology/W10-1703 Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation . In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 1...

  10. [10]

    Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. http://www.aclweb.org/anthology/W12-3102 Findings of the 2012 workshop on statistical machine translation . In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 10--51, Montr \'e al, Canada. Association for Computational Linguistics

  11. [11]

    Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. http://www.aclweb.org/anthology/W/W09/W09-0401 Findings of the 2009 W orkshop on S tatistical M achine T ranslation . In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 1--28, Athens, Greece. Association for Computational Linguistics

  12. [12]

    Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. 2011. http://www.aclweb.org/anthology/W11-2103 Findings of the 2011 workshop on statistical machine translation . In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 22--64, Edinburgh, Scotland. Association for Computational Linguistics

  13. [13]

    Jacob Cohen. 1988. Statistical power analysis for the social sciences. Hillsdale, NJ: Erlbaum

  14. [14]

    Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2016. https://doi.org/10.1017/S1351324915000339 Can machine translation systems be evaluated by the crowd alone . Natural Language Engineering, FirstView:1--28

  15. [15]

    Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys - Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie - Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. http://arxiv.org/abs...

  16. [16]

    Gennadi Lambersky, Noam Ordan, and Shuly Wintner. 2012. Language models for machine translation: Original vs. translated texts. Computational Linguistics, 38:4

  17. [17]

    Samuel L \"a ubli, Rico Sennrich, and Martin Volk. 2018. http://arxiv.org/abs/1808.07048 Has Neural Machine Translation Achieved Human Parity? A Case for Document-level Evaluation . In EMNLP 2018 , Brussels, Belgium. Association for Computational Linguistics

  18. [18]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation . In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 311--318, Philadelphia, Pennsylvania

  19. [19]

    Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way. 2018. https://arxiv.org/pdf/1808.10432.pdf Attaining the unattainable? reassessing claims of human parity in neural machine translation . CoRR, abs/1808.10432