arxiv: 2604.17393 · v2 · submitted 2026-04-19 · 💻 cs.CL

Recognition: unknown

Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

Finn Schmidt , Jan Philip Wahle , Terry Ruas , Bela Gipp

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine translation evaluationdomain shiftautomatic metricshuman annotationinter-annotator agreementerror span annotationunseen domains

0 comments

The pith

Automatic translation metrics match humans reasonably on new domains until human annotator differences are included in the comparison.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether automatic metrics for machine translation remain reliable when moved from the news data they were tuned on to technical domains like IT and chemistry. It builds a dataset that keeps the same annotators and the same six translation systems fixed while only swapping the input texts, so any change in agreement can be attributed to the domain itself. At the segment level the metrics reach agreement scores as high as 0.69 with human judgments, yet this number falls once the variation among the human annotators is measured and used as a reference. The gap is largest on the chemical domain, where humans agree with one another at 0.78–0.83 while the metrics lag. The authors therefore argue that future evaluations should always report metric-human agreement relative to the level of agreement among humans rather than in absolute terms.

Core claim

We introduce a Cross-Domain Error-Span-Annotation dataset containing 18.8k human error-span annotations across three language pairs. Within each pair the annotators remain the same and the six translation systems remain the same; only the domain of the source text changes from news (seen) to IT and chemical (unseen). Automatic metrics show up to 0.69 agreement with humans at the segment level, but this figure declines once inter-annotator agreement is taken into account. Averaging the annotations raises inter-annotator agreement by as much as 0.11. On the unseen chemical domain, human agreement reaches 0.78–0.83 while metrics perform noticeably worse. The central recommendation is to compare

What carries the argument

The CD-ESA dataset, which fixes annotators and translation systems while varying only the domain, together with the practice of benchmarking metric-human agreement against measured inter-annotator agreement.

If this is right

Metrics that look adequate on standard news benchmarks may underperform when judged against human consistency on technical content.
Averaging several human annotations per segment raises the standard that a metric must meet to be considered reliable.
The chemical domain exposes larger gaps between metric scores and human consensus than the IT or news domains do.
Evaluation reports should include inter-annotator agreement as a reference line so that raw metric-human numbers are not misinterpreted across domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need to create or adapt metrics specifically for technical content rather than relying on news-tuned ones.
The same controlled multi-annotator design could be applied to other NLP tasks such as summarization or question answering to test their domain robustness.
If the pattern holds, standard practice in domain-shift studies would shift toward always reporting inter-annotator baselines instead of absolute agreement scores.

Load-bearing premise

Holding the same annotators and the same six translation systems fixed while only changing the domain is enough to isolate domain-shift effects from annotator noise and system variation.

What would settle it

If a replication that uses entirely new annotators for the technical domains still finds the same drop in metric-human agreement after averaging, the claim that domain shift itself drives the loss of robustness would be weakened.

Figures

Figures reproduced from arXiv: 2604.17393 by Bela Gipp, Finn Schmidt, Jan Philip Wahle, Terry Ruas.

**Figure 1.** Figure 1: Metrics may show unexpected behavior on unseen domains, reflecting domain familiarity over translation quality, while variability in human judgments (e.g., overlooking errors) can mask these effects. as reference translations are often unavailable when exploring new domains (Mathur et al., 2020). Those QE metrics enable large-scale experimentation and frequent model comparisons that would be too costly wit… view at source ↗

**Figure 2.** Figure 2: (Soft)Pairwise accuracies of metric–human [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Pairwise accuracy gains from score aggregation on WMT23 and Emea. Bars report changes in acceq when replacing single with averaged scores. GEMBA remains competitive (e.g., ∼0.85–0.91) and in some cases matches or slightly exceeds human agreement, while SFT metrics and Remedy tend to lag behind (e.g., ∼0.74–0.87 depending on the language pair). On PharmaChem, interannotator agreement is consistently high (… view at source ↗

**Figure 4.** Figure 4: Pairwise accuracy for grouping segments. The [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the ranking of three MT systems. The x-axis lists the ranks and the y-axis shows the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Average differences of normalized human and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Example of XCOMET on Emea marking many words as errors when humans only marked two. 5 Conclusion This paper analyzes how automatic MT evaluation metrics behave under domain shift. Using CDESA, a new multi-annotator dataset covering translations from the same five MT systems on a seen news domain (WMT23) and two unseen technical domains (Emea and the proprietary PharmaChem), all annotated by the same five … view at source ↗

**Figure 8.** Figure 8: Pairwise metric–human accuracy on BIO-MQM data from WMT22. Except for BLEU, which requires [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Pairwise accuracy on Chinese–English BIO [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 11.** Figure 11: Average differences of normalized human and metric scores across domains. Negative values mean the metric underestimates translation quality relative to humans; positive values mean it overestimates quality. (a) PharmaChem (red–yellow) vs. WMT (blue–violet). (b) Emea (red–yellow) vs. WMT (blue–violet) [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 13.** Figure 13: Example for the annotation interface in Label Studio. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96). We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces the Cross-Domain Error-Span-Annotation (CD-ESA) dataset with 18.8k multi-annotator error-span labels across three language pairs. By holding annotators and six MT systems fixed while varying only the domain (seen news vs. two unseen technical domains), the authors evaluate automatic metric robustness. They report segment-level metric-human agreement up to 0.69 that appears robust to domain shift, but claim this robustness largely vanishes after averaging annotations to account for human variation; metrics underperform humans on the chemical domain (IAA 0.78-0.83 vs. 0.96). The authors recommend benchmarking metric-human agreement against IAA rather than raw agreement alone when assessing cross-domain performance.

Significance. If the controlled construction isolates domain effects as intended, the work provides a useful caution against over-interpreting raw metric-human agreement in unseen domains and supplies a new multi-annotator resource with fixed systems and annotators. The explicit comparison to IAA is a constructive methodological suggestion that could improve evaluation practices in machine translation.

major comments (3)

[Abstract and §3] Abstract and §3 (Dataset Construction): The central claim that apparent metric robustness 'largely disappears' once human label variation is accounted for rests on CD-ESA successfully isolating domain shift by fixing annotators and the six systems. However, the reported IAA varies substantially by domain (0.78-0.83 on chemical vs. 0.96 on news), which raises the possibility that technical domains alter annotation consistency or error-span labeling behavior rather than simply exposing metric failure; this directly affects whether the drop after averaging can be attributed to domain shift per se.
[Abstract and Results] Abstract and Results: The segment-level agreement figure of 'up to 0.69' is given without error bars, confidence intervals, or details on the number of segments, annotators per segment, or averaging procedure across domains. This makes it impossible to determine whether the reported robustness (and its subsequent disappearance) is statistically reliable or sensitive to the particular domain-sampling and aggregation choices.
[§4] §4 (or equivalent Results/Discussion): The recommendation to compare metric-human agreement against IAA assumes IAA is a domain-independent gold standard for human performance. Yet the observed IAA variation across domains suggests that the chemical-domain drop may partly reflect noisier or more conservative human labels rather than metric inadequacy; additional analysis (e.g., per-domain label distributions or annotator consistency metrics) is needed to support the recommendation.

minor comments (3)

The manuscript does not specify the exact annotation guidelines or error-span criteria used for technical domains, nor whether they were adapted from the news domain.
No information is provided on data release, repository, or licensing for the CD-ESA annotations, which limits reproducibility.
Notation for agreement metrics (e.g., how segment-level vs. averaged agreement is computed) could be clarified with a small table or equation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thoughtful and constructive review. We appreciate the opportunity to clarify our methodology and have revised the manuscript to address the points raised. Below we respond to each major comment.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Dataset Construction): The central claim that apparent metric robustness 'largely disappears' once human label variation is accounted for rests on CD-ESA successfully isolating domain shift by fixing annotators and the six systems. However, the reported IAA varies substantially by domain (0.78-0.83 on chemical vs. 0.96 on news), which raises the possibility that technical domains alter annotation consistency or error-span labeling behavior rather than simply exposing metric failure; this directly affects whether the drop after averaging can be attributed to domain shift per se.

Authors: We thank the referee for highlighting this nuance. The fixed-annotator design isolates domain effects because the same annotators label all domains; any IAA drop therefore reflects domain-induced changes in annotation consistency. Our central recommendation is to benchmark metrics against the domain-specific IAA precisely to incorporate such variation rather than assuming uniform human performance. The chemical-domain result shows metrics underperform even this lower IAA benchmark. We will revise the abstract and §3 to state this interpretation explicitly and to underscore that per-domain IAA serves as the appropriate reference. revision: partial
Referee: [Abstract and Results] Abstract and Results: The segment-level agreement figure of 'up to 0.69' is given without error bars, confidence intervals, or details on the number of segments, annotators per segment, or averaging procedure across domains. This makes it impossible to determine whether the reported robustness (and its subsequent disappearance) is statistically reliable or sensitive to the particular domain-sampling and aggregation choices.

Authors: We agree that statistical details are required to evaluate reliability. In the revised manuscript we will add error bars and confidence intervals to all agreement scores, report the exact number of segments and total annotations per domain, specify the number of annotators per segment, and describe the averaging procedure used for both IAA and metric-human agreement. These additions will allow readers to assess robustness across domains and aggregation choices. revision: yes
Referee: [§4] §4 (or equivalent Results/Discussion): The recommendation to compare metric-human agreement against IAA assumes IAA is a domain-independent gold standard for human performance. Yet the observed IAA variation across domains suggests that the chemical-domain drop may partly reflect noisier or more conservative human labels rather than metric inadequacy; additional analysis (e.g., per-domain label distributions or annotator consistency metrics) is needed to support the recommendation.

Authors: We do not treat IAA as a domain-independent gold standard; we explicitly report its variation and recommend benchmarking against the IAA observed within each domain. This approach already accounts for any domain-specific differences in label noise or conservatism. To further support the recommendation we will add per-domain analyses of label distributions (e.g., error-span length and type statistics) and annotator consistency metrics (e.g., pairwise agreements) in the revised §4. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical data collection and direct comparisons

full rationale

The paper constructs a new multi-annotator dataset (CD-ESA) by fixing annotators and systems while varying domains, then reports direct empirical measurements of metric-human agreement versus inter-annotator agreement. No equations, fitted parameters, or derivations appear; the central recommendation follows immediately from the collected annotations without reduction to any author-defined quantity or self-citation chain. The study is self-contained against external benchmarks and exhibits none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human error-span annotations constitute reliable ground truth and that the controlled multi-annotator design removes confounding noise.

axioms (1)

domain assumption Human error span annotations provide a stable, domain-independent measure of translation quality that can serve as ground truth for metric evaluation.
Invoked when the paper treats averaged human judgments as the reference against which metric robustness is measured.

pith-pipeline@v0.9.0 · 5543 in / 1331 out tokens · 43241 ms · 2026-05-10T06:02:36.342853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 23 canonical work pages

[1]

arXiv preprint arXiv:2402.18747 , year=

Fine-tuned machine translation metrics struggle in unseen domains , author=. arXiv preprint arXiv:2402.18747 , year=

work page arXiv
[2]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Scirepeval: A multi-format benchmark for scientific document representations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[3]

Scibert: Pretrained contextualized embeddings for scientific text

SciBERT: A pretrained language model for scientific text , author=. arXiv preprint arXiv:1903.10676 , year=

work page arXiv 1903
[4]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

On the cross-lingual transferability of monolingual representations , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
[5]

Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis , pages=

Disentangling document topic and author gender in multiple languages: Lessons for adversarial debiasing , author=. Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis , pages=
[6]

, author=

Parallel data, tools and interfaces in OPUS. , author=. Lrec , volume=
[7]

Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

MLQE-PE: A multilingual quality estimation and post-editing dataset , author=. Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=
[8]

Tailoring Domain Adaptation for Machine Translation Quality Estimation

Sharami, Javad Pourmostafa Roshan and Shterionov, Dimitar and Blain, Fr \'e d \'e ric and Vanmassenhove, Eva and Sisto, Mirella De and Emmery, Chris and Spronck, Pieter. Tailoring Domain Adaptation for Machine Translation Quality Estimation. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023

2023
[9]

Poor Man ' s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference

Zouhar, Vil \'e m and Dhuliawala, Shehzaad and Zhou, Wangchunshu and Daheim, Nico and Kocmi, Tom and Jiang, Yuchen Eleanor and Sachan, Mrinmaya. Poor Man ' s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi...

work page doi:10.18653/v1/2023.eacl-main.95 2023
[10]

How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s

Zhang, Ran and Zhao, Wei and Eger, Steffen. How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.n...

work page doi:10.18653/v1/2025.naacl-long.548 2025
[11]

Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration.arXiv preprint arXiv:2305.14324, 2023

Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration , author=. arXiv preprint arXiv:2305.14324 , year=

work page arXiv
[12]

arXiv preprint arXiv:2409.09598 , year=

Improving statistical significance in human evaluation of automatic metrics via soft pairwise accuracy , author=. arXiv preprint arXiv:2409.09598 , year=

work page arXiv
[13]

The “problem” of human label variation: On ground truth in data, modeling and evaluation.arXiv:2211.02570, 2022

The'Problem'of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation , author=. arXiv preprint arXiv:2211.02570 , year=

work page arXiv
[14]

arXiv preprint arXiv:2406.11580 , year=

Error span annotation: A balanced approach for human evaluation of machine translation , author=. arXiv preprint arXiv:2406.11580 , year=

work page arXiv
[15]

Comet: A neural framework for mt evaluation.arXiv preprint arXiv:2009.09025, 2020

COMET: A neural framework for MT evaluation , author=. arXiv preprint arXiv:2009.09025 , year=

work page arXiv 2009
[16]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
[17]

Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=

Results of WMT22 metrics shared task: Stop using BLEU--neural metrics are better and more robust , author=. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=
[18]

arXiv preprint arXiv:2506.19571 , year=

Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress , author=. arXiv preprint arXiv:2506.19571 , year=

work page arXiv
[19]

Proceedings of the Tenth Conference on Machine Translation , pages=

Findings of the WMT25 shared task on automated translation evaluation systems: Linguistic diversity is challenging and references still help , author=. Proceedings of the Tenth Conference on Machine Translation , pages=
[20]

Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task

Freitag, Markus and Mathur, Nitika and Deutsch, Daniel and Lo, Chi-Kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Blain, Frederic and Kocmi, Tom and Wang, Jiayi and Adelani, David Ifeoluwa and Buchicchio, Marianna and Zerva, Chrysoula and Lavie, Alon. Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task. Procee...

work page doi:10.18653/v1/2024.wmt-1.2 2024
[21]

Proceedings of the Eighth Conference on Machine Translation , pages=

Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent , author=. Proceedings of the Eighth Conference on Machine Translation , pages=
[22]

Transactions of the Association for Computational Linguistics , volume=

xcomet: Transparent machine translation evaluation through fine-grained error detection , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

2024
[23]

M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35

work page doi:10.18653/v1/2024.wmt-1.35 2024
[24]

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Kocmi, Tom and Federmann, Christian. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023

2023
[25]

2023 , eprint=

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation , author=. 2023 , eprint=

2023
[26]

arXiv preprint arXiv:2310.13988 , year=

GEMBA-MQM: Detecting translation quality error spans with GPT-4 , author=. arXiv preprint arXiv:2310.13988 , year=

work page arXiv
[27]

Proceedings of the Fourth Workshop on Insights from Negative Results in NLP , pages=

Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP , author=. Proceedings of the Fourth Workshop on Insights from Negative Results in NLP , pages=
[28]

Proceedings of the Eighth Conference on Machine Translation , pages=

Scaling up cometkiwi: Unbabel-ist 2023 submission for the quality estimation shared task , author=. Proceedings of the Eighth Conference on Machine Translation , pages=

2023
[29]

arXiv preprint arXiv:2504.13630 , year=

Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling , author=. arXiv preprint arXiv:2504.13630 , year=

work page arXiv
[30]

Experts, errors, and context: A large-scale study of human evaluation for machine translation

Freitag, Markus and Foster, George and Grangier, David and Ratnakar, Viresh and Tan, Qijun and Macherey, Wolfgang , title =. Transactions of the Association for Computational Linguistics , volume =. 2021 , month =. doi:10.1162/tacl_a_00437 , url =

work page doi:10.1162/tacl_a_00437 2021
[31]

Proceedings of the 7th linguistic annotation workshop and interoperability with discourse , pages=

Continuous measurement scales in human evaluation of machine translation , author=. Proceedings of the 7th linguistic annotation workshop and interoperability with discourse , pages=
[32]

Finding Replicable Human Evaluations via Stable Ranking Probability

Riley, Parker and Deutsch, Daniel and Foster, George and Ratnakar, Viresh and Dabirmoghaddam, Ali and Freitag, Markus. Finding Replicable Human Evaluations via Stable Ranking Probability. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2...

work page doi:10.18653/v1/2024.naacl-long.275 2024
[33]

2023 , URL =

MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task , author =. 2023 , URL =

2023
[34]

Transactions of the Association for Computational Linguistics , volume=

High quality rather than high model probability: Minimum Bayes risk decoding with neural metrics , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

2022
[35]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Quality-aware translation models: Efficient generation and quality estimation in a single model , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[36]

arXiv preprint arXiv:2504.19044 , year=

Calibrating Translation Decoding with Quality Estimation on LLMs , author=. arXiv preprint arXiv:2504.19044 , year=

work page arXiv
[37]

Transactions of the Association for Computational Linguistics , volume=

Adding chocolate to mint: Mitigating metric interference in machine translation , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

2025
[38]

Proceedings of the Ninth Conference on Machine Translation , pages=

Tower v2: Unbabel-IST 2024 submission for the general MT shared task , author=. Proceedings of the Ninth Conference on Machine Translation , pages=

2024
[39]

arXiv preprint arXiv:2410.03115 , year=

X-alma: Plug & play modules and adaptive rejection for quality translation at scale , author=. arXiv preprint arXiv:2410.03115 , year=

work page arXiv
[40]

2024 , howpublished =

GPT-4o System Card , author =. 2024 , howpublished =

2024
[41]

2024 , howpublished =

DeepL Translator , author =. 2024 , howpublished =

2024
[42]

2024 , howpublished =

Microsoft Translator Documentation , author =. 2024 , howpublished =

2024
[43]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation , author=. arXiv preprint arXiv:2401.08417 , year=

work page arXiv
[44]

arXiv preprint arXiv:2408.13831 , year=

Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! , author=. arXiv preprint arXiv:2408.13831 , year=

work page arXiv
[45]

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement , url=

Sarti, Gabriele and Zouhar, Vilém and Nissim, Malvina and Bisazza, Arianna , year=. Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement , url=. doi:10.18653/v1/2025.emnlp-main.924 , booktitle=

work page doi:10.18653/v1/2025.emnlp-main.924 2025
[46]

2021 , eprint=

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , author=. 2021 , eprint=

2021
[47]

Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor. Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.448

work page doi:10.18653/v1/2020.acl-main.448 2020
[48]

North American Chapter of the Association for Computational Linguistics , year=

Accurate Evaluation of Segment-level Machine Translation Metrics , author=. North American Chapter of the Association for Computational Linguistics , year=
[49]

European Association for Machine Translation Conferences/Workshops , year=

Estimating the Sentence-Level Quality of Machine Translation Systems , author=. European Association for Machine Translation Conferences/Workshops , year=
[50]

ArXiv , year=

Sequence to Sequence Learning with Neural Networks , author=. ArXiv , year=
[51]

Bioscience , year=

Overcoming Language Barriers in Academia: Machine Translation Tools and a Vision for a Multilingual Future , author=. Bioscience , year=
[52]

ArXiv , year=

Open Problems in Cooperative AI , author=. ArXiv , year=
[53]

ArXiv , year=

Quality and Quantity of Machine Translation References for Automatic Metrics , author=. ArXiv , year=
[54]

ArXiv , year=

The price of debiasing automatic metrics in natural language evalaution , author=. ArXiv , year=
[55]

2024 , eprint=

Can Automatic Metrics Assess High-Quality Translations? , author=. 2024 , eprint=

2024
[56]

Results of the WMT 17 Metrics Shared Task

Bojar, Ond r ej and Graham, Yvette and Kamran, Amir. Results of the WMT 17 Metrics Shared Task. Proceedings of the Second Conference on Machine Translation. 2017. doi:10.18653/v1/W17-4755

work page doi:10.18653/v1/w17-4755 2017