Recognition: unknown
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains
Pith reviewed 2026-05-10 06:02 UTC · model grok-4.3
The pith
Automatic translation metrics match humans reasonably on new domains until human annotator differences are included in the comparison.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a Cross-Domain Error-Span-Annotation dataset containing 18.8k human error-span annotations across three language pairs. Within each pair the annotators remain the same and the six translation systems remain the same; only the domain of the source text changes from news (seen) to IT and chemical (unseen). Automatic metrics show up to 0.69 agreement with humans at the segment level, but this figure declines once inter-annotator agreement is taken into account. Averaging the annotations raises inter-annotator agreement by as much as 0.11. On the unseen chemical domain, human agreement reaches 0.78–0.83 while metrics perform noticeably worse. The central recommendation is to compare
What carries the argument
The CD-ESA dataset, which fixes annotators and translation systems while varying only the domain, together with the practice of benchmarking metric-human agreement against measured inter-annotator agreement.
If this is right
- Metrics that look adequate on standard news benchmarks may underperform when judged against human consistency on technical content.
- Averaging several human annotations per segment raises the standard that a metric must meet to be considered reliable.
- The chemical domain exposes larger gaps between metric scores and human consensus than the IT or news domains do.
- Evaluation reports should include inter-annotator agreement as a reference line so that raw metric-human numbers are not misinterpreted across domains.
Where Pith is reading between the lines
- Developers may need to create or adapt metrics specifically for technical content rather than relying on news-tuned ones.
- The same controlled multi-annotator design could be applied to other NLP tasks such as summarization or question answering to test their domain robustness.
- If the pattern holds, standard practice in domain-shift studies would shift toward always reporting inter-annotator baselines instead of absolute agreement scores.
Load-bearing premise
Holding the same annotators and the same six translation systems fixed while only changing the domain is enough to isolate domain-shift effects from annotator noise and system variation.
What would settle it
If a replication that uses entirely new annotators for the technical domains still finds the same drop in metric-human agreement after averaging, the claim that domain shift itself drives the loss of robustness would be weakened.
Figures
read the original abstract
Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96). We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Cross-Domain Error-Span-Annotation (CD-ESA) dataset with 18.8k multi-annotator error-span labels across three language pairs. By holding annotators and six MT systems fixed while varying only the domain (seen news vs. two unseen technical domains), the authors evaluate automatic metric robustness. They report segment-level metric-human agreement up to 0.69 that appears robust to domain shift, but claim this robustness largely vanishes after averaging annotations to account for human variation; metrics underperform humans on the chemical domain (IAA 0.78-0.83 vs. 0.96). The authors recommend benchmarking metric-human agreement against IAA rather than raw agreement alone when assessing cross-domain performance.
Significance. If the controlled construction isolates domain effects as intended, the work provides a useful caution against over-interpreting raw metric-human agreement in unseen domains and supplies a new multi-annotator resource with fixed systems and annotators. The explicit comparison to IAA is a constructive methodological suggestion that could improve evaluation practices in machine translation.
major comments (3)
- [Abstract and §3] Abstract and §3 (Dataset Construction): The central claim that apparent metric robustness 'largely disappears' once human label variation is accounted for rests on CD-ESA successfully isolating domain shift by fixing annotators and the six systems. However, the reported IAA varies substantially by domain (0.78-0.83 on chemical vs. 0.96 on news), which raises the possibility that technical domains alter annotation consistency or error-span labeling behavior rather than simply exposing metric failure; this directly affects whether the drop after averaging can be attributed to domain shift per se.
- [Abstract and Results] Abstract and Results: The segment-level agreement figure of 'up to 0.69' is given without error bars, confidence intervals, or details on the number of segments, annotators per segment, or averaging procedure across domains. This makes it impossible to determine whether the reported robustness (and its subsequent disappearance) is statistically reliable or sensitive to the particular domain-sampling and aggregation choices.
- [§4] §4 (or equivalent Results/Discussion): The recommendation to compare metric-human agreement against IAA assumes IAA is a domain-independent gold standard for human performance. Yet the observed IAA variation across domains suggests that the chemical-domain drop may partly reflect noisier or more conservative human labels rather than metric inadequacy; additional analysis (e.g., per-domain label distributions or annotator consistency metrics) is needed to support the recommendation.
minor comments (3)
- The manuscript does not specify the exact annotation guidelines or error-span criteria used for technical domains, nor whether they were adapted from the news domain.
- No information is provided on data release, repository, or licensing for the CD-ESA annotations, which limits reproducibility.
- Notation for agreement metrics (e.g., how segment-level vs. averaged agreement is computed) could be clarified with a small table or equation.
Simulated Author's Rebuttal
Thank you for your thoughtful and constructive review. We appreciate the opportunity to clarify our methodology and have revised the manuscript to address the points raised. Below we respond to each major comment.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Dataset Construction): The central claim that apparent metric robustness 'largely disappears' once human label variation is accounted for rests on CD-ESA successfully isolating domain shift by fixing annotators and the six systems. However, the reported IAA varies substantially by domain (0.78-0.83 on chemical vs. 0.96 on news), which raises the possibility that technical domains alter annotation consistency or error-span labeling behavior rather than simply exposing metric failure; this directly affects whether the drop after averaging can be attributed to domain shift per se.
Authors: We thank the referee for highlighting this nuance. The fixed-annotator design isolates domain effects because the same annotators label all domains; any IAA drop therefore reflects domain-induced changes in annotation consistency. Our central recommendation is to benchmark metrics against the domain-specific IAA precisely to incorporate such variation rather than assuming uniform human performance. The chemical-domain result shows metrics underperform even this lower IAA benchmark. We will revise the abstract and §3 to state this interpretation explicitly and to underscore that per-domain IAA serves as the appropriate reference. revision: partial
-
Referee: [Abstract and Results] Abstract and Results: The segment-level agreement figure of 'up to 0.69' is given without error bars, confidence intervals, or details on the number of segments, annotators per segment, or averaging procedure across domains. This makes it impossible to determine whether the reported robustness (and its subsequent disappearance) is statistically reliable or sensitive to the particular domain-sampling and aggregation choices.
Authors: We agree that statistical details are required to evaluate reliability. In the revised manuscript we will add error bars and confidence intervals to all agreement scores, report the exact number of segments and total annotations per domain, specify the number of annotators per segment, and describe the averaging procedure used for both IAA and metric-human agreement. These additions will allow readers to assess robustness across domains and aggregation choices. revision: yes
-
Referee: [§4] §4 (or equivalent Results/Discussion): The recommendation to compare metric-human agreement against IAA assumes IAA is a domain-independent gold standard for human performance. Yet the observed IAA variation across domains suggests that the chemical-domain drop may partly reflect noisier or more conservative human labels rather than metric inadequacy; additional analysis (e.g., per-domain label distributions or annotator consistency metrics) is needed to support the recommendation.
Authors: We do not treat IAA as a domain-independent gold standard; we explicitly report its variation and recommend benchmarking against the IAA observed within each domain. This approach already accounts for any domain-specific differences in label noise or conservatism. To further support the recommendation we will add per-domain analyses of label distributions (e.g., error-span length and type statistics) and annotator consistency metrics (e.g., pairwise agreements) in the revised §4. revision: partial
Circularity Check
No circularity: empirical data collection and direct comparisons
full rationale
The paper constructs a new multi-annotator dataset (CD-ESA) by fixing annotators and systems while varying domains, then reports direct empirical measurements of metric-human agreement versus inter-annotator agreement. No equations, fitted parameters, or derivations appear; the central recommendation follows immediately from the collected annotations without reduction to any author-defined quantity or self-citation chain. The study is self-contained against external benchmarks and exhibits none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human error span annotations provide a stable, domain-independent measure of translation quality that can serve as ground truth for metric evaluation.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2402.18747 , year=
Fine-tuned machine translation metrics struggle in unseen domains , author=. arXiv preprint arXiv:2402.18747 , year=
-
[2]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Scirepeval: A multi-format benchmark for scientific document representations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
2023
-
[3]
Scibert: Pretrained contextualized embeddings for scientific text
SciBERT: A pretrained language model for scientific text , author=. arXiv preprint arXiv:1903.10676 , year=
-
[4]
Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
On the cross-lingual transferability of monolingual representations , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
-
[5]
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis , pages=
Disentangling document topic and author gender in multiple languages: Lessons for adversarial debiasing , author=. Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis , pages=
-
[6]
, author=
Parallel data, tools and interfaces in OPUS. , author=. Lrec , volume=
-
[7]
Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=
MLQE-PE: A multilingual quality estimation and post-editing dataset , author=. Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=
-
[8]
Tailoring Domain Adaptation for Machine Translation Quality Estimation
Sharami, Javad Pourmostafa Roshan and Shterionov, Dimitar and Blain, Fr \'e d \'e ric and Vanmassenhove, Eva and Sisto, Mirella De and Emmery, Chris and Spronck, Pieter. Tailoring Domain Adaptation for Machine Translation Quality Estimation. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023
2023
-
[9]
Poor Man ' s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Zouhar, Vil \'e m and Dhuliawala, Shehzaad and Zhou, Wangchunshu and Daheim, Nico and Kocmi, Tom and Jiang, Yuchen Eleanor and Sachan, Mrinmaya. Poor Man ' s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi...
-
[10]
Zhang, Ran and Zhao, Wei and Eger, Steffen. How Good Are LLM s for Literary Translation, Really? Literary Translation Evaluation with Humans and LLM s. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.n...
-
[11]
Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration , author=. arXiv preprint arXiv:2305.14324 , year=
-
[12]
arXiv preprint arXiv:2409.09598 , year=
Improving statistical significance in human evaluation of automatic metrics via soft pairwise accuracy , author=. arXiv preprint arXiv:2409.09598 , year=
-
[13]
The'Problem'of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation , author=. arXiv preprint arXiv:2211.02570 , year=
-
[14]
arXiv preprint arXiv:2406.11580 , year=
Error span annotation: A balanced approach for human evaluation of machine translation , author=. arXiv preprint arXiv:2406.11580 , year=
-
[15]
Comet: A neural framework for mt evaluation.arXiv preprint arXiv:2009.09025, 2020
COMET: A neural framework for MT evaluation , author=. arXiv preprint arXiv:2009.09025 , year=
-
[16]
Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
-
[17]
Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=
Results of WMT22 metrics shared task: Stop using BLEU--neural metrics are better and more robust , author=. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=
-
[18]
arXiv preprint arXiv:2506.19571 , year=
Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress , author=. arXiv preprint arXiv:2506.19571 , year=
-
[19]
Proceedings of the Tenth Conference on Machine Translation , pages=
Findings of the WMT25 shared task on automated translation evaluation systems: Linguistic diversity is challenging and references still help , author=. Proceedings of the Tenth Conference on Machine Translation , pages=
-
[20]
Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task
Freitag, Markus and Mathur, Nitika and Deutsch, Daniel and Lo, Chi-Kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Blain, Frederic and Kocmi, Tom and Wang, Jiayi and Adelani, David Ifeoluwa and Buchicchio, Marianna and Zerva, Chrysoula and Lavie, Alon. Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task. Procee...
-
[21]
Proceedings of the Eighth Conference on Machine Translation , pages=
Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent , author=. Proceedings of the Eighth Conference on Machine Translation , pages=
-
[22]
Transactions of the Association for Computational Linguistics , volume=
xcomet: Transparent machine translation evaluation through fine-grained error detection , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=
2024
-
[23]
M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task
Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35
-
[24]
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Kocmi, Tom and Federmann, Christian. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023
2023
-
[25]
2023 , eprint=
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation , author=. 2023 , eprint=
2023
-
[26]
arXiv preprint arXiv:2310.13988 , year=
GEMBA-MQM: Detecting translation quality error spans with GPT-4 , author=. arXiv preprint arXiv:2310.13988 , year=
-
[27]
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP , pages=
Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP , author=. Proceedings of the Fourth Workshop on Insights from Negative Results in NLP , pages=
-
[28]
Proceedings of the Eighth Conference on Machine Translation , pages=
Scaling up cometkiwi: Unbabel-ist 2023 submission for the quality estimation shared task , author=. Proceedings of the Eighth Conference on Machine Translation , pages=
2023
-
[29]
arXiv preprint arXiv:2504.13630 , year=
Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling , author=. arXiv preprint arXiv:2504.13630 , year=
-
[30]
Experts, errors, and context: A large-scale study of human evaluation for machine translation
Freitag, Markus and Foster, George and Grangier, David and Ratnakar, Viresh and Tan, Qijun and Macherey, Wolfgang , title =. Transactions of the Association for Computational Linguistics , volume =. 2021 , month =. doi:10.1162/tacl_a_00437 , url =
-
[31]
Proceedings of the 7th linguistic annotation workshop and interoperability with discourse , pages=
Continuous measurement scales in human evaluation of machine translation , author=. Proceedings of the 7th linguistic annotation workshop and interoperability with discourse , pages=
-
[32]
Finding Replicable Human Evaluations via Stable Ranking Probability
Riley, Parker and Deutsch, Daniel and Foster, George and Ratnakar, Viresh and Dabirmoghaddam, Ali and Freitag, Markus. Finding Replicable Human Evaluations via Stable Ranking Probability. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2...
-
[33]
2023 , URL =
MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task , author =. 2023 , URL =
2023
-
[34]
Transactions of the Association for Computational Linguistics , volume=
High quality rather than high model probability: Minimum Bayes risk decoding with neural metrics , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=
2022
-
[35]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Quality-aware translation models: Efficient generation and quality estimation in a single model , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[36]
arXiv preprint arXiv:2504.19044 , year=
Calibrating Translation Decoding with Quality Estimation on LLMs , author=. arXiv preprint arXiv:2504.19044 , year=
-
[37]
Transactions of the Association for Computational Linguistics , volume=
Adding chocolate to mint: Mitigating metric interference in machine translation , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=
2025
-
[38]
Proceedings of the Ninth Conference on Machine Translation , pages=
Tower v2: Unbabel-IST 2024 submission for the general MT shared task , author=. Proceedings of the Ninth Conference on Machine Translation , pages=
2024
-
[39]
arXiv preprint arXiv:2410.03115 , year=
X-alma: Plug & play modules and adaptive rejection for quality translation at scale , author=. arXiv preprint arXiv:2410.03115 , year=
-
[40]
2024 , howpublished =
GPT-4o System Card , author =. 2024 , howpublished =
2024
-
[41]
2024 , howpublished =
DeepL Translator , author =. 2024 , howpublished =
2024
-
[42]
2024 , howpublished =
Microsoft Translator Documentation , author =. 2024 , howpublished =
2024
-
[43]
Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation , author=. arXiv preprint arXiv:2401.08417 , year=
-
[44]
arXiv preprint arXiv:2408.13831 , year=
Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! , author=. arXiv preprint arXiv:2408.13831 , year=
-
[45]
Sarti, Gabriele and Zouhar, Vilém and Nissim, Malvina and Bisazza, Arianna , year=. Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement , url=. doi:10.18653/v1/2025.emnlp-main.924 , booktitle=
-
[46]
2021 , eprint=
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , author=. 2021 , eprint=
2021
-
[47]
Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor. Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.448
-
[48]
North American Chapter of the Association for Computational Linguistics , year=
Accurate Evaluation of Segment-level Machine Translation Metrics , author=. North American Chapter of the Association for Computational Linguistics , year=
-
[49]
European Association for Machine Translation Conferences/Workshops , year=
Estimating the Sentence-Level Quality of Machine Translation Systems , author=. European Association for Machine Translation Conferences/Workshops , year=
-
[50]
ArXiv , year=
Sequence to Sequence Learning with Neural Networks , author=. ArXiv , year=
-
[51]
Bioscience , year=
Overcoming Language Barriers in Academia: Machine Translation Tools and a Vision for a Multilingual Future , author=. Bioscience , year=
-
[52]
ArXiv , year=
Open Problems in Cooperative AI , author=. ArXiv , year=
-
[53]
ArXiv , year=
Quality and Quantity of Machine Translation References for Automatic Metrics , author=. ArXiv , year=
-
[54]
ArXiv , year=
The price of debiasing automatic metrics in natural language evalaution , author=. ArXiv , year=
-
[55]
2024 , eprint=
Can Automatic Metrics Assess High-Quality Translations? , author=. 2024 , eprint=
2024
-
[56]
Results of the WMT 17 Metrics Shared Task
Bojar, Ond r ej and Graham, Yvette and Kamran, Amir. Results of the WMT 17 Metrics Shared Task. Proceedings of the Second Conference on Machine Translation. 2017. doi:10.18653/v1/W17-4755
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.