Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

· 2026 · cs.CL · arXiv 2604.17393

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96). We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains.

representative citing papers

HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

HardMTBench is a difficulty-aware benchmark of 20,000 directional test items across 12 domains that widens GEMBA score ranges by a factor of two and reveals domain-specific weaknesses in 22 MT systems.

citing papers explorer

Showing 1 of 1 citing paper.

HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains cs.CL · 2026-05-27 · unverdicted · none · ref 10 · internal anchor
HardMTBench is a difficulty-aware benchmark of 20,000 directional test items across 12 domains that widens GEMBA score ranges by a factor of two and reveals domain-specific weaknesses in 22 MT systems.

Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

fields

years

verdicts

representative citing papers

citing papers explorer