HardMTBench is a difficulty-aware benchmark of 20,000 directional test items across 12 domains that widens GEMBA score ranges by a factor of two and reveals domain-specific weaknesses in 22 MT systems.
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96). We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains.
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains
HardMTBench is a difficulty-aware benchmark of 20,000 directional test items across 12 domains that widens GEMBA score ranges by a factor of two and reveals domain-specific weaknesses in 22 MT systems.