Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.
arXiv preprint arXiv:2310.13988 , year=
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains
Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.