Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.
A historical review of NLG evaluation practices from 1990 to 2026, noting the rise of experimental methods and predicting increased focus on impact, qualitative, and safety evaluation.
citing papers explorer
-
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
-
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains
Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.
-
NLG Evaluation: Past, Present, Future
A historical review of NLG evaluation practices from 1990 to 2026, noting the rise of experimental methods and predicting increased focus on impact, qualitative, and safety evaluation.