Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Mathur, Nitika, Baldwin, Timothy, Cohn, Trevor · 2020 · DOI 10.18653/v1/2020.acl-main.448

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.

NLG Evaluation: Past, Present, Future

cs.CL · 2026-05-22 · unverdicted · novelty 1.0

A historical review of NLG evaluation practices from 1990 to 2026, noting the rise of experimental methods and predicting increased focus on impact, qualitative, and safety evaluation.

citing papers explorer

Showing 3 of 3 citing papers.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations cs.CL · 2026-05-13 · unverdicted · none · ref 67
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains cs.CL · 2026-04-19 · unverdicted · none · ref 47
Automatic translation metrics show lower agreement with humans on unseen technical domains than humans show with each other, and their robustness claims weaken when benchmarked against inter-annotator agreement instead of raw scores.
NLG Evaluation: Past, Present, Future cs.CL · 2026-05-22 · unverdicted · none · ref 30
A historical review of NLG evaluation practices from 1990 to 2026, noting the rise of experimental methods and predicting increased focus on impact, qualitative, and safety evaluation.

Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

fields

years

verdicts

representative citing papers

citing papers explorer