Why We Need New Evaluation Metrics for NLG

Amanda Cercas Curry; Jekaterina Novikova; Ond\v{r}ej Du\v{s}ek; Verena Rieser

arxiv: 1707.06875 · v1 · pith:LMEEBE27new · submitted 2017-07-21 · 💻 cs.CL

Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova , Ond\v{r}ej Du\v{s}ek , Amanda Cercas Curry , Verena Rieser This is my paper

classification 💻 cs.CL

keywords metricsautomaticevaluationsystemneednovelbleucases

0 comments

read the original abstract

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
cs.CL 2026-05 unverdicted novelty 5.0

Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.