pith. sign in

arxiv: 1707.06875 · v1 · pith:LMEEBE27new · submitted 2017-07-21 · 💻 cs.CL

Why We Need New Evaluation Metrics for NLG

classification 💻 cs.CL
keywords metricsautomaticevaluationsystemneednovelbleucases
0
0 comments X
read the original abstract

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

    cs.CL 2026-05 unverdicted novelty 5.0

    Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.