Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Shen, Chenhui, Cheng, Liying, Nguyen, Xuan-Phi, You, Yang, Bing, Lidong · 2023 · DOI 10.18653/v1/2023.findings-emnlp.278

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

cs.CL · 2024-04-29 · conditional · novelty 7.0

A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.

SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.

Lessons from the Trenches on Reproducible Evaluation of Language Models

cs.CL · 2024-05-23

citing papers explorer

Showing 3 of 3 citing papers.

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models cs.CL · 2024-04-29 · conditional · none · ref 47
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization cs.CL · 2026-04-21 · unverdicted · none · ref 34
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · unreviewed · ref 151

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

fields

years

verdicts

representative citing papers

citing papers explorer