Are large language models good evaluators for abstractive summarization?

Chenhui Shen, Liying Cheng, Yang You, Lidong Bing · 2023 · arXiv 2305.13091

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

cs.CL · 2025-04-29 · unverdicted · novelty 7.0

The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

cs.CL · 2023-08-14 · conditional · novelty 6.0

Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.

Instruction-Following Evaluation for Large Language Models

cs.CL · 2023-11-14 · unverdicted · novelty 5.0

IFEval is a new benchmark of 25 verifiable instruction types and ~500 prompts for objective, reproducible evaluation of LLMs' instruction-following abilities.

citing papers explorer

Showing 3 of 3 citing papers.

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models cs.CL · 2025-04-29 · unverdicted · none · ref 18
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate cs.CL · 2023-08-14 · conditional · none · ref 22
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
Instruction-Following Evaluation for Large Language Models cs.CL · 2023-11-14 · unverdicted · none · ref 19
IFEval is a new benchmark of 25 verifiable instruction types and ~500 prompts for objective, reproducible evaluation of LLMs' instruction-following abilities.

Are large language models good evaluators for abstractive summarization?

fields

years

verdicts

representative citing papers

citing papers explorer