Are large language models good evaluators for abstractive summarization?

Are Large Language Models Good Evaluators for Abstractive Summarization? , author= · 2023 · arXiv 2305.13091

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

cs.CL · 2025-04-29 · unverdicted · novelty 7.0

The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

cs.CL · 2023-08-14 · conditional · novelty 6.0

Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.

Instruction-Following Evaluation for Large Language Models

cs.CL · 2023-11-14 · unverdicted · novelty 5.0

IFEval is a new benchmark of 25 verifiable instruction types and ~500 prompts for objective, reproducible evaluation of LLMs' instruction-following abilities.

A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs

cs.CL · 2026-06-26 · unverdicted · novelty 3.0

A tree-of-thoughts inspired hybrid extractive-abstractive LLM prompt yields better legal case judgment summaries than standard extractive or abstractive prompts.

citing papers explorer

Showing 2 of 2 citing papers after filters.

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate cs.CL · 2023-08-14 · conditional · none · ref 22
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
Instruction-Following Evaluation for Large Language Models cs.CL · 2023-11-14 · unverdicted · none · ref 19
IFEval is a new benchmark of 25 verifiable instruction types and ~500 prompts for objective, reproducible evaluation of LLMs' instruction-following abilities.

Are large language models good evaluators for abstractive summarization?

fields

years

verdicts

representative citing papers

citing papers explorer