Benchmarking Large Language Models for News Summarization

Esin Durmus; Faisal Ladhak; Kathleen McKeown; Percy Liang; Tatsunori B. Hashimoto; Tianyi Zhang

arxiv: 2301.13848 · v1 · pith:6L7Q2QPMnew · submitted 2023-01-31 · 💻 cs.CL · cs.AI· cs.LG

Benchmarking Large Language Models for News Summarization

Tianyi Zhang , Faisal Ladhak , Esin Durmus , Percy Liang , Kathleen McKeown , Tatsunori B. Hashimoto This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords humanllmssummariessummarizationevaluationfindlanguagelarge

0 comments

read the original abstract

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
cs.LG 2023-06 unverdicted novelty 6.0

H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs
cs.CL 2026-06 unverdicted novelty 3.0

A tree-of-thoughts inspired hybrid extractive-abstractive LLM prompt yields better legal case judgment summaries than standard extractive or abstractive prompts.