100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Hongye Jin; Qifan Wang; Shaochen Zhong; Song Jiang; Vipin Chaudhary; Wang Yang; Xiaotian Han

arxiv: 2505.19293 · v2 · pith:NLMF3BJFnew · submitted 2025-05-25 · 💻 cs.CL · cs.AI· cs.LG

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Wang Yang , Hongye Jin , Shaochen Zhong , Song Jiang , Qifan Wang , Vipin Chaudhary , Xiaotian Han This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords long-contextbenchmarksabilitybaselineevaluatingllmslongbenchmodel

0 comments

read the original abstract

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
cs.CL 2026-05 conditional novelty 7.0

Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.