arxiv: 2412.15204 · v2 · submitted 2024-12-19 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai , Shangqing Tu , Jiajie Zhang , Hao Peng , Xiaozhi Wang , Xin Lv , Shulin Cao , Jiazheng Xu

show 4 more authors

Lei Hou Yuxiao Dong Jie Tang Juanzi Li

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords long-contextLLM benchmarkreasoningmultitaskmultiple-choicehuman baselineinference compute

0 comments

The pith

LongBench v2 shows current LLMs score 50% on long-context reasoning tasks while reasoning models exceed the 54% human baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongBench v2, a benchmark of 503 multiple-choice questions drawn from real-world sources with contexts from 8k to 2M words. It tests deep understanding across six task types including multi-document QA and code repository comprehension. Human experts score 53.7% when limited to 15 minutes, while the best direct-answer model reaches only 50.1%. Models that perform extended reasoning during inference, such as o1-preview, reach 57.7% and surpass humans. This establishes that long-context challenges demand both strong context handling and additional inference-time compute for multi-step reasoning.

Core claim

LongBench v2 consists of 503 challenging questions across six categories with long contexts, where human performance under time constraints is 53.7%, direct model answers top out at 50.1%, but reasoning-augmented models achieve 57.7% and exceed the human baseline.

What carries the argument

The LongBench v2 benchmark, a collection of 503 multiple-choice questions requiring deep understanding and reasoning over long contexts from real-world multitasks.

Load-bearing premise

The 503 questions genuinely require deep multi-step understanding rather than being solvable through surface cues or training data patterns.

What would settle it

A model that scores above 65% accuracy using only direct answers without extended reasoning would show the benchmark does not require the claimed level of reasoning.

read the original abstract

This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at https://longbench2.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LongBench v2 adds a new expert-sourced benchmark for long-context multitasks up to 2M words, but the 15-minute human time limit on those inputs makes the reported human baseline of 53.7% difficult to treat as a fair full-context comparison.

read the letter

LongBench v2 introduces 503 multiple-choice questions across six task categories, with contexts ranging from 8k to 2M words. The questions come from nearly 100 experts with varied professional backgrounds, followed by automated and manual review. The main results show direct model answers reaching at most 50.1% while o1-preview with longer reasoning reaches 57.7%, slightly above the human score of 53.7% under a 15-minute constraint. This gives a concrete picture of where current systems sit on realistic long-document tasks like code repositories and multi-document QA.

Referee Report

1 major / 2 minor

Summary. The paper introduces LongBench v2, a benchmark of 503 expert-sourced multiple-choice questions spanning six task categories (single- and multi-document QA, long in-context learning, dialogue history, code repositories, and structured data) with contexts from 8k to 2M words. After automated and manual review, human experts achieve 53.7% accuracy under a 15-minute time limit; the strongest direct-answer model reaches 50.1% while o1-preview with extended reasoning reaches 57.7%.

Significance. The benchmark supplies a high-quality, expert-validated resource for measuring long-context understanding beyond retrieval. The reported human baseline and the gap between direct-answer models and o1-preview, if the evaluation protocol is shown to be fair, would usefully quantify the value of inference-time compute for realistic multitask reasoning.

major comments (1)

[Human Evaluation] Human Evaluation section: the headline claim that o1-preview surpasses humans (57.7% vs. 53.7%) rests on the 15-minute time limit constituting a fair test of deep understanding. For contexts up to 2M words this constraint permits humans to read only a few thousand words at normal speed, while models receive the entire context; the manuscript must supply (a) measured or estimated fraction of context actually read by participants, (b) evidence that the 503 questions cannot be solved from surface cues or partial context, and (c) any sensitivity analysis showing how human accuracy changes with longer time allowances.

minor comments (2)

[§3] §3 (Dataset Construction): state the exact number of questions per category and the precise exclusion criteria applied during the manual review stage.
[Results] Table 2 or Results section: report per-category accuracy for both humans and the top models rather than only aggregate scores.

Simulated Author's Rebuttal

1 responses · 2 unresolved

We thank the referee for the constructive feedback on the human evaluation protocol. We address the major comment point by point below and will revise the manuscript to strengthen the discussion of evaluation fairness.

read point-by-point responses

Referee: [Human Evaluation] Human Evaluation section: the headline claim that o1-preview surpasses humans (57.7% vs. 53.7%) rests on the 15-minute time limit constituting a fair test of deep understanding. For contexts up to 2M words this constraint permits humans to read only a few thousand words at normal speed, while models receive the entire context; the manuscript must supply (a) measured or estimated fraction of context actually read by participants, (b) evidence that the 503 questions cannot be solved from surface cues or partial context, and (c) any sensitivity analysis showing how human accuracy changes with longer time allowances.

Authors: We agree that clarifying the human evaluation protocol is essential. For (a), we did not measure the exact fraction of context read by participants. Based on average reading speeds of 200-300 words per minute, we estimate participants could read 3,000-4,500 words in 15 minutes; we will add this estimate and a discussion of its implications to the revised Human Evaluation section. For (b), questions were created by domain experts and underwent automated filtering plus manual review specifically to require deep reasoning over the full context rather than surface-level cues; the resulting human accuracy of only 53.7% provides supporting evidence. We will expand the question collection and validation description to make this explicit. For (c), we have not conducted sensitivity analysis with longer time allowances, as it would require substantial additional expert time. We will add an explicit limitations paragraph acknowledging this gap. These changes will be incorporated in the revision. revision: partial

standing simulated objections not resolved

Exact measured (rather than estimated) fraction of context read by human participants
Empirical sensitivity analysis of human accuracy under longer time allowances

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on new benchmark

full rationale

The paper introduces LongBench v2 as a newly constructed dataset of 503 questions and reports straightforward accuracy measurements (human 53.7% under 15-minute limit, best direct model 50.1%, o1-preview 57.7%). No mathematical derivations, fitted parameters, predictions, or equations are claimed. All results are direct empirical evaluations on the provided contexts and questions, with no self-definitional reductions, self-citation load-bearing steps, or renamings of known results. The construction process (data collection from experts, automated/manual review) is described as independent of the final performance numbers, rendering the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that expert-collected multiple-choice questions with long contexts validly measure deep understanding and reasoning capabilities.

axioms (1)

domain assumption Expert-collected multiple-choice questions with long contexts validly measure deep understanding and reasoning in LLMs and humans
Invoked in the design and interpretation of the benchmark results.

pith-pipeline@v0.9.0 · 5555 in / 1163 out tokens · 40798 ms · 2026-05-16T20:34:54.088333+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA
cs.CL 2026-04 unverdicted novelty 8.0

InsightGen uses thematic clustering and graph neighborhood selection to generate diverse, relevant insights for open-ended document-grounded questions and releases the SCOpE-QA dataset of 3000 questions.
Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting
cs.CL 2026-05 unverdicted novelty 7.0

Telegraph English compresses prompts via structured symbolic rewriting into atomic facts, achieving roughly 50% token reduction with 99.1% key-fact accuracy on LongBench-v2 and outperforming token-deletion baselines a...
Internalized Reasoning for Long-Context Visual Document Understanding
cs.CV 2026-03 unverdicted novelty 7.0

A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services
cs.DC 2025-12 unverdicted novelty 7.0

MMA routes host-GPU transfers over multiple available paths to deliver 4.62x higher peak bandwidth and lower latencies in LLM serving without hardware or driver changes.
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
cs.CL 2025-07 unverdicted novelty 7.0

MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods ...
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
cs.NI 2026-04 unverdicted novelty 6.0

SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
cs.IR 2026-04 unverdicted novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
cs.AR 2026-04 unverdicted novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
PolicyLong: Towards On-Policy Context Extension
cs.LG 2026-04 unverdicted novelty 6.0

PolicyLong shifts long-context data synthesis to an on-policy loop that re-screens contexts using the evolving model's entropy landscape, producing a self-curriculum that outperforms static offline baselines with larg...
S2O: Early Stopping for Sparse Attention via Online Permutation
cs.LG 2026-02 unverdicted novelty 6.0

S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.
Efficient Evaluation of LLM Performance with Statistical Guarantees
stat.ML 2026-01 unverdicted novelty 6.0

Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
cs.CL 2025-12 unverdicted novelty 6.0

BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
cs.LG 2026-05 unverdicted novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
Too long; didn't solve
cs.AI 2026-04 unverdicted novelty 5.0

Longer prompts and solutions in a new expert-authored math dataset correlate with higher failure rates across LLMs, with length linked to empirical difficulty after difficulty adjustment.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 20 Pith papers · 3 internal anchors

[1]

arXiv preprint arXiv:2404.11018

Many-shot in-context learning. arXiv preprint arXiv:2404.11018. Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. 2024. L-eval: Instituting standardized evaluation for long context language models. In Pro- ceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1...

work page arXiv 2024
[2]

The Llama 3 Herd of Models

Goemotions: A dataset of fine-grained emo- tions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054. Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. Few-nerd: A few-shot named entity recog- nition dataset. In Proceedings of the 59th Annual...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- tanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654. Jen-tse Huang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091. Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. 2024. Longgenbench: Long-context gener- ation benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 865–883. Jekaterina Novikova, Ondˇrej Dušek, Amanda Cercas Curry...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

glucagon

Hellobench: Evaluating long text generation capabilities of large language models. arXiv preprint arXiv:2409.16191. Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Fi- rat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Un- locking multimodal understanding a...

work page arXiv 2024
[6]

It is recommended to change such questions to listing all elements

Counting-type questions: When the quantity is large (>10), most models perform poorly. It is recommended to change such questions to listing all elements

work page
[7]

Retrieval-type questions: Current large language models have strong retrieval capabilities, and questions based on single information located somewhere in the document are relatively simple

work page
[8]

It is acceptable if it only requires common sense or a small amount of professional knowledge

Questions that rely too much on external/professional knowledge: If the question requires a lot of professional knowledge in addition to reading the document, it is difficult to determine whether the model’s mistake stems from insufficient text understanding or lack of knowledge. It is acceptable if it only requires common sense or a small amount of profe...

work page
[9]

Questions should be more natural, try to be close to the real needs of users’ questions, and should not be deliberately set to unreasonable challenges just to increase difficulty

Deliberately difficult questions: It is forbidden for annotators to ask deliberately difficult and stilted questions just to ensure that the human reviewer cannot solve them within a short amount of time. Questions should be more natural, try to be close to the real needs of users’ questions, and should not be deliberately set to unreasonable challenges j...

work page
[10]

Questions that depend on visual understanding: Avoid asking questions that require looking at pictures to answer. Data filtering rules : To ensure data quality, we will filter out the following types of data (for unqualified data, the corresponding annotators will not be rewarded, and you have 5 chances to rewrite them to qualify):

work page
[11]

Questions that do not meet requirements: If the questions do not meet the above requirements, human reviewers will determine them as unqualified questions, and the data will be disqualified

work page
[12]

If all models answer correctly, the data will be disqualified; after passing the model’s automatic test, we will have human reviewers answer the questions

Too simple questions: First, we will automatically test the performance of three models on the questions. If all models answer correctly, the data will be disqualified; after passing the model’s automatic test, we will have human reviewers answer the questions. If the human reviewers can answer correctly within 3 minutes, the data will be disqualified

work page
[13]

Data Annotation

Questions with incorrect answers: Questions judged by human reviewers to have incorrect answers will be disqualified. Reward rules : Each piece of data that passes the review will receive a basic reward of 100 CNY; if in the automatic evaluation, at least two out of three models answer incorrectly, and the reviewer cannot solve the question within 10 minu...

work page
[14]

Data Annotation

Click on “Data Annotation” in the left column to select the task and subtask type of the annotated data. The table at the top shows the “total demand”, “number of verified”, and “number of pending verification” for each task. You can only select tasks where “verified + pending verification < total demand” for annotation

work page
[15]

Upload Files

Please drag individual/multiple files into the “Upload Files” box in the left column. Make sure that all files you upload are in English. After uploading, click “Start Conversion”. The converted plain text will be pasted directly into the “Long Document” box on the right and the word count will be automatically calculated. If you upload the wrong file, yo...

work page
[16]

Evidence

After passing word counting and duplicate checking, you can continue to annotate questions, options, and answers, all in English. Try to include distractors in the option design to avoid guessing correctly. At the same time, for ease of verification, please fill in as detailed evidence as possible in the “Evidence” box, where you can cite sentences from t...

work page
[17]

Submit” (you cannot submit if there are blanks), and you will see the status of your submitted annotated data in the “main

After filling in all the above, click “Submit” (you cannot submit if there are blanks), and you will see the status of your submitted annotated data in the “main” column: - The system will detect newly submitted data in real-time and automatically evaluate the data, getting answers from 3 large language models (usually you can see the results in the “main...

work page
[18]

_id” of the original data in the “Modify My Annotation

If the data does not pass verification for various reasons, you can modify it based on the original data, modifying the question, options, or answer according to the reviewer’s feedback. Please copy the “_id” of the original data in the “Modify My Annotation ” box, and resubmit after modifying the data. Do not repeatedly submit the same data without modif...

work page
[19]

Guidelines for the reviewers, displayed on the data verification page:

To ensure the diversity of questions, each user can design a maximum of 20 questions. Guidelines for the reviewers, displayed on the data verification page:

work page
[20]

Data Verification

Click on “Data Verification” in the left column to select the task and subtask type of the data to be verified. The table below displays the “total demand”, “number of verified”, and “number of pending verification” for each current task. You can only select tasks with “pending verification > 0” for verification (you cannot verify data that you have label...

work page
[21]

Start Verification

Click “Start Verification”, please download the file first and open it (if blocked by the browser, please choose “Keep”). After confirming that the file has been downloaded and opened, click “Start Answering”, and the timer will start. Please select the answer and click “Submit Answer”; if after a long time (>15 min) of reading and thinking you still cann...

work page
[22]

Submit Verification Result

After answering, you will see your answer time, the answer provided by the data annotator, and the evidence. You need to check whether the answer provided by the data annotator is correct, if not, please fill in the reason, and finally click “Submit Verification Result”

work page
[23]

I don’t know the answer

The reward for verifying a piece of data is 25 CNY. If it is found that there is a malicious verification pattern (such as quick answering, directly guessing options, or blindly choosing “I don’t know the answer”), the account will be revoked, and all rewards will be cleared. After reading the above requirements, start data verification now! C.4 Data Coll...

work page 2024