Recognition: 1 theorem link
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Pith reviewed 2026-05-16 20:34 UTC · model grok-4.3
The pith
LongBench v2 shows current LLMs score 50% on long-context reasoning tasks while reasoning models exceed the 54% human baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongBench v2 consists of 503 challenging questions across six categories with long contexts, where human performance under time constraints is 53.7%, direct model answers top out at 50.1%, but reasoning-augmented models achieve 57.7% and exceed the human baseline.
What carries the argument
The LongBench v2 benchmark, a collection of 503 multiple-choice questions requiring deep understanding and reasoning over long contexts from real-world multitasks.
Load-bearing premise
The 503 questions genuinely require deep multi-step understanding rather than being solvable through surface cues or training data patterns.
What would settle it
A model that scores above 65% accuracy using only direct answers without extended reasoning would show the benchmark does not require the claimed level of reasoning.
read the original abstract
This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at https://longbench2.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LongBench v2, a benchmark of 503 expert-sourced multiple-choice questions spanning six task categories (single- and multi-document QA, long in-context learning, dialogue history, code repositories, and structured data) with contexts from 8k to 2M words. After automated and manual review, human experts achieve 53.7% accuracy under a 15-minute time limit; the strongest direct-answer model reaches 50.1% while o1-preview with extended reasoning reaches 57.7%.
Significance. The benchmark supplies a high-quality, expert-validated resource for measuring long-context understanding beyond retrieval. The reported human baseline and the gap between direct-answer models and o1-preview, if the evaluation protocol is shown to be fair, would usefully quantify the value of inference-time compute for realistic multitask reasoning.
major comments (1)
- [Human Evaluation] Human Evaluation section: the headline claim that o1-preview surpasses humans (57.7% vs. 53.7%) rests on the 15-minute time limit constituting a fair test of deep understanding. For contexts up to 2M words this constraint permits humans to read only a few thousand words at normal speed, while models receive the entire context; the manuscript must supply (a) measured or estimated fraction of context actually read by participants, (b) evidence that the 503 questions cannot be solved from surface cues or partial context, and (c) any sensitivity analysis showing how human accuracy changes with longer time allowances.
minor comments (2)
- [§3] §3 (Dataset Construction): state the exact number of questions per category and the precise exclusion criteria applied during the manual review stage.
- [Results] Table 2 or Results section: report per-category accuracy for both humans and the top models rather than only aggregate scores.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the human evaluation protocol. We address the major comment point by point below and will revise the manuscript to strengthen the discussion of evaluation fairness.
read point-by-point responses
-
Referee: [Human Evaluation] Human Evaluation section: the headline claim that o1-preview surpasses humans (57.7% vs. 53.7%) rests on the 15-minute time limit constituting a fair test of deep understanding. For contexts up to 2M words this constraint permits humans to read only a few thousand words at normal speed, while models receive the entire context; the manuscript must supply (a) measured or estimated fraction of context actually read by participants, (b) evidence that the 503 questions cannot be solved from surface cues or partial context, and (c) any sensitivity analysis showing how human accuracy changes with longer time allowances.
Authors: We agree that clarifying the human evaluation protocol is essential. For (a), we did not measure the exact fraction of context read by participants. Based on average reading speeds of 200-300 words per minute, we estimate participants could read 3,000-4,500 words in 15 minutes; we will add this estimate and a discussion of its implications to the revised Human Evaluation section. For (b), questions were created by domain experts and underwent automated filtering plus manual review specifically to require deep reasoning over the full context rather than surface-level cues; the resulting human accuracy of only 53.7% provides supporting evidence. We will expand the question collection and validation description to make this explicit. For (c), we have not conducted sensitivity analysis with longer time allowances, as it would require substantial additional expert time. We will add an explicit limitations paragraph acknowledging this gap. These changes will be incorporated in the revision. revision: partial
- Exact measured (rather than estimated) fraction of context read by human participants
- Empirical sensitivity analysis of human accuracy under longer time allowances
Circularity Check
No circularity: direct empirical measurements on new benchmark
full rationale
The paper introduces LongBench v2 as a newly constructed dataset of 503 questions and reports straightforward accuracy measurements (human 53.7% under 15-minute limit, best direct model 50.1%, o1-preview 57.7%). No mathematical derivations, fitted parameters, predictions, or equations are claimed. All results are direct empirical evaluations on the provided contexts and questions, with no self-definitional reductions, self-citation load-bearing steps, or renamings of known results. The construction process (data collection from experts, automated/manual review) is described as independent of the final performance numbers, rendering the analysis self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert-collected multiple-choice questions with long contexts validly measure deep understanding and reasoning in LLMs and humans
Forward citations
Cited by 20 Pith papers
-
An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA
InsightGen uses thematic clustering and graph neighborhood selection to generate diverse, relevant insights for open-ended document-grounded questions and releases the SCOpE-QA dataset of 3000 questions.
-
Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting
Telegraph English compresses prompts via structured symbolic rewriting into atomic facts, achieving roughly 50% token reduction with 99.1% key-fact accuracy on LongBench-v2 and outperforming token-deletion baselines a...
-
Internalized Reasoning for Long-Context Visual Document Understanding
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
-
MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services
MMA routes host-GPU transfers over multiple available paths to deliver 4.62x higher peak bandwidth and lower latencies in LLM serving without hardware or driver changes.
-
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods ...
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
-
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
-
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
-
PolicyLong: Towards On-Policy Context Extension
PolicyLong shifts long-context data synthesis to an on-policy loop that re-screens contexts using the evolving model's entropy landscape, producing a self-curriculum that outperforms static offline baselines with larg...
-
S2O: Early Stopping for Sparse Attention via Online Permutation
S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.
-
Efficient Evaluation of LLM Performance with Statistical Guarantees
Factorized Active Querying (FAQ) provides up to 5 times more effective samples for LLM accuracy estimation by using Bayesian factor models and adaptive querying under a fixed budget with guaranteed coverage.
-
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
-
Too long; didn't solve
Longer prompts and solutions in a new expert-authored math dataset correlate with higher failure rates across LLMs, with length linked to empirical difficulty after difficulty adjustment.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2404.11018
Many-shot in-context learning. arXiv preprint arXiv:2404.11018. Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. 2024. L-eval: Instituting standardized evaluation for long context language models. In Pro- ceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1...
-
[2]
Goemotions: A dataset of fine-grained emo- tions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054. Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. Few-nerd: A few-shot named entity recog- nition dataset. In Proceedings of the 59th Annual...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- tanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654. Jen-tse Huang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091. Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. 2024. Longgenbench: Long-context gener- ation benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 865–883. Jekaterina Novikova, Ondˇrej Dušek, Amanda Cercas Curry...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Hellobench: Evaluating long text generation capabilities of large language models. arXiv preprint arXiv:2409.16191. Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Fi- rat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Un- locking multimodal understanding a...
-
[6]
It is recommended to change such questions to listing all elements
Counting-type questions: When the quantity is large (>10), most models perform poorly. It is recommended to change such questions to listing all elements
-
[7]
Retrieval-type questions: Current large language models have strong retrieval capabilities, and questions based on single information located somewhere in the document are relatively simple
-
[8]
It is acceptable if it only requires common sense or a small amount of professional knowledge
Questions that rely too much on external/professional knowledge: If the question requires a lot of professional knowledge in addition to reading the document, it is difficult to determine whether the model’s mistake stems from insufficient text understanding or lack of knowledge. It is acceptable if it only requires common sense or a small amount of profe...
-
[9]
Deliberately difficult questions: It is forbidden for annotators to ask deliberately difficult and stilted questions just to ensure that the human reviewer cannot solve them within a short amount of time. Questions should be more natural, try to be close to the real needs of users’ questions, and should not be deliberately set to unreasonable challenges j...
-
[10]
Questions that depend on visual understanding: Avoid asking questions that require looking at pictures to answer. Data filtering rules : To ensure data quality, we will filter out the following types of data (for unqualified data, the corresponding annotators will not be rewarded, and you have 5 chances to rewrite them to qualify):
-
[11]
Questions that do not meet requirements: If the questions do not meet the above requirements, human reviewers will determine them as unqualified questions, and the data will be disqualified
-
[12]
Too simple questions: First, we will automatically test the performance of three models on the questions. If all models answer correctly, the data will be disqualified; after passing the model’s automatic test, we will have human reviewers answer the questions. If the human reviewers can answer correctly within 3 minutes, the data will be disqualified
-
[13]
Questions with incorrect answers: Questions judged by human reviewers to have incorrect answers will be disqualified. Reward rules : Each piece of data that passes the review will receive a basic reward of 100 CNY; if in the automatic evaluation, at least two out of three models answer incorrectly, and the reviewer cannot solve the question within 10 minu...
-
[14]
Click on “Data Annotation” in the left column to select the task and subtask type of the annotated data. The table at the top shows the “total demand”, “number of verified”, and “number of pending verification” for each task. You can only select tasks where “verified + pending verification < total demand” for annotation
-
[15]
Please drag individual/multiple files into the “Upload Files” box in the left column. Make sure that all files you upload are in English. After uploading, click “Start Conversion”. The converted plain text will be pasted directly into the “Long Document” box on the right and the word count will be automatically calculated. If you upload the wrong file, yo...
-
[16]
After passing word counting and duplicate checking, you can continue to annotate questions, options, and answers, all in English. Try to include distractors in the option design to avoid guessing correctly. At the same time, for ease of verification, please fill in as detailed evidence as possible in the “Evidence” box, where you can cite sentences from t...
-
[17]
After filling in all the above, click “Submit” (you cannot submit if there are blanks), and you will see the status of your submitted annotated data in the “main” column: - The system will detect newly submitted data in real-time and automatically evaluate the data, getting answers from 3 large language models (usually you can see the results in the “main...
-
[18]
_id” of the original data in the “Modify My Annotation
If the data does not pass verification for various reasons, you can modify it based on the original data, modifying the question, options, or answer according to the reviewer’s feedback. Please copy the “_id” of the original data in the “Modify My Annotation ” box, and resubmit after modifying the data. Do not repeatedly submit the same data without modif...
-
[19]
Guidelines for the reviewers, displayed on the data verification page:
To ensure the diversity of questions, each user can design a maximum of 20 questions. Guidelines for the reviewers, displayed on the data verification page:
-
[20]
Click on “Data Verification” in the left column to select the task and subtask type of the data to be verified. The table below displays the “total demand”, “number of verified”, and “number of pending verification” for each current task. You can only select tasks with “pending verification > 0” for verification (you cannot verify data that you have label...
-
[21]
Click “Start Verification”, please download the file first and open it (if blocked by the browser, please choose “Keep”). After confirming that the file has been downloaded and opened, click “Start Answering”, and the timer will start. Please select the answer and click “Submit Answer”; if after a long time (>15 min) of reading and thinking you still cann...
-
[22]
After answering, you will see your answer time, the answer provided by the data annotator, and the evidence. You need to check whether the answer provided by the data annotator is correct, if not, please fill in the reason, and finally click “Submit Verification Result”
-
[23]
The reward for verifying a piece of data is 25 CNY. If it is found that there is a malicious verification pattern (such as quick answering, directly guessing options, or blindly choosing “I don’t know the answer”), the account will be revoked, and all rewards will be cleared. After reading the above requirements, start data verification now! C.4 Data Coll...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.