{"paper":{"title":"LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LongBench v2 shows current LLMs score 50% on long-context reasoning tasks while reasoning models exceed the 54% human baseline.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Hao Peng, Jiajie Zhang, Jiazheng Xu, Jie Tang, Juanzi Li, Lei Hou, Shangqing Tu, Shulin Cao, Xiaozhi Wang, Xin Lv, Yushi Bai, Yuxiao Dong","submitted_at":"2024-12-19T18:59:17Z","abstract_excerpt":"This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educate"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 503 questions genuinely require deep understanding and multi-step reasoning rather than being solvable through surface cues or training-data leakage, and that the 15-minute human time limit produces a fair comparison to model performance.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LongBench v2 shows current LLMs score 50% on long-context reasoning tasks while reasoning models exceed the 54% human baseline.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"54b1e97657495c00fa6902432bc3adeed38c363a8c001933653dda0d1a4c2117"},"source":{"id":"2412.15204","kind":"arxiv","version":2},"verdict":{"id":"143973a6-ec2d-413c-b051-ac0698797f19","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T20:34:54.088333Z","strongest_claim":"The best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%.","one_line_summary":"LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 503 questions genuinely require deep understanding and multi-step reasoning rather than being solvable through surface cues or training-data leakage, and that the 15-minute human time limit produces a fair comparison to model performance.","pith_extraction_headline":"LongBench v2 shows current LLMs score 50% on long-context reasoning tasks while reasoning models exceed the 54% human baseline."},"references":{"count":23,"sample":[{"doi":"","year":2024,"title":"Agrawal, P., Craig, N., Madden, A., and Lombera, I","work_id":"1ff25d89-9966-4623-9f61-876eb43549b4","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","ref_index":2,"cited_arxiv_id":"2407.21783","is_internal_anchor":true},{"doi":"","year":2024,"title":"ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools","work_id":"de9ce5af-0d8d-4b94-9793-64968d9bc06d","ref_index":3,"cited_arxiv_id":"2406.12793","is_internal_anchor":true},{"doi":"","year":2024,"title":"RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems","work_id":"24b24ba5-b21c-43c6-866f-4b874305372e","ref_index":4,"cited_arxiv_id":"2306.03091","is_internal_anchor":true},{"doi":"","year":2024,"title":"In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 11621–11640, Bangkok, Thailand","work_id":"4ad83798-4e7c-471d-97e7-97ca3a0a1127","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":23,"snapshot_sha256":"8e30c0b650334039f4a8687c289f063ab987dc14efa5fddade629e7581467b79","internal_anchors":3},"formal_canon":{"evidence_count":1,"snapshot_sha256":"cd08eaf3270df4b78a2cfb664e5934400260c13312d1d9b889577aae2b8d8e40"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}