pith. sign in
Pith Number

pith:G2XA7TKE

pith:2024:G2XA7TKEYV53AO5KYTG6OUGBP5
not attested not anchored not stored refs resolved

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Hao Peng, Jiajie Zhang, Jiazheng Xu, Jie Tang, Juanzi Li, Lei Hou, Shangqing Tu, Shulin Cao, Xiaozhi Wang, Xin Lv, Yushi Bai, Yuxiao Dong

LongBench v2 shows current LLMs score 50% on long-context reasoning tasks while reasoning models exceed the 54% human baseline.

arxiv:2412.15204 v2 · 2024-12-19 · cs.CL · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{G2XA7TKEYV53AO5KYTG6OUGBP5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%.

C2weakest assumption

That the 503 questions genuinely require deep understanding and multi-step reasoning rather than being solvable through surface cues or training-data leakage, and that the 15-minute human time limit produces a fair comparison to model performance.

C3one line summary

LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.

References

23 extracted · 23 resolved · 3 Pith anchors

[1] Agrawal, P., Craig, N., Madden, A., and Lombera, I 2024
[2] The Llama 3 Herd of Models 2021 · arXiv:2407.21783
[3] ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools 2024 · arXiv:2406.12793
[4] RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems 2024 · arXiv:2306.03091
[5] In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 11621–11640, Bangkok, Thailand 2024

Formal links

1 machine-checked theorem link

Cited by

26 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:46.654233Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

36ae0fcd44c57bb03baac4cde750c17f5fc633d6e2b8c874bcd85ed980ec3b75

Aliases

arxiv: 2412.15204 · arxiv_version: 2412.15204v2 · doi: 10.48550/arxiv.2412.15204 · pith_short_12: G2XA7TKEYV53 · pith_short_16: G2XA7TKEYV53AO5K · pith_short_8: G2XA7TKE
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/G2XA7TKEYV53AO5KYTG6OUGBP5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 36ae0fcd44c57bb03baac4cde750c17f5fc633d6e2b8c874bcd85ed980ec3b75
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "7765192ced9a40be15cb5d5ecd09e4647b36f808cf665312205b0b87976cb5f6",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-12-19T18:59:17Z",
    "title_canon_sha256": "4998e049c23af4c78fd2e5f612dad7ae2284185f686b6fa03754a436ae679944"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2412.15204",
    "kind": "arxiv",
    "version": 2
  }
}