LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.
Questions should be more natural, try to be close to the real needs of users’ questions, and should not be deliberately set to unreasonable challenges just to increase difficulty
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2024 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.