Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
Audits reveal no reasoning benchmark controls position/filler/length jointly; CRE shows LLMs drop up to 88pp on middle-position tasks at 64K context, with diagnostic probe supporting positional cause.