pith. sign in

arxiv: 2605.23170 · v1 · pith:W7O4RW2Znew · submitted 2026-05-22 · 💻 cs.CL · cs.AI· cs.LG

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Pith reviewed 2026-05-25 04:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords long-context LLMsreasoning benchmarkspositional failurescontext lengthevaluation gapfiller interferenceContext Rot Evaluation
0
0 comments X

The pith

Long-context reasoning benchmarks miss large accuracy drops when tasks are placed in the middle rather than at the end of the context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that mainstream reasoning benchmarks for long-context LLMs do not jointly control task position, filler content, and context length, leaving positional vulnerabilities unmeasured. It introduces Context Rot Evaluation to vary all three factors and tests models on GSM8K and ARC-Challenge, finding sharp middle-position drops that worsen with length. Newer model releases show reduced but persistent drops under some conditions. This matters because uncontrolled benchmarks cannot reveal how performance changes with task location, so reported results may not reflect behavior in realistic varied positions. The claim rests on error patterns dominated by filler matching at middle positions and a probe that recovers accuracy by copying the task to the end.

Core claim

None of the 11 audited long-context reasoning benchmarks control task position together with filler and length, and controlled tests show models can drop as much as 88 percentage points when the target task moves from end to middle position, with the drop growing at longer contexts; under questions-only filler the drops remain across models, middle errors match filler text 76 percent of the time versus 22 percent at the end, and adding an end-position copy restores middle accuracy to within 4 points of baseline.

What carries the argument

Context Rot Evaluation (CRE), the framework that jointly varies task position, filler content, and context length on reasoning tasks to expose positional effects.

If this is right

  • Middle-position accuracy drops grow with context length for vulnerable models, reaching 88 points at 64K under certain fillers.
  • Newer model releases narrow some drops to within 6 points but leave larger gaps under questions-only filler.
  • 76 percent of middle-position errors match surrounding filler text, versus 22 percent at the end position.
  • Adding a target-task copy at the end brings middle accuracy within 4 points of the end baseline across all tested models.
  • Standard benchmark result tables that omit position-controlled tests cannot detect these growing vulnerabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks would need to randomize or systematically vary task position to measure full robustness rather than position-favored performance.
  • The same positional gap may affect agentic and coding benchmarks that currently appear in main result tables without position controls.
  • Position-controlled testing could be extended to other long-context tasks to check whether filler interference is a general failure mode.
  • Vendors could add position as an explicit variable in evaluations to identify models that remain stable across locations.

Load-bearing premise

The accuracy drops are driven primarily by task position and filler interference rather than other model-specific or setup factors.

What would settle it

A controlled test in which middle-position accuracy stays within a few points of end-position accuracy across models and lengths when filler and context length are held fixed.

Figures

Figures reproduced from arXiv: 2605.23170 by Chuyifei Zhang, Hongyu Cui, Jitao Sang, Xiaowen Huang.

Figure 1
Figure 1. Figure 1: CRE evaluation framework. (A) Main experiment design: a [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript audits 11 long-context reasoning benchmarks and finds that none jointly control task position, filler content, and context length. It introduces Context Rot Evaluation (CRE) and reports results from nine LLMs on GSM8K and ARC-Challenge, documenting sharp accuracy drops when target tasks move from end to middle position (e.g., 88pp drop for MiMo-v2-Flash at 64K with solutions filler), with drops scaling with length for some models. Newer releases show smaller but persistent drops under questions_only_v2 filler. A diagnostic probe (end-position copy) recovers middle accuracy to within ±4pp of baseline at 8K across all models, and error analysis shows 76% filler-match errors in middle vs. 22% at end. The work concludes that positional vulnerabilities cannot be measured without position-controlled benchmarks.

Significance. If the empirical patterns hold, the paper identifies a structural gap in reasoning benchmark design that affects interpretation of long-context model capabilities. The controlled variation of position, filler, and length, the cross-model consistency, and the direct diagnostic probe constitute concrete, falsifiable evidence for a positional account. This strengthens the case for revising evaluation practices, particularly as long-context releases increasingly appear in vendor tables without such controls.

major comments (2)
  1. [Methods] Methods / Experimental Setup: The manuscript provides no full description of prompt construction, response parsing rules, or exclusion criteria for invalid outputs. This directly limits verification of the central quantitative claims, including the 88pp drop, the 76%/22% filler-match rates, and the diagnostic probe recovery to ±4pp.
  2. [Results] Results sections (CRE experiments on GSM8K/ARC-Challenge): No error bars, confidence intervals, or statistical significance tests accompany the reported percentage-point drops across positions and lengths. Without these, the robustness of the position-dependent degradation and the claim that drops worsen with context length cannot be fully assessed.
minor comments (1)
  1. [Benchmark Audit] The benchmark audit would benefit from an explicit table listing the 11 benchmarks and the three control criteria with yes/no entries for each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify clear opportunities to strengthen reproducibility and statistical presentation. We address both below and will incorporate the requested details and analyses in a revised manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods / Experimental Setup: The manuscript provides no full description of prompt construction, response parsing rules, or exclusion criteria for invalid outputs. This directly limits verification of the central quantitative claims, including the 88pp drop, the 76%/22% filler-match rates, and the diagnostic probe recovery to ±4pp.

    Authors: We agree that the current manuscript lacks sufficient detail on these aspects. In the revised version we will add a dedicated Experimental Setup subsection that specifies: (1) exact prompt templates and how contexts are assembled for each position/filler/length combination, (2) the deterministic parsing rules used to extract answers (including handling of non-numeric or malformed outputs), and (3) any exclusion criteria applied to invalid responses. These additions will enable independent reproduction of the reported accuracy figures, error-type distributions, and probe results. revision: yes

  2. Referee: [Results] Results sections (CRE experiments on GSM8K/ARC-Challenge): No error bars, confidence intervals, or statistical significance tests accompany the reported percentage-point drops across positions and lengths. Without these, the robustness of the position-dependent degradation and the claim that drops worsen with context length cannot be fully assessed.

    Authors: We acknowledge the absence of statistical measures in the presented results. Although the largest observed drops (e.g., 88 pp) are unlikely to arise from sampling variability alone, formal quantification is warranted. In revision we will (a) report results aggregated over multiple independent runs where feasible, (b) include error bars (standard deviation or 95% CI), and (c) apply appropriate statistical tests (paired proportion tests or bootstrap resampling) to assess the significance of position and length effects. These additions will be placed in the main results tables and figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on a factual audit of 11 benchmarks (none jointly control position/filler/length) and direct empirical measurements of accuracy drops under controlled CRE conditions across nine models. No equations, fitted parameters, or derivations are present; the end-copy diagnostic and filler-match error rates (76% vs 22%) are measured outcomes rather than self-referential quantities. The audit is an enumeration, not an inference that reduces to prior self-citations. All load-bearing steps are externally verifiable against public benchmarks and model outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on empirical observations from benchmark audits and model evaluations rather than mathematical derivations. The primary added element is the CRE framework itself.

axioms (2)
  • domain assumption Accuracy on GSM8K and ARC-Challenge serves as a valid proxy for reasoning capability in long-context settings
    The evaluation uses these datasets to measure the effect of position on reasoning performance.
  • domain assumption Task position, filler content, and context length can be independently controlled in prompt construction
    This underpins the design of the CRE framework described in the abstract.
invented entities (1)
  • Context Rot Evaluation (CRE) no independent evidence
    purpose: Controlled framework that jointly varies task position, filler content, and context length for reasoning benchmarks
    Newly proposed evaluation method in the paper.

pith-pipeline@v0.9.0 · 5911 in / 1438 out tokens · 33563 ms · 2026-05-25T04:56:17.102728+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 6 internal anchors

  1. [2]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://arxiv.org/abs/2308.14508 LongBench : A bilingual, multitask benchmark for long context understanding . In Proceedings of the 62nd Annual Meeting of the Association for Computational L...

  2. [4]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. https://arxiv.org/abs/1803.05457 Think you have solved question answering? T ry ARC , the AI2 reasoning challenge . arXiv preprint arXiv:1803.05457. Introduces ARC and ARC-Challenge

  3. [5]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168. Introduces GSM8K

  4. [6]

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum \'e III, and Kate Crawford. 2021. https://arxiv.org/abs/1803.09010 Datasheets for datasets . Communications of the ACM, 64(12):86--92

  5. [7]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. https://arxiv.org/abs/2404.06654 RULER : What's the real context size of your long-context language models? In Proceedings of the Conference on Language Modeling (COLM)

  6. [8]

    Greg Kamradt. 2023. Needle in a haystack. https://github.com/gkamradt/LLMTest_NeedleInAHaystack. GitHub repository

  7. [9]

    Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. 2024. https://arxiv.org/abs/2406.10149 BABILong : Testing the limits of LLM s with long context reasoning-in-a-haystack . In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track

  8. [15]

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivi \`e re, Alina Beygelzimer, Florence d'Alch \'e Buc, Emily Fox, and Hugo Larochelle. 2021. https://arxiv.org/abs/2003.12206 Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program) . Journal of Machine Learning Research, 22(164):1--20

  9. [16]

    Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. 2023. https://arxiv.org/abs/2305.14196 ZeroSCROLLS : A zero-shot benchmark for long text understanding . In Findings of the Association for Computational Linguistics: EMNLP 2023

  10. [17]

    Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. 2022. https://arxiv.org/abs/2201.03533 SCROLLS : Standardized C ompa R ison over long language sequences . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)

  11. [18]

    Mingyang Song, Mao Zheng, and Xuan Luo. 2025. https://arxiv.org/abs/2403.11802 Counting-Stars : A multi-evidence, position-aware, and scalable benchmark for evaluating long-context large language models . In Proceedings of the 31st International Conference on Computational Linguistics (COLING)

  12. [19]

    Runchu Tian, Yanghao Li, Yuepeng Fu, Siyang Deng, Qinyu Luo, Cheng Qian, Shuo Wang, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Huadong Wang, and Xiaojiang Liu. 2025. https://arxiv.org/abs/2410.14641 Distance between relevant information pieces causes bias in long-context LLM s . In Findings of the Association for Computational Linguistics: ACL 2025. Int...

  13. [22]

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. 2024. https://arxiv.org/abs/2402.13718 bench: Extending long context evaluation beyond 100 K tokens . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

  14. [23]

    2024 , eprint=

    Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Zhang, Yang and Ginsburg, Boris , booktitle=. 2024 , eprint=

  15. [24]

    2023 , howpublished=

    Needle in a Haystack , author=. 2023 , howpublished=

  16. [25]

    2024 , eprint=

    Kuratov, Yuri and Bulatov, Aydar and Anokhin, Petr and Rodkin, Ivan and Sorokin, Dmitry and Sorokin, Artyom and Burtsev, Mikhail , booktitle=. 2024 , eprint=

  17. [26]

    2024 , eprint=

    Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi , booktitle=. 2024 , eprint=

  18. [27]

    Bench: Extending Long Context Evaluation Beyond 100

    Zhang, Xinrong and Chen, Yingfa and Hu, Shengding and Xu, Zihang and Chen, Junhao and Hao, Moo Khai and Han, Xu and Thai, Zhen Leng and Wang, Shuo and Liu, Zhiyuan and Sun, Maosong , booktitle=. Bench: Extending Long Context Evaluation Beyond 100. 2024 , eprint=

  19. [28]

    2022 , eprint=

    Shaham, Uri and Segal, Elad and Ivgi, Maor and Efrat, Avia and Yoran, Ori and Haviv, Adi and Gupta, Ankit and Xiong, Wenhan and Geva, Mor and Berant, Jonathan and Levy, Omer , booktitle=. 2022 , eprint=

  20. [29]

    2023 , eprint=

    Shaham, Uri and Ivgi, Maor and Efrat, Avia and Berant, Jonathan and Levy, Omer , booktitle=. 2023 , eprint=

  21. [30]

    2307.11088 , archivePrefix=

    An, Chenxin and Gong, Shansan and Zhong, Ming and Zhao, Xingjian and Li, Mukai and Zhang, Jun and Kong, Lingpeng and Qiu, Xipeng , year=. 2307.11088 , archivePrefix=

  22. [31]

    2311.04939 , archivePrefix=

    Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan , year=. 2311.04939 , archivePrefix=

  23. [32]

    Yang, Wang and Jin, Hongye and Zhong, Shaochen and Jiang, Song and Wang, Qifan and Chaudhary, Vipin and Han, Xiaotian , year=. 100-. 2505.19293 , archivePrefix=

  24. [33]

    2501.15089 , archivePrefix=

    Ling, Zhan and Liu, Kang and Yan, Kai and Yang, Yifan and Lin, Weijian and Fan, Ting-Han and Shen, Lingfeng and Du, Zhengyin and Chen, Jiecao , year=. 2501.15089 , archivePrefix=

  25. [34]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

    Attribute or Abstain: Large Language Models as Long Document Assistants , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=. 2407.07799 , archivePrefix=

  26. [35]

    Distance between Relevant Information Pieces Causes Bias in Long-Context

    Tian, Runchu and Li, Yanghao and Fu, Yuepeng and Deng, Siyang and Luo, Qinyu and Qian, Cheng and Wang, Shuo and Cong, Xin and Zhang, Zhong and Wu, Yesai and Lin, Yankai and Wang, Huadong and Liu, Xiaojiang , booktitle=. Distance between Relevant Information Pieces Causes Bias in Long-Context. 2025 , eprint=

  27. [36]

    2025 , eprint=

    Song, Mingyang and Zheng, Mao and Luo, Xuan , booktitle=. 2025 , eprint=

  28. [37]

    and Yoon, Seunghyun and Sch

    Modarressi, Ali and Deilamsalehy, Hanieh and Dernoncourt, Franck and Bui, Trung and Rossi, Ryan A. and Yoon, Seunghyun and Sch. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=. 2502.05167 , archivePrefix=

  29. [38]

    2025 , pages=

    Wang, Yifei and Xiong, Feng and Wang, Yong and Li, Linjing and Chu, Xiangxiang and Zeng, Daniel Dajun , booktitle=. 2025 , pages=. doi:10.18653/v1/2025.emnlp-main.78 , url=. 2508.15709 , archivePrefix=

  30. [39]

    Model Cards for Model Reporting

    Model Cards for Model Reporting , author=. Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*) , year=. 1810.03993 , archivePrefix=

  31. [40]

    Communications of the ACM , volume=

    Datasheets for Datasets , author=. Communications of the ACM , volume=. 2021 , eprint=

  32. [41]

    Improving Reproducibility in Machine Learning Research (A Report from the

    Pineau, Joelle and Vincent-Lamarre, Philippe and Sinha, Koustuv and Larivi. Improving Reproducibility in Machine Learning Research (A Report from the. Journal of Machine Learning Research , volume=. 2021 , eprint=

  33. [42]

    Lost in the Middle: How Language Models Use Long Contexts

    Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics (TACL) , year=. 2307.03172 , archivePrefix=

  34. [43]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  35. [44]

    Think You Have Solved Question Answering?

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think You Have Solved Question Answering?. 2018 , eprint=