Language models fail at extended rule following
Pith reviewed 2026-05-21 00:03 UTC · model grok-4.3
The pith
Language models cannot preserve exact state during repeated rule applications beyond a limited threshold.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models rely on a finite number of internal states to mimic counting as a rule and fail once these states are exhausted, producing abrupt inaccuracies above a model-dependent, syntax-sensitive threshold that persists even with larger size, extra inference computation, or external tools.
What carries the argument
Finite internal states that models allocate to simulate repeated application of the counting rule.
If this is right
- Similar abrupt failures will appear in any task that demands sustained exact state over many rule steps.
- Scaling model size, increasing inference compute, or adding external tools will not raise the effective counting capacity.
- Mechanistic inspection of internal states can expose the shared mechanism for both counting and more complex rule-based tasks.
- Autonomous agents built from today's models will lack truly reliable rule-following capabilities.
Where Pith is reading between the lines
- Syntax sensitivity implies that small changes in how a rule is phrased can shift the point at which state exhaustion occurs.
- The same internal-state bottleneck may limit performance on long-horizon planning or multi-step simulation.
- Architectures that maintain explicit, expandable state outside the finite internal representation could avoid these hard limits.
Load-bearing premise
The repeated-character counting task requires and reveals the ability to preserve an exact state while applying a rule many times in succession.
What would settle it
A controlled test in which any current language model counts a string of several hundred identical characters with zero errors using only its standard forward pass.
Figures
read the original abstract
Large language models are highly capable of answering difficult questions by retrieving, recombining, and attending to information in long contexts. For agentic tasks, an additional capability is required: the preservation of an exact state while repeatedly applying rules. We find that this reliability is absent across language models. To demonstrate, we query 126 leading model variants with the task of counting a long string of repeated characters, and we find they all cannot accurately count above a model-dependent, syntax-sensitive counting capacity threshold. Failures are abrupt and persist even with increasing model size, inference time computation, and external tool. Mechanistic probing indicates that models use a finite number of internal states to mimic counting as a rule and fail once these states are exhausted. Furthermore, such states are the basis for performing complex tasks beyond counting. These results indicate that fundamentally new model architectures are required for autonomous agents to achieve truly reliable rule following capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that large language models cannot reliably preserve an exact state while repeatedly applying rules, as evidenced by their inability to accurately count beyond a model-dependent threshold in long strings of repeated characters. This limitation is syntax-sensitive, abrupt, and persists with model scaling, increased inference-time computation, and the use of external tools. Mechanistic probing reveals that models employ a finite set of internal states to approximate counting, and the authors posit that these states underpin performance on more complex tasks, implying the need for fundamentally new architectures to enable reliable rule following in autonomous agents.
Significance. If substantiated, these findings would be significant for the field of AI agents and reliable reasoning systems. The work highlights a potential architectural limitation in current LLMs for long-horizon, stateful rule application, which is critical for agentic behaviors. By evaluating 126 model variants and incorporating mechanistic analysis, it offers empirical breadth and some insight into internal mechanisms. This could encourage development of models with unbounded or explicit state tracking capabilities. The broad testing and persistence of failures under various mitigations strengthen the case for the observed limitation.
major comments (2)
- [§5] §5 (Mechanistic Analysis): The assertion that the finite internal states 'are the basis for performing complex tasks beyond counting' is central to the architectural recommendation, yet the evidence appears limited to probing results on the repeated-character counting task. Without causal interventions such as activation patching or ablation on other multi-step rule-following tasks (e.g., state tracking in conditional instruction sequences), the generalization from counting failures to general rule following remains an extrapolation rather than a demonstrated mechanistic link.
- [§3] §3 (Experimental Setup): The claim of syntax sensitivity and abrupt failure thresholds is load-bearing for the core empirical result, but the manuscript should clarify whether controls were included for context length effects versus pure state exhaustion (e.g., by comparing to non-repetitive but equally long rule-application sequences). This distinction affects whether the counting task truly isolates the targeted capability.
minor comments (2)
- [Abstract] Abstract and §2: The phrasing 'persist even with ... external tool' would benefit from a brief description of the tool-use protocol (e.g., which tool and how state was passed) to allow readers to assess the mitigation attempt.
- [§4] Figure captions and §4: Ensure all plots of accuracy versus length explicitly label the model-dependent thresholds and include error bars or statistical tests for the abruptness of the drop-off.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment in detail below, indicating the revisions made.
read point-by-point responses
-
Referee: [§5] §5 (Mechanistic Analysis): The assertion that the finite internal states 'are the basis for performing complex tasks beyond counting' is central to the architectural recommendation, yet the evidence appears limited to probing results on the repeated-character counting task. Without causal interventions such as activation patching or ablation on other multi-step rule-following tasks (e.g., state tracking in conditional instruction sequences), the generalization from counting failures to general rule following remains an extrapolation rather than a demonstrated mechanistic link.
Authors: We agree that the link to complex tasks is an important claim and that our primary evidence comes from the counting task. The mechanistic probing demonstrates that models rely on a limited set of internal states for this rule-following behavior. We posit that this mechanism generalizes because many complex tasks, such as multi-step reasoning or instruction following, similarly require maintaining precise states over extended sequences. To address the concern, we have expanded the discussion in §5 to include references to prior work on state tracking in LLMs and clarified that the counting task serves as a minimal example of rule application. We have also noted the need for future causal studies on other tasks as a limitation. This revision strengthens the presentation without overclaiming. revision: partial
-
Referee: [§3] §3 (Experimental Setup): The claim of syntax sensitivity and abrupt failure thresholds is load-bearing for the core empirical result, but the manuscript should clarify whether controls were included for context length effects versus pure state exhaustion (e.g., by comparing to non-repetitive but equally long rule-application sequences). This distinction affects whether the counting task truly isolates the targeted capability.
Authors: We appreciate this point and have clarified the experimental controls in the revised §3. Our original experiments included variations in sequence length and syntax to show that the failure thresholds are model-dependent and occur well below the context limits, as models succeed on other long-context tasks. To directly address the distinction, we have added new control experiments using non-repetitive but long rule-application sequences (e.g., following conditional instructions over extended contexts). These controls confirm that failures are tied to repeated state updates rather than context length per se. The results are incorporated into the manuscript, with updated figures and text. revision: yes
Circularity Check
No circularity: purely empirical measurements with no self-referential derivations
full rationale
The paper reports direct experimental results from querying 126 model variants on a repeated-character counting task, measuring abrupt failure thresholds, and performing mechanistic probing to observe finite internal states. No equations, fitted parameters, or derivations are presented that reduce any claimed result to its own inputs by construction. The generalization that the observed states form the basis for complex rule-following is an interpretive claim supported by the counting experiments rather than a self-citation chain or definitional loop, leaving the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Counting a long string of repeated characters requires preservation of an exact state while repeatedly applying rules.
Reference graph
Works this paper leans on
-
[1]
Transactions on Machine Learning Research , year =
Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =
-
[2]
International Conference on Learning Representations , year =
Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =
-
[3]
Transactions on Machine Learning Research , year =
Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author =. Transactions on Machine Learning Research , year =
-
[4]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , year =. doi:10.48550/arXiv.2311.12022 , url =. 2311.12022 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022
-
[5]
Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. 2024 , url =
work page 2024
-
[6]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah , year =. LiveBench: A Challenging, Contamination-Free. doi:10.48550/arXiv.24...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.19314
-
[7]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =
Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2025.emnlp-main.511 , url =
-
[8]
Hai, Nam Le and Nguyen, Dung Manh and Bui, Nghi D. Q. , year =. doi:10.48550/arXiv.2406.11927 , url =. 2406.11927 , archivePrefix=
-
[9]
LongGenBench: Benchmarking Long-Form Generation in Long Context
Wu, Yuhao and Hee, Ming Shan and Hu, Zhiqing and Lee, Roy Ka-Wei , year =. LongGenBench: Benchmarking Long-Form Generation in Long Context. doi:10.48550/arXiv.2409.02076 , url =. 2409.02076 , archivePrefix=
-
[10]
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =
work page 2023
-
[11]
Advances in Neural Information Processing Systems , year =
Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =
-
[12]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =. doi:10.1162/tacl_a_00638 , url =
-
[13]
Li, Mo and Zhang, Songyang and Zhang, Taolin and Duan, Haodong and Liu, Yunxin and Chen, Kai , journal =. NeedleBench: Evaluating. 2025 , url =
work page 2025
-
[14]
Advances in Neural Information Processing Systems , year =
Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =
-
[15]
Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and others , title =. 2021 , url =
work page 2021
-
[16]
In-context Learning and Induction Heads
In-context Learning and Induction Heads , author =. 2022 , eprint =. doi:10.48550/arXiv.2209.11895 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.11895 2022
-
[17]
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Chollet, Francois and Knoop, Mike and Kamradt, Gregory and Landers, Bryan and Pinkard, Henry , year =. doi:10.48550/arXiv.2505.11831 , url =. 2505.11831 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.11831
-
[18]
Advances in Neural Information Processing Systems , year =
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =
-
[19]
Advances in Neural Information Processing Systems , year =
Large Language Models are Zero-Shot Reasoners , author =. Advances in Neural Information Processing Systems , year =
-
[20]
Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , year =. Scaling. doi:10.48550/arXiv.2408.03314 , url =. 2408.03314 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.03314
-
[21]
Transactions of the Association for Computational Linguistics , volume =
Theoretical Limitations of Self-Attention in Neural Sequence Models , author =. Transactions of the Association for Computational Linguistics , volume =. 2020 , doi =
work page 2020
-
[22]
International Conference on Learning Representations , year =
Neural Networks and the Chomsky Hierarchy , author =. International Conference on Learning Representations , year =
-
[23]
Transactions of the Association for Computational Linguistics , volume =
What Formal Languages Can Transformers Express? A Survey , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =
work page 2024
-
[24]
Transformers Represent Belief State Geometry in Their Residual Stream , author =. 2024 , eprint =. doi:10.48550/arXiv.2405.15943 , url =
-
[25]
International Conference on Learning Representations , year =
Scaling and Evaluating Sparse Autoencoders , author =. International Conference on Learning Representations , year =
-
[26]
International Conference on Learning Representations , year =
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. International Conference on Learning Representations , year =
-
[27]
Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2023 , url =
work page 2023
-
[28]
Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , booktitle =. 2024 , url =
work page 2024
-
[29]
Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , booktitle =. Transformer-. 2019 , pages =. doi:10.18653/v1/P19-1285 , url =
-
[30]
Advances in Neural Information Processing Systems , year =
Recurrent Memory Transformer , author =. Advances in Neural Information Processing Systems , year =
-
[31]
International Conference on Learning Representations , year =
Memorizing Transformers , author =. International Conference on Learning Representations , year =
-
[32]
Proceedings of the 39th International Conference on Machine Learning , year =
Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , year =
- [33]
- [34]
-
[35]
Proceedings of the 42nd International Conference on Machine Learning , series =
Interpreting the Repeated Token Phenomenon in Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , url =
work page 2025
-
[36]
International Conference on Learning Representations , year =
When Can Transformers Count to n? , author =. International Conference on Learning Representations , year =
-
[37]
arXiv preprint arXiv:2501.12948 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris , booktitle =. 2024 , url =
work page 2024
-
[39]
Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics: ACL 2023 , year =. doi:10.18653/v1/2023.findings-acl.824 , url =
-
[40]
Gu, Alex and Roziere, Baptiste and Leather, Hugh James and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida , booktitle =. 2024 , url =
work page 2024
-
[41]
International Conference on Learning Representations , year =
Let's Verify Step by Step , author =. International Conference on Learning Representations , year =
-
[42]
Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. 2024 , url =
work page 2024
-
[43]
Solving olympiad geometry without human demonstrations , author =. Nature , volume =. 2024 , doi =
work page 2024
-
[44]
He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.211 , url =
-
[45]
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =. doi:10.48550/arXiv.2107.03374 , eprint =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374
-
[46]
Competition-Level Code Generation with
Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume =. 2022 , doi =
work page 2022
-
[47]
Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.172 , url =
-
[48]
arXiv preprint arXiv:2410.19730 , year =
Counting Ability of Large Language Models and Impact of Tokenization , author =. arXiv preprint arXiv:2410.19730 , year =. doi:10.48550/arXiv.2410.19730 , eprint =
-
[49]
Why Do Large Language Models (
Fu, Tairan and Ferrando, Raquel and Conde, Javier and Arriaga, Carlos and Reviriego, Pedro , journal =. Why Do Large Language Models (. 2024 , doi =. 2412.18626 , archivePrefix =
-
[50]
Xu, Nan and Ma, Xuezhe , booktitle =. 2025 , address =. doi:10.18653/v1/2025.naacl-long.172 , url =
-
[51]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =
The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =. 2025 , publisher =
work page 2025
-
[52]
Proceedings of the International Conference on Machine Learning (ICML) , year=
Interpreting the Repeated Token Phenomenon in Large Language Models , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=
-
[53]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=
work page 2023
-
[54]
Robert Lupoiu and Yixuan Shao and Tianxiang Dai and Chenkai Mao and Kofi Edée and Jonathan A. Fan , title =. Science Advances , volume =. 2025 , doi =. https://www.science.org/doi/pdf/10.1126/sciadv.adx8006 , abstract =
-
[55]
Claude 4 (Claude Opus 4) announcement , howpublished =
-
[56]
Claude Opus 4.1 announcement , howpublished =
-
[57]
Claude Opus 4.5 announcement , howpublished =
-
[58]
Claude Sonnet 4.5 announcement , howpublished =
-
[59]
Claude Opus 4.6 announcement , howpublished =
-
[60]
gpt-5.4 model page , howpublished =
-
[61]
Claude 4 (Claude Sonnet 4) announcement , howpublished =
-
[62]
Claude Sonnet 4.6 announcement , howpublished =
-
[63]
Claude 3.7 Sonnet announcement , howpublished =
-
[64]
Extended thinking for Claude 3.7 Sonnet , howpublished =
-
[65]
gemini-3-pro-preview entry in Gemini models guide , howpublished =
-
[66]
gemini-3.1-pro-preview entry in Gemini models guide , howpublished =
-
[67]
gemini-3-flash-preview entry in Gemini models guide , howpublished =
-
[68]
gpt-5.3-codex model page , howpublished =
-
[69]
Claude Haiku 4.5 announcement , howpublished =
-
[70]
gpt-5.2 model page , howpublished =
-
[71]
Claude 3.5 Haiku addendum , howpublished =
-
[72]
gpt-5.4-mini model page , howpublished =
-
[73]
gemini-2.5-pro entry in Gemini models guide , howpublished =
-
[74]
gemini-3.1-flash-lite-preview entry in Gemini models guide , howpublished =
-
[75]
gpt-5.2-codex model page , howpublished =
-
[76]
gpt-5.1 model page , howpublished =
-
[77]
Kimi K2 Instruct 0905 model card , howpublished =
-
[78]
gpt-5 model page , howpublished =
-
[79]
Llama 4 Maverick 17B 128E Instruct model card , howpublished =
-
[80]
gpt-4.1 model page , howpublished =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.