Language models fail at extended rule following

Jonathan Fan; Tianxiang Dai

arxiv: 2605.02028 · v2 · pith:7KRUJZFKnew · submitted 2026-05-03 · 💻 cs.CL

Language models fail at extended rule following

Tianxiang Dai , Jonathan Fan This is my paper

Pith reviewed 2026-05-21 00:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelsrule followingcounting taskinternal statesagentic tasksstate preservationmechanistic probing

0 comments

The pith

Language models cannot preserve exact state during repeated rule applications beyond a limited threshold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models can maintain an exact internal state while repeatedly applying the same rule, a skill needed for agentic tasks. It does this by asking 126 model variants to count long strings of repeated characters and finds that every model fails abruptly once it reaches a capacity that depends on both the model and the exact syntax of the input. These errors do not disappear when models are scaled up, given more inference-time compute, or supplied with external tools. Probing the models' activations shows they simulate the counting rule with only a finite set of internal states that eventually get exhausted. The same states appear to support more complex rule-following behavior, indicating that current architectures cannot deliver reliable extended rule following.

Core claim

Models rely on a finite number of internal states to mimic counting as a rule and fail once these states are exhausted, producing abrupt inaccuracies above a model-dependent, syntax-sensitive threshold that persists even with larger size, extra inference computation, or external tools.

What carries the argument

Finite internal states that models allocate to simulate repeated application of the counting rule.

If this is right

Similar abrupt failures will appear in any task that demands sustained exact state over many rule steps.
Scaling model size, increasing inference compute, or adding external tools will not raise the effective counting capacity.
Mechanistic inspection of internal states can expose the shared mechanism for both counting and more complex rule-based tasks.
Autonomous agents built from today's models will lack truly reliable rule-following capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Syntax sensitivity implies that small changes in how a rule is phrased can shift the point at which state exhaustion occurs.
The same internal-state bottleneck may limit performance on long-horizon planning or multi-step simulation.
Architectures that maintain explicit, expandable state outside the finite internal representation could avoid these hard limits.

Load-bearing premise

The repeated-character counting task requires and reveals the ability to preserve an exact state while applying a rule many times in succession.

What would settle it

A controlled test in which any current language model counts a string of several hundred identical characters with zero errors using only its standard forward pass.

Figures

Figures reproduced from arXiv: 2605.02028 by Jonathan Fan, Tianxiang Dai.

**Figure 1.** Figure 1: Stable Counting Capacity as a fully mechanical benchmark for rule execution evaluation. a, Classes of LLM benchmarks. Knowledge-dependent benchmarks (left) evaluate a mixture of reasoning, factual recall, and tool usage, and they can be impacted by data contamination and leaderboard saturation. Mechanical benchmarks (right) isolate structural processing by applying a simple rule to a minimal sequence witho… view at source ↗

**Figure 2.** Figure 2: Model behavior at the point of counting failure. a, The tracking behavior of a representative model during a counting run. The model predicts the exact count perfectly before abruptly failing and defaulting to highly specific rounded numbers. b, A high resolution overlay of boundary behavior across all models. The transition from perfect rule execution to chaotic output is sudden, showing no controlled or … view at source ↗

**Figure 3.** Figure 3: Impact of token consumption and test-time compute on procedural state maintenance. a, Average total token consumption evaluated at the CC boundary. Higher token expenditure does not guarantee a greater counting capacity. b, A matched comparison between base non-reasoning models and their reasoning variants. Reasoning models consume dramatically more tokens during inference, but they show negligible improve… view at source ↗

read the original abstract

Large language models are highly capable of answering difficult questions by retrieving, recombining, and attending to information in long contexts. For agentic tasks, an additional capability is required: the preservation of an exact state while repeatedly applying rules. We find that this reliability is absent across language models. To demonstrate, we query 126 leading model variants with the task of counting a long string of repeated characters, and we find they all cannot accurately count above a model-dependent, syntax-sensitive counting capacity threshold. Failures are abrupt and persist even with increasing model size, inference time computation, and external tool. Mechanistic probing indicates that models use a finite number of internal states to mimic counting as a rule and fail once these states are exhausted. Furthermore, such states are the basis for performing complex tasks beyond counting. These results indicate that fundamentally new model architectures are required for autonomous agents to achieve truly reliable rule following capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that large language models cannot reliably preserve an exact state while repeatedly applying rules, as evidenced by their inability to accurately count beyond a model-dependent threshold in long strings of repeated characters. This limitation is syntax-sensitive, abrupt, and persists with model scaling, increased inference-time computation, and the use of external tools. Mechanistic probing reveals that models employ a finite set of internal states to approximate counting, and the authors posit that these states underpin performance on more complex tasks, implying the need for fundamentally new architectures to enable reliable rule following in autonomous agents.

Significance. If substantiated, these findings would be significant for the field of AI agents and reliable reasoning systems. The work highlights a potential architectural limitation in current LLMs for long-horizon, stateful rule application, which is critical for agentic behaviors. By evaluating 126 model variants and incorporating mechanistic analysis, it offers empirical breadth and some insight into internal mechanisms. This could encourage development of models with unbounded or explicit state tracking capabilities. The broad testing and persistence of failures under various mitigations strengthen the case for the observed limitation.

major comments (2)

[§5] §5 (Mechanistic Analysis): The assertion that the finite internal states 'are the basis for performing complex tasks beyond counting' is central to the architectural recommendation, yet the evidence appears limited to probing results on the repeated-character counting task. Without causal interventions such as activation patching or ablation on other multi-step rule-following tasks (e.g., state tracking in conditional instruction sequences), the generalization from counting failures to general rule following remains an extrapolation rather than a demonstrated mechanistic link.
[§3] §3 (Experimental Setup): The claim of syntax sensitivity and abrupt failure thresholds is load-bearing for the core empirical result, but the manuscript should clarify whether controls were included for context length effects versus pure state exhaustion (e.g., by comparing to non-repetitive but equally long rule-application sequences). This distinction affects whether the counting task truly isolates the targeted capability.

minor comments (2)

[Abstract] Abstract and §2: The phrasing 'persist even with ... external tool' would benefit from a brief description of the tool-use protocol (e.g., which tool and how state was passed) to allow readers to assess the mitigation attempt.
[§4] Figure captions and §4: Ensure all plots of accuracy versus length explicitly label the model-dependent thresholds and include error bars or statistical tests for the abruptness of the drop-off.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment in detail below, indicating the revisions made.

read point-by-point responses

Referee: [§5] §5 (Mechanistic Analysis): The assertion that the finite internal states 'are the basis for performing complex tasks beyond counting' is central to the architectural recommendation, yet the evidence appears limited to probing results on the repeated-character counting task. Without causal interventions such as activation patching or ablation on other multi-step rule-following tasks (e.g., state tracking in conditional instruction sequences), the generalization from counting failures to general rule following remains an extrapolation rather than a demonstrated mechanistic link.

Authors: We agree that the link to complex tasks is an important claim and that our primary evidence comes from the counting task. The mechanistic probing demonstrates that models rely on a limited set of internal states for this rule-following behavior. We posit that this mechanism generalizes because many complex tasks, such as multi-step reasoning or instruction following, similarly require maintaining precise states over extended sequences. To address the concern, we have expanded the discussion in §5 to include references to prior work on state tracking in LLMs and clarified that the counting task serves as a minimal example of rule application. We have also noted the need for future causal studies on other tasks as a limitation. This revision strengthens the presentation without overclaiming. revision: partial
Referee: [§3] §3 (Experimental Setup): The claim of syntax sensitivity and abrupt failure thresholds is load-bearing for the core empirical result, but the manuscript should clarify whether controls were included for context length effects versus pure state exhaustion (e.g., by comparing to non-repetitive but equally long rule-application sequences). This distinction affects whether the counting task truly isolates the targeted capability.

Authors: We appreciate this point and have clarified the experimental controls in the revised §3. Our original experiments included variations in sequence length and syntax to show that the failure thresholds are model-dependent and occur well below the context limits, as models succeed on other long-context tasks. To directly address the distinction, we have added new control experiments using non-repetitive but long rule-application sequences (e.g., following conditional instructions over extended contexts). These controls confirm that failures are tied to repeated state updates rather than context length per se. The results are incorporated into the manuscript, with updated figures and text. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no self-referential derivations

full rationale

The paper reports direct experimental results from querying 126 model variants on a repeated-character counting task, measuring abrupt failure thresholds, and performing mechanistic probing to observe finite internal states. No equations, fitted parameters, or derivations are presented that reduce any claimed result to its own inputs by construction. The generalization that the observed states form the basis for complex rule-following is an interpretive claim supported by the counting experiments rather than a self-citation chain or definitional loop, leaving the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that exact state preservation is required for rule following and that internal state exhaustion explains both the counting failures and broader task limitations.

axioms (1)

domain assumption Counting a long string of repeated characters requires preservation of an exact state while repeatedly applying rules.
Explicitly stated in the abstract as the additional capability needed for agentic tasks beyond retrieval and recombination.

pith-pipeline@v0.9.0 · 5672 in / 1140 out tokens · 65108 ms · 2026-05-21T00:03:29.078658+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

178 extracted references · 178 canonical work pages · 7 internal anchors

[1]

Transactions on Machine Learning Research , year =

Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =

work page
[2]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

work page
[3]

Transactions on Machine Learning Research , year =

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author =. Transactions on Machine Learning Research , year =

work page
[4]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , year =. doi:10.48550/arXiv.2311.12022 , url =. 2311.12022 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022
[5]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. 2024 , url =

work page 2024
[6]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah , year =. LiveBench: A Challenging, Contamination-Free. doi:10.48550/arXiv.24...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.19314
[7]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2025.emnlp-main.511 , url =

work page doi:10.18653/v1/2025.emnlp-main.511 2025
[8]

Hai, Nam Le and Nguyen, Dung Manh and Bui, Nghi D. Q. , year =. doi:10.48550/arXiv.2406.11927 , url =. 2406.11927 , archivePrefix=

work page doi:10.48550/arxiv.2406.11927
[9]

LongGenBench: Benchmarking Long-Form Generation in Long Context

Wu, Yuhao and Hee, Ming Shan and Hu, Zhiqing and Lee, Roy Ka-Wei , year =. LongGenBench: Benchmarking Long-Form Generation in Long Context. doi:10.48550/arXiv.2409.02076 , url =. 2409.02076 , archivePrefix=

work page doi:10.48550/arxiv.2409.02076
[10]

2023 , url =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

work page 2023
[11]

Advances in Neural Information Processing Systems , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =

work page
[12]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =. doi:10.1162/tacl_a_00638 , url =

work page doi:10.1162/tacl_a_00638 2024
[13]

NeedleBench: Evaluating

Li, Mo and Zhang, Songyang and Zhang, Taolin and Duan, Haodong and Liu, Yunxin and Chen, Kai , journal =. NeedleBench: Evaluating. 2025 , url =

work page 2025
[14]

Advances in Neural Information Processing Systems , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =

work page
[15]

2021 , url =

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and others , title =. 2021 , url =

work page 2021
[16]

In-context Learning and Induction Heads

In-context Learning and Induction Heads , author =. 2022 , eprint =. doi:10.48550/arXiv.2209.11895 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.11895 2022
[17]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Chollet, Francois and Knoop, Mike and Kamradt, Gregory and Landers, Bryan and Pinkard, Henry , year =. doi:10.48550/arXiv.2505.11831 , url =. 2505.11831 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.11831
[18]

Advances in Neural Information Processing Systems , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =

work page
[19]

Advances in Neural Information Processing Systems , year =

Large Language Models are Zero-Shot Reasoners , author =. Advances in Neural Information Processing Systems , year =

work page
[20]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , year =. Scaling. doi:10.48550/arXiv.2408.03314 , url =. 2408.03314 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.03314
[21]

Transactions of the Association for Computational Linguistics , volume =

Theoretical Limitations of Self-Attention in Neural Sequence Models , author =. Transactions of the Association for Computational Linguistics , volume =. 2020 , doi =

work page 2020
[22]

International Conference on Learning Representations , year =

Neural Networks and the Chomsky Hierarchy , author =. International Conference on Learning Representations , year =

work page
[23]

Transactions of the Association for Computational Linguistics , volume =

What Formal Languages Can Transformers Express? A Survey , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =

work page 2024
[24]

2024 , eprint =

Transformers Represent Belief State Geometry in Their Residual Stream , author =. 2024 , eprint =. doi:10.48550/arXiv.2405.15943 , url =

work page doi:10.48550/arxiv.2405.15943 2024
[25]

International Conference on Learning Representations , year =

Scaling and Evaluating Sparse Autoencoders , author =. International Conference on Learning Representations , year =

work page
[26]

International Conference on Learning Representations , year =

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. International Conference on Learning Representations , year =

work page
[27]

2023 , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2023 , url =

work page 2023
[28]

2024 , url =

Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , booktitle =. 2024 , url =

work page 2024
[29]

doi: 10.18653/v1/P19-1285

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , booktitle =. Transformer-. 2019 , pages =. doi:10.18653/v1/P19-1285 , url =

work page doi:10.18653/v1/p19-1285 2019
[30]

Advances in Neural Information Processing Systems , year =

Recurrent Memory Transformer , author =. Advances in Neural Information Processing Systems , year =

work page
[31]

International Conference on Learning Representations , year =

Memorizing Transformers , author =. International Conference on Learning Representations , year =

work page
[32]

Proceedings of the 39th International Conference on Machine Learning , year =

Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , year =

work page
[33]

2025 , eprint =

Gemma 3 Technical Report , author =. 2025 , eprint =

work page 2025
[34]

2025 , note =

Gemma Scope 2 - Technical Paper , author =. 2025 , note =

work page 2025
[35]

Proceedings of the 42nd International Conference on Machine Learning , series =

Interpreting the Repeated Token Phenomenon in Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , url =

work page 2025
[36]

International Conference on Learning Representations , year =

When Can Transformers Count to n? , author =. International Conference on Learning Representations , year =

work page
[37]

arXiv preprint arXiv:2501.12948 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[38]

2024 , url =

Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris , booktitle =. 2024 , url =

work page 2024
[39]

Challenging

Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics: ACL 2023 , year =. doi:10.18653/v1/2023.findings-acl.824 , url =

work page doi:10.18653/v1/2023.findings-acl.824 2023
[40]

2024 , url =

Gu, Alex and Roziere, Baptiste and Leather, Hugh James and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida , booktitle =. 2024 , url =

work page 2024
[41]

International Conference on Learning Representations , year =

Let's Verify Step by Step , author =. International Conference on Learning Representations , year =

work page
[42]

2024 , url =

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. 2024 , url =

work page 2024
[43]

Nature , volume =

Solving olympiad geometry without human demonstrations , author =. Nature , volume =. 2024 , doi =

work page 2024
[44]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.211 , url =

work page doi:10.18653/v1/2024.acl-long.211 2024
[45]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =. doi:10.48550/arXiv.2107.03374 , eprint =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374
[46]

Competition-Level Code Generation with

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume =. 2022 , doi =

work page 2022
[47]

LongBench:

Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.172 , url =

work page doi:10.18653/v1/2024.acl-long.172 2024
[48]

arXiv preprint arXiv:2410.19730 , year =

Counting Ability of Large Language Models and Impact of Tokenization , author =. arXiv preprint arXiv:2410.19730 , year =. doi:10.48550/arXiv.2410.19730 , eprint =

work page doi:10.48550/arxiv.2410.19730
[49]

Why Do Large Language Models (

Fu, Tairan and Ferrando, Raquel and Conde, Javier and Arriaga, Carlos and Reviriego, Pedro , journal =. Why Do Large Language Models (. 2024 , doi =. 2412.18626 , archivePrefix =

work page arXiv 2024
[50]

2025 , address =

Xu, Nan and Ma, Xuezhe , booktitle =. 2025 , address =. doi:10.18653/v1/2025.naacl-long.172 , url =

work page doi:10.18653/v1/2025.naacl-long.172 2025
[51]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =. 2025 , publisher =

work page 2025
[52]

Proceedings of the International Conference on Machine Learning (ICML) , year=

Interpreting the Repeated Token Phenomenon in Large Language Models , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=

work page
[53]

2023 , eprint=

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=

work page 2023
[54]

Fan , title =

Robert Lupoiu and Yixuan Shao and Tianxiang Dai and Chenkai Mao and Kofi Edée and Jonathan A. Fan , title =. Science Advances , volume =. 2025 , doi =. https://www.science.org/doi/pdf/10.1126/sciadv.adx8006 , abstract =

work page doi:10.1126/sciadv.adx8006 2025
[55]

Claude 4 (Claude Opus 4) announcement , howpublished =

work page
[56]

Claude Opus 4.1 announcement , howpublished =

work page
[57]

Claude Opus 4.5 announcement , howpublished =

work page
[58]

Claude Sonnet 4.5 announcement , howpublished =

work page
[59]

Claude Opus 4.6 announcement , howpublished =

work page
[60]

gpt-5.4 model page , howpublished =

work page
[61]

Claude 4 (Claude Sonnet 4) announcement , howpublished =

work page
[62]

Claude Sonnet 4.6 announcement , howpublished =

work page
[63]

Claude 3.7 Sonnet announcement , howpublished =

work page
[64]

Extended thinking for Claude 3.7 Sonnet , howpublished =

work page
[65]

gemini-3-pro-preview entry in Gemini models guide , howpublished =

work page
[66]

gemini-3.1-pro-preview entry in Gemini models guide , howpublished =

work page
[67]

gemini-3-flash-preview entry in Gemini models guide , howpublished =

work page
[68]

gpt-5.3-codex model page , howpublished =

work page
[69]

Claude Haiku 4.5 announcement , howpublished =

work page
[70]

gpt-5.2 model page , howpublished =

work page
[71]

Claude 3.5 Haiku addendum , howpublished =

work page
[72]

gpt-5.4-mini model page , howpublished =

work page
[73]

gemini-2.5-pro entry in Gemini models guide , howpublished =

work page
[74]

gemini-3.1-flash-lite-preview entry in Gemini models guide , howpublished =

work page
[75]

gpt-5.2-codex model page , howpublished =

work page
[76]

gpt-5.1 model page , howpublished =

work page
[77]

Kimi K2 Instruct 0905 model card , howpublished =

work page
[78]

gpt-5 model page , howpublished =

work page
[79]

Llama 4 Maverick 17B 128E Instruct model card , howpublished =

work page
[80]

gpt-4.1 model page , howpublished =

work page

Showing first 80 references.

[1] [1]

Transactions on Machine Learning Research , year =

Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =

work page

[2] [2]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

work page

[3] [3]

Transactions on Machine Learning Research , year =

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author =. Transactions on Machine Learning Research , year =

work page

[4] [4]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , year =. doi:10.48550/arXiv.2311.12022 , url =. 2311.12022 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022

[5] [5]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. 2024 , url =

work page 2024

[6] [6]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah , year =. LiveBench: A Challenging, Contamination-Free. doi:10.48550/arXiv.24...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.19314

[7] [7]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2025.emnlp-main.511 , url =

work page doi:10.18653/v1/2025.emnlp-main.511 2025

[8] [8]

Hai, Nam Le and Nguyen, Dung Manh and Bui, Nghi D. Q. , year =. doi:10.48550/arXiv.2406.11927 , url =. 2406.11927 , archivePrefix=

work page doi:10.48550/arxiv.2406.11927

[9] [9]

LongGenBench: Benchmarking Long-Form Generation in Long Context

Wu, Yuhao and Hee, Ming Shan and Hu, Zhiqing and Lee, Roy Ka-Wei , year =. LongGenBench: Benchmarking Long-Form Generation in Long Context. doi:10.48550/arXiv.2409.02076 , url =. 2409.02076 , archivePrefix=

work page doi:10.48550/arxiv.2409.02076

[10] [10]

2023 , url =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

work page 2023

[11] [11]

Advances in Neural Information Processing Systems , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =

work page

[12] [12]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =. doi:10.1162/tacl_a_00638 , url =

work page doi:10.1162/tacl_a_00638 2024

[13] [13]

NeedleBench: Evaluating

Li, Mo and Zhang, Songyang and Zhang, Taolin and Duan, Haodong and Liu, Yunxin and Chen, Kai , journal =. NeedleBench: Evaluating. 2025 , url =

work page 2025

[14] [14]

Advances in Neural Information Processing Systems , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =

work page

[15] [15]

2021 , url =

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and others , title =. 2021 , url =

work page 2021

[16] [16]

In-context Learning and Induction Heads

In-context Learning and Induction Heads , author =. 2022 , eprint =. doi:10.48550/arXiv.2209.11895 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.11895 2022

[17] [17]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Chollet, Francois and Knoop, Mike and Kamradt, Gregory and Landers, Bryan and Pinkard, Henry , year =. doi:10.48550/arXiv.2505.11831 , url =. 2505.11831 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.11831

[18] [18]

Advances in Neural Information Processing Systems , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =

work page

[19] [19]

Advances in Neural Information Processing Systems , year =

Large Language Models are Zero-Shot Reasoners , author =. Advances in Neural Information Processing Systems , year =

work page

[20] [20]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , year =. Scaling. doi:10.48550/arXiv.2408.03314 , url =. 2408.03314 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.03314

[21] [21]

Transactions of the Association for Computational Linguistics , volume =

Theoretical Limitations of Self-Attention in Neural Sequence Models , author =. Transactions of the Association for Computational Linguistics , volume =. 2020 , doi =

work page 2020

[22] [22]

International Conference on Learning Representations , year =

Neural Networks and the Chomsky Hierarchy , author =. International Conference on Learning Representations , year =

work page

[23] [23]

Transactions of the Association for Computational Linguistics , volume =

What Formal Languages Can Transformers Express? A Survey , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =

work page 2024

[24] [24]

2024 , eprint =

Transformers Represent Belief State Geometry in Their Residual Stream , author =. 2024 , eprint =. doi:10.48550/arXiv.2405.15943 , url =

work page doi:10.48550/arxiv.2405.15943 2024

[25] [25]

International Conference on Learning Representations , year =

Scaling and Evaluating Sparse Autoencoders , author =. International Conference on Learning Representations , year =

work page

[26] [26]

International Conference on Learning Representations , year =

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. International Conference on Learning Representations , year =

work page

[27] [27]

2023 , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2023 , url =

work page 2023

[28] [28]

2024 , url =

Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , booktitle =. 2024 , url =

work page 2024

[29] [29]

doi: 10.18653/v1/P19-1285

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , booktitle =. Transformer-. 2019 , pages =. doi:10.18653/v1/P19-1285 , url =

work page doi:10.18653/v1/p19-1285 2019

[30] [30]

Advances in Neural Information Processing Systems , year =

Recurrent Memory Transformer , author =. Advances in Neural Information Processing Systems , year =

work page

[31] [31]

International Conference on Learning Representations , year =

Memorizing Transformers , author =. International Conference on Learning Representations , year =

work page

[32] [32]

Proceedings of the 39th International Conference on Machine Learning , year =

Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , year =

work page

[33] [33]

2025 , eprint =

Gemma 3 Technical Report , author =. 2025 , eprint =

work page 2025

[34] [34]

2025 , note =

Gemma Scope 2 - Technical Paper , author =. 2025 , note =

work page 2025

[35] [35]

Proceedings of the 42nd International Conference on Machine Learning , series =

Interpreting the Repeated Token Phenomenon in Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , url =

work page 2025

[36] [36]

International Conference on Learning Representations , year =

When Can Transformers Count to n? , author =. International Conference on Learning Representations , year =

work page

[37] [37]

arXiv preprint arXiv:2501.12948 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

2024 , url =

Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris , booktitle =. 2024 , url =

work page 2024

[39] [39]

Challenging

Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics: ACL 2023 , year =. doi:10.18653/v1/2023.findings-acl.824 , url =

work page doi:10.18653/v1/2023.findings-acl.824 2023

[40] [40]

2024 , url =

Gu, Alex and Roziere, Baptiste and Leather, Hugh James and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida , booktitle =. 2024 , url =

work page 2024

[41] [41]

International Conference on Learning Representations , year =

Let's Verify Step by Step , author =. International Conference on Learning Representations , year =

work page

[42] [42]

2024 , url =

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. 2024 , url =

work page 2024

[43] [43]

Nature , volume =

Solving olympiad geometry without human demonstrations , author =. Nature , volume =. 2024 , doi =

work page 2024

[44] [44]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.211 , url =

work page doi:10.18653/v1/2024.acl-long.211 2024

[45] [45]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =. doi:10.48550/arXiv.2107.03374 , eprint =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374

[46] [46]

Competition-Level Code Generation with

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume =. 2022 , doi =

work page 2022

[47] [47]

LongBench:

Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.172 , url =

work page doi:10.18653/v1/2024.acl-long.172 2024

[48] [48]

arXiv preprint arXiv:2410.19730 , year =

Counting Ability of Large Language Models and Impact of Tokenization , author =. arXiv preprint arXiv:2410.19730 , year =. doi:10.48550/arXiv.2410.19730 , eprint =

work page doi:10.48550/arxiv.2410.19730

[49] [49]

Why Do Large Language Models (

Fu, Tairan and Ferrando, Raquel and Conde, Javier and Arriaga, Carlos and Reviriego, Pedro , journal =. Why Do Large Language Models (. 2024 , doi =. 2412.18626 , archivePrefix =

work page arXiv 2024

[50] [50]

2025 , address =

Xu, Nan and Ma, Xuezhe , booktitle =. 2025 , address =. doi:10.18653/v1/2025.naacl-long.172 , url =

work page doi:10.18653/v1/2025.naacl-long.172 2025

[51] [51]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =. 2025 , publisher =

work page 2025

[52] [52]

Proceedings of the International Conference on Machine Learning (ICML) , year=

Interpreting the Repeated Token Phenomenon in Large Language Models , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=

work page

[53] [53]

2023 , eprint=

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=

work page 2023

[54] [54]

Fan , title =

Robert Lupoiu and Yixuan Shao and Tianxiang Dai and Chenkai Mao and Kofi Edée and Jonathan A. Fan , title =. Science Advances , volume =. 2025 , doi =. https://www.science.org/doi/pdf/10.1126/sciadv.adx8006 , abstract =

work page doi:10.1126/sciadv.adx8006 2025

[55] [55]

Claude 4 (Claude Opus 4) announcement , howpublished =

work page

[56] [56]

Claude Opus 4.1 announcement , howpublished =

work page

[57] [57]

Claude Opus 4.5 announcement , howpublished =

work page

[58] [58]

Claude Sonnet 4.5 announcement , howpublished =

work page

[59] [59]

Claude Opus 4.6 announcement , howpublished =

work page

[60] [60]

gpt-5.4 model page , howpublished =

work page

[61] [61]

Claude 4 (Claude Sonnet 4) announcement , howpublished =

work page

[62] [62]

Claude Sonnet 4.6 announcement , howpublished =

work page

[63] [63]

Claude 3.7 Sonnet announcement , howpublished =

work page

[64] [64]

Extended thinking for Claude 3.7 Sonnet , howpublished =

work page

[65] [65]

gemini-3-pro-preview entry in Gemini models guide , howpublished =

work page

[66] [66]

gemini-3.1-pro-preview entry in Gemini models guide , howpublished =

work page

[67] [67]

gemini-3-flash-preview entry in Gemini models guide , howpublished =

work page

[68] [68]

gpt-5.3-codex model page , howpublished =

work page

[69] [69]

Claude Haiku 4.5 announcement , howpublished =

work page

[70] [70]

gpt-5.2 model page , howpublished =

work page

[71] [71]

Claude 3.5 Haiku addendum , howpublished =

work page

[72] [72]

gpt-5.4-mini model page , howpublished =

work page

[73] [73]

gemini-2.5-pro entry in Gemini models guide , howpublished =

work page

[74] [74]

gemini-3.1-flash-lite-preview entry in Gemini models guide , howpublished =

work page

[75] [75]

gpt-5.2-codex model page , howpublished =

work page

[76] [76]

gpt-5.1 model page , howpublished =

work page

[77] [77]

Kimi K2 Instruct 0905 model card , howpublished =

work page

[78] [78]

gpt-5 model page , howpublished =

work page

[79] [79]

Llama 4 Maverick 17B 128E Instruct model card , howpublished =

work page

[80] [80]

gpt-4.1 model page , howpublished =

work page