pith. sign in

arxiv: 2605.02028 · v2 · pith:7KRUJZFKnew · submitted 2026-05-03 · 💻 cs.CL

Language models fail at extended rule following

Pith reviewed 2026-05-21 00:03 UTC · model grok-4.3

classification 💻 cs.CL
keywords language modelsrule followingcounting taskinternal statesagentic tasksstate preservationmechanistic probing
0
0 comments X

The pith

Language models cannot preserve exact state during repeated rule applications beyond a limited threshold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models can maintain an exact internal state while repeatedly applying the same rule, a skill needed for agentic tasks. It does this by asking 126 model variants to count long strings of repeated characters and finds that every model fails abruptly once it reaches a capacity that depends on both the model and the exact syntax of the input. These errors do not disappear when models are scaled up, given more inference-time compute, or supplied with external tools. Probing the models' activations shows they simulate the counting rule with only a finite set of internal states that eventually get exhausted. The same states appear to support more complex rule-following behavior, indicating that current architectures cannot deliver reliable extended rule following.

Core claim

Models rely on a finite number of internal states to mimic counting as a rule and fail once these states are exhausted, producing abrupt inaccuracies above a model-dependent, syntax-sensitive threshold that persists even with larger size, extra inference computation, or external tools.

What carries the argument

Finite internal states that models allocate to simulate repeated application of the counting rule.

If this is right

  • Similar abrupt failures will appear in any task that demands sustained exact state over many rule steps.
  • Scaling model size, increasing inference compute, or adding external tools will not raise the effective counting capacity.
  • Mechanistic inspection of internal states can expose the shared mechanism for both counting and more complex rule-based tasks.
  • Autonomous agents built from today's models will lack truly reliable rule-following capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Syntax sensitivity implies that small changes in how a rule is phrased can shift the point at which state exhaustion occurs.
  • The same internal-state bottleneck may limit performance on long-horizon planning or multi-step simulation.
  • Architectures that maintain explicit, expandable state outside the finite internal representation could avoid these hard limits.

Load-bearing premise

The repeated-character counting task requires and reveals the ability to preserve an exact state while applying a rule many times in succession.

What would settle it

A controlled test in which any current language model counts a string of several hundred identical characters with zero errors using only its standard forward pass.

Figures

Figures reproduced from arXiv: 2605.02028 by Jonathan Fan, Tianxiang Dai.

Figure 1
Figure 1. Figure 1: Stable Counting Capacity as a fully mechanical benchmark for rule execution evaluation. a, Classes of LLM benchmarks. Knowledge-dependent benchmarks (left) evaluate a mixture of reasoning, factual recall, and tool usage, and they can be impacted by data contamination and leaderboard saturation. Mechanical benchmarks (right) isolate structural processing by applying a simple rule to a minimal sequence witho… view at source ↗
Figure 2
Figure 2. Figure 2: Model behavior at the point of counting failure. a, The tracking behavior of a representative model during a counting run. The model predicts the exact count perfectly before abruptly failing and defaulting to highly specific rounded numbers. b, A high resolution overlay of boundary behavior across all models. The transition from perfect rule execution to chaotic output is sudden, showing no controlled or … view at source ↗
Figure 3
Figure 3. Figure 3: Impact of token consumption and test-time compute on procedural state maintenance. a, Average total token consumption evaluated at the CC boundary. Higher token expenditure does not guarantee a greater counting capacity. b, A matched comparison between base non-reasoning models and their reasoning variants. Reasoning models consume dramatically more tokens during inference, but they show negligible improve… view at source ↗
Figure 4
Figure 4. Figure 4 view at source ↗
read the original abstract

Large language models are highly capable of answering difficult questions by retrieving, recombining, and attending to information in long contexts. For agentic tasks, an additional capability is required: the preservation of an exact state while repeatedly applying rules. We find that this reliability is absent across language models. To demonstrate, we query 126 leading model variants with the task of counting a long string of repeated characters, and we find they all cannot accurately count above a model-dependent, syntax-sensitive counting capacity threshold. Failures are abrupt and persist even with increasing model size, inference time computation, and external tool. Mechanistic probing indicates that models use a finite number of internal states to mimic counting as a rule and fail once these states are exhausted. Furthermore, such states are the basis for performing complex tasks beyond counting. These results indicate that fundamentally new model architectures are required for autonomous agents to achieve truly reliable rule following capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that large language models cannot reliably preserve an exact state while repeatedly applying rules, as evidenced by their inability to accurately count beyond a model-dependent threshold in long strings of repeated characters. This limitation is syntax-sensitive, abrupt, and persists with model scaling, increased inference-time computation, and the use of external tools. Mechanistic probing reveals that models employ a finite set of internal states to approximate counting, and the authors posit that these states underpin performance on more complex tasks, implying the need for fundamentally new architectures to enable reliable rule following in autonomous agents.

Significance. If substantiated, these findings would be significant for the field of AI agents and reliable reasoning systems. The work highlights a potential architectural limitation in current LLMs for long-horizon, stateful rule application, which is critical for agentic behaviors. By evaluating 126 model variants and incorporating mechanistic analysis, it offers empirical breadth and some insight into internal mechanisms. This could encourage development of models with unbounded or explicit state tracking capabilities. The broad testing and persistence of failures under various mitigations strengthen the case for the observed limitation.

major comments (2)
  1. [§5] §5 (Mechanistic Analysis): The assertion that the finite internal states 'are the basis for performing complex tasks beyond counting' is central to the architectural recommendation, yet the evidence appears limited to probing results on the repeated-character counting task. Without causal interventions such as activation patching or ablation on other multi-step rule-following tasks (e.g., state tracking in conditional instruction sequences), the generalization from counting failures to general rule following remains an extrapolation rather than a demonstrated mechanistic link.
  2. [§3] §3 (Experimental Setup): The claim of syntax sensitivity and abrupt failure thresholds is load-bearing for the core empirical result, but the manuscript should clarify whether controls were included for context length effects versus pure state exhaustion (e.g., by comparing to non-repetitive but equally long rule-application sequences). This distinction affects whether the counting task truly isolates the targeted capability.
minor comments (2)
  1. [Abstract] Abstract and §2: The phrasing 'persist even with ... external tool' would benefit from a brief description of the tool-use protocol (e.g., which tool and how state was passed) to allow readers to assess the mitigation attempt.
  2. [§4] Figure captions and §4: Ensure all plots of accuracy versus length explicitly label the model-dependent thresholds and include error bars or statistical tests for the abruptness of the drop-off.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment in detail below, indicating the revisions made.

read point-by-point responses
  1. Referee: [§5] §5 (Mechanistic Analysis): The assertion that the finite internal states 'are the basis for performing complex tasks beyond counting' is central to the architectural recommendation, yet the evidence appears limited to probing results on the repeated-character counting task. Without causal interventions such as activation patching or ablation on other multi-step rule-following tasks (e.g., state tracking in conditional instruction sequences), the generalization from counting failures to general rule following remains an extrapolation rather than a demonstrated mechanistic link.

    Authors: We agree that the link to complex tasks is an important claim and that our primary evidence comes from the counting task. The mechanistic probing demonstrates that models rely on a limited set of internal states for this rule-following behavior. We posit that this mechanism generalizes because many complex tasks, such as multi-step reasoning or instruction following, similarly require maintaining precise states over extended sequences. To address the concern, we have expanded the discussion in §5 to include references to prior work on state tracking in LLMs and clarified that the counting task serves as a minimal example of rule application. We have also noted the need for future causal studies on other tasks as a limitation. This revision strengthens the presentation without overclaiming. revision: partial

  2. Referee: [§3] §3 (Experimental Setup): The claim of syntax sensitivity and abrupt failure thresholds is load-bearing for the core empirical result, but the manuscript should clarify whether controls were included for context length effects versus pure state exhaustion (e.g., by comparing to non-repetitive but equally long rule-application sequences). This distinction affects whether the counting task truly isolates the targeted capability.

    Authors: We appreciate this point and have clarified the experimental controls in the revised §3. Our original experiments included variations in sequence length and syntax to show that the failure thresholds are model-dependent and occur well below the context limits, as models succeed on other long-context tasks. To directly address the distinction, we have added new control experiments using non-repetitive but long rule-application sequences (e.g., following conditional instructions over extended contexts). These controls confirm that failures are tied to repeated state updates rather than context length per se. The results are incorporated into the manuscript, with updated figures and text. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no self-referential derivations

full rationale

The paper reports direct experimental results from querying 126 model variants on a repeated-character counting task, measuring abrupt failure thresholds, and performing mechanistic probing to observe finite internal states. No equations, fitted parameters, or derivations are presented that reduce any claimed result to its own inputs by construction. The generalization that the observed states form the basis for complex rule-following is an interpretive claim supported by the counting experiments rather than a self-citation chain or definitional loop, leaving the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that exact state preservation is required for rule following and that internal state exhaustion explains both the counting failures and broader task limitations.

axioms (1)
  • domain assumption Counting a long string of repeated characters requires preservation of an exact state while repeatedly applying rules.
    Explicitly stated in the abstract as the additional capability needed for agentic tasks beyond retrieval and recombination.

pith-pipeline@v0.9.0 · 5672 in / 1140 out tokens · 65108 ms · 2026-05-21T00:03:29.078658+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

178 extracted references · 178 canonical work pages · 7 internal anchors

  1. [1]

    Transactions on Machine Learning Research , year =

    Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =

  2. [2]

    International Conference on Learning Representations , year =

    Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

  3. [3]

    Transactions on Machine Learning Research , year =

    Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author =. Transactions on Machine Learning Research , year =

  4. [4]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , year =. doi:10.48550/arXiv.2311.12022 , url =. 2311.12022 , archivePrefix=

  5. [5]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle =. 2024 , url =

  6. [6]

    LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah , year =. LiveBench: A Challenging, Contamination-Free. doi:10.48550/arXiv.24...

  7. [7]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

    Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2025.emnlp-main.511 , url =

  8. [8]

    Hai, Nam Le and Nguyen, Dung Manh and Bui, Nghi D. Q. , year =. doi:10.48550/arXiv.2406.11927 , url =. 2406.11927 , archivePrefix=

  9. [9]

    LongGenBench: Benchmarking Long-Form Generation in Long Context

    Wu, Yuhao and Hee, Ming Shan and Hu, Zhiqing and Lee, Roy Ka-Wei , year =. LongGenBench: Benchmarking Long-Form Generation in Long Context. doi:10.48550/arXiv.2409.02076 , url =. 2409.02076 , archivePrefix=

  10. [10]

    2023 , url =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

  11. [11]

    Advances in Neural Information Processing Systems , year =

    Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =

  12. [12]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =. doi:10.1162/tacl_a_00638 , url =

  13. [13]

    NeedleBench: Evaluating

    Li, Mo and Zhang, Songyang and Zhang, Taolin and Duan, Haodong and Liu, Yunxin and Chen, Kai , journal =. NeedleBench: Evaluating. 2025 , url =

  14. [14]

    Advances in Neural Information Processing Systems , year =

    Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =

  15. [15]

    2021 , url =

    Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and others , title =. 2021 , url =

  16. [16]

    In-context Learning and Induction Heads

    In-context Learning and Induction Heads , author =. 2022 , eprint =. doi:10.48550/arXiv.2209.11895 , url =

  17. [17]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Chollet, Francois and Knoop, Mike and Kamradt, Gregory and Landers, Bryan and Pinkard, Henry , year =. doi:10.48550/arXiv.2505.11831 , url =. 2505.11831 , archivePrefix=

  18. [18]

    Advances in Neural Information Processing Systems , year =

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =

  19. [19]

    Advances in Neural Information Processing Systems , year =

    Large Language Models are Zero-Shot Reasoners , author =. Advances in Neural Information Processing Systems , year =

  20. [20]

    Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , year =. Scaling. doi:10.48550/arXiv.2408.03314 , url =. 2408.03314 , archivePrefix=

  21. [21]

    Transactions of the Association for Computational Linguistics , volume =

    Theoretical Limitations of Self-Attention in Neural Sequence Models , author =. Transactions of the Association for Computational Linguistics , volume =. 2020 , doi =

  22. [22]

    International Conference on Learning Representations , year =

    Neural Networks and the Chomsky Hierarchy , author =. International Conference on Learning Representations , year =

  23. [23]

    Transactions of the Association for Computational Linguistics , volume =

    What Formal Languages Can Transformers Express? A Survey , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =

  24. [24]

    2024 , eprint =

    Transformers Represent Belief State Geometry in Their Residual Stream , author =. 2024 , eprint =. doi:10.48550/arXiv.2405.15943 , url =

  25. [25]

    International Conference on Learning Representations , year =

    Scaling and Evaluating Sparse Autoencoders , author =. International Conference on Learning Representations , year =

  26. [26]

    International Conference on Learning Representations , year =

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. International Conference on Learning Representations , year =

  27. [27]

    2023 , url =

    Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2023 , url =

  28. [28]

    2024 , url =

    Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , booktitle =. 2024 , url =

  29. [29]

    doi: 10.18653/v1/P19-1285

    Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , booktitle =. Transformer-. 2019 , pages =. doi:10.18653/v1/P19-1285 , url =

  30. [30]

    Advances in Neural Information Processing Systems , year =

    Recurrent Memory Transformer , author =. Advances in Neural Information Processing Systems , year =

  31. [31]

    International Conference on Learning Representations , year =

    Memorizing Transformers , author =. International Conference on Learning Representations , year =

  32. [32]

    Proceedings of the 39th International Conference on Machine Learning , year =

    Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , year =

  33. [33]

    2025 , eprint =

    Gemma 3 Technical Report , author =. 2025 , eprint =

  34. [34]

    2025 , note =

    Gemma Scope 2 - Technical Paper , author =. 2025 , note =

  35. [35]

    Proceedings of the 42nd International Conference on Machine Learning , series =

    Interpreting the Repeated Token Phenomenon in Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , url =

  36. [36]

    International Conference on Learning Representations , year =

    When Can Transformers Count to n? , author =. International Conference on Learning Representations , year =

  37. [37]

    arXiv preprint arXiv:2501.12948 , year =

  38. [38]

    2024 , url =

    Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris , booktitle =. 2024 , url =

  39. [39]

    Challenging

    Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics: ACL 2023 , year =. doi:10.18653/v1/2023.findings-acl.824 , url =

  40. [40]

    2024 , url =

    Gu, Alex and Roziere, Baptiste and Leather, Hugh James and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida , booktitle =. 2024 , url =

  41. [41]

    International Conference on Learning Representations , year =

    Let's Verify Step by Step , author =. International Conference on Learning Representations , year =

  42. [42]

    2024 , url =

    Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. 2024 , url =

  43. [43]

    Nature , volume =

    Solving olympiad geometry without human demonstrations , author =. Nature , volume =. 2024 , doi =

  44. [44]

    O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.211 , url =

  45. [45]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =. doi:10.48550/arXiv.2107.03374 , eprint =

  46. [46]

    Competition-Level Code Generation with

    Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume =. 2022 , doi =

  47. [47]

    LongBench:

    Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi , booktitle =. 2024 , address =. doi:10.18653/v1/2024.acl-long.172 , url =

  48. [48]

    arXiv preprint arXiv:2410.19730 , year =

    Counting Ability of Large Language Models and Impact of Tokenization , author =. arXiv preprint arXiv:2410.19730 , year =. doi:10.48550/arXiv.2410.19730 , eprint =

  49. [49]

    Why Do Large Language Models (

    Fu, Tairan and Ferrando, Raquel and Conde, Javier and Arriaga, Carlos and Reviriego, Pedro , journal =. Why Do Large Language Models (. 2024 , doi =. 2412.18626 , archivePrefix =

  50. [50]

    2025 , address =

    Xu, Nan and Ma, Xuezhe , booktitle =. 2025 , address =. doi:10.18653/v1/2025.naacl-long.172 , url =

  51. [51]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

    The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =. 2025 , publisher =

  52. [52]

    Proceedings of the International Conference on Machine Learning (ICML) , year=

    Interpreting the Repeated Token Phenomenon in Large Language Models , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=

  53. [53]

    2023 , eprint=

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=

  54. [54]

    Fan , title =

    Robert Lupoiu and Yixuan Shao and Tianxiang Dai and Chenkai Mao and Kofi Edée and Jonathan A. Fan , title =. Science Advances , volume =. 2025 , doi =. https://www.science.org/doi/pdf/10.1126/sciadv.adx8006 , abstract =

  55. [55]

    Claude 4 (Claude Opus 4) announcement , howpublished =

  56. [56]

    Claude Opus 4.1 announcement , howpublished =

  57. [57]

    Claude Opus 4.5 announcement , howpublished =

  58. [58]

    Claude Sonnet 4.5 announcement , howpublished =

  59. [59]

    Claude Opus 4.6 announcement , howpublished =

  60. [60]

    gpt-5.4 model page , howpublished =

  61. [61]

    Claude 4 (Claude Sonnet 4) announcement , howpublished =

  62. [62]

    Claude Sonnet 4.6 announcement , howpublished =

  63. [63]

    Claude 3.7 Sonnet announcement , howpublished =

  64. [64]

    Extended thinking for Claude 3.7 Sonnet , howpublished =

  65. [65]

    gemini-3-pro-preview entry in Gemini models guide , howpublished =

  66. [66]

    gemini-3.1-pro-preview entry in Gemini models guide , howpublished =

  67. [67]

    gemini-3-flash-preview entry in Gemini models guide , howpublished =

  68. [68]

    gpt-5.3-codex model page , howpublished =

  69. [69]

    Claude Haiku 4.5 announcement , howpublished =

  70. [70]

    gpt-5.2 model page , howpublished =

  71. [71]

    Claude 3.5 Haiku addendum , howpublished =

  72. [72]

    gpt-5.4-mini model page , howpublished =

  73. [73]

    gemini-2.5-pro entry in Gemini models guide , howpublished =

  74. [74]

    gemini-3.1-flash-lite-preview entry in Gemini models guide , howpublished =

  75. [75]

    gpt-5.2-codex model page , howpublished =

  76. [76]

    gpt-5.1 model page , howpublished =

  77. [77]

    Kimi K2 Instruct 0905 model card , howpublished =

  78. [78]

    gpt-5 model page , howpublished =

  79. [79]

    Llama 4 Maverick 17B 128E Instruct model card , howpublished =

  80. [80]

    gpt-4.1 model page , howpublished =

Showing first 80 references.