Counting as a minimal probe of language model reliability

Tianxiang Dai , Jonathan Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords countingrulelanguagemodelmodelsstableapplicationassay

0 comments

The pith

Language models have limited stable counting capacity well below context limits and rely on a finite set of count-like internal states, collapsing to guessing once exhausted.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models excel at benchmarks involving math, coding, and analysis, but it is unclear if this reflects real logical ability or just pattern matching. The authors created a minimal test called Stable Counting Capacity that asks models to count identical symbols in a sequence, removing any need for outside knowledge or complex language. Across more than 100 different models, the reliable counting length stayed much shorter than the models' advertised maximum context sizes. The behavior did not match open-ended logic or consistent rule application. Instead, models appeared to draw on a small number of internal counting states, similar to using fingers to count. Once those states were used up, performance dropped to random guessing even when extra computation was allowed at test time. The results indicate that strong performance on knowledge-heavy benchmarks does not ensure reliable execution of simple procedures.

Core claim

Model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing.

Load-bearing premise

That the observed collapse in counting performance directly indicates a lack of general logical competence rather than a task-specific limitation in how models represent or maintain count information.

Figures

Figures reproduced from arXiv: 2605.02028 by Jonathan Fan, Tianxiang Dai.

**Figure 1.** Figure 1: Stable Counting Capacity as a fully mechanical benchmark for rule execution evaluation. a, Classes of LLM benchmarks. Knowledge-dependent benchmarks (left) evaluate a mixture of reasoning, factual recall, and tool usage, and they can be impacted by data contamination and leaderboard saturation. Mechanical benchmarks (right) isolate structural processing by applying a simple rule to a minimal sequence witho… view at source ↗

**Figure 2.** Figure 2: Model behavior at the point of counting failure. a, The tracking behavior of a representative model during a counting run. The model predicts the exact count perfectly before abruptly failing and defaulting to highly specific rounded numbers. b, A high resolution overlay of boundary behavior across all models. The transition from perfect rule execution to chaotic output is sudden, showing no controlled or … view at source ↗

**Figure 3.** Figure 3: Impact of token consumption and test-time compute on procedural state maintenance. a, Average total token consumption evaluated at the CC boundary. Higher token expenditure does not guarantee a greater counting capacity. b, A matched comparison between base non-reasoning models and their reasoning variants. Reasoning models consume dramatically more tokens during inference, but they show negligible improve… view at source ↗

read the original abstract

Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects general logical competence, repeated application of learned procedures, or pattern matching that mimics rule execution. We investigate this question by introducing Stable Counting Capacity, an assay in which models count repeated symbols until failure. The assay removes knowledge dependencies, semantics and ambiguity from evaluation, avoids lexical and tokenization confounds, and provides a direct measure of procedural reliability beyond standard knowledge-based benchmarks. Here we show, across more than 100 model variants, that stable counting capacity remains far below advertised context limits. Model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing, even with additional test-time compute. These findings show that fluent performance in current language models does not guarantee general, reliable rule following.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The counting assay cleanly shows early collapse on a minimal task, but the jump to finite count-like states lacks direct support.

read the letter

The paper's main point is that LLMs fail at stable counting of repeated symbols well below their context limits, and that this failure looks like exhaustion of some small internal resource rather than true rule following. The Stable Counting Capacity test is the new piece here. It uses long strings of identical tokens to remove semantics, knowledge, and tokenization effects, then measures where exact counting stops and guessing begins. Running this across more than 100 model variants produces consistent early drop-offs, which is useful data on procedural reliability that standard benchmarks miss.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the counting assay cleanly measures procedural reliability and that behavioral patterns indicate limited internal states rather than other architectural factors.

axioms (1)

domain assumption The counting assay removes knowledge dependencies, semantics, ambiguity, and tokenization confounds
Stated as a core design feature of the probe in the abstract.

invented entities (1)

finite set of count-like internal states no independent evidence
purpose: To account for the point at which reliable counting collapses into guessing
Inferred from observed model behavior; no independent falsifiable evidence is provided in the abstract.

pith-pipeline@v0.9.0 · 5475 in / 1257 out tokens · 35726 ms · 2026-05-08T19:34:51.968250+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/ArithmeticFromLogic.lean LogicNat / equivNat (initial Peano object from distinction) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states
Foundation/ArithmeticOf.lean ArithmeticOf.canonical / PeanoSurface unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stable Counting Capacity (SCC), a purely mechanical assay that utilizes a minimal probe based on homogeneous sequence counting
Cost (J-cost), Constants (phi, c, ℏ, G) n/a — no φ-ladder or J-cost structure invoked unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Counting capacities... spanning a broad range, with newer models often supporting larger CCs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 11 canonical work pages · 4 internal anchors

[1]

H., Wu, Y ., Le, Q

Trinh, T. H., Wu, Y ., Le, Q. V ., He, H. & Luong, T. Solving olympiad geometry without human demonstrations. Nature625, 476–482 (2024)

2024
[2]

URLhttps://aclanthology.org/2024.acl-long.211/

He, C.et al.OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems (2024). URLhttps://aclanthology.org/2024.acl-long.211/

2024
[3]

Chen, M.et al.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page Pith review arXiv 2021
[4]

URL https://aclanthology.org/2024.acl-long.172/

Bai, Y .et al.LongBench: A bilingual, multitask benchmark for long context understanding (2024). URL https://aclanthology.org/2024.acl-long.172/

2024
[5]

URL https://openreview.net/forum?id= WULjblaCoc

Yehudai, G.et al.When can transformers count to n? (2024). URL https://openreview.net/forum?id= WULjblaCoc

2024
[6]

URL https://openreview.net/ forum?id=WbxHAzkeQcn

Delétang, G.et al.Neural networks and the chomsky hierarchy (2023). URL https://openreview.net/ forum?id=WbxHAzkeQcn

2023
[7]

URLhttps://aclanthology.org/2025.emnlp-main.511/

Chen, S.et al.Benchmarking large language models under data contamination: A survey from static to dynamic evaluation (2025). URLhttps://aclanthology.org/2025.emnlp-main.511/

2025
[8]

& Dascalu, M

Cosma, A., Ruseti, S., Radoi, E. & Dascalu, M. The strawberry problem: Emergence of character-level understanding in tokenized language models (2025)

2025
[9]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D.et al.GPQA: A graduate-level google-proof Q&A benchmark (2023). URL https://arxiv.org/ abs/2311.12022. arXiv:2311.12022

work page internal anchor Pith review arXiv 2023
[10]

E.et al.SWE-bench: Can language models resolve real-world GitHub is- sues? (2024)

Jimenez, C. E.et al.SWE-bench: Can language models resolve real-world GitHub is- sues? (2024). URL https://proceedings.iclr.cc/paper_files/paper/2024/hash/ edac78c3e300629acfe6cbe9ca88fb84-Abstract-Conference.html

2024
[11]

URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

Wei, J.et al.Chain-of-thought prompting elicits reasoning in large language mod- els (2022). URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

2022
[12]

S., Reid, M., Matsuo, Y

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y . & Iwasawa, Y . Large language models are zero-shot reasoners (2022). URL https://papers.nips.cc/paper_files/paper/2022/hash/ 8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

2022
[13]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Snell, C., Lee, J., Xu, K. & Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling model parameters (2024). URLhttps://arxiv.org/abs/2408.03314. arXiv:2408.03314

work page Pith review arXiv 2024
[14]

URL https://openreview.net/forum?id=iO4LZibEqW

Liang, P.et al.Holistic evaluation of language models.Transactions on Machine Learning Research(2023). URL https://openreview.net/forum?id=iO4LZibEqW

2023
[15]

URL https://openreview

Hendrycks, D.et al.Measuring massive multitask language understanding (2021). URL https://openreview. net/forum?id=d7KBjmI3GmQ

2021
[16]

URL https://openreview.net/forum?id= uyTL5Bvosj

Srivastava, A.et al.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research(2023). URL https://openreview.net/forum?id= uyTL5Bvosj

2023
[17]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

White, C.et al.Livebench: A challenging, contamination-free LLM benchmark (2024). URLhttps://arxiv. org/abs/2406.19314. arXiv:2406.19314

work page internal anchor Pith review arXiv 2024
[18]

L., Nguyen, D

Hai, N. L., Nguyen, D. M. & Bui, N. D. Q. REPOEXEC: Evaluate code generation with a repository-level executable benchmark (2024). URLhttps://arxiv.org/abs/2406.11927. arXiv:2406.11927

work page arXiv 2024
[19]

Longgenbench: Benchmarking long-form generation in long context llms.arXiv preprint arXiv:2409.02076,

Wu, Y ., Hee, M. S., Hu, Z. & Lee, R. K.-W. Longgenbench: Benchmarking long-form generation in long context LLMs (2024). URLhttps://arxiv.org/abs/2409.02076. arXiv:2409.02076

work page arXiv 2024
[20]

URL https://openreview

Yao, S.et al.ReAct: Synergizing reasoning and acting in language models (2023). URL https://openreview. net/forum?id=WE_vluYUL-X. 11

2023
[21]

URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ d842425e4bf79ba039352da0f658a906-Abstract-Conference.html

Schick, T.et al.Toolformer: Language models can teach themselves to use tools (2023). URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ d842425e4bf79ba039352da0f658a906-Abstract-Conference.html

2023
[22]

Arc- agi-2: A new challenge for frontier ai reasoning systems

Chollet, F., Knoop, M., Kamradt, G., Landers, B. & Pinkard, H. ARC-AGI-2: A new challenge for frontier AI reasoning systems (2025). URLhttps://arxiv.org/abs/2505.11831. arXiv:2505.11831

work page arXiv 2025
[23]

Barbero, F.et al.Interpreting the repeated token phenomenon in large language models (2024)

2024
[24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review arXiv 2025
[25]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report (2025). URL https://arxiv.org/abs/2503.19786. arXiv:2503.19786

work page Pith review arXiv 2025
[26]

McDougall, C.et al.Gemma scope 2 - technical paper (2025). URL https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/ gemma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-complex-language-model-behavior/ Gemma_Scope_2_Technical_Paper.pdf. Google technical paper

2025
[27]

URL https://openreview.net/forum? id=tcsZt9ZNKD

Gao, L.et al.Scaling and evaluating sparse autoencoders (2025). URL https://openreview.net/forum? id=tcsZt9ZNKD

2025
[28]

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

ARC Prize Foundation. ARC-AGI-3: A new challenge for frontier agentic intelligence (2026). arXiv:2603.24621

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

URL https://openreview.net/forum?id=B1ckMDqlg

Shazeer, N.et al.Outrageously large neural networks: The sparsely-gated mixture-of-experts layer (2017). URL https://openreview.net/forum?id=B1ckMDqlg

2017
[30]

& Zettlemoyer, L

Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. QLoRA: Efficient finetuning of quantized LLMs (2023). URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html

2023
[31]

URL https://proceedings.mlsys.org/paper_files/paper/2024/hash/ 42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html

Lin, J.et al.AWQ: Activation-aware weight quantization for LLM compression and accel- eration (2024). URL https://proceedings.mlsys.org/paper_files/paper/2024/hash/ 42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html

2024
[32]

Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics8, 156–171 (2020)

Hahn, M. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics8, 156–171 (2020). URLhttps://aclanthology.org/2020.tacl-1.11/

2020
[33]

& Angluin, D

Strobl, L., Merrill, W., Weiss, G., Chiang, D. & Angluin, D. What formal languages can transformers express? a survey.Transactions of the Association for Computational Linguistics12(2024). URL https://aclanthology. org/2024.tacl-1.30/

2024
[34]

URL https: //aclanthology.org/P19-1285/

Dai, Z.et al.Transformer-XL: Attentive language models beyond a fixed-length context (2019). URL https: //aclanthology.org/P19-1285/

2019
[35]

& Burtsev, M

Bulatov, A., Kuratov, Y . & Burtsev, M. S. Recurrent memory transformer (2022). URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 47e288629a6996a17ce50b90a056a0e1-Abstract-Conference.html

2022
[36]

N., Hutchins, D

Wu, Y ., Rabe, M. N., Hutchins, D. & Szegedy, C. Memorizing transformers (2022). URLhttps://openreview. net/forum?id=TrjbxzRcnf-

2022
[37]

URL https: //proceedings.mlr.press/v162/borgeaud22a.html

Borgeaud, S.et al.Improving language models by retrieving from trillions of tokens (2022). URL https: //proceedings.mlr.press/v162/borgeaud22a.html

2022
[38]

SWE-bench leaderboard and verified subset (2026)

SWE-bench. SWE-bench leaderboard and verified subset (2026). URL https://www.swebench.com/. Official benchmark website

2026
[39]

OTIS mock AIME 2024–2025 (2025)

Epoch AI. OTIS mock AIME 2024–2025 (2025). URL https://epoch.ai/benchmarks/ otis-mock-aime-2024-2025. Official benchmark description page

2024
[40]

URL https://aclanthology.org/2023.findings-acl.824/

Suzgun, M.et al.Challenging BIG-bench tasks and whether chain-of-thought can solve them (2023). URL https://aclanthology.org/2023.findings-acl.824/

2023
[41]

URL https: //proceedings.mlr.press/v235/gu24d.html

Gu, A.et al.CRUXEval: A benchmark for code reasoning, understanding and execution (2024). URL https: //proceedings.mlr.press/v235/gu24d.html

2024
[42]

URL https://openreview.net/forum?id=v8L0pN6EOi

Lightman, H.et al.Let’s verify step by step (2024). URL https://openreview.net/forum?id=v8L0pN6EOi

2024
[43]

URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ ad236edc564f3e3156e1b2feafb99a24-Abstract-Datasets_and_Benchmarks_Track.html

Wang, Y .et al.MMLU-pro: A more robust and challenging multi-task language understand- ing benchmark (2024). URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ ad236edc564f3e3156e1b2feafb99a24-Abstract-Datasets_and_Benchmarks_Track.html. 12

2024
[44]

URL https:// transformer-circuits.pub/2021/framework/index.html

Elhage, N.et al.A mathematical framework for transformer circuits (2021). URL https:// transformer-circuits.pub/2021/framework/index.html. Transformer Circuits thread

2021
[45]

2024 , archivePrefix=

Shai, A. S., Marzen, S. E., Teixeira, L., Oldenziel, A. G. & Riechers, P. M. Transformers represent belief state geometry in their residual stream (2024). URLhttps://arxiv.org/abs/2405.15943. arXiv:2405.15943. Acknowledgements This work was funded by the National Science Foundation under Award Number 2103301 and The Packard Foundation under grant number 2...

work page arXiv 2024