arxiv: 2603.11161 · v2 · submitted 2026-03-11 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Algorithmic Task Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Orit Davidovich , Zohar Ringel

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:06 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML

keywords algorithmic captureinductive biasinfinite transformerscomputational complexitycombinatorial tasksscalinglazy regimerich regime

0 comments

The pith

Transformers possess an inductive bias that disfavors higher-complexity algorithmic tasks despite universal expressivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines algorithmic capture as the ability of a transformer to extrapolate combinatorial tasks to arbitrary sizes with controllable error and only logarithmic sample adaptation. This definition supplies a scaling criterion that separates genuine logic internalization from statistical interpolation. Analysis of infinite-width transformers in lazy and rich regimes produces upper bounds on the inference-time computational complexity of tasks that can be captured. The resulting inductive bias is consistent with observed capture on low-complexity tasks such as induction heads, sorting, and string matching. A sympathetic reader would care because the bias explains why scaling succeeds on some algorithmic problems but encounters hard limits on others.

Core claim

Infinite transformers exhibit an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class. Despite their universal expressivity, analysis in both lazy and rich regimes derives upper bounds on the inference-time computational complexity of the combinatorial tasks these networks can capture, and this is consistent with successful capture on simpler tasks such as induction heads, sort, and string matching.

What carries the argument

Algorithmic capture, defined as extrapolation to arbitrary task sizes with controllable error and logarithmic sample adaptation, serves as the scaling criterion that distinguishes logic internalization from statistical interpolation.

If this is right

Transformers capture simpler combinatorial tasks such as induction heads, sorting, and string matching.
Higher-complexity procedures within the efficient polynomial-time class are disfavored by the inductive bias.
Upper bounds exist on the inference-time computational complexity of capturable tasks.
Empirical scaling across up to 2.5 orders of magnitude shows both capture and non-capture regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bias may limit performance on complex reasoning benchmarks even when models are scaled further.
Architectures that alter the lazy or rich regime dynamics could potentially raise the complexity ceiling.
Similar capture definitions could be applied to other architectures to compare their inductive biases on algorithmic tasks.

Load-bearing premise

The formal definition of algorithmic capture sharply separates logic internalization from statistical interpolation.

What would settle it

Demonstrating that an infinite transformer captures a higher-complexity polynomial-time task, such as a specific quadratic heuristic, by extrapolating to arbitrary sizes with controllable error and logarithmic samples would falsify the claimed upper bounds.

Figures

Figures reproduced from arXiv: 2603.11161 by Orit Davidovich, Zohar Ringel.

**Figure 1.** Figure 1: Empirical Verification of Algorithmic Capture. We train models on problem instances of size T0 reaching accuracy δ after seeing P0(δ) data points. We then fine-tune these models on larger instance sizes (T > T0), re-achieving δ-accuracy after seeing P extra datapoints. Dots are average empirical values based on 20-40 transformer networks, and solid lines are best fits to C log(T /T0) meant to guide the eye… view at source ↗

read the original abstract

We formally define algorithmic capture of combinatorial tasks as the ability of a transformer to extrapolate to arbitrary task sizes with controllable error and logarithmic sample adaptation, providing a sharp scaling criterion for distinguishing logic internalization from statistical interpolation. Empirically, across scaling ranges spanning up to 2.5 orders of magnitude, we observe evidence of capture and non-capture. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the combinatorial tasks these networks can capture. We show that, despite their universal expressivity, transformers possess an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class, consistent with successful capture on simpler combinatorial tasks such as induction heads, sort, and string matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new definition of algorithmic capture is a clean framing, but the infinite-width complexity bounds don't clearly transfer to the finite models in the experiments.

read the letter

The paper's main move is to define algorithmic capture as extrapolation to arbitrary task sizes with controllable error after only logarithmic samples. This gives a sharper way to separate internalized algorithms from plain interpolation than just looking at test accuracy on fixed sizes. They show capture on induction heads, sort, and string matching over scaling ranges up to 2.5 orders of magnitude, and non-capture on harder cases, which lines up with the claim that transformers favor lower-complexity polynomial procedures.

Referee Report

2 major / 2 minor

Summary. The paper formally defines algorithmic capture of combinatorial tasks as extrapolation to arbitrary sizes with controllable error and logarithmic sample adaptation. It reports empirical evidence of capture on simpler tasks (induction heads, sort, string matching) across scaling ranges up to 2.5 orders of magnitude, and non-capture on others. For infinite-width transformers in lazy and rich regimes, it derives upper bounds on inference-time computational complexity, concluding that transformers possess an inductive bias disfavoring higher-complexity polynomial-time heuristic procedures despite universal expressivity.

Significance. If the central claims hold, the work supplies a sharp scaling criterion to separate logic internalization from statistical interpolation and supplies explicit complexity upper bounds that could explain observed transformer biases toward simpler algorithmic procedures. The combination of a formal capture definition with regime-specific bounds is a substantive contribution to understanding inductive bias in sequence models.

major comments (2)

[theoretical analysis of infinite-width regimes] The upper bounds on inference-time complexity are derived exclusively for infinite-width transformers in the lazy and rich regimes, yet the empirical capture results (extrapolation on sort, string matching, induction heads) are obtained with finite-width models. No explicit limit theorem or continuity argument is provided showing that the complexity restriction or the logarithmic-sample-adaptation property survives the finite-to-infinite transition.
[discussion of inductive bias] The central claim that transformers disfavor higher-complexity procedures within the efficient polynomial-time heuristic scheme class rests on the infinite-width bounds; without a bridging result to finite models, the inductive-bias conclusion does not directly apply to the networks tested in the empirical sections.

minor comments (2)

[empirical results] Empirical scaling plots lack error bars, data-exclusion criteria, and any statement of the number of random seeds or training runs.
[experimental setup] The manuscript does not indicate whether code or trained models will be released, which would be required to verify the reported capture thresholds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major points below, acknowledging the gap between the infinite-width analysis and finite-width experiments.

read point-by-point responses

Referee: The upper bounds on inference-time complexity are derived exclusively for infinite-width transformers in the lazy and rich regimes, yet the empirical capture results (extrapolation on sort, string matching, induction heads) are obtained with finite-width models. No explicit limit theorem or continuity argument is provided showing that the complexity restriction or the logarithmic-sample-adaptation property survives the finite-to-infinite transition.

Authors: We agree that the upper bounds are derived strictly for the infinite-width limit in both regimes, while all empirical capture results use finite-width models, and no explicit limit theorem or continuity argument is supplied. This is a genuine limitation of the current manuscript. In the revision we will insert a new subsection in the discussion that explicitly states the bounds apply only in the infinite-width case and that the finite-width experiments are presented as consistent empirical illustrations rather than as a proof that the bounds carry over exactly. revision: yes
Referee: The central claim that transformers disfavor higher-complexity procedures within the efficient polynomial-time heuristic scheme class rests on the infinite-width bounds; without a bridging result to finite models, the inductive-bias conclusion does not directly apply to the networks tested in the empirical sections.

Authors: The referee correctly notes that the inductive-bias claim, as currently worded, depends on the infinite-width bounds. We do not possess a bridging result, so the precise complexity upper bounds cannot be asserted for the finite-width networks used in the experiments. In revision we will qualify the relevant statements in the abstract, introduction, and conclusion to make clear that the disfavoring of higher-complexity procedures is established for infinite-width transformers, while the finite-width results provide supporting evidence of capture on simpler tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses independent infinite-width analysis

full rationale

The paper formally defines algorithmic capture as extrapolation with controllable error and log-sample adaptation, then analyzes infinite-width transformers in lazy/rich regimes to obtain complexity upper bounds. No quoted step reduces a derived bound or prediction to a fitted parameter, self-citation chain, or definitional tautology. The infinite-width bounds are derived from standard regime assumptions rather than from the empirical finite-width capture results, and the definition itself is not shown to be equivalent to the claimed inductive bias by construction. This is a standard non-circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the infinite-width limit, the lazy/rich regime distinction, and the new capture definition. No explicit free parameters or invented physical entities are mentioned; the main added entity is the capture criterion itself.

axioms (1)

domain assumption Infinite-width transformer limits exist and separate lazy and rich regimes
Invoked to derive complexity upper bounds; standard in the cited theoretical literature on neural tangent kernels and feature learning.

invented entities (1)

algorithmic capture no independent evidence
purpose: Sharp scaling criterion to distinguish logic internalization from statistical interpolation
Newly defined formal notion; no independent falsifiable handle outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5433 in / 1319 out tokens · 32012 ms · 2026-05-15T13:06:08.545146+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity... O(P N_MC T^3)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transformers possess an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

[1]

cc/paper_files/paper/2022/hash/ f96180a6b7d142d76f827a5d7367c3b2-Abstract-Conference

URL https://proceedings.neurips. cc/paper_files/paper/2022/hash/ f96180a6b7d142d76f827a5d7367c3b2-Abstract-Conference. html. Barak, B., Edelman, B. L., Goel, S., Kakade, S., Malach, E., and Zhang, C. Hidden progress in deep learning: Sgd learns parities near the computational limit. InAdvances in Neural Information Processing Systems, volume 35,

work page 2022
[2]

cc/paper_files/paper/2022/file/ 884baf65392170763b27c914087bde01-Paper-Conference

URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 884baf65392170763b27c914087bde01-Paper-Conference. pdf. Bennani, M. A., Doan, T., and Sugiyama, M. Generali- sation guarantees for continual learning with orthogonal gradient descent.arXiv preprint arXiv:2006.11942, 2020. Bertsekas, D. P.Network Optimization: Continuous and Discrete Models. ...

work page arXiv 2022
[3]

URL http://www

ISBN 1-886529-02-7. URL http://www. athenasc.com/netopt.html. Bietti, A., Bruna, J., Sanford, C., and Song, M. J. Learning single-index models with shallow neu- ral networks.Advances in Neural Informa- tion Processing Systems, 35:9768–9780, 2022. URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 3fb6c52aeb11e09053c16eabee74dd7b-Paper-Confer...

work page doi:10.1561/0400000004 2022
[4]

cc/paper_files/paper/2022/hash/ d14b434b92b0286372bf906e5300d60d-Abstract-Conference

URL https://proceedings.neurips. cc/paper_files/paper/2022/hash/ d14b434b92b0286372bf906e5300d60d-Abstract-Conference. html. Bordelon, B. and Pehlevan, C. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks, 2023. URL https://arxiv.org/abs/ 2304.03408. Chen, L., Peng, B., and Wu, H. Theoretical lim- itations of multi-...

work page arXiv 2022
[5]

URL https://proceedings.mlr.press/ v202/chiang23a.html. Cho, Y . and Saul, L. Kernel methods for deep learning. InAdvances in Neural Information Processing Systems (NIPS), volume 22, pp. 342– 350, 2009. URL https://proceedings. neurips.cc/paper/2009/hash/ 5751ec3e9a4feab575962e78e006250d-Abstract. html. Dall, J. and Christensen, M. Random geometric graphs...

work page doi:10.1038/s41467-018-03101-6 2009
[6]

URL https://proceedings.mlr.press/ v139/dong21a.html. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y ., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and...

work page
[7]

pub/2021/framework/index.html

URL https://transformer-circuits. pub/2021/framework/index.html. Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33: 14820–14830, 2020. URL https://proceedings. neurips.cc/paper/2020/file/ a9df2255ad642b923d95503b9a7958d8-Paper. pdf. Gromov, A....

work page 2021
[8]

Also available as arXiv:2403.17887

URL https://openreview.net/forum? id=XJ2ZqJ2s4d. Also available as arXiv:2403.17887. Hron, J., Bahri, Y ., Sohl-Dickstein, J., and Novak, R. Infinite attention: NNGP and NTK for deep attention networks. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pp. 4376–4386. PMLR, 2020. URL https://proceedings.mlr. press/v...

work page doi:10.1109/sct.1995.514853 2020
[9]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

URL https://openreview.net/forum? id=u21fK4h5YQ. Also available as arXiv:2410.05229. Nanda, N., Chan, L., Lieberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations, 2023. URL https://openreview. net/pdf?id=9XFSbDPmdW. Naveh, G., Ben David, O., Sompo...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physreve.104 2023
[10]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

ISBN 9780198506263. Power, A., Burda, Y ., Edwards, H., Babuschkin, I., and Misra, V . Grokking: Generalization beyond overfit- ting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022. URL https://arxiv. org/abs/2201.02177. Ramamoorthy, A., Shi, J., and Wesel, R. D. On the capacity of network coding for random networks.IEEE Trans- actions...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tit.2005.851722 2022
[11]

Deep learning generalizes because the parameter-function map is biased towards simple functions

URL https://openreview.net/forum? id=H1W1UN9gg. Seroussi, I., Naveh, G., and Ringel, Z. Separation of scales and a thermodynamic description of feature learning in some cnns.Nature Communications, 14(1):908, 2023. doi: 10.1038/s41467-023-36361-y. URL https:// doi.org/10.1038/s41467-023-36361-y. Sompolinsky, H. Statistical mechanics of neural networks. Phy...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41467-023-36361-y 2023
[12]

von Mises, R.Mathematical Theory of Probability and Statistics

URL https://openreview.net/forum? id=SkgKO0EtvS. von Mises, R.Mathematical Theory of Probability and Statistics. Academic Press, New York and London, 1964. Wei, C., Lee, J. D., Liu, Q., and Ma, T. Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. InAdvances in Neural Infor- mation Processing Systems, volume ...

work page arXiv 1964
[13]

TX c,e=1 Ac(s1)∆Σ′ ceAe(s2) # ≤ ∥∆Σ ′∥∞ E(s1,s2)

Bounding eval.The error induced by substituting Σ′ → Σ′ + ∆Σ′ is eval = E˜s∼N(0,C+∆C) " TX c,e=1 Ac(s1)∆Σ′ ceAe(s2) # ≤ ∥∆Σ ′∥∞ E(s1,s2) " TX c=1 Ac(s1) ! TX e=1 Ae(s2) !# ≤ ∥∆Σ ′∥∞ ,(16) The 2nd inequality results from Softmax outputs summing up to 1. The valuation error, thus, perfectly absorbs the T 2 summands and is exactly 1-Lipschitz and independent of T

work page
[14]

As estab- lished in Eq

Bounding emeasure Next, we evaluate the shift in the expectation due to the perturbed distribution of ˜s. As estab- lished in Eq. (7), the covarianceCfactorizes as C(i,c),(j,e) = (Σ′ ij)ab(Σ′ ij)ce. where, Σ′ ij ≡Σ ′(Xi, Xj). Under perturbation, the corre- sponding score covariance block is shifted C(i,c),(j,e) → (Σ′ ij)ab + (∆Σ′ ij)ab (Σ′ ij)ce + (∆Σ′ ij...

work page 2017