pith. machine review for the scientific record. sign in

arxiv: 2603.11161 · v2 · submitted 2026-03-11 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Algorithmic Task Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:06 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML
keywords algorithmic captureinductive biasinfinite transformerscomputational complexitycombinatorial tasksscalinglazy regimerich regime
0
0 comments X

The pith

Transformers possess an inductive bias that disfavors higher-complexity algorithmic tasks despite universal expressivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines algorithmic capture as the ability of a transformer to extrapolate combinatorial tasks to arbitrary sizes with controllable error and only logarithmic sample adaptation. This definition supplies a scaling criterion that separates genuine logic internalization from statistical interpolation. Analysis of infinite-width transformers in lazy and rich regimes produces upper bounds on the inference-time computational complexity of tasks that can be captured. The resulting inductive bias is consistent with observed capture on low-complexity tasks such as induction heads, sorting, and string matching. A sympathetic reader would care because the bias explains why scaling succeeds on some algorithmic problems but encounters hard limits on others.

Core claim

Infinite transformers exhibit an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class. Despite their universal expressivity, analysis in both lazy and rich regimes derives upper bounds on the inference-time computational complexity of the combinatorial tasks these networks can capture, and this is consistent with successful capture on simpler tasks such as induction heads, sort, and string matching.

What carries the argument

Algorithmic capture, defined as extrapolation to arbitrary task sizes with controllable error and logarithmic sample adaptation, serves as the scaling criterion that distinguishes logic internalization from statistical interpolation.

If this is right

  • Transformers capture simpler combinatorial tasks such as induction heads, sorting, and string matching.
  • Higher-complexity procedures within the efficient polynomial-time class are disfavored by the inductive bias.
  • Upper bounds exist on the inference-time computational complexity of capturable tasks.
  • Empirical scaling across up to 2.5 orders of magnitude shows both capture and non-capture regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bias may limit performance on complex reasoning benchmarks even when models are scaled further.
  • Architectures that alter the lazy or rich regime dynamics could potentially raise the complexity ceiling.
  • Similar capture definitions could be applied to other architectures to compare their inductive biases on algorithmic tasks.

Load-bearing premise

The formal definition of algorithmic capture sharply separates logic internalization from statistical interpolation.

What would settle it

Demonstrating that an infinite transformer captures a higher-complexity polynomial-time task, such as a specific quadratic heuristic, by extrapolating to arbitrary sizes with controllable error and logarithmic samples would falsify the claimed upper bounds.

Figures

Figures reproduced from arXiv: 2603.11161 by Orit Davidovich, Zohar Ringel.

Figure 1
Figure 1. Figure 1: Empirical Verification of Algorithmic Capture. We train models on problem instances of size T0 reaching accuracy δ after seeing P0(δ) data points. We then fine-tune these models on larger instance sizes (T > T0), re-achieving δ-accuracy after seeing P extra datapoints. Dots are average empirical values based on 20-40 transformer networks, and solid lines are best fits to C log(T /T0) meant to guide the eye… view at source ↗
read the original abstract

We formally define algorithmic capture of combinatorial tasks as the ability of a transformer to extrapolate to arbitrary task sizes with controllable error and logarithmic sample adaptation, providing a sharp scaling criterion for distinguishing logic internalization from statistical interpolation. Empirically, across scaling ranges spanning up to 2.5 orders of magnitude, we observe evidence of capture and non-capture. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the combinatorial tasks these networks can capture. We show that, despite their universal expressivity, transformers possess an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class, consistent with successful capture on simpler combinatorial tasks such as induction heads, sort, and string matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formally defines algorithmic capture of combinatorial tasks as extrapolation to arbitrary sizes with controllable error and logarithmic sample adaptation. It reports empirical evidence of capture on simpler tasks (induction heads, sort, string matching) across scaling ranges up to 2.5 orders of magnitude, and non-capture on others. For infinite-width transformers in lazy and rich regimes, it derives upper bounds on inference-time computational complexity, concluding that transformers possess an inductive bias disfavoring higher-complexity polynomial-time heuristic procedures despite universal expressivity.

Significance. If the central claims hold, the work supplies a sharp scaling criterion to separate logic internalization from statistical interpolation and supplies explicit complexity upper bounds that could explain observed transformer biases toward simpler algorithmic procedures. The combination of a formal capture definition with regime-specific bounds is a substantive contribution to understanding inductive bias in sequence models.

major comments (2)
  1. [theoretical analysis of infinite-width regimes] The upper bounds on inference-time complexity are derived exclusively for infinite-width transformers in the lazy and rich regimes, yet the empirical capture results (extrapolation on sort, string matching, induction heads) are obtained with finite-width models. No explicit limit theorem or continuity argument is provided showing that the complexity restriction or the logarithmic-sample-adaptation property survives the finite-to-infinite transition.
  2. [discussion of inductive bias] The central claim that transformers disfavor higher-complexity procedures within the efficient polynomial-time heuristic scheme class rests on the infinite-width bounds; without a bridging result to finite models, the inductive-bias conclusion does not directly apply to the networks tested in the empirical sections.
minor comments (2)
  1. [empirical results] Empirical scaling plots lack error bars, data-exclusion criteria, and any statement of the number of random seeds or training runs.
  2. [experimental setup] The manuscript does not indicate whether code or trained models will be released, which would be required to verify the reported capture thresholds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major points below, acknowledging the gap between the infinite-width analysis and finite-width experiments.

read point-by-point responses
  1. Referee: The upper bounds on inference-time complexity are derived exclusively for infinite-width transformers in the lazy and rich regimes, yet the empirical capture results (extrapolation on sort, string matching, induction heads) are obtained with finite-width models. No explicit limit theorem or continuity argument is provided showing that the complexity restriction or the logarithmic-sample-adaptation property survives the finite-to-infinite transition.

    Authors: We agree that the upper bounds are derived strictly for the infinite-width limit in both regimes, while all empirical capture results use finite-width models, and no explicit limit theorem or continuity argument is supplied. This is a genuine limitation of the current manuscript. In the revision we will insert a new subsection in the discussion that explicitly states the bounds apply only in the infinite-width case and that the finite-width experiments are presented as consistent empirical illustrations rather than as a proof that the bounds carry over exactly. revision: yes

  2. Referee: The central claim that transformers disfavor higher-complexity procedures within the efficient polynomial-time heuristic scheme class rests on the infinite-width bounds; without a bridging result to finite models, the inductive-bias conclusion does not directly apply to the networks tested in the empirical sections.

    Authors: The referee correctly notes that the inductive-bias claim, as currently worded, depends on the infinite-width bounds. We do not possess a bridging result, so the precise complexity upper bounds cannot be asserted for the finite-width networks used in the experiments. In revision we will qualify the relevant statements in the abstract, introduction, and conclusion to make clear that the disfavoring of higher-complexity procedures is established for infinite-width transformers, while the finite-width results provide supporting evidence of capture on simpler tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses independent infinite-width analysis

full rationale

The paper formally defines algorithmic capture as extrapolation with controllable error and log-sample adaptation, then analyzes infinite-width transformers in lazy/rich regimes to obtain complexity upper bounds. No quoted step reduces a derived bound or prediction to a fitted parameter, self-citation chain, or definitional tautology. The infinite-width bounds are derived from standard regime assumptions rather than from the empirical finite-width capture results, and the definition itself is not shown to be equivalent to the claimed inductive bias by construction. This is a standard non-circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the infinite-width limit, the lazy/rich regime distinction, and the new capture definition. No explicit free parameters or invented physical entities are mentioned; the main added entity is the capture criterion itself.

axioms (1)
  • domain assumption Infinite-width transformer limits exist and separate lazy and rich regimes
    Invoked to derive complexity upper bounds; standard in the cited theoretical literature on neural tangent kernels and feature learning.
invented entities (1)
  • algorithmic capture no independent evidence
    purpose: Sharp scaling criterion to distinguish logic internalization from statistical interpolation
    Newly defined formal notion; no independent falsifiable handle outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5433 in / 1319 out tokens · 32012 ms · 2026-05-15T13:06:08.545146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    cc/paper_files/paper/2022/hash/ f96180a6b7d142d76f827a5d7367c3b2-Abstract-Conference

    URL https://proceedings.neurips. cc/paper_files/paper/2022/hash/ f96180a6b7d142d76f827a5d7367c3b2-Abstract-Conference. html. Barak, B., Edelman, B. L., Goel, S., Kakade, S., Malach, E., and Zhang, C. Hidden progress in deep learning: Sgd learns parities near the computational limit. InAdvances in Neural Information Processing Systems, volume 35,

  2. [2]

    cc/paper_files/paper/2022/file/ 884baf65392170763b27c914087bde01-Paper-Conference

    URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 884baf65392170763b27c914087bde01-Paper-Conference. pdf. Bennani, M. A., Doan, T., and Sugiyama, M. Generali- sation guarantees for continual learning with orthogonal gradient descent.arXiv preprint arXiv:2006.11942, 2020. Bertsekas, D. P.Network Optimization: Continuous and Discrete Models. ...

  3. [3]

    URL http://www

    ISBN 1-886529-02-7. URL http://www. athenasc.com/netopt.html. Bietti, A., Bruna, J., Sanford, C., and Song, M. J. Learning single-index models with shallow neu- ral networks.Advances in Neural Informa- tion Processing Systems, 35:9768–9780, 2022. URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 3fb6c52aeb11e09053c16eabee74dd7b-Paper-Confer...

  4. [4]

    cc/paper_files/paper/2022/hash/ d14b434b92b0286372bf906e5300d60d-Abstract-Conference

    URL https://proceedings.neurips. cc/paper_files/paper/2022/hash/ d14b434b92b0286372bf906e5300d60d-Abstract-Conference. html. Bordelon, B. and Pehlevan, C. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks, 2023. URL https://arxiv.org/abs/ 2304.03408. Chen, L., Peng, B., and Wu, H. Theoretical lim- itations of multi-...

  5. [5]

    URL https://proceedings.mlr.press/ v202/chiang23a.html. Cho, Y . and Saul, L. Kernel methods for deep learning. InAdvances in Neural Information Processing Systems (NIPS), volume 22, pp. 342– 350, 2009. URL https://proceedings. neurips.cc/paper/2009/hash/ 5751ec3e9a4feab575962e78e006250d-Abstract. html. Dall, J. and Christensen, M. Random geometric graphs...

  6. [6]

    URL https://proceedings.mlr.press/ v139/dong21a.html. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y ., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and...

  7. [7]

    pub/2021/framework/index.html

    URL https://transformer-circuits. pub/2021/framework/index.html. Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33: 14820–14830, 2020. URL https://proceedings. neurips.cc/paper/2020/file/ a9df2255ad642b923d95503b9a7958d8-Paper. pdf. Gromov, A....

  8. [8]

    Also available as arXiv:2403.17887

    URL https://openreview.net/forum? id=XJ2ZqJ2s4d. Also available as arXiv:2403.17887. Hron, J., Bahri, Y ., Sohl-Dickstein, J., and Novak, R. Infinite attention: NNGP and NTK for deep attention networks. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pp. 4376–4386. PMLR, 2020. URL https://proceedings.mlr. press/v...

  9. [9]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    URL https://openreview.net/forum? id=u21fK4h5YQ. Also available as arXiv:2410.05229. Nanda, N., Chan, L., Lieberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations, 2023. URL https://openreview. net/pdf?id=9XFSbDPmdW. Naveh, G., Ben David, O., Sompo...

  10. [10]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    ISBN 9780198506263. Power, A., Burda, Y ., Edwards, H., Babuschkin, I., and Misra, V . Grokking: Generalization beyond overfit- ting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022. URL https://arxiv. org/abs/2201.02177. Ramamoorthy, A., Shi, J., and Wesel, R. D. On the capacity of network coding for random networks.IEEE Trans- actions...

  11. [11]

    Deep learning generalizes because the parameter-function map is biased towards simple functions

    URL https://openreview.net/forum? id=H1W1UN9gg. Seroussi, I., Naveh, G., and Ringel, Z. Separation of scales and a thermodynamic description of feature learning in some cnns.Nature Communications, 14(1):908, 2023. doi: 10.1038/s41467-023-36361-y. URL https:// doi.org/10.1038/s41467-023-36361-y. Sompolinsky, H. Statistical mechanics of neural networks. Phy...

  12. [12]

    von Mises, R.Mathematical Theory of Probability and Statistics

    URL https://openreview.net/forum? id=SkgKO0EtvS. von Mises, R.Mathematical Theory of Probability and Statistics. Academic Press, New York and London, 1964. Wei, C., Lee, J. D., Liu, Q., and Ma, T. Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. InAdvances in Neural Infor- mation Processing Systems, volume ...

  13. [13]

    TX c,e=1 Ac(s1)∆Σ′ ceAe(s2) # ≤ ∥∆Σ ′∥∞ E(s1,s2)

    Bounding eval.The error induced by substituting Σ′ → Σ′ + ∆Σ′ is eval = E˜s∼N(0,C+∆C) " TX c,e=1 Ac(s1)∆Σ′ ceAe(s2) # ≤ ∥∆Σ ′∥∞ E(s1,s2) " TX c=1 Ac(s1) ! TX e=1 Ae(s2) !# ≤ ∥∆Σ ′∥∞ ,(16) The 2nd inequality results from Softmax outputs summing up to 1. The valuation error, thus, perfectly absorbs the T 2 summands and is exactly 1-Lipschitz and independent of T

  14. [14]

    As estab- lished in Eq

    Bounding emeasure Next, we evaluate the shift in the expectation due to the perturbed distribution of ˜s. As estab- lished in Eq. (7), the covarianceCfactorizes as C(i,c),(j,e) = (Σ′ ij)ab(Σ′ ij)ce. where, Σ′ ij ≡Σ ′(Xi, Xj). Under perturbation, the corre- sponding score covariance block is shifted C(i,c),(j,e) → (Σ′ ij)ab + (∆Σ′ ij)ab (Σ′ ij)ce + (∆Σ′ ij...