Recognition: 2 theorem links
· Lean TheoremAlgorithmic Task Capture, Computational Complexity, and Inductive Bias of Infinite Transformers
Pith reviewed 2026-05-15 13:06 UTC · model grok-4.3
The pith
Transformers possess an inductive bias that disfavors higher-complexity algorithmic tasks despite universal expressivity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Infinite transformers exhibit an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class. Despite their universal expressivity, analysis in both lazy and rich regimes derives upper bounds on the inference-time computational complexity of the combinatorial tasks these networks can capture, and this is consistent with successful capture on simpler tasks such as induction heads, sort, and string matching.
What carries the argument
Algorithmic capture, defined as extrapolation to arbitrary task sizes with controllable error and logarithmic sample adaptation, serves as the scaling criterion that distinguishes logic internalization from statistical interpolation.
If this is right
- Transformers capture simpler combinatorial tasks such as induction heads, sorting, and string matching.
- Higher-complexity procedures within the efficient polynomial-time class are disfavored by the inductive bias.
- Upper bounds exist on the inference-time computational complexity of capturable tasks.
- Empirical scaling across up to 2.5 orders of magnitude shows both capture and non-capture regimes.
Where Pith is reading between the lines
- The bias may limit performance on complex reasoning benchmarks even when models are scaled further.
- Architectures that alter the lazy or rich regime dynamics could potentially raise the complexity ceiling.
- Similar capture definitions could be applied to other architectures to compare their inductive biases on algorithmic tasks.
Load-bearing premise
The formal definition of algorithmic capture sharply separates logic internalization from statistical interpolation.
What would settle it
Demonstrating that an infinite transformer captures a higher-complexity polynomial-time task, such as a specific quadratic heuristic, by extrapolating to arbitrary sizes with controllable error and logarithmic samples would falsify the claimed upper bounds.
Figures
read the original abstract
We formally define algorithmic capture of combinatorial tasks as the ability of a transformer to extrapolate to arbitrary task sizes with controllable error and logarithmic sample adaptation, providing a sharp scaling criterion for distinguishing logic internalization from statistical interpolation. Empirically, across scaling ranges spanning up to 2.5 orders of magnitude, we observe evidence of capture and non-capture. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the combinatorial tasks these networks can capture. We show that, despite their universal expressivity, transformers possess an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class, consistent with successful capture on simpler combinatorial tasks such as induction heads, sort, and string matching.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formally defines algorithmic capture of combinatorial tasks as extrapolation to arbitrary sizes with controllable error and logarithmic sample adaptation. It reports empirical evidence of capture on simpler tasks (induction heads, sort, string matching) across scaling ranges up to 2.5 orders of magnitude, and non-capture on others. For infinite-width transformers in lazy and rich regimes, it derives upper bounds on inference-time computational complexity, concluding that transformers possess an inductive bias disfavoring higher-complexity polynomial-time heuristic procedures despite universal expressivity.
Significance. If the central claims hold, the work supplies a sharp scaling criterion to separate logic internalization from statistical interpolation and supplies explicit complexity upper bounds that could explain observed transformer biases toward simpler algorithmic procedures. The combination of a formal capture definition with regime-specific bounds is a substantive contribution to understanding inductive bias in sequence models.
major comments (2)
- [theoretical analysis of infinite-width regimes] The upper bounds on inference-time complexity are derived exclusively for infinite-width transformers in the lazy and rich regimes, yet the empirical capture results (extrapolation on sort, string matching, induction heads) are obtained with finite-width models. No explicit limit theorem or continuity argument is provided showing that the complexity restriction or the logarithmic-sample-adaptation property survives the finite-to-infinite transition.
- [discussion of inductive bias] The central claim that transformers disfavor higher-complexity procedures within the efficient polynomial-time heuristic scheme class rests on the infinite-width bounds; without a bridging result to finite models, the inductive-bias conclusion does not directly apply to the networks tested in the empirical sections.
minor comments (2)
- [empirical results] Empirical scaling plots lack error bars, data-exclusion criteria, and any statement of the number of random seeds or training runs.
- [experimental setup] The manuscript does not indicate whether code or trained models will be released, which would be required to verify the reported capture thresholds.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address the two major points below, acknowledging the gap between the infinite-width analysis and finite-width experiments.
read point-by-point responses
-
Referee: The upper bounds on inference-time complexity are derived exclusively for infinite-width transformers in the lazy and rich regimes, yet the empirical capture results (extrapolation on sort, string matching, induction heads) are obtained with finite-width models. No explicit limit theorem or continuity argument is provided showing that the complexity restriction or the logarithmic-sample-adaptation property survives the finite-to-infinite transition.
Authors: We agree that the upper bounds are derived strictly for the infinite-width limit in both regimes, while all empirical capture results use finite-width models, and no explicit limit theorem or continuity argument is supplied. This is a genuine limitation of the current manuscript. In the revision we will insert a new subsection in the discussion that explicitly states the bounds apply only in the infinite-width case and that the finite-width experiments are presented as consistent empirical illustrations rather than as a proof that the bounds carry over exactly. revision: yes
-
Referee: The central claim that transformers disfavor higher-complexity procedures within the efficient polynomial-time heuristic scheme class rests on the infinite-width bounds; without a bridging result to finite models, the inductive-bias conclusion does not directly apply to the networks tested in the empirical sections.
Authors: The referee correctly notes that the inductive-bias claim, as currently worded, depends on the infinite-width bounds. We do not possess a bridging result, so the precise complexity upper bounds cannot be asserted for the finite-width networks used in the experiments. In revision we will qualify the relevant statements in the abstract, introduction, and conclusion to make clear that the disfavoring of higher-complexity procedures is established for infinite-width transformers, while the finite-width results provide supporting evidence of capture on simpler tasks. revision: yes
Circularity Check
No circularity: derivation uses independent infinite-width analysis
full rationale
The paper formally defines algorithmic capture as extrapolation with controllable error and log-sample adaptation, then analyzes infinite-width transformers in lazy/rich regimes to obtain complexity upper bounds. No quoted step reduces a derived bound or prediction to a fitted parameter, self-citation chain, or definitional tautology. The infinite-width bounds are derived from standard regime assumptions rather than from the empirical finite-width capture results, and the definition itself is not shown to be equivalent to the claimed inductive bias by construction. This is a standard non-circular structure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Infinite-width transformer limits exist and separate lazy and rich regimes
invented entities (1)
-
algorithmic capture
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity... O(P N_MC T^3)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
transformers possess an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
cc/paper_files/paper/2022/hash/ f96180a6b7d142d76f827a5d7367c3b2-Abstract-Conference
URL https://proceedings.neurips. cc/paper_files/paper/2022/hash/ f96180a6b7d142d76f827a5d7367c3b2-Abstract-Conference. html. Barak, B., Edelman, B. L., Goel, S., Kakade, S., Malach, E., and Zhang, C. Hidden progress in deep learning: Sgd learns parities near the computational limit. InAdvances in Neural Information Processing Systems, volume 35,
work page 2022
-
[2]
cc/paper_files/paper/2022/file/ 884baf65392170763b27c914087bde01-Paper-Conference
URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 884baf65392170763b27c914087bde01-Paper-Conference. pdf. Bennani, M. A., Doan, T., and Sugiyama, M. Generali- sation guarantees for continual learning with orthogonal gradient descent.arXiv preprint arXiv:2006.11942, 2020. Bertsekas, D. P.Network Optimization: Continuous and Discrete Models. ...
-
[3]
ISBN 1-886529-02-7. URL http://www. athenasc.com/netopt.html. Bietti, A., Bruna, J., Sanford, C., and Song, M. J. Learning single-index models with shallow neu- ral networks.Advances in Neural Informa- tion Processing Systems, 35:9768–9780, 2022. URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 3fb6c52aeb11e09053c16eabee74dd7b-Paper-Confer...
-
[4]
cc/paper_files/paper/2022/hash/ d14b434b92b0286372bf906e5300d60d-Abstract-Conference
URL https://proceedings.neurips. cc/paper_files/paper/2022/hash/ d14b434b92b0286372bf906e5300d60d-Abstract-Conference. html. Bordelon, B. and Pehlevan, C. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks, 2023. URL https://arxiv.org/abs/ 2304.03408. Chen, L., Peng, B., and Wu, H. Theoretical lim- itations of multi-...
-
[5]
URL https://proceedings.mlr.press/ v202/chiang23a.html. Cho, Y . and Saul, L. Kernel methods for deep learning. InAdvances in Neural Information Processing Systems (NIPS), volume 22, pp. 342– 350, 2009. URL https://proceedings. neurips.cc/paper/2009/hash/ 5751ec3e9a4feab575962e78e006250d-Abstract. html. Dall, J. and Christensen, M. Random geometric graphs...
-
[6]
URL https://proceedings.mlr.press/ v139/dong21a.html. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y ., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and...
-
[7]
URL https://transformer-circuits. pub/2021/framework/index.html. Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33: 14820–14830, 2020. URL https://proceedings. neurips.cc/paper/2020/file/ a9df2255ad642b923d95503b9a7958d8-Paper. pdf. Gromov, A....
work page 2021
-
[8]
Also available as arXiv:2403.17887
URL https://openreview.net/forum? id=XJ2ZqJ2s4d. Also available as arXiv:2403.17887. Hron, J., Bahri, Y ., Sohl-Dickstein, J., and Novak, R. Infinite attention: NNGP and NTK for deep attention networks. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pp. 4376–4386. PMLR, 2020. URL https://proceedings.mlr. press/v...
-
[9]
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
URL https://openreview.net/forum? id=u21fK4h5YQ. Also available as arXiv:2410.05229. Nanda, N., Chan, L., Lieberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations, 2023. URL https://openreview. net/pdf?id=9XFSbDPmdW. Naveh, G., Ben David, O., Sompo...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physreve.104 2023
-
[10]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
ISBN 9780198506263. Power, A., Burda, Y ., Edwards, H., Babuschkin, I., and Misra, V . Grokking: Generalization beyond overfit- ting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022. URL https://arxiv. org/abs/2201.02177. Ramamoorthy, A., Shi, J., and Wesel, R. D. On the capacity of network coding for random networks.IEEE Trans- actions...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tit.2005.851722 2022
-
[11]
Deep learning generalizes because the parameter-function map is biased towards simple functions
URL https://openreview.net/forum? id=H1W1UN9gg. Seroussi, I., Naveh, G., and Ringel, Z. Separation of scales and a thermodynamic description of feature learning in some cnns.Nature Communications, 14(1):908, 2023. doi: 10.1038/s41467-023-36361-y. URL https:// doi.org/10.1038/s41467-023-36361-y. Sompolinsky, H. Statistical mechanics of neural networks. Phy...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41467-023-36361-y 2023
-
[12]
von Mises, R.Mathematical Theory of Probability and Statistics
URL https://openreview.net/forum? id=SkgKO0EtvS. von Mises, R.Mathematical Theory of Probability and Statistics. Academic Press, New York and London, 1964. Wei, C., Lee, J. D., Liu, Q., and Ma, T. Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. InAdvances in Neural Infor- mation Processing Systems, volume ...
-
[13]
TX c,e=1 Ac(s1)∆Σ′ ceAe(s2) # ≤ ∥∆Σ ′∥∞ E(s1,s2)
Bounding eval.The error induced by substituting Σ′ → Σ′ + ∆Σ′ is eval = E˜s∼N(0,C+∆C) " TX c,e=1 Ac(s1)∆Σ′ ceAe(s2) # ≤ ∥∆Σ ′∥∞ E(s1,s2) " TX c=1 Ac(s1) ! TX e=1 Ae(s2) !# ≤ ∥∆Σ ′∥∞ ,(16) The 2nd inequality results from Softmax outputs summing up to 1. The valuation error, thus, perfectly absorbs the T 2 summands and is exactly 1-Lipschitz and independent of T
-
[14]
Bounding emeasure Next, we evaluate the shift in the expectation due to the perturbed distribution of ˜s. As estab- lished in Eq. (7), the covarianceCfactorizes as C(i,c),(j,e) = (Σ′ ij)ab(Σ′ ij)ce. where, Σ′ ij ≡Σ ′(Xi, Xj). Under perturbation, the corre- sponding score covariance block is shifted C(i,c),(j,e) → (Σ′ ij)ab + (∆Σ′ ij)ab (Σ′ ij)ce + (∆Σ′ ij...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.