On the Spatiotemporal Dynamics of Generalization in Neural Networks

Zichao Wei

arxiv: 2602.01651 · v2 · submitted 2026-02-02 · 💻 cs.LG · cs.AI

On the Spatiotemporal Dynamics of Generalization in Neural Networks

Zichao Wei This is my paper

Pith reviewed 2026-05-16 08:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords neural networksgeneralizationcellular automataadditionlocalitysymmetrystabilityattractor dynamics

0 comments

The pith

A neural architecture derived from locality, symmetry and stability postulates achieves perfect addition on sequences up to a million digits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard neural networks fail to generalize simple rules like addition to longer inputs because they violate basic physical constraints that any reliable computing system must obey. The paper identifies three such constraints: information must propagate at finite speed, computational rules must remain unchanged across space and time, and the system must settle into stable discrete states. From these postulates the authors derive rather than hand-design the SEAD architecture, a neural cellular automaton that applies local convolutional updates repeatedly until convergence. On addition this produces 100 percent accuracy when tested on inputs a hundred thousand times longer than the training examples, with the number of iterations adapting automatically to the input length.

Core claim

The central claim is that any system capable of true generalization must satisfy the physical postulates of locality, symmetry and stability; enforcing them directly yields the SEAD neural cellular automaton whose iterated local rules produce scale-invariant behavior, including 100 percent accurate addition from 16-digit training to one-million-digit test cases and exact reproduction of the Turing-complete Rule 110 automaton without trajectory divergence.

What carries the argument

The SEAD architecture: a neural cellular automaton that applies fixed local convolutional rules iteratively until the state converges to a discrete attractor.

If this is right

Parity is solved with perfect length generalization through explicit light-cone propagation of information.
Addition exhibits input-adaptive computation, using more iterations only when needed, while remaining exactly correct up to one million digits.
Rule 110, a Turing-complete cellular automaton, is learned without divergence or loss of long-term behavior.
Generalization is obtained without increasing parameter count or training data volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same iterative attractor mechanism might allow length generalization on other algorithmic tasks such as sorting or matrix multiplication.
Requiring convergence to discrete attractors could restrict use on problems whose natural outputs are continuous or probabilistic.
If the postulates truly capture the physics of computation, then any architecture ignoring them should fail at arbitrary-length generalization regardless of scale.
Testing whether removing one postulate while keeping the others intact destroys generalization would directly probe the necessity claim.

Load-bearing premise

The three physical postulates are both necessary and sufficient for generalization, and the SEAD architecture follows from them without any additional task-specific design choices.

What would settle it

Training a network that violates at least one of the three postulates yet still achieves 100 percent accuracy on million-digit addition, or showing that SEAD itself fails to generalize on a new task whose solution satisfies locality, symmetry and stability.

Figures

Figures reproduced from arXiv: 2602.01651 by Zichao Wei.

**Figure 1.** Figure 1: Spatiotemporal evolution of the Parity task. Horizontal axis (Position) represents spatial coordinates; vertical axis (Time Step) represents evolution depth, increasing downwards. Left: Evolution starting from a random initial state. The cumulative XOR wave propagates from left to right at light speed 𝑐 = 1, gradually ordering the lattice. Right: Correctness wave (Green=Correct, Red=Incorrect). The wavefro… view at source ↗

**Figure 2.** Figure 2: Spatiotemporal evolution comparison: Random vs Adversarial inputs. Horizontal axis (Position) represents spatial coordinates; vertical axis (Time Step) represents evolution depth. Green=Correct, Red=Incorrect. Left: Random input. Due to short carry chains, full convergence is reached in about 8 steps (all green), showing an “island-like” rapid convergence pattern. Right: Adversarial sample 1 𝐿 + 1. The car… view at source ↗

**Figure 3.** Figure 3: Complexity analysis of convergence steps (Log-Log scale). Horizontal axis is sequence length 𝐿 (log scale); vertical axis is steps to convergence (log scale). Blue line: Random input, convergence steps grow sub-𝑂(log𝐿). Red line: Adversarial input, convergence steps strictly linear in 𝐿, 𝑂(𝐿). Dashed lines are theoretical fits. This indicates SEAD spontaneously realizes the “Least Action” principle—dynamic… view at source ↗

**Figure 4.** Figure 4: Learning Rule 110: Neural Cellular Automata vs Ground Truth. Left: Evolution graph generated by SEAD via supervised learning. Right: Evolution graph of true Rule 110. The two are visually identical, indicating SEAD has perfectly learned the transition rules and can losslessly simulate complex non-linear structures like Gliders and collisions [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Why do neural networks fail to generalize addition from 16-digit to 32-digit numbers, while a child who learns the rule can apply it to arbitrarily long sequences? We argue that this failure is not an engineering problem but a violation of physical postulates. Drawing inspiration from physics, we identify three constraints that any generalizing system must satisfy: (1) Locality -- information propagates at finite speed; (2) Symmetry -- the laws of computation are invariant across space and time; (3) Stability -- the system converges to discrete attractors that resist noise accumulation. From these postulates, we derive -- rather than design -- the Spatiotemporal Evolution with Attractor Dynamics (SEAD) architecture: a neural cellular automaton where local convolutional rules are iterated until convergence. Experiments on three tasks validate our theory: (1) Parity -- demonstrating perfect length generalization via light-cone propagation; (2) Addition -- achieving scale-invariant inference from L=16 to L=1 million with 100% accuracy, exhibiting input-adaptive computation; (3) Rule 110 -- learning a Turing-complete cellular automaton without trajectory divergence. Our results suggest that the gap between statistical learning and logical reasoning can be bridged -- not by scaling parameters, but by respecting the physics of computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a neural cellular automaton from locality, symmetry and stability postulates and reports 100% length generalization on addition to a million digits, but the steps showing those postulates force the iterative attractor form are not visible in the abstract.

read the letter

The main thing to know is that this work reframes length generalization failure as a violation of three physical constraints and claims to derive an architecture that satisfies them by construction. The SEAD model is presented as a neural cellular automaton that applies local convolutional rules until it reaches a stable attractor, and the experiments report perfect accuracy on addition from 16 to one million digits plus solid results on parity and rule 110.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that neural network generalization failures (e.g., addition from 16 to 32 digits) violate three physical postulates—locality (finite propagation speed), symmetry (invariance across space/time), and stability (convergence to noise-resistant discrete attractors). From these, the authors derive rather than design the SEAD architecture: a neural cellular automaton applying local convolutional rules iteratively until attractor convergence. Experiments report perfect length generalization on parity via light-cone propagation, 100% scale-invariant accuracy on addition from L=16 to L=1 million with input-adaptive computation, and learning of Rule 110 without trajectory divergence.

Significance. If the derivation is shown to be forced and the extreme-length results hold under rigorous controls, the work offers a principled route to scale-invariant logical generalization grounded in physical constraints rather than parameter scaling. The reported 100% accuracy on addition to 10^6 digits and the input-adaptive behavior would constitute a notable empirical advance for algorithmic reasoning tasks.

major comments (3)

[Abstract / derivation of SEAD] Abstract and derivation section: The claim that SEAD follows directly from the three postulates lacks any equations or mapping steps showing how locality, symmetry, and stability uniquely force iterative local conv rules to a fixed-point attractor (as opposed to fixed-depth equivariant CNNs, translation-invariant RNNs, or non-iterated message-passing graphs that also satisfy the postulates). This is load-bearing for the central thesis that the architecture is derived rather than additionally chosen.
[Addition experiments] Addition task results: The 100% accuracy claim from L=16 training to L=1 million inference is central but unsupported by any reported baseline comparisons, error bars, number of large-L test instances, or verification that no global pooling/attention leaks length information. Without these controls, it is impossible to confirm that the result stems from the postulates rather than task-specific implementation details.
[Rule 110 experiments / stability analysis] Stability and convergence: The stability postulate is invoked to guarantee discrete attractors that resist noise, yet no formal criterion (e.g., Lyapunov function, contraction mapping, or explicit fixed-point condition) is supplied showing why iterated local rules converge without divergence on Rule 110 while satisfying the other postulates.

minor comments (2)

[Abstract] Abstract: 'light-cone propagation' is used without definition or pointer to the relevant section or figure.
[Throughout] Notation: Ensure the SEAD acronym and 'attractor dynamics' are introduced with consistent mathematical notation (e.g., update rule, convergence threshold) on first use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments help clarify how to strengthen the presentation of the derivation and the rigor of the experimental claims. We address each major point below and indicate the revisions that will be incorporated in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract / derivation of SEAD] Abstract and derivation section: The claim that SEAD follows directly from the three postulates lacks any equations or mapping steps showing how locality, symmetry, and stability uniquely force iterative local conv rules to a fixed-point attractor (as opposed to fixed-depth equivariant CNNs, translation-invariant RNNs, or non-iterated message-passing graphs that also satisfy the postulates). This is load-bearing for the central thesis that the architecture is derived rather than additionally chosen.

Authors: We agree that the derivation would be clearer with explicit equations. In the revised manuscript we will expand Section 3 to include a step-by-step formal mapping: (i) locality restricts updates to finite-support convolutional kernels; (ii) spatiotemporal symmetry requires the same kernel to be applied uniformly at every location and iteration; (iii) stability is encoded by requiring the update operator to be a contraction mapping whose unique fixed point is reached in finite steps for any input length. These conditions exclude fixed-depth CNNs (which violate time-invariance for arbitrary lengths) and non-iterated message-passing graphs (which lack guaranteed attractor convergence). The added equations will make the uniqueness explicit. revision: yes
Referee: [Addition experiments] Addition task results: The 100% accuracy claim from L=16 training to L=1 million inference is central but unsupported by any reported baseline comparisons, error bars, number of large-L test instances, or verification that no global pooling/attention leaks length information. Without these controls, it is impossible to confirm that the result stems from the postulates rather than task-specific implementation details.

Authors: We accept that additional controls are necessary. The revised version will report: (a) direct comparisons against Transformer and LSTM baselines trained on the same 16-digit regime and evaluated at 1 million digits; (b) mean accuracy and standard deviation across five independent seeds; (c) explicit counts (1 000 test instances for each length up to 10^6); and (d) an architectural audit confirming that only strictly local convolutions are used, with no global pooling or attention that could encode length. These additions will be placed in the experimental section and supplementary material. revision: yes
Referee: [Rule 110 experiments / stability analysis] Stability and convergence: The stability postulate is invoked to guarantee discrete attractors that resist noise, yet no formal criterion (e.g., Lyapunov function, contraction mapping, or explicit fixed-point condition) is supplied showing why iterated local rules converge without divergence on Rule 110 while satisfying the other postulates.

Authors: Stability is currently supported by empirical convergence on Rule 110 under injected noise. In the revision we will add a contraction-mapping argument in the stability subsection: the learned local update is shown to be Lipschitz with constant <1 on the discrete state space, guaranteeing a unique fixed point reached in bounded iterations. While a general Lyapunov function for arbitrary rules remains future work, the contraction condition directly ties the observed non-divergence to the stability postulate and will be stated formally. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation asserted but not reduced to inputs by construction

full rationale

The paper asserts that the three postulates (locality, symmetry, stability) directly yield the SEAD architecture of iterated local convolutional rules to attractor, yet the provided abstract and description contain no equations, self-citations, or fitted parameters that reduce the claimed derivation to a tautology or prior result by the same authors. No load-bearing step equates the output architecture to its inputs by definition, renames a known result, or imports uniqueness via self-citation. The central claim remains an assertion whose validity can be checked against external benchmarks (e.g., whether other models satisfying the same postulates also generalize), but the derivation chain itself does not collapse into circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 1 invented entities

The central claim rests on three domain assumptions presented as physical postulates plus the assertion that they suffice to derive the SEAD model; no free parameters or new entities are explicitly quantified in the abstract.

axioms (3)

domain assumption Locality: information propagates at finite speed
Postulate (1) invoked to motivate local convolutional rules
domain assumption Symmetry: laws of computation invariant across space and time
Postulate (2) invoked to justify translation-invariant updates
domain assumption Stability: system converges to discrete attractors resisting noise
Postulate (3) invoked to justify iteration until convergence

invented entities (1)

SEAD architecture no independent evidence
purpose: Neural cellular automaton implementing the three postulates
New architecture introduced to satisfy the postulates; no independent falsifiable prediction outside the reported tasks is given

pith-pipeline@v0.9.0 · 5514 in / 1433 out tokens · 48873 ms · 2026-05-16T08:57:18.846352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction (spacetime-emergence certificate, Lorentzian signature, light-cone classification) matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

Postulate 1 (Relativistic Causality): strict Causal Horizon ... Δx/Δt ≤ c ⟺ h_{t+1}(x) = f(h_t(N_c(x))) ... light-cone propagation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction (translation invariance from distinction) matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

Postulate 2 (Spacetime Symmetry): f_{x,t}(·) ≡ f_{x+Δx,t+Δt}(·) ≡ f_shared(·) ... translation invariance
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J(x)=½(x+x^{-1})−1 unique calibrated reciprocal cost whose minima are attractors) matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

Postulate 3 (Thermodynamic Dissipation and Stability): lim dist(f^t(h+ε),A)=0 ... discrete attractors ... Contractive Nonlinearity
IndisputableMonolith/Foundation/DimensionForcing.lean 8-tick period + recognition lattices echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Corollary 1: Isomorphism to Cellular Automata ... ⟨L,S,N,f⟩ ... SEAD: neural cellular automaton iterated until convergence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
cs.LG 2026-03 unverdicted novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 ti...
Structural Generalization on SLOG without Hand-Written Rules
cs.CL 2026-04 unverdicted novelty 7.0

A neural cellular automaton model learns all compositional rules from data via local iteration and achieves 100% type-exact match on 11 of 17 structural generalization categories on the SLOG benchmark.
On the Emergence of Syntax by Means of Local Interaction
cs.CL 2026-04 unverdicted novelty 7.0

A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.
Structural Generalization on SLOG without Hand-Written Rules
cs.CL 2026-04 unverdicted novelty 6.0

A neural cellular automaton learns compositional rules from data alone to achieve structural generalization on the SLOG semantic parsing benchmark, reaching 67.3% accuracy and fully succeeding on 11 of 17 categories.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 3 Pith papers · 7 internal anchors

[1]

What algorithms can transformers learn? a study in length generalization

H. Zhou et al., “What Algorithms Can Transformers Learn? A Study in Length Generalization, ” no. arXiv:2310.16028. arXiv, Oct. 2023. doi: 10.48550/arXiv.2310.16028

work page doi:10.48550/arxiv.2310.16028 2023
[2]

International Conference on Learning Representations , month =

G. Delétang et al., “Neural Networks and the Chomsky Hierarchy, ” no. arXiv:2207.02098. arXiv, Feb

work page arXiv
[3]

doi: 10.48550/arXiv.2207.02098

work page doi:10.48550/arxiv.2207.02098
[4]

On the ability and limitations of transformers to recognize formal languages

S. Bhattamishra, K. Ahuja, and N. Goyal, “On the Ability and Limitations of Transformers to Recognize Formal Languages, ” no. arXiv:2009.11264. arXiv, Oct. 2020. doi: 10.48550/arXiv.2009.11264

work page doi:10.48550/arxiv.2009.11264 2009
[5]

To Infinity and beyond: Children Generalize the Successor Function to All Possible Numbers Years after Learning to Count,

P. Cheung, M. Rubenson, and D. Barner, “To Infinity and beyond: Children Generalize the Successor Function to All Possible Numbers Years after Learning to Count, ” Cognitive Psychology, vol. 92, pp. 22– 36, Feb. 2017, doi: 10.1016/j.cogpsych.2016.11.002

work page doi:10.1016/j.cogpsych.2016.11.002 2017
[6]

Johan Håstad.Computational Limitations of Small-Depth Circuits

M. Hahn, “Theoretical Limitations of Self-Attention in Neural Sequence Models, ” Transactions of the Association for Computational Linguistics, vol. 8, pp. 156–171, Jan. 2020, doi: 10.1162/tacl_a_00306

work page doi:10.1162/tacl_a_00306 2020
[7]

The Parallelism Tradeoff: Limitations of Log-Precision Transformers,

W. Merrill and A. Sabharwal, “The Parallelism Tradeoff: Limitations of Log-Precision Transformers, ” Transactions of the Association for Computational Linguistics, vol. 11, pp. 531–545, June 2023, doi: 10.1162/ tacl_a_00562

work page 2023
[8]

Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count,

H. Cho, J. Cha, S. Bhojanapalli, and C. Yun, “Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count, ” no. arXiv:2410.15787. arXiv, Apr. 2025. doi: 10.48550/arXiv.2410.15787

work page doi:10.48550/arxiv.2410.15787 2025
[9]

A Formal Framework for Understanding Length Generalization in Transformers,

X. Huang et al., “A Formal Framework for Understanding Length Generalization in Transformers, ” no. arXiv:2410.02140. arXiv, Apr. 2025. doi: 10.48550/arXiv.2410.02140

work page doi:10.48550/arxiv.2410.02140 2025
[10]

arXiv preprint arXiv:2312.17044 (2024)

L. Zhao et al., “Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding, ” no. arXiv:2312.17044. arXiv, Apr. 2024. doi: 10.48550/arXiv.2312.17044

work page doi:10.48550/arxiv.2312.17044 2024
[11]

Attention Bias as an Inductive Bias: How to Teach Transformers Simple Arithmetic,

S. Duan, Y. Shi, and W. Xu, “Attention Bias as an Inductive Bias: How to Teach Transformers Simple Arithmetic, ” pp. 0–18

work page
[12]

Show Your Work: Scratchpads for Intermediate Computation with Language Models,

M. Nye et al., “Show Your Work: Scratchpads for Intermediate Computation with Language Models, ” Oct. 2021

work page 2021
[13]

Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective,

G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang, “Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective, ” Advances in Neural Information Processing Systems, vol. 36, pp. 70757–70798, Dec. 2023

work page 2023
[14]

An Overview of Statistical Learning Theory,

V. Vapnik, “An Overview of Statistical Learning Theory, ” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, Sept. 1999, doi: 10.1109/72.788640

work page doi:10.1109/72.788640 1999
[15]

Pearl, Causality: Models, Reasoning, and Inference , 2 edition, reprinted with corrections

J. Pearl, Causality: Models, Reasoning, and Inference , 2 edition, reprinted with corrections. Cambridge New York, NY Port Melbourne New Delhi Singapore: Cambridge University Press, 2022

work page 2022
[16]

A Spacetime Perspective on Dynamical Computation in Neural Information Processing Systems,

T. A. Keller, L. Muller, T. J. Sejnowski, and M. Welling, “A Spacetime Perspective on Dynamical Computation in Neural Information Processing Systems, ” no. arXiv:2409.13669. arXiv, Sept. 2024. doi: 10.48550/arXiv.2409.13669

work page doi:10.48550/arxiv.2409.13669 2024
[17]

Von Neumann and A

J. Von Neumann and A. W. (. W. Burks, Theory of Self-Reproducing Automata . Urbana, University of Illinois Press, 1966

work page 1966
[18]

Why Are Sensitive Functions Hard for Transformers?,

M. Hahn and M. Rofin, “Why Are Sensitive Functions Hard for Transformers?, ” no. arXiv:2402.09963. arXiv, May 2024. doi: 10.48550/arXiv.2402.09963

work page doi:10.48550/arxiv.2402.09963 2024
[19]

Universality in Elementary Cellular Automata,

M. Cook, “Universality in Elementary Cellular Automata, ” Complex Systems, vol. 15, no. 1, pp. 1–40, Mar. 2004, doi: 10.25088/ComplexSystems.15.1.1

work page doi:10.25088/complexsystems.15.1.1 2004
[20]

The Role of Sparsity for Length Generalization in Transformers,

N. Golowich, S. Jelassi, D. Brandfonbrener, S. M. Kakade, and E. Malach, “The Role of Sparsity for Length Generalization in Transformers, ” no. arXiv:2502.16792. arXiv, Feb. 2025. doi: 10.48550/arXiv.2502.16792

work page doi:10.48550/arxiv.2502.16792 2025
[21]

Looped Transformers for Length Generalization,

Y. Fan, Y. Du, K. Ramchandran, and K. Lee, “Looped Transformers for Length Generalization, ” pp. 0– 19, 2025. 17

work page 2025
[22]

Universal Transformers

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, “Universal Transformers, ” no. arXiv:1807.03819. arXiv, Mar. 2019. doi: 10.48550/arXiv.1807.03819

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.03819 2019
[23]

Neural GPUs Learn Algorithms

Ł. Kaiser and I. Sutskever, “Neural GPUs Learn Algorithms, ” no. arXiv:1511.08228. arXiv, Mar. 2016. doi: 10.48550/arXiv.1511.08228

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1511.08228 2016
[24]

To Infinity and beyond: Tool-Use Unlocks Length Generalization in State Space Models,

E. Malach et al. , “To Infinity and beyond: Tool-Use Unlocks Length Generalization in State Space Models, ” no. arXiv:2510.14826. arXiv, Oct. 2025. doi: 10.48550/arXiv.2510.14826

work page doi:10.48550/arxiv.2510.14826 2025
[25]

Softmax trans- formers are turing-complete.CoRR, abs/2511.20038, 2025

H. Jiang, M. Hahn, G. Zetzsche, and A. W. Lin, “Softmax Transformers Are Turing-Complete, ” no. arXiv:2511.20038. arXiv, Nov. 2025. doi: 10.48550/arXiv.2511.20038

work page doi:10.48550/arxiv.2511.20038 2025
[26]

Lower Bounds for Chain-of-Thought Reasoning in Hard- Attention Transformers,

A. Amiri, X. Huang, M. Rofin, and M. Hahn, “Lower Bounds for Chain-of-Thought Reasoning in Hard- Attention Transformers, ” no. arXiv:2502.02393. arXiv, July 2025. doi: 10.48550/arXiv.2502.02393

work page doi:10.48550/arxiv.2502.02393 2025
[27]

Cellular Automata as Convolutional Neural Networks,

W. Gilpin, “Cellular Automata as Convolutional Neural Networks, ” Physical Review E, vol. 100, no. 3, p. 32402, Sept. 2019, doi: 10.1103/PhysRevE.100.032402

work page doi:10.1103/physreve.100.032402 2019
[28]

Growing Neural Cellular Automata , volume =

A. Mordvintsev, E. Randazzo, E. Niklasson, and M. Levin, “Growing Neural Cellular Automata, ” Distill, vol. 5, no. 2, p. e23, Feb. 2020, doi: 10.23915/distill.00023

work page doi:10.23915/distill.00023 2020
[29]

Neural Cellular Automata: Applications to Biology and beyond Classical AI,

B. Hartl, M. Levin, and L. Pio-Lopez, “Neural Cellular Automata: Applications to Biology and beyond Classical AI, ” Physics of Life Reviews, vol. 56, pp. 94–108, Mar. 2026, doi: 10.1016/j.plrev.2025.11.010

work page doi:10.1016/j.plrev.2025.11.010 2026
[30]

Neural Cellular Automata for ARC-AGI,

K. Xu and R. Miikkulainen, “Neural Cellular Automata for ARC-AGI, ” in ALIFE 2025: Ciphers of Life: Proceedings of the Artificial Life Conference 2025, MIT Press, Oct. 2025. doi: 10.1162/ISAL.a.844

work page doi:10.1162/isal.a.844 2025
[31]

The Hardware Lottery,

S. Hooker, “The Hardware Lottery, ” no. arXiv:2009.06489. arXiv, Sept. 2020. doi: 10.48550/ arXiv.2009.06489

work page arXiv 2009
[32]

Noether Networks: Meta-learning Useful Conserved Quantities,

F. Alet, D. Doblar, A. Zhou, J. Tenenbaum, K. Kawaguchi, and C. Finn, “Noether Networks: Meta-learning Useful Conserved Quantities, ” pp. 0–21

work page
[33]

Exploring the Long-Term Generalization of Counting Behavior in RNNs,

N. El-Naggar, P. Madhyastha, and T. Weyde, “Exploring the Long-Term Generalization of Counting Behavior in RNNs, ” no. arXiv:2211.16429. arXiv, Nov. 2022. doi: 10.48550/arXiv.2211.16429

work page doi:10.48550/arxiv.2211.16429 2022
[34]

Originally circulated 2019; published 2023

A. d'Avila Garcez and L. C. Lamb, “Neurosymbolic AI: The 3rd Wave, ” Artificial Intelligence Review, vol. 56, no. 11, pp. 12387–12406, Nov. 2023, doi: 10.1007/s10462-023-10448-w

work page doi:10.1007/s10462-023-10448-w 2023
[35]

Energy-Based Transformers Are Scalable Learners and Thinkers,

A. Gladstone et al. , “Energy-Based Transformers Are Scalable Learners and Thinkers, ” no. arXiv:2507.02092. arXiv, July 2025. doi: 10.48550/arXiv.2507.02092

work page doi:10.48550/arxiv.2507.02092 2025
[36]

Neural Turing Machines

A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines, ” no. arXiv:1410.5401. arXiv, Dec. 2014. doi: 10.48550/arXiv.1410.5401

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1410.5401 2014
[37]

Thermodynamic State Machine Network,

T. Hylton, “Thermodynamic State Machine Network, ” Entropy, vol. 24, no. 6, p. 744, June 2022, doi: 10.3390/e24060744

work page doi:10.3390/e24060744 2022
[38]

Kahneman, Thinking, Fast and Slow

D. Kahneman, Thinking, Fast and Slow. London: PENGUIN, 2024

work page 2024
[39]

Exposing Attention Glitches with Flip-Flop Language Modeling,

B. Liu, J. Ash, S. Goel, A. Krishnamurthy, and C. Zhang, “Exposing Attention Glitches with Flip-Flop Language Modeling, ” Advances in Neural Information Processing Systems, vol. 36, pp. 25549–25583, Dec. 2023

work page 2023
[40]

Mamba Modulation: On the Length Generalization of Mamba,

P. Lu et al., “Mamba Modulation: On the Length Generalization of Mamba, ” no. arXiv:2509.19633. arXiv, Dec. 2025. doi: 10.48550/arXiv.2509.19633

work page doi:10.48550/arxiv.2509.19633 2025
[41]

On Soliton Collisions between Localizations in Complex Elementary Cellular Automata: Rules 54 and 110 and Beyond

G. J. Martinez, A. Adamatzky, F. Chen, and L. Chua, “On Soliton Collisions between Localizations in Complex Elementary Cellular Automata: Rules 54 and 110 and Beyond, ” no. arXiv:1301.6258. arXiv, Jan

work page internal anchor Pith review Pith/arXiv arXiv
[42]

doi: 10.48550/arXiv.1301.6258

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1301.6258
[43]

Roma Patel and Ellie Pavlick

J. Park et al., “Can Mamba Learn How to Learn? A Comparative Study on in-Context Learning Tasks, ” no. arXiv:2402.04248. arXiv, Apr. 2024. doi: 10.48550/arXiv.2402.04248

work page doi:10.48550/arxiv.2402.04248 2024
[44]

RWKV: Reinventing RNNs for the Transformer Era

B. Peng et al., “RWKV: Reinventing RNNs for the Transformer Era, ” no. arXiv:2305.13048. arXiv, Dec

work page internal anchor Pith review Pith/arXiv arXiv
[45]

doi: 10.48550/arXiv.2305.13048

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.13048
[46]

LSTM Networks Can Perform Dynamic Counting,

M. Suzgun, Y. Belinkov, S. Shieber, and S. Gehrmann, “LSTM Networks Can Perform Dynamic Counting, ” in Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges , J. Eisner, M. Gallé, J. Heinz, A. Quattoni, and G. Rabusseau, Eds., Florence: Association for Computational Linguistics, Aug. 2019, pp. 44–54. doi: 10.18653/v1/W19-3905. 18

work page doi:10.18653/v1/w19-3905 2019
[47]

Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks,

H. Tanaka and D. Kunin, “Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks, ” in Advances in Neural Information Processing Systems, Nov. 2021

work page 2021
[48]

Computation Theory of Cellular Automata,

S. Wolfram, “Computation Theory of Cellular Automata, ” Communications in Mathematical Physics, vol. 96, no. 1, pp. 15–57, Mar. 1984, doi: 10.1007/BF01217347

work page doi:10.1007/bf01217347 1984
[49]

Universality and Complexity in Cellular Automata

S. Wolfram, “Universality and Complexity in Cellular Automata, ” Physica D: Nonlinear Phenomena, vol. 10, no. 1, pp. 1–35, Jan. 1984, doi: 10.1016/0167-2789(84)90245-8

work page doi:10.1016/0167-2789(84)90245-8 1984
[50]

Wolfram, A New Kind of Science

S. Wolfram, A New Kind of Science. Champaign (Ill.): Wolfram, 2002. 19

work page 2002

[1] [1]

What algorithms can transformers learn? a study in length generalization

H. Zhou et al., “What Algorithms Can Transformers Learn? A Study in Length Generalization, ” no. arXiv:2310.16028. arXiv, Oct. 2023. doi: 10.48550/arXiv.2310.16028

work page doi:10.48550/arxiv.2310.16028 2023

[2] [2]

International Conference on Learning Representations , month =

G. Delétang et al., “Neural Networks and the Chomsky Hierarchy, ” no. arXiv:2207.02098. arXiv, Feb

work page arXiv

[3] [3]

doi: 10.48550/arXiv.2207.02098

work page doi:10.48550/arxiv.2207.02098

[4] [4]

On the ability and limitations of transformers to recognize formal languages

S. Bhattamishra, K. Ahuja, and N. Goyal, “On the Ability and Limitations of Transformers to Recognize Formal Languages, ” no. arXiv:2009.11264. arXiv, Oct. 2020. doi: 10.48550/arXiv.2009.11264

work page doi:10.48550/arxiv.2009.11264 2009

[5] [5]

To Infinity and beyond: Children Generalize the Successor Function to All Possible Numbers Years after Learning to Count,

P. Cheung, M. Rubenson, and D. Barner, “To Infinity and beyond: Children Generalize the Successor Function to All Possible Numbers Years after Learning to Count, ” Cognitive Psychology, vol. 92, pp. 22– 36, Feb. 2017, doi: 10.1016/j.cogpsych.2016.11.002

work page doi:10.1016/j.cogpsych.2016.11.002 2017

[6] [6]

Johan Håstad.Computational Limitations of Small-Depth Circuits

M. Hahn, “Theoretical Limitations of Self-Attention in Neural Sequence Models, ” Transactions of the Association for Computational Linguistics, vol. 8, pp. 156–171, Jan. 2020, doi: 10.1162/tacl_a_00306

work page doi:10.1162/tacl_a_00306 2020

[7] [7]

The Parallelism Tradeoff: Limitations of Log-Precision Transformers,

W. Merrill and A. Sabharwal, “The Parallelism Tradeoff: Limitations of Log-Precision Transformers, ” Transactions of the Association for Computational Linguistics, vol. 11, pp. 531–545, June 2023, doi: 10.1162/ tacl_a_00562

work page 2023

[8] [8]

Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count,

H. Cho, J. Cha, S. Bhojanapalli, and C. Yun, “Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count, ” no. arXiv:2410.15787. arXiv, Apr. 2025. doi: 10.48550/arXiv.2410.15787

work page doi:10.48550/arxiv.2410.15787 2025

[9] [9]

A Formal Framework for Understanding Length Generalization in Transformers,

X. Huang et al., “A Formal Framework for Understanding Length Generalization in Transformers, ” no. arXiv:2410.02140. arXiv, Apr. 2025. doi: 10.48550/arXiv.2410.02140

work page doi:10.48550/arxiv.2410.02140 2025

[10] [10]

arXiv preprint arXiv:2312.17044 (2024)

L. Zhao et al., “Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding, ” no. arXiv:2312.17044. arXiv, Apr. 2024. doi: 10.48550/arXiv.2312.17044

work page doi:10.48550/arxiv.2312.17044 2024

[11] [11]

Attention Bias as an Inductive Bias: How to Teach Transformers Simple Arithmetic,

S. Duan, Y. Shi, and W. Xu, “Attention Bias as an Inductive Bias: How to Teach Transformers Simple Arithmetic, ” pp. 0–18

work page

[12] [12]

Show Your Work: Scratchpads for Intermediate Computation with Language Models,

M. Nye et al., “Show Your Work: Scratchpads for Intermediate Computation with Language Models, ” Oct. 2021

work page 2021

[13] [13]

Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective,

G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang, “Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective, ” Advances in Neural Information Processing Systems, vol. 36, pp. 70757–70798, Dec. 2023

work page 2023

[14] [14]

An Overview of Statistical Learning Theory,

V. Vapnik, “An Overview of Statistical Learning Theory, ” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, Sept. 1999, doi: 10.1109/72.788640

work page doi:10.1109/72.788640 1999

[15] [15]

Pearl, Causality: Models, Reasoning, and Inference , 2 edition, reprinted with corrections

J. Pearl, Causality: Models, Reasoning, and Inference , 2 edition, reprinted with corrections. Cambridge New York, NY Port Melbourne New Delhi Singapore: Cambridge University Press, 2022

work page 2022

[16] [16]

A Spacetime Perspective on Dynamical Computation in Neural Information Processing Systems,

T. A. Keller, L. Muller, T. J. Sejnowski, and M. Welling, “A Spacetime Perspective on Dynamical Computation in Neural Information Processing Systems, ” no. arXiv:2409.13669. arXiv, Sept. 2024. doi: 10.48550/arXiv.2409.13669

work page doi:10.48550/arxiv.2409.13669 2024

[17] [17]

Von Neumann and A

J. Von Neumann and A. W. (. W. Burks, Theory of Self-Reproducing Automata . Urbana, University of Illinois Press, 1966

work page 1966

[18] [18]

Why Are Sensitive Functions Hard for Transformers?,

M. Hahn and M. Rofin, “Why Are Sensitive Functions Hard for Transformers?, ” no. arXiv:2402.09963. arXiv, May 2024. doi: 10.48550/arXiv.2402.09963

work page doi:10.48550/arxiv.2402.09963 2024

[19] [19]

Universality in Elementary Cellular Automata,

M. Cook, “Universality in Elementary Cellular Automata, ” Complex Systems, vol. 15, no. 1, pp. 1–40, Mar. 2004, doi: 10.25088/ComplexSystems.15.1.1

work page doi:10.25088/complexsystems.15.1.1 2004

[20] [20]

The Role of Sparsity for Length Generalization in Transformers,

N. Golowich, S. Jelassi, D. Brandfonbrener, S. M. Kakade, and E. Malach, “The Role of Sparsity for Length Generalization in Transformers, ” no. arXiv:2502.16792. arXiv, Feb. 2025. doi: 10.48550/arXiv.2502.16792

work page doi:10.48550/arxiv.2502.16792 2025

[21] [21]

Looped Transformers for Length Generalization,

Y. Fan, Y. Du, K. Ramchandran, and K. Lee, “Looped Transformers for Length Generalization, ” pp. 0– 19, 2025. 17

work page 2025

[22] [22]

Universal Transformers

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, “Universal Transformers, ” no. arXiv:1807.03819. arXiv, Mar. 2019. doi: 10.48550/arXiv.1807.03819

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.03819 2019

[23] [23]

Neural GPUs Learn Algorithms

Ł. Kaiser and I. Sutskever, “Neural GPUs Learn Algorithms, ” no. arXiv:1511.08228. arXiv, Mar. 2016. doi: 10.48550/arXiv.1511.08228

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1511.08228 2016

[24] [24]

To Infinity and beyond: Tool-Use Unlocks Length Generalization in State Space Models,

E. Malach et al. , “To Infinity and beyond: Tool-Use Unlocks Length Generalization in State Space Models, ” no. arXiv:2510.14826. arXiv, Oct. 2025. doi: 10.48550/arXiv.2510.14826

work page doi:10.48550/arxiv.2510.14826 2025

[25] [25]

Softmax trans- formers are turing-complete.CoRR, abs/2511.20038, 2025

H. Jiang, M. Hahn, G. Zetzsche, and A. W. Lin, “Softmax Transformers Are Turing-Complete, ” no. arXiv:2511.20038. arXiv, Nov. 2025. doi: 10.48550/arXiv.2511.20038

work page doi:10.48550/arxiv.2511.20038 2025

[26] [26]

Lower Bounds for Chain-of-Thought Reasoning in Hard- Attention Transformers,

A. Amiri, X. Huang, M. Rofin, and M. Hahn, “Lower Bounds for Chain-of-Thought Reasoning in Hard- Attention Transformers, ” no. arXiv:2502.02393. arXiv, July 2025. doi: 10.48550/arXiv.2502.02393

work page doi:10.48550/arxiv.2502.02393 2025

[27] [27]

Cellular Automata as Convolutional Neural Networks,

W. Gilpin, “Cellular Automata as Convolutional Neural Networks, ” Physical Review E, vol. 100, no. 3, p. 32402, Sept. 2019, doi: 10.1103/PhysRevE.100.032402

work page doi:10.1103/physreve.100.032402 2019

[28] [28]

Growing Neural Cellular Automata , volume =

A. Mordvintsev, E. Randazzo, E. Niklasson, and M. Levin, “Growing Neural Cellular Automata, ” Distill, vol. 5, no. 2, p. e23, Feb. 2020, doi: 10.23915/distill.00023

work page doi:10.23915/distill.00023 2020

[29] [29]

Neural Cellular Automata: Applications to Biology and beyond Classical AI,

B. Hartl, M. Levin, and L. Pio-Lopez, “Neural Cellular Automata: Applications to Biology and beyond Classical AI, ” Physics of Life Reviews, vol. 56, pp. 94–108, Mar. 2026, doi: 10.1016/j.plrev.2025.11.010

work page doi:10.1016/j.plrev.2025.11.010 2026

[30] [30]

Neural Cellular Automata for ARC-AGI,

K. Xu and R. Miikkulainen, “Neural Cellular Automata for ARC-AGI, ” in ALIFE 2025: Ciphers of Life: Proceedings of the Artificial Life Conference 2025, MIT Press, Oct. 2025. doi: 10.1162/ISAL.a.844

work page doi:10.1162/isal.a.844 2025

[31] [31]

The Hardware Lottery,

S. Hooker, “The Hardware Lottery, ” no. arXiv:2009.06489. arXiv, Sept. 2020. doi: 10.48550/ arXiv.2009.06489

work page arXiv 2009

[32] [32]

Noether Networks: Meta-learning Useful Conserved Quantities,

F. Alet, D. Doblar, A. Zhou, J. Tenenbaum, K. Kawaguchi, and C. Finn, “Noether Networks: Meta-learning Useful Conserved Quantities, ” pp. 0–21

work page

[33] [33]

Exploring the Long-Term Generalization of Counting Behavior in RNNs,

N. El-Naggar, P. Madhyastha, and T. Weyde, “Exploring the Long-Term Generalization of Counting Behavior in RNNs, ” no. arXiv:2211.16429. arXiv, Nov. 2022. doi: 10.48550/arXiv.2211.16429

work page doi:10.48550/arxiv.2211.16429 2022

[34] [34]

Originally circulated 2019; published 2023

A. d'Avila Garcez and L. C. Lamb, “Neurosymbolic AI: The 3rd Wave, ” Artificial Intelligence Review, vol. 56, no. 11, pp. 12387–12406, Nov. 2023, doi: 10.1007/s10462-023-10448-w

work page doi:10.1007/s10462-023-10448-w 2023

[35] [35]

Energy-Based Transformers Are Scalable Learners and Thinkers,

A. Gladstone et al. , “Energy-Based Transformers Are Scalable Learners and Thinkers, ” no. arXiv:2507.02092. arXiv, July 2025. doi: 10.48550/arXiv.2507.02092

work page doi:10.48550/arxiv.2507.02092 2025

[36] [36]

Neural Turing Machines

A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines, ” no. arXiv:1410.5401. arXiv, Dec. 2014. doi: 10.48550/arXiv.1410.5401

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1410.5401 2014

[37] [37]

Thermodynamic State Machine Network,

T. Hylton, “Thermodynamic State Machine Network, ” Entropy, vol. 24, no. 6, p. 744, June 2022, doi: 10.3390/e24060744

work page doi:10.3390/e24060744 2022

[38] [38]

Kahneman, Thinking, Fast and Slow

D. Kahneman, Thinking, Fast and Slow. London: PENGUIN, 2024

work page 2024

[39] [39]

Exposing Attention Glitches with Flip-Flop Language Modeling,

B. Liu, J. Ash, S. Goel, A. Krishnamurthy, and C. Zhang, “Exposing Attention Glitches with Flip-Flop Language Modeling, ” Advances in Neural Information Processing Systems, vol. 36, pp. 25549–25583, Dec. 2023

work page 2023

[40] [40]

Mamba Modulation: On the Length Generalization of Mamba,

P. Lu et al., “Mamba Modulation: On the Length Generalization of Mamba, ” no. arXiv:2509.19633. arXiv, Dec. 2025. doi: 10.48550/arXiv.2509.19633

work page doi:10.48550/arxiv.2509.19633 2025

[41] [41]

On Soliton Collisions between Localizations in Complex Elementary Cellular Automata: Rules 54 and 110 and Beyond

G. J. Martinez, A. Adamatzky, F. Chen, and L. Chua, “On Soliton Collisions between Localizations in Complex Elementary Cellular Automata: Rules 54 and 110 and Beyond, ” no. arXiv:1301.6258. arXiv, Jan

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

doi: 10.48550/arXiv.1301.6258

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1301.6258

[43] [43]

Roma Patel and Ellie Pavlick

J. Park et al., “Can Mamba Learn How to Learn? A Comparative Study on in-Context Learning Tasks, ” no. arXiv:2402.04248. arXiv, Apr. 2024. doi: 10.48550/arXiv.2402.04248

work page doi:10.48550/arxiv.2402.04248 2024

[44] [44]

RWKV: Reinventing RNNs for the Transformer Era

B. Peng et al., “RWKV: Reinventing RNNs for the Transformer Era, ” no. arXiv:2305.13048. arXiv, Dec

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

doi: 10.48550/arXiv.2305.13048

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.13048

[46] [46]

LSTM Networks Can Perform Dynamic Counting,

M. Suzgun, Y. Belinkov, S. Shieber, and S. Gehrmann, “LSTM Networks Can Perform Dynamic Counting, ” in Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges , J. Eisner, M. Gallé, J. Heinz, A. Quattoni, and G. Rabusseau, Eds., Florence: Association for Computational Linguistics, Aug. 2019, pp. 44–54. doi: 10.18653/v1/W19-3905. 18

work page doi:10.18653/v1/w19-3905 2019

[47] [47]

Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks,

H. Tanaka and D. Kunin, “Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks, ” in Advances in Neural Information Processing Systems, Nov. 2021

work page 2021

[48] [48]

Computation Theory of Cellular Automata,

S. Wolfram, “Computation Theory of Cellular Automata, ” Communications in Mathematical Physics, vol. 96, no. 1, pp. 15–57, Mar. 1984, doi: 10.1007/BF01217347

work page doi:10.1007/bf01217347 1984

[49] [49]

Universality and Complexity in Cellular Automata

S. Wolfram, “Universality and Complexity in Cellular Automata, ” Physica D: Nonlinear Phenomena, vol. 10, no. 1, pp. 1–35, Jan. 1984, doi: 10.1016/0167-2789(84)90245-8

work page doi:10.1016/0167-2789(84)90245-8 1984

[50] [50]

Wolfram, A New Kind of Science

S. Wolfram, A New Kind of Science. Champaign (Ill.): Wolfram, 2002. 19

work page 2002