pith. sign in

arxiv: 2602.01651 · v2 · submitted 2026-02-02 · 💻 cs.LG · cs.AI

On the Spatiotemporal Dynamics of Generalization in Neural Networks

Pith reviewed 2026-05-16 08:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords neural networksgeneralizationcellular automataadditionlocalitysymmetrystabilityattractor dynamics
0
0 comments X

The pith

A neural architecture derived from locality, symmetry and stability postulates achieves perfect addition on sequences up to a million digits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard neural networks fail to generalize simple rules like addition to longer inputs because they violate basic physical constraints that any reliable computing system must obey. The paper identifies three such constraints: information must propagate at finite speed, computational rules must remain unchanged across space and time, and the system must settle into stable discrete states. From these postulates the authors derive rather than hand-design the SEAD architecture, a neural cellular automaton that applies local convolutional updates repeatedly until convergence. On addition this produces 100 percent accuracy when tested on inputs a hundred thousand times longer than the training examples, with the number of iterations adapting automatically to the input length.

Core claim

The central claim is that any system capable of true generalization must satisfy the physical postulates of locality, symmetry and stability; enforcing them directly yields the SEAD neural cellular automaton whose iterated local rules produce scale-invariant behavior, including 100 percent accurate addition from 16-digit training to one-million-digit test cases and exact reproduction of the Turing-complete Rule 110 automaton without trajectory divergence.

What carries the argument

The SEAD architecture: a neural cellular automaton that applies fixed local convolutional rules iteratively until the state converges to a discrete attractor.

If this is right

  • Parity is solved with perfect length generalization through explicit light-cone propagation of information.
  • Addition exhibits input-adaptive computation, using more iterations only when needed, while remaining exactly correct up to one million digits.
  • Rule 110, a Turing-complete cellular automaton, is learned without divergence or loss of long-term behavior.
  • Generalization is obtained without increasing parameter count or training data volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same iterative attractor mechanism might allow length generalization on other algorithmic tasks such as sorting or matrix multiplication.
  • Requiring convergence to discrete attractors could restrict use on problems whose natural outputs are continuous or probabilistic.
  • If the postulates truly capture the physics of computation, then any architecture ignoring them should fail at arbitrary-length generalization regardless of scale.
  • Testing whether removing one postulate while keeping the others intact destroys generalization would directly probe the necessity claim.

Load-bearing premise

The three physical postulates are both necessary and sufficient for generalization, and the SEAD architecture follows from them without any additional task-specific design choices.

What would settle it

Training a network that violates at least one of the three postulates yet still achieves 100 percent accuracy on million-digit addition, or showing that SEAD itself fails to generalize on a new task whose solution satisfies locality, symmetry and stability.

Figures

Figures reproduced from arXiv: 2602.01651 by Zichao Wei.

Figure 1
Figure 1. Figure 1: Spatiotemporal evolution of the Parity task. Horizontal axis (Position) represents spatial coordinates; vertical axis (Time Step) represents evolution depth, increasing downwards. Left: Evolution starting from a random initial state. The cumulative XOR wave propagates from left to right at light speed 𝑐 = 1, gradually ordering the lattice. Right: Correctness wave (Green=Correct, Red=Incorrect). The wavefro… view at source ↗
Figure 2
Figure 2. Figure 2: Spatiotemporal evolution comparison: Random vs Adversarial inputs. Horizontal axis (Position) represents spatial coordinates; vertical axis (Time Step) represents evolution depth. Green=Correct, Red=Incorrect. Left: Random input. Due to short carry chains, full convergence is reached in about 8 steps (all green), showing an “island-like” rapid convergence pattern. Right: Adversarial sample 1 𝐿 + 1. The car… view at source ↗
Figure 3
Figure 3. Figure 3: Complexity analysis of convergence steps (Log-Log scale). Horizontal axis is sequence length 𝐿 (log scale); vertical axis is steps to convergence (log scale). Blue line: Random input, convergence steps grow sub-𝑂(log𝐿). Red line: Adversarial input, convergence steps strictly linear in 𝐿, 𝑂(𝐿). Dashed lines are theoretical fits. This indicates SEAD spontaneously realizes the “Least Action” principle—dynamic… view at source ↗
Figure 4
Figure 4. Figure 4: Learning Rule 110: Neural Cellular Automata vs Ground Truth. Left: Evolution graph generated by SEAD via supervised learning. Right: Evolution graph of true Rule 110. The two are visually identical, indicating SEAD has perfectly learned the transition rules and can losslessly simulate complex non-linear structures like Gliders and collisions [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Why do neural networks fail to generalize addition from 16-digit to 32-digit numbers, while a child who learns the rule can apply it to arbitrarily long sequences? We argue that this failure is not an engineering problem but a violation of physical postulates. Drawing inspiration from physics, we identify three constraints that any generalizing system must satisfy: (1) Locality -- information propagates at finite speed; (2) Symmetry -- the laws of computation are invariant across space and time; (3) Stability -- the system converges to discrete attractors that resist noise accumulation. From these postulates, we derive -- rather than design -- the Spatiotemporal Evolution with Attractor Dynamics (SEAD) architecture: a neural cellular automaton where local convolutional rules are iterated until convergence. Experiments on three tasks validate our theory: (1) Parity -- demonstrating perfect length generalization via light-cone propagation; (2) Addition -- achieving scale-invariant inference from L=16 to L=1 million with 100% accuracy, exhibiting input-adaptive computation; (3) Rule 110 -- learning a Turing-complete cellular automaton without trajectory divergence. Our results suggest that the gap between statistical learning and logical reasoning can be bridged -- not by scaling parameters, but by respecting the physics of computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that neural network generalization failures (e.g., addition from 16 to 32 digits) violate three physical postulates—locality (finite propagation speed), symmetry (invariance across space/time), and stability (convergence to noise-resistant discrete attractors). From these, the authors derive rather than design the SEAD architecture: a neural cellular automaton applying local convolutional rules iteratively until attractor convergence. Experiments report perfect length generalization on parity via light-cone propagation, 100% scale-invariant accuracy on addition from L=16 to L=1 million with input-adaptive computation, and learning of Rule 110 without trajectory divergence.

Significance. If the derivation is shown to be forced and the extreme-length results hold under rigorous controls, the work offers a principled route to scale-invariant logical generalization grounded in physical constraints rather than parameter scaling. The reported 100% accuracy on addition to 10^6 digits and the input-adaptive behavior would constitute a notable empirical advance for algorithmic reasoning tasks.

major comments (3)
  1. [Abstract / derivation of SEAD] Abstract and derivation section: The claim that SEAD follows directly from the three postulates lacks any equations or mapping steps showing how locality, symmetry, and stability uniquely force iterative local conv rules to a fixed-point attractor (as opposed to fixed-depth equivariant CNNs, translation-invariant RNNs, or non-iterated message-passing graphs that also satisfy the postulates). This is load-bearing for the central thesis that the architecture is derived rather than additionally chosen.
  2. [Addition experiments] Addition task results: The 100% accuracy claim from L=16 training to L=1 million inference is central but unsupported by any reported baseline comparisons, error bars, number of large-L test instances, or verification that no global pooling/attention leaks length information. Without these controls, it is impossible to confirm that the result stems from the postulates rather than task-specific implementation details.
  3. [Rule 110 experiments / stability analysis] Stability and convergence: The stability postulate is invoked to guarantee discrete attractors that resist noise, yet no formal criterion (e.g., Lyapunov function, contraction mapping, or explicit fixed-point condition) is supplied showing why iterated local rules converge without divergence on Rule 110 while satisfying the other postulates.
minor comments (2)
  1. [Abstract] Abstract: 'light-cone propagation' is used without definition or pointer to the relevant section or figure.
  2. [Throughout] Notation: Ensure the SEAD acronym and 'attractor dynamics' are introduced with consistent mathematical notation (e.g., update rule, convergence threshold) on first use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments help clarify how to strengthen the presentation of the derivation and the rigor of the experimental claims. We address each major point below and indicate the revisions that will be incorporated in the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract / derivation of SEAD] Abstract and derivation section: The claim that SEAD follows directly from the three postulates lacks any equations or mapping steps showing how locality, symmetry, and stability uniquely force iterative local conv rules to a fixed-point attractor (as opposed to fixed-depth equivariant CNNs, translation-invariant RNNs, or non-iterated message-passing graphs that also satisfy the postulates). This is load-bearing for the central thesis that the architecture is derived rather than additionally chosen.

    Authors: We agree that the derivation would be clearer with explicit equations. In the revised manuscript we will expand Section 3 to include a step-by-step formal mapping: (i) locality restricts updates to finite-support convolutional kernels; (ii) spatiotemporal symmetry requires the same kernel to be applied uniformly at every location and iteration; (iii) stability is encoded by requiring the update operator to be a contraction mapping whose unique fixed point is reached in finite steps for any input length. These conditions exclude fixed-depth CNNs (which violate time-invariance for arbitrary lengths) and non-iterated message-passing graphs (which lack guaranteed attractor convergence). The added equations will make the uniqueness explicit. revision: yes

  2. Referee: [Addition experiments] Addition task results: The 100% accuracy claim from L=16 training to L=1 million inference is central but unsupported by any reported baseline comparisons, error bars, number of large-L test instances, or verification that no global pooling/attention leaks length information. Without these controls, it is impossible to confirm that the result stems from the postulates rather than task-specific implementation details.

    Authors: We accept that additional controls are necessary. The revised version will report: (a) direct comparisons against Transformer and LSTM baselines trained on the same 16-digit regime and evaluated at 1 million digits; (b) mean accuracy and standard deviation across five independent seeds; (c) explicit counts (1 000 test instances for each length up to 10^6); and (d) an architectural audit confirming that only strictly local convolutions are used, with no global pooling or attention that could encode length. These additions will be placed in the experimental section and supplementary material. revision: yes

  3. Referee: [Rule 110 experiments / stability analysis] Stability and convergence: The stability postulate is invoked to guarantee discrete attractors that resist noise, yet no formal criterion (e.g., Lyapunov function, contraction mapping, or explicit fixed-point condition) is supplied showing why iterated local rules converge without divergence on Rule 110 while satisfying the other postulates.

    Authors: Stability is currently supported by empirical convergence on Rule 110 under injected noise. In the revision we will add a contraction-mapping argument in the stability subsection: the learned local update is shown to be Lipschitz with constant <1 on the discrete state space, guaranteeing a unique fixed point reached in bounded iterations. While a general Lyapunov function for arbitrary rules remains future work, the contraction condition directly ties the observed non-divergence to the stability postulate and will be stated formally. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation asserted but not reduced to inputs by construction

full rationale

The paper asserts that the three postulates (locality, symmetry, stability) directly yield the SEAD architecture of iterated local convolutional rules to attractor, yet the provided abstract and description contain no equations, self-citations, or fitted parameters that reduce the claimed derivation to a tautology or prior result by the same authors. No load-bearing step equates the output architecture to its inputs by definition, renames a known result, or imports uniqueness via self-citation. The central claim remains an assertion whose validity can be checked against external benchmarks (e.g., whether other models satisfying the same postulates also generalize), but the derivation chain itself does not collapse into circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 1 invented entities

The central claim rests on three domain assumptions presented as physical postulates plus the assertion that they suffice to derive the SEAD model; no free parameters or new entities are explicitly quantified in the abstract.

axioms (3)
  • domain assumption Locality: information propagates at finite speed
    Postulate (1) invoked to motivate local convolutional rules
  • domain assumption Symmetry: laws of computation invariant across space and time
    Postulate (2) invoked to justify translation-invariant updates
  • domain assumption Stability: system converges to discrete attractors resisting noise
    Postulate (3) invoked to justify iteration until convergence
invented entities (1)
  • SEAD architecture no independent evidence
    purpose: Neural cellular automaton implementing the three postulates
    New architecture introduced to satisfy the postulates; no independent falsifiable prediction outside the reported tasks is given

pith-pipeline@v0.9.0 · 5514 in / 1433 out tokens · 48873 ms · 2026-05-16T08:57:18.846352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

    cs.LG 2026-03 unverdicted novelty 8.0

    Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 ti...

  2. Structural Generalization on SLOG without Hand-Written Rules

    cs.CL 2026-04 unverdicted novelty 7.0

    A neural cellular automaton model learns all compositional rules from data via local iteration and achieves 100% type-exact match on 11 of 17 structural generalization categories on the SLOG benchmark.

  3. On the Emergence of Syntax by Means of Local Interaction

    cs.CL 2026-04 unverdicted novelty 7.0

    A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.

  4. Structural Generalization on SLOG without Hand-Written Rules

    cs.CL 2026-04 unverdicted novelty 6.0

    A neural cellular automaton learns compositional rules from data alone to achieve structural generalization on the SLOG semantic parsing benchmark, reaching 67.3% accuracy and fully succeeding on 11 of 17 categories.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 3 Pith papers · 7 internal anchors

  1. [1]

    What algorithms can transformers learn? a study in length generalization

    H. Zhou et al., “What Algorithms Can Transformers Learn? A Study in Length Generalization, ” no. arXiv:2310.16028. arXiv, Oct. 2023. doi: 10.48550/arXiv.2310.16028

  2. [2]

    International Conference on Learning Representations , month =

    G. Delétang et al., “Neural Networks and the Chomsky Hierarchy, ” no. arXiv:2207.02098. arXiv, Feb

  3. [3]

    doi: 10.48550/arXiv.2207.02098

  4. [4]

    On the ability and limitations of transformers to recognize formal languages

    S. Bhattamishra, K. Ahuja, and N. Goyal, “On the Ability and Limitations of Transformers to Recognize Formal Languages, ” no. arXiv:2009.11264. arXiv, Oct. 2020. doi: 10.48550/arXiv.2009.11264

  5. [5]

    To Infinity and beyond: Children Generalize the Successor Function to All Possible Numbers Years after Learning to Count,

    P. Cheung, M. Rubenson, and D. Barner, “To Infinity and beyond: Children Generalize the Successor Function to All Possible Numbers Years after Learning to Count, ” Cognitive Psychology, vol. 92, pp. 22– 36, Feb. 2017, doi: 10.1016/j.cogpsych.2016.11.002

  6. [6]

    Johan Håstad.Computational Limitations of Small-Depth Circuits

    M. Hahn, “Theoretical Limitations of Self-Attention in Neural Sequence Models, ” Transactions of the Association for Computational Linguistics, vol. 8, pp. 156–171, Jan. 2020, doi: 10.1162/tacl_a_00306

  7. [7]

    The Parallelism Tradeoff: Limitations of Log-Precision Transformers,

    W. Merrill and A. Sabharwal, “The Parallelism Tradeoff: Limitations of Log-Precision Transformers, ” Transactions of the Association for Computational Linguistics, vol. 11, pp. 531–545, June 2023, doi: 10.1162/ tacl_a_00562

  8. [8]

    Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count,

    H. Cho, J. Cha, S. Bhojanapalli, and C. Yun, “Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count, ” no. arXiv:2410.15787. arXiv, Apr. 2025. doi: 10.48550/arXiv.2410.15787

  9. [9]

    A Formal Framework for Understanding Length Generalization in Transformers,

    X. Huang et al., “A Formal Framework for Understanding Length Generalization in Transformers, ” no. arXiv:2410.02140. arXiv, Apr. 2025. doi: 10.48550/arXiv.2410.02140

  10. [10]

    arXiv preprint arXiv:2312.17044 (2024)

    L. Zhao et al., “Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding, ” no. arXiv:2312.17044. arXiv, Apr. 2024. doi: 10.48550/arXiv.2312.17044

  11. [11]

    Attention Bias as an Inductive Bias: How to Teach Transformers Simple Arithmetic,

    S. Duan, Y. Shi, and W. Xu, “Attention Bias as an Inductive Bias: How to Teach Transformers Simple Arithmetic, ” pp. 0–18

  12. [12]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models,

    M. Nye et al., “Show Your Work: Scratchpads for Intermediate Computation with Language Models, ” Oct. 2021

  13. [13]

    Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective,

    G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang, “Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective, ” Advances in Neural Information Processing Systems, vol. 36, pp. 70757–70798, Dec. 2023

  14. [14]

    An Overview of Statistical Learning Theory,

    V. Vapnik, “An Overview of Statistical Learning Theory, ” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, Sept. 1999, doi: 10.1109/72.788640

  15. [15]

    Pearl, Causality: Models, Reasoning, and Inference , 2 edition, reprinted with corrections

    J. Pearl, Causality: Models, Reasoning, and Inference , 2 edition, reprinted with corrections. Cambridge New York, NY Port Melbourne New Delhi Singapore: Cambridge University Press, 2022

  16. [16]

    A Spacetime Perspective on Dynamical Computation in Neural Information Processing Systems,

    T. A. Keller, L. Muller, T. J. Sejnowski, and M. Welling, “A Spacetime Perspective on Dynamical Computation in Neural Information Processing Systems, ” no. arXiv:2409.13669. arXiv, Sept. 2024. doi: 10.48550/arXiv.2409.13669

  17. [17]

    Von Neumann and A

    J. Von Neumann and A. W. (. W. Burks, Theory of Self-Reproducing Automata . Urbana, University of Illinois Press, 1966

  18. [18]

    Why Are Sensitive Functions Hard for Transformers?,

    M. Hahn and M. Rofin, “Why Are Sensitive Functions Hard for Transformers?, ” no. arXiv:2402.09963. arXiv, May 2024. doi: 10.48550/arXiv.2402.09963

  19. [19]

    Universality in Elementary Cellular Automata,

    M. Cook, “Universality in Elementary Cellular Automata, ” Complex Systems, vol. 15, no. 1, pp. 1–40, Mar. 2004, doi: 10.25088/ComplexSystems.15.1.1

  20. [20]

    The Role of Sparsity for Length Generalization in Transformers,

    N. Golowich, S. Jelassi, D. Brandfonbrener, S. M. Kakade, and E. Malach, “The Role of Sparsity for Length Generalization in Transformers, ” no. arXiv:2502.16792. arXiv, Feb. 2025. doi: 10.48550/arXiv.2502.16792

  21. [21]

    Looped Transformers for Length Generalization,

    Y. Fan, Y. Du, K. Ramchandran, and K. Lee, “Looped Transformers for Length Generalization, ” pp. 0– 19, 2025. 17

  22. [22]

    Universal Transformers

    M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, “Universal Transformers, ” no. arXiv:1807.03819. arXiv, Mar. 2019. doi: 10.48550/arXiv.1807.03819

  23. [23]

    Neural GPUs Learn Algorithms

    Ł. Kaiser and I. Sutskever, “Neural GPUs Learn Algorithms, ” no. arXiv:1511.08228. arXiv, Mar. 2016. doi: 10.48550/arXiv.1511.08228

  24. [24]

    To Infinity and beyond: Tool-Use Unlocks Length Generalization in State Space Models,

    E. Malach et al. , “To Infinity and beyond: Tool-Use Unlocks Length Generalization in State Space Models, ” no. arXiv:2510.14826. arXiv, Oct. 2025. doi: 10.48550/arXiv.2510.14826

  25. [25]

    Softmax trans- formers are turing-complete.CoRR, abs/2511.20038, 2025

    H. Jiang, M. Hahn, G. Zetzsche, and A. W. Lin, “Softmax Transformers Are Turing-Complete, ” no. arXiv:2511.20038. arXiv, Nov. 2025. doi: 10.48550/arXiv.2511.20038

  26. [26]

    Lower Bounds for Chain-of-Thought Reasoning in Hard- Attention Transformers,

    A. Amiri, X. Huang, M. Rofin, and M. Hahn, “Lower Bounds for Chain-of-Thought Reasoning in Hard- Attention Transformers, ” no. arXiv:2502.02393. arXiv, July 2025. doi: 10.48550/arXiv.2502.02393

  27. [27]

    Cellular Automata as Convolutional Neural Networks,

    W. Gilpin, “Cellular Automata as Convolutional Neural Networks, ” Physical Review E, vol. 100, no. 3, p. 32402, Sept. 2019, doi: 10.1103/PhysRevE.100.032402

  28. [28]

    Growing Neural Cellular Automata , volume =

    A. Mordvintsev, E. Randazzo, E. Niklasson, and M. Levin, “Growing Neural Cellular Automata, ” Distill, vol. 5, no. 2, p. e23, Feb. 2020, doi: 10.23915/distill.00023

  29. [29]

    Neural Cellular Automata: Applications to Biology and beyond Classical AI,

    B. Hartl, M. Levin, and L. Pio-Lopez, “Neural Cellular Automata: Applications to Biology and beyond Classical AI, ” Physics of Life Reviews, vol. 56, pp. 94–108, Mar. 2026, doi: 10.1016/j.plrev.2025.11.010

  30. [30]

    Neural Cellular Automata for ARC-AGI,

    K. Xu and R. Miikkulainen, “Neural Cellular Automata for ARC-AGI, ” in ALIFE 2025: Ciphers of Life: Proceedings of the Artificial Life Conference 2025, MIT Press, Oct. 2025. doi: 10.1162/ISAL.a.844

  31. [31]

    The Hardware Lottery,

    S. Hooker, “The Hardware Lottery, ” no. arXiv:2009.06489. arXiv, Sept. 2020. doi: 10.48550/ arXiv.2009.06489

  32. [32]

    Noether Networks: Meta-learning Useful Conserved Quantities,

    F. Alet, D. Doblar, A. Zhou, J. Tenenbaum, K. Kawaguchi, and C. Finn, “Noether Networks: Meta-learning Useful Conserved Quantities, ” pp. 0–21

  33. [33]

    Exploring the Long-Term Generalization of Counting Behavior in RNNs,

    N. El-Naggar, P. Madhyastha, and T. Weyde, “Exploring the Long-Term Generalization of Counting Behavior in RNNs, ” no. arXiv:2211.16429. arXiv, Nov. 2022. doi: 10.48550/arXiv.2211.16429

  34. [34]

    Originally circulated 2019; published 2023

    A. d'Avila Garcez and L. C. Lamb, “Neurosymbolic AI: The 3rd Wave, ” Artificial Intelligence Review, vol. 56, no. 11, pp. 12387–12406, Nov. 2023, doi: 10.1007/s10462-023-10448-w

  35. [35]

    Energy-Based Transformers Are Scalable Learners and Thinkers,

    A. Gladstone et al. , “Energy-Based Transformers Are Scalable Learners and Thinkers, ” no. arXiv:2507.02092. arXiv, July 2025. doi: 10.48550/arXiv.2507.02092

  36. [36]

    Neural Turing Machines

    A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines, ” no. arXiv:1410.5401. arXiv, Dec. 2014. doi: 10.48550/arXiv.1410.5401

  37. [37]

    Thermodynamic State Machine Network,

    T. Hylton, “Thermodynamic State Machine Network, ” Entropy, vol. 24, no. 6, p. 744, June 2022, doi: 10.3390/e24060744

  38. [38]

    Kahneman, Thinking, Fast and Slow

    D. Kahneman, Thinking, Fast and Slow. London: PENGUIN, 2024

  39. [39]

    Exposing Attention Glitches with Flip-Flop Language Modeling,

    B. Liu, J. Ash, S. Goel, A. Krishnamurthy, and C. Zhang, “Exposing Attention Glitches with Flip-Flop Language Modeling, ” Advances in Neural Information Processing Systems, vol. 36, pp. 25549–25583, Dec. 2023

  40. [40]

    Mamba Modulation: On the Length Generalization of Mamba,

    P. Lu et al., “Mamba Modulation: On the Length Generalization of Mamba, ” no. arXiv:2509.19633. arXiv, Dec. 2025. doi: 10.48550/arXiv.2509.19633

  41. [41]

    On Soliton Collisions between Localizations in Complex Elementary Cellular Automata: Rules 54 and 110 and Beyond

    G. J. Martinez, A. Adamatzky, F. Chen, and L. Chua, “On Soliton Collisions between Localizations in Complex Elementary Cellular Automata: Rules 54 and 110 and Beyond, ” no. arXiv:1301.6258. arXiv, Jan

  42. [42]

    doi: 10.48550/arXiv.1301.6258

  43. [43]

    Roma Patel and Ellie Pavlick

    J. Park et al., “Can Mamba Learn How to Learn? A Comparative Study on in-Context Learning Tasks, ” no. arXiv:2402.04248. arXiv, Apr. 2024. doi: 10.48550/arXiv.2402.04248

  44. [44]

    RWKV: Reinventing RNNs for the Transformer Era

    B. Peng et al., “RWKV: Reinventing RNNs for the Transformer Era, ” no. arXiv:2305.13048. arXiv, Dec

  45. [45]

    doi: 10.48550/arXiv.2305.13048

  46. [46]

    LSTM Networks Can Perform Dynamic Counting,

    M. Suzgun, Y. Belinkov, S. Shieber, and S. Gehrmann, “LSTM Networks Can Perform Dynamic Counting, ” in Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges , J. Eisner, M. Gallé, J. Heinz, A. Quattoni, and G. Rabusseau, Eds., Florence: Association for Computational Linguistics, Aug. 2019, pp. 44–54. doi: 10.18653/v1/W19-3905. 18

  47. [47]

    Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks,

    H. Tanaka and D. Kunin, “Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks, ” in Advances in Neural Information Processing Systems, Nov. 2021

  48. [48]

    Computation Theory of Cellular Automata,

    S. Wolfram, “Computation Theory of Cellular Automata, ” Communications in Mathematical Physics, vol. 96, no. 1, pp. 15–57, Mar. 1984, doi: 10.1007/BF01217347

  49. [49]

    Universality and Complexity in Cellular Automata

    S. Wolfram, “Universality and Complexity in Cellular Automata, ” Physica D: Nonlinear Phenomena, vol. 10, no. 1, pp. 1–35, Jan. 1984, doi: 10.1016/0167-2789(84)90245-8

  50. [50]

    Wolfram, A New Kind of Science

    S. Wolfram, A New Kind of Science. Champaign (Ill.): Wolfram, 2002. 19