Recognition: 2 theorem links
· Lean TheoremMinMax Recurrent Neural Cascades
Pith reviewed 2026-05-11 01:43 UTC · model grok-4.3
The pith
MinMax Recurrent Neural Cascades express every regular language, evaluate in logarithmic parallel time, and maintain constant state gradients over arbitrary distances.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MinMax RNCs are built by cascading layers whose recurrence follows the MinMax algebra; this construction yields models whose formal expressivity includes all regular languages, whose parallel evaluation runs in time logarithmic in sequence length, whose states and activations remain uniformly bounded for every input length, whose loss gradients exist and stay bounded almost everywhere, and whose state gradients never vanish because the derivative of a state with respect to an earlier state can equal one independently of the time gap between them.
What carries the argument
The MinMax recurrence step, which replaces the standard linear-plus-nonlinearity update with direct min and max operations over previous states and current inputs, then cascaded across multiple layers.
If this is right
- Any regular language can be recognized exactly by some MinMax RNC of finite size.
- Sequence evaluation can be performed either sequentially or in parallel with runtime O(log n) given sufficient processors.
- States and activations stay inside fixed numerical bounds regardless of how long the input grows.
- The gradient of any state with respect to a state from arbitrarily far in the past can remain exactly one at almost all parameter settings.
- On synthetic tasks that stress long-range dependencies the models reach perfect solutions while standard recurrent networks do not.
Where Pith is reading between the lines
- The uniform bounds and non-vanishing gradients could allow reliable training on sequences far longer than those feasible for vanilla RNNs or LSTMs.
- Hybrid architectures that insert MinMax layers into larger networks might inherit stability while adding other inductive biases.
- The non-differentiable loci might still require careful initialization or smoothing in very deep cascades before the theoretical advantages appear on noisy real-world data.
Load-bearing premise
The points where min and max are non-differentiable do not block reliable gradient-based training, and the algebraic bounds proved for single layers continue to hold once the layers are stacked and the whole network is optimized on actual data.
What would settle it
Training a MinMax RNC on a regular language such as balanced parentheses and observing either failure to reach perfect accuracy or a state gradient that shrinks toward zero with increasing distance would falsify the expressivity or gradient claims.
Figures
read the original abstract
We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivity for a finite-memory system. Second, they can be evaluated in parallel with a runtime that is logarithmic in the input length given enough processors; and they can also be evaluated sequentially. Third, their state and activations are bounded uniformly for all input lengths. Fourth, at almost all points, their loss gradient exists and it is bounded. Fifth, they do not exhibit a vanishing state gradient: the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance between the two states. Finally, we find empirical evidence that the favourable theoretical properties of MinMax RNCs are matched by their practical capabilities: they are able to perfectly solve a number of synthetic tasks, showing superior performance compared to the considered state-of-the-art recurrent neural networks; also, we train a MinMax RNC of 127M parameters on next-token prediction, and the obtained model shows competitive performance for its size, providing evidence of the potential of MinMax RNCs on real-world tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MinMax Recurrent Neural Cascades (RNCs) based on min-max algebra for recurrence. It claims these models express all regular languages, support parallel logarithmic-time evaluation (and sequential), maintain uniform bounds on states and activations for any input length, have loss gradients that exist at almost all points and are bounded, and exhibit non-vanishing state gradients where the partial derivative of a state at time t w.r.t. a past state at s equals 1 independently of |t-s|. Empirical support is given via perfect performance on synthetic tasks (outperforming standard RNNs) and competitive results from a 127M-parameter model on next-token prediction.
Significance. If the central derivations hold, the work provides a theoretically grounded RNN variant with automata-theoretic expressivity, parallel efficiency, and explicit non-vanishing gradient guarantees derived from the min-max semiring. The explicit connection to regular languages and the scaling demonstration to 127M parameters are notable strengths that could influence recurrent architecture design for long-range tasks.
major comments (2)
- [gradient analysis section] Abstract and the gradient analysis section: the claim that 'the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance' is derived from algebraic identities in the min-max semiring that hold only at differentiable points. Cascading layers composes these operations and enlarges the non-differentiable set (where arguments are equal); the manuscript states gradients exist 'at almost all points' but does not establish that the constant-1 identity survives in the subgradient sense after composition, which is load-bearing for the no-vanishing-gradient and trainability claims.
- [empirical evaluation section] The 127M-parameter next-token experiment: while competitive performance is reported, no ablation or gradient-norm analysis is provided to test whether optimization actually exploits the claimed constant state-gradient property versus other convergence mechanisms. This weakens the link between the theoretical gradient bound and the practical scaling result.
minor comments (2)
- The cascade construction and its relation to standard automata should be illustrated with a small explicit example (e.g., a 2-state acceptor) to make the expressivity proof more accessible.
- Notation for the min and max operations and the recurrence update should be unified across the theoretical and experimental sections to avoid ambiguity in the boundedness proofs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment in detail below, indicating revisions where appropriate to clarify and strengthen the presentation of our results.
read point-by-point responses
-
Referee: [gradient analysis section] Abstract and the gradient analysis section: the claim that 'the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance' is derived from algebraic identities in the min-max semiring that hold only at differentiable points. Cascading layers composes these operations and enlarges the non-differentiable set (where arguments are equal); the manuscript states gradients exist 'at almost all points' but does not establish that the constant-1 identity survives in the subgradient sense after composition, which is load-bearing for the no-vanishing-gradient and trainability claims.
Authors: The referee correctly identifies that the core algebraic identities for the constant gradient of 1 are established at points of differentiability for the basic min-max recurrence. The manuscript already states that gradients exist at almost all points, which follows because the non-differentiable loci (where min or max arguments are equal) form a lower-dimensional subset of the input space. For cascaded layers, this property is preserved under composition: the overall non-differentiable set remains a union of measure-zero manifolds. We agree that an explicit argument for the subgradient (in the sense of Clarke's generalized gradient) is needed to confirm that 1 remains in the subgradient after chaining the operations. We will revise the gradient analysis section to include a short proof sketch showing that the chain rule for subgradients preserves the constant-1 element in the limiting sense from nearby differentiable points, thereby supporting the non-vanishing claim for the full cascaded model. revision: yes
-
Referee: [empirical evaluation section] The 127M-parameter next-token experiment: while competitive performance is reported, no ablation or gradient-norm analysis is provided to test whether optimization actually exploits the claimed constant state-gradient property versus other convergence mechanisms. This weakens the link between the theoretical gradient bound and the practical scaling result.
Authors: We acknowledge that the 127M-parameter experiment reports competitive next-token prediction performance without accompanying ablation studies or explicit gradient-norm measurements that would isolate the contribution of the constant state-gradient property. The empirical section is presented as supporting evidence of practical scalability rather than a direct empirical validation of the gradient theory; the theoretical non-vanishing result stands independently. Nevertheless, the successful training of a model at this scale without divergence is consistent with the bounded-gradient guarantees. We will add a paragraph in the empirical evaluation section discussing how the uniform state bounds and non-vanishing gradient property provide a theoretical basis for stable optimization at large parameter counts, and we will include a brief note on the absence of vanishing gradients observed during training. revision: partial
Circularity Check
No circularity: properties derived from external min/max algebra and automata theory
full rationale
The derivation begins from the explicit definition of MinMax recurrence (using min and max as the state-update operators) and proceeds via standard algebraic identities and automata constructions to prove expressivity, parallelizability, uniform bounds, and non-vanishing state gradients. These steps rely on external facts about min/max (which are differentiable almost everywhere) and regular-language theory rather than on any fitted parameter, self-referential definition, or load-bearing self-citation. No claimed prediction or result is shown to equal its own input by construction; the 127 M-parameter experiment is presented only as empirical corroboration, not as part of the formal chain.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The min-max semiring is associative, commutative, and distributive in the manner required for recurrence definitions.
- standard math Finite automata recognize exactly the regular languages.
invented entities (1)
-
MinMax Recurrent Neural Cascade
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness, washburn_uniqueness_aczel)reality_from_one_distinction, Jcost_pos_of_ne_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MinMax recurrence xt = (Rt ⊗ xt−1) ⊕ st … piece-wise linear operator with constant unit norm … ∇xut(x0) = 1 for all x0 ∈ (at, bt)
-
IndisputableMonolith/Foundation/DimensionForcing.lean (8-tick period, 2^D = 8)alexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
parallel scan … associative operator ⊕⊗ … depth O(log T log N)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Yoshua Bengio, Patrice Y . Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult.IEEE Trans. Neural Networks, 5(2):157–166, 1994. doi: 10.1109/72.279181. URLhttps://doi.org/10.1109/72.279181
-
[2]
Jeffrey L. Elman. Finding structure in time.Cognitive Science, 14(2):179–211, 1990. ISSN 0364-0213. doi: https://doi.org/10.1016/0364-0213(90)90002-E. URL https://www. sciencedirect.com/science/article/pii/036402139090002E
-
[3]
On the expressivity of recurrent neural cascades
Nadezda Alexandrovna Knorozova and Alessandro Ronca. On the expressivity of recurrent neural cascades. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium ...
-
[4]
Neural Computation 9, 1735–1780
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Comput., 9(8): 1735–1780, 1997. doi: 10.1162/NECO.1997.9.8.1735. URL https://doi.org/10.1162/ neco.1997.9.8.1735
-
[5]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference...
work page 2017
-
[6]
Theoretical Limitations of Self Attention in Neural Sequence Models,
Michael Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 01 2020. ISSN 2307-387X. doi: 10.1162/tacl_a_00306. URLhttps://doi.org/10.1162/tacl_a_00306
-
[7]
What Formal Languages Can Transformers Express? A Survey
Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What for- mal languages can transformers express? a survey.Transactions of the Association for Computational Linguistics, 12:543–561, 2024. doi: 10.1162/tacl_a_00663. URL https: //aclanthology.org/2024.tacl-1.30/
-
[8]
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547– 107603, 2024
work page 2024
-
[9]
Titans: Learning to memorize at test time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=8GjSf9Rh7Z. 10
work page 2026
-
[10]
Decimamba: Exploring the length extrapolation potential of mamba
Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=iWSl5Zyjjw
work page 2025
-
[11]
Improved state mixing in higher-order and block diagonal linear recurrent networks, 2026
Igor Dubinin, Antonio Orvieto, and Felix Effenberger. Improved state mixing in higher-order and block diagonal linear recurrent networks, 2026. URLhttps://openreview.net/forum? id=fF72TYOKjZ
work page 2026
-
[12]
Advancing regular language reasoning in linear recurrent neural networks
Ting-Han Fan, Ta-Chung Chi, and Alexander Rudnicky. Advancing regular language reasoning in linear recurrent neural networks. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages...
work page 2024
-
[13]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=tEYskw1VY2
work page 2024
-
[14]
Hippo: Recurrent memory with optimal polynomial projections
Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur...
work page 2020
-
[15]
Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceeding...
work page 2024
-
[16]
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=QtTKTdVrFBB
work page 2021
-
[17]
Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng
Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 ”goose” with expressive dynamic state evolution. InSecond Conference on Language Modeling,
-
[18]
URLhttps://openreview.net/forum?id=ayB1PACN5j
-
[19]
doi: 10.18653/v1/2023.findings-emnlp.936
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guan...
-
[20]
Transformers are RNNs: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 202...
work page 2020
-
[21]
Resurrecting recurrent neural networks for long sequences
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volu...
work page 2023
-
[22]
Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, and Samuel L Smith. Univer- sality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceed- ...
work page 2024
-
[23]
Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y . Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learnin...
work page 2023
-
[24]
HGRN2: Gated linear RNNs with state expansion
Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion. InFirst Conference on Language Modeling,
-
[25]
URLhttps://openreview.net/forum?id=y6SqbJfCSk
-
[26]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Proceedings of Machine Learning Research, pages 9355–9366. PMLR, 2021. URL http: //proceedings...
work page 2021
-
[27]
Deltaproduct: Improving state-tracking in linear RNNs via householder products
Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear RNNs via householder products. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=SoRiaijTGr
work page 2025
-
[28]
Learning to (learn at test time): RNNs with expressive hidden states
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=wXfuOj9C7L
work page 2025
-
[29]
Structured linear CDEs: Maximally expressive and parallel-in-time sequence models
Benjamin Walker, Lingyi Yang, Nicola Muca Cirone, Cristopher Salvi, and Terry Lyons. Structured linear CDEs: Maximally expressive and parallel-in-time sequence models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=HKDyRDzy1E
work page 2026
-
[30]
Gated delta networks: Improving mamba2 with delta rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=r8H7xhYPwz
work page 2025
-
[31]
Annan Yu and N. Benjamin Erichson. Block-biased mamba for long-range sequence processing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=5WKEH9LhAQ
work page 2026
-
[32]
Gated slot attention for efficient linear-time sequence modeling
Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear-time sequence modeling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=jY4PhQibmg. 12
work page 2024
-
[33]
Guy E. Blelloch and Bruce M. Maggs. Parallel algorithms. In Mikhail J. Atallah and Marina Blanton, editors,Algorithms and Theory of Computation Handbook, chapter 25. Chapman & Hall/CRC, 2010
work page 2010
-
[34]
The expressive capacity of state space models: A formal language perspective
Yash Sarrof, Yana Veitsman, and Michael Hahn. The expressive capacity of state space models: A formal language perspective. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=eV5YIrJPdy
work page 2024
-
[35]
Metric automata theory: A unifying theory of RNNs
Adam Dankowiakowski and Alessandro Ronca. Metric automata theory: A unifying theory of RNNs. InNeurIPS, 2025
work page 2025
-
[36]
xLSTM 7b: A recurrent LLM for fast and efficient inference
Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M Blies, Günter Klambauer, Sebastian Böck, and Sepp Hochreiter. xLSTM 7b: A recurrent LLM for fast and efficient inference. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=LV3DpKD08B
work page 2025
-
[37]
Approximation capabilities of multilayer feedforward networks , journal =
Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural Networks, 4(2):251–257, 1991. doi: 10.1016/0893-6080(91)90009-T. URL https://doi.org/10. 1016/0893-6080(91)90009-T
-
[38]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[39]
Arbib.Theories of abstract automata
Michael A. Arbib.Theories of abstract automata. Prentice-Hall series in automatic computation. Prentice-Hall, Englewood Cliffs, N.J, 1969
work page 1969
-
[40]
Juris Hartmanis and R. E. Stearns.Algebraic structure theory of sequential machines. Prentice- Hall international series in applied mathematics. Prentice-Hall, Englewood Cliffs, N.J, 1966
work page 1966
-
[41]
G.E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, 38(11): 1526–1538, 1989. doi: 10.1109/12.42122
-
[42]
Prefix sums and their applications
Guy E Blelloch. Prefix sums and their applications. 1990
work page 1990
-
[43]
Peter M Kogge and Harold S Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE transactions on computers, 100(8):786–793, 2009
work page 2009
-
[44]
Wonyeol Lee, Hangyeol Yu, Xavier Rival, and Hongseok Yang. On correctness of automatic differentiation for non-differentiable functions.Advances in neural information processing systems, 33:6719–6730, 2020
work page 2020
-
[45]
Princeton university press, 1997
R Tyrrell Rockafellar.Convex analysis, volume 28. Princeton university press, 1997
work page 1997
-
[46]
PyTorch. Autograd mechanics. https://docs.pytorch.org/docs/stable/notes/ autograd.html, 2026. Accessed: 2026-04-18
work page 2026
-
[47]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult.IEEE transactions on neural networks, 5(2):157–166, 1994
work page 1994
-
[48]
In-context learning and induction heads.Transformer Circuits Thread, 2022
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page 2022
-
[49]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des- maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...
-
[50]
Curran Associates, Inc., 2019. URL https://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library . 13
work page 2019
-
[51]
Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022. GitHub repository. Accessed: 2026-05-04
work page 2022
-
[52]
Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019
work page 2019
-
[53]
Abraham Ginzburg.Algebraic Theory of Automata. Academic Press, 1968
work page 1968
-
[54]
Pál Dömösi and Chrystopher L Nehaniv.Algebraic theory of automata networks: An introduc- tion. SIAM, 2005
work page 2005
-
[55]
Lenore Blum, Felipe Cucker, Michael Shub, and Steve Smale.Complexity and Real Computa- tion. Springer, 1998. doi: 10.1007/978-1-4612-0701-6
-
[56]
Floating-point neural networks can represent almost all floating-point functions
Geonho Hwang, Yeachan Park, Wonyeol Lee, and Sejun Park. Floating-point neural networks can represent almost all floating-point functions. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 26412–26432. PMLR, 13–19 Jul 2025
work page 2025
-
[57]
Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein.Introduction to algorithms. MIT press, 2022. 14 Appendices A Additional Preliminaries 16 A.1 Semigroup and Group Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Algebraic Automata Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Metric Au...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.