Recognition: unknown
Training Transformers as a Universal Computer
Pith reviewed 2026-05-07 16:31 UTC · model grok-4.3
The pith
A transformer can learn to execute any program in a universal language like MicroPy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A standard transformer trained to predict small-step execution of MicroPy programs using PENCIL scaffolding generalizes from random programs to human-written ones, including binary addition, multiplication, and SAT verification, indicating that transformers can be trained to act as universal computers.
What carries the argument
PENCIL scaffolding, which enables space-efficient execution of MicroPy programs within the transformer's bounded context window by predicting small-step operations.
If this is right
- Transformers can generalize to novel programs not seen during training.
- Successful execution on diverse tasks like SAT solving follows from the universal nature of MicroPy.
- The model can perform out-of-distribution generalization on program evaluation.
Where Pith is reading between the lines
- If this holds, transformers might be extended to more complex languages or real-world applications by similar training.
- Further work could test whether the approach scales to larger programs or different architectures.
- Connections exist to using AI models for automated program interpretation and execution.
Load-bearing premise
That performance on the specific set of human-written programs tested is sufficient proof that the transformer can handle any computation expressible in MicroPy.
What would settle it
Finding a MicroPy program that the trained transformer cannot execute correctly, despite having been trained on random programs of similar complexity.
Figures
read the original abstract
We demonstrate that a small transformer can learn to execute programs in MicroPy, a simplified yet computationally universal programming language. Given procedure definitions together with an expression to evaluate, the transformer predicts small-step execution using PENCIL scaffolding for space-efficient execution within a bounded context window. After training on randomly generated, meaningless MicroPy programs, the learned transformer generalizes to various human-written programs including bit copying and flipping, binary addition and multiplication, and SAT verification and solving. We note that the trained model can achieve out-of-distribution generalization; i.e., evaluate novel programs from distribution on programs. Since MicroPy can express any computation, our results provide empirical evidence that a standard transformer can be trained to act as a universal computer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a small transformer trained on randomly generated MicroPy programs (a simplified universal language) can generalize to execute human-written programs including bit operations, arithmetic, and SAT verification/solving. It uses PENCIL scaffolding for space-efficient small-step execution within a bounded context window and concludes that this provides empirical evidence that standard transformers can act as universal computers.
Significance. If the empirical results hold with rigorous quantitative validation, the work would offer a notable demonstration that transformers can internalize the semantics of a Turing-complete language from random training data, with implications for neural program execution, reasoning, and the computational expressivity of attention-based models. The scaffolding technique for bounded contexts is a practical engineering contribution.
major comments (3)
- [Abstract] Abstract: The generalization claim to human-written programs (bit copying/flipping, binary addition/multiplication, SAT) is stated without any quantitative metrics such as accuracy, error rates, success percentages, or failure modes, making it impossible to assess whether the evidence supports the universality conclusion.
- [Results] Results section: Success on the listed specific human-written programs does not establish the ability to execute arbitrary MicroPy programs; programs with deeper nesting, longer execution traces, or control structures outside the tested set could fail even if the reported cases succeed, leaving the central claim under-supported.
- [Methods] Methods section: The PENCIL scaffolding is presented as enabling bounded-context universal execution, but without details on model size, training procedure, dataset statistics, or how the scaffolding interacts with the transformer's context window, the load-bearing mechanism for the universality claim cannot be evaluated.
minor comments (2)
- [Abstract] Clarify the exact definition and implementation of 'out-of-distribution generalization' (novel programs from the program distribution) versus true extrapolation to arbitrary computations.
- [Results] The manuscript would benefit from a table summarizing the human-written test programs, their complexities, and observed performance.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The generalization claim to human-written programs (bit copying/flipping, binary addition/multiplication, SAT) is stated without any quantitative metrics such as accuracy, error rates, success percentages, or failure modes, making it impossible to assess whether the evidence supports the universality conclusion.
Authors: We agree that the abstract would benefit from quantitative metrics to support the generalization claims. In the revised version, we will include specific accuracy rates, success percentages, and notes on observed failure modes for the human-written programs (e.g., bit operations, arithmetic, SAT tasks). This will make the empirical evidence more transparent and allow better evaluation of the universality conclusion. revision: yes
-
Referee: [Results] Results section: Success on the listed specific human-written programs does not establish the ability to execute arbitrary MicroPy programs; programs with deeper nesting, longer execution traces, or control structures outside the tested set could fail even if the reported cases succeed, leaving the central claim under-supported.
Authors: We acknowledge that success on a finite set of programs provides only empirical support rather than exhaustive proof of arbitrary execution. Our claim rests on MicroPy being Turing-complete combined with observed out-of-distribution generalization. To address the concern, we will expand the results with additional test cases involving deeper nesting, longer traces, and varied control structures, while explicitly discussing the limitations of empirical testing for universality. revision: partial
-
Referee: [Methods] Methods section: The PENCIL scaffolding is presented as enabling bounded-context universal execution, but without details on model size, training procedure, dataset statistics, or how the scaffolding interacts with the transformer's context window, the load-bearing mechanism for the universality claim cannot be evaluated.
Authors: We will substantially expand the Methods section in the revision to include model architecture details (layers, heads, embedding size), full training procedure (optimizer, learning rate, epochs, loss), dataset statistics (program counts, length distributions, generation process), and a precise description of how PENCIL scaffolding bounds the context window while preserving small-step execution semantics. This will allow full evaluation of the mechanism. revision: yes
Circularity Check
No circularity: empirical training and OOD testing on MicroPy execution
full rationale
The paper reports an empirical procedure: train a transformer on randomly generated MicroPy programs, then evaluate small-step execution on a separate set of human-written programs (bit ops, arithmetic, SAT). The universality claim rests on the external property that MicroPy can express any computation plus observed generalization, not on any derivation that reduces to the training inputs by construction. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the provided text. The test programs are distinct from the random training distribution, so success is not forced by the fitting process itself. This is a standard ML generalization experiment.
Axiom & Free-Parameter Ledger
invented entities (2)
-
MicroPy
no independent evidence
-
PENCIL scaffolding
no independent evidence
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
-
[4]
Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D
arXiv:2305.18654. Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers,
-
[5]
4) Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J
URL https://arxiv.org/abs/2301.13196. Rishi Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, and Luc De Raedt. Can large language models reason? a characterization via 3-sat,
-
[6]
URL https://arxiv. org/abs/2408.07215. Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, and Eran Malach. Uni- versal length generalization with Turing Programs,
-
[7]
URL https://arxiv.org/abs/ 2407.03310. Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks,
-
[8]
URL https: //github.com/karpathy/nanochat. Daniel Keysers, Nathanael Sch ¨arli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. Measuring compositional generalization: A comprehensive method on realistic data.arX...
-
[9]
Najoung Kim and Tal Linzen. Cogs: A compositional generalization challenge based on semantic interpretation.arXiv preprint arXiv:2010.05465,
-
[10]
10 Brenden M. Lake and Marco Baroni. Generalization without systematicity: On the composi- tional skills of sequence-to-sequence recurrent networks.arXiv preprint arXiv:1711.00350,
-
[11]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:2005.11401,
work page internal anchor Pith review arXiv 2005
-
[12]
Chain of thought empowers transformers to solve inherently serial problems, 2024
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers trans- formers to solve inherently serial problems.arXiv preprint arXiv:2402.12875,
-
[13]
arXiv preprint arXiv:2305.05383 , year=
URLhttps://arxiv.org/abs/2305.05383. Findings of ACL
-
[14]
William Merrill and Ashish Sabharwal
URL https://arxiv.org/abs/2405.17399. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923,
-
[15]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114,
work page internal anchor Pith review arXiv
- [16]
-
[17]
URL https: //arxiv.org/abs/1511.06279. ICLR
-
[18]
Sudoku-bench: Evaluating creative reasoning with sudoku variants
URL https://arxiv.org/abs/ 2505.16135. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[19]
ISSN 1530-888X. doi: 10.1162/neco.1996.8.7
-
[20]
The Lack of A Priori Distinctions Between Learning Algorithms
URLhttp://dx.doi.org/10.1162/neco.1996.8.7.1341. Chenxiao Yang, Nathan Srebro, David McAllester, and Zhiyuan Li. PENCIL: Long thoughts with short memory.arXiv preprint arXiv:2503.14337,
-
[21]
Learning to execute.arXiv:1410.4615,
URL https://arxiv.org/ abs/1410.4615. Dylan Zhang, Curt Tigges, Zory Zhang, Stella Biderman, Maxim Raginsky, and Talia Ringer. Transformer-based models are not yet perfect at learning to emulate structural recursion. arXiv preprint arXiv:2401.12947,
-
[22]
What Algorithms Can Transformers Learn? A Study in Length Generalization,
Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization.arXiv preprint arXiv:2310.16028,
-
[23]
URL https://arxiv.org/abs/2402.09371. Mingchen Zhuge, Changsheng Zhao, Haozhe Liu, Zijian Zhou, Shuming Liu, Wenyi Wang, Ernie Chang, Gael Le Lan, Junjie Fei, Wenxuan Zhang, Yasheng Sun, Zhipeng Cai, Zechun Liu, Yunyang Xiong, Yining Yang, Yuandong Tian, Yangyang Shi, Vikas Chandra, and J ¨urgen Schmidhuber. Neural computers.arXiv preprint arXiv:2604.06425,
-
[24]
value"),LookupAttr(b1,Attr(
Ethics Statement We do not identify specific ethical concerns that require separate discussion beyond standard scientific responsibility. We disclose that we used Codex with GPT-5.2 and GPT-5.4 to assist with implementing the experiments, Claude Code to assist with producing figures, and GPT-5.4 and Claude to assist with paraphrasing and writing the paper...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.