arxiv: 2604.25166 · v1 · submitted 2026-04-28 · 💻 cs.AI

Recognition: unknown

Training Transformers as a Universal Computer

Ruize Xu , Chenxiao Yang , Yanhong Li , David McAllester

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:31 UTC · model grok-4.3

classification 💻 cs.AI

keywords transformerprogram executionuniversal computerMicroPygeneralizationSAT

0 comments

The pith

A transformer can learn to execute any program in a universal language like MicroPy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a small transformer, trained only on randomly generated meaningless programs in MicroPy, can then correctly run human-written programs that perform bit operations, arithmetic calculations, and even solve SAT problems. MicroPy is designed to be computationally universal, meaning it can express any possible computation. This provides evidence that standard transformers have the potential to be trained as universal computers capable of handling arbitrary tasks.

Core claim

A standard transformer trained to predict small-step execution of MicroPy programs using PENCIL scaffolding generalizes from random programs to human-written ones, including binary addition, multiplication, and SAT verification, indicating that transformers can be trained to act as universal computers.

What carries the argument

PENCIL scaffolding, which enables space-efficient execution of MicroPy programs within the transformer's bounded context window by predicting small-step operations.

If this is right

Transformers can generalize to novel programs not seen during training.
Successful execution on diverse tasks like SAT solving follows from the universal nature of MicroPy.
The model can perform out-of-distribution generalization on program evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If this holds, transformers might be extended to more complex languages or real-world applications by similar training.
Further work could test whether the approach scales to larger programs or different architectures.
Connections exist to using AI models for automated program interpretation and execution.

Load-bearing premise

That performance on the specific set of human-written programs tested is sufficient proof that the transformer can handle any computation expressible in MicroPy.

What would settle it

Finding a MicroPy program that the trained transformer cannot execute correctly, despite having been trained on random programs of similar complexity.

Figures

Figures reproduced from arXiv: 2604.25166 by Chenxiao Yang, David McAllester, Ruize Xu, Yanhong Li.

**Figure 1.** Figure 1: Evaluation of a MicroPy expression. We define foo and bar to both be the identity view at source ↗

**Figure 2.** Figure 2: Sampled plan and matched trace step (general stack). Some details (e.g., binding view at source ↗

**Figure 3.** Figure 3: Experiment summary plots. a visualizes the length gap between generatorsampled training programs and the held-out evaluation suite; b compares the pooled operation profiles of the training generator and the test programs. Nye et al., 2021; Liu et al., 2023) explore learned program execution but rely on specialized architectures or task-specific training; we use a standard transformer with next-token predi… view at source ↗

**Figure 4.** Figure 4: Leaf retrieval procedures. Each panel highlights the current (deepest) frame and the single prompt block it consults (if any). LookupVar retrieves a binding from the top Env(...) frame, while LookupAttr and HasAttr read the store inside State/Effects (after evaluating their object subexpression). FunDefs FD f = lambda( ) ExpBody ; 1 ExpDefs D.ExpK = <primitive>( ) General Stack (Env) Env( Bind(x Obj) ) top… view at source ↗

**Figure 5.** Figure 5: Dictionary-backed steps. (Left) ExpK expansion retrieves D.ExpK = <primitive>(...) from ExpDefs. (Right) Apply (App) retrieves FD f = lambda(...) ExpBody; from FunDefs, then pushes a fresh Env(...) frame binding arguments before evaluating ExpBody. 18 view at source ↗

**Figure 6.** Figure 6: Control and effect composition. These primitives are compositional: they create sub-calls and combine the returned Eff(...) lists and values. Assert appends an Assertion(o,a,v) to the effect list; Seq concatenates effects and returns the second value; If evaluates the test and rewrites to the selected branch; and Equal compares the identities of the evaluated operands. 19 view at source ↗

**Figure 7.** Figure 7: Per-task operation profiles for the held-out evaluation suite. Each subplot compares view at source ↗

read the original abstract

We demonstrate that a small transformer can learn to execute programs in MicroPy, a simplified yet computationally universal programming language. Given procedure definitions together with an expression to evaluate, the transformer predicts small-step execution using PENCIL scaffolding for space-efficient execution within a bounded context window. After training on randomly generated, meaningless MicroPy programs, the learned transformer generalizes to various human-written programs including bit copying and flipping, binary addition and multiplication, and SAT verification and solving. We note that the trained model can achieve out-of-distribution generalization; i.e., evaluate novel programs from distribution on programs. Since MicroPy can express any computation, our results provide empirical evidence that a standard transformer can be trained to act as a universal computer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that a small transformer trained on randomly generated MicroPy programs (a simplified universal language) can generalize to execute human-written programs including bit operations, arithmetic, and SAT verification/solving. It uses PENCIL scaffolding for space-efficient small-step execution within a bounded context window and concludes that this provides empirical evidence that standard transformers can act as universal computers.

Significance. If the empirical results hold with rigorous quantitative validation, the work would offer a notable demonstration that transformers can internalize the semantics of a Turing-complete language from random training data, with implications for neural program execution, reasoning, and the computational expressivity of attention-based models. The scaffolding technique for bounded contexts is a practical engineering contribution.

major comments (3)

[Abstract] Abstract: The generalization claim to human-written programs (bit copying/flipping, binary addition/multiplication, SAT) is stated without any quantitative metrics such as accuracy, error rates, success percentages, or failure modes, making it impossible to assess whether the evidence supports the universality conclusion.
[Results] Results section: Success on the listed specific human-written programs does not establish the ability to execute arbitrary MicroPy programs; programs with deeper nesting, longer execution traces, or control structures outside the tested set could fail even if the reported cases succeed, leaving the central claim under-supported.
[Methods] Methods section: The PENCIL scaffolding is presented as enabling bounded-context universal execution, but without details on model size, training procedure, dataset statistics, or how the scaffolding interacts with the transformer's context window, the load-bearing mechanism for the universality claim cannot be evaluated.

minor comments (2)

[Abstract] Clarify the exact definition and implementation of 'out-of-distribution generalization' (novel programs from the program distribution) versus true extrapolation to arbitrary computations.
[Results] The manuscript would benefit from a table summarizing the human-written test programs, their complexities, and observed performance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The generalization claim to human-written programs (bit copying/flipping, binary addition/multiplication, SAT) is stated without any quantitative metrics such as accuracy, error rates, success percentages, or failure modes, making it impossible to assess whether the evidence supports the universality conclusion.

Authors: We agree that the abstract would benefit from quantitative metrics to support the generalization claims. In the revised version, we will include specific accuracy rates, success percentages, and notes on observed failure modes for the human-written programs (e.g., bit operations, arithmetic, SAT tasks). This will make the empirical evidence more transparent and allow better evaluation of the universality conclusion. revision: yes
Referee: [Results] Results section: Success on the listed specific human-written programs does not establish the ability to execute arbitrary MicroPy programs; programs with deeper nesting, longer execution traces, or control structures outside the tested set could fail even if the reported cases succeed, leaving the central claim under-supported.

Authors: We acknowledge that success on a finite set of programs provides only empirical support rather than exhaustive proof of arbitrary execution. Our claim rests on MicroPy being Turing-complete combined with observed out-of-distribution generalization. To address the concern, we will expand the results with additional test cases involving deeper nesting, longer traces, and varied control structures, while explicitly discussing the limitations of empirical testing for universality. revision: partial
Referee: [Methods] Methods section: The PENCIL scaffolding is presented as enabling bounded-context universal execution, but without details on model size, training procedure, dataset statistics, or how the scaffolding interacts with the transformer's context window, the load-bearing mechanism for the universality claim cannot be evaluated.

Authors: We will substantially expand the Methods section in the revision to include model architecture details (layers, heads, embedding size), full training procedure (optimizer, learning rate, epochs, loss), dataset statistics (program counts, length distributions, generation process), and a precise description of how PENCIL scaffolding bounds the context window while preserving small-step execution semantics. This will allow full evaluation of the mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and OOD testing on MicroPy execution

full rationale

The paper reports an empirical procedure: train a transformer on randomly generated MicroPy programs, then evaluate small-step execution on a separate set of human-written programs (bit ops, arithmetic, SAT). The universality claim rests on the external property that MicroPy can express any computation plus observed generalization, not on any derivation that reduces to the training inputs by construction. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the provided text. The test programs are distinct from the random training distribution, so success is not forced by the fitting process itself. This is a standard ML generalization experiment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The paper introduces two key constructed elements: MicroPy as a universal language and PENCIL as scaffolding. No free parameters or mathematical axioms are described in the abstract.

invented entities (2)

MicroPy no independent evidence
purpose: A simplified yet computationally universal programming language for training and testing.
Defined in the abstract as the basis for all training and generalization experiments.
PENCIL scaffolding no independent evidence
purpose: Enables space-efficient small-step execution inside a bounded transformer context window.
Mentioned as the mechanism allowing the transformer to perform execution steps.

pith-pipeline@v0.9.0 · 5413 in / 1195 out tokens · 56909 ms · 2026-05-07T16:31:11.145145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 23 canonical work pages · 2 internal anchors

[1]

URL https://arxiv.org/abs/2406.06467. NeurIPS

work page arXiv
[2]

URLhttps://arxiv.org/abs/1704.06611. ICLR

work page arXiv
[3]

URL https://arxiv.org/abs/2405.20671. NeurIPS

work page arXiv
[4]

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D

arXiv:2305.18654. Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers,

work page arXiv
[5]

4) Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J

URL https://arxiv.org/abs/2301.13196. Rishi Hazra, Gabriele Venturato, Pedro Zuidberg Dos Martires, and Luc De Raedt. Can large language models reason? a characterization via 3-sat,

work page arXiv
[6]

org/abs/2408.07215

URL https://arxiv. org/abs/2408.07215. Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, and Eran Malach. Uni- versal length generalization with Turing Programs,

work page arXiv
[7]

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein

URL https://arxiv.org/abs/ 2407.03310. Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks,

work page arXiv
[8]

URL https: //github.com/karpathy/nanochat. Daniel Keysers, Nathanael Sch ¨arli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. Measuring compositional generalization: A comprehensive method on realistic data.arX...

work page arXiv 1912
[9]

Cogs: A compositional generalization challenge based on semantic interpretation.arXiv preprint arXiv:2010.05465,

Najoung Kim and Tal Linzen. Cogs: A compositional generalization challenge based on semantic interpretation.arXiv preprint arXiv:2010.05465,

work page arXiv 2010
[10]

Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks

10 Brenden M. Lake and Marco Baroni. Generalization without systematicity: On the composi- tional skills of sequence-to-sequence recurrent networks.arXiv preprint arXiv:1711.00350,

work page Pith review arXiv
[11]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:2005.11401,

work page internal anchor Pith review arXiv 2005
[12]

Chain of thought empowers transformers to solve inherently serial problems, 2024

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers trans- formers to solve inherently serial problems.arXiv preprint arXiv:2402.12875,

work page arXiv
[13]

arXiv preprint arXiv:2305.05383 , year=

URLhttps://arxiv.org/abs/2305.05383. Findings of ACL

work page arXiv
[14]

William Merrill and Ashish Sabharwal

URL https://arxiv.org/abs/2405.17399. William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923,

work page arXiv
[15]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114,

work page internal anchor Pith review arXiv
[16]

URL https://arxiv.org/abs/ 1905.12941. NeurIPS

work page arXiv 1905
[17]

URL https: //arxiv.org/abs/1511.06279. ICLR

work page Pith review arXiv
[18]

Sudoku-bench: Evaluating creative reasoning with sudoku variants

URL https://arxiv.org/abs/ 2505.16135. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page arXiv
[19]

doi: 10.1162/neco.1996.8.7

ISSN 1530-888X. doi: 10.1162/neco.1996.8.7

work page doi:10.1162/neco.1996.8.7 1996
[20]

The Lack of A Priori Distinctions Between Learning Algorithms

URLhttp://dx.doi.org/10.1162/neco.1996.8.7.1341. Chenxiao Yang, Nathan Srebro, David McAllester, and Zhiyuan Li. PENCIL: Long thoughts with short memory.arXiv preprint arXiv:2503.14337,

work page doi:10.1162/neco.1996.8.7.1341 1996
[21]

Learning to execute.arXiv:1410.4615,

URL https://arxiv.org/ abs/1410.4615. Dylan Zhang, Curt Tigges, Zory Zhang, Stella Biderman, Maxim Raginsky, and Talia Ringer. Transformer-based models are not yet perfect at learning to emulate structural recursion. arXiv preprint arXiv:2401.12947,

work page arXiv
[22]

What Algorithms Can Transformers Learn? A Study in Length Generalization,

Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization.arXiv preprint arXiv:2310.16028,

work page arXiv
[23]

URL https://arxiv.org/abs/2402.09371. Mingchen Zhuge, Changsheng Zhao, Haozhe Liu, Zijian Zhou, Shuming Liu, Wenyi Wang, Ernie Chang, Gael Le Lan, Junjie Fei, Wenxuan Zhang, Yasheng Sun, Zhipeng Cai, Zechun Liu, Yunyang Xiong, Yining Yang, Yuandong Tian, Yangyang Shi, Vikas Chandra, and J ¨urgen Schmidhuber. Neural computers.arXiv preprint arXiv:2604.06425,

work page arXiv
[24]

value"),LookupAttr(b1,Attr(

Ethics Statement We do not identify specific ethical concerns that require separate discussion beyond standard scientific responsibility. We disclose that we used Codex with GPT-5.2 and GPT-5.4 to assist with implementing the experiments, Claude Code to assist with producing figures, and GPT-5.4 and Claude to assist with paraphrasing and writing the paper...

2017