Transformers Linearly Represent Highly Structured World Models
Pith reviewed 2026-05-20 20:59 UTC · model grok-4.3
The pith
A transformer trained on Sudoku traces builds an internal model organized around rows, columns, and boxes rather than individual cells.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The transformer develops a substructure world model in which information is organized around the rows, columns, and boxes that define Sudoku constraints, rather than representing the board state cell by cell. In addition, a small set of dedicated neurons in the final MLP layer forms a naked-single circuit that detects when exactly one digit remains possible for a given cell and promotes that digit.
What carries the argument
The substructure world model, in which the transformer organizes representations around the constraint groups of rows, columns, and boxes.
If this is right
- The geometry of an emergent world model inside a transformer is determined by the constraint algebra of the task domain rather than its surface presentation.
- The resulting decision circuits can be sparse, with individual neurons performing monosemantic functions that are fully interpretable.
- Mechanistic interpretability methods can recover an end-to-end algorithmic description of how the transformer solves a combinatorial reasoning task.
Where Pith is reading between the lines
- The same substructure organization might appear in transformers trained on other constraint-satisfaction domains such as graph coloring or scheduling.
- If the pattern holds, training data that explicitly exposes constraint structure could accelerate the emergence of interpretable internal models.
- The finding suggests that linear representations in transformers can encode relational structure without requiring explicit architectural changes.
Load-bearing premise
The mechanistic interpretability techniques correctly isolate the causal mechanisms responsible for the model's Sudoku-solving behavior.
What would settle it
An experiment in which targeted interventions on the row-column-box representations or the candidate naked-single neurons fail to change the model's output accuracy on held-out puzzles.
Figures
read the original abstract
Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell by cell, as a human analyst would expect, but organizes information around the rows, columns, and boxes that Sudoku's constraints act on. Second, we identify a naked-single circuit: a small set of dedicated neurons in the final MLP layer, each individually detecting when exactly one digit remains possible for a specific cell, and reliably promoting that digit. These findings show that the geometry of an emergent world model is shaped by the constraint algebra of the domain, not its surface presentation, and that the resulting decision circuit is sparse, monosemantic, and fully interpretable. More broadly, they demonstrate that mechanistic interpretability tools can recover an end-to-end algorithmic account of how a transformer solves a combinatorial reasoning task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper trains an 8-layer transformer on sequential Sudoku solving traces and performs mechanistic interpretability analysis using activation patching and neuron inspection. It claims two main results: the model builds a substructure world model organized around rows, columns, and boxes (rather than representing the board cell-by-cell), and a sparse naked-single circuit exists in the final MLP layer where individual neurons detect cells with exactly one possible digit and promote that digit.
Significance. If the causal claims hold, the work would provide evidence that transformer internal representations can mirror the constraint algebra of a combinatorial domain rather than its surface structure, and that standard MI tools can recover sparse, interpretable decision circuits. This would strengthen the case for using mechanistic analysis to obtain end-to-end algorithmic accounts of reasoning in trained models, with potential implications for understanding how neural networks solve structured tasks.
major comments (2)
- [Mechanistic analysis of world model] Mechanistic analysis section on substructure representations: the activation patching and neuron inspection results establish correlations between row/column/box features and internal activations, but the manuscript does not report control experiments comparing performance drops when patching these putative substructure neurons versus neurons identified via cell-wise linear probes; without such specificity tests, the claim that the model organizes information around Sudoku constraints (rather than cells) remains vulnerable to the possibility that cell-level features are also linearly extractable and causally sufficient.
- [Naked-single circuit] Naked-single circuit identification (final MLP layer): while the paper identifies a small set of dedicated neurons that detect single-possibility cells, no quantitative ablation results are provided showing the exact accuracy drop (e.g., fraction of puzzles solved or per-step prediction accuracy) when these neurons are ablated versus random or control neurons; this is needed to confirm they are causally responsible for promoting the correct digit rather than correlational.
minor comments (2)
- [Abstract and Introduction] The abstract and introduction could more explicitly state the training dataset size, number of Sudoku puzzles, and exact sequence format used for the solving traces to allow replication.
- [Results figures] Figure legends for activation patching results should include error bars or statistical significance tests across multiple runs or puzzle sets.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the strength of our causal claims. We address each major comment point-by-point below.
read point-by-point responses
-
Referee: [Mechanistic analysis of world model] Mechanistic analysis section on substructure representations: the activation patching and neuron inspection results establish correlations between row/column/box features and internal activations, but the manuscript does not report control experiments comparing performance drops when patching these putative substructure neurons versus neurons identified via cell-wise linear probes; without such specificity tests, the claim that the model organizes information around Sudoku constraints (rather than cells) remains vulnerable to the possibility that cell-level features are also linearly extractable and causally sufficient.
Authors: We agree that additional specificity controls are needed to rule out cell-level alternatives. In the revised manuscript we will add experiments that first train cell-wise linear probes, identify the top neurons by probe accuracy, and then compare activation-patching performance drops for those neurons against the substructure neurons. This will directly test whether constraint-based features are more causally relevant than cell-level ones. revision: yes
-
Referee: [Naked-single circuit] Naked-single circuit identification (final MLP layer): while the paper identifies a small set of dedicated neurons that detect single-possibility cells, no quantitative ablation results are provided showing the exact accuracy drop (e.g., fraction of puzzles solved or per-step prediction accuracy) when these neurons are ablated versus random or control neurons; this is needed to confirm they are causally responsible for promoting the correct digit rather than correlational.
Authors: We acknowledge that the main text lacked precise quantitative ablation metrics. The revised version will include a new table and figure reporting the exact drop in puzzle-solving accuracy and per-step prediction accuracy when ablating the naked-single neurons, compared against random ablations of the same number of neurons and against ablations of other high-activation neurons in the same layer. These results will quantify the causal contribution. revision: yes
Circularity Check
No circularity: empirical MI analysis of trained transformer
full rationale
The paper trains an 8-layer transformer on Sudoku solving traces and applies mechanistic interpretability methods (activation patching, neuron inspection) to identify internal representations organized around rows/columns/boxes and a naked-single circuit in the final MLP. These results are obtained by direct empirical probing of the trained model rather than any derivation, equation, or first-principles argument that reduces to fitted parameters, self-definitions, or self-citation chains. No load-bearing step equates a claimed prediction or world-model geometry to quantities defined by the analysis itself. The work is self-contained against external benchmarks of model behavior and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
free parameters (1)
- model depth and width
axioms (1)
- domain assumption Mechanistic interpretability methods can isolate causal circuits in transformer activations
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the model builds a substructure world model: it does not represent the board state cell by cell... organizes information around the rows, columns, and boxes that Sudoku's constraints act on
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Linear probes for “is digit d present in substructure S?” achieve perfect exact match accuracy across all 243 (substructure, digit) pairs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes
G. Alain and Y. Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [2]
-
[3]
N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits. 2021. Transformer ...
work page 2021
-
[4]
N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah. Toy models of superposition. 2022. Transformer Circuits Thread, https://transformer-circuits.pub/2022/toy_model/index.html
work page 2022
- [5]
-
[6]
P. Giannoulis, Y. Pantis, and C. Tzamos. Teaching transformers to solve combinatorial problems through efficient trial & error. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=MLprqOvAAK
work page 2026
-
[7]
M. Ivanitskiy, A. F. Spies, T. R\"auker, G. Corlouer, C. Mathwin, L. Quirke, C. Rager, R. Shah, D. Valentine, C. D. Behn, K. Inoue, and S. W. Fung. Linearly structured world representations in maze-solving transformers. In M. Fumero, E. Rodolá, C. Domine, F. Locatello, K. Dziugaite, and C. Mathilde, editors, Proceedings of UniReps: the First Workshop on U...
work page 2024
- [8]
-
[9]
K. Li, A. K. Hopkins, D. Bau, F. Vi \' e gas, H. Pfister, and M. Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations ( ICLR 2023) , 2023. URL https://openreview.net/forum?id=DeG07_TcZvT
work page 2023
-
[10]
T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In L. Vanderwende, H. Daum \'e III, and K. Kirchhoff, editors, Proceedings of the 2013 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746--751, Atlanta, Georgia, June 2013. ...
work page 2013
- [11]
-
[12]
P. Norvig. Solving every Sudoku puzzle. https://norvig.com/sudoku.html, 2006
work page 2006
-
[13]
Interpreting GPT : the logit lens
nostalgebraist. Interpreting GPT : the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/
-
[14]
K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=UGpGkLzwpP
work page 2024
-
[15]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022. arXiv:2201.02177
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [16]
-
[17]
K. Shah, N. Dikkala, X. Wang, and R. Panigrahy. Causal language modeling can elicit search and reasoning capabilities on logic puzzles. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=i5PoejmWoC
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.