pith. sign in

arxiv: 2605.18847 · v1 · pith:MMAZ5EVQnew · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Transformers Linearly Represent Highly Structured World Models

Pith reviewed 2026-05-20 20:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords transformersmechanistic interpretabilitySudokuworld modelsconstraint satisfactionemergent representationsneural circuitscombinatorial reasoning
0
0 comments X

The pith

A transformer trained on Sudoku traces builds an internal model organized around rows, columns, and boxes rather than individual cells.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains an 8-layer transformer on sequences of Sudoku solving steps and applies mechanistic interpretability methods to inspect its internal states. It shows that the model represents the board by grouping information according to the puzzle's constraint groups instead of storing cell-by-cell values. This structure arises because the model aligns its representations with the algebra of the domain's rules. A sympathetic reader would care because it indicates that transformers can discover and use the underlying structure of combinatorial problems without being given that structure explicitly.

Core claim

The transformer develops a substructure world model in which information is organized around the rows, columns, and boxes that define Sudoku constraints, rather than representing the board state cell by cell. In addition, a small set of dedicated neurons in the final MLP layer forms a naked-single circuit that detects when exactly one digit remains possible for a given cell and promotes that digit.

What carries the argument

The substructure world model, in which the transformer organizes representations around the constraint groups of rows, columns, and boxes.

If this is right

  • The geometry of an emergent world model inside a transformer is determined by the constraint algebra of the task domain rather than its surface presentation.
  • The resulting decision circuits can be sparse, with individual neurons performing monosemantic functions that are fully interpretable.
  • Mechanistic interpretability methods can recover an end-to-end algorithmic description of how the transformer solves a combinatorial reasoning task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same substructure organization might appear in transformers trained on other constraint-satisfaction domains such as graph coloring or scheduling.
  • If the pattern holds, training data that explicitly exposes constraint structure could accelerate the emergence of interpretable internal models.
  • The finding suggests that linear representations in transformers can encode relational structure without requiring explicit architectural changes.

Load-bearing premise

The mechanistic interpretability techniques correctly isolate the causal mechanisms responsible for the model's Sudoku-solving behavior.

What would settle it

An experiment in which targeted interventions on the row-column-box representations or the candidate naked-single neurons fail to change the model's output accuracy on held-out puzzles.

Figures

Figures reproduced from arXiv: 2605.18847 by Nathana\"el Fijalkow, Roman Kniazev.

Figure 1
Figure 1. Figure 1: (Left) Mean exact match accuracy of three families of linear probes trained at [clues_end] token over different layers. Blue: 81 multi-class probes predicting the digit in a cell (top-1 accuracy); Orange: 729 binary probes predicting if a digit is a valid candidate in a cell (exact match across cell); Green: 243 binary probes predicting if a digit is present in a substructure (exact match over a substructu… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Cross-layer transfer of substructure-state probes: entry (i, j) is the mean exact match accuracy of probes trained on layer i activations and evaluated on layer j activations without retraining; (Right) Cross-position transfer of substructure-state probes trained on different layers: mean exact match accuracy of probes trained at [clues_end] and evaluated on later positions without retraining. accur… view at source ↗
Figure 3
Figure 3. Figure 3: Substructure constraint elimination in mid-layer attention heads. Attention map (left) shows [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (Left) The distribution of the within-cell logit margin (correct digit score - max incorrect digit score) across 80,919 states with unique NS placement. (Right) A representative logit lens trace of tokens of a NS cell. The correct digit candidate (red) separates from other digits of the cell (gray) across the layers, with a general promotion of the logits for all digits of the cell in the last MLP block. 5… view at source ↗
Figure 5
Figure 5. Figure 5: The distribution of activations of an NS neurons. Placement Logit drop Probability drop Target NS placement 11.408 ± 0.084 0.585 ± 0.003 Other NS placements −0.655 ± 0.008 – [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (Left) Mean squared error between the probes’ predicted probability vectors and the one-hot targets, per layer, for the three probe families: 81 cell-state probes (blue), 729 cell-candidate probes (orange), and 243 substructure-state probes (green). Cell-state probes settle around MSE≈0.03 in mid-layers and never reach zero, consistent with their imperfect top-1 accuracy; substructure-state probes drive MS… view at source ↗
Figure 7
Figure 7. Figure 7: Cosine similarities between the probe trained at layer 6 activations to predict if [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-position transfer diagnostics complementing Fig. 2 (right). Frozen substructure [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pairwise cosine similarity of the unembedding vectors. Each of the nine big cells is a row, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mean attention scores from [clues_end] token to other tokens in the grid, averaged over all digits, computed over 6400 puzzles. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Substructure constraint elimination in mid-layer attention heads. Attention maps (left) [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
read the original abstract

Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell by cell, as a human analyst would expect, but organizes information around the rows, columns, and boxes that Sudoku's constraints act on. Second, we identify a naked-single circuit: a small set of dedicated neurons in the final MLP layer, each individually detecting when exactly one digit remains possible for a specific cell, and reliably promoting that digit. These findings show that the geometry of an emergent world model is shaped by the constraint algebra of the domain, not its surface presentation, and that the resulting decision circuit is sparse, monosemantic, and fully interpretable. More broadly, they demonstrate that mechanistic interpretability tools can recover an end-to-end algorithmic account of how a transformer solves a combinatorial reasoning task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper trains an 8-layer transformer on sequential Sudoku solving traces and performs mechanistic interpretability analysis using activation patching and neuron inspection. It claims two main results: the model builds a substructure world model organized around rows, columns, and boxes (rather than representing the board cell-by-cell), and a sparse naked-single circuit exists in the final MLP layer where individual neurons detect cells with exactly one possible digit and promote that digit.

Significance. If the causal claims hold, the work would provide evidence that transformer internal representations can mirror the constraint algebra of a combinatorial domain rather than its surface structure, and that standard MI tools can recover sparse, interpretable decision circuits. This would strengthen the case for using mechanistic analysis to obtain end-to-end algorithmic accounts of reasoning in trained models, with potential implications for understanding how neural networks solve structured tasks.

major comments (2)
  1. [Mechanistic analysis of world model] Mechanistic analysis section on substructure representations: the activation patching and neuron inspection results establish correlations between row/column/box features and internal activations, but the manuscript does not report control experiments comparing performance drops when patching these putative substructure neurons versus neurons identified via cell-wise linear probes; without such specificity tests, the claim that the model organizes information around Sudoku constraints (rather than cells) remains vulnerable to the possibility that cell-level features are also linearly extractable and causally sufficient.
  2. [Naked-single circuit] Naked-single circuit identification (final MLP layer): while the paper identifies a small set of dedicated neurons that detect single-possibility cells, no quantitative ablation results are provided showing the exact accuracy drop (e.g., fraction of puzzles solved or per-step prediction accuracy) when these neurons are ablated versus random or control neurons; this is needed to confirm they are causally responsible for promoting the correct digit rather than correlational.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction could more explicitly state the training dataset size, number of Sudoku puzzles, and exact sequence format used for the solving traces to allow replication.
  2. [Results figures] Figure legends for activation patching results should include error bars or statistical significance tests across multiple runs or puzzle sets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strength of our causal claims. We address each major comment point-by-point below.

read point-by-point responses
  1. Referee: [Mechanistic analysis of world model] Mechanistic analysis section on substructure representations: the activation patching and neuron inspection results establish correlations between row/column/box features and internal activations, but the manuscript does not report control experiments comparing performance drops when patching these putative substructure neurons versus neurons identified via cell-wise linear probes; without such specificity tests, the claim that the model organizes information around Sudoku constraints (rather than cells) remains vulnerable to the possibility that cell-level features are also linearly extractable and causally sufficient.

    Authors: We agree that additional specificity controls are needed to rule out cell-level alternatives. In the revised manuscript we will add experiments that first train cell-wise linear probes, identify the top neurons by probe accuracy, and then compare activation-patching performance drops for those neurons against the substructure neurons. This will directly test whether constraint-based features are more causally relevant than cell-level ones. revision: yes

  2. Referee: [Naked-single circuit] Naked-single circuit identification (final MLP layer): while the paper identifies a small set of dedicated neurons that detect single-possibility cells, no quantitative ablation results are provided showing the exact accuracy drop (e.g., fraction of puzzles solved or per-step prediction accuracy) when these neurons are ablated versus random or control neurons; this is needed to confirm they are causally responsible for promoting the correct digit rather than correlational.

    Authors: We acknowledge that the main text lacked precise quantitative ablation metrics. The revised version will include a new table and figure reporting the exact drop in puzzle-solving accuracy and per-step prediction accuracy when ablating the naked-single neurons, compared against random ablations of the same number of neurons and against ablations of other high-activation neurons in the same layer. These results will quantify the causal contribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical MI analysis of trained transformer

full rationale

The paper trains an 8-layer transformer on Sudoku solving traces and applies mechanistic interpretability methods (activation patching, neuron inspection) to identify internal representations organized around rows/columns/boxes and a naked-single circuit in the final MLP. These results are obtained by direct empirical probing of the trained model rather than any derivation, equation, or first-principles argument that reduces to fitted parameters, self-definitions, or self-citation chains. No load-bearing step equates a claimed prediction or world-model geometry to quantities defined by the analysis itself. The work is self-contained against external benchmarks of model behavior and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard transformer architecture assumptions and the validity of current mechanistic interpretability tools; no new physical or mathematical entities are postulated.

free parameters (1)
  • model depth and width
    8-layer transformer size chosen for the experiment; typical hyperparameter not derived from first principles.
axioms (1)
  • domain assumption Mechanistic interpretability methods can isolate causal circuits in transformer activations
    Invoked when claiming the naked-single neurons are the dedicated mechanism.

pith-pipeline@v0.9.0 · 5723 in / 1242 out tokens · 33042 ms · 2026-05-20T20:59:04.367809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    Understanding intermediate layers using linear classifier probes

    G. Alain and Y. Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016

  2. [2]

    Belinkov

    Y. Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022

  3. [3]

    Elhage, N

    N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits. 2021. Transformer ...

  4. [4]

    Elhage, T

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah. Toy models of superposition. 2022. Transformer Circuits Thread, https://transformer-circuits.pub/2022/toy_model/index.html

  5. [5]

    Geiger, H

    A. Geiger, H. Lu, T. Icard, and C. Potts. Causal abstractions of neural networks. Advances in neural information processing systems, 34: 0 9574--9586, 2021

  6. [6]

    Giannoulis, Y

    P. Giannoulis, Y. Pantis, and C. Tzamos. Teaching transformers to solve combinatorial problems through efficient trial & error. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=MLprqOvAAK

  7. [7]

    Ivanitskiy, A

    M. Ivanitskiy, A. F. Spies, T. R\"auker, G. Corlouer, C. Mathwin, L. Quirke, C. Rager, R. Shah, D. Valentine, C. D. Behn, K. Inoue, and S. W. Fung. Linearly structured world representations in maze-solving transformers. In M. Fumero, E. Rodolá, C. Domine, F. Locatello, K. Dziugaite, and C. Mathilde, editors, Proceedings of UniReps: the First Workshop on U...

  8. [8]

    Karvonen

    A. Karvonen. Emergent world models and latent variable estimation in chess-playing language models. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=PPTrmvEnpW

  9. [9]

    K. Li, A. K. Hopkins, D. Bau, F. Vi \' e gas, H. Pfister, and M. Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations ( ICLR 2023) , 2023. URL https://openreview.net/forum?id=DeG07_TcZvT

  10. [10]

    Mikolov, W.-t

    T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In L. Vanderwende, H. Daum \'e III, and K. Kirchhoff, editors, Proceedings of the 2013 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746--751, Atlanta, Georgia, June 2013. ...

  11. [11]

    Nanda, A

    N. Nanda, A. Lee, and M. Wattenberg. Emergent linear representations in world models of self-supervised sequence models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 16--30, 2023

  12. [12]

    P. Norvig. Solving every Sudoku puzzle. https://norvig.com/sudoku.html, 2006

  13. [13]

    Interpreting GPT : the logit lens

    nostalgebraist. Interpreting GPT : the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/

  14. [14]

    K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=UGpGkLzwpP

  15. [15]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022. arXiv:2201.02177

  16. [16]

    Radcliffe

    D. Radcliffe. 3 million Sudoku puzzles with ratings. Kaggle dataset, https://www.kaggle.com/datasets/radcliffe/3-million-sudoku-puzzles-with-ratings, 2020

  17. [17]

    K. Shah, N. Dikkala, X. Wang, and R. Panigrahy. Causal language modeling can elicit search and reasoning capabilities on logic puzzles. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=i5PoejmWoC