pith. sign in

arxiv: 2605.17811 · v1 · pith:QLDBMFJKnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· math.OC

One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer

Pith reviewed 2026-05-20 12:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC
keywords recurrent transformeremergent specializationasymmetric input recurrenceshared parametersstate dynamicssudoku extrememaze solvingattention analysis
0
0 comments X

The pith

A clear state-identity signal induces stable functional roles inside a shared-parameter recurrent Transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a single Transformer reused across two recurrent states can develop distinct internal behaviors without being split into separate modules. In the Asymmetric Input Recurrence setup the only built-in difference is that encoded input is injected during L-updates but withheld during H-updates. On Sudoku-Extreme and Maze tasks the H state consistently acts as a committed proposal while the L state retains local uncertainty and shifting structure. Freeze and ablation experiments tie this split to the model's ability to distinguish the two update types. The results indicate that minimal signals suffice to produce related but specialized roles in a shared recurrent architecture.

Core claim

In a two-state recurrent setting, a clear state-identity signal can induce stable, related functional roles inside a shared-parameter recurrent Transformer: zH behaves like a fully committed proposal state whereas zL retains local uncertainty and shifting intermediate structure, with the split arising when the model can tell the update types apart via input-injection asymmetry or a level token.

What carries the argument

Asymmetric Input Recurrence (AIR), a minimal two-state reasoning architecture that reuses the same Transformer parameters for both L and H updates while injecting the encoded input only during L-updates.

If this is right

  • Freezing zH reduces content changes in zL while freezing zL increases changes in zH on Sudoku-Extreme.
  • Freezing either state increases content changes in the other state on Maze.
  • L-updates produce consistently more local attention patterns than H-updates in both tasks.
  • Specialization disappears when the model has no signal to distinguish the two update types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same minimal asymmetry could produce analogous role splits in other multi-step recurrent reasoning domains.
  • Scaling the architecture while preserving the input-injection rule would test whether the proposal-uncertainty division persists at larger model sizes.
  • Comparing attention locality across additional tasks would clarify whether the local-versus-global pattern is a general consequence of the L/H distinction.

Load-bearing premise

The observed functional split between zH and zL is caused by the input-injection asymmetry or level token rather than by task-specific training dynamics or other unablated factors.

What would settle it

Train the same models on Sudoku-Extreme and Maze after removing both the input-injection asymmetry and the level token; if the proposal-versus-uncertainty split still appears in decoded rollouts and freeze experiments, the claim is false.

Figures

Figures reproduced from arXiv: 2605.17811 by Anastasios Kyrillidis, Barbara Su, Jucheng Shen.

Figure 1
Figure 1. Figure 1: AIR testbed. The same shared Transformer f(·; θ) is reused at two positions in the recurrent schedule: with the encoded input x˜ for L-updates (top) and without it for H-updates (bottom). Despite identical parameters, this input difference lets the shared model tell the two update types apart: zL acts like a shifting-commitment scratchpad while zH acts like a fully committed solution state, and each query … view at source ↗
Figure 2
Figure 2. Figure 2: Sudoku decoded states. Left = zH, Right = zL; zH stays fully committed, while zL keeps some cells as BLANK (shaded light gray) and shifts those blanks across sub-steps [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Maze decoded states. Left = zH, Right = zL; Black = wall, white = open path, blue = solution path, gray = undecided (PAD), red = start, and green = goal. zH stays committed, while zL shows gray undecided cells and other shifting local structure. The symmetric Lx_Hx control removes this decoded role split. Figures 4 and 5 repeat the same four-panel view for symmetric models on Sudoku and Maze. Once the asym… view at source ↗
Figure 4
Figure 4. Figure 4: Sudoku decoded states for the symmetric Lx_Hx variant. Left = zH, Right = zL; decoded intermediate states across two sub-steps for each update type. Relative to [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Maze decoded states for the symmetric Lx_Hx variant. Left = zH, Right = zL; decoded intermediate states across two sub-steps for each update type. Relative to [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Paired freeze experiments measured by content changes (number of decoded-token positions [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Symmetric-variant freeze experiments on Sudoku and Maze, measured by the same content [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sudoku L−H attention contrasts across query classes and Transformer layers. L-updates are more local than H-updates (∆nbr, ∆ent > 0), while violation-specific concentration (∆viol > 0) appears mainly in deeper layers [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Maze attention contrasts across error-adjacent and control queries. We report per-cell [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Three representative L-update versus H-update attention examples on Sudoku. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Temporal persistence for a single query cell (puzzle [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Three representative L-update versus H-update attention examples on Maze-30 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Temporal persistence for a single anchor query (puzzle [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Can a shared-weight recurrent Transformer develop distinct internal roles without being partitioned into separate modules? We study this in Asymmetric Input Recurrence (AIR), a minimal two-state reasoning architecture in which the same Transformer model is reused for both updates (per literature, L and H) and the only built-in difference in the update rule is that the encoded input is injected during L-updates but not H-updates. Across Sudoku-Extreme and Maze, decoded rollouts reveal a consistent split: $\zH$ behaves like a fully committed proposal state, whereas $\zL$ retains local uncertainty and shifting intermediate structure. Freeze experiments show that this split is, in practice, related to the model's state dynamics: in Sudoku, freezing $\zH$ reduces $\zL$'s content changes whereas freezing $\zL$ increases $\zH$'s, while in Maze, freezing either state increases content changes in the other state. Ablations show that to induce specialization, the shared model needs to be able to tell the two update types apart, either from input injection asymmetry or from a separate level token. Mechanistically, attention analysis shows that L-updates are consistently more local than H-updates in both Sudoku and Maze. Together, these results show that, in a two-state recurrent setting, a clear state-identity signal can induce stable, related functional roles inside a shared-parameter recurrent Transformer. Code is available at \href{https://github.com/juchengshen/air}{\textcolor{blue}{https://github.com/juchengshen/air}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Asymmetric Input Recurrence (AIR), a minimal two-state recurrent Transformer in which a shared-parameter model is reused for L- and H-updates, with the sole built-in asymmetry being input injection during L-updates (or an explicit level token). Across Sudoku-Extreme and Maze, decoded rollouts show zH behaving as a committed proposal state while zL retains local uncertainty and shifting structure. Freeze experiments demonstrate interdependence between the states, ablations confirm that distinguishability of update types is required for the split, and attention analysis reveals consistently more local attention in L-updates than H-updates. The central claim is that a clear state-identity signal suffices to induce stable, related functional roles inside a shared recurrent Transformer.

Significance. If the causal link between the minimal asymmetry and the observed specialization holds, the result provides evidence that functional differentiation can emerge in shared-weight recurrent architectures without explicit modularization. This is relevant to designs for multi-step reasoning models. The public code release supports reproducibility and is a positive feature.

major comments (1)
  1. Abstract and freeze-experiment description: the reported interdependence (freezing zH reduces zL changes in Sudoku; freezing either increases changes in the other in Maze) is correlational and does not isolate whether the specific functional roles (zH as committed proposal, zL as uncertain local structure) are induced by the input-injection asymmetry or level token versus arising from task-specific training dynamics that reward phased proposal/refinement once states are distinguishable. The ablations establish only that distinguishability is necessary, not that it is sufficient to produce these particular roles independent of Sudoku-Extreme and Maze objectives.
minor comments (1)
  1. Clarify the precise definition and implementation of the level token versus input-injection asymmetry in the methods section; the abstract treats them as interchangeable but the mechanistic implications may differ.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and have revised the manuscript to clarify the scope of our claims.

read point-by-point responses
  1. Referee: Abstract and freeze-experiment description: the reported interdependence (freezing zH reduces zL changes in Sudoku; freezing either increases changes in the other in Maze) is correlational and does not isolate whether the specific functional roles (zH as committed proposal, zL as uncertain local structure) are induced by the input-injection asymmetry or level token versus arising from task-specific training dynamics that reward phased proposal/refinement once states are distinguishable. The ablations establish only that distinguishability is necessary, not that it is sufficient to produce these particular roles independent of Sudoku-Extreme and Maze objectives.

    Authors: We agree that the freeze experiments demonstrate correlational interdependence and do not by themselves establish that the input-injection asymmetry (or level token) is what induces the precise functional roles observed. The ablations show that distinguishability between update types is necessary for any specialization to emerge. We maintain that the consistent appearance of the same roles (committed zH proposal state and shifting zL uncertainty) across two tasks with different objectives and structures provides supporting evidence that the minimal asymmetry plays a causal enabling role. Nevertheless, we acknowledge that task-specific training dynamics may also shape the exact roles once distinguishability is present. We have revised the abstract and the freeze-experiment discussion to more precisely describe the evidence as showing necessity via ablations plus observed consistency, while noting the correlational nature of the interdependence results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from ablations and rollouts do not reduce to fitted inputs or self-referential definitions.

full rationale

The paper reports an empirical investigation of emergent specialization in a shared recurrent Transformer under Asymmetric Input Recurrence. Claims rest on decoded rollouts, freeze experiments showing interdependence between states, ablations requiring distinguishable update types, and attention locality differences across Sudoku-Extreme and Maze. No derivation chain, first-principles equations, or predictions are presented that reduce by construction to fitted parameters, self-definitions, or self-citation load-bearing premises. The central result—that a state-identity signal can induce functional roles—is supported by direct experimental contrasts rather than any renaming or smuggling of prior results into the current analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is primarily empirical and introduces no new mathematical axioms or invented entities; the main domain assumption is that the two chosen tasks suffice to reveal generalizable specialization patterns.

axioms (1)
  • domain assumption Sudoku-Extreme and Maze tasks are representative environments in which state specialization can be reliably observed and measured.
    All reported rollouts freeze experiments and attention analyses are performed on these two tasks.

pith-pipeline@v0.9.0 · 5815 in / 959 out tokens · 35618 ms · 2026-05-20T12:54:51.102108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    CVPR , year =

    Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan , title =. CVPR , year =

  2. [2]

    8th ICML Workshop on Automated Machine Learning , year =

    Banino, Andrea and Balaguer, Jan and Blundell, Charles , title =. 8th ICML Workshop on Automated Machine Learning , year =

  3. [3]

    A Mechanistic Analysis of Looped Reasoning Language Models , journal =

    Blayney, Hugh and Arroyo,. A Mechanistic Analysis of Looped Reasoning Language Models , journal =

  4. [4]

    Findings of ACL , pages =

    Brinkmann, Jannik and Sheshadri, Abhay and Levoso, Victor and Swoboda, Paul and Bartelt, Christian , title =. Findings of ACL , pages =

  5. [5]

    Are Neural Nets Modular?

    Csord. Are Neural Nets Modular?. ICLR , year =

  6. [6]

    Universal Transformers , booktitle =

    Dehghani, Mostafa and Gouws, Stephan and Vinyals, Oriol and Uszkoreit, Jakob and Kaiser,. Universal Transformers , booktitle =

  7. [7]

    Transformer Circuits Thread , year =

    Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, ...

  8. [8]

    and Novak, Roman and Liu, Peter J

    Everett, Katie and Xiao, Lechao and Wortsman, Mitchell and Alemi, Alexander A. and Novak, Roman and Liu, Peter J. and Gur, Izzeddin and Sohl-Dickstein, Jascha and Kaelbling, Leslie Pack and Lee, Jaehoon and Pennington, Jeffrey , title =. ICML , year =

  9. [9]

    arXiv:2510.00355 , year =

    Ge, Renee and Liao, Qianli and Poggio, Tomaso , title =. arXiv:2510.00355 , year =

  10. [10]

    and Papailiopoulos, Dimitris , title =

    Giannou, Angeliki and Rajput, Shashank and Sohn, Jy-yong and Lee, Kangwook and Lee, Jason D. and Papailiopoulos, Dimitris , title =. ICML , year =

  11. [11]

    Recurrent Independent Mechanisms , booktitle =

    Goyal, Anirudh and Lamb, Alex and Hoffmann, Jordan and Sodhani, Shagun and Levine, Sergey and Bengio, Yoshua and Sch. Recurrent Independent Mechanisms , booktitle =

  12. [12]

    Adaptive Computation Time for Recurrent Neural Networks

    Graves, Alex , title =. arXiv:1603.08983 , year =

  13. [13]

    COLM , year =

    Hao, Shibo and Sukhbaatar, Sainbayar and Su, DiJia and Li, Xian and Hu, Zhiting and Weston, Jason and Tian, Yuandong , title =. COLM , year =

  14. [14]

    NeurIPS , year =

    Hong, Guan Zhe and Dikkala, Nishanth and Luo, Enming and Rashtchian, Cyrus and Wang, Xin and Panigrahy, Rina , title =. NeurIPS , year =

  15. [15]

    Less is More: Recursive Reasoning with Tiny Networks

    Jolicoeur-Martineau, Alexia , title =. arXiv:2510.04871 , year =

  16. [16]

    Maze 30 30 Hard 1k Dataset , howpublished =

  17. [17]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Nye, Maxwell and Andreassen, Anders Johan and Gur-Ari, Guy and Michalewski, Henryk and Austin, Jacob and Bieber, David and Dohan, David and Lewkowycz, Aitor and Bosma, Maarten and Luan, David and Sutton, Charles and Odena, Augustus , title =. arXiv:2112.00114 , year =

  18. [18]

    Transformer Circuits Thread , year =

    Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Johnston, Scott and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse...

  19. [19]

    Prieto, Lucas and Barsbey, Melih and Mediano, Pedro A. M. and Birdal, Tolga , title =. ICLR , year =

  20. [20]

    NeurIPS , year =

    Schwarzschild, Avi and Borgnia, Eitan and Gupta, Arjun and Huang, Furong and Vishkin, Uzi and Goldblum, Micah and Goldstein, Tom , title =. NeurIPS , year =

  21. [21]

    arXiv:2507.12858 , year =

    Tomoda, Yuki and Tsuda, Ichiro and Yamaguti, Yutaka , title =. arXiv:2507.12858 , year =

  22. [22]

    Hierarchical Reasoning Model

    Wang, Guan and Li, Jin and Sun, Yuhao and Chen, Xing and Liu, Changling and Wu, Yue and Lu, Meng and Song, Sen and Yadkori, Yasin Abbasi , title =. arXiv:2506.21734 , year =

  23. [23]

    NeurIPS , year =

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , title =. NeurIPS , year =

  24. [24]

    and Peng, Jing , title =

    Williams, Ronald J. and Peng, Jing , title =. Neural Computation , volume =