pith. sign in

arxiv: 2606.24898 · v1 · pith:UMEVXQSZnew · submitted 2026-06-12 · 💻 cs.LG · cs.AI

Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models

Pith reviewed 2026-06-27 04:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords looped language modelsrecurrent transformerscross-entropy supervisionhidden state normsscale-invariant readoutspre-norm residualsearly exitvariable depth
0
0 comments X

The pith

Dense per-loop cross-entropy leaves hidden-state scale uncontrolled when readouts are scale-invariant.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Looped language models reuse hidden states across computation steps, decoding each for prediction and feeding it back as input. The paper shows that cross-entropy loss applied at every loop only supervises the variables the readout exposes, leaving radial scale invisible when the readout is scale-invariant like RMSNorm. Pre-norm residual connections continue to propagate and amplify that scale anyway. In 44M and 129M parameter models this produces final hidden-state norms in the thousands or tens of thousands. Scale-visible readouts or explicit norm penalties keep norms in the tens and yield lower perplexity on variable-depth benchmarks.

Core claim

Dense per-loop cross-entropy through RMSNorm readouts drives final hidden-state norms into the thousands or tens of thousands in looped transformers without inter-loop normalization; scale-visible readouts and explicit norm penalties keep norms in the tens. The resulting design rule is that dense supervision trains exits while recurrent scale control requires either making scale visible to a loss or removing it from the loop.

What carries the argument

The readout blind spot created by scale-invariant functions such as RMSNorm, which hide radial scale from the immediate cross-entropy loss while pre-norm residual recurrence continues to carry and update that same scale.

If this is right

  • Scale-controlled variants achieve lower perplexity at matched inference-depth operating points in variable-depth benchmarks.
  • Per-loop loss can make early exits usable without controlling recurrent scale.
  • Scale-removing recurrence serves as a complementary architectural fix to visible-scale readouts.
  • Dense supervision alone is insufficient to control all variables active in the recurrent transition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same blind spot could appear in other recurrent or stateful architectures that combine invariant readouts with residual loops.
  • Explicit norm penalties might generalize to other scale-related instabilities in deep recurrent networks.
  • Variable-depth evaluation protocols could expose analogous supervision gaps in non-looped models that reuse internal states.

Load-bearing premise

Norm growth is caused specifically by the interaction of scale-invariant readouts with pre-norm residual recurrence rather than other training dynamics or initialization effects.

What would settle it

Training the same 44M and 129M looped models with scale-visible readouts and observing whether final hidden-state norms still reach thousands would directly test the claimed mechanism.

Figures

Figures reproduced from arXiv: 2606.24898 by Rituraj Sharma, Tu Vu.

Figure 1
Figure 1. Figure 1: Exit training and scale control are orthogonal in the core 2×2. The x-axis measures early-exit failure, the cross-entropy gap between the first and fourth recurrent loops, CE(K=1) − CE(K=4), where K is the inference loop count; the axis uses a symlog scale. The y-axis measures final-loop norm drift, with HK denoting the hidden state after K loops. Lower is better on both axes. Per-loop CE moves models left… view at source ↗
Figure 2
Figure 2. Figure 2: Visibility–activity mismatch. The same hidden state Hk = skuk serves as both a predic￾tion interface and the input to the next recurrent step. Along the readout path, output normalization removes scale, so the immediate CE loss has approximately zero radial gradient. Along the recurrent path, the pre-norm residual update still carries the skip state Hk, so scale remains active and can drift through the rec… view at source ↗
Figure 3
Figure 3. Figure 3: Trained checkpoints enter the slow-angular-motion regime predicted by the scale expansion. (a) Hidden-state RMS scale over 30 loops. (b) Euclidean radial residual inner product a˜k = ⟨uk, F(Hk) − Hk⟩, whose scaled value a˜k/d predicts the RMS-scale increment. (c) ∥uk+1 − uk∥rmssk, whose leading-order value is ∥b⊥(uk)∥rms. 5 Mechanism Diagnostics The ablation matches the framework, but we also test the loca… view at source ↗
Figure 4
Figure 4. Figure 4: Compute–quality frontier on per-loop-loss 129M models. Each curve sweeps K ∈ {1, 2, 3, 4}. Raw, final-only norm, and norm penalty occupy the lower-PPL region; RMSNorm is above them at every measured operating point. A simple practical intervention: scale-visibility penalty. For practitioners who want to keep normalized readouts, the norm penalty is the most direct intervention in our study. The implementat… view at source ↗
Figure 5
Figure 5. Figure 5: A normalized readout hides hidden-state scale at billion-parameter scale. We scale the final hidden state of the published Ouro 1.4B checkpoint by α ∈ [0.1, 10] before the readout and measure cross-entropy. With RMSNorm, CE is essentially unchanged across a 100× range, consistent with Lemma 1. Without RMSNorm, the same hidden states produce a sharp scale-sensitive curve. This is a direct measurement on a t… view at source ↗
Figure 6
Figure 6. Figure 6: Direct scale intervention on trained 1.4B checkpoints. We multiply the final hidden state by α before the readout and measure cross-entropy. The baseline (RMSNorm readout) is flat across the whole α range. The no-readout-norm variant produces a sharp scale-sensitive curve. This is the same intervention as [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Variable-depth inference on the published Ouro 1.4B model. Left: perplexity versus [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Looped language models turn hidden states into runtime state: each state is decoded for prediction and fed back into future computation. This creates a basic supervision question: which state variables does cross-entropy actually control? We show that dense per-loop cross-entropy controls the variables exposed by the readout, not every variable active in the recurrent transition. Hidden-state scale gives a concrete failure mode. Scale-invariant readouts such as RMSNorm and LayerNorm hide radial scale from the immediate cross-entropy loss, while pre-norm residual recurrence continues to carry and update that same scale. Thus per-loop loss can make early exits usable without controlling recurrent scale. In 44M and 129M looped transformers without inter-loop normalization, per-loop cross-entropy through RMSNorm readouts still drives final hidden-state norms into the thousands or tens of thousands. Scale-visible readouts and explicit norm penalties keep norms in the tens, and scale-removing recurrence is the complementary architectural fix. The resulting design rule is simple: dense supervision trains exits; recurrent scale control requires either making scale visible to a loss or removing it from the loop. Consistent with this rule, scale-controlled variants achieve lower perplexity at matched inference-depth operating points in our variable-depth benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that dense per-loop cross-entropy in looped transformers fails to control hidden-state radial scale when readouts are scale-invariant (e.g., RMSNorm), because pre-norm residual recurrence continues to propagate and update that scale. In 44M and 129M models without inter-loop normalization, this produces final hidden-state norms in the thousands or tens of thousands; scale-visible readouts or explicit norm penalties keep norms in the tens. Scale-controlled variants also yield lower perplexity at matched inference depths. The resulting design rule is that supervision trains exits while recurrent scale requires either visibility to a loss or removal from the loop.

Significance. If the central empirical contrast holds, the work supplies a concrete, actionable principle for training looped and recurrent language models: standard dense supervision is insufficient for scale control in the recurrence. The demonstration on two model sizes and the link to improved variable-depth perplexity give the finding immediate engineering relevance for architectures that reuse hidden states across loops.

major comments (1)
  1. [Experimental results (abstract and §4)] The abstract reports that scale-visible readouts and explicit norm penalties keep norms in the tens while RMSNorm readouts produce norms in the thousands/tens of thousands, yet supplies no description of matched ablations that hold initialization scale, optimizer hyperparameters, inter-loop normalization absence, and all other recurrence details fixed while toggling only readout visibility or the penalty term. This isolation is load-bearing for attributing the observed norm growth specifically to the readout blind spot rather than other uncontrolled training dynamics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for explicit experimental controls. The concern is valid: the current manuscript does not sufficiently document that the reported norm differences arise from toggling only readout visibility or the penalty term. We will revise the text to make the matched ablation design fully transparent.

read point-by-point responses
  1. Referee: [Experimental results (abstract and §4)] The abstract reports that scale-visible readouts and explicit norm penalties keep norms in the tens while RMSNorm readouts produce norms in the thousands/tens of thousands, yet supplies no description of matched ablations that hold initialization scale, optimizer hyperparameters, inter-loop normalization absence, and all other recurrence details fixed while toggling only readout visibility or the penalty term. This isolation is load-bearing for attributing the observed norm growth specifically to the readout blind spot rather than other uncontrolled training dynamics.

    Authors: The experiments were run with all listed factors held fixed: identical initialization scales, the same optimizer hyperparameters and schedule, no inter-loop normalization in any condition, and identical recurrence structure. The only differences were the readout (RMSNorm versus scale-visible alternatives such as a linear projection without normalization) or the addition of an explicit norm penalty. We will expand §4 with a dedicated paragraph enumerating these controls and will update the abstract to reference the matched setup. This revision will also include a short table summarizing the fixed versus varied elements across the three conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical training runs

full rationale

The paper reports empirical observations from training 44M and 129M looped transformers under different readout and normalization conditions, showing norm growth under RMSNorm readouts versus controlled norms with scale-visible readouts or penalties. No derivation chain, equations, or predictions are presented that reduce to inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The central claim is an experimental contrast, not a self-referential definition or fitted-input prediction, so the work is self-contained against its own benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on empirical observations from training specific looped transformer models; the abstract introduces no new free parameters, axioms beyond standard language-modeling assumptions, or invented entities.

axioms (1)
  • domain assumption Cross-entropy loss applied per loop is the primary supervision signal for the hidden state in looped models.
    The paper's analysis assumes this loss is applied densely and is the mechanism under test.

pith-pipeline@v0.9.1-grok · 5747 in / 1270 out tokens · 55273 ms · 2026-06-27T04:25:14.566197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 13 linked inside Pith

  1. [1]

    Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407,

    Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407,

  2. [2]

    A mechanistic analysis of looped reasoning language models.arXiv preprint arXiv:2604.11791,

    Hugh Blayney, Álvaro Arroyo, Johan Obando-Ceron, et al. A mechanistic analysis of looped reasoning language models.arXiv preprint arXiv:2604.11791,

  3. [3]

    Simply stabilizing the loop via fully looped transformer.arXiv preprint arXiv:2605.18797,

    Rao Fu, Zixuan Yang, Jiankun Zhang, Jing Ma, Hechang Chen, Yu Li, and Yi Chang. Simply stabilizing the loop via fully looped transformer.arXiv preprint arXiv:2605.18797,

  4. [4]

    Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171,

  5. [5]

    Elt: Elastic looped transformers for visual generation.arXiv preprint arXiv:2604.09168,

    Sahil Goyal, Swayam Agrawal, Gautham Govind Anil, Prateek Jain, Sujoy Paul, and Aditya Kusupati. Elt: Elastic looped transformers for visual generation.arXiv preprint arXiv:2604.09168,

  6. [6]

    Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

    10 Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

  7. [7]

    Gated normalization removal and scale anchoring in pre-norm transformers.arXiv preprint arXiv:2602.10408,

    Andrei Kanavalau, Carmen Amo Alonso, and Sanjay Lall. Gated normalization removal and scale anchoring in pre-norm transformers.arXiv preprint arXiv:2602.10408,

  8. [8]

    Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.arXiv preprint arXiv:2604.07822,

    Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.arXiv preprint arXiv:2604.07822,

  9. [9]

    Stability and generalization in looped transformers.arXiv preprint arXiv:2604.15259,

    Asher Labovich. Stability and generalization in looped transformers.arXiv preprint arXiv:2604.15259,

  10. [10]

    Hu, and Jonathan May

    Ryan Lee, Jacob Biloki, Edward J. Hu, and Jonathan May. Sparse layers are critical to scaling looped language models.arXiv preprint arXiv:2605.09165,

  11. [11]

    The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

    Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

  12. [12]

    Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y . Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946,

  13. [13]

    How much is one recurrence worth? iso-depth scaling laws for looped language models.arXiv preprint arXiv:2604.21106,

    Kristian Schwethelm, Daniel Rueckert, and Georgios Kaissis. How much is one recurrence worth? iso-depth scaling laws for looped language models.arXiv preprint arXiv:2604.21106,

  14. [14]

    Normformer: Improved transformer pretraining with extra normalization.arXiv preprint arXiv:2110.09456,

    Sam Shleifer, Jason Weston, and Myle Ott. Normformer: Improved transformer pretraining with extra normalization.arXiv preprint arXiv:2110.09456,

  15. [15]

    LoopRPT: Reinforcement pre-training for looped language models.arXiv preprint arXiv:2603.19714,

    Guo Tang, Shixin Jiang, Heng Chang, Nuo Chen, Yuhan Li, Huiming Fan, Jia Li, Ming Liu, and Bing Qin. LoopRPT: Reinforcement pre-training for looped language models.arXiv preprint arXiv:2603.19714,

  16. [16]

    Deepnet: Scaling transformers to 1,000 layers.arXiv preprint arXiv:2203.00555,

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.arXiv preprint arXiv:2203.00555,

  17. [17]

    Scaling latent reasoning via looped language models

    Rui-Jie Zhu, Zixuan Wang, Kai Hua, et al. Scaling latent reasoning via looped language models. arXiv preprint arXiv:2510.25741,

  18. [18]

    We also ran a near-matched-token Ouro-scale check to ask whether the readout-side diagnostic survives in a larger implementation trained on a modern corpus

    B 1.4B Scale Sanity Check The controlled 44M and 129M experiments isolate the mechanism. We also ran a near-matched-token Ouro-scale check to ask whether the readout-side diagnostic survives in a larger implementation trained on a modern corpus. This appendix reports that check. It should be read as scale evidence only: it is not a full 2×2 ablation, not ...

  19. [19]

    Table 13:High halting speedup can reflect K-invariance rather than useful adaptive computa- tion.Calibrated logit-margin halting on per-loop-loss models, mean over 3 seeds, timed slice harness. Thresholds are tuned on a held-out calibration slice to keep dynamic perplexity (PPL) within 1% of fixed K= 4 ; the table reports a separate timed slice, so small ...