Simply Stabilizing the Loop via Fully Looped Transformer

Hechang Chen; Jiankun Zhang; Jing Ma; Rao Fu; Yi Chang; Yu Li; Zixuan Yang

Two parameter-free modifications stabilize looped Transformers for training at up to 12 iterations.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 22:13 UTC pith:W5DTY6P2

load-bearing objection Two parameter-free tweaks let looped transformers train stably to 12 iterations and pick up 13% downstream gains where baselines hold.

arxiv 2605.18797 v2 pith:W5DTY6P2 submitted 2026-05-11 cs.LG cs.AI

Simply Stabilizing the Loop via Fully Looped Transformer

Rao Fu , Zixuan Yang , Jiankun Zhang , Jing Ma , Hechang Chen , Yu Li , Yi Chang This is my paper

classification cs.LG cs.AI

keywords Looped Transformertraining stabilitygradient oscillationresidual connectionsattention mechanismiterative models

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Looped Transformers reuse the same blocks multiple times to gain performance without adding parameters, but they become unstable as loop count rises. The paper identifies gradient oscillation and residual explosion as the causes and proposes the Fully Looped Transformer to fix them. The first change spreads signals from each loop across every layer instead of adding them sequentially. The second reuses the attention mechanism to counteract oscillating gradients. These steps let the model train where others fail and deliver better results on tasks when loop counts stay moderate.

Core claim

By distributing inter-loop signals across all layers and injecting attention outputs to stabilize gradients, the Fully Looped Transformer trains without collapse up to 12 iterations and improves downstream performance by as much as 13.2 percent compared to standard looped models.

What carries the argument

Fully Looped Architecture, which spreads inter-loop signals to all layers, and Attention Injection, which reuses attention blocks to reduce gradient oscillation.

Load-bearing premise

That the observed instability comes only from gradient oscillation and residual explosion, and that fixing these two issues is sufficient to stabilize training without side effects.

What would settle it

Run training of both the original Looped Transformer and the Fully Looped version for 12 iterations on a standard language modeling task; stable convergence in the new model but collapse in the baseline would support the claim.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Models can be trained with higher loop iterations without divergence.
Performance gains appear even when standard looped models remain stable.
Inference compute can be adjusted by changing the number of loops after training.
The approach maintains fixed parameter count while scaling effective depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar stabilization techniques might apply to other iterative models like recurrent neural networks.
Adjusting loop count at inference could serve as a way to trade accuracy for speed on a per-example basis.
The parameter-free nature suggests the fixes could be added to existing looped models with little overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Two parameter-free tweaks let looped transformers train stably to 12 iterations and pick up 13% downstream gains where baselines hold.

read the letter

The paper identifies gradient oscillation and residual explosion as the main reasons looped transformers become unstable at higher iteration counts. It then adds two changes: spreading inter-loop signals across every layer instead of just the last one, and reusing the attention block to inject a stabilizing signal.

These modifications are presented as new and they appear to work in the reported experiments. The model trains without collapse up to 12 loops while other looped baselines fail, and it still improves average task performance by up to 13.2% in regimes where the original looped version remains trainable. That is a concrete, usable result for anyone who wants to trade extra test-time iterations for accuracy without growing the parameter count.

The evidence is empirical and the stress-test note indicates the full manuscript supplies training curves and ablations that link the gains to the two proposed mechanisms. That part looks internally consistent. The main soft spot is that the performance lift and stability claims still need checking against a broader set of tasks and scales to confirm they are not tied to particular initializations or data regimes. It is also unclear how much of the gain comes from the exact fixes versus incidental changes in signal flow.

This is for people working on efficient scaling and recurrent-style transformers. Readers who need practical stability fixes for looped architectures will find usable ideas here.

Send it to peer review. The changes are simple enough that referees can evaluate them directly.

Referee Report

0 major / 3 minor

Summary. The paper claims that training instability in Looped Transformers arises from gradient oscillation and residual explosion. It introduces the Fully Looped Transformer with two parameter-free changes—Fully Looped Architecture (distributing inter-loop signals across layers) and Attention Injection (reusing attention blocks)—that stabilize training up to 12 iterations (where baselines collapse) and yield up to 13.2% better average downstream performance in milder regimes, while supporting inference-time adaptation via loop count.

Significance. If the empirical results and mechanistic analysis hold, the work provides a lightweight route to deeper effective computation in transformers without added parameters or context length. The parameter-free character of the fixes and the reported stability gains at high iteration counts are concrete strengths; the manuscript supplies the supporting training curves, ablations, and architecture diagrams that tie the gains to the stated mechanisms.

minor comments (3)

Abstract: the 13.2% figure and the 12-iteration stability claim are presented without naming the tasks, baselines, or number of runs; adding one sentence with these details would improve reproducibility.
The manuscript should clarify whether the reported downstream gains are measured at the same loop count used during training or at a different inference-time budget.
Figure captions for training curves should explicitly state the y-axis scale (e.g., loss or gradient norm) and whether shaded regions represent standard deviation across seeds.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of our manuscript and for recommending minor revision. The referee's summary correctly identifies the core issues of gradient oscillation and residual explosion in looped training, as well as the parameter-free nature of the Fully Looped Architecture and Attention Injection that enable stable training up to 12 iterations.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experiments

full rationale

The manuscript advances an empirical architecture proposal (Fully Looped Transformer with two parameter-free modifications) whose central claims are validated through training stability curves, ablation studies, and downstream-task metrics rather than any closed mathematical derivation. No equations appear that define a quantity in terms of itself or that rename a fitted parameter as a prediction. The stated sources of instability (gradient oscillation, residual explosion) are presented as observations from the authors' runs, not as self-referential definitions. No load-bearing self-citations or uniqueness theorems imported from prior author work are invoked. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on one domain assumption about the sources of instability; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Instability in looped transformers arises from gradient oscillation and residual explosion
Stated directly as the diagnosed cause of collapse when loop count increases.

pith-pipeline@v0.9.1-grok · 5771 in / 1011 out tokens · 28647 ms · 2026-06-30T22:13:59.850735+00:00 · methodology

0 comments

read the original abstract

Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.

Figures

Figures reproduced from arXiv: 2605.18797 by Hechang Chen, Jiankun Zhang, Jing Ma, Rao Fu, Yi Chang, Yu Li, Zixuan Yang.

**Figure 2.** Figure 2: Training dynamics of LT and FLT during the first 2000 optimizer steps. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics comparison of the FLT variants and LT variants. All models compared [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The loss of different base size models at 12-loop setting. All models except FLT collapsed. Smoothed with factor 0.9 for readability. Metrics GQA MLA SWA FA Wiki2ppl ↓ 39.68 38.76 38.91 41.12 Valbpb ↓ 0.897 0.904 0.895 0.895 Core↑ 16.24 15.64 15.58 15.66 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Test-time adaptation evaluation results of FLT. Models trained with 3, 6, 9, or 12 loop [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Left: the residual norm at the 6th loop iteration. Middle: the residual norm at the 9th loop iteration. Right: the residual norm at the 12th loop iteration [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Left: the gradient norm of LM head block. Middle: the gradient norm of FFN at 5th layer. Right: the gradient norm of attention block at 5th layer. Figures 6 and 7 provide supplementary evidence for the diagnostic experiment in Section 3 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Trend chart of Core Metric changes for FLT with different attention variants throughout the [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: The training loss of FLT with different attention variants. Smoothed with factor 0.9. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models
cs.LG 2026-06 unverdicted novelty 5.0

Dense per-loop cross-entropy in looped transformers fails to control hidden-state scale with scale-invariant readouts like RMSNorm, driving norms to thousands, while scale-visible readouts or norm penalties keep norms...

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

URLhttps://arxiv.org/abs/2405.04434. M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. Universal transformers. In International Conference on Learning Representations, 2019. J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein. Scaling up test-time compute with latent reason...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

URLhttps://arxiv.org/abs/1606.06031. R. Pascanu, T. Mikolov, and Y . Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. Pmlr, 2013. H. Prairie, Z. Novack, T. Berg-Kirkpatrick, and D. Y . Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946...

work page internal anchor Pith review Pith/arXiv arXiv 2013
[3]

URLhttps://arxiv.org/abs/2203.03466. R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. B. Zhang and R. Sennrich. Root mean square layer normalization, 2019. URL https://arxiv. org/abs/1910.07467....

work page arXiv 2019

[1] [1]

URLhttps://arxiv.org/abs/2405.04434. M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. Universal transformers. In International Conference on Learning Representations, 2019. J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein. Scaling up test-time compute with latent reason...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

URLhttps://arxiv.org/abs/1606.06031. R. Pascanu, T. Mikolov, and Y . Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. Pmlr, 2013. H. Prairie, Z. Novack, T. Berg-Kirkpatrick, and D. Y . Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946...

work page internal anchor Pith review Pith/arXiv arXiv 2013

[3] [3]

URLhttps://arxiv.org/abs/2203.03466. R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. B. Zhang and R. Sennrich. Root mean square layer normalization, 2019. URL https://arxiv. org/abs/1910.07467....

work page arXiv 2019