Simply Stabilizing the Loop via Fully Looped Transformer

Hechang Chen; Jiankun Zhang; Jing Ma; Rao Fu; Yi Chang; Yu Li; Zixuan Yang

arxiv: 2605.18797 · v2 · pith:W5DTY6P2new · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Simply Stabilizing the Loop via Fully Looped Transformer

Rao Fu , Zixuan Yang , Jiankun Zhang , Jing Ma , Hechang Chen , Yu Li , Yi Chang This is my paper

Pith reviewed 2026-05-20 23:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords looped transformertraining stabilitygradient oscillationresidual explosionattention injectioniterative computationmodel scalingtest-time compute

0 comments

The pith

Two parameter-free changes to the looped transformer fix gradient oscillation and residual explosion to allow stable training at up to 12 iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that looped transformers, which reuse the same blocks repeatedly to gain performance without adding parameters, become unstable as the number of iterations grows. The authors trace the instability to oscillating gradients and exploding residuals, then introduce a fully looped architecture that spreads inter-loop signals across every layer and an attention injection step that reuses the existing attention block to dampen oscillations. These changes keep training stable at 12 loop iterations where prior looped models collapse. In regimes where baselines remain stable, the modified model still raises average downstream accuracy by as much as 13.2 percent. The result supplies a practical way to trade extra test-time computation for better performance while keeping parameter count fixed.

Core claim

The Fully Looped Transformer distributes inter-loop signals across all layers to prevent residual explosion and reuses the attention block to suppress gradient oscillation. These two parameter-free modifications stabilize training dynamics up to 12 loop iterations, whereas baseline looped models collapse in the same regime, and they raise average downstream-task performance by up to 13.2 percent in milder settings.

What carries the argument

Fully Looped Architecture together with Attention Injection, which together spread residual connections and reuse attention to remove the two identified sources of instability.

If this is right

Model capacity can be increased by raising loop count at inference instead of widening or deepening the network.
Test-time compute can be varied after training by choosing different numbers of iterations without retraining.
Training succeeds in regimes where earlier looped designs fail, expanding the usable range of iteration counts.
Downstream accuracy improves even when both models train successfully, showing the fixes also aid optimization.
Parameter count stays constant while effective depth grows, offering a different scaling axis from standard transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same signal-distribution and attention-reuse ideas could be tested in other iterative architectures such as recurrent networks or state-space models.
If the fixes generalize, training budgets could shift from adding parameters toward adding loop iterations at inference.
The approach may reduce the need for very deep unrolled networks by letting a shallow block be reused more reliably.
Similar lightweight modifications might address instability in other training regimes that suffer from repeated residual paths.

Load-bearing premise

Instability in looped transformers comes only from gradient oscillation and residual explosion, and the two proposed fixes remove those sources without introducing new instabilities.

What would settle it

Train the Fully Looped Transformer and a standard Looped Transformer to 12 iterations on the same data and compare whether loss curves remain stable or diverge.

Figures

Figures reproduced from arXiv: 2605.18797 by Hechang Chen, Jiankun Zhang, Jing Ma, Rao Fu, Yi Chang, Yu Li, Zixuan Yang.

**Figure 2.** Figure 2: Training dynamics of LT and FLT during the first 2000 optimizer steps. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics comparison of the FLT variants and LT variants. All models compared [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The loss of different base size models at 12-loop setting. All models except FLT collapsed. Smoothed with factor 0.9 for readability. Metrics GQA MLA SWA FA Wiki2ppl ↓ 39.68 38.76 38.91 41.12 Valbpb ↓ 0.897 0.904 0.895 0.895 Core↑ 16.24 15.64 15.58 15.66 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Test-time adaptation evaluation results of FLT. Models trained with 3, 6, 9, or 12 loop [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Left: the residual norm at the 6th loop iteration. Middle: the residual norm at the 9th loop iteration. Right: the residual norm at the 12th loop iteration [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Left: the gradient norm of LM head block. Middle: the gradient norm of FFN at 5th layer. Right: the gradient norm of attention block at 5th layer. Figures 6 and 7 provide supplementary evidence for the diagnostic experiment in Section 3 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Trend chart of Core Metric changes for FLT with different attention variants throughout the [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: The training loss of FLT with different attention variants. Smoothed with factor 0.9. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper stabilizes looped transformers up to 12 iterations with two parameter-free changes and reports some performance gains, but the instability diagnosis looks incomplete and experimental details are thin.

read the letter

The main takeaway is that this work takes the looped transformer setup and adds a fully distributed signal path across layers plus attention reuse to keep gradients in check. That lets training hold together at higher loop counts where prior versions fall apart, and it gives a modest lift on downstream tasks when the baseline stays stable. The changes are parameter-free, which keeps the scaling story clean: more test-time iterations instead of more weights or context.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Fully Looped Transformer to address training instability in Looped Transformers when increasing the number of loop iterations. It identifies gradient oscillation and residual explosion as the two primary sources of instability through analysis, then proposes two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion, and (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. Experiments demonstrate stable training up to 12 loop iterations (where baselines collapse) and up to 13.2% average downstream-task performance gains in milder regimes, with preliminary evidence of adaptability by varying loop count at inference.

Significance. If the empirical stability and gains hold under rigorous controls, the work provides a practical, parameter-free route to deeper effective computation via looping without increasing model size or context length. This directly supports better performance-compute tradeoffs at test time. The explicit identification of instability sources and the parameter-free design are strengths that could influence efficient scaling research, though the completeness of the causal analysis remains a key open question for the central claims.

major comments (2)

§3 (Instability Analysis): The assertion that gradient oscillation and residual explosion constitute the complete and primary causes of collapse at high loop counts is load-bearing for the design of the two fixes and the stability claim up to 12 iterations. The manuscript does not provide exhaustive ablations or theoretical arguments ruling out other mechanisms (e.g., attention pattern drift across iterations or depth-dependent regularization effects), leaving the attribution of success to these specific modifications incomplete.
§4 and §5 (Experimental Setup and Results): The reported 13.2% average performance improvement and stability up to 12 iterations lack details on experimental controls, number of random seeds, statistical significance testing, or correction for multiple comparisons across tasks and loop counts. Without these, it is unclear whether the gains are robust or whether the collapse of baselines is consistently reproduced.

minor comments (2)

Abstract and §2: The description of Attention Injection as 'reusing the existing attention block' would benefit from a precise equation or pseudocode showing how the injection is implemented without altering parameter count or introducing new learnable weights.
Figure 2 or equivalent (Training Dynamics): The plots of gradient norms and residual magnitudes across loops would be clearer if they included error bars from multiple runs and direct comparison to the proposed fixes at each iteration count.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [—] §3 (Instability Analysis): The assertion that gradient oscillation and residual explosion constitute the complete and primary causes of collapse at high loop counts is load-bearing for the design of the two fixes and the stability claim up to 12 iterations. The manuscript does not provide exhaustive ablations or theoretical arguments ruling out other mechanisms (e.g., attention pattern drift across iterations or depth-dependent regularization effects), leaving the attribution of success to these specific modifications incomplete.

Authors: We appreciate the referee's emphasis on the need for stronger causal attribution. Section 3 presents empirical gradient analysis and ablation results showing that gradient oscillation and residual explosion are the dominant instability sources in the high-iteration regime, with the proposed fixes directly mitigating them to enable stable training up to 12 iterations. We do not claim these are the sole possible mechanisms. In the revision we will expand the discussion to explicitly acknowledge alternative factors such as attention pattern drift and include additional targeted ablations (e.g., monitoring attention entropy across loops) to further support the primary role of the identified issues. revision: partial
Referee: [—] §4 and §5 (Experimental Setup and Results): The reported 13.2% average performance improvement and stability up to 12 iterations lack details on experimental controls, number of random seeds, statistical significance testing, or correction for multiple comparisons across tasks and loop counts. Without these, it is unclear whether the gains are robust or whether the collapse of baselines is consistently reproduced.

Authors: We agree that greater experimental rigor is required. The revised manuscript will report the number of random seeds (3–5 per setting), include standard deviations or error bars for all metrics, add statistical significance tests (e.g., paired t-tests against baselines), and clarify the primary comparisons of interest to address multiple-testing concerns. These additions will substantiate the robustness of the 13.2% gains and the consistent reproduction of baseline collapse at high loop counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of parameter-free modifications

full rationale

The paper presents an empirical analysis identifying gradient oscillation and residual explosion as instability sources, followed by two parameter-free architectural modifications (Fully Looped Architecture and Attention Injection) whose effects are demonstrated through training runs up to 12 iterations and downstream task improvements of up to 13.2%. No equations, predictions, or first-principles derivations are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central stability and performance claims are supported by external experimental benchmarks rather than internal redefinitions or renamings, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer training assumptions plus the untested premise that the observed instability sources are exhaustive and that the fixes are neutral with respect to other dynamics.

axioms (1)

domain assumption Standard assumptions of transformer training dynamics and gradient flow hold for looped variants.
Invoked when attributing instability to gradient oscillation and residual explosion.

pith-pipeline@v0.9.0 · 5771 in / 1154 out tokens · 26824 ms · 2026-05-20T23:15:33.420617+00:00 · methodology

Simply Stabilizing the Loop via Fully Looped Transformer

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)