How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

· 2026 · cs.LG · arXiv 2604.21106

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

We measure how much one recurrence is worth to a looped (depth-recurrent) transformer, in equivalent unique parameters. From an iso-depth pretraining sweep across recurrence counts $r \in \{1, 2, 4, 8\}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-\alpha} + B\,D^{-\beta}$ and measure a recurrence-equivalence exponent $\varphi = 0.46$. Intuitively, $\varphi$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $\varphi{=}1$) or to a single block run repeatedly with no capacity gain ($\varphi{=}0$). Our $\varphi = 0.46$ sits in between, so replacing unique blocks with shared recurrences increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but incurs the training cost of a 1B non-looped one. We demonstrate the utility of $\varphi$ as a diagnostic tool on two case studies: commonly used truncated backpropagation lowers $\varphi$ to $0.38$, indicating that the loop mechanism is poorly trained under truncation, even though validation loss decreases. Conversely, hyperconnections raise $\varphi$ to $0.65$, a genuine capacity gain. Our method separates true loop improvements from training-side gains, a distinction raw validation loss cannot make.

representative citing papers

Hyperloop Transformers

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.

citing papers explorer

Showing 1 of 1 citing paper.

Hyperloop Transformers cs.LG · 2026-04-23 · unverdicted · none · ref 21 · internal anchor
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

fields

years

verdicts

representative citing papers

citing papers explorer