Review Residuals: Update-Conditioned Residual Gating for Transformers

Kyle Kramer

arxiv: 2606.31859 · v1 · pith:O2QKGYYLnew · submitted 2026-06-30 · 💻 cs.LG · cs.CL

Review Residuals: Update-Conditioned Residual Gating for Transformers

Kyle Kramer This is my paper

Pith reviewed 2026-07-01 06:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords residual connectionstransformersgated residualshighway networksscaling lawsdepth stabilityvanishing gradients

0 comments

The pith

Review Residuals replace the fixed add-1 in residual connections with a gate that sees both the current state and the proposed update, producing gains that emerge at 590M parameters and widen at 1B.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard residual connections add every sublayer update with a fixed coefficient of one without checking reliability. Review Residuals instead compute a learned gate conditioned on both the prior hidden state and the sublayer's proposed update, then scale the update by that gate before adding it. Experiments across five model sizes from 60M to 1B parameters show that an additive version of this gate trains stably at all tested depths while a convex Highway-style version reintroduces vanishing gradients beyond roughly 20 layers. No advantage appears at small scale, but at 590M parameters the method significantly outperforms both a parameter-matched Highway gate and a standard residual connection, with a larger gap at 1B parameters where the benefit continues to grow.

Core claim

Review Residuals replace the fixed coefficient of one in residual connections with a learned, input-dependent gate conditioned on both the current hidden state and the proposed update from the sublayer. This yields two results: an additive formulation trains stably at arbitrary depth while a convex formulation reintroduces vanishing gradients, and the performance benefit over standard residuals and Highway gates emerges only at large scale and increases with model size.

What carries the argument

The update-conditioned gate r_l = sigmoid(W[RMSNorm(h_{l-1}), RMSNorm(u_l)]) that multiplies the proposed update u_l before adding it to the previous state h_{l-1}.

If this is right

Transformers can be trained stably to greater depths using the additive Review Residual form.
Performance advantages of update-conditioned gating increase rather than diminish as model size grows from 590M to 1B parameters.
Parameter-matched comparisons show statistical significance (p<0.05) favoring Review Residuals at large scales.
Small models (60M) show no benefit, indicating the mechanism's utility is scale-dependent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same update-conditioning principle could be tested in non-residual components such as attention or feed-forward layers.
If update quality becomes more variable at larger scales, explicit review gates may become a general requirement rather than an optional refinement.
The depth-stability result suggests that any gating method intended for very deep networks should preserve an identity path rather than rely on convex combinations.

Load-bearing premise

The performance differences arise specifically from conditioning the gate on the proposed update rather than from other differences in training procedure, initialization, or hyperparameter tuning, and the multi-seed statistical tests adequately control for run-to-run variance.

What would settle it

A 1B-parameter replication in which a gate conditioned only on the state (without the update) matches or exceeds the update-conditioned version would show that conditioning on the update is not the source of the reported benefit.

Figures

Figures reproduced from arXiv: 2606.31859 by Kyle Kramer.

**Figure 1.** Figure 1: All three variants preserve the identity path (additive). (a) The standard residual adds the update with a fixed coefficient. (b) Highway scales the update by a gate conditioned only on the state hl−1. (c) Review Residuals scale it by a gate conditioned on both the state and the proposed update ul (red arrow)—the network inspects the change before committing it. Method Update modulation Inputdependent? Co… view at source ↗

**Figure 2.** Figure 2: Review’s advantage (baseline loss − Review loss) versus model size, with ±1 s.e. bars. Both curves start near or below zero (the standard residual is better than Review at 60M) and rise to a significant ∼ + 0.016 nats at 1B. The benefit emerges with scale. Stars mark p < 0.05 (590M). advantage over both baselines is significant at p < 0.05; the larger 1B advantage, with three seeds, sits just below that th… view at source ↗

read the original abstract

Residual connections add every sublayer's proposed update with a fixed coefficient of one; the network never evaluates whether an update is reliable before committing it. Drawing on the human-factors principle of independent verification, we introduce Review Residuals, which scale each update by a learned, input-dependent gate conditioned on both the current state and the proposed update: h_l = h_{l-1} + r_l * u_l with r_l = sigmoid(W[RMSNorm(h_{l-1}), RMSNorm(u_l)]). Conditioning the gate on the update is the property that distinguishes it from prior gated and scaled residuals. We report two findings. First, a depth-stability result: a convex (Highway-style) form of the gate reintroduces vanishing gradients and fails to train beyond ~20 layers, whereas the additive, identity-preserving form trains stably at all depths we tested. Second, an emergence-with-scale result: trained from scratch across five sizes (60M-1B parameters, multi-seed), Review Residuals show no advantage at small scale but at 590M significantly outperform both a parameter-matched Highway gate and a parameter-matched standard residual (p<0.05), with a larger advantage at 1B. The benefit grows with model size rather than shrinking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main claim is a scale-emergent win from conditioning the residual gate on the proposed update, but the abstract leaves the isolation of that effect under-specified.

read the letter

The new piece is the gate that looks at both the current hidden state and the update via RMSNorm on each, then multiplies the update by a sigmoid. That is a clean extension of earlier gated residuals, and the depth-stability result (convex Highway form collapses past ~20 layers while the additive form does not) is a useful sanity check. The scale result is the bigger claim: no gain at 60M–300M but clear outperformance at 590M and 1B against parameter-matched baselines, with the gap widening.

The experiments are run from scratch across five sizes with multiple seeds and p<0.05 tests, which is more than many architecture papers bother with. That deserves credit.

The soft spot is exactly the one the stress-test flags. The abstract asserts parameter matching and identical training but does not state that optimizer, LR schedule, initialization, data order, or normalization placement were locked down across the three residual variants. Without those controls written out, the scale-dependent pattern could still be driven by an uncontrolled difference rather than the update-conditioning itself. The full paper may fix this; the abstract does not.

This is for people who already care about residual modifications inside the transformer stack. A reader who wants a quick, reproducible tweak for large-model training could get value if the isolation holds. It is coherent on its own terms and shows honest engagement with the scaling question, so it deserves a serious referee to check the experimental controls and the exact implementation of the gate projection.

Referee Report

1 major / 1 minor

Summary. The paper introduces Review Residuals, a modification to residual connections in Transformers where the update u_l is scaled by a gate r_l = sigmoid(W [RMSNorm(h_{l-1}), RMSNorm(u_l)]), conditioned on both the previous state and the proposed update. It claims two results: (1) an additive form of the gate enables stable training at arbitrary depths unlike a convex Highway-style gate, and (2) when trained from scratch on models from 60M to 1B parameters with multiple seeds, Review Residuals show no benefit at small scales but significantly outperform parameter-matched Highway and standard residual connections at 590M and 1B scales (p<0.05), with the advantage increasing with scale.

Significance. If the experimental comparisons are properly controlled, the emergence-with-scale finding would be notable, as it identifies a residual formulation whose benefits appear only at larger model sizes rather than diminishing, potentially offering a practical improvement for training large Transformers. The depth-stability result reinforces the importance of preserving the identity path in gated residuals. The multi-seed experiments and reported p<0.05 tests are a strength.

major comments (1)

[Abstract] Abstract: The abstract states that Review Residuals 'significantly outperform both a parameter-matched Highway gate and a parameter-matched standard residual (p<0.05)' at 590M and larger at 1B, but provides no explicit confirmation that all training details (optimizer, learning rate schedule, initialization, data ordering, normalization placement) were held identical across the three residual variants. Without this, the scale-dependent outperformance cannot be confidently attributed to the update-conditioning mechanism.

minor comments (1)

[Abstract] Abstract: The notation for the gate input concatenation is written as W[RMSNorm(h_{l-1}), RMSNorm(u_l)], which could be clarified as concatenation before the linear projection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for explicit confirmation of experimental controls. We address the concern below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that Review Residuals 'significantly outperform both a parameter-matched Highway gate and a parameter-matched standard residual (p<0.05)' at 590M and larger at 1B, but provides no explicit confirmation that all training details (optimizer, learning rate schedule, initialization, data ordering, normalization placement) were held identical across the three residual variants. Without this, the scale-dependent outperformance cannot be confidently attributed to the update-conditioning mechanism.

Authors: We agree that the abstract should explicitly state the controls. The full methods section already specifies that all three residual variants were trained under identical conditions: the same optimizer (AdamW), learning-rate schedule, initialization scheme, data ordering, batch size, and normalization placement, with only the residual formulation differing. To address the referee's point directly in the abstract, we will add the clause 'with all other training details held identical across variants' to the relevant sentence. This makes the attribution to the update-conditioning mechanism unambiguous without altering any results or claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on direct training runs, not self-referential definitions or fitted inputs

full rationale

The paper introduces a gated residual form and reports scale-dependent performance differences from multi-seed training experiments across model sizes. No equations are presented that define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing claims rely on self-citations or uniqueness theorems imported from prior author work. The depth-stability and emergence-with-scale findings are framed as outcomes of the reported training procedure rather than algebraic identities, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on empirical training outcomes of a new gating mechanism whose weights are learned from data; no theoretical derivation or external benchmarks are invoked.

free parameters (1)

Gate projection matrix W
Learned linear weights that compute the gate from the concatenated normalized state and update vectors.

axioms (2)

domain assumption RMSNorm produces suitable normalized inputs for the gate
Invoked in the gate definition to stabilize the sigmoid input.
standard math Sigmoid produces a valid scaling factor in [0,1]
Ensures the multiplicative gate remains a convex scaling of the update.

pith-pipeline@v0.9.1-grok · 5751 in / 1356 out tokens · 53626 ms · 2026-07-01T06:36:10.343626+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 17 canonical work pages · 12 internal anchors

[1]

Attention Is All You Need

A. Vaswani et al. Attention Is All You Need.NeurIPS, 2017. arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition.CVPR, 2016. arXiv:1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

R. K. Srivastava, K. Greff, J. Schmidhuber. Highway Networks. arXiv:1505.00387, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[4]

Bachlechner et al

T. Bachlechner et al. ReZero is All You Need.UAI, 2021. arXiv:2003.04887

work page arXiv 2021
[5]

Touvron et al

H. Touvron et al. Going Deeper with Image Transformers (CaiT / LayerScale).ICCV, 2021. arXiv:2103.17239

work page arXiv 2021
[6]

Fixup Initialization: Residual Learning Without Normalization

H. Zhang, Y. N. Dauphin, T. Ma. Fixup Initialization.ICLR, 2019. arXiv:1901.09321

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

Wang et al

H. Wang et al. DeepNet: Scaling Transformers to 1,000 Layers. arXiv:2203.00555, 2022

work page arXiv 2022
[8]

Attention Residuals

Kimi Team. Attention Residuals. arXiv:2603.15031, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Root Mean Square Layer Normalization

B. Zhang, R. Sennrich. Root Mean Square Layer Normalization.NeurIPS, 2019. arXiv:1910.07467

work page internal anchor Pith review Pith/arXiv arXiv 2019
[10]

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter. Decoupled Weight Decay Regularization (AdamW).ICLR, 2019. arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[11]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

R. Eldan, Y. Li. TinyStories. arXiv:2305.07759, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

A. Graves. Adaptive Computation Time for Recurrent Neural Networks. arXiv:1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Banino, J

A. Banino, J. Balaguer, C. Blundell. PonderNet: Learning to Ponder. arXiv:2107.05407, 2021

work page arXiv 2021
[14]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer et al. Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer.ICLR, 2017. arXiv:1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

D. Raposo et al. Mixture-of-Depths. arXiv:2404.02258, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

arXiv preprint arXiv:2002.04745 , year=

R. Xiong et al. On Layer Normalization in the Transformer Architecture.ICML, 2020. arXiv:2002.04745

work page arXiv 2020
[17]

J. L. Ba, J. R. Kiros, G. E. Hinton. Layer Normalization. arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

D. J. Simons, C. F. Chabris. Gorillas in Our Midst: Sustained Inattentional Blindness for Dynamic Events.Perception, 28(9):1059–1074, 1999. 6

1999
[19]

Kramer.The Operator’s Guide to AI Agents

K. Kramer.The Operator’s Guide to AI Agents. NeraTech LLC, 2026. (Bounded-intelligence thesis; reliability via engineered verification.)

2026
[20]

Reason.Human Error

J. Reason.Human Error. Cambridge University Press, 1990

1990
[21]

K. E. Weick, K. M. Sutcliffe.Managing the Unexpected. Jossey-Bass, 2007

2007
[22]

Department of Energy.Human Performance Improvement Handbook, DOE-HDBK-1028-2009, 2009

U.S. Department of Energy.Human Performance Improvement Handbook, DOE-HDBK-1028-2009, 2009. 7

2009

[1] [1]

Attention Is All You Need

A. Vaswani et al. Attention Is All You Need.NeurIPS, 2017. arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition.CVPR, 2016. arXiv:1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

R. K. Srivastava, K. Greff, J. Schmidhuber. Highway Networks. arXiv:1505.00387, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[4] [4]

Bachlechner et al

T. Bachlechner et al. ReZero is All You Need.UAI, 2021. arXiv:2003.04887

work page arXiv 2021

[5] [5]

Touvron et al

H. Touvron et al. Going Deeper with Image Transformers (CaiT / LayerScale).ICCV, 2021. arXiv:2103.17239

work page arXiv 2021

[6] [6]

Fixup Initialization: Residual Learning Without Normalization

H. Zhang, Y. N. Dauphin, T. Ma. Fixup Initialization.ICLR, 2019. arXiv:1901.09321

work page internal anchor Pith review Pith/arXiv arXiv 2019

[7] [7]

Wang et al

H. Wang et al. DeepNet: Scaling Transformers to 1,000 Layers. arXiv:2203.00555, 2022

work page arXiv 2022

[8] [8]

Attention Residuals

Kimi Team. Attention Residuals. arXiv:2603.15031, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Root Mean Square Layer Normalization

B. Zhang, R. Sennrich. Root Mean Square Layer Normalization.NeurIPS, 2019. arXiv:1910.07467

work page internal anchor Pith review Pith/arXiv arXiv 2019

[10] [10]

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter. Decoupled Weight Decay Regularization (AdamW).ICLR, 2019. arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019

[11] [11]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

R. Eldan, Y. Li. TinyStories. arXiv:2305.07759, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

A. Graves. Adaptive Computation Time for Recurrent Neural Networks. arXiv:1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

Banino, J

A. Banino, J. Balaguer, C. Blundell. PonderNet: Learning to Ponder. arXiv:2107.05407, 2021

work page arXiv 2021

[14] [14]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer et al. Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer.ICLR, 2017. arXiv:1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

D. Raposo et al. Mixture-of-Depths. arXiv:2404.02258, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

arXiv preprint arXiv:2002.04745 , year=

R. Xiong et al. On Layer Normalization in the Transformer Architecture.ICML, 2020. arXiv:2002.04745

work page arXiv 2020

[17] [17]

J. L. Ba, J. R. Kiros, G. E. Hinton. Layer Normalization. arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [18]

D. J. Simons, C. F. Chabris. Gorillas in Our Midst: Sustained Inattentional Blindness for Dynamic Events.Perception, 28(9):1059–1074, 1999. 6

1999

[19] [19]

Kramer.The Operator’s Guide to AI Agents

K. Kramer.The Operator’s Guide to AI Agents. NeraTech LLC, 2026. (Bounded-intelligence thesis; reliability via engineered verification.)

2026

[20] [20]

Reason.Human Error

J. Reason.Human Error. Cambridge University Press, 1990

1990

[21] [21]

K. E. Weick, K. M. Sutcliffe.Managing the Unexpected. Jossey-Bass, 2007

2007

[22] [22]

Department of Energy.Human Performance Improvement Handbook, DOE-HDBK-1028-2009, 2009

U.S. Department of Energy.Human Performance Improvement Handbook, DOE-HDBK-1028-2009, 2009. 7

2009