Unlocking Feature Learning in Gated Delta Networks at Scale

Quanquan Gu; Yifeng Liu

arxiv: 2606.04048 · v1 · pith:FJNQWYBTnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Unlocking Feature Learning in Gated Delta Networks at Scale

Yifeng Liu , Quanquan Gu This is my paper

Pith reviewed 2026-06-28 11:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Gated Delta Networksscaling ruleshyperparameter transfercoordinate-size estimatesrecurrent state dynamicslanguage model pre-trainingAdamWSGD

0 comments

The pith

Propagating coordinate-size estimates through gates and recurrence yields scaling rules that let Gated Delta Networks transfer learning rates stably across model widths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives explicit scaling rules for Gated Delta Networks by tracking how coordinate magnitudes evolve through the forward computation, the gating operations, and the recurrent state updates. These rules produce model configurations that keep learning-rate transfer intact when width changes, under both AdamW and SGD. Experiments on language-model pre-training show the new parametrization succeeds where ordinary scaling produces instability or collapse. The central goal is to remove the need for per-width hyperparameter retuning when training these structured recurrent models at scale.

Core claim

By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer.

What carries the argument

Coordinate-size estimates tracked through the full forward pass, gating functions, and recurrent state updates to obtain width-dependent scaling factors.

If this is right

Learning-rate schedules derived at one width remain optimal at other widths under the new parametrization.
Both AdamW and SGD optimizers exhibit stable transfer when the coordinate-size rules are followed.
Standard parametrization produces width-dependent optimal learning rates and training instability.
The same propagation method supplies scaling factors for all weight matrices, gates, and state transitions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coordinate-size propagation technique could be applied to other gated recurrent or state-space architectures to obtain transfer rules.
If the rules hold, hyperparameter search budgets for large Gated Delta Networks can be reduced to a single small-width run.
Failure modes observed under standard parametrization may be re-interpreted as mismatches in coordinate growth rather than inherent architectural defects.

Load-bearing premise

Coordinate-size estimates can be carried through the entire forward pass, gating, and recurrent dynamics without missing interactions that would break the derived scaling rules.

What would settle it

Apply the derived scaling rules to a Gated Delta Network of increasing widths and observe that the optimal learning rate still changes with width or that training diverges at the transferred rate.

read the original abstract

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ($\mu$P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They derive μP scaling rules for Gated Delta Networks via coordinate propagation and show the rules enable LR transfer in pretraining where standard parametrization does not.

read the letter

The core result here is a set of scaling rules for Gated Delta Networks obtained by propagating coordinate-size estimates through the forward pass, gating, and recurrent dynamics. The experiments then demonstrate that these rules support stable learning-rate transfer across widths on language-model pretraining under both AdamW and SGD, while the usual parametrization does not.

What the paper does cleanly is close a stated gap: prior μP work covered standard Transformers but left linear models with structured state transitions largely untouched. The authors treat the propagation as a direct extension of existing coordinate estimates rather than introducing new fitted parameters, which keeps the circularity low. The choice of pretraining as the testbed is appropriate for the scaling claim.

The main limitation visible from the abstract is the absence of the actual derivation steps or experimental details such as model widths tested, exact transfer curves, or ablation on whether any gating interactions were approximated. Without those, it is hard to judge whether the propagation missed any recurrent feedback that would invalidate the rules at larger scales. The soundness rating in the report reflects this lack of visible evidence rather than an identified flaw in the approach itself.

This paper is aimed at people working on sub-quadratic architectures who already use or want to use μP-style transfer. A reader already familiar with the μP literature will get the most out of it; others may need the full equations to assess the novelty.

I would send it to peer review. The claim is narrow enough to be checked by referees who can verify the propagation and ask for replication details, and the experimental contrast with standard parametrization is a clear, falsifiable point.

Referee Report

1 major / 0 minor

Summary. The paper claims that by rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics of Gated Delta Networks, scaling rules can be derived that extend the Maximal Update Parametrization (μP). Experiments on language-model pre-training are said to confirm that these rules enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer.

Significance. If the derivation and experiments hold, the work would be significant for enabling zero-shot hyperparameter transfer in non-Transformer architectures that include gating and structured recurrence, addressing a gap in scaling methods for efficient sub-quadratic models. The dual-optimizer validation (AdamW and SGD) and focus on practical utility for pre-training add value if the coordinate propagation is shown to be complete and the experiments are reproducible with clear controls.

major comments (1)

The abstract provides no derivation steps, explicit equations, or experimental details (e.g., model widths tested, exact propagation rules, or baseline comparisons), preventing verification of whether coordinate-size estimates propagate without missing interactions in the gating or recurrent dynamics as claimed in the central result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review of our manuscript. We address the major comment below.

read point-by-point responses

Referee: The abstract provides no derivation steps, explicit equations, or experimental details (e.g., model widths tested, exact propagation rules, or baseline comparisons), preventing verification of whether coordinate-size estimates propagate without missing interactions in the gating or recurrent dynamics as claimed in the central result.

Authors: We acknowledge that the abstract is high-level and omits explicit derivation steps, equations, and experimental details such as model widths, propagation rules, and baseline comparisons. This can limit immediate verification of the completeness of coordinate-size propagation through gating and recurrent dynamics. The full manuscript contains the detailed analysis and experiments. To address the concern directly, we will revise the abstract to include a concise reference to the key scaling rules derived and the experimental setup (model widths and optimizers). revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claim is a derivation of scaling rules for Gated Delta Networks obtained by propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, extending the existing μP framework. This propagation is presented as a first-principles calculation rather than a fit to data or a renaming of known results. No equations or steps in the provided abstract or description reduce the output scaling rules to the inputs by construction, nor do they rely on load-bearing self-citations whose validity depends on the present work. Experiments serve as external validation rather than the source of the claimed rules. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the prior validity of μP for transformers and the assumption that coordinate-size propagation applies without additional fitted constants specific to this architecture.

axioms (1)

domain assumption Maximal Update Parametrization enables zero-shot hyperparameter transfer for standard Transformers
The paper builds directly on μP as the foundation for the new scaling rules.

pith-pipeline@v0.9.1-grok · 5639 in / 1084 out tokens · 25896 ms · 2026-06-28T11:17:08.274611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 8 canonical work pages · 4 internal anchors

[1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

1901
[2]

Effective sharpness aware minimization requires layerwise perturbation scaling

Moritz Haas, Jin Xu, Volkan Cevher, and Leena Chennuru Vankadara. Effective sharpness aware minimization requires layerwise perturbation scaling. InHigh-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning,

2024
[3]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

2015
[5]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Accessed: 2025-04-10

URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/. Accessed: 2025-04-10. Yurii Nesterov. A method for solving the convex programming problem with convergence rate o (1/k2). InDokl akad nauk Sssr, volume 269, page 543,

2025
[7]

Rwkv: Reinventing rnns for the transformer era

17 Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077,

2023
[8]

Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75 (9):1889–1935,

Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75 (9):1889–1935,

1935
[9]

Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958,

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958,

1929
[10]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jian Wang, and Furu Wei. Retentive network: A successor to Transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322,

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322,

work page arXiv
[12]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Tensor programs ivb: Adaptive optimization in the infinite-width limit.arXiv preprint arXiv:2308.01814,

Greg Yang and Etai Littwin. Tensor programs ivb: Adaptive optimization in the infinite-width limit.arXiv preprint arXiv:2308.01814,

work page arXiv
[14]

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

work page arXiv
[15]

A spectral condition for feature learning.arXiv preprint arXiv:2310.17813,

Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813,

work page arXiv
[16]

TensorprogramsVI:featurelearningininfinite depth neural networks

GregYang, DingliYu, ChenZhu, andSoufianeHayou. TensorprogramsVI:featurelearningininfinite depth neural networks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024a. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-effic...

2024
[17]

21 B Compatibility with AdamW 22 B.1 Adam(W) in the Scale-Invariant Regime

19 Appendix A Additional derivations in the backward process for SGD 21 A.1 Derivation of the cumulative latent space . . . . . . . . . . . . . . . . . . . . . . . . 21 B Compatibility with AdamW 22 B.1 Adam(W) in the Scale-Invariant Regime . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.2 Derivation for Main Projection Weights (Wq,Wk,Wv,Wo) . . . ...

2024

[1] [1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

1901

[2] [2]

Effective sharpness aware minimization requires layerwise perturbation scaling

Moritz Haas, Jin Xu, Volkan Cevher, and Leena Chennuru Vankadara. Effective sharpness aware minimization requires layerwise perturbation scaling. InHigh-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning,

2024

[3] [3]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[4] [4]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

2015

[5] [5]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Accessed: 2025-04-10

URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/. Accessed: 2025-04-10. Yurii Nesterov. A method for solving the convex programming problem with convergence rate o (1/k2). InDokl akad nauk Sssr, volume 269, page 543,

2025

[7] [7]

Rwkv: Reinventing rnns for the transformer era

17 Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077,

2023

[8] [8]

Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75 (9):1889–1935,

Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75 (9):1889–1935,

1935

[9] [9]

Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958,

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958,

1929

[10] [10]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jian Wang, and Furu Wei. Retentive network: A successor to Transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322,

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322,

work page arXiv

[12] [12]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Tensor programs ivb: Adaptive optimization in the infinite-width limit.arXiv preprint arXiv:2308.01814,

Greg Yang and Etai Littwin. Tensor programs ivb: Adaptive optimization in the infinite-width limit.arXiv preprint arXiv:2308.01814,

work page arXiv

[14] [14]

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

work page arXiv

[15] [15]

A spectral condition for feature learning.arXiv preprint arXiv:2310.17813,

Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813,

work page arXiv

[16] [16]

TensorprogramsVI:featurelearningininfinite depth neural networks

GregYang, DingliYu, ChenZhu, andSoufianeHayou. TensorprogramsVI:featurelearningininfinite depth neural networks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024a. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-effic...

2024

[17] [17]

21 B Compatibility with AdamW 22 B.1 Adam(W) in the Scale-Invariant Regime

19 Appendix A Additional derivations in the backward process for SGD 21 A.1 Derivation of the cumulative latent space . . . . . . . . . . . . . . . . . . . . . . . . 21 B Compatibility with AdamW 22 B.1 Adam(W) in the Scale-Invariant Regime . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.2 Derivation for Main Projection Weights (Wq,Wk,Wv,Wo) . . . ...

2024