Unlocking Feature Learning in Gated Delta Networks at Scale
Pith reviewed 2026-06-28 11:17 UTC · model grok-4.3
The pith
Propagating coordinate-size estimates through gates and recurrence yields scaling rules that let Gated Delta Networks transfer learning rates stably across model widths.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer.
What carries the argument
Coordinate-size estimates tracked through the full forward pass, gating functions, and recurrent state updates to obtain width-dependent scaling factors.
If this is right
- Learning-rate schedules derived at one width remain optimal at other widths under the new parametrization.
- Both AdamW and SGD optimizers exhibit stable transfer when the coordinate-size rules are followed.
- Standard parametrization produces width-dependent optimal learning rates and training instability.
- The same propagation method supplies scaling factors for all weight matrices, gates, and state transitions.
Where Pith is reading between the lines
- The same coordinate-size propagation technique could be applied to other gated recurrent or state-space architectures to obtain transfer rules.
- If the rules hold, hyperparameter search budgets for large Gated Delta Networks can be reduced to a single small-width run.
- Failure modes observed under standard parametrization may be re-interpreted as mismatches in coordinate growth rather than inherent architectural defects.
Load-bearing premise
Coordinate-size estimates can be carried through the entire forward pass, gating, and recurrent dynamics without missing interactions that would break the derived scaling rules.
What would settle it
Apply the derived scaling rules to a Gated Delta Network of increasing widths and observe that the optimal learning rate still changes with width or that training diverges at the transferred rate.
read the original abstract
Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ($\mu$P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that by rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics of Gated Delta Networks, scaling rules can be derived that extend the Maximal Update Parametrization (μP). Experiments on language-model pre-training are said to confirm that these rules enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer.
Significance. If the derivation and experiments hold, the work would be significant for enabling zero-shot hyperparameter transfer in non-Transformer architectures that include gating and structured recurrence, addressing a gap in scaling methods for efficient sub-quadratic models. The dual-optimizer validation (AdamW and SGD) and focus on practical utility for pre-training add value if the coordinate propagation is shown to be complete and the experiments are reproducible with clear controls.
major comments (1)
- The abstract provides no derivation steps, explicit equations, or experimental details (e.g., model widths tested, exact propagation rules, or baseline comparisons), preventing verification of whether coordinate-size estimates propagate without missing interactions in the gating or recurrent dynamics as claimed in the central result.
Simulated Author's Rebuttal
We thank the referee for their review of our manuscript. We address the major comment below.
read point-by-point responses
-
Referee: The abstract provides no derivation steps, explicit equations, or experimental details (e.g., model widths tested, exact propagation rules, or baseline comparisons), preventing verification of whether coordinate-size estimates propagate without missing interactions in the gating or recurrent dynamics as claimed in the central result.
Authors: We acknowledge that the abstract is high-level and omits explicit derivation steps, equations, and experimental details such as model widths, propagation rules, and baseline comparisons. This can limit immediate verification of the completeness of coordinate-size propagation through gating and recurrent dynamics. The full manuscript contains the detailed analysis and experiments. To address the concern directly, we will revise the abstract to include a concise reference to the key scaling rules derived and the experimental setup (model widths and optimizers). revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's central claim is a derivation of scaling rules for Gated Delta Networks obtained by propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, extending the existing μP framework. This propagation is presented as a first-principles calculation rather than a fit to data or a renaming of known results. No equations or steps in the provided abstract or description reduce the output scaling rules to the inputs by construction, nor do they rely on load-bearing self-citations whose validity depends on the present work. Experiments serve as external validation rather than the source of the claimed rules. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Maximal Update Parametrization enables zero-shot hyperparameter transfer for standard Transformers
Reference graph
Works this paper leans on
-
[1]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
1901
-
[2]
Effective sharpness aware minimization requires layerwise perturbation scaling
Moritz Haas, Jin Xu, Volkan Cevher, and Leena Chennuru Vankadara. Effective sharpness aware minimization requires layerwise perturbation scaling. InHigh-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning,
2024
-
[3]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[4]
Kingma and Jimmy Ba
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,
2015
-
[5]
Decoupled Weight Decay Regularization
I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Accessed: 2025-04-10
URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/. Accessed: 2025-04-10. Yurii Nesterov. A method for solving the convex programming problem with convergence rate o (1/k2). InDokl akad nauk Sssr, volume 269, page 543,
2025
-
[7]
Rwkv: Reinventing rnns for the transformer era
17 Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077,
2023
-
[8]
Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75 (9):1889–1935,
Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75 (9):1889–1935,
1935
-
[9]
Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958,
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958,
1929
-
[10]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jian Wang, and Furu Wei. Retentive network: A successor to Transformer for large language models.arXiv preprint arXiv:2307.08621,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Longcat-flash technical report.arXiv preprint arXiv:2509.01322,
Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322,
-
[12]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Greg Yang and Etai Littwin. Tensor programs ivb: Adaptive optimization in the infinite-width limit.arXiv preprint arXiv:2308.01814,
-
[14]
Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,
-
[15]
A spectral condition for feature learning.arXiv preprint arXiv:2310.17813,
Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813,
-
[16]
TensorprogramsVI:featurelearningininfinite depth neural networks
GregYang, DingliYu, ChenZhu, andSoufianeHayou. TensorprogramsVI:featurelearningininfinite depth neural networks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024a. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-effic...
2024
-
[17]
21 B Compatibility with AdamW 22 B.1 Adam(W) in the Scale-Invariant Regime
19 Appendix A Additional derivations in the backward process for SGD 21 A.1 Derivation of the cumulative latent space . . . . . . . . . . . . . . . . . . . . . . . . 21 B Compatibility with AdamW 22 B.1 Adam(W) in the Scale-Invariant Regime . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.2 Derivation for Main Projection Weights (Wq,Wk,Wv,Wo) . . . ...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.