arxiv: 2603.00541 · v2 · submitted 2026-02-28 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Spectral Condition for μP under Width-Depth Scaling

Chenyu Zheng , Rongzhen Wang , Xinyu Zhang , Chongxuan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:59 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords maximal update parameterizationwidth-depth scalingspectral frameworkresidual networksfeature learning stabilityhyperparameter transferTransformersk transformations

0 comments

The pith

A spectral framework for maximal update parameterization shows that scaling rules based on k greater than or equal to 2 transformations stabilize feature learning under joint width and depth growth in residual networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified spectral framework that tells how the norms of weights and their updates must grow with both width and depth in deep residual networks. It highlights a basic shift: when residual blocks contain only one transformation the old rules often break down at scale, but when blocks contain two or more transformations the derived maximal-update parameterization keeps feature learning stable and lets hyperparameters transfer reliably. A reader cares because foundation models are now grown in both directions at once, and repeated retuning of learning rates or initialization scales becomes impractical. The framework recovers earlier results for many optimizers and supplies a concrete recipe that extends beyond the cases previously treated.

Core claim

For residual networks whose blocks contain k transformations, the spectral conditions on weight norms and per-step updates yield a μP formulation that, when k is at least 2, produces stable feature learning and robust hyperparameter transfer across simultaneous width-depth scaling; the k equals 1 case and standard parameterization do not.

What carries the argument

The spectral framework that converts constraints on the norms of weights and their updates into explicit scaling rules with width and depth for residual blocks containing k transformations.

If this is right

The k greater than or equal to 2 scaling rules recover existing μP results for a broad class of optimizers and extend them to additional ones.
Practical architectures such as Transformers, whose blocks contain multiple transformations, align with the stable regime identified by the framework.
Hyperparameter transfer from small to large models becomes reliable once the spectral conditions for k greater than or equal to 2 are followed.
Standard parameterization and the k equals 1 formulation lose stability once both width and depth are increased together.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectral lens could be applied to non-residual architectures to test whether a comparable transition appears.
If the framework holds, it predicts that attention-based models will continue to benefit from the k greater than or equal to 2 rules even at extreme scales.
The approach supplies a way to derive μP variants for new optimizers without re-deriving the entire theory from scratch.

Load-bearing premise

The mapping from weight and update norms directly to stable feature learning holds across joint width-depth regimes without needing further justification of the spectral radius conditions inside the residual blocks.

What would settle it

Training a GPT-2-style model at two different widths and depths using the k greater than or equal to 2 μP rules and checking whether the same learning-rate schedule still produces stable loss curves; if the k equals 1 rules transfer equally well the claimed distinction collapses.

read the original abstract

Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($\mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $\mu$P under joint width-depth scaling. For deep residual networks whose residual blocks contain $k$ transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from $k=1$ to $k\geq 2$, unifying previously disparate $\mu$P formulations and identifying the $k\geq 2$ case as more appropriate for practical architectures with multi-transformation branches such as Transformers. Building on this framework, we derive a general recipe for implementing $\mu$P across a broad class of optimizers by mapping spectral constraints to concrete HP parameterizations, recovering existing results and extending them to additional optimizers. Finally, experiments on GPT-2 style language models show that the $\mu$P formulation derived from the $k\geq 2$ case achieves stable feature learning and robust HP transfer under width-depth scaling, whereas standard parameterization and $\mu$P in the $k=1$ case often fail to do so. These results support the practical effectiveness of the proposed spectral framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies μP for width-depth scaling via a spectral view on residual blocks, with k≥2 looking like the practical choice for Transformers and GPT-2 experiments backing the HP transfer claim.

read the letter

The main point is that they give a spectral framework for maximal update parameterization when models scale in both width and depth at once. For residual blocks that contain k transformations, the rules for weight norms and update sizes change at k=1 versus k≥2, and they argue the latter fits real architectures like Transformers better. This pulls together earlier fragmented extensions into one picture and produces concrete hyperparameter recipes for several optimizers. The GPT-2 runs then show that the k≥2 version keeps feature learning stable and lets hyperparameters transfer across sizes, while standard parameterization and the k=1 version do not.

Referee Report

2 major / 1 minor

Summary. The manuscript develops a spectral framework for maximal update parameterization (μP) under joint width-depth scaling in deep residual networks. For residual blocks containing k transformations, it derives scaling rules for the norms of weights and per-step updates, identifies a transition from the k=1 to k≥2 regime, unifies prior μP formulations, and provides a general recipe mapping spectral constraints to hyperparameter choices across optimizers. GPT-2 experiments are presented to show that the k≥2 formulation yields stable feature learning and robust HP transfer, while standard parameterization and the k=1 case do not.

Significance. If the spectral conditions correctly capture feature-learning dynamics in multi-branch architectures, the framework supplies a simple, architecture-aware route to μP that extends beyond width-only scaling and recovers existing results while covering additional optimizers. The GPT-2 validation, if the residual-block modeling holds, would constitute concrete evidence that the k≥2 rules improve stability and transfer under simultaneous width-depth growth.

major comments (2)

[Spectral framework derivation (around the k-transition)] The central claim that the k≥2 spectral rules are the appropriate choice for Transformers rests on the assumption that the general residual block with k transformations accurately reproduces the effective Jacobian spectral radius of a self-attention + FFN block. The manuscript does not supply an explicit linearization or eigenvalue-multiplication argument showing how the depth-multiplication factor for k≥2 maps onto the attention-MLP composition; without this step the experimental interpretation that the observed stability is due to the derived parameterization rather than other implementation details remains open.
[Experimental section on GPT-2 models] In the GPT-2 experiments, the paper reports that the k≥2 μP achieves stable feature learning and HP transfer while k=1 and standard parameterization fail. To make this comparison load-bearing, the manuscript should include a direct check that the implemented residual structure matches the k≥2 model (e.g., by measuring the empirical spectral radius of the block Jacobian or by ablating the number of transformations inside each residual unit).

minor comments (1)

[Notation and definitions] Notation for the per-step update norm scaling should be introduced once and used consistently; the current presentation mixes “update norm” and “ΔW norm” without a single defining equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that additional derivations and empirical checks will strengthen the connection between the theoretical framework and the Transformer experiments, and we will incorporate these in the revised manuscript.

read point-by-point responses

Referee: [Spectral framework derivation (around the k-transition)] The central claim that the k≥2 spectral rules are the appropriate choice for Transformers rests on the assumption that the general residual block with k transformations accurately reproduces the effective Jacobian spectral radius of a self-attention + FFN block. The manuscript does not supply an explicit linearization or eigenvalue-multiplication argument showing how the depth-multiplication factor for k≥2 maps onto the attention-MLP composition; without this step the experimental interpretation that the observed stability is due to the derived parameterization rather than other implementation details remains open.

Authors: We agree that an explicit linearization argument would make the mapping more rigorous. In the revision we will add a dedicated subsection deriving the effective Jacobian spectral radius for a residual block composed of self-attention followed by an FFN. Under the standard assumption that the individual Jacobians have spectral radii controlled by the width scaling, their product yields the depth-multiplication factor matching the k=2 case, thereby justifying the choice of the k≥2 rules for Transformers and clarifying that the observed stability arises from the parameterization. revision: yes
Referee: [Experimental section on GPT-2 models] In the GPT-2 experiments, the paper reports that the k≥2 μP achieves stable feature learning and HP transfer while k=1 and standard parameterization fail. To make this comparison load-bearing, the manuscript should include a direct check that the implemented residual structure matches the k≥2 model (e.g., by measuring the empirical spectral radius of the block Jacobian or by ablating the number of transformations inside each residual unit).

Authors: We accept that a direct verification is needed to make the comparison conclusive. In the revised experimental section we will report measurements of the empirical spectral radius of the residual-block Jacobians computed on the trained GPT-2 models, confirming that the implemented architecture aligns with the k≥2 regime. We will also add an ablation that varies the number of transformations per residual unit and shows the corresponding change in stability and transfer behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives its spectral framework for μP under width-depth scaling from first-principles analysis of residual blocks with k transformations, specifying weight-norm and update scaling rules that unify prior μP variants. No equations or steps in the abstract reduce predictions to fitted inputs by construction, nor do they rely on self-citations for load-bearing uniqueness claims. The k≥2 transition and resulting HP recipes are presented as outputs of the spectral radius conditions rather than inputs, and the GPT-2 experiments serve as external validation rather than tautological confirmation. The derivation chain remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on spectral norm constraints for weight matrices and updates in residual blocks; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5577 in / 1075 out tokens · 67543 ms · 2026-05-15T17:59:43.530240+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Condition 3.1 (Spectral condition for μP under joint width-depth scaling) … α_l ∥W(2)_l∥_R ∥W(1)_l∥_R = Θ(1/L)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For residual blocks of depth k … product of the α_l and the norms of the k hidden weights to scale as Θ(1/L)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 11 internal anchors

[1]

Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.CoRR, abs/1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Haeffele, Soo Min Kwon, Qing Qu, Peng Wang, Zhangyang Wang, and Can Yaras

Laura Balzano, Tianjiao Ding, Benjamin D. Haeffele, Soo Min Kwon, Qing Qu, Peng Wang, Zhangyang Wang, and Can Yaras. An overview of low-rank structures in the training and adaptation of large models.CoRR, abs/2503.19859, 2025

work page arXiv 2025
[3]

u- µp: The unit-scaled maximal update parametrization

Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Yuri Prince, Björn Deiseroth, Andrés Felipe Cruz-Salinas, Carlo Luschi, Samuel Weinbach, and Douglas Orr. u- µp: The unit-scaled maximal update parametrization. In ICLR, 2025

work page 2025
[4]

Self-consistent dynamical field theory of kernel evolution in wide neural networks

Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. InNeurIPS, 2022

work page 2022
[5]

Infinite limits of multi-head transformer dynamics

Blake Bordelon, Hamza Tahir Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head transformer dynamics. In NeurIPS, 2024

work page 2024
[6]

Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit

Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. InICLR, 2024

work page 2024
[7]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms. InNeurIPS, 2023

work page 2023
[8]

Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster

Nolan Dey, Gurpreet Gosal, Zhiming Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. CoRR, abs/2304.03208, 2023

work page arXiv 2023
[9]

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Nolan Dey, Shane Bergsma, and Joel Hestness. Sparse maximal update parameterization: A holistic approach to sparse training dynamics. InNeurIPS, 2024

work page 2024
[10]

Don’t be lazy: Completep enables compute-efficient deep transformers.CoRR, abs/2505.01618, 2025

Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Bill Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers.CoRR, abs/2505.01618, 2025

work page arXiv 2025
[11]

Openwebtext corpus.http://Skylion007.github.io/OpenWebTextCorpus, 2019

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019
[12]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018

work page 2018
[13]

InNeurIPS, 2024

Moritz Haas, Jin Xu, Volkan Cevher, and Leena Chennuru Vankadara.µp2: Effective sharpness aware minimization requires layerwise perturbation scaling. InNeurIPS, 2024

work page 2024
[14]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InICCV, pages 1026–1034, 2015

work page 2015
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016

work page 2016
[16]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, volume EMNLP 2020, pages 4246–4253, 2020

work page 2020
[17]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

On the parameterization of second-order optimization effective towards the infinite width

Satoki Ishikawa and Ryo Karakida. On the parameterization of second-order optimization effective towards the infinite width. InICLR, 2024

work page 2024
[20]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. InNeurIPS, pages 8580–8589, 2018

work page 2018
[21]

Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 6, 2024

work page 2024
[22]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[23]

nanogpt.https://github.com/karpathy/nanoGPT, 2022

Andrej Karpathy. nanogpt.https://github.com/karpathy/nanoGPT, 2022

work page 2022
[24]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Sophia: A scalable stochastic second-order optimizer for language model pre-training

Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InICLR, 2024

work page 2024
[26]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

work page 2019
[28]

A method for solving the convex programming problem with convergence rate o(1/k2)

Yurii Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). InDokl akad nauk Sssr, volume 269, page 543, 1983

work page 1983
[29]

Extendingµp: Spectral conditions for feature learning across optimizers

Marieme Ngom, Sam Foreman, Venkatram Vishwanath, et al. Extendingµp: Spectral conditions for feature learning across optimizers. InOPT 2025: Optimization for Machine Learning, 2025

work page 2025
[30]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.CoRR, abs/2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Hyperparameter transfer enables consistent gains of matrix-preconditioned optimizers across scales.arXiv preprint arXiv:2512.05620, 2025

Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, and Andrew Gordon Wilson. Hyperparameter transfer enables consistent gains of matrix-preconditioned optimizers across scales.arXiv preprint arXiv:2512.05620, 2025

work page arXiv 2025
[32]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[33]

Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein

Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. In ICLR, 2017

work page 2017
[34]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

On feature learning in structured state space models

Leena Chennuru Vankadara, Jin Xu, Moritz Haas, and Volkan Cevher. On feature learning in structured state space models. InNeurIPS, 2024

work page 2024
[37]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, pages 5998–6008, 2017

work page 2017
[38]

Cambridge university press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

work page 2018
[39]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Controlled llm training on spectral sphere.arXiv preprint arXiv:2601.08393, 2026

Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, et al. Controlled llm training on spectral sphere.arXiv preprint arXiv:2601.08393, 2026

work page arXiv 2026
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Tensor programs III: neural matrix laws.CoRR, abs/2009.10685, 2020

Greg Yang. Tensor programs III: neural matrix laws.CoRR, abs/2009.10685, 2020. 13

work page arXiv 2009
[43]

Greg Yang and Edward J. Hu. Tensor programs IV: feature learning in infinite-width neural networks. InICML, volume 139, pages 11727–11737. PMLR, 2021

work page 2021
[44]

Tensor programs ivb: Adaptive optimization in the infinite-width limit.CoRR, abs/2308.01814, 2023

Greg Yang and Etai Littwin. Tensor programs ivb: Adaptive optimization in the infinite-width limit.CoRR, abs/2308.01814, 2023

work page arXiv 2023
[45]

Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer. CoRR, abs/2203.03466, 2022

work page arXiv 2022
[46]

Simon, and Jeremy Bernstein

Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning. CoRR, abs/2310.17813, 2023

work page arXiv 2023
[47]

Tensor programs VI: feature learning in infinite depth neural networks

Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs VI: feature learning in infinite depth neural networks. InICLR, 2024

work page 2024
[48]

Galore: Memory-efficient LLM training by gradient low-rank projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient LLM training by gradient low-rank projection. InICML, 2024

work page 2024
[49]

Scaling diffusion transformers efficiently viaµp.CoRR, abs/2505.15270, 2025

Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, and Chongxuan Li. Scaling diffusion transformers efficiently viaµp.CoRR, abs/2505.15270, 2025. 14 Contents of Appendix A Additional Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 A.1µP under Width Scaling . . . . . . . . . ....

work page arXiv 2025
[50]

Σ(t) l 0 0 0 # . Then, we find that SOAP is reduced to: W (t) l =W (t−1) l −η (t)  Q(t) Ll sign  

does not directly extend to practical architectures (residual block with two or more layers), nor do they support robust HP transfer across depth [10, 47]. 17 B.1.3 Derivation for Initial Condition We first derive the initialization condition that ensures stability of feature magnitudes during forward propagation for single-layer residual blocks. We consi...

work page arXiv 2048