Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
Pith reviewed 2026-05-21 04:58 UTC · model grok-4.3
The pith
The primary reason maximal update parameterization offers better hyperparameter transfer than standard parameterization is its higher learning rate on the embedding layer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We find that the overwhelming benefit of μP relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match μP dramatically smooths out training while improving hyperparameter transfer.
What carries the argument
The embedding-layer learning rate, which serves as a training bottleneck in standard parameterization but is scaled with width in maximal update parameterization.
If this is right
- Raising the embedding layer learning rate by a factor of model width in standard parameterization eliminates training instabilities and matches the hyperparameter transfer of μP.
- Weight decay improves the fit quality of scaling laws for optimal hyperparameters.
- Weight decay reduces the robustness of hyperparameter extrapolation in the fixed token-per-parameter regime.
- The three proposed metrics enable quantitative comparison of how well different parameterizations support scale-invariant training.
Where Pith is reading between the lines
- Targeted learning rate adjustments for specific layers like embeddings could be a simpler alternative to adopting full μP for improving transfer in practice.
- The results may apply to other model components or optimizers where certain layers act as scale-dependent bottlenecks.
- This work highlights the value of layer-wise analysis in understanding parameterization effects at large scales.
Load-bearing premise
The comprehensive ablations successfully pinpoint the embedding layer learning rate as the dominant factor distinguishing μP and SP without interference from other training elements or choices.
What would settle it
A direct test would be to train models of varying widths using standard parameterization but with the embedding layer learning rate multiplied by the width, and verify whether training becomes stable and hyperparameter transfer quality approaches that of μP.
Figures
read the original abstract
Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($\mu$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $\mu$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $\mu$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $\mu$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a framework with three metrics to quantify hyperparameter transfer: scaling-law fit quality, robustness to extrapolation errors, and asymptotic loss penalty from parameterization choice. It then uses a series of ablations to investigate why Maximal Update Parameterization (μP) yields better learning-rate transfer than Standard Parameterization (SP) under AdamW. The central empirical claim is that the dominant advantage of μP arises simply from scaling the embedding-layer learning rate by model width; in SP this layer acts as a bottleneck inducing instabilities, and correcting only the embedding LR (multiplying by width) largely recovers μP’s stability and transfer benefits. Weight decay is reported to improve scaling-law fits while reducing extrapolation robustness in the fixed tokens-per-parameter regime.
Significance. If the isolation of the embedding-layer effect holds, the result is significant for LLM training practice: it suggests that a single, simple modification to SP can approximate most of μP’s hyperparameter-transfer gains without adopting the full reparameterization. The three-metric framework itself supplies a reproducible, falsifiable way to compare transfer across parameterizations. The work is strengthened by its comprehensive ablations and explicit focus on AdamW training dynamics, which are directly relevant to current large-model pipelines.
major comments (2)
- [§4] §4 (ablations isolating embedding LR): the central claim that 'the overwhelming benefit ... arises simply from maximizing the learning rate of the embedding layer' requires explicit verification that the modified-SP condition applies only the width-scaled embedding LR while leaving all other layer-specific scalings (attention, MLP, output) identical to SP. If any μP-style rules for non-embedding layers are inadvertently active, the attribution to embedding LR alone is confounded and the 'simply from' conclusion is not load-bearing.
- [results on weight decay] Fixed tokens-per-parameter experiments (reported in results on weight decay and robustness): the claim that weight decay 'hurts the robustness of the extrapolation' must be shown to be insensitive to the precise tokens-per-parameter schedule; otherwise the interaction between embedding-LR correction and the token schedule could still confound the reported stability gains.
minor comments (2)
- [§3] Notation for the three metrics should be introduced with explicit equations or pseudocode in §3 so that readers can reproduce the scaling-law fit quality and robustness calculations without ambiguity.
- [figures] Figure captions for the training-curve plots should state the exact width values, batch size, and whether error bars represent standard deviation over seeds or runs.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which help clarify the presentation of our results. We address each major comment in turn below.
read point-by-point responses
-
Referee: §4 (ablations isolating embedding LR): the central claim that 'the overwhelming benefit ... arises simply from maximizing the learning rate of the embedding layer' requires explicit verification that the modified-SP condition applies only the width-scaled embedding LR while leaving all other layer-specific scalings (attention, MLP, output) identical to SP. If any μP-style rules for non-embedding layers are inadvertently active, the attribution to embedding LR alone is confounded and the 'simply from' conclusion is not load-bearing.
Authors: In our ablation experiments described in §4, the modified-SP variant scales only the embedding layer's learning rate by the model width, while strictly adhering to standard parameterization (SP) rules for all other components, including attention, MLP, and output layers. No μP-specific scaling is applied to non-embedding layers. This setup is explicitly stated in the methods and ablation descriptions. We will add a more prominent clarification in the revised §4 to emphasize this isolation and prevent any potential misinterpretation. revision: yes
-
Referee: Fixed tokens-per-parameter experiments (reported in results on weight decay and robustness): the claim that weight decay 'hurts the robustness of the extrapolation' must be shown to be insensitive to the precise tokens-per-parameter schedule; otherwise the interaction between embedding-LR correction and the token schedule could still confound the reported stability gains.
Authors: Our weight decay experiments were performed in the fixed tokens-per-parameter regime using a consistent schedule across all runs. While we did not explicitly vary the schedule in the current manuscript, the observed reduction in extrapolation robustness with weight decay is consistent with the training dynamics under AdamW and the embedding LR correction. To fully address this, we will include a brief sensitivity analysis or additional discussion in the revision to confirm that the effect is not an artifact of the specific schedule chosen. revision: partial
Circularity Check
Empirical framework and ablations are self-contained with independent metrics
full rationale
The paper defines its three metrics for hyperparameter transfer (scaling law fit quality, extrapolation robustness, asymptotic loss penalty) directly from experimental outcomes without any derivation that reduces them to fitted parameters or prior self-citations. The central finding on embedding-layer learning rate is obtained via ablations that compare SP and μP variants; these comparisons do not invoke equations or uniqueness theorems that loop back to the inputs by construction. No load-bearing step relies on self-citation chains or ansatzes smuggled from prior author work. The analysis remains externally falsifiable through the reported training runs and is therefore scored as having no circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: from random graphs to natural language, 2026. URLhttps://arxiv.org/abs/2601.10684
-
[2]
Power lines: Scaling laws for weight decay and batch size in LLM pre-training
Shane Bergsma, Nolan Simran Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in LLM pre-training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=bFXbLQzRoZ
work page 2025
-
[3]
Scaling optimal LR across token horizons
Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, and Xia Song. Scaling optimal LR across token horizons. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=WYL4eFLcxG
work page 2025
-
[4]
Self-consistent dynamical field theory of kernel evolution in wide neural networks
Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum? id=sipwrPCrIS
work page 2022
-
[5]
Infinite limits of multi-head transformer dynamics
Blake Bordelon, Hamza Tahir Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head transformer dynamics. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=p0BBKhD5aI
work page 2024
-
[6]
Depthwise hyperparam- eter transfer in residual networks: Dynamics and scaling limit
Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperparam- eter transfer in residual networks: Dynamics and scaling limit. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=KZJehvRKGD
work page 2024
-
[7]
DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Don’t be lazy: Completep enables compute-efficient deep transformers,
Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers,
- [9]
-
[10]
Sparse maximal update parameterization: A holistic approach to sparse training dynamics
Nolan Simran Dey, Shane Bergsma, and Joel Hestness. Sparse maximal update parameterization: A holistic approach to sparse training dynamics. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=OWmu3QOa0O
work page 2024
-
[11]
Scaling exponents across parameterizations and optimizers
Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=0ksNeD1SJT
work page 2024
-
[12]
Understanding the mechanisms of fast hyperpa- rameter transfer.arXiv preprint arXiv:2512.22768, 2025
Nikhil Ghosh, Denny Wu, and Alberto Bietti. Understanding the mechanisms of fast hyperparameter transfer, 2025. URLhttps://arxiv.org/abs/2512.22768. 11
-
[13]
A loss curvature perspective on training instabilities of deep learning models
Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Edward Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instabilities of deep learning models. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=OcKMT-36vUs
work page 2022
-
[14]
Moritz Haas, Jin Xu, V olkan Cevher, and Leena Chennuru Vankadara. $\boldsymbol{\mu}\mathbf{P^2}$: Effective sharpness aware minimization requires layerwise perturbation scaling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=pR5g1bBqoV
work page 2024
-
[15]
A proof of learning rate transfer under $\mu$p
Soufiane Hayou. A proof of learning rate transfer under $\mu$p. InThe 29th International Conference on Artificial Intelligence and Statistics, 2026. URLhttps://openreview.net/forum?id=sZHpz3DHPj
work page 2026
-
[16]
Optimal embedding learning rate in llms: The effect of vocabulary size,
Soufiane Hayou and Liyuan Liu. Optimal embedding learning rate in llms: The effect of vocabulary size,
- [17]
-
[18]
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks
Tianyu He, Darshil Doshi, Aritra Das, and Andrey Gromov. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. InThe Thirty-eighth Annual Conference on Neu- ral Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=aVh9KRZdRk
work page 2024
-
[19]
An empirical analysis of compute-optimal large language model training
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laur...
work page 2022
-
[20]
MiniCPM: Unveiling the potential of small language models with scalable training strategies
Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the potential of small language models with...
work page 2024
-
[21]
Hyperparameter Transfer with Mixture-of-Expert Layers
Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, and Boris Hanin. Hyperparameter transfer with mixture- of-expert layers, 2026. URLhttps://arxiv.org/abs/2601.20205
work page internal anchor Pith review arXiv 2026
-
[22]
Muon: An optimizer for hidden layers in neural networks, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bern- stein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan. github.io/posts/muon/
work page 2024
-
[23]
Why warmup the learning rate? underlying mechanisms and improvements
Dayal Singh Kalra and Maissam Barkeshli. Why warmup the learning rate? underlying mechanisms and improvements. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=NVl4SAmz5c
work page 2024
-
[24]
Dayal Singh Kalra, Tianyu He, and Maissam Barkeshli. Universal sharpness dynamics in neural network training: Fixed point analysis, edge of stability, and route to chaos. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URLhttps://openreview.net/forum?id=VZN0irKnl0
work page 2025
-
[25]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[26]
Korchinski, Andres Nava, Matthieu Wyart, and Yasaman Bahri
Dhruva Karkada, Daniel J. Korchinski, Andres Nava, Matthieu Wyart, and Yasaman Bahri. Symmetry in language statistics shapes the geometry of model representations, 2026. URL https://arxiv.org/ abs/2602.15029
-
[27]
nanoGPT: The simplest, fastest repository for training/finetuning medium-sized gpts
Andrej Karpathy. nanoGPT: The simplest, fastest repository for training/finetuning medium-sized gpts. https://github.com/karpathy/nanoGPT, 2022
work page 2022
-
[28]
Weight decay may matter more than µp for learning rate transfer in practice
Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, and Xi Chen. Weight decay may matter more than µp for learning rate transfer in practice. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PvTxIdZc1E
work page 2026
-
[29]
Cifar-100 (canadian institute for advanced research)
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research). URLhttp://www.cs.toronto.edu/~kriz/cifar.html. 12
-
[30]
Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining
Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Predictable scale: Part i – optimal hyperparameter scaling law in large language model pretraining, 2025. URL https://arxiv.org/abs/ 2503.04715
-
[31]
Adaptive optimization in the $\infty$-width limit
Etai Littwin and Greg Yang. Adaptive optimization in the $\infty$-width limit. InThe Eleventh Inter- national Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= zgVDqw9ZUES
work page 2023
-
[32]
Team Llama. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
-
[34]
µ-parametrization for mixture of experts, 2025
Jan Mała´snicki, Kamil Ciebiera, Mateusz Boru´n, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, and Jakub Krajewski. µ-parametrization for mixture of experts, 2025. URLhttps://arxiv.org/abs/2508.09752
-
[35]
Progress measures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=9XFSbDPmdW
work page 2023
-
[36]
Super consistency of neural network landscapes and learning rate transfer
Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, and Antonio Orvieto. Super consistency of neural network landscapes and learning rate transfer. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=rgwhJ7INtZ
work page 2024
-
[37]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=n...
work page 2024
-
[39]
Resolving discrepan- cies in compute-optimal scaling of language models
Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepan- cies in compute-optimal scaling of language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=4fSSqpk1sM
work page 2024
-
[40]
Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, and Andrew Gordon Wilson. Hyperparameter transfer enables consistent gains of matrix-preconditioned optimizers across scales, 2025. URL https://arxiv.org/ abs/2512.05620
-
[41]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lina, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID: 160025533
work page 2019
-
[43]
Roberts, Sho Yaida, and Boris Hanin.Frontmatter, page i–iv
Daniel A. Roberts, Sho Yaida, and Boris Hanin.Frontmatter, page i–iv. Cambridge University Press, 2022
work page 2022
-
[44]
Jascha Sohl-Dickstein, Roman Novak, Samuel S Schoenholz, and Jaehoon Lee. On the infinite width limit of neural networks with a standard parameterization.arXiv preprint arXiv:2001.07301, 2020
-
[45]
Tao Tao, Darshil Doshi, Dayal Singh Kalra, Tianyu He, and Maissam Barkeshli. (how) can transformers predict pseudo-random numbers? InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=asDx9sPAUN
work page 2025
-
[46]
On feature learning in structured state space models
Leena Chennuru Vankadara, Jin Xu, Moritz Haas, and V olkan Cevher. On feature learning in structured state space models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,
-
[47]
URLhttps://openreview.net/forum?id=aQv5AbN1wF. 13
-
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL ht...
work page 2017
-
[50]
Meta-principled family of hyperparameter scaling strategies, 2022
Sho Yaida. Meta-principled family of hyperparameter scaling strategies, 2022. URL https://arxiv. org/abs/2210.04909
-
[51]
Greg Yang and Edward J. Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11727–11737. PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/v139/yang21c.html
work page 2021
-
[52]
Tuning large neural networks via zero-shot hyperparameter transfer
Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URLhttps://openre...
work page 2021
-
[53]
Tensor programs VI: Feature learning in infinite depth neural networks
Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs VI: Feature learning in infinite depth neural networks. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=17pVDnpwwl. 14 A Experimental Details We pre-trained GPT-style Transformers on FineWeb-Edu [ 36], building on the nanoGPT c...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.