GQA-{\mu}P: The maximal parameterization update for grouped query attention

Alexander Moreno; Daria Soboleva; Eric Xing; Huijuan Wang; Joel Hestness; Kyle R. Chickering; Mengxi Wu; Muhao Chen; Xuezhe Ma; Zhengzhong Liu

arxiv: 2605.15290 · v1 · pith:4ZAE2CQ5new · submitted 2026-05-14 · 💻 cs.LG · cs.AI

GQA-{μ}P: The maximal parameterization update for grouped query attention

Kyle R. Chickering , Huijuan Wang , Mengxi Wu , Alexander Moreno , Muhao Chen , Xuezhe Ma , Daria Soboleva , Joel Hestness

show 2 more authors

Zhengzhong Liu Eric Xing

This is my paper

Pith reviewed 2026-05-19 16:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords maximal update parameterizationgrouped query attentionhyperparameter transferspectral normfeature learninglarge language modelsweight decay scaling

0 comments

The pith

A modified spectral norm for non-full-rank matrices lets maximal update parameterization apply to grouped-query attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper redefines feature learning through spectral norm conditions on weights and introduces a modified norm that keeps scaling laws valid even when matrices lack full rank. This step produces the first derivation of μP rules for grouped-query attention, including depth and weight-decay scalings. The result is concrete transfer of learning rates across the GQA repetition factor and across weight-decay choices. Readers care because such transfer removes the need to retune large language models from scratch when switching attention configurations.

Core claim

By elevating spectral norm conditions to the definition of feature learning and adopting a modified spectral norm that preserves valid weight scaling for non-full-rank matrices, the authors derive μP scalings for grouped-query attention. These scalings produce learning-rate transfer across the GQA repetition hyperparameter and across weight-decay values, as verified in experiments.

What carries the argument

The modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank.

If this is right

Learning rates tuned on one GQA configuration transfer to models with different numbers of query groups.
Weight-decay hyperparameters also transfer without retuning when the GQA-μP rules are followed.
Hyperparameter search compute drops because small-model optima apply directly to larger GQA models.
Depth and weight-decay scalings emerge directly from the spectral definition without separate lazy-learning arguments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modified-norm step may allow μP derivations for other low-rank attention variants such as multi-query attention.
If the norm modification generalizes, practitioners could apply a single set of scaling rules across many attention architectures instead of deriving each case separately.
The approach suggests testing whether the same spectral redefinition yields transfer for mixture-of-experts layers or other non-square weight structures common in large models.

Load-bearing premise

The modified spectral norm preserves the valid scaling law of network weights when weight matrices are not full rank.

What would settle it

Training runs in which learning rates tuned under one GQA repetition factor fail to transfer when the modified spectral norm is replaced by the ordinary spectral norm.

Figures

Figures reproduced from arXiv: 2605.15290 by Alexander Moreno, Daria Soboleva, Eric Xing, Huijuan Wang, Joel Hestness, Kyle R. Chickering, Mengxi Wu, Muhao Chen, Xuezhe Ma, Zhengzhong Liu.

**Figure 2.** Figure 2: Demonstration of the failure of the spectral norm to accurately capture the behavior for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Voronoi interpolation for random sweeps over both learning rate and weight decay. The top [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Coordinate checks in the style of Yang et al. (2022) for the activation update norms [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Coordinate checks for || ∆W || under the vanilla Adam-µP scalings. The model fails the coordinate checks when evaluated using the spectral feature learning condition equation 1. However, as shown in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Coordinate checks for || ∆W || under our proposed GQA scalings. The model has eight hidden layers. Additional experimental details are provided in Appendix B.1.1. vanilla Adam-µP implementation and our proposed scaling preserve their qualitative properties across model sizes. For the experiment in [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Voronoi interpolation for random sweeps over both learning rate and [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Learning-rate transfer at 20 tokens-per-parameter (TPP) under vanilla Adam- [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization ({\mu}P) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures. Building on the spectral feature-learning view of Yang et al. (2023a), we make two advances. First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete-P depth and weight-decay scalings without recourse to lazy-learning. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables (to our knowledge, the first) derivation of {\mu}P scalings for grouped-query attention (GQA). We demonstrate the efficacy of our theoretical derivations by showing learning rate transfer across the GQA repetition hyperparameter as well as experiments regarding transfer over weight decay.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies the first explicit μP rules for GQA by promoting spectral norms to a definition and adding a modified norm for rank-deficient matrices, with positive LR-transfer results, though the modified-norm step lacks an independent check.

read the letter

The main thing to know is that this work derives μP scalings for grouped query attention, which is already common in production models. They do it by first turning spectral-norm conditions into the definition of feature learning, which directly gives the Complete-P rules for depth and weight decay without lazy-training assumptions. That step is a straightforward extension of Yang et al. 2023a and looks internally consistent on its own terms. The second move is a modified spectral norm that is supposed to keep the correct scaling when weight matrices lose rank due to GQA's repetition hyperparameter. They report learning-rate transfer across different GQA grouping values plus some weight-decay experiments to show the rules work in practice. Those results are the concrete evidence they offer. The soft spot is exactly the modified norm. The paper treats it as the key technical device that makes the GQA derivation possible, yet the abstract gives no separate test, such as artificially introducing rank deficiency into a full-rank case and confirming the scaling law still holds. Without that or the full derivations, it is hard to tell whether the transfer experiments validate a derived parameterization or simply a workable one. Dataset details and error analysis are also missing from the high-level description, so the strength of the empirical support is difficult to judge right now. This paper is for researchers who run scaling experiments on LLMs that already use GQA and want to reduce hyperparameter search cost across model sizes or variants. A reader focused on parameterization and transfer would find the explicit rules and the reported transfers useful even if they plan to re-derive the modified norm themselves. It deserves a serious referee because the claim is practically relevant, the theoretical path is traceable to prior work, and the experiments are at least directionally positive. Referees could usefully press on the justification for the norm modification and ask for more controls in the transfer tests.

Referee Report

1 major / 2 minor

Summary. The paper extends the maximal update parameterization (μP) to grouped-query attention (GQA) by building on the spectral feature-learning framework of Yang et al. (2023a). It promotes spectral-norm conditions on weights from a heuristic to the definition of feature learning, thereby obtaining Complete-P scalings for depth and weight decay without invoking lazy learning. A modified spectral norm is then introduced to preserve the correct scaling law for weight matrices that are not full rank (as occurs due to the GQA repetition/grouping hyperparameter). The resulting GQA-μP scalings are validated by experiments demonstrating learning-rate transfer across the GQA repetition hyperparameter and across weight-decay values.

Significance. If the derivations are rigorous, the work would supply the first principled μP parameterization for GQA, a widely used architectural variant in modern LLMs, thereby reducing the compute required for hyperparameter transfer when scaling models that employ grouped attention. The definitional elevation of spectral conditions and the explicit treatment of rank deficiency could serve as a template for μP derivations in other attention or sparsity patterns. The reported LR-transfer experiments provide concrete evidence of practical utility, though their strength depends on the soundness of the underlying modified-norm construction.

major comments (1)

[Modified spectral norm definition and GQA derivation] The section introducing the modified spectral norm (immediately after the promotion of spectral conditions to a definition): this construction is asserted to preserve the valid scaling law of network weights for non-full-rank matrices and is the explicit technical device that permits the GQA-μP derivation. No independent verification is supplied—e.g., an explicit rank-deficient limit, an artificial rank-reduction test that recovers the known full-rank μP scaling, or a direct comparison against the unmodified spectral norm under controlled rank deficiency. Because the entire GQA extension rests on this step, the absence of such a check makes the central theoretical claim difficult to assess from the given material.

minor comments (2)

[Experiments] The abstract and experimental sections would benefit from explicit statements of the datasets, model sizes, and exact GQA repetition values used in the transfer experiments, together with quantitative metrics (e.g., loss curves or final perplexity) that allow readers to judge the magnitude of the observed transfer.
[Theoretical development] Notation for the modified spectral norm should be introduced with a clear equation number and contrasted side-by-side with the standard spectral norm to make the precise modification transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of our manuscript and for the constructive feedback. We address the major comment point-by-point below and are happy to revise the manuscript accordingly to strengthen the presentation of our theoretical results.

read point-by-point responses

Referee: The section introducing the modified spectral norm (immediately after the promotion of spectral conditions to a definition): this construction is asserted to preserve the valid scaling law of network weights for non-full-rank matrices and is the explicit technical device that permits the GQA-μP derivation. No independent verification is supplied—e.g., an explicit rank-deficient limit, an artificial rank-reduction test that recovers the known full-rank μP scaling, or a direct comparison against the unmodified spectral norm under controlled rank deficiency. Because the entire GQA extension rests on this step, the absence of such a check makes the central theoretical claim difficult to assess from the given material.

Authors: We appreciate the referee pointing out the need for more explicit verification of the modified spectral norm construction. In the paper, the modified spectral norm is motivated and derived to ensure that the feature learning condition (promoted to a definition) holds for the rank-deficient weight matrices that arise in GQA due to the repetition of query and key heads. The derivation ensures that the scaling of the learning rate and other hyperparameters remains consistent with the full-rank case, adjusted for the grouping factor. While the overall GQA-μP is validated through learning rate transfer experiments across different repetition hyperparameters, we agree that an independent check of the norm itself would be beneficial. In the revised manuscript, we will add a new subsection providing: (1) an explicit rank-deficient limit analysis showing how the modified norm recovers the correct μP scaling laws, and (2) a controlled numerical test where we apply artificial rank reduction to a weight matrix and compare the behavior under modified vs. standard spectral norm. This will directly address the concern and make the technical device more transparent. revision: yes

Circularity Check

1 steps flagged

Definitional promotion of spectral conditions plus modified norm chosen to preserve scaling reduce GQA-μP to construction

specific steps

self definitional [Abstract]
"First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete-P depth and weight-decay scalings without recourse to lazy-learning. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables (to our knowledge, the first) derivation of μP scalings for grouped-query attention (GQA)."

The spectral conditions are promoted to the definition of feature learning; the scalings are then stated to follow 'as a consequence.' The modified norm is introduced specifically because it 'preserves the valid scaling law' under the rank reduction of GQA. Both the definition and the modification are therefore chosen to make the desired Complete-P and GQA-μP results hold, rendering the derivation tautological with respect to these choices rather than derived from prior independent premises.

full rationale

The paper's two stated advances are (1) elevating spectral-norm conditions to the definition of feature learning, from which Complete-P scalings follow directly, and (2) introducing a modified spectral norm explicitly asserted to preserve the scaling law under rank deficiency induced by GQA. Both steps are load-bearing for the claimed first-principles derivation of μP for grouped-query attention. Because the modification is defined to achieve preservation and the feature-learning definition is chosen to yield the target scalings, the central results reduce to the inputs by construction rather than independent derivation. Experiments on LR transfer are presented as validation but do not retroactively make the definitional steps non-circular. No fitted parameters or self-citations are shown to be the sole support, so score remains moderate.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on redefining feature learning via spectral norm conditions and on the new modified spectral norm for rank-deficient matrices; no explicit free parameters are named in the abstract, but the modified norm itself functions as an invented technical device whose independent justification is the preservation of scaling laws.

axioms (1)

domain assumption Spectral norm conditions on weights constitute the definition of feature learning rather than a heuristic
Promoted from prior heuristic status to definitional status to derive Complete-P depth and weight-decay scalings without lazy-learning arguments.

invented entities (1)

Modified spectral norm for non-full-rank weight matrices no independent evidence
purpose: Preserves the valid scaling law of network weights when matrices are rank-deficient, enabling μP derivation for GQA
Introduced to handle the structure of GQA weight matrices; independent evidence would be a falsifiable prediction that the resulting scalings produce transfer on held-out model sizes.

pith-pipeline@v0.9.0 · 5735 in / 1460 out tokens · 52506 ms · 2026-05-19T16:31:30.644410+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables ... derivation of μP scalings for grouped-query attention (GQA)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 8 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head check- points. arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

Maksym Andriushchenko, Francesco D’Angelo, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning? arXiv preprint arXiv:2310.04415,

work page arXiv
[3]

Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in llm pre-training. arXiv preprint arXiv:2505.13738,

work page arXiv
[4]

S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208,

work page arXiv
[5]

Don’t be lazy: Completep enables compute-efficient deep transformers.arXiv preprint arXiv:2505.01618, 2025

Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers. arXiv preprint arXiv:2505.01618,

work page arXiv
[6]

Everett, L

Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, et al. Scaling expo- nents across parameterizations and optimizers. arXiv preprint arXiv:2407.05872,

work page arXiv
[7]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Gaussian Error Linear Units (GELUs)

URLhttps:// arxiv.org/abs/1606.08415. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training. Advances in neural information processing systems, 35:30016–30030,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mistral 7B

Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/23...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Adam: A Method for Stochastic Optimization

URLhttps: //kellerjordan.github.io/posts/muon/. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

K2-V2: A 360-open, reasoning-enhanced LLM,

Zhengzhong Liu, Liping Tang, Linghao Jin, Haonan Li, Nikhil Ranjan, Desai Fan, Shaurya Rohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, et al. K2-v2: A 360-open, reasoning-enhanced llm. arXiv preprint arXiv:2512.06201,

work page arXiv
[12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025

Bruno Mlodozeniec, Pierre Ablin, Louis B´ethune, Dan Busbridge, Michal Klein, Jason Ramapuram, and Marco Cuturi. Completed hyperparameter transfer across modules, width, depth, batch and duration. arXiv preprint arXiv:2512.22382,

work page arXiv
[14]

arXiv preprint arXiv:2502.05967,

Saaketh Narayan, Abhay Gupta, Mansheej Paul, and Davis Blalock.µnit scaling: Simple and scalable fp8 llm training. arXiv preprint arXiv:2502.05967,

work page arXiv
[15]

How to jointly tune learning rate and weight decay for AdamW.https:// fabian-sp.github.io/posts/2024/02/decoupling/,

Fabian Schaipp. How to jointly tune learning rate and weight decay for AdamW.https:// fabian-sp.github.io/posts/2024/02/decoupling/,

work page 2024
[16]

How to set AdamW 's weight decay as you scale model and dataset size

Xi Wang and Laurence Aitchison. How to set adamw’s weight decay as you scale model and dataset size. arXiv preprint arXiv:2405.13698,

work page arXiv
[17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Tensor programs ii: Neural tangent kernel for any architecture,

Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020a. Greg Yang. Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685, 2020b. Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pp...

work page arXiv 2006
[19]

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466, 2022

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ry- der, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466,

work page arXiv
[20]

A spectral condition for feature learning

Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023a. Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023b. Yong-Qua Yin, Zhi-Dong Bai, and Pathak R Krishnaiah. On the li...

work page arXiv
[21]

Spectral Condition for $\mu$P under Width-Depth Scaling

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, and Li Chongxuan. Spectral condition forµp under width-depth scaling. arXiv preprint arXiv:2603.00541v2,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

12 A ADDITIONALMATHEMATICALDETAILS A.1 DERIVATION FORADAM We demonstrate the applicability of our framework by re-deriving theµP scalings for Adam. Recall that the Adam optimizer Kingma & Ba (2014) uses hyperparametersβ 1,β 2,ε, andηand has its optimization steps given by the following components: gt =∇ W f(W t−1), mt =β 1mt−1 + (1−β 1)gt, v t =β 2vt−1 + ...

work page 2014
[23]

If they did, the update could cancel the weights, causing the activations or backpropagated gradients to shrink across layers or training steps

Width Depth Num Heads Head Size KV Heads KV Reps 576 8 12 64 1 12 576 8 12 64 2 6 576 8 12 64 3 4 576 8 12 64 4 3 576 8 12 64 6 2 576 8 12 64 12 1 This assumption captures a basic stability property of high-dimensional neural networks: when an update is added to a weight matrix, the update and the existing weights should not systematically point in opposi...

work page 2023
[24]

(2026) but uses a rounded exponent for tractability

B= 0.000733× √ntokens.(16) Equation 16 follows the isoloss sweep methodology of Bergsma et al. (2026) but uses a rounded exponent for tractability. Specifically, Bergsma et al. (2026) estimates a scaling exponent of 0.46 and recommends rounding to 0.5. Since we ran independent sweeps on our own data, equation 16 is specific to our setup but aligns structu...

work page 2026
[25]

We use a base Adamεof10 −9/n, wherenis the embedding dimension, to match the predicted Adamεscaling of Dey et al

We set the base weight decay to beλ 0 = 0.1. We use a base Adamεof10 −9/n, wherenis the embedding dimension, to match the predicted Adamεscaling of Dey et al. (2025). We take three runs for each data point, using seeds42,43,44for reproducibility. Table 4: Model configurations for the GQA transfer experiments from Figure

work page 2025
[26]

The configurations used for this experiment can be found in Table

ParamsNon-Embd ParamsWidth DepthNum HeadsHead SizeKV Heads KV RepsTPPDataset Size (Tokens) Dataset Size (Sequences) Batch Size (Tokens) Batch Size (Sequences)Iterations kvrt1 125.55 80.62 768 7 12 64 1 12 10 806200000 98413 262144 32 3075 kvrt2 126.23 81.31 768 7 12 64 2 6 10 813100000 99255 262144 32 3102 kvrt3 126.92 82 768 7 12 64 3 4 10 820000000 1000...

work page 2022
[27]

ParamsNon-Embd ParamsWidth DepthNum Heads Head Size KV Heads KV RepsTPPDataset Size (Tokens) Dataset Size (Sequences) Batch Size (Tokens) Batch Size (Sequences)Iters. jwd-small 48.82 26.38 384 4 6 64 6 1 3 79140000 9661 81920 10 966 jwd-medium 125.96 81.07 768 6 12 64 12 1 3 243210000 29689 147456 18 1649 jwd-large 237.17 177.31 1024 10 16 64 16 1 3 53193...

work page 2022
[28]

The top row is standard parameterization

16 10 1 101 103 epoch 10 3 10 2 Learning Rate SP (n_embd=384) 10 1 101 103 epoch 10 3 10 2 Learning Rate SP (n_embd=768) 10 1 101 103 epoch 10 3 10 2 Learning Rate SP (n_embd=1024) 10 1 101 103 epoch 10 3 10 2 Learning Rate P Unit WD (n_embd=384) 10 1 101 103 epoch 10 3 10 2 Learning Rate P Unit WD (n_embd=768) 10 1 101 103 epoch 10 3 10 2 Learning Rate P...

work page 2024
[29]

Implementation Var. LR Var. WD Var. Loss SP1.34 3.83×10 −1 4.87×10 −1 µP4.75×10 −2 1.38 4.87×10 −1 µP + WD5.54×10 −3 7.51×10 −1 4.77×10 −1 B.4 MORERESULTS ABOUTWEIGHTDECAY We used the same data that was collected from Figure 3 to analyze whether or not our experimental testbed demonstrates transfer overτ epoch, as is suggested by (Wang & Aitchison, 2024; ...

work page 2024
[30]

Like for the case of weight decay transfer (see Figure 3), we find that our suggested implementation outperforms both the standard parameterization and the vanilla Adam-µP implementation from Yang et al. (2022). C LLM STATEMENT We did not use LLMs in a significant way to aid our research during the completion of this work. Our LLM usage did not extend bey...

work page 2022

[1] [1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head check- points. arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

Maksym Andriushchenko, Francesco D’Angelo, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning? arXiv preprint arXiv:2310.04415,

work page arXiv

[3] [3]

Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in llm pre-training. arXiv preprint arXiv:2505.13738,

work page arXiv

[4] [4]

S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208,

work page arXiv

[5] [5]

Don’t be lazy: Completep enables compute-efficient deep transformers.arXiv preprint arXiv:2505.01618, 2025

Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers. arXiv preprint arXiv:2505.01618,

work page arXiv

[6] [6]

Everett, L

Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, et al. Scaling expo- nents across parameterizations and optimizers. arXiv preprint arXiv:2407.05872,

work page arXiv

[7] [7]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Gaussian Error Linear Units (GELUs)

URLhttps:// arxiv.org/abs/1606.08415. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training. Advances in neural information processing systems, 35:30016–30030,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Mistral 7B

Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/23...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Adam: A Method for Stochastic Optimization

URLhttps: //kellerjordan.github.io/posts/muon/. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

K2-V2: A 360-open, reasoning-enhanced LLM,

Zhengzhong Liu, Liping Tang, Linghao Jin, Haonan Li, Nikhil Ranjan, Desai Fan, Shaurya Rohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, et al. K2-v2: A 360-open, reasoning-enhanced llm. arXiv preprint arXiv:2512.06201,

work page arXiv

[12] [12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025

Bruno Mlodozeniec, Pierre Ablin, Louis B´ethune, Dan Busbridge, Michal Klein, Jason Ramapuram, and Marco Cuturi. Completed hyperparameter transfer across modules, width, depth, batch and duration. arXiv preprint arXiv:2512.22382,

work page arXiv

[14] [14]

arXiv preprint arXiv:2502.05967,

Saaketh Narayan, Abhay Gupta, Mansheej Paul, and Davis Blalock.µnit scaling: Simple and scalable fp8 llm training. arXiv preprint arXiv:2502.05967,

work page arXiv

[15] [15]

How to jointly tune learning rate and weight decay for AdamW.https:// fabian-sp.github.io/posts/2024/02/decoupling/,

Fabian Schaipp. How to jointly tune learning rate and weight decay for AdamW.https:// fabian-sp.github.io/posts/2024/02/decoupling/,

work page 2024

[16] [16]

How to set AdamW 's weight decay as you scale model and dataset size

Xi Wang and Laurence Aitchison. How to set adamw’s weight decay as you scale model and dataset size. arXiv preprint arXiv:2405.13698,

work page arXiv

[17] [17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Tensor programs ii: Neural tangent kernel for any architecture,

Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020a. Greg Yang. Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685, 2020b. Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pp...

work page arXiv 2006

[19] [19]

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466, 2022

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ry- der, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466,

work page arXiv

[20] [20]

A spectral condition for feature learning

Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023a. Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023b. Yong-Qua Yin, Zhi-Dong Bai, and Pathak R Krishnaiah. On the li...

work page arXiv

[21] [21]

Spectral Condition for $\mu$P under Width-Depth Scaling

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, and Li Chongxuan. Spectral condition forµp under width-depth scaling. arXiv preprint arXiv:2603.00541v2,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

12 A ADDITIONALMATHEMATICALDETAILS A.1 DERIVATION FORADAM We demonstrate the applicability of our framework by re-deriving theµP scalings for Adam. Recall that the Adam optimizer Kingma & Ba (2014) uses hyperparametersβ 1,β 2,ε, andηand has its optimization steps given by the following components: gt =∇ W f(W t−1), mt =β 1mt−1 + (1−β 1)gt, v t =β 2vt−1 + ...

work page 2014

[23] [23]

If they did, the update could cancel the weights, causing the activations or backpropagated gradients to shrink across layers or training steps

Width Depth Num Heads Head Size KV Heads KV Reps 576 8 12 64 1 12 576 8 12 64 2 6 576 8 12 64 3 4 576 8 12 64 4 3 576 8 12 64 6 2 576 8 12 64 12 1 This assumption captures a basic stability property of high-dimensional neural networks: when an update is added to a weight matrix, the update and the existing weights should not systematically point in opposi...

work page 2023

[24] [24]

(2026) but uses a rounded exponent for tractability

B= 0.000733× √ntokens.(16) Equation 16 follows the isoloss sweep methodology of Bergsma et al. (2026) but uses a rounded exponent for tractability. Specifically, Bergsma et al. (2026) estimates a scaling exponent of 0.46 and recommends rounding to 0.5. Since we ran independent sweeps on our own data, equation 16 is specific to our setup but aligns structu...

work page 2026

[25] [25]

We use a base Adamεof10 −9/n, wherenis the embedding dimension, to match the predicted Adamεscaling of Dey et al

We set the base weight decay to beλ 0 = 0.1. We use a base Adamεof10 −9/n, wherenis the embedding dimension, to match the predicted Adamεscaling of Dey et al. (2025). We take three runs for each data point, using seeds42,43,44for reproducibility. Table 4: Model configurations for the GQA transfer experiments from Figure

work page 2025

[26] [26]

The configurations used for this experiment can be found in Table

ParamsNon-Embd ParamsWidth DepthNum HeadsHead SizeKV Heads KV RepsTPPDataset Size (Tokens) Dataset Size (Sequences) Batch Size (Tokens) Batch Size (Sequences)Iterations kvrt1 125.55 80.62 768 7 12 64 1 12 10 806200000 98413 262144 32 3075 kvrt2 126.23 81.31 768 7 12 64 2 6 10 813100000 99255 262144 32 3102 kvrt3 126.92 82 768 7 12 64 3 4 10 820000000 1000...

work page 2022

[27] [27]

ParamsNon-Embd ParamsWidth DepthNum Heads Head Size KV Heads KV RepsTPPDataset Size (Tokens) Dataset Size (Sequences) Batch Size (Tokens) Batch Size (Sequences)Iters. jwd-small 48.82 26.38 384 4 6 64 6 1 3 79140000 9661 81920 10 966 jwd-medium 125.96 81.07 768 6 12 64 12 1 3 243210000 29689 147456 18 1649 jwd-large 237.17 177.31 1024 10 16 64 16 1 3 53193...

work page 2022

[28] [28]

The top row is standard parameterization

16 10 1 101 103 epoch 10 3 10 2 Learning Rate SP (n_embd=384) 10 1 101 103 epoch 10 3 10 2 Learning Rate SP (n_embd=768) 10 1 101 103 epoch 10 3 10 2 Learning Rate SP (n_embd=1024) 10 1 101 103 epoch 10 3 10 2 Learning Rate P Unit WD (n_embd=384) 10 1 101 103 epoch 10 3 10 2 Learning Rate P Unit WD (n_embd=768) 10 1 101 103 epoch 10 3 10 2 Learning Rate P...

work page 2024

[29] [29]

Implementation Var. LR Var. WD Var. Loss SP1.34 3.83×10 −1 4.87×10 −1 µP4.75×10 −2 1.38 4.87×10 −1 µP + WD5.54×10 −3 7.51×10 −1 4.77×10 −1 B.4 MORERESULTS ABOUTWEIGHTDECAY We used the same data that was collected from Figure 3 to analyze whether or not our experimental testbed demonstrates transfer overτ epoch, as is suggested by (Wang & Aitchison, 2024; ...

work page 2024

[30] [30]

Like for the case of weight decay transfer (see Figure 3), we find that our suggested implementation outperforms both the standard parameterization and the vanilla Adam-µP implementation from Yang et al. (2022). C LLM STATEMENT We did not use LLMs in a significant way to aid our research during the completion of this work. Our LLM usage did not extend bey...

work page 2022