GQA-{μ}P: The maximal parameterization update for grouped query attention
Pith reviewed 2026-05-19 16:31 UTC · model grok-4.3
The pith
A modified spectral norm for non-full-rank matrices lets maximal update parameterization apply to grouped-query attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By elevating spectral norm conditions to the definition of feature learning and adopting a modified spectral norm that preserves valid weight scaling for non-full-rank matrices, the authors derive μP scalings for grouped-query attention. These scalings produce learning-rate transfer across the GQA repetition hyperparameter and across weight-decay values, as verified in experiments.
What carries the argument
The modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank.
If this is right
- Learning rates tuned on one GQA configuration transfer to models with different numbers of query groups.
- Weight-decay hyperparameters also transfer without retuning when the GQA-μP rules are followed.
- Hyperparameter search compute drops because small-model optima apply directly to larger GQA models.
- Depth and weight-decay scalings emerge directly from the spectral definition without separate lazy-learning arguments.
Where Pith is reading between the lines
- The same modified-norm step may allow μP derivations for other low-rank attention variants such as multi-query attention.
- If the norm modification generalizes, practitioners could apply a single set of scaling rules across many attention architectures instead of deriving each case separately.
- The approach suggests testing whether the same spectral redefinition yields transfer for mixture-of-experts layers or other non-square weight structures common in large models.
Load-bearing premise
The modified spectral norm preserves the valid scaling law of network weights when weight matrices are not full rank.
What would settle it
Training runs in which learning rates tuned under one GQA repetition factor fail to transfer when the modified spectral norm is replaced by the ordinary spectral norm.
Figures
read the original abstract
Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization ({\mu}P) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures. Building on the spectral feature-learning view of Yang et al. (2023a), we make two advances. First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete-P depth and weight-decay scalings without recourse to lazy-learning. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables (to our knowledge, the first) derivation of {\mu}P scalings for grouped-query attention (GQA). We demonstrate the efficacy of our theoretical derivations by showing learning rate transfer across the GQA repetition hyperparameter as well as experiments regarding transfer over weight decay.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends the maximal update parameterization (μP) to grouped-query attention (GQA) by building on the spectral feature-learning framework of Yang et al. (2023a). It promotes spectral-norm conditions on weights from a heuristic to the definition of feature learning, thereby obtaining Complete-P scalings for depth and weight decay without invoking lazy learning. A modified spectral norm is then introduced to preserve the correct scaling law for weight matrices that are not full rank (as occurs due to the GQA repetition/grouping hyperparameter). The resulting GQA-μP scalings are validated by experiments demonstrating learning-rate transfer across the GQA repetition hyperparameter and across weight-decay values.
Significance. If the derivations are rigorous, the work would supply the first principled μP parameterization for GQA, a widely used architectural variant in modern LLMs, thereby reducing the compute required for hyperparameter transfer when scaling models that employ grouped attention. The definitional elevation of spectral conditions and the explicit treatment of rank deficiency could serve as a template for μP derivations in other attention or sparsity patterns. The reported LR-transfer experiments provide concrete evidence of practical utility, though their strength depends on the soundness of the underlying modified-norm construction.
major comments (1)
- [Modified spectral norm definition and GQA derivation] The section introducing the modified spectral norm (immediately after the promotion of spectral conditions to a definition): this construction is asserted to preserve the valid scaling law of network weights for non-full-rank matrices and is the explicit technical device that permits the GQA-μP derivation. No independent verification is supplied—e.g., an explicit rank-deficient limit, an artificial rank-reduction test that recovers the known full-rank μP scaling, or a direct comparison against the unmodified spectral norm under controlled rank deficiency. Because the entire GQA extension rests on this step, the absence of such a check makes the central theoretical claim difficult to assess from the given material.
minor comments (2)
- [Experiments] The abstract and experimental sections would benefit from explicit statements of the datasets, model sizes, and exact GQA repetition values used in the transfer experiments, together with quantitative metrics (e.g., loss curves or final perplexity) that allow readers to judge the magnitude of the observed transfer.
- [Theoretical development] Notation for the modified spectral norm should be introduced with a clear equation number and contrasted side-by-side with the standard spectral norm to make the precise modification transparent.
Simulated Author's Rebuttal
We thank the referee for their careful reading of our manuscript and for the constructive feedback. We address the major comment point-by-point below and are happy to revise the manuscript accordingly to strengthen the presentation of our theoretical results.
read point-by-point responses
-
Referee: The section introducing the modified spectral norm (immediately after the promotion of spectral conditions to a definition): this construction is asserted to preserve the valid scaling law of network weights for non-full-rank matrices and is the explicit technical device that permits the GQA-μP derivation. No independent verification is supplied—e.g., an explicit rank-deficient limit, an artificial rank-reduction test that recovers the known full-rank μP scaling, or a direct comparison against the unmodified spectral norm under controlled rank deficiency. Because the entire GQA extension rests on this step, the absence of such a check makes the central theoretical claim difficult to assess from the given material.
Authors: We appreciate the referee pointing out the need for more explicit verification of the modified spectral norm construction. In the paper, the modified spectral norm is motivated and derived to ensure that the feature learning condition (promoted to a definition) holds for the rank-deficient weight matrices that arise in GQA due to the repetition of query and key heads. The derivation ensures that the scaling of the learning rate and other hyperparameters remains consistent with the full-rank case, adjusted for the grouping factor. While the overall GQA-μP is validated through learning rate transfer experiments across different repetition hyperparameters, we agree that an independent check of the norm itself would be beneficial. In the revised manuscript, we will add a new subsection providing: (1) an explicit rank-deficient limit analysis showing how the modified norm recovers the correct μP scaling laws, and (2) a controlled numerical test where we apply artificial rank reduction to a weight matrix and compare the behavior under modified vs. standard spectral norm. This will directly address the concern and make the technical device more transparent. revision: yes
Circularity Check
Definitional promotion of spectral conditions plus modified norm chosen to preserve scaling reduce GQA-μP to construction
specific steps
-
self definitional
[Abstract]
"First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete-P depth and weight-decay scalings without recourse to lazy-learning. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables (to our knowledge, the first) derivation of μP scalings for grouped-query attention (GQA)."
The spectral conditions are promoted to the definition of feature learning; the scalings are then stated to follow 'as a consequence.' The modified norm is introduced specifically because it 'preserves the valid scaling law' under the rank reduction of GQA. Both the definition and the modification are therefore chosen to make the desired Complete-P and GQA-μP results hold, rendering the derivation tautological with respect to these choices rather than derived from prior independent premises.
full rationale
The paper's two stated advances are (1) elevating spectral-norm conditions to the definition of feature learning, from which Complete-P scalings follow directly, and (2) introducing a modified spectral norm explicitly asserted to preserve the scaling law under rank deficiency induced by GQA. Both steps are load-bearing for the claimed first-principles derivation of μP for grouped-query attention. Because the modification is defined to achieve preservation and the feature-learning definition is chosen to yield the target scalings, the central results reduce to the inputs by construction rather than independent derivation. Experiments on LR transfer are presented as validation but do not retroactively make the definitional steps non-circular. No fitted parameters or self-citations are shown to be the sole support, so score remains moderate.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Spectral norm conditions on weights constitute the definition of feature learning rather than a heuristic
invented entities (1)
-
Modified spectral norm for non-full-rank weight matrices
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables ... derivation of μP scalings for grouped-query attention (GQA)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head check- points. arXiv preprint arXiv:2305.13245,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415
Maksym Andriushchenko, Francesco D’Angelo, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning? arXiv preprint arXiv:2310.04415,
-
[3]
Advances in Neural Information Processing Systems (NeurIPS 2025) , year =
Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in llm pre-training. arXiv preprint arXiv:2505.13738,
-
[4]
S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J
Nolan Dey, Gurpreet Gosal, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208,
-
[5]
Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers. arXiv preprint arXiv:2505.01618,
-
[6]
Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, et al. Scaling expo- nents across parameterizations and optimizers. arXiv preprint arXiv:2407.05872,
-
[7]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Gaussian Error Linear Units (GELUs)
URLhttps:// arxiv.org/abs/1606.08415. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training. Advances in neural information processing systems, 35:30016–30030,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/23...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Adam: A Method for Stochastic Optimization
URLhttps: //kellerjordan.github.io/posts/muon/. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
K2-V2: A 360-open, reasoning-enhanced LLM,
Zhengzhong Liu, Liping Tang, Linghao Jin, Haonan Li, Nikhil Ranjan, Desai Fan, Shaurya Rohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, et al. K2-v2: A 360-open, reasoning-enhanced llm. arXiv preprint arXiv:2512.06201,
-
[12]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Bruno Mlodozeniec, Pierre Ablin, Louis B´ethune, Dan Busbridge, Michal Klein, Jason Ramapuram, and Marco Cuturi. Completed hyperparameter transfer across modules, width, depth, batch and duration. arXiv preprint arXiv:2512.22382,
-
[14]
arXiv preprint arXiv:2502.05967,
Saaketh Narayan, Abhay Gupta, Mansheej Paul, and Davis Blalock.µnit scaling: Simple and scalable fp8 llm training. arXiv preprint arXiv:2502.05967,
-
[15]
Fabian Schaipp. How to jointly tune learning rate and weight decay for AdamW.https:// fabian-sp.github.io/posts/2024/02/decoupling/,
work page 2024
-
[16]
How to set AdamW 's weight decay as you scale model and dataset size
Xi Wang and Laurence Aitchison. How to set adamw’s weight decay as you scale model and dataset size. arXiv preprint arXiv:2405.13698,
-
[17]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Tensor programs ii: Neural tangent kernel for any architecture,
Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020a. Greg Yang. Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685, 2020b. Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pp...
-
[19]
Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ry- der, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466,
-
[20]
A spectral condition for feature learning
Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023a. Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023b. Yong-Qua Yin, Zhi-Dong Bai, and Pathak R Krishnaiah. On the li...
-
[21]
Spectral Condition for $\mu$P under Width-Depth Scaling
Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, and Li Chongxuan. Spectral condition forµp under width-depth scaling. arXiv preprint arXiv:2603.00541v2,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
12 A ADDITIONALMATHEMATICALDETAILS A.1 DERIVATION FORADAM We demonstrate the applicability of our framework by re-deriving theµP scalings for Adam. Recall that the Adam optimizer Kingma & Ba (2014) uses hyperparametersβ 1,β 2,ε, andηand has its optimization steps given by the following components: gt =∇ W f(W t−1), mt =β 1mt−1 + (1−β 1)gt, v t =β 2vt−1 + ...
work page 2014
-
[23]
Width Depth Num Heads Head Size KV Heads KV Reps 576 8 12 64 1 12 576 8 12 64 2 6 576 8 12 64 3 4 576 8 12 64 4 3 576 8 12 64 6 2 576 8 12 64 12 1 This assumption captures a basic stability property of high-dimensional neural networks: when an update is added to a weight matrix, the update and the existing weights should not systematically point in opposi...
work page 2023
-
[24]
(2026) but uses a rounded exponent for tractability
B= 0.000733× √ntokens.(16) Equation 16 follows the isoloss sweep methodology of Bergsma et al. (2026) but uses a rounded exponent for tractability. Specifically, Bergsma et al. (2026) estimates a scaling exponent of 0.46 and recommends rounding to 0.5. Since we ran independent sweeps on our own data, equation 16 is specific to our setup but aligns structu...
work page 2026
-
[25]
We set the base weight decay to beλ 0 = 0.1. We use a base Adamεof10 −9/n, wherenis the embedding dimension, to match the predicted Adamεscaling of Dey et al. (2025). We take three runs for each data point, using seeds42,43,44for reproducibility. Table 4: Model configurations for the GQA transfer experiments from Figure
work page 2025
-
[26]
The configurations used for this experiment can be found in Table
ParamsNon-Embd ParamsWidth DepthNum HeadsHead SizeKV Heads KV RepsTPPDataset Size (Tokens) Dataset Size (Sequences) Batch Size (Tokens) Batch Size (Sequences)Iterations kvrt1 125.55 80.62 768 7 12 64 1 12 10 806200000 98413 262144 32 3075 kvrt2 126.23 81.31 768 7 12 64 2 6 10 813100000 99255 262144 32 3102 kvrt3 126.92 82 768 7 12 64 3 4 10 820000000 1000...
work page 2022
-
[27]
ParamsNon-Embd ParamsWidth DepthNum Heads Head Size KV Heads KV RepsTPPDataset Size (Tokens) Dataset Size (Sequences) Batch Size (Tokens) Batch Size (Sequences)Iters. jwd-small 48.82 26.38 384 4 6 64 6 1 3 79140000 9661 81920 10 966 jwd-medium 125.96 81.07 768 6 12 64 12 1 3 243210000 29689 147456 18 1649 jwd-large 237.17 177.31 1024 10 16 64 16 1 3 53193...
work page 2022
-
[28]
The top row is standard parameterization
16 10 1 101 103 epoch 10 3 10 2 Learning Rate SP (n_embd=384) 10 1 101 103 epoch 10 3 10 2 Learning Rate SP (n_embd=768) 10 1 101 103 epoch 10 3 10 2 Learning Rate SP (n_embd=1024) 10 1 101 103 epoch 10 3 10 2 Learning Rate P Unit WD (n_embd=384) 10 1 101 103 epoch 10 3 10 2 Learning Rate P Unit WD (n_embd=768) 10 1 101 103 epoch 10 3 10 2 Learning Rate P...
work page 2024
-
[29]
Implementation Var. LR Var. WD Var. Loss SP1.34 3.83×10 −1 4.87×10 −1 µP4.75×10 −2 1.38 4.87×10 −1 µP + WD5.54×10 −3 7.51×10 −1 4.77×10 −1 B.4 MORERESULTS ABOUTWEIGHTDECAY We used the same data that was collected from Figure 3 to analyze whether or not our experimental testbed demonstrates transfer overτ epoch, as is suggested by (Wang & Aitchison, 2024; ...
work page 2024
-
[30]
Like for the case of weight decay transfer (see Figure 3), we find that our suggested implementation outperforms both the standard parameterization and the vanilla Adam-µP implementation from Yang et al. (2022). C LLM STATEMENT We did not use LLMs in a significant way to aid our research during the completion of this work. Our LLM usage did not extend bey...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.