Pith Number

pith:4ZAE2CQ5

pith:2026:4ZAE2CQ5P2OAQA7JZEBMVDQNF5

not attested not anchored not stored refs resolved

GQA-{\mu}P: The maximal parameterization update for grouped query attention

Alexander Moreno, Daria Soboleva, Eric Xing, Huijuan Wang, Joel Hestness, Kyle R. Chickering, Mengxi Wu, Muhao Chen, Xuezhe Ma, Zhengzhong Liu

A modified spectral norm for non-full-rank matrices lets maximal update parameterization apply to grouped-query attention.

arxiv:2605.15290 v1 · 2026-05-14 · cs.LG · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{4ZAE2CQ5P2OAQA7JZEBMVDQNF5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We demonstrate the efficacy of our theoretical derivations by showing learning rate transfer across the GQA repetition hyperparameter as well as experiments regarding transfer over weight decay.

C2weakest assumption

The modified spectral norm preserves the valid scaling law of network weights when weight matrices are not full rank; this premise is invoked to enable the GQA derivation and is stated as the key technical step after promoting spectral conditions to a definition.

C3one line summary

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

References

30 extracted · 30 resolved · 8 Pith anchors

[1] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints · arXiv:2305.13245

[2] Why do we need weight decay in modern deep learning? ArXiv, abs/2310.04415

[3] Power lines: Scaling laws for weight decay and batch size in llm pre-training

[4] Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster

[5] Don’t be lazy: CompleteP enables compute- efficient deep transformers, January 2026

Formal links

1 machine-checked theorem link

Receipt and verification

First computed	2026-05-20T00:00:50.909862Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

e6404d0a1d7e9c0803e9c902ca8e0d2f488b932ee2a429751a1c5bc4e859073d

Aliases

arxiv: 2605.15290 · arxiv_version: 2605.15290v1 · doi: 10.48550/arxiv.2605.15290 · pith_short_12: 4ZAE2CQ5P2OA · pith_short_16: 4ZAE2CQ5P2OAQA7J · pith_short_8: 4ZAE2CQ5

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/4ZAE2CQ5P2OAQA7JZEBMVDQNF5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e6404d0a1d7e9c0803e9c902ca8e0d2f488b932ee2a429751a1c5bc4e859073d

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "9a236153d0ed19a973f09d5e91d5efffac6932927864cff9ce38cb74bc5e31b8",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-14T18:03:16Z",
    "title_canon_sha256": "4f1ed45308da2ef4127da20d11a5a70c05ad7e972ddcc4c497973587c1bbc514"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15290",
    "kind": "arxiv",
    "version": 1
  }
}