If they did, the update could cancel the weights, causing the activations or backpropagated gradients to shrink across layers or training steps

Width Depth Num Heads Head Size KV Heads KV Reps 576 8 12 64 1 12 576 8 12 64 2 6 576 8 12 64 3 4 576 8 12 64 4 3 576 8 12 64 6 2 576 8 12 64 12 1 This assumption captures a basic stability property of high-dimensional neural networks: when · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

citing papers explorer

Showing 1 of 1 citing paper.

GQA-{\mu}P: The maximal parameterization update for grouped query attention cs.LG · 2026-05-14 · unverdicted · none · ref 23
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

If they did, the update could cancel the weights, causing the activations or backpropagated gradients to shrink across layers or training steps

fields

years

verdicts

representative citing papers

citing papers explorer