How to set AdamW 's weight decay as you scale model and dataset size

Xi Wang, Laurence Aitchison · 2024 · arXiv 2405.13698

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

A Weibull diagnostic framework classifies transformer weight matrices into consistent functional classes via the shape parameter k and tracks training progress via the scale parameter lambda across multiple architectures.

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

cs.LG · 2026-03-30 · conditional · novelty 6.0

HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.

citing papers explorer

Showing 3 of 3 citing papers.

GQA-{\mu}P: The maximal parameterization update for grouped query attention cs.LG · 2026-05-14 · unverdicted · none · ref 16
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions cs.LG · 2026-05-17 · unverdicted · none · ref 32
A Weibull diagnostic framework classifies transformer weight matrices into consistent functional classes via the shape parameter k and tracks training progress via the scale parameter lambda across multiple architectures.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 22
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.

How to set AdamW 's weight decay as you scale model and dataset size

fields

years

verdicts

representative citing papers

citing papers explorer