pith. sign in

Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $\beta_{1}=\beta_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $\beta_{1}=\beta_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $\beta_{1}=\beta_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.

fields

cs.LG 2

years

2026 2

verdicts

UNVERDICTED 2

representative citing papers

Refresh-Scaling the Memory of Balanced Adam

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Setting β in balanced Adam to achieve a refresh count R_β ≈1000 based on effective learning horizon T_ES improves validation robustness over fixed-β baselines across 11 vision and language experiments.

citing papers explorer

Showing 2 of 2 citing papers.

  • Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models cs.LG · 2026-05-22 · unverdicted · none · ref 29 · internal anchor

    Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.

  • Refresh-Scaling the Memory of Balanced Adam cs.LG · 2026-05-11 · unverdicted · none · ref 2 · internal anchor

    Setting β in balanced Adam to achieve a refresh count R_β ≈1000 based on effective learning horizon T_ES improves validation robustness over fixed-β baselines across 11 vision and language experiments.