arxiv: 2601.21739 · v2 · submitted 2026-01-29 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Why Adam Works Better with β₁ = β₂: The Missing Gradient Scale Invariance Principle

Alberto Fern\'andez-Hern\'andez , Cristian P\'erez-Corral , Jose I. Mestre , Manuel F. Dolz , Enrique S. Quintana-Ort\'i

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords Adam optimizergradient scale invariancemomentum parametersβ1 β2optimizationdeep learningscale robustnesstraining dynamics

0 comments

The pith

Adam becomes first-order gradient scale invariant if and only if its momentum parameters β1 and β2 are set equal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Adam gains first-order gradient scale invariance exactly when the two momentum decay rates are identical. This invariance keeps the update direction unchanged when the gradient vector is multiplied by any positive constant, at least to linear order. A sympathetic reader would care because the property explains the observed gains in training stability and validation scores under the balanced-β regime. It also aligns the default Adam choice with the explicit scale-robustness built into several newer optimizers. Experiments on vision and language models across multiple architectures show that rescaling the gradient perturbs the update far less when β1 equals β2.

Core claim

Adam becomes gradient scale invariant of first order if and only if β1 equals β2. This structural property accounts for the improved validation performance and smoother training dynamics observed when the momentum parameters are balanced, placing that choice in direct alignment with the design principles of scale-robust optimizers.

What carries the argument

First-order gradient scale invariance: the property that the optimizer's parameter update direction remains unchanged (to first order) under uniform rescaling of the gradient.

If this is right

Parameter updates stay consistent under changes in gradient magnitude, yielding more stable training trajectories.
Validation scores improve consistently across vision and language tasks when the momentum rates are balanced.
The equal-β regime aligns Adam with newer optimizers built around explicit scale robustness.
Rescaling the gradient produces markedly smoother changes in the update when β1 equals β2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New optimizers could be designed by directly enforcing first-order scale invariance instead of tuning separate β values.
The same invariance principle may explain hyperparameter sensitivity in other momentum-based methods.
Higher-order versions of the invariance could be derived and tested for additional robustness gains.
The property offers a concrete test for whether a proposed optimizer inherits Adam's practical advantages.

Load-bearing premise

The formal definition of first-order gradient scale invariance accurately captures the practical training benefits seen when β1 equals β2.

What would settle it

An experiment in which β1 equals β2 yet rescaling the gradient by a large constant still produces a noticeably different update direction, or in which unequal β values preserve the update direction under rescaling.

read the original abstract

Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $\beta_{1}=\beta_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $\beta_{1}=\beta_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $\beta_{1}=\beta_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives an iff proof that Adam is first-order gradient scale invariant precisely when β1=β2, which explains the common setting, though the second-moment scaling still needs explicit handling.

read the letter

The main point is that the authors formalize first-order gradient scale invariance for Adam and prove it holds exactly when β1 equals β2. This is the new piece: an equivalence that was not in the earlier literature, even though the empirical pattern had been noted before. They start from the Adam update equations, define the invariance as the normalized direction staying fixed under positive scalar multiplication of the gradient, and derive the beta condition directly. The experiments then check this by rescaling gradients on vision and language models and showing the update changes less when the betas match. That connection to other scale-robust optimizers is also useful and gives the result some context. The work is straightforward and the math appears to check out on its own terms. The soft spot is the first-order qualifier. Even with β1=β2 the second-moment term is an EMA of squared gradients, so a scalar c on the gradient scales that term by c squared and the denominator by c. The proof may set this aside or treat it as outside the invariance scope, which means the result could be conditional on regimes where second moments are roughly stable or on learning-rate adjustments. The paper should make that boundary explicit rather than leave it implicit. This is for optimizer researchers and people tuning large-model training. A reader who wants a theoretical reason for the beta choice will find the proof and the runs worth reading. It has enough formal content and reproducible experiments to go to peer review, though a referee should verify how the second-moment term is treated in the derivation.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that Adam exhibits first-order gradient scale invariance—i.e., the normalized update direction u_t satisfies u_t(g) = u_t(c·g) for scalar c>0—if and only if β1=β2. It formalizes this property, proves the iff statement, and reports experiments across vision and language tasks showing that gradient rescaling produces smoother update behavior precisely in the balanced-β regime. The result is positioned as explaining an empirical heuristic and aligning Adam with scale-robust optimizer designs.

Significance. If the formalization and proof are correct, the work supplies a clean, parameter-free explanation for a decade-old tuning observation and supplies a concrete design principle for future optimizers. The alignment with recent scale-invariant methods is a useful conceptual contribution even if the practical scope of the first-order definition is limited.

major comments (2)

[§3.2, Theorem 1] §3.2, Definition 1 and Theorem 1: the first-order invariance condition u_t(g)=u_t(c·g) is stated for the normalized direction, yet the second-moment EMA v_t scales as c² and therefore sqrt(v_t) scales as c; the proof must explicitly demonstrate that β1=β2 cancels this scaling in the final normalized update or else restrict the claim to regimes in which v_t is approximately constant.
[§4.1] §4.1, Experiments on gradient rescaling: the reported qualitative smoothness is consistent with the claim, but the quantitative tables do not report the norm of the difference ||u_t(g)-u_t(c·g)|| or its variance across layers and training steps; without these metrics it is difficult to judge how completely the invariance is realized in practice.

minor comments (2)

[Abstract] Abstract: the sentence on experimental support would be strengthened by a single quantitative statement (e.g., “update-norm variance reduced by X % when β1=β2”).
[Notation] Notation section: the update rule (Eq. 2) should explicitly label the bias-corrected moments m̂_t and v̂_t to avoid ambiguity when the proof later refers to “the normalized update.”

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of the first-order invariance property. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2, Theorem 1] §3.2, Definition 1 and Theorem 1: the first-order invariance condition u_t(g)=u_t(c·g) is stated for the normalized direction, yet the second-moment EMA v_t scales as c² and therefore sqrt(v_t) scales as c; the proof must explicitly demonstrate that β1=β2 cancels this scaling in the final normalized update or else restrict the claim to regimes in which v_t is approximately constant.

Authors: We agree that the scaling argument deserves a more explicit derivation. In the revised proof of Theorem 1 we will add an inductive step showing that m_t scales linearly with c while v_t scales quadratically, and we will compute the derivative of the normalized update u_t(c) with respect to the scale factor at c=1. This first-order expansion vanishes if and only if β1=β2, confirming the claimed invariance without requiring v_t to be constant. The revised text will also state clearly that “first-order” refers to local invariance under infinitesimal rescaling. revision: yes
Referee: [§4.1] §4.1, Experiments on gradient rescaling: the reported qualitative smoothness is consistent with the claim, but the quantitative tables do not report the norm of the difference ||u_t(g)-u_t(c·g)|| or its variance across layers and training steps; without these metrics it is difficult to judge how completely the invariance is realized in practice.

Authors: We appreciate the suggestion for quantitative validation. The revised §4.1 will include new tables (or supplementary figures) reporting the mean and variance of ||u_t(g) − u_t(c·g)|| computed across layers and training steps for several values of c and for both balanced and unbalanced β regimes. These metrics will directly quantify the degree to which the invariance holds in practice. revision: yes

Circularity Check

0 steps flagged

No circularity: proof derives from newly introduced definition

full rationale

The paper defines first-order gradient scale invariance as a property on the normalized update direction and proves the iff statement that this holds precisely when β1=β2. This is a direct mathematical derivation from the stated definition and Adam's update equations; it does not reduce any prediction or central claim to a fitted parameter, prior self-citation, or ansatz smuggled in from the authors' own work. Experiments are presented as supporting evidence rather than as the load-bearing justification. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a newly introduced definition of gradient scale invariance whose practical relevance is taken as given.

axioms (1)

domain assumption Gradient scale invariance of first order is a meaningful structural property that explains observed training improvements.
This definition is introduced by the authors to link β1=β2 to practical behavior.

pith-pipeline@v0.9.0 · 5537 in / 1126 out tokens · 26327 ms · 2026-05-16T10:13:29.711489+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.3 ... R(t) = sign(g(t)) (1 + (τ2 − τ1)δ(t) + O(Λ² + Λ′)) ... iff τ1=τ2 (Corollary 3.6)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 3.1 ... Rk invariant under λ gk for λ>0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Refresh-Scaling the Memory of Balanced Adam
cs.LG 2026-05 unverdicted novelty 6.0

Choosing beta in balanced Adam so the refresh count R_beta is approximately 1000 reduces the worst-case validation gap by 33.4% and keeps all runs within 1% of their oracle compared with the best fixed-beta baseline.
Refresh-Scaling the Memory of Balanced Adam
cs.LG 2026-05 unverdicted novelty 5.0

Setting β in balanced Adam to achieve a refresh count R_β ≈1000 based on effective learning horizon T_ES improves validation robustness over fixed-β baselines across 11 vision and language experiments.