Recognition: 2 theorem links
· Lean TheoremWhy Adam Works Better with β₁ = β₂: The Missing Gradient Scale Invariance Principle
Pith reviewed 2026-05-16 10:13 UTC · model grok-4.3
The pith
Adam becomes first-order gradient scale invariant if and only if its momentum parameters β1 and β2 are set equal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adam becomes gradient scale invariant of first order if and only if β1 equals β2. This structural property accounts for the improved validation performance and smoother training dynamics observed when the momentum parameters are balanced, placing that choice in direct alignment with the design principles of scale-robust optimizers.
What carries the argument
First-order gradient scale invariance: the property that the optimizer's parameter update direction remains unchanged (to first order) under uniform rescaling of the gradient.
If this is right
- Parameter updates stay consistent under changes in gradient magnitude, yielding more stable training trajectories.
- Validation scores improve consistently across vision and language tasks when the momentum rates are balanced.
- The equal-β regime aligns Adam with newer optimizers built around explicit scale robustness.
- Rescaling the gradient produces markedly smoother changes in the update when β1 equals β2.
Where Pith is reading between the lines
- New optimizers could be designed by directly enforcing first-order scale invariance instead of tuning separate β values.
- The same invariance principle may explain hyperparameter sensitivity in other momentum-based methods.
- Higher-order versions of the invariance could be derived and tested for additional robustness gains.
- The property offers a concrete test for whether a proposed optimizer inherits Adam's practical advantages.
Load-bearing premise
The formal definition of first-order gradient scale invariance accurately captures the practical training benefits seen when β1 equals β2.
What would settle it
An experiment in which β1 equals β2 yet rescaling the gradient by a large constant still produces a noticeably different update direction, or in which unequal β values preserve the update direction under rescaling.
read the original abstract
Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $\beta_{1}=\beta_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $\beta_{1}=\beta_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $\beta_{1}=\beta_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that Adam exhibits first-order gradient scale invariance—i.e., the normalized update direction u_t satisfies u_t(g) = u_t(c·g) for scalar c>0—if and only if β1=β2. It formalizes this property, proves the iff statement, and reports experiments across vision and language tasks showing that gradient rescaling produces smoother update behavior precisely in the balanced-β regime. The result is positioned as explaining an empirical heuristic and aligning Adam with scale-robust optimizer designs.
Significance. If the formalization and proof are correct, the work supplies a clean, parameter-free explanation for a decade-old tuning observation and supplies a concrete design principle for future optimizers. The alignment with recent scale-invariant methods is a useful conceptual contribution even if the practical scope of the first-order definition is limited.
major comments (2)
- [§3.2, Theorem 1] §3.2, Definition 1 and Theorem 1: the first-order invariance condition u_t(g)=u_t(c·g) is stated for the normalized direction, yet the second-moment EMA v_t scales as c² and therefore sqrt(v_t) scales as c; the proof must explicitly demonstrate that β1=β2 cancels this scaling in the final normalized update or else restrict the claim to regimes in which v_t is approximately constant.
- [§4.1] §4.1, Experiments on gradient rescaling: the reported qualitative smoothness is consistent with the claim, but the quantitative tables do not report the norm of the difference ||u_t(g)-u_t(c·g)|| or its variance across layers and training steps; without these metrics it is difficult to judge how completely the invariance is realized in practice.
minor comments (2)
- [Abstract] Abstract: the sentence on experimental support would be strengthened by a single quantitative statement (e.g., “update-norm variance reduced by X % when β1=β2”).
- [Notation] Notation section: the update rule (Eq. 2) should explicitly label the bias-corrected moments m̂_t and v̂_t to avoid ambiguity when the proof later refers to “the normalized update.”
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of the first-order invariance property. We address each major point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2, Theorem 1] §3.2, Definition 1 and Theorem 1: the first-order invariance condition u_t(g)=u_t(c·g) is stated for the normalized direction, yet the second-moment EMA v_t scales as c² and therefore sqrt(v_t) scales as c; the proof must explicitly demonstrate that β1=β2 cancels this scaling in the final normalized update or else restrict the claim to regimes in which v_t is approximately constant.
Authors: We agree that the scaling argument deserves a more explicit derivation. In the revised proof of Theorem 1 we will add an inductive step showing that m_t scales linearly with c while v_t scales quadratically, and we will compute the derivative of the normalized update u_t(c) with respect to the scale factor at c=1. This first-order expansion vanishes if and only if β1=β2, confirming the claimed invariance without requiring v_t to be constant. The revised text will also state clearly that “first-order” refers to local invariance under infinitesimal rescaling. revision: yes
-
Referee: [§4.1] §4.1, Experiments on gradient rescaling: the reported qualitative smoothness is consistent with the claim, but the quantitative tables do not report the norm of the difference ||u_t(g)-u_t(c·g)|| or its variance across layers and training steps; without these metrics it is difficult to judge how completely the invariance is realized in practice.
Authors: We appreciate the suggestion for quantitative validation. The revised §4.1 will include new tables (or supplementary figures) reporting the mean and variance of ||u_t(g) − u_t(c·g)|| computed across layers and training steps for several values of c and for both balanced and unbalanced β regimes. These metrics will directly quantify the degree to which the invariance holds in practice. revision: yes
Circularity Check
No circularity: proof derives from newly introduced definition
full rationale
The paper defines first-order gradient scale invariance as a property on the normalized update direction and proves the iff statement that this holds precisely when β1=β2. This is a direct mathematical derivation from the stated definition and Adam's update equations; it does not reduce any prediction or central claim to a fitted parameter, prior self-citation, or ansatz smuggled in from the authors' own work. Experiments are presented as supporting evidence rather than as the load-bearing justification. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient scale invariance of first order is a meaningful structural property that explains observed training improvements.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.3 ... R(t) = sign(g(t)) (1 + (τ2 − τ1)δ(t) + O(Λ² + Λ′)) ... iff τ1=τ2 (Corollary 3.6)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 3.1 ... Rk invariant under λ gk for λ>0
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Refresh-Scaling the Memory of Balanced Adam
Choosing beta in balanced Adam so the refresh count R_beta is approximately 1000 reduces the worst-case validation gap by 33.4% and keeps all runs within 1% of their oracle compared with the best fixed-beta baseline.
-
Refresh-Scaling the Memory of Balanced Adam
Setting β in balanced Adam to achieve a refresh count R_β ≈1000 based on effective learning horizon T_ES improves validation robustness over fixed-β baselines across 11 vision and language experiments.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.