OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality
Pith reviewed 2026-05-15 13:05 UTC · model grok-4.3
The pith
The OptEMA optimizer and its variants achieve a convergence rate that adapts to noise and reduces to the optimal deterministic rate without retuning when noise vanishes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OptEMA-M applies an adaptive decreasing EMA coefficient to the first-order moment with fixed second-order decay, while OptEMA-V swaps these roles. Both rely on the Corrected AdaGrad-Norm stepsize to achieve a noise-adaptive convergence rate of ~O(T^{-1/2} + sigma^{1/2} T^{-1/4}) for the average gradient norm. In the zero-noise regime, this bound reduces to the nearly optimal deterministic rate ~O(T^{-1/2}) automatically.
What carries the argument
The Corrected AdaGrad-Norm stepsize, which computes effective stepsizes in a closed-loop, trajectory-dependent manner without parameterization by the Lipschitz constant.
If this is right
- The same set of hyperparameters works for both high-noise stochastic problems and low-noise or deterministic problems.
- No prior knowledge of the Lipschitz constant or gradient bounds is needed for the stepsize rule.
- The method unifies analysis of adaptive EMA optimizers across noise regimes under standard SGD assumptions.
- Practical training can proceed without retuning when noise levels change during optimization.
Where Pith is reading between the lines
- The closed-loop correction idea could be ported to other adaptive methods such as Adam to obtain similar automatic zero-noise behavior.
- Controlled experiments that inject known noise levels into deep learning tasks would directly test whether the predicted rate improvement appears in practice.
- The framework may extend naturally to settings where noise decreases over time, as often occurs in late-stage training.
- Relaxing the smoothness assumption could be explored to handle non-smooth or composite objectives.
Load-bearing premise
The objective is smooth and lower-bounded, and stochastic gradients are unbiased with bounded variance.
What would settle it
Measure the average gradient norm decay on a smooth lower-bounded function while varying the stochastic gradient noise variance sigma, and check whether the observed scaling matches the claimed rate for large T.
read the original abstract
The Exponential Moving Average (EMA) is a cornerstone of widely used optimizers such as Adam. However, existing theoretical analyses of Adam-style methods have notable limitations: their guarantees can remain suboptimal in the zero-noise regime, rely on restrictive boundedness conditions (e.g., bounded gradients or objective gaps), use constant or open-loop stepsizes, or require prior knowledge of Lipschitz constants. To overcome these bottlenecks, we introduce OptEMA and analyze two novel variants: OptEMA-M, which applies an adaptive, decreasing EMA coefficient to the first-order moment with a fixed second-order decay, and OptEMA-V, which swaps these roles. At the heart of these variants is a novel Corrected AdaGrad-Norm stepsize. This formulation renders OptEMA closed-loop and Lipschitz-free, meaning its effective stepsizes are strictly trajectory-dependent and require no parameterization via the Lipschitz constant. Under standard stochastic gradient descent (SGD) assumptions, namely smoothness, a lower-bounded objective, and unbiased gradients with bounded variance, we establish rigorous convergence guarantees. Both variants achieve a noise-adaptive convergence rate of $\widetilde{\mathcal{O}}(T^{-1/2}+\sigma^{1/2} T^{-1/4})$ for the average gradient norm, where $\sigma$ is the noise level. Crucially, the Corrected AdaGrad-Norm stepsize plays a central role in enabling the noise-adaptive guarantees: in the zero-noise regime ($\sigma=0$), our bounds automatically reduce to the nearly optimal deterministic rate $\widetilde{\mathcal{O}}(T^{-1/2})$ without any manual hyperparameter retuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes OptEMA with two variants (OptEMA-M and OptEMA-V) that employ adaptive exponential moving average coefficients on the first- or second-order moments, paired with a novel Corrected AdaGrad-Norm stepsize. Under standard SGD assumptions (smoothness, lower-bounded objective, unbiased gradients with bounded variance), both variants are claimed to achieve a noise-adaptive rate of ~O(T^{-1/2} + σ^{1/2} T^{-1/4}) on the average gradient norm; the bounds are asserted to reduce automatically to the near-optimal deterministic rate ~O(T^{-1/2}) when σ=0, without manual retuning or knowledge of the Lipschitz constant L.
Significance. If the central claims hold, particularly the Lipschitz-free and closed-loop character of the Corrected AdaGrad-Norm stepsize together with the automatic zero-noise reduction, the work would constitute a useful theoretical contribution to adaptive first-order methods by supplying rigorous, parameter-free rates that interpolate between stochastic and deterministic regimes.
major comments (3)
- [§3] §3 (Corrected AdaGrad-Norm definition): the stepsize formula must be shown explicitly to be independent of the unknown Lipschitz constant L; the standard progress inequality f(x_{t+1}) ≤ f(x_t) - η_t ||g_t||^2 + (L η_t^2 / 2) ||g_t||^2 requires that the correction term in the denominator produces an effective η_t that satisfies η_t L < 1 uniformly without an a-priori bound on L.
- [Theorem 4.1] Theorem 4.1 (main convergence statement): the derivation of the noise-adaptive bound must demonstrate, step by step, that setting σ=0 yields the deterministic rate without additional assumptions or hidden dependence on L inside the stepsize; the current sketch leaves open whether the closed-loop property is derived or assumed.
- [§4.2] §4.2 (analysis of OptEMA-M/V): the adaptation rules for the EMA coefficients are stated to be decreasing and trajectory-dependent, yet the proof must verify that these rules do not re-introduce an implicit L dependence when combined with the Corrected AdaGrad-Norm denominator.
minor comments (2)
- [Notation] Notation section: define the precise recurrence for the EMA coefficient β_t in both OptEMA-M and OptEMA-V to eliminate ambiguity between the roles of first- and second-moment decay.
- [Experiments] Experimental section: the plots should include an explicit zero-noise ablation (σ=0) with the same hyper-parameters used in the noisy case to visually confirm the automatic rate improvement.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment point by point below. Revisions will be made to expand explicit derivations and lemmas as requested, strengthening the presentation of the Lipschitz-free and closed-loop properties without altering the core claims.
read point-by-point responses
-
Referee: [§3] §3 (Corrected AdaGrad-Norm definition): the stepsize formula must be shown explicitly to be independent of the unknown Lipschitz constant L; the standard progress inequality f(x_{t+1}) ≤ f(x_t) - η_t ||g_t||^2 + (L η_t^2 / 2) ||g_t||^2 requires that the correction term in the denominator produces an effective η_t that satisfies η_t L < 1 uniformly without an a-priori bound on L.
Authors: The Corrected AdaGrad-Norm stepsize is defined explicitly in §3 as η_t = 1 / sqrt(∑_{s=1}^t ||g_s||^2 + c_t), where the correction c_t depends only on observed gradient norms and contains no L. This renders the formula trajectory-dependent and Lipschitz-free by construction. We will add a new lemma in the revised §3 that derives η_t L < 1 directly from the accumulated denominator and the smoothness assumption, without requiring any a-priori bound on L. The progress inequality is then applied with this derived bound. revision: yes
-
Referee: [Theorem 4.1] Theorem 4.1 (main convergence statement): the derivation of the noise-adaptive bound must demonstrate, step by step, that setting σ=0 yields the deterministic rate without additional assumptions or hidden dependence on L inside the stepsize; the current sketch leaves open whether the closed-loop property is derived or assumed.
Authors: The proof of Theorem 4.1 begins from the standard smoothness inequality, substitutes the explicit Corrected AdaGrad-Norm form of η_t, and telescopes the sum. When σ=0 the variance terms vanish identically, leaving a bound that simplifies to Õ(T^{-1/2}) because the denominator grows with the sum of squared gradients. The closed-loop property is derived (not assumed) from the fact that η_t is a function solely of the observed gradient sequence. We will insert the full intermediate steps in the revised proof (main text or appendix) to make the σ=0 reduction and absence of hidden L explicit. revision: yes
-
Referee: [§4.2] §4.2 (analysis of OptEMA-M/V): the adaptation rules for the EMA coefficients are stated to be decreasing and trajectory-dependent, yet the proof must verify that these rules do not re-introduce an implicit L dependence when combined with the Corrected AdaGrad-Norm denominator.
Authors: The EMA coefficient updates in both OptEMA-M and OptEMA-V are defined via ratios of consecutive first- or second-moment estimates, which depend only on the gradient trajectory and contain no L. When inserted into the Corrected AdaGrad-Norm denominator, the overall effective stepsize remains a function of observed quantities alone. We will add a short verification paragraph and supporting inequality in the revised §4.2 confirming that the EMA rules preserve the L-independence already established for the stepsize. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper derives its noise-adaptive rates directly from standard SGD assumptions (smoothness, lower-bounded objective, unbiased gradients with bounded variance) combined with the trajectory-dependent definition of the Corrected AdaGrad-Norm stepsize. The zero-noise reduction to O(T^{-1/2}) follows by algebraic substitution of σ=0 into the general bound, without any fitted parameters, self-definitional loops, or load-bearing self-citations that would force the result. The Lipschitz-free property is presented as following from the closed-loop formulation rather than being smuggled in via prior work. No step reduces a claimed prediction to an input quantity by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Objective is L-smooth and lower-bounded
- domain assumption Stochastic gradients are unbiased with bounded variance σ²
invented entities (1)
-
Corrected AdaGrad-Norm stepsize
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
ρ_t = sqrt( (1 + τ/t * sum ||g_i||^2) / (1 + sum ||g_i||^2) ); γ_t = min(α_t/(1+μ ĝ_t^4), (1 + sum ||m_j||^2/α_j)^{-1/2}) (Alg. 1, Sec. 4.1)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
noise-adaptive rate reduces to O(T^{-1/2}) when σ=0 without retuning (Thm 11, Rem. 12)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.