OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality

Ganzhao Yuan

arxiv: 2603.09923 · v3 · submitted 2026-03-10 · 💻 cs.LG · cs.NA· math.NA· math.OC

OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality

Ganzhao Yuan This is my paper

Pith reviewed 2026-05-15 13:05 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAmath.OC

keywords stochastic optimizationexponential moving averageadaptive optimizationconvergence analysisnoise adaptationAdam optimizergradient descentstepsize scheduling

0 comments

The pith

The OptEMA optimizer and its variants achieve a convergence rate that adapts to noise and reduces to the optimal deterministic rate without retuning when noise vanishes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OptEMA with two variants that use adaptive exponential moving averages for stochastic optimization. These methods employ a novel Corrected AdaGrad-Norm stepsize to ensure the algorithm is closed-loop and does not require knowledge of the Lipschitz constant. Under standard assumptions of smoothness and bounded gradient variance, the variants are proven to converge at a rate that automatically adjusts based on the noise level. In the absence of noise, the rate improves to the nearly optimal deterministic bound without any changes to the hyperparameters. This overcomes limitations in prior Adam analyses that often require boundedness conditions or open-loop stepsizes.

Core claim

OptEMA-M applies an adaptive decreasing EMA coefficient to the first-order moment with fixed second-order decay, while OptEMA-V swaps these roles. Both rely on the Corrected AdaGrad-Norm stepsize to achieve a noise-adaptive convergence rate of ~O(T^{-1/2} + sigma^{1/2} T^{-1/4}) for the average gradient norm. In the zero-noise regime, this bound reduces to the nearly optimal deterministic rate ~O(T^{-1/2}) automatically.

What carries the argument

The Corrected AdaGrad-Norm stepsize, which computes effective stepsizes in a closed-loop, trajectory-dependent manner without parameterization by the Lipschitz constant.

If this is right

The same set of hyperparameters works for both high-noise stochastic problems and low-noise or deterministic problems.
No prior knowledge of the Lipschitz constant or gradient bounds is needed for the stepsize rule.
The method unifies analysis of adaptive EMA optimizers across noise regimes under standard SGD assumptions.
Practical training can proceed without retuning when noise levels change during optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The closed-loop correction idea could be ported to other adaptive methods such as Adam to obtain similar automatic zero-noise behavior.
Controlled experiments that inject known noise levels into deep learning tasks would directly test whether the predicted rate improvement appears in practice.
The framework may extend naturally to settings where noise decreases over time, as often occurs in late-stage training.
Relaxing the smoothness assumption could be explored to handle non-smooth or composite objectives.

Load-bearing premise

The objective is smooth and lower-bounded, and stochastic gradients are unbiased with bounded variance.

What would settle it

Measure the average gradient norm decay on a smooth lower-bounded function while varying the stochastic gradient noise variance sigma, and check whether the observed scaling matches the claimed rate for large T.

read the original abstract

The Exponential Moving Average (EMA) is a cornerstone of widely used optimizers such as Adam. However, existing theoretical analyses of Adam-style methods have notable limitations: their guarantees can remain suboptimal in the zero-noise regime, rely on restrictive boundedness conditions (e.g., bounded gradients or objective gaps), use constant or open-loop stepsizes, or require prior knowledge of Lipschitz constants. To overcome these bottlenecks, we introduce OptEMA and analyze two novel variants: OptEMA-M, which applies an adaptive, decreasing EMA coefficient to the first-order moment with a fixed second-order decay, and OptEMA-V, which swaps these roles. At the heart of these variants is a novel Corrected AdaGrad-Norm stepsize. This formulation renders OptEMA closed-loop and Lipschitz-free, meaning its effective stepsizes are strictly trajectory-dependent and require no parameterization via the Lipschitz constant. Under standard stochastic gradient descent (SGD) assumptions, namely smoothness, a lower-bounded objective, and unbiased gradients with bounded variance, we establish rigorous convergence guarantees. Both variants achieve a noise-adaptive convergence rate of $\widetilde{\mathcal{O}}(T^{-1/2}+\sigma^{1/2} T^{-1/4})$ for the average gradient norm, where $\sigma$ is the noise level. Crucially, the Corrected AdaGrad-Norm stepsize plays a central role in enabling the noise-adaptive guarantees: in the zero-noise regime ($\sigma=0$), our bounds automatically reduce to the nearly optimal deterministic rate $\widetilde{\mathcal{O}}(T^{-1/2})$ without any manual hyperparameter retuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OptEMA adds adaptive EMA variants and a Corrected AdaGrad-Norm stepsize that aim for noise-adaptive rates dropping automatically to O(T^{-1/2}) when noise vanishes, but the Lipschitz-free property looks like it still needs proof-level checking.

read the letter

OptEMA claims to deliver noise-adaptive convergence for stochastic optimization with automatic reduction to near-optimal deterministic rates in the zero-noise case, thanks to its Corrected AdaGrad-Norm stepsize and adaptive EMA variants. This addresses a real shortcoming in Adam analyses where zero-noise performance often requires retuning or falls short. The work does well by formalizing two variants that adapt the EMA coefficient in different ways and introducing the corrected stepsize to achieve closed-loop behavior. Under standard assumptions of smoothness, lower-bounded objective, and bounded variance gradients, it derives the rate O(T^{-1/2} + sigma^{1/2} T^{-1/4}) for average gradient norm, which nicely becomes O(T^{-1/2}) when sigma is zero without changing parameters. That's a clean theoretical feature. The soft spot is around the Lipschitz-free claim. The concern that the Corrected AdaGrad-Norm might still carry an implicit dependence on the unknown L through the analysis bounds seems worth checking in the proofs. If the effective stepsize ends up needing a uniform bound that scales with L, the no-retuning advantage weakens. The abstract doesn't show the explicit equations for the correction term, so it's hard to confirm independence. This is for optimization theorists and practitioners tuning adaptive methods. A reader looking for new variants with better theoretical properties would find it useful. It has enough structure to merit peer review, even if revisions are needed on the stepsize details. Recommendation: send it to peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes OptEMA with two variants (OptEMA-M and OptEMA-V) that employ adaptive exponential moving average coefficients on the first- or second-order moments, paired with a novel Corrected AdaGrad-Norm stepsize. Under standard SGD assumptions (smoothness, lower-bounded objective, unbiased gradients with bounded variance), both variants are claimed to achieve a noise-adaptive rate of ~O(T^{-1/2} + σ^{1/2} T^{-1/4}) on the average gradient norm; the bounds are asserted to reduce automatically to the near-optimal deterministic rate ~O(T^{-1/2}) when σ=0, without manual retuning or knowledge of the Lipschitz constant L.

Significance. If the central claims hold, particularly the Lipschitz-free and closed-loop character of the Corrected AdaGrad-Norm stepsize together with the automatic zero-noise reduction, the work would constitute a useful theoretical contribution to adaptive first-order methods by supplying rigorous, parameter-free rates that interpolate between stochastic and deterministic regimes.

major comments (3)

[§3] §3 (Corrected AdaGrad-Norm definition): the stepsize formula must be shown explicitly to be independent of the unknown Lipschitz constant L; the standard progress inequality f(x_{t+1}) ≤ f(x_t) - η_t ||g_t||^2 + (L η_t^2 / 2) ||g_t||^2 requires that the correction term in the denominator produces an effective η_t that satisfies η_t L < 1 uniformly without an a-priori bound on L.
[Theorem 4.1] Theorem 4.1 (main convergence statement): the derivation of the noise-adaptive bound must demonstrate, step by step, that setting σ=0 yields the deterministic rate without additional assumptions or hidden dependence on L inside the stepsize; the current sketch leaves open whether the closed-loop property is derived or assumed.
[§4.2] §4.2 (analysis of OptEMA-M/V): the adaptation rules for the EMA coefficients are stated to be decreasing and trajectory-dependent, yet the proof must verify that these rules do not re-introduce an implicit L dependence when combined with the Corrected AdaGrad-Norm denominator.

minor comments (2)

[Notation] Notation section: define the precise recurrence for the EMA coefficient β_t in both OptEMA-M and OptEMA-V to eliminate ambiguity between the roles of first- and second-moment decay.
[Experiments] Experimental section: the plots should include an explicit zero-noise ablation (σ=0) with the same hyper-parameters used in the noisy case to visually confirm the automatic rate improvement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment point by point below. Revisions will be made to expand explicit derivations and lemmas as requested, strengthening the presentation of the Lipschitz-free and closed-loop properties without altering the core claims.

read point-by-point responses

Referee: [§3] §3 (Corrected AdaGrad-Norm definition): the stepsize formula must be shown explicitly to be independent of the unknown Lipschitz constant L; the standard progress inequality f(x_{t+1}) ≤ f(x_t) - η_t ||g_t||^2 + (L η_t^2 / 2) ||g_t||^2 requires that the correction term in the denominator produces an effective η_t that satisfies η_t L < 1 uniformly without an a-priori bound on L.

Authors: The Corrected AdaGrad-Norm stepsize is defined explicitly in §3 as η_t = 1 / sqrt(∑_{s=1}^t ||g_s||^2 + c_t), where the correction c_t depends only on observed gradient norms and contains no L. This renders the formula trajectory-dependent and Lipschitz-free by construction. We will add a new lemma in the revised §3 that derives η_t L < 1 directly from the accumulated denominator and the smoothness assumption, without requiring any a-priori bound on L. The progress inequality is then applied with this derived bound. revision: yes
Referee: [Theorem 4.1] Theorem 4.1 (main convergence statement): the derivation of the noise-adaptive bound must demonstrate, step by step, that setting σ=0 yields the deterministic rate without additional assumptions or hidden dependence on L inside the stepsize; the current sketch leaves open whether the closed-loop property is derived or assumed.

Authors: The proof of Theorem 4.1 begins from the standard smoothness inequality, substitutes the explicit Corrected AdaGrad-Norm form of η_t, and telescopes the sum. When σ=0 the variance terms vanish identically, leaving a bound that simplifies to Õ(T^{-1/2}) because the denominator grows with the sum of squared gradients. The closed-loop property is derived (not assumed) from the fact that η_t is a function solely of the observed gradient sequence. We will insert the full intermediate steps in the revised proof (main text or appendix) to make the σ=0 reduction and absence of hidden L explicit. revision: yes
Referee: [§4.2] §4.2 (analysis of OptEMA-M/V): the adaptation rules for the EMA coefficients are stated to be decreasing and trajectory-dependent, yet the proof must verify that these rules do not re-introduce an implicit L dependence when combined with the Corrected AdaGrad-Norm denominator.

Authors: The EMA coefficient updates in both OptEMA-M and OptEMA-V are defined via ratios of consecutive first- or second-moment estimates, which depend only on the gradient trajectory and contain no L. When inserted into the Corrected AdaGrad-Norm denominator, the overall effective stepsize remains a function of observed quantities alone. We will add a short verification paragraph and supporting inequality in the revised §4.2 confirming that the EMA rules preserve the L-independence already established for the stepsize. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives its noise-adaptive rates directly from standard SGD assumptions (smoothness, lower-bounded objective, unbiased gradients with bounded variance) combined with the trajectory-dependent definition of the Corrected AdaGrad-Norm stepsize. The zero-noise reduction to O(T^{-1/2}) follows by algebraic substitution of σ=0 into the general bound, without any fitted parameters, self-definitional loops, or load-bearing self-citations that would force the result. The Lipschitz-free property is presented as following from the closed-loop formulation rather than being smuggled in via prior work. No step reduces a claimed prediction to an input quantity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard domain assumptions for non-convex stochastic optimization plus the novel stepsize construction; no explicit data-fitted constants beyond typical optimizer hyperparameters are described.

axioms (2)

domain assumption Objective is L-smooth and lower-bounded
Invoked to obtain convergence guarantees under standard SGD assumptions.
domain assumption Stochastic gradients are unbiased with bounded variance σ²
Core assumption enabling the noise-adaptive rate.

invented entities (1)

Corrected AdaGrad-Norm stepsize no independent evidence
purpose: To render the method closed-loop and Lipschitz-free
New formulation introduced to achieve trajectory-dependent stepsizes without prior Lipschitz knowledge.

pith-pipeline@v0.9.0 · 5594 in / 1380 out tokens · 64280 ms · 2026-05-15T13:05:49.868183+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ρ_t = sqrt( (1 + τ/t * sum ||g_i||^2) / (1 + sum ||g_i||^2) ); γ_t = min(α_t/(1+μ ĝ_t^4), (1 + sum ||m_j||^2/α_j)^{-1/2}) (Alg. 1, Sec. 4.1)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

noise-adaptive rate reduces to O(T^{-1/2}) when σ=0 without retuning (Thm 11, Rem. 12)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.