RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

El Mahdi Chayti

arxiv: 2602.06824 · v2 · pith:AIKUWQ46new · submitted 2026-02-06 · 🧮 math.OC · cs.LG

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

El Mahdi Chayti This is my paper

Pith reviewed 2026-05-21 14:02 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords momentum optimizationrandomized stepsStein identitiesstochastic optimizationconstrained optimizationconvergence ratesheavy-tailed noise

0 comments

The pith

Randomized step sizes eliminate curvature bias in momentum methods using a single Hessian-vector product.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Momentum methods suffer from curvature-induced bias when gradients are stochastic, which caps convergence at the suboptimal O(ε^{-4}) rate. The paper replaces the usual deterministic step size with a random draw whose expectation is exactly the target η_t. This change lets Stein-type identities deliver an exact unbiased estimate of the bias term from one Hessian-vector product evaluated together with the gradient. The same idea is specialized to unconstrained problems with exponential random steps and to constrained problems with beta random steps that keep iterates feasible. Analysis shows the resulting methods recover the optimal O(ε^{-3}) rate under bounded noise and optimal rates under heavy-tailed noise.

Core claim

By drawing the step size from a distribution whose mean equals the nominal η_t, Stein-type identities can be applied to obtain an exact, unbiased estimator of the momentum bias term that otherwise arises from curvature interacting with stochastic gradients. The estimator uses only a single Hessian-vector product computed jointly with the gradient and requires no auxiliary queries. The framework is realized as RanSOM-E for unconstrained optimization via exponential distributions and RanSOM-B for constrained optimization via beta distributions that strictly preserve feasibility. Theoretical guarantees establish the optimal O(ε^{-3}) convergence rate for standard bounded noise and optimal rates

What carries the argument

Randomized scaling of the step size (drawn from a distribution with exact mean η_t) combined with Stein-type identities to produce an unbiased momentum-bias estimate from a single Hessian-vector product.

If this is right

Momentum methods achieve the optimal O(ε^{-3}) rate instead of the biased O(ε^{-4}) rate in stochastic settings.
Bias correction requires no auxiliary gradient queries beyond the single joint Hessian-vector product.
The same framework works for constrained optimization while strictly preserving feasibility of the iterates.
Optimal rates hold for heavy-tailed noise when the moment index p lies in (1,2].

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The randomization device could be reused to debias other first-order estimators that suffer from curvature-noise interactions.
In deep-network training the method may reduce the number of wasted steps without increasing per-iteration cost.
Different step-size distributions could be designed for additional constraint types or for noise with specific tail properties.

Load-bearing premise

Steps drawn from a distribution whose mean is exactly η_t suffice for Stein identities to yield an unbiased bias estimate without extra smoothness or boundedness conditions.

What would settle it

An empirical check that the Stein-based bias estimate matches the true bias (computed by averaging many independent realizations) or that observed convergence reaches O(ε^{-3}) on a problem where uncorrected momentum stalls at a worse rate.

Figures

Figures reproduced from arXiv: 2602.06824 by El Mahdi Chayti.

**Figure 1.** Figure 1: Momentum Bias. The historic momentum mt−1 approximates ∇f(xt−1), but the optimizer has moved to xt. The deviation (orange vector) is the bias induced by the Hessian ∇2 f. To accelerate convergence, this bias must be corrected. 1.2. The Landscape of Bias Correction To restore the optimal convergence properties of momentum, we must correct the ”lag” in the estimator. Mathematically, this requires construc… view at source ↗

**Figure 2.** Figure 2: Training Loss (left) and Test Accuracy (right) on the Splice dataset with Welsch regularization. RanSOM-E (Muon) shows the fastest convergence and highest final accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Average RMSE on Nano MovieLens Matrix Completion. RanSOM-B (green) converges to a lower final error than SFWPolyak and SFW-SOM. 6. Discussion and Conclusion Standard momentum methods suffer from curvatureinduced bias in non-convex settings. To address this, we introduced RanSOM, a framework that corrects this bias via randomized integration. By leveraging Stein-type identities, RanSOM constructs an unbia… view at source ↗

read the original abstract

Momentum methods, such as Polyak's Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in stochastic settings, limiting convergence to suboptimal $\mathcal{O}(\epsilon^{-4})$ rates. Existing corrections typically require expensive auxiliary sampling or restrictive smoothness assumptions. We propose \textbf{RanSOM}, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean $\eta_t$. This modification allows us to leverage Stein-type identities to compute an exact, unbiased estimate of the momentum bias using a single Hessian-vector product computed jointly with the gradient, avoiding auxiliary queries. We instantiate this framework in two algorithms: \textbf{RanSOM-E} for unconstrained optimization (using exponentially distributed steps) and \textbf{RanSOM-B} for constrained optimization (using beta-distributed steps to strictly preserve feasibility). Theoretical analysis confirms that RanSOM recovers the optimal $\mathcal{O}(\epsilon^{-3})$ convergence rate under standard bounded noise, and achieves optimal rates for heavy-tailed noise settings ($p \in (1, 2]$).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RanSOM randomizes momentum steps to apply Stein identities for unbiased bias correction with one HVP, claiming optimal rates, though the identity may not hold cleanly under the stated assumptions.

read the letter

The core idea here is that RanSOM replaces fixed step sizes in momentum with random draws from exponential or beta distributions so Stein identities can turn the bias term into an exact expression using the gradient and one Hessian-vector product. This is meant to recover the O(ε^{-3}) rate in stochastic settings without auxiliary samples or extra smoothness demands beyond bounded noise. They give separate versions for unconstrained and constrained problems, with the beta choice keeping steps inside the feasible set. That practical detail stands out as useful for constrained work. The approach is new in how it combines randomization with the identity to avoid the usual fixes, and the abstract shows clear engagement with the rate gap in standard momentum methods. The write-up is direct about the problem and the claimed fix. The soft spot is the one the stress test flags. Stein identities require the test function to be sufficiently differentiable and boundary terms to vanish at the distribution edges. The paper relies on standard bounded noise and Lipschitz gradients, but it is not immediate that these guarantee the momentum map meets the conditions or that the random step stays independent of the gradient noise in the needed way. If a remainder appears, the unbiased correction and the rate proof would not go through as stated. I would want to check the full derivation for how they handle this. This is for people working on stochastic first-order methods and momentum variants in optimization or deep learning. A reader looking for bias-correction techniques would find the mechanism worth examining even if they later modify it. The paper shows enough structure and a concrete claim to merit a serious referee who can verify the proofs and any experiments.

Referee Report

2 major / 2 minor

Summary. The paper proposes RanSOM, a unified framework for momentum-based optimization that replaces deterministic step sizes with randomized steps drawn from distributions (exponential for unconstrained, beta for constrained) having exact mean η_t. This enables the use of Stein-type identities to obtain an unbiased estimate of the momentum bias term via a single Hessian-vector product computed jointly with the gradient, without auxiliary queries. Two instantiations are given: RanSOM-E and RanSOM-B. The central theoretical claim is that the resulting algorithms recover the optimal O(ε^{-3}) convergence rate under standard bounded noise and also achieve optimal rates for heavy-tailed noise with p ∈ (1,2].

Significance. If the Stein-identity correction is shown to be exact under the stated assumptions and the convergence proofs are complete, the work would be significant: it offers a practical, low-overhead bias correction for second-order momentum that avoids extra sampling and restrictive smoothness requirements, while extending to constrained problems and heavy-tailed noise. The unified treatment of constrained and unconstrained cases via distribution choice is a strength, as is the explicit claim of recovering the optimal rate with only standard noise assumptions.

major comments (2)

[Theoretical Analysis / proof of the unbiased estimator] Abstract and the main theoretical section deriving the unbiased estimator: the application of the Stein identity to replace E[∇f(x + η v)] with an expression involving only ∇f(x) and one Hessian-vector product assumes that the test function (the momentum map) is C^2 and that boundary terms vanish at 0 and ∞ (or 0 and 1 for the beta case). The manuscript states only bounded noise and the usual Lipschitz-gradient assumption; these do not automatically guarantee the required differentiability or independence between the stochastic gradient and the random step size. If a nonzero remainder remains, the single-HVP unbiased correction fails and the O(ε^{-3}) rate proof does not go through.
[Convergence-rate theorem] The section establishing the O(ε^{-3}) rate: the proof sketch invokes the unbiasedness result to recover the optimal rate, but the argument appears to inherit the same smoothness and boundary-term conditions from the Stein step. A concrete verification (or counter-example) that the boundary terms are zero under only the paper’s noise and Lipschitz assumptions is needed; otherwise the rate claim rests on an unverified extension of the identity.

minor comments (2)

The abstract and introduction would benefit from an explicit statement of the density functions used for the exponential and beta distributions, together with the precise form of the Stein identity applied.
Notation for the random step η_t and the momentum vector v should be introduced with a short table or boxed definition early in the manuscript to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the current statement of assumptions is insufficient to rigorously justify the application of the Stein identity. We will revise the manuscript to strengthen the theoretical foundations as detailed below.

read point-by-point responses

Referee: Abstract and the main theoretical section deriving the unbiased estimator: the application of the Stein identity to replace E[∇f(x + η v)] with an expression involving only ∇f(x) and one Hessian-vector product assumes that the test function (the momentum map) is C^2 and that boundary terms vanish at 0 and ∞ (or 0 and 1 for the beta case). The manuscript states only bounded noise and the usual Lipschitz-gradient assumption; these do not automatically guarantee the required differentiability or independence between the stochastic gradient and the random step size. If a nonzero remainder remains, the single-HVP unbiased correction fails and the O(ε^{-3}) rate proof does not go through.

Authors: We agree that the Lipschitz-gradient assumption alone does not suffice for the Stein identity. In the revision we will add the explicit assumption that f is twice continuously differentiable. We will also prove that, for the exponential distribution (unconstrained case) and beta distribution (constrained case), the boundary terms vanish at the endpoints of the support under this smoothness condition together with the existing bounded-noise assumption. Independence between the random step size and the stochastic gradient will be stated explicitly, as the step size is drawn independently at each iteration. These changes will be placed in the assumptions section and the proof of the unbiased estimator. revision: yes
Referee: The section establishing the O(ε^{-3}) rate: the proof sketch invokes the unbiasedness result to recover the optimal rate, but the argument appears to inherit the same smoothness and boundary-term conditions from the Stein step. A concrete verification (or counter-example) that the boundary terms are zero under only the paper’s noise and Lipschitz assumptions is needed; otherwise the rate claim rests on an unverified extension of the identity.

Authors: We accept that a concrete verification is required. The revised manuscript will include an appendix that directly verifies the vanishing of the boundary terms for both distributions. For the exponential case we use the rapid decay at infinity; for the beta case we exploit the compact support on [0,1]. The verification will be carried out under the added C^2 assumption and the paper’s noise conditions, without introducing stronger smoothness requirements that would limit practical applicability. This will close the gap between the unbiasedness result and the convergence-rate theorem. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external Stein identities under standard assumptions

full rationale

The paper introduces randomized step sizes drawn from distributions with given mean η_t and invokes Stein-type identities to obtain an unbiased momentum bias correction via one Hessian-vector product. These identities are standard external mathematical tools, not defined or derived from the paper's own target rate or fitted quantities. The claimed O(ε^{-3}) rate follows from applying the correction under the usual bounded-noise and Lipschitz-gradient assumptions, which are stated independently of the result itself. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described derivation chain. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard stochastic optimization assumptions plus the applicability of Stein identities to the chosen step-size distributions; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Standard bounded noise assumption for convergence analysis
Invoked to obtain the O(ε^{-3}) rate under bounded noise and optimal rates for heavy-tailed noise.

pith-pipeline@v0.9.0 · 5720 in / 1364 out tokens · 62100 ms · 2026-05-21T14:02:36.131979+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By randomizing the step (e.g., via Exponential or Beta distributions), we utilize identities analogous to Stein’s Lemma to perform 'statistical integration' of the Hessian. This yields an exact, unbiased estimator of the curvature bias using a single Hessian-Vector Product (HVP) at the next iterate xt+1.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3.1 (Stein-Type Identities for Optimization). ... E[g(s)−g(0)] = 1/λ E[g′(s)].

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

The PDF is f(z) =λe −λz for z≥0

Exponential Identity (Unconstrained).Let s∼Exp(λ) . The PDF is f(z) =λe −λz for z≥0 . The survival function is: 1−F(z) = Z ∞ z λe−λs ds=e −λz. We observe that the survival function is directly proportional to the PDF: 1−F(z) = 1 λ(λe−λz) = 1 λ f(z). Substituting this into the master equation (18): E[g(s)−g(0)] = Z ∞ 0 g′(z) 1 λ f(z) dz = 1 λ Z ∞ 0 g′(z)f(...

work page
[2]

The PDF is given by f(s) = 1 B(1,K) (1−s) K−1 for s∈[0,1]

Beta Identity (Constrained).Let s∼Beta(1, K) . The PDF is given by f(s) = 1 B(1,K) (1−s) K−1 for s∈[0,1] . Since the Beta function isB(1, K) = Γ(1)Γ(K) Γ(1+K) = 1 K , the PDF simplifies to: f(z) =K(1−z) K−1 , z∈[0,1]. The survival function forz∈[0,1]is: 1−F(z) = Z 1 z K(1−s) K−1 ds= −(1−s) K 1 z = (1−z) K. To express the master integral as an expectation,...

work page
[3]

•RanSOM-E(s t ∼Exp(1/η t),w t =η t): –C s = 2

Stein Moment Constants:To bound the variance, we require the following normalized moments of the Stein weight wt and step sizes t: Mw ≜E[|w t/ηt|q], M ws ≜E[|w t/ηt +s t/ηt|q].(19) 3.Correction ConstantC δ:We define the aggregate constant used in the error bound as: Cδ ≜2·2 1−1/q ·max n M1/q w , M1/q ws o .(20) Upper Bounds for Specific Distributions. •Ra...

work page
[4]

BoundingT 1 (Hessian Noise):Using Assumption 4.2 and∥d t∥ ≤ρ: E[∥T1∥q 2]≤E[|w t|q]κqE[∥(Hξ − ∇2f)d t∥q 2 |x t+1] ≤M wηq t ρq (σq h +α q h∥∇f(x t+1)∥q ∗)

work page
[5]

Taking the expectation and using(a+b) q ≤2 q−1(aq +b q)again: E[∥T2∥q 2]≤E[(|w t|+s t)q]κqρq (L0 +L 1∥∇f(x t+1)∥∗)q ≤M wsηq t κqρq2q−1 (Lq 0 +L q 1∥∇f(x t+1)∥q ∗)

Bounding T2 (Approximation Error):Using Assumption 4.1 ( L0, L1 smoothness) and the triangle inequality on the integral: ∥T2∥2 ≤κρ |wt|∥∇2f(x t+1)∥op +s t Z 1 0 ∥∇2f(x τ)∥opdτ ≤κρ(|w t|+s t) (L0 +L 1∥∇f(x t+1)∥∗). Taking the expectation and using(a+b) q ≤2 q−1(aq +b q)again: E[∥T2∥q 2]≤E[(|w t|+s t)q]κqρq (L0 +L 1∥∇f(x t+1)∥∗)q ≤M wsηq t κqρq2q−1 (Lq 0 +L...

work page
[6]

+κ qρq(αq h +L q 1)∥∇f∥ q ∗] ≤C q δ ηq t (¯σq h + ¯αq h∥∇f(x t+1)∥q ∗). B. Auxiliary Lemmas Lemma B.1(Martingale Moment Inequality (V on Bahr & Esseen, 1965)).Let {Xj} be a martingale difference sequence in a Hilbert space. For anyr∈(1,2], there exists a constantC r ≤2such that: E   tX j=1 Xj r 2   ≤C r tX j=1 E[∥Xj∥r 2].(24) Lemma B.2(Affine Noise Pe...

work page 1965
[7]

,1yields the result

We peel off the last term using the sub-additivity of x7→x 1/p (concavity): E[S1/p t |Ft−1]≤(S t−1 +c p t E[∥ξt∥p 2|Ft−1])1/p (Jensen’s Inequality) ≤(S t−1 +c p t σp +c p t αp∥∇f(x t)∥p ∗)1/p ≤(S t−1 +c p t σp)1/p +c tα∥∇f(x t)∥∗ (Sub-additivity) Applying this recursively forj=t, t−1, . . . ,1yields the result. 14 RanSOM: Randomized Second-Order Momentum ...

work page
[8]

Letc j =β(1−β) t+1−j

Bounding the Gradient Noise Term (Mt,grad).We apply the Martingale Moment Inequality (Lemma B.1) with moment p. Letc j =β(1−β) t+1−j. E[∥Mt,grad∥2]≤2   t+1X j=1 cp j E[∥¯ξj∥p 2]   1/p . Using the property of batched means forp∈(1,2],E[∥ ¯ξj∥p 2]≤B −(p−1)E[∥ξj∥p 2]. Substituting Assumption 4.2: E[∥Mt,grad∥2]≤ 2 B p−1 p E     t+1X j=1 cp j(σp g +α ...

work page
[9]

Applying Lemma B.1 with moment q: E[∥Mt,corr∥2]≤ 2 B q−1 q   t+1X j=1 dq j E[∥Ψj∥q 2]   1/q

Bounding the Correction Noise Term (Mt,corr).Similarly, let dj = (1−β) t+2−j. Applying Lemma B.1 with moment q: E[∥Mt,corr∥2]≤ 2 B q−1 q   t+1X j=1 dq j E[∥Ψj∥q 2]   1/q . Using Lemma A.2, we substitute the boundE[∥Ψ j∥q 2]≤C q δ ηq(¯σq h + ¯αq h∥∇f(x j)∥q ∗): E[∥Mt,corr∥2]≤ 2Cδη B q−1 q   t+1X j=1 dq j(¯σq h + ¯αq h∥∇f(x j)∥q ∗)   1/q . Applying ...

work page
[10]

Initialization Term.The geometric series sum is bounded: T−1X t=0 (1−β) t+1∥e0∥2 ≤ ∥e0∥2 ∞X k=1 (1−β) k =∥e 0∥2 1−β β ≤ ∥e0∥2 β

work page
[11]

Summing them simply multiplies byT

Constant Noise Terms.The constant noise terms (involving σg and ¯σh) are independent of t. Summing them simply multiplies byT. Sumσ =T 2β1−1/p p1/pB p−1 p σg + 2ηβ −1/qCδ q1/qB q−1 q ¯σh !

work page
[12]

Consider the generic structure S= PT−1 t=0 CPt+1 j=1(1−β) t+1−jE[∥∇f(x j)∥∗]

Affine Growth Terms (Convolution).We handle the recursive sums using the property of discrete convolution. Consider the generic structure S= PT−1 t=0 CPt+1 j=1(1−β) t+1−jE[∥∇f(x j)∥∗]. By swapping the order of summation (PT−1 t=0 Pt+1 j=1 =PT j=1 PT−1 t=j−1), we bound the inner geometric series: T−1X t=j−1 (1−β) t+1−j = T−jX k=0 (1−β) k ≤ ∞X k=0 (1−β) k =...

work page
[13]

Final Combination.Combining these parts, dividing the entire inequality byT, we obtain the stated result. D. Convergence Analysis D.1. Unconstrained Optimization (RanSOM-E) Theorem D.1(Convergence of RanSOM-E).Let ∆0 =F(x 0)−F ∗ be the initial functional gap and ∥e0∥2 be the initial momentum error. We define the following problem-dependent constants: Cini...

work page
[14]

Let Ccond ≜ B(q−1)/q 32κCδ ¯αh

Step Size V alidity:The step size η must satisfy the following geometric and noise absorption constraints. Let Ccond ≜ B(q−1)/q 32κCδ ¯αh . η≤min 1 2CsρL1 , 1 ρL0 , Ccond ·β .(34) Let¯η= min 1 2CsρL1 , 1 ρL0 denote the constant geometric limit. Result:If we select a sufficiently large initial batch size Binit ≳T p(1−A−K) (p−1)(2A+K) to suppress initializa...

work page
[15]

Main Rate (Noise + Bias Balance) + 2 Cinit gapc1/A 1 C −1 cond A 1+A T − p−1 2p−1 | {z }

work page
[16]

Constraint Rate (Active ifp>q) + 2 p Cinit gapCsmoothT −1/2 | {z }

work page
[17]

Smoothness Rate (Deterministic) + Cinit gap ¯η T −1 | {z }

work page
[18]

Geometric Limit Rate + Cinit gap βinit limitCcond T −1 | {z }

work page
[19]

Constraint Initialization Rate (35) whereA= p−1 p ,K= 1 q ,c 1 = Cnoise grad p1/pBA ,c 2 = Cnoise hess q1/qBK , andC mom = (cK 1 cA 2 ) 1 A+K . Proof. 1. Explicit Bound Derivation.Starting from the descent inequality and substituting error bounds (as shown in previous derivations), we obtain the fundamental upper bound: R ≤ Cinit gap T η +c 1βA +c 2ηβ −K ...

work page
[20]

The optimal β is max(β∗, βcons), where β∗ balances noise/bias and βcons =η/C cond

Optimization of Momentum β.We minimize the momentum error Emom(η, β) =c 1βA +c 2ηβ −K subject to β≥η/C cond. The optimal β is max(β∗, βcons), where β∗ balances noise/bias and βcons =η/C cond. Substituting this back, the effective momentum error function is: Emom(η)≈2C momη A A+K | {z } Ideal Error +c 1C −A condηA | {z } Constraint Error

work page
[21]

The optimalηis the minimum of these candidate step sizes: 1.Ideal Noise Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1

Optimization of Step Size η.We determine η by balancing the Drift term ( Cinit gap T η ) against every other error source in the bound. The optimalηis the minimum of these candidate step sizes: 1.Ideal Noise Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1. 18 RanSOM: Randomized Second-Order Momentum 2.Constraint Balance: Cinit gap T η =c 1C −A condηA =⇒η 2. 3...

work page
[22]

Substituting η= min(η i) into the drift term yields the sum of the inverse rates: Cinit gap Tmin(η i) ≤ X i Cinit gap T ηi

Final Summation of Rates.The drift term Cinit gap T η dominates the bound. Substituting η= min(η i) into the drift term yields the sum of the inverse rates: Cinit gap Tmin(η i) ≤ X i Cinit gap T ηi . Evaluating each term explicitly: •Term 1 (Main): Cinit gap T η1 = 2Cinit gap(Cinit gap/2Cmom)− A+K 2A+K T − A 2A+K . •Term 2 (Constraint): Cinit gap T η2 = 2...

work page
[23]

Main Rate (Noise + Bias Balance) + 2 p Cinit gapCsmoothT −1/2 | {z }

work page
[24]

Smoothness Rate + ∆0 T|{z}

work page
[25]

Geometric Limit Rate (46) whereA= p−1 p ,K= 1 q , andC mom = (cK 1 cA 2 ) 1 A+K withc 1 = Cnoise grad p1/pBA , c2 = Cnoise hess q1/qBK . Proof. 1. Explicit Bound Derivation.Summing the descent inequality from t= 0 to T−1 and substituting the error bound from Lemma A.2 (simplified for bounded effective noise): X E[G(xt)]≤ ∆0 η +T C smoothη+ 2D X E[∥et∥2]. ...

work page
[26]

The optimal momentum is β∗(η) = ( c2 c1 ) 1 A+K η 1 A+K

Optimization of Momentum β.We minimize the momentum error c1βA +c 2ηβ −K. The optimal momentum is β∗(η) = ( c2 c1 ) 1 A+K η 1 A+K . The effective error is2C momη A A+K

work page
[27]

We takeηas the minimum of the candidates: 1.Main Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1 ∝T − A+K 2A+K

Optimization of Step Size η.We balance the Drift term Cinit gap T η against the Momentum Error and the Smoothness term. We takeηas the minimum of the candidates: 1.Main Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1 ∝T − A+K 2A+K . 2.Smoothness Balance: Cinit gap T η =C smoothη=⇒η 2 = q Cinit gap CsmoothT . 3.Geometric Limit:η≤1. Since the updatex t+1 = (1−s...

work page
[28]

This ensures it does not affect the asymptotic rate

Initialization.The term Cinit mom T ηβ is suppressed by choosing Binit large enough so that Cinit mom ≤C init gapβ. This ensures it does not affect the asymptotic rate. Corollary: Main paper Setting (Bounded Moments). In the simplified setting of the main paper (αg = ¯αh = 0,L 1 = 0), the constants simplify significantly. Corollary D.5(Bounded Moments Rat...

work page

[1] [1]

The PDF is f(z) =λe −λz for z≥0

Exponential Identity (Unconstrained).Let s∼Exp(λ) . The PDF is f(z) =λe −λz for z≥0 . The survival function is: 1−F(z) = Z ∞ z λe−λs ds=e −λz. We observe that the survival function is directly proportional to the PDF: 1−F(z) = 1 λ(λe−λz) = 1 λ f(z). Substituting this into the master equation (18): E[g(s)−g(0)] = Z ∞ 0 g′(z) 1 λ f(z) dz = 1 λ Z ∞ 0 g′(z)f(...

work page

[2] [2]

The PDF is given by f(s) = 1 B(1,K) (1−s) K−1 for s∈[0,1]

Beta Identity (Constrained).Let s∼Beta(1, K) . The PDF is given by f(s) = 1 B(1,K) (1−s) K−1 for s∈[0,1] . Since the Beta function isB(1, K) = Γ(1)Γ(K) Γ(1+K) = 1 K , the PDF simplifies to: f(z) =K(1−z) K−1 , z∈[0,1]. The survival function forz∈[0,1]is: 1−F(z) = Z 1 z K(1−s) K−1 ds= −(1−s) K 1 z = (1−z) K. To express the master integral as an expectation,...

work page

[3] [3]

•RanSOM-E(s t ∼Exp(1/η t),w t =η t): –C s = 2

Stein Moment Constants:To bound the variance, we require the following normalized moments of the Stein weight wt and step sizes t: Mw ≜E[|w t/ηt|q], M ws ≜E[|w t/ηt +s t/ηt|q].(19) 3.Correction ConstantC δ:We define the aggregate constant used in the error bound as: Cδ ≜2·2 1−1/q ·max n M1/q w , M1/q ws o .(20) Upper Bounds for Specific Distributions. •Ra...

work page

[4] [4]

BoundingT 1 (Hessian Noise):Using Assumption 4.2 and∥d t∥ ≤ρ: E[∥T1∥q 2]≤E[|w t|q]κqE[∥(Hξ − ∇2f)d t∥q 2 |x t+1] ≤M wηq t ρq (σq h +α q h∥∇f(x t+1)∥q ∗)

work page

[5] [5]

Taking the expectation and using(a+b) q ≤2 q−1(aq +b q)again: E[∥T2∥q 2]≤E[(|w t|+s t)q]κqρq (L0 +L 1∥∇f(x t+1)∥∗)q ≤M wsηq t κqρq2q−1 (Lq 0 +L q 1∥∇f(x t+1)∥q ∗)

Bounding T2 (Approximation Error):Using Assumption 4.1 ( L0, L1 smoothness) and the triangle inequality on the integral: ∥T2∥2 ≤κρ |wt|∥∇2f(x t+1)∥op +s t Z 1 0 ∥∇2f(x τ)∥opdτ ≤κρ(|w t|+s t) (L0 +L 1∥∇f(x t+1)∥∗). Taking the expectation and using(a+b) q ≤2 q−1(aq +b q)again: E[∥T2∥q 2]≤E[(|w t|+s t)q]κqρq (L0 +L 1∥∇f(x t+1)∥∗)q ≤M wsηq t κqρq2q−1 (Lq 0 +L...

work page

[6] [6]

+κ qρq(αq h +L q 1)∥∇f∥ q ∗] ≤C q δ ηq t (¯σq h + ¯αq h∥∇f(x t+1)∥q ∗). B. Auxiliary Lemmas Lemma B.1(Martingale Moment Inequality (V on Bahr & Esseen, 1965)).Let {Xj} be a martingale difference sequence in a Hilbert space. For anyr∈(1,2], there exists a constantC r ≤2such that: E   tX j=1 Xj r 2   ≤C r tX j=1 E[∥Xj∥r 2].(24) Lemma B.2(Affine Noise Pe...

work page 1965

[7] [7]

,1yields the result

We peel off the last term using the sub-additivity of x7→x 1/p (concavity): E[S1/p t |Ft−1]≤(S t−1 +c p t E[∥ξt∥p 2|Ft−1])1/p (Jensen’s Inequality) ≤(S t−1 +c p t σp +c p t αp∥∇f(x t)∥p ∗)1/p ≤(S t−1 +c p t σp)1/p +c tα∥∇f(x t)∥∗ (Sub-additivity) Applying this recursively forj=t, t−1, . . . ,1yields the result. 14 RanSOM: Randomized Second-Order Momentum ...

work page

[8] [8]

Letc j =β(1−β) t+1−j

Bounding the Gradient Noise Term (Mt,grad).We apply the Martingale Moment Inequality (Lemma B.1) with moment p. Letc j =β(1−β) t+1−j. E[∥Mt,grad∥2]≤2   t+1X j=1 cp j E[∥¯ξj∥p 2]   1/p . Using the property of batched means forp∈(1,2],E[∥ ¯ξj∥p 2]≤B −(p−1)E[∥ξj∥p 2]. Substituting Assumption 4.2: E[∥Mt,grad∥2]≤ 2 B p−1 p E     t+1X j=1 cp j(σp g +α ...

work page

[9] [9]

Applying Lemma B.1 with moment q: E[∥Mt,corr∥2]≤ 2 B q−1 q   t+1X j=1 dq j E[∥Ψj∥q 2]   1/q

Bounding the Correction Noise Term (Mt,corr).Similarly, let dj = (1−β) t+2−j. Applying Lemma B.1 with moment q: E[∥Mt,corr∥2]≤ 2 B q−1 q   t+1X j=1 dq j E[∥Ψj∥q 2]   1/q . Using Lemma A.2, we substitute the boundE[∥Ψ j∥q 2]≤C q δ ηq(¯σq h + ¯αq h∥∇f(x j)∥q ∗): E[∥Mt,corr∥2]≤ 2Cδη B q−1 q   t+1X j=1 dq j(¯σq h + ¯αq h∥∇f(x j)∥q ∗)   1/q . Applying ...

work page

[10] [10]

Initialization Term.The geometric series sum is bounded: T−1X t=0 (1−β) t+1∥e0∥2 ≤ ∥e0∥2 ∞X k=1 (1−β) k =∥e 0∥2 1−β β ≤ ∥e0∥2 β

work page

[11] [11]

Summing them simply multiplies byT

Constant Noise Terms.The constant noise terms (involving σg and ¯σh) are independent of t. Summing them simply multiplies byT. Sumσ =T 2β1−1/p p1/pB p−1 p σg + 2ηβ −1/qCδ q1/qB q−1 q ¯σh !

work page

[12] [12]

Consider the generic structure S= PT−1 t=0 CPt+1 j=1(1−β) t+1−jE[∥∇f(x j)∥∗]

Affine Growth Terms (Convolution).We handle the recursive sums using the property of discrete convolution. Consider the generic structure S= PT−1 t=0 CPt+1 j=1(1−β) t+1−jE[∥∇f(x j)∥∗]. By swapping the order of summation (PT−1 t=0 Pt+1 j=1 =PT j=1 PT−1 t=j−1), we bound the inner geometric series: T−1X t=j−1 (1−β) t+1−j = T−jX k=0 (1−β) k ≤ ∞X k=0 (1−β) k =...

work page

[13] [13]

Final Combination.Combining these parts, dividing the entire inequality byT, we obtain the stated result. D. Convergence Analysis D.1. Unconstrained Optimization (RanSOM-E) Theorem D.1(Convergence of RanSOM-E).Let ∆0 =F(x 0)−F ∗ be the initial functional gap and ∥e0∥2 be the initial momentum error. We define the following problem-dependent constants: Cini...

work page

[14] [14]

Let Ccond ≜ B(q−1)/q 32κCδ ¯αh

Step Size V alidity:The step size η must satisfy the following geometric and noise absorption constraints. Let Ccond ≜ B(q−1)/q 32κCδ ¯αh . η≤min 1 2CsρL1 , 1 ρL0 , Ccond ·β .(34) Let¯η= min 1 2CsρL1 , 1 ρL0 denote the constant geometric limit. Result:If we select a sufficiently large initial batch size Binit ≳T p(1−A−K) (p−1)(2A+K) to suppress initializa...

work page

[15] [15]

Main Rate (Noise + Bias Balance) + 2 Cinit gapc1/A 1 C −1 cond A 1+A T − p−1 2p−1 | {z }

work page

[16] [16]

Constraint Rate (Active ifp>q) + 2 p Cinit gapCsmoothT −1/2 | {z }

work page

[17] [17]

Smoothness Rate (Deterministic) + Cinit gap ¯η T −1 | {z }

work page

[18] [18]

Geometric Limit Rate + Cinit gap βinit limitCcond T −1 | {z }

work page

[19] [19]

Constraint Initialization Rate (35) whereA= p−1 p ,K= 1 q ,c 1 = Cnoise grad p1/pBA ,c 2 = Cnoise hess q1/qBK , andC mom = (cK 1 cA 2 ) 1 A+K . Proof. 1. Explicit Bound Derivation.Starting from the descent inequality and substituting error bounds (as shown in previous derivations), we obtain the fundamental upper bound: R ≤ Cinit gap T η +c 1βA +c 2ηβ −K ...

work page

[20] [20]

The optimal β is max(β∗, βcons), where β∗ balances noise/bias and βcons =η/C cond

Optimization of Momentum β.We minimize the momentum error Emom(η, β) =c 1βA +c 2ηβ −K subject to β≥η/C cond. The optimal β is max(β∗, βcons), where β∗ balances noise/bias and βcons =η/C cond. Substituting this back, the effective momentum error function is: Emom(η)≈2C momη A A+K | {z } Ideal Error +c 1C −A condηA | {z } Constraint Error

work page

[21] [21]

The optimalηis the minimum of these candidate step sizes: 1.Ideal Noise Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1

Optimization of Step Size η.We determine η by balancing the Drift term ( Cinit gap T η ) against every other error source in the bound. The optimalηis the minimum of these candidate step sizes: 1.Ideal Noise Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1. 18 RanSOM: Randomized Second-Order Momentum 2.Constraint Balance: Cinit gap T η =c 1C −A condηA =⇒η 2. 3...

work page

[22] [22]

Substituting η= min(η i) into the drift term yields the sum of the inverse rates: Cinit gap Tmin(η i) ≤ X i Cinit gap T ηi

Final Summation of Rates.The drift term Cinit gap T η dominates the bound. Substituting η= min(η i) into the drift term yields the sum of the inverse rates: Cinit gap Tmin(η i) ≤ X i Cinit gap T ηi . Evaluating each term explicitly: •Term 1 (Main): Cinit gap T η1 = 2Cinit gap(Cinit gap/2Cmom)− A+K 2A+K T − A 2A+K . •Term 2 (Constraint): Cinit gap T η2 = 2...

work page

[23] [23]

Main Rate (Noise + Bias Balance) + 2 p Cinit gapCsmoothT −1/2 | {z }

work page

[24] [24]

Smoothness Rate + ∆0 T|{z}

work page

[25] [25]

Geometric Limit Rate (46) whereA= p−1 p ,K= 1 q , andC mom = (cK 1 cA 2 ) 1 A+K withc 1 = Cnoise grad p1/pBA , c2 = Cnoise hess q1/qBK . Proof. 1. Explicit Bound Derivation.Summing the descent inequality from t= 0 to T−1 and substituting the error bound from Lemma A.2 (simplified for bounded effective noise): X E[G(xt)]≤ ∆0 η +T C smoothη+ 2D X E[∥et∥2]. ...

work page

[26] [26]

The optimal momentum is β∗(η) = ( c2 c1 ) 1 A+K η 1 A+K

Optimization of Momentum β.We minimize the momentum error c1βA +c 2ηβ −K. The optimal momentum is β∗(η) = ( c2 c1 ) 1 A+K η 1 A+K . The effective error is2C momη A A+K

work page

[27] [27]

We takeηas the minimum of the candidates: 1.Main Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1 ∝T − A+K 2A+K

Optimization of Step Size η.We balance the Drift term Cinit gap T η against the Momentum Error and the Smoothness term. We takeηas the minimum of the candidates: 1.Main Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1 ∝T − A+K 2A+K . 2.Smoothness Balance: Cinit gap T η =C smoothη=⇒η 2 = q Cinit gap CsmoothT . 3.Geometric Limit:η≤1. Since the updatex t+1 = (1−s...

work page

[28] [28]

This ensures it does not affect the asymptotic rate

Initialization.The term Cinit mom T ηβ is suppressed by choosing Binit large enough so that Cinit mom ≤C init gapβ. This ensures it does not affect the asymptotic rate. Corollary: Main paper Setting (Bounded Moments). In the simplified setting of the main paper (αg = ¯αh = 0,L 1 = 0), the constants simplify significantly. Corollary D.5(Bounded Moments Rat...

work page