RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization
Pith reviewed 2026-05-21 14:02 UTC · model grok-4.3
The pith
Randomized step sizes eliminate curvature bias in momentum methods using a single Hessian-vector product.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By drawing the step size from a distribution whose mean equals the nominal η_t, Stein-type identities can be applied to obtain an exact, unbiased estimator of the momentum bias term that otherwise arises from curvature interacting with stochastic gradients. The estimator uses only a single Hessian-vector product computed jointly with the gradient and requires no auxiliary queries. The framework is realized as RanSOM-E for unconstrained optimization via exponential distributions and RanSOM-B for constrained optimization via beta distributions that strictly preserve feasibility. Theoretical guarantees establish the optimal O(ε^{-3}) convergence rate for standard bounded noise and optimal rates
What carries the argument
Randomized scaling of the step size (drawn from a distribution with exact mean η_t) combined with Stein-type identities to produce an unbiased momentum-bias estimate from a single Hessian-vector product.
If this is right
- Momentum methods achieve the optimal O(ε^{-3}) rate instead of the biased O(ε^{-4}) rate in stochastic settings.
- Bias correction requires no auxiliary gradient queries beyond the single joint Hessian-vector product.
- The same framework works for constrained optimization while strictly preserving feasibility of the iterates.
- Optimal rates hold for heavy-tailed noise when the moment index p lies in (1,2].
Where Pith is reading between the lines
- The randomization device could be reused to debias other first-order estimators that suffer from curvature-noise interactions.
- In deep-network training the method may reduce the number of wasted steps without increasing per-iteration cost.
- Different step-size distributions could be designed for additional constraint types or for noise with specific tail properties.
Load-bearing premise
Steps drawn from a distribution whose mean is exactly η_t suffice for Stein identities to yield an unbiased bias estimate without extra smoothness or boundedness conditions.
What would settle it
An empirical check that the Stein-based bias estimate matches the true bias (computed by averaging many independent realizations) or that observed convergence reaches O(ε^{-3}) on a problem where uncorrected momentum stalls at a worse rate.
Figures
read the original abstract
Momentum methods, such as Polyak's Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in stochastic settings, limiting convergence to suboptimal $\mathcal{O}(\epsilon^{-4})$ rates. Existing corrections typically require expensive auxiliary sampling or restrictive smoothness assumptions. We propose \textbf{RanSOM}, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean $\eta_t$. This modification allows us to leverage Stein-type identities to compute an exact, unbiased estimate of the momentum bias using a single Hessian-vector product computed jointly with the gradient, avoiding auxiliary queries. We instantiate this framework in two algorithms: \textbf{RanSOM-E} for unconstrained optimization (using exponentially distributed steps) and \textbf{RanSOM-B} for constrained optimization (using beta-distributed steps to strictly preserve feasibility). Theoretical analysis confirms that RanSOM recovers the optimal $\mathcal{O}(\epsilon^{-3})$ convergence rate under standard bounded noise, and achieves optimal rates for heavy-tailed noise settings ($p \in (1, 2]$).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RanSOM, a unified framework for momentum-based optimization that replaces deterministic step sizes with randomized steps drawn from distributions (exponential for unconstrained, beta for constrained) having exact mean η_t. This enables the use of Stein-type identities to obtain an unbiased estimate of the momentum bias term via a single Hessian-vector product computed jointly with the gradient, without auxiliary queries. Two instantiations are given: RanSOM-E and RanSOM-B. The central theoretical claim is that the resulting algorithms recover the optimal O(ε^{-3}) convergence rate under standard bounded noise and also achieve optimal rates for heavy-tailed noise with p ∈ (1,2].
Significance. If the Stein-identity correction is shown to be exact under the stated assumptions and the convergence proofs are complete, the work would be significant: it offers a practical, low-overhead bias correction for second-order momentum that avoids extra sampling and restrictive smoothness requirements, while extending to constrained problems and heavy-tailed noise. The unified treatment of constrained and unconstrained cases via distribution choice is a strength, as is the explicit claim of recovering the optimal rate with only standard noise assumptions.
major comments (2)
- [Theoretical Analysis / proof of the unbiased estimator] Abstract and the main theoretical section deriving the unbiased estimator: the application of the Stein identity to replace E[∇f(x + η v)] with an expression involving only ∇f(x) and one Hessian-vector product assumes that the test function (the momentum map) is C^2 and that boundary terms vanish at 0 and ∞ (or 0 and 1 for the beta case). The manuscript states only bounded noise and the usual Lipschitz-gradient assumption; these do not automatically guarantee the required differentiability or independence between the stochastic gradient and the random step size. If a nonzero remainder remains, the single-HVP unbiased correction fails and the O(ε^{-3}) rate proof does not go through.
- [Convergence-rate theorem] The section establishing the O(ε^{-3}) rate: the proof sketch invokes the unbiasedness result to recover the optimal rate, but the argument appears to inherit the same smoothness and boundary-term conditions from the Stein step. A concrete verification (or counter-example) that the boundary terms are zero under only the paper’s noise and Lipschitz assumptions is needed; otherwise the rate claim rests on an unverified extension of the identity.
minor comments (2)
- The abstract and introduction would benefit from an explicit statement of the density functions used for the exponential and beta distributions, together with the precise form of the Stein identity applied.
- Notation for the random step η_t and the momentum vector v should be introduced with a short table or boxed definition early in the manuscript to improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments correctly identify that the current statement of assumptions is insufficient to rigorously justify the application of the Stein identity. We will revise the manuscript to strengthen the theoretical foundations as detailed below.
read point-by-point responses
-
Referee: Abstract and the main theoretical section deriving the unbiased estimator: the application of the Stein identity to replace E[∇f(x + η v)] with an expression involving only ∇f(x) and one Hessian-vector product assumes that the test function (the momentum map) is C^2 and that boundary terms vanish at 0 and ∞ (or 0 and 1 for the beta case). The manuscript states only bounded noise and the usual Lipschitz-gradient assumption; these do not automatically guarantee the required differentiability or independence between the stochastic gradient and the random step size. If a nonzero remainder remains, the single-HVP unbiased correction fails and the O(ε^{-3}) rate proof does not go through.
Authors: We agree that the Lipschitz-gradient assumption alone does not suffice for the Stein identity. In the revision we will add the explicit assumption that f is twice continuously differentiable. We will also prove that, for the exponential distribution (unconstrained case) and beta distribution (constrained case), the boundary terms vanish at the endpoints of the support under this smoothness condition together with the existing bounded-noise assumption. Independence between the random step size and the stochastic gradient will be stated explicitly, as the step size is drawn independently at each iteration. These changes will be placed in the assumptions section and the proof of the unbiased estimator. revision: yes
-
Referee: The section establishing the O(ε^{-3}) rate: the proof sketch invokes the unbiasedness result to recover the optimal rate, but the argument appears to inherit the same smoothness and boundary-term conditions from the Stein step. A concrete verification (or counter-example) that the boundary terms are zero under only the paper’s noise and Lipschitz assumptions is needed; otherwise the rate claim rests on an unverified extension of the identity.
Authors: We accept that a concrete verification is required. The revised manuscript will include an appendix that directly verifies the vanishing of the boundary terms for both distributions. For the exponential case we use the rapid decay at infinity; for the beta case we exploit the compact support on [0,1]. The verification will be carried out under the added C^2 assumption and the paper’s noise conditions, without introducing stronger smoothness requirements that would limit practical applicability. This will close the gap between the unbiasedness result and the convergence-rate theorem. revision: yes
Circularity Check
No circularity: derivation uses external Stein identities under standard assumptions
full rationale
The paper introduces randomized step sizes drawn from distributions with given mean η_t and invokes Stein-type identities to obtain an unbiased momentum bias correction via one Hessian-vector product. These identities are standard external mathematical tools, not defined or derived from the paper's own target rate or fitted quantities. The claimed O(ε^{-3}) rate follows from applying the correction under the usual bounded-noise and Lipschitz-gradient assumptions, which are stated independently of the result itself. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described derivation chain. The framework is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard bounded noise assumption for convergence analysis
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By randomizing the step (e.g., via Exponential or Beta distributions), we utilize identities analogous to Stein’s Lemma to perform 'statistical integration' of the Hessian. This yields an exact, unbiased estimator of the curvature bias using a single Hessian-Vector Product (HVP) at the next iterate xt+1.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3.1 (Stein-Type Identities for Optimization). ... E[g(s)−g(0)] = 1/λ E[g′(s)].
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The PDF is f(z) =λe −λz for z≥0
Exponential Identity (Unconstrained).Let s∼Exp(λ) . The PDF is f(z) =λe −λz for z≥0 . The survival function is: 1−F(z) = Z ∞ z λe−λs ds=e −λz. We observe that the survival function is directly proportional to the PDF: 1−F(z) = 1 λ(λe−λz) = 1 λ f(z). Substituting this into the master equation (18): E[g(s)−g(0)] = Z ∞ 0 g′(z) 1 λ f(z) dz = 1 λ Z ∞ 0 g′(z)f(...
-
[2]
The PDF is given by f(s) = 1 B(1,K) (1−s) K−1 for s∈[0,1]
Beta Identity (Constrained).Let s∼Beta(1, K) . The PDF is given by f(s) = 1 B(1,K) (1−s) K−1 for s∈[0,1] . Since the Beta function isB(1, K) = Γ(1)Γ(K) Γ(1+K) = 1 K , the PDF simplifies to: f(z) =K(1−z) K−1 , z∈[0,1]. The survival function forz∈[0,1]is: 1−F(z) = Z 1 z K(1−s) K−1 ds= −(1−s) K 1 z = (1−z) K. To express the master integral as an expectation,...
-
[3]
•RanSOM-E(s t ∼Exp(1/η t),w t =η t): –C s = 2
Stein Moment Constants:To bound the variance, we require the following normalized moments of the Stein weight wt and step sizes t: Mw ≜E[|w t/ηt|q], M ws ≜E[|w t/ηt +s t/ηt|q].(19) 3.Correction ConstantC δ:We define the aggregate constant used in the error bound as: Cδ ≜2·2 1−1/q ·max n M1/q w , M1/q ws o .(20) Upper Bounds for Specific Distributions. •Ra...
-
[4]
BoundingT 1 (Hessian Noise):Using Assumption 4.2 and∥d t∥ ≤ρ: E[∥T1∥q 2]≤E[|w t|q]κqE[∥(Hξ − ∇2f)d t∥q 2 |x t+1] ≤M wηq t ρq (σq h +α q h∥∇f(x t+1)∥q ∗)
-
[5]
Bounding T2 (Approximation Error):Using Assumption 4.1 ( L0, L1 smoothness) and the triangle inequality on the integral: ∥T2∥2 ≤κρ |wt|∥∇2f(x t+1)∥op +s t Z 1 0 ∥∇2f(x τ)∥opdτ ≤κρ(|w t|+s t) (L0 +L 1∥∇f(x t+1)∥∗). Taking the expectation and using(a+b) q ≤2 q−1(aq +b q)again: E[∥T2∥q 2]≤E[(|w t|+s t)q]κqρq (L0 +L 1∥∇f(x t+1)∥∗)q ≤M wsηq t κqρq2q−1 (Lq 0 +L...
-
[6]
+κ qρq(αq h +L q 1)∥∇f∥ q ∗] ≤C q δ ηq t (¯σq h + ¯αq h∥∇f(x t+1)∥q ∗). B. Auxiliary Lemmas Lemma B.1(Martingale Moment Inequality (V on Bahr & Esseen, 1965)).Let {Xj} be a martingale difference sequence in a Hilbert space. For anyr∈(1,2], there exists a constantC r ≤2such that: E tX j=1 Xj r 2 ≤C r tX j=1 E[∥Xj∥r 2].(24) Lemma B.2(Affine Noise Pe...
work page 1965
-
[7]
We peel off the last term using the sub-additivity of x7→x 1/p (concavity): E[S1/p t |Ft−1]≤(S t−1 +c p t E[∥ξt∥p 2|Ft−1])1/p (Jensen’s Inequality) ≤(S t−1 +c p t σp +c p t αp∥∇f(x t)∥p ∗)1/p ≤(S t−1 +c p t σp)1/p +c tα∥∇f(x t)∥∗ (Sub-additivity) Applying this recursively forj=t, t−1, . . . ,1yields the result. 14 RanSOM: Randomized Second-Order Momentum ...
-
[8]
Bounding the Gradient Noise Term (Mt,grad).We apply the Martingale Moment Inequality (Lemma B.1) with moment p. Letc j =β(1−β) t+1−j. E[∥Mt,grad∥2]≤2 t+1X j=1 cp j E[∥¯ξj∥p 2] 1/p . Using the property of batched means forp∈(1,2],E[∥ ¯ξj∥p 2]≤B −(p−1)E[∥ξj∥p 2]. Substituting Assumption 4.2: E[∥Mt,grad∥2]≤ 2 B p−1 p E t+1X j=1 cp j(σp g +α ...
-
[9]
Applying Lemma B.1 with moment q: E[∥Mt,corr∥2]≤ 2 B q−1 q t+1X j=1 dq j E[∥Ψj∥q 2] 1/q
Bounding the Correction Noise Term (Mt,corr).Similarly, let dj = (1−β) t+2−j. Applying Lemma B.1 with moment q: E[∥Mt,corr∥2]≤ 2 B q−1 q t+1X j=1 dq j E[∥Ψj∥q 2] 1/q . Using Lemma A.2, we substitute the boundE[∥Ψ j∥q 2]≤C q δ ηq(¯σq h + ¯αq h∥∇f(x j)∥q ∗): E[∥Mt,corr∥2]≤ 2Cδη B q−1 q t+1X j=1 dq j(¯σq h + ¯αq h∥∇f(x j)∥q ∗) 1/q . Applying ...
-
[10]
Initialization Term.The geometric series sum is bounded: T−1X t=0 (1−β) t+1∥e0∥2 ≤ ∥e0∥2 ∞X k=1 (1−β) k =∥e 0∥2 1−β β ≤ ∥e0∥2 β
-
[11]
Summing them simply multiplies byT
Constant Noise Terms.The constant noise terms (involving σg and ¯σh) are independent of t. Summing them simply multiplies byT. Sumσ =T 2β1−1/p p1/pB p−1 p σg + 2ηβ −1/qCδ q1/qB q−1 q ¯σh !
-
[12]
Consider the generic structure S= PT−1 t=0 CPt+1 j=1(1−β) t+1−jE[∥∇f(x j)∥∗]
Affine Growth Terms (Convolution).We handle the recursive sums using the property of discrete convolution. Consider the generic structure S= PT−1 t=0 CPt+1 j=1(1−β) t+1−jE[∥∇f(x j)∥∗]. By swapping the order of summation (PT−1 t=0 Pt+1 j=1 =PT j=1 PT−1 t=j−1), we bound the inner geometric series: T−1X t=j−1 (1−β) t+1−j = T−jX k=0 (1−β) k ≤ ∞X k=0 (1−β) k =...
-
[13]
Final Combination.Combining these parts, dividing the entire inequality byT, we obtain the stated result. D. Convergence Analysis D.1. Unconstrained Optimization (RanSOM-E) Theorem D.1(Convergence of RanSOM-E).Let ∆0 =F(x 0)−F ∗ be the initial functional gap and ∥e0∥2 be the initial momentum error. We define the following problem-dependent constants: Cini...
-
[14]
Let Ccond ≜ B(q−1)/q 32κCδ ¯αh
Step Size V alidity:The step size η must satisfy the following geometric and noise absorption constraints. Let Ccond ≜ B(q−1)/q 32κCδ ¯αh . η≤min 1 2CsρL1 , 1 ρL0 , Ccond ·β .(34) Let¯η= min 1 2CsρL1 , 1 ρL0 denote the constant geometric limit. Result:If we select a sufficiently large initial batch size Binit ≳T p(1−A−K) (p−1)(2A+K) to suppress initializa...
-
[15]
Main Rate (Noise + Bias Balance) + 2 Cinit gapc1/A 1 C −1 cond A 1+A T − p−1 2p−1 | {z }
-
[16]
Constraint Rate (Active ifp>q) + 2 p Cinit gapCsmoothT −1/2 | {z }
-
[17]
Smoothness Rate (Deterministic) + Cinit gap ¯η T −1 | {z }
-
[18]
Geometric Limit Rate + Cinit gap βinit limitCcond T −1 | {z }
-
[19]
Constraint Initialization Rate (35) whereA= p−1 p ,K= 1 q ,c 1 = Cnoise grad p1/pBA ,c 2 = Cnoise hess q1/qBK , andC mom = (cK 1 cA 2 ) 1 A+K . Proof. 1. Explicit Bound Derivation.Starting from the descent inequality and substituting error bounds (as shown in previous derivations), we obtain the fundamental upper bound: R ≤ Cinit gap T η +c 1βA +c 2ηβ −K ...
-
[20]
The optimal β is max(β∗, βcons), where β∗ balances noise/bias and βcons =η/C cond
Optimization of Momentum β.We minimize the momentum error Emom(η, β) =c 1βA +c 2ηβ −K subject to β≥η/C cond. The optimal β is max(β∗, βcons), where β∗ balances noise/bias and βcons =η/C cond. Substituting this back, the effective momentum error function is: Emom(η)≈2C momη A A+K | {z } Ideal Error +c 1C −A condηA | {z } Constraint Error
-
[21]
Optimization of Step Size η.We determine η by balancing the Drift term ( Cinit gap T η ) against every other error source in the bound. The optimalηis the minimum of these candidate step sizes: 1.Ideal Noise Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1. 18 RanSOM: Randomized Second-Order Momentum 2.Constraint Balance: Cinit gap T η =c 1C −A condηA =⇒η 2. 3...
-
[22]
Final Summation of Rates.The drift term Cinit gap T η dominates the bound. Substituting η= min(η i) into the drift term yields the sum of the inverse rates: Cinit gap Tmin(η i) ≤ X i Cinit gap T ηi . Evaluating each term explicitly: •Term 1 (Main): Cinit gap T η1 = 2Cinit gap(Cinit gap/2Cmom)− A+K 2A+K T − A 2A+K . •Term 2 (Constraint): Cinit gap T η2 = 2...
-
[23]
Main Rate (Noise + Bias Balance) + 2 p Cinit gapCsmoothT −1/2 | {z }
-
[24]
Smoothness Rate + ∆0 T|{z}
-
[25]
Geometric Limit Rate (46) whereA= p−1 p ,K= 1 q , andC mom = (cK 1 cA 2 ) 1 A+K withc 1 = Cnoise grad p1/pBA , c2 = Cnoise hess q1/qBK . Proof. 1. Explicit Bound Derivation.Summing the descent inequality from t= 0 to T−1 and substituting the error bound from Lemma A.2 (simplified for bounded effective noise): X E[G(xt)]≤ ∆0 η +T C smoothη+ 2D X E[∥et∥2]. ...
-
[26]
The optimal momentum is β∗(η) = ( c2 c1 ) 1 A+K η 1 A+K
Optimization of Momentum β.We minimize the momentum error c1βA +c 2ηβ −K. The optimal momentum is β∗(η) = ( c2 c1 ) 1 A+K η 1 A+K . The effective error is2C momη A A+K
-
[27]
Optimization of Step Size η.We balance the Drift term Cinit gap T η against the Momentum Error and the Smoothness term. We takeηas the minimum of the candidates: 1.Main Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1 ∝T − A+K 2A+K . 2.Smoothness Balance: Cinit gap T η =C smoothη=⇒η 2 = q Cinit gap CsmoothT . 3.Geometric Limit:η≤1. Since the updatex t+1 = (1−s...
-
[28]
This ensures it does not affect the asymptotic rate
Initialization.The term Cinit mom T ηβ is suppressed by choosing Binit large enough so that Cinit mom ≤C init gapβ. This ensures it does not affect the asymptotic rate. Corollary: Main paper Setting (Bounded Moments). In the simplified setting of the main paper (αg = ¯αh = 0,L 1 = 0), the constants simplify significantly. Corollary D.5(Bounded Moments Rat...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.