pith. sign in

arxiv: 2602.06824 · v2 · pith:AIKUWQ46new · submitted 2026-02-06 · 🧮 math.OC · cs.LG

RanSOM: Second-Order Momentum with Randomized Scaling for Constrained and Unconstrained Optimization

Pith reviewed 2026-05-21 14:02 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords momentum optimizationrandomized stepsStein identitiesstochastic optimizationconstrained optimizationconvergence ratesheavy-tailed noise
0
0 comments X

The pith

Randomized step sizes eliminate curvature bias in momentum methods using a single Hessian-vector product.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Momentum methods suffer from curvature-induced bias when gradients are stochastic, which caps convergence at the suboptimal O(ε^{-4}) rate. The paper replaces the usual deterministic step size with a random draw whose expectation is exactly the target η_t. This change lets Stein-type identities deliver an exact unbiased estimate of the bias term from one Hessian-vector product evaluated together with the gradient. The same idea is specialized to unconstrained problems with exponential random steps and to constrained problems with beta random steps that keep iterates feasible. Analysis shows the resulting methods recover the optimal O(ε^{-3}) rate under bounded noise and optimal rates under heavy-tailed noise.

Core claim

By drawing the step size from a distribution whose mean equals the nominal η_t, Stein-type identities can be applied to obtain an exact, unbiased estimator of the momentum bias term that otherwise arises from curvature interacting with stochastic gradients. The estimator uses only a single Hessian-vector product computed jointly with the gradient and requires no auxiliary queries. The framework is realized as RanSOM-E for unconstrained optimization via exponential distributions and RanSOM-B for constrained optimization via beta distributions that strictly preserve feasibility. Theoretical guarantees establish the optimal O(ε^{-3}) convergence rate for standard bounded noise and optimal rates

What carries the argument

Randomized scaling of the step size (drawn from a distribution with exact mean η_t) combined with Stein-type identities to produce an unbiased momentum-bias estimate from a single Hessian-vector product.

If this is right

  • Momentum methods achieve the optimal O(ε^{-3}) rate instead of the biased O(ε^{-4}) rate in stochastic settings.
  • Bias correction requires no auxiliary gradient queries beyond the single joint Hessian-vector product.
  • The same framework works for constrained optimization while strictly preserving feasibility of the iterates.
  • Optimal rates hold for heavy-tailed noise when the moment index p lies in (1,2].

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The randomization device could be reused to debias other first-order estimators that suffer from curvature-noise interactions.
  • In deep-network training the method may reduce the number of wasted steps without increasing per-iteration cost.
  • Different step-size distributions could be designed for additional constraint types or for noise with specific tail properties.

Load-bearing premise

Steps drawn from a distribution whose mean is exactly η_t suffice for Stein identities to yield an unbiased bias estimate without extra smoothness or boundedness conditions.

What would settle it

An empirical check that the Stein-based bias estimate matches the true bias (computed by averaging many independent realizations) or that observed convergence reaches O(ε^{-3}) on a problem where uncorrected momentum stalls at a worse rate.

Figures

Figures reproduced from arXiv: 2602.06824 by El Mahdi Chayti.

Figure 1
Figure 1. Figure 1: Momentum Bias. The historic momentum mt−1 ap￾proximates ∇f(xt−1), but the optimizer has moved to xt. The deviation (orange vector) is the bias induced by the Hessian ∇2 f. To accelerate convergence, this bias must be corrected. 1.2. The Landscape of Bias Correction To restore the optimal convergence properties of momen￾tum, we must correct the ”lag” in the estimator. Mathe￾matically, this requires construc… view at source ↗
Figure 2
Figure 2. Figure 2: Training Loss (left) and Test Accuracy (right) on the Splice dataset with Welsch regularization. RanSOM-E (Muon) shows the fastest convergence and highest final accuracy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average RMSE on Nano MovieLens Matrix Completion. RanSOM-B (green) converges to a lower final error than SFW￾Polyak and SFW-SOM. 6. Discussion and Conclusion Standard momentum methods suffer from curvature￾induced bias in non-convex settings. To address this, we introduced RanSOM, a framework that corrects this bias via randomized integration. By leveraging Stein-type identities, RanSOM constructs an unbia… view at source ↗
read the original abstract

Momentum methods, such as Polyak's Heavy Ball, are the standard for training deep networks but suffer from curvature-induced bias in stochastic settings, limiting convergence to suboptimal $\mathcal{O}(\epsilon^{-4})$ rates. Existing corrections typically require expensive auxiliary sampling or restrictive smoothness assumptions. We propose \textbf{RanSOM}, a unified framework that eliminates this bias by replacing deterministic step sizes with randomized steps drawn from distributions with mean $\eta_t$. This modification allows us to leverage Stein-type identities to compute an exact, unbiased estimate of the momentum bias using a single Hessian-vector product computed jointly with the gradient, avoiding auxiliary queries. We instantiate this framework in two algorithms: \textbf{RanSOM-E} for unconstrained optimization (using exponentially distributed steps) and \textbf{RanSOM-B} for constrained optimization (using beta-distributed steps to strictly preserve feasibility). Theoretical analysis confirms that RanSOM recovers the optimal $\mathcal{O}(\epsilon^{-3})$ convergence rate under standard bounded noise, and achieves optimal rates for heavy-tailed noise settings ($p \in (1, 2]$).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RanSOM, a unified framework for momentum-based optimization that replaces deterministic step sizes with randomized steps drawn from distributions (exponential for unconstrained, beta for constrained) having exact mean η_t. This enables the use of Stein-type identities to obtain an unbiased estimate of the momentum bias term via a single Hessian-vector product computed jointly with the gradient, without auxiliary queries. Two instantiations are given: RanSOM-E and RanSOM-B. The central theoretical claim is that the resulting algorithms recover the optimal O(ε^{-3}) convergence rate under standard bounded noise and also achieve optimal rates for heavy-tailed noise with p ∈ (1,2].

Significance. If the Stein-identity correction is shown to be exact under the stated assumptions and the convergence proofs are complete, the work would be significant: it offers a practical, low-overhead bias correction for second-order momentum that avoids extra sampling and restrictive smoothness requirements, while extending to constrained problems and heavy-tailed noise. The unified treatment of constrained and unconstrained cases via distribution choice is a strength, as is the explicit claim of recovering the optimal rate with only standard noise assumptions.

major comments (2)
  1. [Theoretical Analysis / proof of the unbiased estimator] Abstract and the main theoretical section deriving the unbiased estimator: the application of the Stein identity to replace E[∇f(x + η v)] with an expression involving only ∇f(x) and one Hessian-vector product assumes that the test function (the momentum map) is C^2 and that boundary terms vanish at 0 and ∞ (or 0 and 1 for the beta case). The manuscript states only bounded noise and the usual Lipschitz-gradient assumption; these do not automatically guarantee the required differentiability or independence between the stochastic gradient and the random step size. If a nonzero remainder remains, the single-HVP unbiased correction fails and the O(ε^{-3}) rate proof does not go through.
  2. [Convergence-rate theorem] The section establishing the O(ε^{-3}) rate: the proof sketch invokes the unbiasedness result to recover the optimal rate, but the argument appears to inherit the same smoothness and boundary-term conditions from the Stein step. A concrete verification (or counter-example) that the boundary terms are zero under only the paper’s noise and Lipschitz assumptions is needed; otherwise the rate claim rests on an unverified extension of the identity.
minor comments (2)
  1. The abstract and introduction would benefit from an explicit statement of the density functions used for the exponential and beta distributions, together with the precise form of the Stein identity applied.
  2. Notation for the random step η_t and the momentum vector v should be introduced with a short table or boxed definition early in the manuscript to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the current statement of assumptions is insufficient to rigorously justify the application of the Stein identity. We will revise the manuscript to strengthen the theoretical foundations as detailed below.

read point-by-point responses
  1. Referee: Abstract and the main theoretical section deriving the unbiased estimator: the application of the Stein identity to replace E[∇f(x + η v)] with an expression involving only ∇f(x) and one Hessian-vector product assumes that the test function (the momentum map) is C^2 and that boundary terms vanish at 0 and ∞ (or 0 and 1 for the beta case). The manuscript states only bounded noise and the usual Lipschitz-gradient assumption; these do not automatically guarantee the required differentiability or independence between the stochastic gradient and the random step size. If a nonzero remainder remains, the single-HVP unbiased correction fails and the O(ε^{-3}) rate proof does not go through.

    Authors: We agree that the Lipschitz-gradient assumption alone does not suffice for the Stein identity. In the revision we will add the explicit assumption that f is twice continuously differentiable. We will also prove that, for the exponential distribution (unconstrained case) and beta distribution (constrained case), the boundary terms vanish at the endpoints of the support under this smoothness condition together with the existing bounded-noise assumption. Independence between the random step size and the stochastic gradient will be stated explicitly, as the step size is drawn independently at each iteration. These changes will be placed in the assumptions section and the proof of the unbiased estimator. revision: yes

  2. Referee: The section establishing the O(ε^{-3}) rate: the proof sketch invokes the unbiasedness result to recover the optimal rate, but the argument appears to inherit the same smoothness and boundary-term conditions from the Stein step. A concrete verification (or counter-example) that the boundary terms are zero under only the paper’s noise and Lipschitz assumptions is needed; otherwise the rate claim rests on an unverified extension of the identity.

    Authors: We accept that a concrete verification is required. The revised manuscript will include an appendix that directly verifies the vanishing of the boundary terms for both distributions. For the exponential case we use the rapid decay at infinity; for the beta case we exploit the compact support on [0,1]. The verification will be carried out under the added C^2 assumption and the paper’s noise conditions, without introducing stronger smoothness requirements that would limit practical applicability. This will close the gap between the unbiasedness result and the convergence-rate theorem. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external Stein identities under standard assumptions

full rationale

The paper introduces randomized step sizes drawn from distributions with given mean η_t and invokes Stein-type identities to obtain an unbiased momentum bias correction via one Hessian-vector product. These identities are standard external mathematical tools, not defined or derived from the paper's own target rate or fitted quantities. The claimed O(ε^{-3}) rate follows from applying the correction under the usual bounded-noise and Lipschitz-gradient assumptions, which are stated independently of the result itself. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described derivation chain. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard stochastic optimization assumptions plus the applicability of Stein identities to the chosen step-size distributions; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Standard bounded noise assumption for convergence analysis
    Invoked to obtain the O(ε^{-3}) rate under bounded noise and optimal rates for heavy-tailed noise.

pith-pipeline@v0.9.0 · 5720 in / 1364 out tokens · 62100 ms · 2026-05-21T14:02:36.131979+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    The PDF is f(z) =λe −λz for z≥0

    Exponential Identity (Unconstrained).Let s∼Exp(λ) . The PDF is f(z) =λe −λz for z≥0 . The survival function is: 1−F(z) = Z ∞ z λe−λs ds=e −λz. We observe that the survival function is directly proportional to the PDF: 1−F(z) = 1 λ(λe−λz) = 1 λ f(z). Substituting this into the master equation (18): E[g(s)−g(0)] = Z ∞ 0 g′(z) 1 λ f(z) dz = 1 λ Z ∞ 0 g′(z)f(...

  2. [2]

    The PDF is given by f(s) = 1 B(1,K) (1−s) K−1 for s∈[0,1]

    Beta Identity (Constrained).Let s∼Beta(1, K) . The PDF is given by f(s) = 1 B(1,K) (1−s) K−1 for s∈[0,1] . Since the Beta function isB(1, K) = Γ(1)Γ(K) Γ(1+K) = 1 K , the PDF simplifies to: f(z) =K(1−z) K−1 , z∈[0,1]. The survival function forz∈[0,1]is: 1−F(z) = Z 1 z K(1−s) K−1 ds= −(1−s) K 1 z = (1−z) K. To express the master integral as an expectation,...

  3. [3]

    •RanSOM-E(s t ∼Exp(1/η t),w t =η t): –C s = 2

    Stein Moment Constants:To bound the variance, we require the following normalized moments of the Stein weight wt and step sizes t: Mw ≜E[|w t/ηt|q], M ws ≜E[|w t/ηt +s t/ηt|q].(19) 3.Correction ConstantC δ:We define the aggregate constant used in the error bound as: Cδ ≜2·2 1−1/q ·max n M1/q w , M1/q ws o .(20) Upper Bounds for Specific Distributions. •Ra...

  4. [4]

    BoundingT 1 (Hessian Noise):Using Assumption 4.2 and∥d t∥ ≤ρ: E[∥T1∥q 2]≤E[|w t|q]κqE[∥(Hξ − ∇2f)d t∥q 2 |x t+1] ≤M wηq t ρq (σq h +α q h∥∇f(x t+1)∥q ∗)

  5. [5]

    Taking the expectation and using(a+b) q ≤2 q−1(aq +b q)again: E[∥T2∥q 2]≤E[(|w t|+s t)q]κqρq (L0 +L 1∥∇f(x t+1)∥∗)q ≤M wsηq t κqρq2q−1 (Lq 0 +L q 1∥∇f(x t+1)∥q ∗)

    Bounding T2 (Approximation Error):Using Assumption 4.1 ( L0, L1 smoothness) and the triangle inequality on the integral: ∥T2∥2 ≤κρ |wt|∥∇2f(x t+1)∥op +s t Z 1 0 ∥∇2f(x τ)∥opdτ ≤κρ(|w t|+s t) (L0 +L 1∥∇f(x t+1)∥∗). Taking the expectation and using(a+b) q ≤2 q−1(aq +b q)again: E[∥T2∥q 2]≤E[(|w t|+s t)q]κqρq (L0 +L 1∥∇f(x t+1)∥∗)q ≤M wsηq t κqρq2q−1 (Lq 0 +L...

  6. [6]

    +κ qρq(αq h +L q 1)∥∇f∥ q ∗] ≤C q δ ηq t (¯σq h + ¯αq h∥∇f(x t+1)∥q ∗). B. Auxiliary Lemmas Lemma B.1(Martingale Moment Inequality (V on Bahr & Esseen, 1965)).Let {Xj} be a martingale difference sequence in a Hilbert space. For anyr∈(1,2], there exists a constantC r ≤2such that: E   tX j=1 Xj r 2   ≤C r tX j=1 E[∥Xj∥r 2].(24) Lemma B.2(Affine Noise Pe...

  7. [7]

    ,1yields the result

    We peel off the last term using the sub-additivity of x7→x 1/p (concavity): E[S1/p t |Ft−1]≤(S t−1 +c p t E[∥ξt∥p 2|Ft−1])1/p (Jensen’s Inequality) ≤(S t−1 +c p t σp +c p t αp∥∇f(x t)∥p ∗)1/p ≤(S t−1 +c p t σp)1/p +c tα∥∇f(x t)∥∗ (Sub-additivity) Applying this recursively forj=t, t−1, . . . ,1yields the result. 14 RanSOM: Randomized Second-Order Momentum ...

  8. [8]

    Letc j =β(1−β) t+1−j

    Bounding the Gradient Noise Term (Mt,grad).We apply the Martingale Moment Inequality (Lemma B.1) with moment p. Letc j =β(1−β) t+1−j. E[∥Mt,grad∥2]≤2   t+1X j=1 cp j E[∥¯ξj∥p 2]   1/p . Using the property of batched means forp∈(1,2],E[∥ ¯ξj∥p 2]≤B −(p−1)E[∥ξj∥p 2]. Substituting Assumption 4.2: E[∥Mt,grad∥2]≤ 2 B p−1 p E     t+1X j=1 cp j(σp g +α ...

  9. [9]

    Applying Lemma B.1 with moment q: E[∥Mt,corr∥2]≤ 2 B q−1 q   t+1X j=1 dq j E[∥Ψj∥q 2]   1/q

    Bounding the Correction Noise Term (Mt,corr).Similarly, let dj = (1−β) t+2−j. Applying Lemma B.1 with moment q: E[∥Mt,corr∥2]≤ 2 B q−1 q   t+1X j=1 dq j E[∥Ψj∥q 2]   1/q . Using Lemma A.2, we substitute the boundE[∥Ψ j∥q 2]≤C q δ ηq(¯σq h + ¯αq h∥∇f(x j)∥q ∗): E[∥Mt,corr∥2]≤ 2Cδη B q−1 q   t+1X j=1 dq j(¯σq h + ¯αq h∥∇f(x j)∥q ∗)   1/q . Applying ...

  10. [10]

    Initialization Term.The geometric series sum is bounded: T−1X t=0 (1−β) t+1∥e0∥2 ≤ ∥e0∥2 ∞X k=1 (1−β) k =∥e 0∥2 1−β β ≤ ∥e0∥2 β

  11. [11]

    Summing them simply multiplies byT

    Constant Noise Terms.The constant noise terms (involving σg and ¯σh) are independent of t. Summing them simply multiplies byT. Sumσ =T 2β1−1/p p1/pB p−1 p σg + 2ηβ −1/qCδ q1/qB q−1 q ¯σh !

  12. [12]

    Consider the generic structure S= PT−1 t=0 CPt+1 j=1(1−β) t+1−jE[∥∇f(x j)∥∗]

    Affine Growth Terms (Convolution).We handle the recursive sums using the property of discrete convolution. Consider the generic structure S= PT−1 t=0 CPt+1 j=1(1−β) t+1−jE[∥∇f(x j)∥∗]. By swapping the order of summation (PT−1 t=0 Pt+1 j=1 =PT j=1 PT−1 t=j−1), we bound the inner geometric series: T−1X t=j−1 (1−β) t+1−j = T−jX k=0 (1−β) k ≤ ∞X k=0 (1−β) k =...

  13. [13]

    Final Combination.Combining these parts, dividing the entire inequality byT, we obtain the stated result. D. Convergence Analysis D.1. Unconstrained Optimization (RanSOM-E) Theorem D.1(Convergence of RanSOM-E).Let ∆0 =F(x 0)−F ∗ be the initial functional gap and ∥e0∥2 be the initial momentum error. We define the following problem-dependent constants: Cini...

  14. [14]

    Let Ccond ≜ B(q−1)/q 32κCδ ¯αh

    Step Size V alidity:The step size η must satisfy the following geometric and noise absorption constraints. Let Ccond ≜ B(q−1)/q 32κCδ ¯αh . η≤min 1 2CsρL1 , 1 ρL0 , Ccond ·β .(34) Let¯η= min 1 2CsρL1 , 1 ρL0 denote the constant geometric limit. Result:If we select a sufficiently large initial batch size Binit ≳T p(1−A−K) (p−1)(2A+K) to suppress initializa...

  15. [15]

    Main Rate (Noise + Bias Balance) + 2 Cinit gapc1/A 1 C −1 cond A 1+A T − p−1 2p−1 | {z }

  16. [16]

    Constraint Rate (Active ifp>q) + 2 p Cinit gapCsmoothT −1/2 | {z }

  17. [17]

    Smoothness Rate (Deterministic) + Cinit gap ¯η T −1 | {z }

  18. [18]

    Geometric Limit Rate + Cinit gap βinit limitCcond T −1 | {z }

  19. [19]

    Constraint Initialization Rate (35) whereA= p−1 p ,K= 1 q ,c 1 = Cnoise grad p1/pBA ,c 2 = Cnoise hess q1/qBK , andC mom = (cK 1 cA 2 ) 1 A+K . Proof. 1. Explicit Bound Derivation.Starting from the descent inequality and substituting error bounds (as shown in previous derivations), we obtain the fundamental upper bound: R ≤ Cinit gap T η +c 1βA +c 2ηβ −K ...

  20. [20]

    The optimal β is max(β∗, βcons), where β∗ balances noise/bias and βcons =η/C cond

    Optimization of Momentum β.We minimize the momentum error Emom(η, β) =c 1βA +c 2ηβ −K subject to β≥η/C cond. The optimal β is max(β∗, βcons), where β∗ balances noise/bias and βcons =η/C cond. Substituting this back, the effective momentum error function is: Emom(η)≈2C momη A A+K | {z } Ideal Error +c 1C −A condηA | {z } Constraint Error

  21. [21]

    The optimalηis the minimum of these candidate step sizes: 1.Ideal Noise Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1

    Optimization of Step Size η.We determine η by balancing the Drift term ( Cinit gap T η ) against every other error source in the bound. The optimalηis the minimum of these candidate step sizes: 1.Ideal Noise Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1. 18 RanSOM: Randomized Second-Order Momentum 2.Constraint Balance: Cinit gap T η =c 1C −A condηA =⇒η 2. 3...

  22. [22]

    Substituting η= min(η i) into the drift term yields the sum of the inverse rates: Cinit gap Tmin(η i) ≤ X i Cinit gap T ηi

    Final Summation of Rates.The drift term Cinit gap T η dominates the bound. Substituting η= min(η i) into the drift term yields the sum of the inverse rates: Cinit gap Tmin(η i) ≤ X i Cinit gap T ηi . Evaluating each term explicitly: •Term 1 (Main): Cinit gap T η1 = 2Cinit gap(Cinit gap/2Cmom)− A+K 2A+K T − A 2A+K . •Term 2 (Constraint): Cinit gap T η2 = 2...

  23. [23]

    Main Rate (Noise + Bias Balance) + 2 p Cinit gapCsmoothT −1/2 | {z }

  24. [24]

    Smoothness Rate + ∆0 T|{z}

  25. [25]

    Geometric Limit Rate (46) whereA= p−1 p ,K= 1 q , andC mom = (cK 1 cA 2 ) 1 A+K withc 1 = Cnoise grad p1/pBA , c2 = Cnoise hess q1/qBK . Proof. 1. Explicit Bound Derivation.Summing the descent inequality from t= 0 to T−1 and substituting the error bound from Lemma A.2 (simplified for bounded effective noise): X E[G(xt)]≤ ∆0 η +T C smoothη+ 2D X E[∥et∥2]. ...

  26. [26]

    The optimal momentum is β∗(η) = ( c2 c1 ) 1 A+K η 1 A+K

    Optimization of Momentum β.We minimize the momentum error c1βA +c 2ηβ −K. The optimal momentum is β∗(η) = ( c2 c1 ) 1 A+K η 1 A+K . The effective error is2C momη A A+K

  27. [27]

    We takeηas the minimum of the candidates: 1.Main Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1 ∝T − A+K 2A+K

    Optimization of Step Size η.We balance the Drift term Cinit gap T η against the Momentum Error and the Smoothness term. We takeηas the minimum of the candidates: 1.Main Balance: Cinit gap T η = 2Cmomη A A+K =⇒η 1 ∝T − A+K 2A+K . 2.Smoothness Balance: Cinit gap T η =C smoothη=⇒η 2 = q Cinit gap CsmoothT . 3.Geometric Limit:η≤1. Since the updatex t+1 = (1−s...

  28. [28]

    This ensures it does not affect the asymptotic rate

    Initialization.The term Cinit mom T ηβ is suppressed by choosing Binit large enough so that Cinit mom ≤C init gapβ. This ensures it does not affect the asymptotic rate. Corollary: Main paper Setting (Bounded Moments). In the simplified setting of the main paper (αg = ¯αh = 0,L 1 = 0), the constants simplify significantly. Corollary D.5(Bounded Moments Rat...