Recognition: unknown
Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance
Pith reviewed 2026-05-10 08:33 UTC · model grok-4.3
The pith
Unbounded variance forces Ω(ε^{-6}) stochastic oracle queries to reach ε-stationary points for smooth non-convex functions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the BG-0 oracle model, locating an ε-stationary point of a smooth non-convex function requires Ω(ε^{-6}) stochastic queries and Ω(ε^{-4}) queries under mean-square smoothness; these bounds are information-theoretic and strictly worse than the Ω(ε^{-4}) and Ω(ε^{-3}) rates that hold when variance is uniformly bounded. PASTA attains matching upper bounds across smooth, mean-square smooth, weakly convex, star-convex, and Polyak-Łojasiewicz regimes while allowing unbounded domains and unbounded stochastic gradients.
What carries the argument
Proximally Anchored STochastic Approximation (PASTA), which dynamically couples Halpern anchoring and Tikhonov regularization to counteract the quadratic variance growth permitted by the BG-0 oracle.
If this is right
- PASTA achieves the minimax rates for smooth non-convex minimization under BG-0.
- The same optimal complexities hold for mean-square smooth, weakly convex, star-convex, and Polyak-Łojasiewicz objectives.
- The algorithm operates over unbounded domains with unbounded stochastic gradients.
- The degradation relative to bounded-variance rates is unavoidable under the BG-0 model.
Where Pith is reading between the lines
- Anchoring and regularization may prove useful for other stochastic first-order methods whenever noise intensity depends on iterate location.
- The quadratic-variance model could be checked empirically on large-scale neural-network training trajectories to see whether observed slowdowns match the predicted complexity gap.
- Extensions of the lower-bound technique to higher-order oracles or to constrained domains would clarify how general the Ω(ε^{-6}) barrier is.
Load-bearing premise
The stochastic oracle obeys the BG-0 condition in which variance grows at most quadratically with distance to the solution, together with the stated smoothness or mean-square smoothness of the objective.
What would settle it
Any algorithm that reaches an ε-stationary point of a smooth non-convex function with o(ε^{-6}) BG-0 oracle calls on a problem whose gradient variance scales quadratically with distance would refute the lower bound.
read the original abstract
Analysis of Stochastic Gradient Descent (SGD) and its variants typically relies on the assumption of uniformly bounded variance, a condition that frequently fails in practical non-convex settings, such as neural network training, as well as in several elementary optimization settings. While several relaxations are explored in the literature, the Blum-Gladyshev (BG-0) condition, which permits the variance to grow quadratically with distance has recently been shown to be the weakest condition. However, the study of the oracle complexity of stochastic first-order non-convex optimization under BG-0 has remained underexplored. In this paper, we address this gap and establish information-theoretic lower bounds, proving that finding an $\epsilon$-stationary point requires $\Omega(\epsilon^{-6})$ stochastic BG-0 oracle queries for smooth functions and $\Omega(\epsilon^{-4})$ queries under mean-square smoothness. These limits demonstrate an unavoidable degradation from classical bounded-variance complexities, i.e., $\Omega(\epsilon^{-4})$ and $\Omega(\epsilon^{-3})$ for smooth and mean-square smooth cases, respectively. To match these lower bounds, we consider Proximally Anchored STochastic Approximation (PASTA), a unified algorithmic framework that couples Halpern anchoring with Tikhonov regularization to dynamically mitigate the extra variance explosion term permitted by the BG-0 oracle. We prove that PASTA achieves minimax optimal complexities across numerous non-convex regimes, including standard smooth, mean-square smooth, weakly convex, star-convex, and Polyak-Lojasiewicz functions, entirely under an unbounded domain and unbounded stochastic gradients.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper establishes information-theoretic lower bounds showing that Ω(ε^{-6}) stochastic BG-0 oracle queries are required to find an ε-stationary point for smooth non-convex functions and Ω(ε^{-4}) queries under mean-square smoothness. It introduces the Proximally Anchored STochastic Approximation (PASTA) algorithm, which couples Halpern anchoring with Tikhonov regularization, and proves that PASTA attains minimax-optimal rates for smooth, mean-square smooth, weakly convex, star-convex, and Polyak-Łojasiewicz regimes under unbounded domains and unbounded stochastic gradients.
Significance. If the results hold, the work is significant for providing the first matching lower and upper bounds in non-convex stochastic optimization under the BG-0 condition (variance growing quadratically with distance), a setting that arises in neural network training and other practical cases. The unified framework across multiple non-convex classes and the explicit demonstration of the complexity degradation relative to bounded-variance rates (Ω(ε^{-4}) and Ω(ε^{-3})) are strengths.
minor comments (3)
- [Introduction] The abstract and introduction compare the new rates to classical bounded-variance complexities, but a summary table listing all regimes (smooth, mean-square smooth, weakly convex, etc.) with both lower and upper bounds would improve readability.
- [Preliminaries] The BG-0 oracle is described as permitting quadratic variance growth, but the precise mathematical definition (including any constants or the exact form of the variance bound) should appear in the preliminaries section before the lower-bound constructions.
- [Algorithm and Analysis] In the algorithmic analysis, the role of the proximal anchoring parameter and the Tikhonov regularization schedule could be clarified with an explicit statement of how they are chosen to cancel the extra variance term.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and for recommending minor revision. The summary correctly captures our contributions on information-theoretic lower bounds under the BG-0 condition and the design of the PASTA algorithm achieving matching rates across multiple non-convex regimes. As no specific major comments were raised in the report, we have no individual points to rebut. We will incorporate minor editorial improvements in the revised version to enhance readability and presentation.
Circularity Check
No significant circularity; lower bounds and algorithm rates are independently derived
full rationale
The paper derives information-theoretic lower bounds through explicit hard-instance constructions under the BG-0 oracle model and proves matching upper bounds for the newly introduced PASTA algorithm via standard Lyapunov analysis and proximal anchoring arguments. No step reduces a claimed result to a fitted parameter, self-definition, or load-bearing self-citation; the BG-0 condition is treated as an external assumption, and the minimax rates follow from oracle query counting and convergence proofs that do not presuppose the target complexities. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Objective function satisfies smoothness or mean-square smoothness.
- domain assumption Stochastic oracle satisfies the BG-0 condition (variance quadratic in distance to solution).
Forward citations
Cited by 1 Pith paper
-
SGD for Variational Inference: Tackling Unbounded Variance via Preconditioning and Dynamic Batching
Proves ELBO solution existence for elliptic location-scale families and convergence guarantees for preconditioned dynamic-batched projected SGD under Blum-Gladyshev conditions in BBVI.
Reference graph
Works this paper leans on
-
[1]
(4) For allx∈R T , prog0(∇ ¯FT (x))≤ prog1/2(x) + 1
(3) For allx∈R T ,∥∇ ¯FT (x)∥∞ ≤γ ∞, whereγ ∞ = 23. (4) For allx∈R T , prog0(∇ ¯FT (x))≤ prog1/2(x) + 1. (5) For allx∈R T , if prog1(x)< Tthen∥∇ ¯FT (x)∥ ≥ |∇ prog1(x)+1 ¯FT (x)|>1. To simulate the difficulty of stochastic optimization, the deterministic chain is evaluated using a stochastic oracle that masks the true gradient. A simple extension to Lemma...
-
[2]
To find the total oracle complexity, we need to estimate∥x t −x 0∥, which determines the batch sizeN t
Taking the total expectation, recursively unrolling fromt= 0 to K−1, and rearranging K−1X t=0 E[∥∇f(x t)∥2]≤ 2(f(x 0)−E[f(x K)]) η +LηKσ 2 ≤ 2∆ η +KLησ 2.(52) Recall thatη≤ ϵ2 2Lσ2 andK= l 4∆ ηϵ2 m . To find the total oracle complexity, we need to estimate∥x t −x 0∥, which determines the batch sizeN t. Using PASTA’s update rule xt −x 0 =−η t−1X j=0 ∇f(x j...
-
[3]
To determine the expected total stochastic oracle complexity, recall (67) and discard the non- positive suboptimality term to boundE ∥xt+1 −x ∗∥2 ≤E ∥xt −x ∗∥2 +η 2σ2
Taking the total expectation and rearranging ηE[f(x t)−f(x ∗)]≤E ∥xt −x ∗∥2 −E ∥xt+1 −x ∗∥2 +η 2σ2.(68) 25 Summing fromt= 0 toK−1 and discarding the non-positive terms yields K−1X t=0 E[f(x t)−f(x ∗)]≤ ∥x0 −x ∗∥2 η +Kησ 2.(69) Dividing byK, by convexity, and noting the definition of the ergodic average ¯x K E[f(¯xK)−f(x ∗)]≤ ∥x0 −x ∗∥2 Kη +ησ 2.(70) Since...
-
[4]
Taking the full unconditional expectation over all random variables yields E[Φt+1]≤E[Φ t]− η 2 E[∥∇f(x t)∥2]− η 4 E[∥gt∥2] + ηϵ2 4 .(98) Telescoping this inequalityKand rearranging K−1X t=0 E[∥∇f(x t)∥2] + 1 2 K−1X t=0 E[∥gt∥2]≤ 2E[Φ0] η + Kϵ2 2 .(99) At initialization, we useN 0 oracle calls which meansE[V 0]≤ ϵ2 2 , thereby we haveE[Φ 0]≤∆ + ηϵ2 4p . Us...
-
[5]
Subtractingf(x ∗) from both sides and using PL condition −∥∇f(x t)∥2 ≤ −2µ(f(x t)−f(x ∗)) establishes Et[f(x t+1)−f(x ∗)]≤(1−ηµ)(f(x t)−f(x ∗)) + Lη2B2 v 2 ∥xt −x 0∥2 + Lη2b2 v 2 .(110) 30 As we discussed in the main paper, for anL-smooth function, the PL condition implies QG [39], i.e.∥x−x ∗∥2 ≤ 2 µ(f(x)−f(x ∗)). Expanding∥x t −x 0∥2 ≤2∥x t −x ∗∥2 + 2∥x0...
-
[6]
This requires 4LηB 2 v µ2 (f(x 0)−f(x ∗))≤ ϵ 3, Lηb2 v µ ≤ ϵ 3, and K≥ 2 ηµ log( 3(f(x 0)−f(x ∗)) ϵ ). SinceS= 1 andN t = 1, the stochastic oracle complexity is equivalent toKand evaluates to K= max 2L µ log 3∆ ϵ , 6Lb2 v µ2ϵ log 3∆ ϵ , 24LB2 v∆ µ3ϵ log 3∆ ϵ ,(114) which is evidentlyO(ϵ −1 log(ϵ−1)).■ B.7 Proof of Theorem 9 Proof of Theorem 9.We setS= 1,N...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.