arxiv: 2604.16620 · v1 · submitted 2026-04-17 · 💻 cs.LG · math.OC

Recognition: unknown

Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance

Arda Fazla , Ege C. Kaya , Antesh Upadhyay , Abolfazl Hashemi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:33 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords stochastic gradient descentnon-convex optimizationunbounded variancelower boundsoracle complexityproximal anchoringBG-0 condition

0 comments

The pith

Unbounded variance forces Ω(ε^{-6}) stochastic oracle queries to reach ε-stationary points for smooth non-convex functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when stochastic gradient variance is permitted to grow quadratically with distance from a solution, the number of oracle calls needed to find an approximate stationary point rises sharply compared with the classical bounded-variance case. This holds for both smooth and mean-square smooth non-convex objectives. The authors introduce PASTA, which pairs Halpern anchoring with Tikhonov regularization to offset the extra variance term and thereby recovers the new lower-bound rates. The same rates are shown to be optimal for weakly convex, star-convex, and Polyak-Łojasiewicz problems as well, all without assuming bounded domains or bounded noise.

Core claim

Under the BG-0 oracle model, locating an ε-stationary point of a smooth non-convex function requires Ω(ε^{-6}) stochastic queries and Ω(ε^{-4}) queries under mean-square smoothness; these bounds are information-theoretic and strictly worse than the Ω(ε^{-4}) and Ω(ε^{-3}) rates that hold when variance is uniformly bounded. PASTA attains matching upper bounds across smooth, mean-square smooth, weakly convex, star-convex, and Polyak-Łojasiewicz regimes while allowing unbounded domains and unbounded stochastic gradients.

What carries the argument

Proximally Anchored STochastic Approximation (PASTA), which dynamically couples Halpern anchoring and Tikhonov regularization to counteract the quadratic variance growth permitted by the BG-0 oracle.

If this is right

PASTA achieves the minimax rates for smooth non-convex minimization under BG-0.
The same optimal complexities hold for mean-square smooth, weakly convex, star-convex, and Polyak-Łojasiewicz objectives.
The algorithm operates over unbounded domains with unbounded stochastic gradients.
The degradation relative to bounded-variance rates is unavoidable under the BG-0 model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Anchoring and regularization may prove useful for other stochastic first-order methods whenever noise intensity depends on iterate location.
The quadratic-variance model could be checked empirically on large-scale neural-network training trajectories to see whether observed slowdowns match the predicted complexity gap.
Extensions of the lower-bound technique to higher-order oracles or to constrained domains would clarify how general the Ω(ε^{-6}) barrier is.

Load-bearing premise

The stochastic oracle obeys the BG-0 condition in which variance grows at most quadratically with distance to the solution, together with the stated smoothness or mean-square smoothness of the objective.

What would settle it

Any algorithm that reaches an ε-stationary point of a smooth non-convex function with o(ε^{-6}) BG-0 oracle calls on a problem whose gradient variance scales quadratically with distance would refute the lower bound.

read the original abstract

Analysis of Stochastic Gradient Descent (SGD) and its variants typically relies on the assumption of uniformly bounded variance, a condition that frequently fails in practical non-convex settings, such as neural network training, as well as in several elementary optimization settings. While several relaxations are explored in the literature, the Blum-Gladyshev (BG-0) condition, which permits the variance to grow quadratically with distance has recently been shown to be the weakest condition. However, the study of the oracle complexity of stochastic first-order non-convex optimization under BG-0 has remained underexplored. In this paper, we address this gap and establish information-theoretic lower bounds, proving that finding an $\epsilon$-stationary point requires $\Omega(\epsilon^{-6})$ stochastic BG-0 oracle queries for smooth functions and $\Omega(\epsilon^{-4})$ queries under mean-square smoothness. These limits demonstrate an unavoidable degradation from classical bounded-variance complexities, i.e., $\Omega(\epsilon^{-4})$ and $\Omega(\epsilon^{-3})$ for smooth and mean-square smooth cases, respectively. To match these lower bounds, we consider Proximally Anchored STochastic Approximation (PASTA), a unified algorithmic framework that couples Halpern anchoring with Tikhonov regularization to dynamically mitigate the extra variance explosion term permitted by the BG-0 oracle. We prove that PASTA achieves minimax optimal complexities across numerous non-convex regimes, including standard smooth, mean-square smooth, weakly convex, star-convex, and Polyak-Lojasiewicz functions, entirely under an unbounded domain and unbounded stochastic gradients.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first tight lower bounds under the weakest BG-0 variance model and a single anchored algorithm that matches them for several non-convex classes.

read the letter

The core news here is that non-convex stochastic optimization gets strictly harder once you drop bounded variance and allow the BG-0 relaxation, where variance can grow quadratically with distance from the optimum. The lower bounds show Omega(eps^{-6}) queries for smooth functions and Omega(eps^{-4}) for mean-square smooth ones, compared with the familiar Omega(eps^{-4}) and Omega(eps^{-3}) under bounded variance. PASTA then hits those rates with a combination of Halpern anchoring and Tikhonov regularization, and the same framework covers weakly convex, star-convex, and Polyak-Lojasiewicz cases without needing a bounded domain or bounded gradients.

Referee Report

0 major / 3 minor

Summary. The paper establishes information-theoretic lower bounds showing that Ω(ε^{-6}) stochastic BG-0 oracle queries are required to find an ε-stationary point for smooth non-convex functions and Ω(ε^{-4}) queries under mean-square smoothness. It introduces the Proximally Anchored STochastic Approximation (PASTA) algorithm, which couples Halpern anchoring with Tikhonov regularization, and proves that PASTA attains minimax-optimal rates for smooth, mean-square smooth, weakly convex, star-convex, and Polyak-Łojasiewicz regimes under unbounded domains and unbounded stochastic gradients.

Significance. If the results hold, the work is significant for providing the first matching lower and upper bounds in non-convex stochastic optimization under the BG-0 condition (variance growing quadratically with distance), a setting that arises in neural network training and other practical cases. The unified framework across multiple non-convex classes and the explicit demonstration of the complexity degradation relative to bounded-variance rates (Ω(ε^{-4}) and Ω(ε^{-3})) are strengths.

minor comments (3)

[Introduction] The abstract and introduction compare the new rates to classical bounded-variance complexities, but a summary table listing all regimes (smooth, mean-square smooth, weakly convex, etc.) with both lower and upper bounds would improve readability.
[Preliminaries] The BG-0 oracle is described as permitting quadratic variance growth, but the precise mathematical definition (including any constants or the exact form of the variance bound) should appear in the preliminaries section before the lower-bound constructions.
[Algorithm and Analysis] In the algorithmic analysis, the role of the proximal anchoring parameter and the Tikhonov regularization schedule could be clarified with an explicit statement of how they are chosen to cancel the extra variance term.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and for recommending minor revision. The summary correctly captures our contributions on information-theoretic lower bounds under the BG-0 condition and the design of the PASTA algorithm achieving matching rates across multiple non-convex regimes. As no specific major comments were raised in the report, we have no individual points to rebut. We will incorporate minor editorial improvements in the revised version to enhance readability and presentation.

Circularity Check

0 steps flagged

No significant circularity; lower bounds and algorithm rates are independently derived

full rationale

The paper derives information-theoretic lower bounds through explicit hard-instance constructions under the BG-0 oracle model and proves matching upper bounds for the newly introduced PASTA algorithm via standard Lyapunov analysis and proximal anchoring arguments. No step reduces a claimed result to a fitted parameter, self-definition, or load-bearing self-citation; the BG-0 condition is treated as an external assumption, and the minimax rates follow from oracle query counting and convergence proofs that do not presuppose the target complexities. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the BG-0 oracle model and standard smoothness assumptions; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)

domain assumption Objective function satisfies smoothness or mean-square smoothness.
Used to derive both the lower bounds and the upper bounds for PASTA.
domain assumption Stochastic oracle satisfies the BG-0 condition (variance quadratic in distance to solution).
The key relaxation that replaces the classical bounded-variance assumption.

pith-pipeline@v0.9.0 · 5607 in / 1490 out tokens · 32639 ms · 2026-05-10T08:33:18.741757+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SGD for Variational Inference: Tackling Unbounded Variance via Preconditioning and Dynamic Batching
cs.LG 2026-05 unverdicted novelty 6.0

Proves ELBO solution existence for elliptic location-scale families and convergence guarantees for preconditioned dynamic-batched projected SGD under Blum-Gladyshev conditions in BBVI.

Reference graph

Works this paper leans on

6 extracted references · cited by 1 Pith paper

[1]

(4) For allx∈R T , prog0(∇ ¯FT (x))≤ prog1/2(x) + 1

(3) For allx∈R T ,∥∇ ¯FT (x)∥∞ ≤γ ∞, whereγ ∞ = 23. (4) For allx∈R T , prog0(∇ ¯FT (x))≤ prog1/2(x) + 1. (5) For allx∈R T , if prog1(x)< Tthen∥∇ ¯FT (x)∥ ≥ |∇ prog1(x)+1 ¯FT (x)|>1. To simulate the difficulty of stochastic optimization, the deterministic chain is evaluated using a stochastic oracle that masks the true gradient. A simple extension to Lemma...
[2]

To find the total oracle complexity, we need to estimate∥x t −x 0∥, which determines the batch sizeN t

Taking the total expectation, recursively unrolling fromt= 0 to K−1, and rearranging K−1X t=0 E[∥∇f(x t)∥2]≤ 2(f(x 0)−E[f(x K)]) η +LηKσ 2 ≤ 2∆ η +KLησ 2.(52) Recall thatη≤ ϵ2 2Lσ2 andK= l 4∆ ηϵ2 m . To find the total oracle complexity, we need to estimate∥x t −x 0∥, which determines the batch sizeN t. Using PASTA’s update rule xt −x 0 =−η t−1X j=0 ∇f(x j...
[3]

To determine the expected total stochastic oracle complexity, recall (67) and discard the non- positive suboptimality term to boundE ∥xt+1 −x ∗∥2 ≤E ∥xt −x ∗∥2 +η 2σ2

Taking the total expectation and rearranging ηE[f(x t)−f(x ∗)]≤E ∥xt −x ∗∥2 −E ∥xt+1 −x ∗∥2 +η 2σ2.(68) 25 Summing fromt= 0 toK−1 and discarding the non-positive terms yields K−1X t=0 E[f(x t)−f(x ∗)]≤ ∥x0 −x ∗∥2 η +Kησ 2.(69) Dividing byK, by convexity, and noting the definition of the ergodic average ¯x K E[f(¯xK)−f(x ∗)]≤ ∥x0 −x ∗∥2 Kη +ησ 2.(70) Since...
[4]

Taking the full unconditional expectation over all random variables yields E[Φt+1]≤E[Φ t]− η 2 E[∥∇f(x t)∥2]− η 4 E[∥gt∥2] + ηϵ2 4 .(98) Telescoping this inequalityKand rearranging K−1X t=0 E[∥∇f(x t)∥2] + 1 2 K−1X t=0 E[∥gt∥2]≤ 2E[Φ0] η + Kϵ2 2 .(99) At initialization, we useN 0 oracle calls which meansE[V 0]≤ ϵ2 2 , thereby we haveE[Φ 0]≤∆ + ηϵ2 4p . Us...
[5]

Subtractingf(x ∗) from both sides and using PL condition −∥∇f(x t)∥2 ≤ −2µ(f(x t)−f(x ∗)) establishes Et[f(x t+1)−f(x ∗)]≤(1−ηµ)(f(x t)−f(x ∗)) + Lη2B2 v 2 ∥xt −x 0∥2 + Lη2b2 v 2 .(110) 30 As we discussed in the main paper, for anL-smooth function, the PL condition implies QG [39], i.e.∥x−x ∗∥2 ≤ 2 µ(f(x)−f(x ∗)). Expanding∥x t −x 0∥2 ≤2∥x t −x ∗∥2 + 2∥x0...
[6]

This requires 4LηB 2 v µ2 (f(x 0)−f(x ∗))≤ ϵ 3, Lηb2 v µ ≤ ϵ 3, and K≥ 2 ηµ log( 3(f(x 0)−f(x ∗)) ϵ ). SinceS= 1 andN t = 1, the stochastic oracle complexity is equivalent toKand evaluates to K= max 2L µ log 3∆ ϵ , 6Lb2 v µ2ϵ log 3∆ ϵ , 24LB2 v∆ µ3ϵ log 3∆ ϵ ,(114) which is evidentlyO(ϵ −1 log(ϵ−1)).■ B.7 Proof of Theorem 9 Proof of Theorem 9.We setS= 1,N...