arxiv: 2605.07531 · v1 · submitted 2026-05-08 · 💻 cs.LG · math.OC

Recognition: no theorem link

SGD for Variational Inference: Tackling Unbounded Variance via Preconditioning and Dynamic Batching

Hippolyte Labarri\`ere , Cesare Molinari , Silvia Villa , Lorenzo Rosasco

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:26 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords variational inferenceblack-box variational inferencestochastic gradient descentELBO optimizationunbounded variancepreconditioningdynamic batchingBlum-Gladyshev condition

0 comments

The pith

Preconditioning combined with dynamic batching enables convergence of projected SGD for black-box variational inference under unbounded gradient variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Black-box variational inference optimizes the evidence lower bound using stochastic gradients that have unbounded variance, which breaks standard convergence theory. This paper focuses on the elliptic location-scale family of distributions and shows that an ELBO solution exists. It then proves that minibatch projected SGD, when equipped with preconditioning and dynamic batching, converges both in finite time and asymptotically when the gradients satisfy the weaker Blum-Gladyshev condition. This bridges theory and practice by allowing rigorous guarantees without assuming bounded variance. A sympathetic reader would care because it justifies the use of these techniques in modern inference tasks where variance grows with distance from the optimum.

Core claim

For parameterized distributions in the elliptic location-scale family, the ELBO has a solution. Moreover, minibatch projected SGD with dynamic batching and preconditioning converges to this solution in finite time and asymptotically, even though the stochastic gradients satisfy only the Blum-Gladyshev condition of quadratically growing variance rather than bounded variance.

What carries the argument

Minibatch Projected SGD equipped with dynamic batching and preconditioning, which controls the variance growth under the Blum-Gladyshev condition for elliptic location-scale distributions.

If this is right

Existence of an ELBO minimizer is established for the broad class of elliptic location-scale families, grounding a common assumption.
Finite-time convergence rates are provided for the preconditioned and dynamically batched algorithm.
Asymptotic convergence is guaranteed under the BG condition without needing bounded variance.
Practical black-box variational inference gains theoretical support for complex models using these enhancements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar preconditioning and batching strategies could extend convergence guarantees to other stochastic optimization settings with quadratic variance growth.
This approach might improve scalability of variational inference in high-dimensional or non-convex landscapes by reducing the impact of distant high-variance regions.
Testing the method on distributions outside the elliptic family could reveal how broadly the BG condition suffices for convergence.

Load-bearing premise

The parameterized distributions must belong to the elliptic location-scale family and the stochastic gradients must satisfy the Blum-Gladyshev condition with variance growing quadratically away from the optimum.

What would settle it

A counterexample where the ELBO has no solution for an elliptic location-scale distribution, or an experiment showing that the algorithm diverges when dynamic batching and preconditioning are removed but variance still grows quadratically.

Figures

Figures reproduced from arXiv: 2605.07531 by Cesare Molinari, Hippolyte Labarri\`ere, Lorenzo Rosasco, Silvia Villa.

**Figure 2.** Figure 2: Evolution of the Negative ELBO (approximated with [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Evolution of the Negative ELBO (approximated with [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

Black-Box Variational Inference (BBVI) typically relies on Stochastic Gradient Descent (SGD) to optimize the Evidence Lower Bound (ELBO). However, the stochastic gradients in BBVI inherently exhibit unbounded variance, violating standard assumptions and instead satisfying the weaker Blum-Gladyshev (BG) condition, where variance grows quadratically with distance from the optimum. In this paper, we bridge the gap between stochastic optimization theory and the practical instances of BBVI. Focusing on the broad elliptic location-scale family of parameterized distributions, we offer two main contributions. First, we prove the existence of an ELBO solution, a foundational property usually assumed a priori in the literature. Second, we establish comprehensive convergence guarantees spanning finite-time and asymptotic regimes for Minibatch Projected SGD (PSGD) equipped with dynamic batching and preconditioning under the BG condition. Our theoretical framework demonstrates that dynamic batching combined with preconditioning systematically enables rigorous guarantees even in complex settings. We illustrate our theoretical findings with numerical results, highlighting the efficacy of our approach for modern inference tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves ELBO existence for the elliptic location-scale family and convergence for minibatch projected SGD with dynamic batching plus preconditioning under the Blum-Gladyshev variance condition, but the projection set is not guaranteed to contain the actual maximizer.

read the letter

The core contribution is a pair of results for black-box variational inference: an existence proof for an ELBO maximizer when the variational family is elliptic location-scale, plus finite-time and asymptotic convergence guarantees for minibatch projected SGD equipped with dynamic batching and preconditioning, all under the Blum-Gladyshev condition that allows quadratic variance growth instead of bounded variance. This combination is new relative to the cited literature on BBVI and stochastic optimization. The work does a clean job of importing the BG framework into the VI setting and showing how the two algorithmic modifications let the standard convergence arguments go through. The numerical illustrations are presented as supporting evidence for modern inference tasks, which is reasonable for a theory paper. The main soft spot is the link between the two results. Convergence holds inside the projection set, yet the existence argument relies on coercivity or continuity without supplying an explicit radius in terms of data size, dimension, or other parameters. Without that, there is no guarantee that the chosen projection contains the ELBO solution, so the algorithm is really solving a projected problem whose relation to the original ELBO is unclear. This is not fatal but it is the load-bearing assumption that needs tighter control. The paper is aimed at researchers working on theoretical foundations of variational inference and stochastic optimization. A reader who already knows the BG condition and wants to see it applied to unbounded-variance gradients in VI will find the proofs and the algorithmic fixes useful. It is coherent on its own terms and engages the relevant literature without circularity, so it deserves a serious referee. I would send it to review with a request that the authors either derive a concrete bound for the projection radius or clarify how the projection is chosen in practice.

Referee Report

2 major / 2 minor

Summary. The paper proves existence of an ELBO maximizer for variational families in the elliptic location-scale class and derives finite-time plus asymptotic convergence guarantees for minibatch projected SGD equipped with dynamic batching and preconditioning, under the Blum-Gladyshev condition that stochastic-gradient variance grows quadratically away from the optimum.

Significance. If the proofs close the gap between the existence result and the projection set, the work supplies the first rigorous justification for using SGD on BBVI objectives whose variance is provably unbounded, while showing how dynamic batching and preconditioning restore standard convergence rates; this directly addresses a long-standing theoretical obstacle in the field.

major comments (2)

[Existence theorem and PSGD convergence analysis] The existence argument (abstract and §3) establishes attainment of the ELBO maximum via coercivity/continuity on the elliptic location-scale family but supplies no explicit radius (in terms of data norms, dimension, or model parameters) that would guarantee the maximizer lies inside any chosen compact projection set. Consequently the finite-time and asymptotic bounds derived for PSGD under the BG condition apply only to the projected problem, not necessarily to the original ELBO solution whose existence was proven.
[Convergence theorems] The BG-condition analysis and dynamic-batching schedule (presumably §4–5) are carried out inside the projected domain; without a concrete inclusion guarantee, the claimed transfer of guarantees to the unprojected ELBO optimization does not hold, undermining the central claim that the algorithm solves the original inference problem.

minor comments (2)

[Notation and preliminaries] Notation for the preconditioner and batch-size schedule should be introduced once and used consistently; several symbols appear first in the abstract and only later in the text.
[Experiments] The numerical experiments would be strengthened by reporting the chosen projection radius and verifying that the recovered solution lies inside it for the reported runs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading and constructive comments, which help clarify the relationship between our existence result and the convergence analysis. We respond point by point below.

read point-by-point responses

Referee: [Existence theorem and PSGD convergence analysis] The existence argument (abstract and §3) establishes attainment of the ELBO maximum via coercivity/continuity on the elliptic location-scale family but supplies no explicit radius (in terms of data norms, dimension, or model parameters) that would guarantee the maximizer lies inside any chosen compact projection set. Consequently the finite-time and asymptotic bounds derived for PSGD under the BG condition apply only to the projected problem, not necessarily to the original ELBO solution whose existence was proven.

Authors: We agree that the existence proof in Section 3 relies on coercivity of the negative ELBO over the unbounded domain and does not supply an explicit radius guaranteeing the maximizer lies inside a pre-specified compact set. The finite-time and asymptotic convergence results in Sections 4–5 are therefore derived for the projected SGD algorithm. In the revised manuscript we will add a dedicated paragraph in Section 3 (and a corresponding remark in the introduction) explaining that the coercivity function used to prove existence also implies the existence of a sufficiently large compact set containing the maximizer; the projection radius can therefore be chosen arbitrarily large so that the projected and unprojected problems coincide. We will not claim an explicit, data-dependent radius, as that would require quantitative bounds on the coercivity that are not available under the stated assumptions. revision: partial
Referee: [Convergence theorems] The BG-condition analysis and dynamic-batching schedule (presumably §4–5) are carried out inside the projected domain; without a concrete inclusion guarantee, the claimed transfer of guarantees to the unprojected ELBO optimization does not hold, undermining the central claim that the algorithm solves the original inference problem.

Authors: We acknowledge that the convergence theorems apply directly to the projected problem. The manuscript’s central claim is that dynamic batching and preconditioning restore standard convergence rates for projected SGD under the BG condition that is known to hold for BBVI objectives; the projection is an explicit algorithmic component introduced precisely to make the optimization well-posed. In the revision we will update the abstract, introduction, and conclusion to state more precisely that the guarantees are for the projected algorithm, which recovers the original ELBO maximizer once the projection set is chosen large enough to contain it (as justified by the coercivity argument). This framing keeps the contribution focused on overcoming the unbounded-variance obstacle while remaining accurate about the role of projection. revision: partial

standing simulated objections not resolved

Deriving an explicit, finite radius for the ELBO maximizer in terms of data norms, dimension, and model parameters under the current assumptions on the elliptic location-scale family.

Circularity Check

0 steps flagged

No circularity detected; derivations rely on external theory and stated assumptions

full rationale

The paper proves existence of an ELBO maximizer for the elliptic location-scale family and finite-time/asymptotic convergence of minibatch PSGD with dynamic batching and preconditioning under the Blum-Gladyshev condition. These results are constructed from first-principles analysis of the given assumptions (location-scale family properties and quadratic variance growth) together with standard stochastic optimization tools, without reducing any claim to a self-referential definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain. No ansatz is smuggled via prior work, no known empirical pattern is merely renamed, and the projection-set argument is handled explicitly within the stated compact-set framework rather than being forced by construction. The derivation chain is therefore self-contained against the paper's own inputs and external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions: that the target distributions lie in the elliptic location-scale family and that the stochastic gradients obey the Blum-Gladyshev variance growth condition. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Stochastic gradients in BBVI satisfy the Blum-Gladyshev condition (variance grows quadratically with distance from the optimum).
Explicitly stated in the abstract as the weaker condition that replaces bounded variance.
domain assumption Parameterized distributions belong to the elliptic location-scale family.
The family is the setting in which the ELBO existence proof and convergence results are derived.

pith-pipeline@v0.9.0 · 5495 in / 1422 out tokens · 37056 ms · 2026-05-11T02:26:18.661364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

Alacaoglu, Y

A. Alacaoglu, Y . Malitsky, and S. J. Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

work page arXiv 2025
[2]

S.-I. Amari. Natural gradient works efficiently in learning.Neural computation, 10(2):251–276, 1998

work page 1998
[3]

Archer, I

E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski. Black box variational inference for state space models.arXiv preprint arXiv:1511.07367, 2015

work page arXiv 2015
[4]

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017

work page 2017
[5]

J. R. Blum. Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

work page 1954
[6]

Bottou, F

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018

work page 2018
[7]

Brazitikos, A

S. Brazitikos, A. Giannopoulos, P. Valettas, and B.-H. Vritsiou.Geometry of isotropic convex bodies, volume 196. American Mathematical Soc., 2014

work page 2014
[8]

Burroni, K

J. Burroni, K. Takatsu, J. Domke, and D. Sheldon. U-statistics for importance-weighted variational inference.arXiv preprint arXiv:2302.13918, 2023. 10

work page arXiv 2023
[9]

D. Cai, C. Modi, L. Pillaud-Vivien, C. C. Margossian, R. M. Gower, D. M. Blei, and L. K. Saul. Batch and match: black-box variational inference with a score-based divergence.arXiv preprint arXiv:2402.14758, 2024

work page arXiv 2024
[10]

Cui and U

S. Cui and U. V . Shanbhag. On the analysis of variance-reduced and randomized projection variants of single projection schemes for monotone stochastic variational inequality problems. Set-Valued and Variational Analysis, 29(2):453–499, 2021

work page 2021
[11]

J. Domke. Provable gradient variance guarantees for black-box variational inference.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[12]

J. Domke. Provable smoothness guarantees for black-box variational inference. InInternational Conference on Machine Learning, pages 2587–2596. PMLR, 2020

work page 2020
[13]

Domke, R

J. Domke, R. Gower, and G. Garrigos. Provable convergence guarantees for black-box varia- tional inference.Advances in neural information processing systems, 36:66289–66327, 2023

work page 2023
[14]

Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance

A. Fazla, E. C. Kaya, A. Upadhyay, and A. Hashemi. Lower bounds and proximally anchored sgd for non-convex minimization under unbounded variance.arXiv preprint arXiv:2604.16620, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Ghadimi and G

S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013

work page 2013
[16]

Ghadimi, G

S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

work page 2016
[17]

Giordano, M

R. Giordano, M. Ingram, and T. Broderick. Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box.Journal of Machine Learning Research, 25(18):1–39, 2024

work page 2024
[18]

Gladyshev

E. Gladyshev. On stochastic approximation.Theory of Probability & Its Applications, 10(2):275– 278, 1965

work page 1965
[19]

P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems.Communications of the ACM, 33(10):75–84, 1990

work page 1990
[20]

Guilmeau, H

T. Guilmeau, H. Hendrikx, and F. Forbes. Convergence of projected stochastic natural gradient variational inference for various step size and sample or batch size schedules.arXiv preprint arXiv:2604.00683, 2026

work page arXiv 2026
[21]

B. Halpern. Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73:957–961, 1967

work page 1967
[22]

A. M. Hotti, L. A. Van der Goten, and J. Lagergren. Benefits of non-linear scale parameter- izations in black box variational inference through smoothness results and gradient variance bounds. InInternational Conference on Artificial Intelligence and Statistics, pages 3538–3546. PMLR, 2024

work page 2024
[23]

Jacobsen and A

A. Jacobsen and A. Cutkosky. Unconstrained online learning with unbounded losses. In International Conference on Machine Learning, pages 14590–14630. PMLR, 2023

work page 2023
[24]

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models.Machine Learning, 37(2):183–233, 1999

work page 1999
[25]

Khaled and P

A. Khaled and P. Richtárik. Better theory for SGD in the non-convex world.Transactions on Machine Learning Research, 2023

work page 2023
[26]

K. Kim, J. Oh, K. Wu, Y . Ma, and J. Gardner. On the convergence of black-box variational inference.Advances in Neural Information Processing Systems, 36:44615–44657, 2023

work page 2023
[27]

Kingma and M

D. Kingma and M. Welling. Efficient gradient-based inference through transformations between bayes nets and neural nets. InInternational Conference on Machine Learning, pages 1782–1790. PMLR, 2014. 11

work page 2014
[28]

Latała.On the Equivalence Between Geometric and Arithmetic Means for Log-Concave Measures, page 123–128

R. Latała.On the Equivalence Between Geometric and Arithmetic Means for Log-Concave Measures, page 123–128. Mathematical Sciences Research Institute Publications. Cambridge University Press, 1999

work page 1999
[29]

Locatello, G

F. Locatello, G. Dresdner, R. Khanna, I. Valera, and G. Rätsch. Boosting black box variational inference.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[30]

V . D. Milman and G. Schechtman.Asymptotic theory of finite dimensional normed spaces. Springer, 1986

work page 1986
[31]

C. Modi, D. Cai, and L. K. Saul. Batch, match, and patch: low-rank approximations for score-based variational inference.arXiv preprint arXiv:2410.22292, 2024

work page arXiv 2024
[32]

C. Modi, R. Gower, C. Margossian, Y . Yao, D. Blei, and L. Saul. Variational inference with gaussian score matching.Advances in Neural Information Processing Systems, 36:29935–29950, 2023

work page 2023
[33]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609, 2009

work page 2009
[34]

Nesterov

Y . Nesterov. Gradient methods for minimizing composite functions.Mathematical programming, 140(1):125–161, 2013

work page 2013
[35]

Neu and N

G. Neu and N. Okolo. Dealing with unbounded gradients in stochastic saddle-point optimization. arXiv preprint arXiv:2402.13903, 2024

work page arXiv 2024
[36]

Ranganath, S

R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. InArtificial intelligence and statistics, pages 814–822. PMLR, 2014

work page 2014
[37]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

work page 1951
[38]

Robbins and D

H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartingales and some applications. InOptimizing methods in statistics, pages 233–257. Elsevier, 1971

work page 1971
[39]

R. Y . Rubinstein. Sensitivity analysis and performance extrapolation for computer models. Operations Research, 37(1):72–81, 1989

work page 1989
[40]

F. Sun, I. Fatkhullin, and N. He. Natural gradient vi: Guarantees for non-conjugate models. arXiv preprint arXiv:2510.19163, 2025

work page arXiv 2025
[41]

Titsias and M

M. Titsias and M. Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. InInternational conference on machine learning, pages 1971–1979. PMLR, 2014

work page 1971
[42]

Wang and D

M. Wang and D. P. Bertsekas. Stochastic first-order methods with random constraint projection. SIAM Journal on Optimization, 26(1):681–717, 2016

work page 2016
[43]

Wierstra, T

D. Wierstra, T. Schaul, T. Glasmachers, Y . Sun, J. Peters, and J. Schmidhuber. Natural evolution strategies.The Journal of Machine Learning Research, 15(1):949–980, 2014

work page 2014
[44]

R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8(3-4):229–256, 1992

work page 1992
[45]

Wu and J

K. Wu and J. R. Gardner. Understanding stochastic natural gradient variational inference.arXiv preprint arXiv:2406.01870, 2024

work page arXiv 2024
[46]

Zantedeschi and K

D. Zantedeschi and K. Muthuraman. Fisher-geometric diffusion in stochastic gradient descent: Optimal rates, oracle complexity, and information-theoretic limits.arXiv e-prints, pages arXiv– 2603, 2026. A Related works 12 Black-Box Variational Inference (BBVI).BBVI provides a scalable framework for posterior approximation by relying solely on stochastic gra...

work page 2026
[47]

studies the oracle complexity of Natural Gradients with minibatches in a broader optimization context. In contrast to these approaches, our work focuses on Projected SGD applied to elliptic location-scale families, providing convergence guarantees without relying on the restrictive geometric properties or conjugacy requirements of exponential families. St...

work page
[48]

Classical convergence analyses of SGD typically rely on the assumption of uniformly bounded variance [33, 15, 6], or more recently, on the Expected Smoothness (ABC) condition [25]

remains the foundational workhorse for optimization. Classical convergence analyses of SGD typically rely on the assumption of uniformly bounded variance [33, 15, 6], or more recently, on the Expected Smoothness (ABC) condition [25]. However, as highlighted by [1], the bounded variance assumption is often overly restrictive for modern machine learning pro...

work page
[49]

• MPSGD without scaling: we apply exactly the method described in the first point of Theorem 4, choosingΛ =I d+d2,γ= 1 2L,N= p d+κ(p) √ EandK= √ E/ p d+κ(p)

This allows to choose the theoretically optimal step size γ= 1 2L √ E which is significantly larger than that in vanilla PSGD. • MPSGD without scaling: we apply exactly the method described in the first point of Theorem 4, choosingΛ =I d+d2,γ= 1 2L,N= p d+κ(p) √ EandK= √ E/ p d+κ(p). • MPSGD with scaling: the second method described in Theorem 4, choosing...

work page
[50]

taking the expectation over the all sequence

P∞ k=0 ck <∞almost surely. The first steps of the proof are identical to that of Theorem 6, without taking (γk)k∈N and (Nk)k∈N as constants. Applying the same steps, it follows that E h θk+1 −θ ∗ 2 Λ−1 |θ k i ≤ 1 + a Nk γ2 k θk −θ ∗ 2 Λ−1 + b Nk γ2 k −2γ k F θk −F(θ ∗) . Let τk = 1 + a Nk γ2 k −1 . We define weights α0 = 1 and αk =Qk−1 i=0 τi for k≥1 , to...

work page