Recognition: no theorem link
SGD for Variational Inference: Tackling Unbounded Variance via Preconditioning and Dynamic Batching
Pith reviewed 2026-05-11 02:26 UTC · model grok-4.3
The pith
Preconditioning combined with dynamic batching enables convergence of projected SGD for black-box variational inference under unbounded gradient variance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For parameterized distributions in the elliptic location-scale family, the ELBO has a solution. Moreover, minibatch projected SGD with dynamic batching and preconditioning converges to this solution in finite time and asymptotically, even though the stochastic gradients satisfy only the Blum-Gladyshev condition of quadratically growing variance rather than bounded variance.
What carries the argument
Minibatch Projected SGD equipped with dynamic batching and preconditioning, which controls the variance growth under the Blum-Gladyshev condition for elliptic location-scale distributions.
If this is right
- Existence of an ELBO minimizer is established for the broad class of elliptic location-scale families, grounding a common assumption.
- Finite-time convergence rates are provided for the preconditioned and dynamically batched algorithm.
- Asymptotic convergence is guaranteed under the BG condition without needing bounded variance.
- Practical black-box variational inference gains theoretical support for complex models using these enhancements.
Where Pith is reading between the lines
- Similar preconditioning and batching strategies could extend convergence guarantees to other stochastic optimization settings with quadratic variance growth.
- This approach might improve scalability of variational inference in high-dimensional or non-convex landscapes by reducing the impact of distant high-variance regions.
- Testing the method on distributions outside the elliptic family could reveal how broadly the BG condition suffices for convergence.
Load-bearing premise
The parameterized distributions must belong to the elliptic location-scale family and the stochastic gradients must satisfy the Blum-Gladyshev condition with variance growing quadratically away from the optimum.
What would settle it
A counterexample where the ELBO has no solution for an elliptic location-scale distribution, or an experiment showing that the algorithm diverges when dynamic batching and preconditioning are removed but variance still grows quadratically.
Figures
read the original abstract
Black-Box Variational Inference (BBVI) typically relies on Stochastic Gradient Descent (SGD) to optimize the Evidence Lower Bound (ELBO). However, the stochastic gradients in BBVI inherently exhibit unbounded variance, violating standard assumptions and instead satisfying the weaker Blum-Gladyshev (BG) condition, where variance grows quadratically with distance from the optimum. In this paper, we bridge the gap between stochastic optimization theory and the practical instances of BBVI. Focusing on the broad elliptic location-scale family of parameterized distributions, we offer two main contributions. First, we prove the existence of an ELBO solution, a foundational property usually assumed a priori in the literature. Second, we establish comprehensive convergence guarantees spanning finite-time and asymptotic regimes for Minibatch Projected SGD (PSGD) equipped with dynamic batching and preconditioning under the BG condition. Our theoretical framework demonstrates that dynamic batching combined with preconditioning systematically enables rigorous guarantees even in complex settings. We illustrate our theoretical findings with numerical results, highlighting the efficacy of our approach for modern inference tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves existence of an ELBO maximizer for variational families in the elliptic location-scale class and derives finite-time plus asymptotic convergence guarantees for minibatch projected SGD equipped with dynamic batching and preconditioning, under the Blum-Gladyshev condition that stochastic-gradient variance grows quadratically away from the optimum.
Significance. If the proofs close the gap between the existence result and the projection set, the work supplies the first rigorous justification for using SGD on BBVI objectives whose variance is provably unbounded, while showing how dynamic batching and preconditioning restore standard convergence rates; this directly addresses a long-standing theoretical obstacle in the field.
major comments (2)
- [Existence theorem and PSGD convergence analysis] The existence argument (abstract and §3) establishes attainment of the ELBO maximum via coercivity/continuity on the elliptic location-scale family but supplies no explicit radius (in terms of data norms, dimension, or model parameters) that would guarantee the maximizer lies inside any chosen compact projection set. Consequently the finite-time and asymptotic bounds derived for PSGD under the BG condition apply only to the projected problem, not necessarily to the original ELBO solution whose existence was proven.
- [Convergence theorems] The BG-condition analysis and dynamic-batching schedule (presumably §4–5) are carried out inside the projected domain; without a concrete inclusion guarantee, the claimed transfer of guarantees to the unprojected ELBO optimization does not hold, undermining the central claim that the algorithm solves the original inference problem.
minor comments (2)
- [Notation and preliminaries] Notation for the preconditioner and batch-size schedule should be introduced once and used consistently; several symbols appear first in the abstract and only later in the text.
- [Experiments] The numerical experiments would be strengthened by reporting the chosen projection radius and verifying that the recovered solution lies inside it for the reported runs.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments, which help clarify the relationship between our existence result and the convergence analysis. We respond point by point below.
read point-by-point responses
-
Referee: [Existence theorem and PSGD convergence analysis] The existence argument (abstract and §3) establishes attainment of the ELBO maximum via coercivity/continuity on the elliptic location-scale family but supplies no explicit radius (in terms of data norms, dimension, or model parameters) that would guarantee the maximizer lies inside any chosen compact projection set. Consequently the finite-time and asymptotic bounds derived for PSGD under the BG condition apply only to the projected problem, not necessarily to the original ELBO solution whose existence was proven.
Authors: We agree that the existence proof in Section 3 relies on coercivity of the negative ELBO over the unbounded domain and does not supply an explicit radius guaranteeing the maximizer lies inside a pre-specified compact set. The finite-time and asymptotic convergence results in Sections 4–5 are therefore derived for the projected SGD algorithm. In the revised manuscript we will add a dedicated paragraph in Section 3 (and a corresponding remark in the introduction) explaining that the coercivity function used to prove existence also implies the existence of a sufficiently large compact set containing the maximizer; the projection radius can therefore be chosen arbitrarily large so that the projected and unprojected problems coincide. We will not claim an explicit, data-dependent radius, as that would require quantitative bounds on the coercivity that are not available under the stated assumptions. revision: partial
-
Referee: [Convergence theorems] The BG-condition analysis and dynamic-batching schedule (presumably §4–5) are carried out inside the projected domain; without a concrete inclusion guarantee, the claimed transfer of guarantees to the unprojected ELBO optimization does not hold, undermining the central claim that the algorithm solves the original inference problem.
Authors: We acknowledge that the convergence theorems apply directly to the projected problem. The manuscript’s central claim is that dynamic batching and preconditioning restore standard convergence rates for projected SGD under the BG condition that is known to hold for BBVI objectives; the projection is an explicit algorithmic component introduced precisely to make the optimization well-posed. In the revision we will update the abstract, introduction, and conclusion to state more precisely that the guarantees are for the projected algorithm, which recovers the original ELBO maximizer once the projection set is chosen large enough to contain it (as justified by the coercivity argument). This framing keeps the contribution focused on overcoming the unbounded-variance obstacle while remaining accurate about the role of projection. revision: partial
- Deriving an explicit, finite radius for the ELBO maximizer in terms of data norms, dimension, and model parameters under the current assumptions on the elliptic location-scale family.
Circularity Check
No circularity detected; derivations rely on external theory and stated assumptions
full rationale
The paper proves existence of an ELBO maximizer for the elliptic location-scale family and finite-time/asymptotic convergence of minibatch PSGD with dynamic batching and preconditioning under the Blum-Gladyshev condition. These results are constructed from first-principles analysis of the given assumptions (location-scale family properties and quadratic variance growth) together with standard stochastic optimization tools, without reducing any claim to a self-referential definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain. No ansatz is smuggled via prior work, no known empirical pattern is merely renamed, and the projection-set argument is handled explicitly within the stated compact-set framework rather than being forced by construction. The derivation chain is therefore self-contained against the paper's own inputs and external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Stochastic gradients in BBVI satisfy the Blum-Gladyshev condition (variance grows quadratically with distance from the optimum).
- domain assumption Parameterized distributions belong to the elliptic location-scale family.
Reference graph
Works this paper leans on
-
[1]
A. Alacaoglu, Y . Malitsky, and S. J. Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025
-
[2]
S.-I. Amari. Natural gradient works efficiently in learning.Neural computation, 10(2):251–276, 1998
work page 1998
- [3]
-
[4]
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017
work page 2017
-
[5]
J. R. Blum. Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954
work page 1954
- [6]
-
[7]
S. Brazitikos, A. Giannopoulos, P. Valettas, and B.-H. Vritsiou.Geometry of isotropic convex bodies, volume 196. American Mathematical Soc., 2014
work page 2014
-
[8]
J. Burroni, K. Takatsu, J. Domke, and D. Sheldon. U-statistics for importance-weighted variational inference.arXiv preprint arXiv:2302.13918, 2023. 10
- [9]
- [10]
-
[11]
J. Domke. Provable gradient variance guarantees for black-box variational inference.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[12]
J. Domke. Provable smoothness guarantees for black-box variational inference. InInternational Conference on Machine Learning, pages 2587–2596. PMLR, 2020
work page 2020
- [13]
-
[14]
Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance
A. Fazla, E. C. Kaya, A. Upadhyay, and A. Hashemi. Lower bounds and proximally anchored sgd for non-convex minimization under unbounded variance.arXiv preprint arXiv:2604.16620, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013
work page 2013
-
[16]
S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016
work page 2016
-
[17]
R. Giordano, M. Ingram, and T. Broderick. Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box.Journal of Machine Learning Research, 25(18):1–39, 2024
work page 2024
- [18]
-
[19]
P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems.Communications of the ACM, 33(10):75–84, 1990
work page 1990
-
[20]
T. Guilmeau, H. Hendrikx, and F. Forbes. Convergence of projected stochastic natural gradient variational inference for various step size and sample or batch size schedules.arXiv preprint arXiv:2604.00683, 2026
-
[21]
B. Halpern. Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73:957–961, 1967
work page 1967
-
[22]
A. M. Hotti, L. A. Van der Goten, and J. Lagergren. Benefits of non-linear scale parameter- izations in black box variational inference through smoothness results and gradient variance bounds. InInternational Conference on Artificial Intelligence and Statistics, pages 3538–3546. PMLR, 2024
work page 2024
-
[23]
A. Jacobsen and A. Cutkosky. Unconstrained online learning with unbounded losses. In International Conference on Machine Learning, pages 14590–14630. PMLR, 2023
work page 2023
-
[24]
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models.Machine Learning, 37(2):183–233, 1999
work page 1999
-
[25]
A. Khaled and P. Richtárik. Better theory for SGD in the non-convex world.Transactions on Machine Learning Research, 2023
work page 2023
-
[26]
K. Kim, J. Oh, K. Wu, Y . Ma, and J. Gardner. On the convergence of black-box variational inference.Advances in Neural Information Processing Systems, 36:44615–44657, 2023
work page 2023
-
[27]
D. Kingma and M. Welling. Efficient gradient-based inference through transformations between bayes nets and neural nets. InInternational Conference on Machine Learning, pages 1782–1790. PMLR, 2014. 11
work page 2014
-
[28]
R. Latała.On the Equivalence Between Geometric and Arithmetic Means for Log-Concave Measures, page 123–128. Mathematical Sciences Research Institute Publications. Cambridge University Press, 1999
work page 1999
-
[29]
F. Locatello, G. Dresdner, R. Khanna, I. Valera, and G. Rätsch. Boosting black box variational inference.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[30]
V . D. Milman and G. Schechtman.Asymptotic theory of finite dimensional normed spaces. Springer, 1986
work page 1986
- [31]
-
[32]
C. Modi, R. Gower, C. Margossian, Y . Yao, D. Blei, and L. Saul. Variational inference with gaussian score matching.Advances in Neural Information Processing Systems, 36:29935–29950, 2023
work page 2023
-
[33]
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609, 2009
work page 2009
- [34]
- [35]
-
[36]
R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. InArtificial intelligence and statistics, pages 814–822. PMLR, 2014
work page 2014
-
[37]
H. Robbins and S. Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951
work page 1951
-
[38]
H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartingales and some applications. InOptimizing methods in statistics, pages 233–257. Elsevier, 1971
work page 1971
-
[39]
R. Y . Rubinstein. Sensitivity analysis and performance extrapolation for computer models. Operations Research, 37(1):72–81, 1989
work page 1989
- [40]
-
[41]
M. Titsias and M. Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. InInternational conference on machine learning, pages 1971–1979. PMLR, 2014
work page 1971
-
[42]
M. Wang and D. P. Bertsekas. Stochastic first-order methods with random constraint projection. SIAM Journal on Optimization, 26(1):681–717, 2016
work page 2016
-
[43]
D. Wierstra, T. Schaul, T. Glasmachers, Y . Sun, J. Peters, and J. Schmidhuber. Natural evolution strategies.The Journal of Machine Learning Research, 15(1):949–980, 2014
work page 2014
-
[44]
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8(3-4):229–256, 1992
work page 1992
- [45]
-
[46]
D. Zantedeschi and K. Muthuraman. Fisher-geometric diffusion in stochastic gradient descent: Optimal rates, oracle complexity, and information-theoretic limits.arXiv e-prints, pages arXiv– 2603, 2026. A Related works 12 Black-Box Variational Inference (BBVI).BBVI provides a scalable framework for posterior approximation by relying solely on stochastic gra...
work page 2026
-
[47]
studies the oracle complexity of Natural Gradients with minibatches in a broader optimization context. In contrast to these approaches, our work focuses on Projected SGD applied to elliptic location-scale families, providing convergence guarantees without relying on the restrictive geometric properties or conjugacy requirements of exponential families. St...
-
[48]
remains the foundational workhorse for optimization. Classical convergence analyses of SGD typically rely on the assumption of uniformly bounded variance [33, 15, 6], or more recently, on the Expected Smoothness (ABC) condition [25]. However, as highlighted by [1], the bounded variance assumption is often overly restrictive for modern machine learning pro...
-
[49]
This allows to choose the theoretically optimal step size γ= 1 2L √ E which is significantly larger than that in vanilla PSGD. • MPSGD without scaling: we apply exactly the method described in the first point of Theorem 4, choosingΛ =I d+d2,γ= 1 2L,N= p d+κ(p) √ EandK= √ E/ p d+κ(p). • MPSGD with scaling: the second method described in Theorem 4, choosing...
-
[50]
taking the expectation over the all sequence
P∞ k=0 ck <∞almost surely. The first steps of the proof are identical to that of Theorem 6, without taking (γk)k∈N and (Nk)k∈N as constants. Applying the same steps, it follows that E h θk+1 −θ ∗ 2 Λ−1 |θ k i ≤ 1 + a Nk γ2 k θk −θ ∗ 2 Λ−1 + b Nk γ2 k −2γ k F θk −F(θ ∗) . Let τk = 1 + a Nk γ2 k −1 . We define weights α0 = 1 and αk =Qk−1 i=0 τi for k≥1 , to...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.