pith. machine review for the scientific record. sign in

arxiv: 2605.07531 · v1 · submitted 2026-05-08 · 💻 cs.LG · math.OC

Recognition: no theorem link

SGD for Variational Inference: Tackling Unbounded Variance via Preconditioning and Dynamic Batching

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:26 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords variational inferenceblack-box variational inferencestochastic gradient descentELBO optimizationunbounded variancepreconditioningdynamic batchingBlum-Gladyshev condition
0
0 comments X

The pith

Preconditioning combined with dynamic batching enables convergence of projected SGD for black-box variational inference under unbounded gradient variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Black-box variational inference optimizes the evidence lower bound using stochastic gradients that have unbounded variance, which breaks standard convergence theory. This paper focuses on the elliptic location-scale family of distributions and shows that an ELBO solution exists. It then proves that minibatch projected SGD, when equipped with preconditioning and dynamic batching, converges both in finite time and asymptotically when the gradients satisfy the weaker Blum-Gladyshev condition. This bridges theory and practice by allowing rigorous guarantees without assuming bounded variance. A sympathetic reader would care because it justifies the use of these techniques in modern inference tasks where variance grows with distance from the optimum.

Core claim

For parameterized distributions in the elliptic location-scale family, the ELBO has a solution. Moreover, minibatch projected SGD with dynamic batching and preconditioning converges to this solution in finite time and asymptotically, even though the stochastic gradients satisfy only the Blum-Gladyshev condition of quadratically growing variance rather than bounded variance.

What carries the argument

Minibatch Projected SGD equipped with dynamic batching and preconditioning, which controls the variance growth under the Blum-Gladyshev condition for elliptic location-scale distributions.

If this is right

  • Existence of an ELBO minimizer is established for the broad class of elliptic location-scale families, grounding a common assumption.
  • Finite-time convergence rates are provided for the preconditioned and dynamically batched algorithm.
  • Asymptotic convergence is guaranteed under the BG condition without needing bounded variance.
  • Practical black-box variational inference gains theoretical support for complex models using these enhancements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar preconditioning and batching strategies could extend convergence guarantees to other stochastic optimization settings with quadratic variance growth.
  • This approach might improve scalability of variational inference in high-dimensional or non-convex landscapes by reducing the impact of distant high-variance regions.
  • Testing the method on distributions outside the elliptic family could reveal how broadly the BG condition suffices for convergence.

Load-bearing premise

The parameterized distributions must belong to the elliptic location-scale family and the stochastic gradients must satisfy the Blum-Gladyshev condition with variance growing quadratically away from the optimum.

What would settle it

A counterexample where the ELBO has no solution for an elliptic location-scale distribution, or an experiment showing that the algorithm diverges when dynamic batching and preconditioning are removed but variance still grows quadratically.

Figures

Figures reproduced from arXiv: 2605.07531 by Cesare Molinari, Hippolyte Labarri\`ere, Lorenzo Rosasco, Silvia Villa.

Figure 1
Figure 1. Figure 1: Evolution of the Negative ELBO (approximated with [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of the Negative ELBO (approximated with [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of the Negative ELBO (approximated with [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
read the original abstract

Black-Box Variational Inference (BBVI) typically relies on Stochastic Gradient Descent (SGD) to optimize the Evidence Lower Bound (ELBO). However, the stochastic gradients in BBVI inherently exhibit unbounded variance, violating standard assumptions and instead satisfying the weaker Blum-Gladyshev (BG) condition, where variance grows quadratically with distance from the optimum. In this paper, we bridge the gap between stochastic optimization theory and the practical instances of BBVI. Focusing on the broad elliptic location-scale family of parameterized distributions, we offer two main contributions. First, we prove the existence of an ELBO solution, a foundational property usually assumed a priori in the literature. Second, we establish comprehensive convergence guarantees spanning finite-time and asymptotic regimes for Minibatch Projected SGD (PSGD) equipped with dynamic batching and preconditioning under the BG condition. Our theoretical framework demonstrates that dynamic batching combined with preconditioning systematically enables rigorous guarantees even in complex settings. We illustrate our theoretical findings with numerical results, highlighting the efficacy of our approach for modern inference tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proves existence of an ELBO maximizer for variational families in the elliptic location-scale class and derives finite-time plus asymptotic convergence guarantees for minibatch projected SGD equipped with dynamic batching and preconditioning, under the Blum-Gladyshev condition that stochastic-gradient variance grows quadratically away from the optimum.

Significance. If the proofs close the gap between the existence result and the projection set, the work supplies the first rigorous justification for using SGD on BBVI objectives whose variance is provably unbounded, while showing how dynamic batching and preconditioning restore standard convergence rates; this directly addresses a long-standing theoretical obstacle in the field.

major comments (2)
  1. [Existence theorem and PSGD convergence analysis] The existence argument (abstract and §3) establishes attainment of the ELBO maximum via coercivity/continuity on the elliptic location-scale family but supplies no explicit radius (in terms of data norms, dimension, or model parameters) that would guarantee the maximizer lies inside any chosen compact projection set. Consequently the finite-time and asymptotic bounds derived for PSGD under the BG condition apply only to the projected problem, not necessarily to the original ELBO solution whose existence was proven.
  2. [Convergence theorems] The BG-condition analysis and dynamic-batching schedule (presumably §4–5) are carried out inside the projected domain; without a concrete inclusion guarantee, the claimed transfer of guarantees to the unprojected ELBO optimization does not hold, undermining the central claim that the algorithm solves the original inference problem.
minor comments (2)
  1. [Notation and preliminaries] Notation for the preconditioner and batch-size schedule should be introduced once and used consistently; several symbols appear first in the abstract and only later in the text.
  2. [Experiments] The numerical experiments would be strengthened by reporting the chosen projection radius and verifying that the recovered solution lies inside it for the reported runs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading and constructive comments, which help clarify the relationship between our existence result and the convergence analysis. We respond point by point below.

read point-by-point responses
  1. Referee: [Existence theorem and PSGD convergence analysis] The existence argument (abstract and §3) establishes attainment of the ELBO maximum via coercivity/continuity on the elliptic location-scale family but supplies no explicit radius (in terms of data norms, dimension, or model parameters) that would guarantee the maximizer lies inside any chosen compact projection set. Consequently the finite-time and asymptotic bounds derived for PSGD under the BG condition apply only to the projected problem, not necessarily to the original ELBO solution whose existence was proven.

    Authors: We agree that the existence proof in Section 3 relies on coercivity of the negative ELBO over the unbounded domain and does not supply an explicit radius guaranteeing the maximizer lies inside a pre-specified compact set. The finite-time and asymptotic convergence results in Sections 4–5 are therefore derived for the projected SGD algorithm. In the revised manuscript we will add a dedicated paragraph in Section 3 (and a corresponding remark in the introduction) explaining that the coercivity function used to prove existence also implies the existence of a sufficiently large compact set containing the maximizer; the projection radius can therefore be chosen arbitrarily large so that the projected and unprojected problems coincide. We will not claim an explicit, data-dependent radius, as that would require quantitative bounds on the coercivity that are not available under the stated assumptions. revision: partial

  2. Referee: [Convergence theorems] The BG-condition analysis and dynamic-batching schedule (presumably §4–5) are carried out inside the projected domain; without a concrete inclusion guarantee, the claimed transfer of guarantees to the unprojected ELBO optimization does not hold, undermining the central claim that the algorithm solves the original inference problem.

    Authors: We acknowledge that the convergence theorems apply directly to the projected problem. The manuscript’s central claim is that dynamic batching and preconditioning restore standard convergence rates for projected SGD under the BG condition that is known to hold for BBVI objectives; the projection is an explicit algorithmic component introduced precisely to make the optimization well-posed. In the revision we will update the abstract, introduction, and conclusion to state more precisely that the guarantees are for the projected algorithm, which recovers the original ELBO maximizer once the projection set is chosen large enough to contain it (as justified by the coercivity argument). This framing keeps the contribution focused on overcoming the unbounded-variance obstacle while remaining accurate about the role of projection. revision: partial

standing simulated objections not resolved
  • Deriving an explicit, finite radius for the ELBO maximizer in terms of data norms, dimension, and model parameters under the current assumptions on the elliptic location-scale family.

Circularity Check

0 steps flagged

No circularity detected; derivations rely on external theory and stated assumptions

full rationale

The paper proves existence of an ELBO maximizer for the elliptic location-scale family and finite-time/asymptotic convergence of minibatch PSGD with dynamic batching and preconditioning under the Blum-Gladyshev condition. These results are constructed from first-principles analysis of the given assumptions (location-scale family properties and quadratic variance growth) together with standard stochastic optimization tools, without reducing any claim to a self-referential definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain. No ansatz is smuggled via prior work, no known empirical pattern is merely renamed, and the projection-set argument is handled explicitly within the stated compact-set framework rather than being forced by construction. The derivation chain is therefore self-contained against the paper's own inputs and external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions: that the target distributions lie in the elliptic location-scale family and that the stochastic gradients obey the Blum-Gladyshev variance growth condition. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Stochastic gradients in BBVI satisfy the Blum-Gladyshev condition (variance grows quadratically with distance from the optimum).
    Explicitly stated in the abstract as the weaker condition that replaces bounded variance.
  • domain assumption Parameterized distributions belong to the elliptic location-scale family.
    The family is the setting in which the ELBO existence proof and convergence results are derived.

pith-pipeline@v0.9.0 · 5495 in / 1422 out tokens · 37056 ms · 2026-05-11T02:26:18.661364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    Alacaoglu, Y

    A. Alacaoglu, Y . Malitsky, and S. J. Wright. Towards weaker variance assumptions for stochastic optimization.arXiv preprint arXiv:2504.09951, 2025

  2. [2]

    S.-I. Amari. Natural gradient works efficiently in learning.Neural computation, 10(2):251–276, 1998

  3. [3]

    Archer, I

    E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski. Black box variational inference for state space models.arXiv preprint arXiv:1511.07367, 2015

  4. [4]

    D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017

  5. [5]

    J. R. Blum. Approximation methods which converge with probability one.The Annals of Mathematical Statistics, pages 382–386, 1954

  6. [6]

    Bottou, F

    L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018

  7. [7]

    Brazitikos, A

    S. Brazitikos, A. Giannopoulos, P. Valettas, and B.-H. Vritsiou.Geometry of isotropic convex bodies, volume 196. American Mathematical Soc., 2014

  8. [8]

    Burroni, K

    J. Burroni, K. Takatsu, J. Domke, and D. Sheldon. U-statistics for importance-weighted variational inference.arXiv preprint arXiv:2302.13918, 2023. 10

  9. [9]

    D. Cai, C. Modi, L. Pillaud-Vivien, C. C. Margossian, R. M. Gower, D. M. Blei, and L. K. Saul. Batch and match: black-box variational inference with a score-based divergence.arXiv preprint arXiv:2402.14758, 2024

  10. [10]

    Cui and U

    S. Cui and U. V . Shanbhag. On the analysis of variance-reduced and randomized projection variants of single projection schemes for monotone stochastic variational inequality problems. Set-Valued and Variational Analysis, 29(2):453–499, 2021

  11. [11]

    J. Domke. Provable gradient variance guarantees for black-box variational inference.Advances in Neural Information Processing Systems, 32, 2019

  12. [12]

    J. Domke. Provable smoothness guarantees for black-box variational inference. InInternational Conference on Machine Learning, pages 2587–2596. PMLR, 2020

  13. [13]

    Domke, R

    J. Domke, R. Gower, and G. Garrigos. Provable convergence guarantees for black-box varia- tional inference.Advances in neural information processing systems, 36:66289–66327, 2023

  14. [14]

    Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance

    A. Fazla, E. C. Kaya, A. Upadhyay, and A. Hashemi. Lower bounds and proximally anchored sgd for non-convex minimization under unbounded variance.arXiv preprint arXiv:2604.16620, 2026

  15. [15]

    Ghadimi and G

    S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013

  16. [16]

    Ghadimi, G

    S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

  17. [17]

    Giordano, M

    R. Giordano, M. Ingram, and T. Broderick. Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box.Journal of Machine Learning Research, 25(18):1–39, 2024

  18. [18]

    Gladyshev

    E. Gladyshev. On stochastic approximation.Theory of Probability & Its Applications, 10(2):275– 278, 1965

  19. [19]

    P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems.Communications of the ACM, 33(10):75–84, 1990

  20. [20]

    Guilmeau, H

    T. Guilmeau, H. Hendrikx, and F. Forbes. Convergence of projected stochastic natural gradient variational inference for various step size and sample or batch size schedules.arXiv preprint arXiv:2604.00683, 2026

  21. [21]

    B. Halpern. Fixed points of nonexpanding maps.Bulletin of the American Mathematical Society, 73:957–961, 1967

  22. [22]

    A. M. Hotti, L. A. Van der Goten, and J. Lagergren. Benefits of non-linear scale parameter- izations in black box variational inference through smoothness results and gradient variance bounds. InInternational Conference on Artificial Intelligence and Statistics, pages 3538–3546. PMLR, 2024

  23. [23]

    Jacobsen and A

    A. Jacobsen and A. Cutkosky. Unconstrained online learning with unbounded losses. In International Conference on Machine Learning, pages 14590–14630. PMLR, 2023

  24. [24]

    M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models.Machine Learning, 37(2):183–233, 1999

  25. [25]

    Khaled and P

    A. Khaled and P. Richtárik. Better theory for SGD in the non-convex world.Transactions on Machine Learning Research, 2023

  26. [26]

    K. Kim, J. Oh, K. Wu, Y . Ma, and J. Gardner. On the convergence of black-box variational inference.Advances in Neural Information Processing Systems, 36:44615–44657, 2023

  27. [27]

    Kingma and M

    D. Kingma and M. Welling. Efficient gradient-based inference through transformations between bayes nets and neural nets. InInternational Conference on Machine Learning, pages 1782–1790. PMLR, 2014. 11

  28. [28]

    Latała.On the Equivalence Between Geometric and Arithmetic Means for Log-Concave Measures, page 123–128

    R. Latała.On the Equivalence Between Geometric and Arithmetic Means for Log-Concave Measures, page 123–128. Mathematical Sciences Research Institute Publications. Cambridge University Press, 1999

  29. [29]

    Locatello, G

    F. Locatello, G. Dresdner, R. Khanna, I. Valera, and G. Rätsch. Boosting black box variational inference.Advances in Neural Information Processing Systems, 31, 2018

  30. [30]

    V . D. Milman and G. Schechtman.Asymptotic theory of finite dimensional normed spaces. Springer, 1986

  31. [31]

    C. Modi, D. Cai, and L. K. Saul. Batch, match, and patch: low-rank approximations for score-based variational inference.arXiv preprint arXiv:2410.22292, 2024

  32. [32]

    C. Modi, R. Gower, C. Margossian, Y . Yao, D. Blei, and L. Saul. Variational inference with gaussian score matching.Advances in Neural Information Processing Systems, 36:29935–29950, 2023

  33. [33]

    Nemirovski, A

    A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609, 2009

  34. [34]

    Nesterov

    Y . Nesterov. Gradient methods for minimizing composite functions.Mathematical programming, 140(1):125–161, 2013

  35. [35]

    Neu and N

    G. Neu and N. Okolo. Dealing with unbounded gradients in stochastic saddle-point optimization. arXiv preprint arXiv:2402.13903, 2024

  36. [36]

    Ranganath, S

    R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. InArtificial intelligence and statistics, pages 814–822. PMLR, 2014

  37. [37]

    Robbins and S

    H. Robbins and S. Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

  38. [38]

    Robbins and D

    H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartingales and some applications. InOptimizing methods in statistics, pages 233–257. Elsevier, 1971

  39. [39]

    R. Y . Rubinstein. Sensitivity analysis and performance extrapolation for computer models. Operations Research, 37(1):72–81, 1989

  40. [40]

    F. Sun, I. Fatkhullin, and N. He. Natural gradient vi: Guarantees for non-conjugate models. arXiv preprint arXiv:2510.19163, 2025

  41. [41]

    Titsias and M

    M. Titsias and M. Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. InInternational conference on machine learning, pages 1971–1979. PMLR, 2014

  42. [42]

    Wang and D

    M. Wang and D. P. Bertsekas. Stochastic first-order methods with random constraint projection. SIAM Journal on Optimization, 26(1):681–717, 2016

  43. [43]

    Wierstra, T

    D. Wierstra, T. Schaul, T. Glasmachers, Y . Sun, J. Peters, and J. Schmidhuber. Natural evolution strategies.The Journal of Machine Learning Research, 15(1):949–980, 2014

  44. [44]

    R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8(3-4):229–256, 1992

  45. [45]

    Wu and J

    K. Wu and J. R. Gardner. Understanding stochastic natural gradient variational inference.arXiv preprint arXiv:2406.01870, 2024

  46. [46]

    Zantedeschi and K

    D. Zantedeschi and K. Muthuraman. Fisher-geometric diffusion in stochastic gradient descent: Optimal rates, oracle complexity, and information-theoretic limits.arXiv e-prints, pages arXiv– 2603, 2026. A Related works 12 Black-Box Variational Inference (BBVI).BBVI provides a scalable framework for posterior approximation by relying solely on stochastic gra...

  47. [47]

    studies the oracle complexity of Natural Gradients with minibatches in a broader optimization context. In contrast to these approaches, our work focuses on Projected SGD applied to elliptic location-scale families, providing convergence guarantees without relying on the restrictive geometric properties or conjugacy requirements of exponential families. St...

  48. [48]

    Classical convergence analyses of SGD typically rely on the assumption of uniformly bounded variance [33, 15, 6], or more recently, on the Expected Smoothness (ABC) condition [25]

    remains the foundational workhorse for optimization. Classical convergence analyses of SGD typically rely on the assumption of uniformly bounded variance [33, 15, 6], or more recently, on the Expected Smoothness (ABC) condition [25]. However, as highlighted by [1], the bounded variance assumption is often overly restrictive for modern machine learning pro...

  49. [49]

    • MPSGD without scaling: we apply exactly the method described in the first point of Theorem 4, choosingΛ =I d+d2,γ= 1 2L,N= p d+κ(p) √ EandK= √ E/ p d+κ(p)

    This allows to choose the theoretically optimal step size γ= 1 2L √ E which is significantly larger than that in vanilla PSGD. • MPSGD without scaling: we apply exactly the method described in the first point of Theorem 4, choosingΛ =I d+d2,γ= 1 2L,N= p d+κ(p) √ EandK= √ E/ p d+κ(p). • MPSGD with scaling: the second method described in Theorem 4, choosing...

  50. [50]

    taking the expectation over the all sequence

    P∞ k=0 ck <∞almost surely. The first steps of the proof are identical to that of Theorem 6, without taking (γk)k∈N and (Nk)k∈N as constants. Applying the same steps, it follows that E h θk+1 −θ ∗ 2 Λ−1 |θ k i ≤ 1 + a Nk γ2 k θk −θ ∗ 2 Λ−1 + b Nk γ2 k −2γ k F θk −F(θ ∗) . Let τk = 1 + a Nk γ2 k −1 . We define weights α0 = 1 and αk =Qk−1 i=0 τi for k≥1 , to...