pith. sign in

arxiv: 2602.18718 · v2 · pith:24SACXTGnew · submitted 2026-02-21 · 📊 stat.ML · cs.LG· math.OC· stat.CO

Stochastic Gradient Variational Inference with Price's Gradient Estimator from Bures-Wasserstein to Parameter Space

Pith reviewed 2026-05-21 12:09 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OCstat.CO
keywords variational inferencestochastic gradientPrice's theoremBures-Wasserstein spaceblack-box VIGaussian variational familyiteration complexity
0
0 comments X

The pith

Black-box VI matches Wasserstein VI convergence rates by adopting Price's gradient estimator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the previous edge of Wasserstein variational inference over black-box variational inference for Gaussian families comes from the choice of gradient estimator, not from working directly in measure space. Price's gradient estimator draws on second-order Hessian information of the target log-density via Price's theorem. Black-box VI can incorporate this estimator through minor changes and thereby reach the same state-of-the-art iteration complexity bounds. The authors also show that Wasserstein VI can fall back to the simpler reparametrization gradient when only first-order information is available.

Core claim

For Gaussian variational families, Wasserstein VI and black-box VI both achieve identical optimal iteration complexity when they employ Price's gradient estimator that uses Hessian information of the unnormalized log-density. The prior gap in theoretical guarantees is therefore closed by transferring the estimator between the two algorithmic frameworks.

What carries the argument

Price's gradient estimator, which applies Price's theorem to produce unbiased gradients for the variational parameters using second-order information of the target log-density.

If this is right

  • Black-box VI now holds the same convergence guarantees as Wasserstein VI under the stated smoothness conditions.
  • The dominant source of performance improvement is the gradient estimator rather than the choice of optimization space.
  • Wasserstein VI can be extended to problems where only first-order gradients of the log-density are available.
  • Minor implementation adjustments let black-box VI leverage Hessian information without leaving parameter space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same estimator transfer could be tested in non-Gaussian variational families to check whether complexity benefits persist.
  • Links to other curvature-aware optimization techniques might yield further practical speed-ups in related inference tasks.
  • Empirical runs on targets with varying smoothness would map out where the theoretical rates remain reliable.

Load-bearing premise

The analysis requires a Gaussian variational family together with enough smoothness in the target log-density for Hessian-based estimators to remain accurate without extra regularization.

What would settle it

An experiment in which black-box VI equipped with Price's gradient fails to match the predicted iteration complexity on a smooth Gaussian target would falsify the claimed equivalence of guarantees.

Figures

Figures reproduced from arXiv: 2602.18718 by Jacob R. Gardner, Kyurae Kim, Qiang Fu, Trevor Campbell, Yi-An Ma.

Figure 1
Figure 1. Figure 1: Variational free energy (F) at T = 4000 versus step size γ. The top 8 problems with the largest dimensionality d are shown here, while the full set of results can be found in Section B. Refer to the main text for why the dotted lines are missing on Rats. The solid lines are the mean estimated over 32 independent repetitions, while the shaded regions are the 95% bootstrap confidence intervals. Results. Part… view at source ↗
Figure 2
Figure 2. Figure 2: Variational free energy (F) of the last iterate qT versus step size γ. The solid lines are the mean estimated over 32 independent repetitions, while the shaded regions are the corresponding 95% bootstrap confidence intervals. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Variational free energy (F) of the last iterate qT versus step size γ. The solid lines are the mean estimated over 32 independent repetitions, while the shaded regions are the corresponding 95% bootstrap confidence intervals. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (continued) Variational free energy (F) of the last iterate qT versus step size γ. In the case of the Rats problem, methods using first-order estimators didn’t converge for any step size between 10−8 and 100 , which is why the dotted lines are not visible. The solid lines are the mean estimated over 32 independent repetitions, while the shaded regions are the corresponding 95% bootstrap confidence interval… view at source ↗
read the original abstract

For approximating a target distribution given only its unnormalized log-density, stochastic gradient-based variational inference (VI) algorithms are a popular approach. For example, Wasserstein VI (WVI) and black-box VI (BBVI) perform gradient descent in measure space (Bures-Wasserstein space) and parameter space, respectively. Previously, for the Gaussian variational family, convergence guarantees for WVI have shown superiority over existing results for black-box VI with the reparametrization gradient, suggesting the measure space approach might provide some unique benefits. In this work, however, we close this gap by obtaining identical state-of-the-art iteration complexity guarantees for both. In particular, we identify that WVI's superiority stems from the specific gradient estimator it uses, which BBVI can also leverage with minor modifications. The estimator in question is usually associated with Price's theorem and utilizes second-order information (Hessians) of the target log-density. We will refer to this as Price's gradient. On the flip side, WVI can be made more widely applicable by using the reparametrization gradient, which requires only gradients of the log-density. We empirically demonstrate that the use of Price's gradient is the major source of performance improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that the performance gap between Wasserstein variational inference (WVI) and black-box variational inference (BBVI) for Gaussian variational families arises from the choice of gradient estimator rather than the optimization domain. By transplanting Price's gradient estimator (which uses second-order information of the target log-density) into the parameter-space setting of BBVI, the authors obtain identical state-of-the-art iteration complexity guarantees for both methods. They further show that WVI can be made more widely applicable by substituting the reparametrization gradient, and they provide empirical evidence that Price's estimator is the dominant source of improvement.

Significance. If the theoretical transfer of guarantees holds under the stated regularity conditions, the work unifies the analysis of two prominent stochastic-gradient VI frameworks and isolates the contribution of the gradient estimator. This clarifies why prior WVI results appeared superior and supplies a concrete recipe for importing advanced estimators into BBVI. The empirical section strengthens the claim by isolating the estimator's effect, and the dual-direction argument (BBVI with Price's estimator; WVI with reparametrization) increases the result's practical utility.

major comments (3)
  1. [§3.2, Assumption 2] §3.2, Assumption 2 (twice continuous differentiability and uniform Hessian bound): The iteration-complexity statements in Theorems 4.1 and 4.3 are derived under exact Hessian access. The manuscript does not quantify how finite-difference or Monte-Carlo approximations to the Hessian would inflate the variance term or invalidate the O(1/ε) rate; an explicit error-propagation lemma is needed to confirm the bounds remain unchanged.
  2. [§4.2, Eq. (18)] §4.2, Eq. (18): The proof that the Price estimator can be substituted into the BBVI update without altering the contraction factor relies on an identification between the Bures-Wasserstein and Euclidean gradients. The step that equates the two inner products appears to omit the Jacobian of the parameterization map; a short calculation showing that this Jacobian cancels exactly would strengthen the claim.
  3. [Table 2] Table 2, rows 3–5: The reported wall-clock times and negative-ELBO values for BBVI+Price versus WVI lack standard deviations across the 10 independent runs mentioned in the caption. Without these, it is impossible to judge whether the observed parity is statistically reliable or an artifact of a single favorable seed.
minor comments (3)
  1. [§2.1] The definition of the Bures-Wasserstein metric in §2.1 is referenced to prior work but never written explicitly; adding the formula would improve readability for readers unfamiliar with optimal transport.
  2. [Introduction] Price's original theorem is cited only in passing; a one-sentence reminder of the statement (E[∇f(X)·Y] = E[tr(H_f(X) Cov(X,Y))]) would help readers connect the estimator to the classical result.
  3. [Algorithm 1] In Algorithm 1 the line that computes the Hessian-vector product is not accompanied by a complexity note; stating that this step is O(d²) per sample would clarify the per-iteration cost relative to reparametrization gradients.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.2, Assumption 2] §3.2, Assumption 2 (twice continuous differentiability and uniform Hessian bound): The iteration-complexity statements in Theorems 4.1 and 4.3 are derived under exact Hessian access. The manuscript does not quantify how finite-difference or Monte-Carlo approximations to the Hessian would inflate the variance term or invalidate the O(1/ε) rate; an explicit error-propagation lemma is needed to confirm the bounds remain unchanged.

    Authors: We agree that the theoretical guarantees in Theorems 4.1 and 4.3 are stated under exact Hessian access. This is to isolate the effect of the gradient estimator in the exact setting. For practical black-box implementations, Hessian approximations via finite differences or Monte Carlo sampling introduce additional variance. We will add a new error-propagation lemma in the appendix that bounds the perturbation to the variance term under standard Lipschitz and boundedness assumptions on the approximation error, showing that the O(1/ε) iteration complexity is retained up to factors depending on the approximation accuracy. revision: yes

  2. Referee: [§4.2, Eq. (18)] §4.2, Eq. (18): The proof that the Price estimator can be substituted into the BBVI update without altering the contraction factor relies on an identification between the Bures-Wasserstein and Euclidean gradients. The step that equates the two inner products appears to omit the Jacobian of the parameterization map; a short calculation showing that this Jacobian cancels exactly would strengthen the claim.

    Authors: We thank the referee for highlighting this detail. The identification between the Bures-Wasserstein gradient and the Euclidean gradient in the proof does incorporate the parameterization map from the variational parameters to the Gaussian family. The Jacobian terms cancel exactly when taking the inner product because the map is an isometry with respect to the chosen metric on the parameter space. We will insert a short explicit calculation immediately after Equation (18) in the revised proof to demonstrate this cancellation step by step. revision: yes

  3. Referee: [Table 2] Table 2, rows 3–5: The reported wall-clock times and negative-ELBO values for BBVI+Price versus WVI lack standard deviations across the 10 independent runs mentioned in the caption. Without these, it is impossible to judge whether the observed parity is statistically reliable or an artifact of a single favorable seed.

    Authors: This is a fair criticism. While the experimental caption states that results are averaged over 10 independent runs, we did not report the standard deviations in Table 2. We will revise the table to include both means and standard deviations for the wall-clock times and negative-ELBO values, allowing readers to assess the statistical reliability of the observed performance parity. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies standard analysis to shared estimator

full rationale

The paper derives matching iteration complexity bounds for WVI and BBVI by transplanting Price's gradient estimator (which uses target log-density Hessians) into the black-box parameter-space setting. This is a direct mathematical analysis under stated smoothness assumptions rather than a reduction to fitted parameters, self-citations, or ansatzes imported from prior author work. The central claim rests on the new bounds themselves, which are presented as obtained via the estimator modification; no load-bearing step collapses to a definition or prior result by construction. Minor self-citation risk is absent from the provided derivation outline.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions for convergence analysis in VI (smoothness of log-density, Gaussian family) plus the specific form of Price's estimator; no free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption Target log-density is sufficiently smooth to admit Hessian evaluations for the gradient estimator.
    Invoked to support use of Price's theorem in both WVI and modified BBVI.
  • domain assumption Variational family is restricted to Gaussians.
    Stated as the setting where prior WVI superiority was shown and new guarantees are derived.

pith-pipeline@v0.9.0 · 5772 in / 1363 out tokens · 42313 ms · 2026-05-21T12:09:15.380411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    A Generalized Poincare Inequality for Gaus- sian Measures.Proceedings of the American Mathemati- cal Society, 105(2):397–400, 1989

    (page 23) Beckner, W. A Generalized Poincare Inequality for Gaus- sian Measures.Proceedings of the American Mathemati- cal Society, 105(2):397–400, 1989. (page 19) Bernton, E. Langevin Monte Carlo and JKO splitting. In Proceedings of the Conference On Learning Theory, vol- ume 75 ofPMLR, pp. 1777–1798. JMLR, 2018. (page 3) Bezanson, J., Edelman, A., Karpi...

  2. [2]

    Transformations des signaux aléatoires a travers les systèmes non linéaires sans mémoire.Annales des Télécommunications, 19(9):203–220, 1964

    (page 1) Bonnet, G. Transformations des signaux aléatoires a travers les systèmes non linéaires sans mémoire.Annales des Télécommunications, 19(9):203–220, 1964. (pages 3, 22) Bottou, L. On-line learning and stochastic approximations. InOn-Line Learning in Neural Networks, pp. 9–42. Cam- bridge University Press, 1 edition, 1999. (page 1) Bottou, L., Curti...

  3. [3]

    Unpublished draft, november 3, 2024 edition, 2024

    (pages 4, 7) Chewi, S.Log-Concave Sampling. Unpublished draft, november 3, 2024 edition, 2024. URL https:// chewisinho.github.io/main.pdf. (pages 5, 27) Chewi, S., Niles-Weed, J., and Rigollet, P.Statistical Op- timal Transport: École d’Été de Probabilités de Saint- Flour XLIX - 2019. Number 2364 in Lecture Notes in Mathematics École d’Été de Probabilités...

  4. [4]

    Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

    (page 27) Diao, M. Z., Balasubramanian, K., Chewi, S., and Salim, A. Forward-backward Gaussian variational inference via JKO in the Bures-Wasserstein space. InProceedings of the International Conference on Machine Learning, vol- ume 202 ofPMLR, pp. 7960–7991. JMLR, 2023. (pages 2, 3, 5, 6, 8, 19, 21, 27, 37) Dieuleveut, A., Fort, G., Moulines, E., and Wai...

  5. [5]

    (page 5) Geffner, T

    Curran Associates, Inc., 2018. (page 5) Geffner, T. and Domke, J. Approximation based variance reduction for reparameterization gradients. InAdvances in Neural Information Processing Systems, volume 33, pp. 2397–2407. Curran Associates, Inc., 2020a. (page 5) Geffner, T. and Domke, J. A rule for gradient estimator selection, with an application to variatio...

  6. [6]

    (page 1) Ho, Y . C. and Cao, X. Perturbation analysis and optimiza- tion of queueing networks.Journal of Optimization The- ory and Applications, 40(4):559–582, 1983. (pages 2, 4) Hoffman, M. and Ma, Y . Black-box variational inference as a parametric approximation to Langevin dynamics. In Proceedings of the International Conference on Machine Learning, vo...

  7. [7]

    Theo- retical Guarantees for Variational Inference with Fixed- Variance Mixture of Gaussians

    (pages 2, 6) Huix, T., Korba, A., Durmus, A., and Moulines, E. Theo- retical Guarantees for Variational Inference with Fixed- Variance Mixture of Gaussians. InProceedings of the International Conference on Machine Learning, volume 235 ofPMLR, pp. 20700–20721. JMLR, 2024. (page 2) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduct...

  8. [8]

    A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method

    IEEE Press. (page 2) Khan, M. E. and Rue, H. The Bayesian learning rule.Jour- nal of Machine Learning Research, 24(281):1–46, 2023. (pages 2, 8) Kim, K., Oh, J., Wu, K., Ma, Y ., and Gardner, J. R. On the convergence of black-box variational inference. In Advances in Neural Information Processing Systems, vol- ume 36, pp. 44615–44657. Curran Associates In...

  9. [9]

    Stein’s lemma for the reparameterization trick with exponential family mixtures.arXiv preprint arXiv:1910.13398,

    (pages 2, 3, 5, 20, 27, 36) Lin, W., Khan, M. E., and Schmidt, M. Fast and sim- ple natural-gradient variational inference with mixture of exponential-family approximations. InProceedings of the International Conference on Machine Learning, vol- ume 97 ofPMLR, pp. 3992–4002. JMLR, 2019. (pages 2, 8) Lin, W., Khan, M. E., and Schmidt, M. Stein’s lemma for ...

  10. [10]

    Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

    (page 4) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro- bust stochastic approximation approach to stochastic pro- gramming.SIAM Journal on Optimization, 19(4):1574– 1609, 2009. (pages 1, 2, 29) Opper, M. and Archambeau, C. The variational Gaussian approximation revisited.Neural Computation, 21(3):786– 792, 2009. (page 5) Parikh, N. and Boyd, S...

  11. [11]

    Meaningful lower-bound of√ a2 +b −a when a≫b >0

    (pages 2, 4, 19) Stich, S. U. Unified optimal analysis of the (stochastic) gradient method. arXiv Preprint arXiv:1907.04232, 2019. (pages 5, 23, 24) Sun, F., Fatkhullin, I., and He, N. Natural gradient VI: Guarantees for non-conjugate models. InAdvances in Neural Information Processing Systems, volume 38 (to appear). Curran Associates, Inc., 2025. (page 8...

  12. [12]

    ppauto” line of business, part of the “Schedule P loss data

    JMLR, 2019. (page 5) Yi, M. and Liu, S. Bridging the gap between variational inference and Wasserstein gradient flows. arXiv Preprint arXiv:2310.20090, 2023. (pages 2, 4, 6) 13 SGVI with Price’s Gradient Estimator from Bures-Wasserstein to Parameter Space TABLE OFCONTENTS 1 Introduction 1 2 Background 2 2.1 Problem Setup . . . . . . . . . . . . . . . . . ...

  13. [13]

    The properties of the proximal operator are summarized as follows: Lemma C.19.Denote λ∗ ∈arg min λ∈Λ F(q λ), where qλ is parametrized as in Assumption 2.2

    This follows from the non-expansiveness of the proximal opera- tor and the fact that the gradient descent step on the energy results in a contraction due to coercivity. The properties of the proximal operator are summarized as follows: Lemma C.19.Denote λ∗ ∈arg min λ∈Λ F(q λ), where qλ is parametrized as in Assumption 2.2. Then the proximal operator of th...