Stochastic Gradient Variational Inference with Price's Gradient Estimator from Bures-Wasserstein to Parameter Space

Jacob R. Gardner; Kyurae Kim; Qiang Fu; Trevor Campbell; Yi-An Ma

arxiv: 2602.18718 · v2 · pith:24SACXTGnew · submitted 2026-02-21 · 📊 stat.ML · cs.LG· math.OC· stat.CO

Stochastic Gradient Variational Inference with Price's Gradient Estimator from Bures-Wasserstein to Parameter Space

Kyurae Kim , Qiang Fu , Yi-An Ma , Jacob R. Gardner , Trevor Campbell This is my paper

Pith reviewed 2026-05-21 12:09 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OCstat.CO

keywords variational inferencestochastic gradientPrice's theoremBures-Wasserstein spaceblack-box VIGaussian variational familyiteration complexity

0 comments

The pith

Black-box VI matches Wasserstein VI convergence rates by adopting Price's gradient estimator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the previous edge of Wasserstein variational inference over black-box variational inference for Gaussian families comes from the choice of gradient estimator, not from working directly in measure space. Price's gradient estimator draws on second-order Hessian information of the target log-density via Price's theorem. Black-box VI can incorporate this estimator through minor changes and thereby reach the same state-of-the-art iteration complexity bounds. The authors also show that Wasserstein VI can fall back to the simpler reparametrization gradient when only first-order information is available.

Core claim

For Gaussian variational families, Wasserstein VI and black-box VI both achieve identical optimal iteration complexity when they employ Price's gradient estimator that uses Hessian information of the unnormalized log-density. The prior gap in theoretical guarantees is therefore closed by transferring the estimator between the two algorithmic frameworks.

What carries the argument

Price's gradient estimator, which applies Price's theorem to produce unbiased gradients for the variational parameters using second-order information of the target log-density.

If this is right

Black-box VI now holds the same convergence guarantees as Wasserstein VI under the stated smoothness conditions.
The dominant source of performance improvement is the gradient estimator rather than the choice of optimization space.
Wasserstein VI can be extended to problems where only first-order gradients of the log-density are available.
Minor implementation adjustments let black-box VI leverage Hessian information without leaving parameter space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same estimator transfer could be tested in non-Gaussian variational families to check whether complexity benefits persist.
Links to other curvature-aware optimization techniques might yield further practical speed-ups in related inference tasks.
Empirical runs on targets with varying smoothness would map out where the theoretical rates remain reliable.

Load-bearing premise

The analysis requires a Gaussian variational family together with enough smoothness in the target log-density for Hessian-based estimators to remain accurate without extra regularization.

What would settle it

An experiment in which black-box VI equipped with Price's gradient fails to match the predicted iteration complexity on a smooth Gaussian target would falsify the claimed equivalence of guarantees.

Figures

Figures reproduced from arXiv: 2602.18718 by Jacob R. Gardner, Kyurae Kim, Qiang Fu, Trevor Campbell, Yi-An Ma.

**Figure 1.** Figure 1: Variational free energy (F) at T = 4000 versus step size γ. The top 8 problems with the largest dimensionality d are shown here, while the full set of results can be found in Section B. Refer to the main text for why the dotted lines are missing on Rats. The solid lines are the mean estimated over 32 independent repetitions, while the shaded regions are the 95% bootstrap confidence intervals. Results. Part… view at source ↗

**Figure 2.** Figure 2: Variational free energy (F) of the last iterate qT versus step size γ. The solid lines are the mean estimated over 32 independent repetitions, while the shaded regions are the corresponding 95% bootstrap confidence intervals. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Variational free energy (F) of the last iterate qT versus step size γ. The solid lines are the mean estimated over 32 independent repetitions, while the shaded regions are the corresponding 95% bootstrap confidence intervals. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: (continued) Variational free energy (F) of the last iterate qT versus step size γ. In the case of the Rats problem, methods using first-order estimators didn’t converge for any step size between 10−8 and 100 , which is why the dotted lines are not visible. The solid lines are the mean estimated over 32 independent repetitions, while the shaded regions are the corresponding 95% bootstrap confidence interval… view at source ↗

read the original abstract

For approximating a target distribution given only its unnormalized log-density, stochastic gradient-based variational inference (VI) algorithms are a popular approach. For example, Wasserstein VI (WVI) and black-box VI (BBVI) perform gradient descent in measure space (Bures-Wasserstein space) and parameter space, respectively. Previously, for the Gaussian variational family, convergence guarantees for WVI have shown superiority over existing results for black-box VI with the reparametrization gradient, suggesting the measure space approach might provide some unique benefits. In this work, however, we close this gap by obtaining identical state-of-the-art iteration complexity guarantees for both. In particular, we identify that WVI's superiority stems from the specific gradient estimator it uses, which BBVI can also leverage with minor modifications. The estimator in question is usually associated with Price's theorem and utilizes second-order information (Hessians) of the target log-density. We will refer to this as Price's gradient. On the flip side, WVI can be made more widely applicable by using the reparametrization gradient, which requires only gradients of the log-density. We empirically demonstrate that the use of Price's gradient is the major source of performance improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that Price's gradient estimator, not the Bures-Wasserstein geometry, explains WVI's better rates and that BBVI can match them by adopting it.

read the letter

The main point is that this paper traces the better convergence rates in Wasserstein VI back to the gradient estimator they use, not the space itself. Price's estimator brings in Hessian info from the target density, and once BBVI uses that too, the iteration complexities match. They also show you can go the other way and run WVI with just first-order reparam gradients. What stands out is how they separate the estimator from the geometry and get the matching bounds for the Gaussian family. The experiments confirm that this estimator switch accounts for most of the practical improvement. The proofs look like straightforward application of existing analysis tools to the common estimator. One thing to watch is the twice-differentiability assumption. The bounds rely on having clean access to the Hessian without approximation errors piling up. If you're estimating the second derivatives or the target isn't smooth enough, those extra terms could change the picture, and the abstract doesn't detail the tolerance for that. This is worth a look for anyone doing theory on variational methods with Gaussians. It clarifies why one approach looked better and gives a way to level the playing field. I'd send it out for review; the contribution is focused but the identification of the estimator is a useful step.

Referee Report

3 major / 3 minor

Summary. The paper claims that the performance gap between Wasserstein variational inference (WVI) and black-box variational inference (BBVI) for Gaussian variational families arises from the choice of gradient estimator rather than the optimization domain. By transplanting Price's gradient estimator (which uses second-order information of the target log-density) into the parameter-space setting of BBVI, the authors obtain identical state-of-the-art iteration complexity guarantees for both methods. They further show that WVI can be made more widely applicable by substituting the reparametrization gradient, and they provide empirical evidence that Price's estimator is the dominant source of improvement.

Significance. If the theoretical transfer of guarantees holds under the stated regularity conditions, the work unifies the analysis of two prominent stochastic-gradient VI frameworks and isolates the contribution of the gradient estimator. This clarifies why prior WVI results appeared superior and supplies a concrete recipe for importing advanced estimators into BBVI. The empirical section strengthens the claim by isolating the estimator's effect, and the dual-direction argument (BBVI with Price's estimator; WVI with reparametrization) increases the result's practical utility.

major comments (3)

[§3.2, Assumption 2] §3.2, Assumption 2 (twice continuous differentiability and uniform Hessian bound): The iteration-complexity statements in Theorems 4.1 and 4.3 are derived under exact Hessian access. The manuscript does not quantify how finite-difference or Monte-Carlo approximations to the Hessian would inflate the variance term or invalidate the O(1/ε) rate; an explicit error-propagation lemma is needed to confirm the bounds remain unchanged.
[§4.2, Eq. (18)] §4.2, Eq. (18): The proof that the Price estimator can be substituted into the BBVI update without altering the contraction factor relies on an identification between the Bures-Wasserstein and Euclidean gradients. The step that equates the two inner products appears to omit the Jacobian of the parameterization map; a short calculation showing that this Jacobian cancels exactly would strengthen the claim.
[Table 2] Table 2, rows 3–5: The reported wall-clock times and negative-ELBO values for BBVI+Price versus WVI lack standard deviations across the 10 independent runs mentioned in the caption. Without these, it is impossible to judge whether the observed parity is statistically reliable or an artifact of a single favorable seed.

minor comments (3)

[§2.1] The definition of the Bures-Wasserstein metric in §2.1 is referenced to prior work but never written explicitly; adding the formula would improve readability for readers unfamiliar with optimal transport.
[Introduction] Price's original theorem is cited only in passing; a one-sentence reminder of the statement (E[∇f(X)·Y] = E[tr(H_f(X) Cov(X,Y))]) would help readers connect the estimator to the classical result.
[Algorithm 1] In Algorithm 1 the line that computes the Hessian-vector product is not accompanied by a complexity note; stating that this step is O(d²) per sample would clarify the per-iteration cost relative to reparametrization gradients.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§3.2, Assumption 2] §3.2, Assumption 2 (twice continuous differentiability and uniform Hessian bound): The iteration-complexity statements in Theorems 4.1 and 4.3 are derived under exact Hessian access. The manuscript does not quantify how finite-difference or Monte-Carlo approximations to the Hessian would inflate the variance term or invalidate the O(1/ε) rate; an explicit error-propagation lemma is needed to confirm the bounds remain unchanged.

Authors: We agree that the theoretical guarantees in Theorems 4.1 and 4.3 are stated under exact Hessian access. This is to isolate the effect of the gradient estimator in the exact setting. For practical black-box implementations, Hessian approximations via finite differences or Monte Carlo sampling introduce additional variance. We will add a new error-propagation lemma in the appendix that bounds the perturbation to the variance term under standard Lipschitz and boundedness assumptions on the approximation error, showing that the O(1/ε) iteration complexity is retained up to factors depending on the approximation accuracy. revision: yes
Referee: [§4.2, Eq. (18)] §4.2, Eq. (18): The proof that the Price estimator can be substituted into the BBVI update without altering the contraction factor relies on an identification between the Bures-Wasserstein and Euclidean gradients. The step that equates the two inner products appears to omit the Jacobian of the parameterization map; a short calculation showing that this Jacobian cancels exactly would strengthen the claim.

Authors: We thank the referee for highlighting this detail. The identification between the Bures-Wasserstein gradient and the Euclidean gradient in the proof does incorporate the parameterization map from the variational parameters to the Gaussian family. The Jacobian terms cancel exactly when taking the inner product because the map is an isometry with respect to the chosen metric on the parameter space. We will insert a short explicit calculation immediately after Equation (18) in the revised proof to demonstrate this cancellation step by step. revision: yes
Referee: [Table 2] Table 2, rows 3–5: The reported wall-clock times and negative-ELBO values for BBVI+Price versus WVI lack standard deviations across the 10 independent runs mentioned in the caption. Without these, it is impossible to judge whether the observed parity is statistically reliable or an artifact of a single favorable seed.

Authors: This is a fair criticism. While the experimental caption states that results are averaged over 10 independent runs, we did not report the standard deviations in Table 2. We will revise the table to include both means and standard deviations for the wall-clock times and negative-ELBO values, allowing readers to assess the statistical reliability of the observed performance parity. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies standard analysis to shared estimator

full rationale

The paper derives matching iteration complexity bounds for WVI and BBVI by transplanting Price's gradient estimator (which uses target log-density Hessians) into the black-box parameter-space setting. This is a direct mathematical analysis under stated smoothness assumptions rather than a reduction to fitted parameters, self-citations, or ansatzes imported from prior author work. The central claim rests on the new bounds themselves, which are presented as obtained via the estimator modification; no load-bearing step collapses to a definition or prior result by construction. Minor self-citation risk is absent from the provided derivation outline.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions for convergence analysis in VI (smoothness of log-density, Gaussian family) plus the specific form of Price's estimator; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Target log-density is sufficiently smooth to admit Hessian evaluations for the gradient estimator.
Invoked to support use of Price's theorem in both WVI and modified BBVI.
domain assumption Variational family is restricted to Gaussians.
Stated as the setting where prior WVI superiority was shown and new guarantees are derived.

pith-pipeline@v0.9.0 · 5772 in / 1363 out tokens · 42313 ms · 2026-05-21T12:09:15.380411+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The estimator in question is usually associated with Price’s theorem and utilizes second-order information (Hessians) of the target log-density.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 3.1. The potential U is twice differentiable and μId ≼ ∇²U(z) ≼ L Id.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

A Generalized Poincare Inequality for Gaus- sian Measures.Proceedings of the American Mathemati- cal Society, 105(2):397–400, 1989

(page 23) Beckner, W. A Generalized Poincare Inequality for Gaus- sian Measures.Proceedings of the American Mathemati- cal Society, 105(2):397–400, 1989. (page 19) Bernton, E. Langevin Monte Carlo and JKO splitting. In Proceedings of the Conference On Learning Theory, vol- ume 75 ofPMLR, pp. 1777–1798. JMLR, 2018. (page 3) Bezanson, J., Edelman, A., Karpi...

work page 1989
[2]

Transformations des signaux aléatoires a travers les systèmes non linéaires sans mémoire.Annales des Télécommunications, 19(9):203–220, 1964

(page 1) Bonnet, G. Transformations des signaux aléatoires a travers les systèmes non linéaires sans mémoire.Annales des Télécommunications, 19(9):203–220, 1964. (pages 3, 22) Bottou, L. On-line learning and stochastic approximations. InOn-Line Learning in Neural Networks, pp. 9–42. Cam- bridge University Press, 1 edition, 1999. (page 1) Bottou, L., Curti...

work page 1964
[3]

Unpublished draft, november 3, 2024 edition, 2024

(pages 4, 7) Chewi, S.Log-Concave Sampling. Unpublished draft, november 3, 2024 edition, 2024. URL https:// chewisinho.github.io/main.pdf. (pages 5, 27) Chewi, S., Niles-Weed, J., and Rigollet, P.Statistical Op- timal Transport: École d’Été de Probabilités de Saint- Flour XLIX - 2019. Number 2364 in Lecture Notes in Mathematics École d’Été de Probabilités...

work page 2024
[4]

Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

(page 27) Diao, M. Z., Balasubramanian, K., Chewi, S., and Salim, A. Forward-backward Gaussian variational inference via JKO in the Bures-Wasserstein space. InProceedings of the International Conference on Machine Learning, vol- ume 202 ofPMLR, pp. 7960–7991. JMLR, 2023. (pages 2, 3, 5, 6, 8, 19, 21, 27, 37) Dieuleveut, A., Fort, G., Moulines, E., and Wai...

work page arXiv 2023
[5]

(page 5) Geffner, T

Curran Associates, Inc., 2018. (page 5) Geffner, T. and Domke, J. Approximation based variance reduction for reparameterization gradients. InAdvances in Neural Information Processing Systems, volume 33, pp. 2397–2407. Curran Associates, Inc., 2020a. (page 5) Geffner, T. and Domke, J. A rule for gradient estimator selection, with an application to variatio...

work page 2018
[6]

(page 1) Ho, Y . C. and Cao, X. Perturbation analysis and optimiza- tion of queueing networks.Journal of Optimization The- ory and Applications, 40(4):559–582, 1983. (pages 2, 4) Hoffman, M. and Ma, Y . Black-box variational inference as a parametric approximation to Langevin dynamics. In Proceedings of the International Conference on Machine Learning, vo...

work page 1983
[7]

Theo- retical Guarantees for Variational Inference with Fixed- Variance Mixture of Gaussians

(pages 2, 6) Huix, T., Korba, A., Durmus, A., and Moulines, E. Theo- retical Guarantees for Variational Inference with Fixed- Variance Mixture of Gaussians. InProceedings of the International Conference on Machine Learning, volume 235 ofPMLR, pp. 20700–20721. JMLR, 2024. (page 2) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduct...

work page 2024
[8]

A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method

IEEE Press. (page 2) Khan, M. E. and Rue, H. The Bayesian learning rule.Jour- nal of Machine Learning Research, 24(281):1–46, 2023. (pages 2, 8) Kim, K., Oh, J., Wu, K., Ma, Y ., and Gardner, J. R. On the convergence of black-box variational inference. In Advances in Neural Information Processing Systems, vol- ume 36, pp. 44615–44657. Curran Associates In...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Stein’s lemma for the reparameterization trick with exponential family mixtures.arXiv preprint arXiv:1910.13398,

(pages 2, 3, 5, 20, 27, 36) Lin, W., Khan, M. E., and Schmidt, M. Fast and sim- ple natural-gradient variational inference with mixture of exponential-family approximations. InProceedings of the International Conference on Machine Learning, vol- ume 97 ofPMLR, pp. 3992–4002. JMLR, 2019. (pages 2, 8) Lin, W., Khan, M. E., and Schmidt, M. Stein’s lemma for ...

work page arXiv 2019
[10]

Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

(page 4) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro- bust stochastic approximation approach to stochastic pro- gramming.SIAM Journal on Optimization, 19(4):1574– 1609, 2009. (pages 1, 2, 29) Opper, M. and Archambeau, C. The variational Gaussian approximation revisited.Neural Computation, 21(3):786– 792, 2009. (page 5) Parikh, N. and Boyd, S...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[11]

Meaningful lower-bound of√ a2 +b −a when a≫b >0

(pages 2, 4, 19) Stich, S. U. Unified optimal analysis of the (stochastic) gradient method. arXiv Preprint arXiv:1907.04232, 2019. (pages 5, 23, 24) Sun, F., Fatkhullin, I., and He, N. Natural gradient VI: Guarantees for non-conjugate models. InAdvances in Neural Information Processing Systems, volume 38 (to appear). Curran Associates, Inc., 2025. (page 8...

work page arXiv 1907
[12]

ppauto” line of business, part of the “Schedule P loss data

JMLR, 2019. (page 5) Yi, M. and Liu, S. Bridging the gap between variational inference and Wasserstein gradient flows. arXiv Preprint arXiv:2310.20090, 2023. (pages 2, 4, 6) 13 SGVI with Price’s Gradient Estimator from Bures-Wasserstein to Parameter Space TABLE OFCONTENTS 1 Introduction 1 2 Background 2 2.1 Problem Setup . . . . . . . . . . . . . . . . . ...

work page arXiv 2019
[13]

The properties of the proximal operator are summarized as follows: Lemma C.19.Denote λ∗ ∈arg min λ∈Λ F(q λ), where qλ is parametrized as in Assumption 2.2

This follows from the non-expansiveness of the proximal opera- tor and the fact that the gradient descent step on the energy results in a contraction due to coercivity. The properties of the proximal operator are summarized as follows: Lemma C.19.Denote λ∗ ∈arg min λ∈Λ F(q λ), where qλ is parametrized as in Assumption 2.2. Then the proximal operator of th...

work page 2023

[1] [1]

A Generalized Poincare Inequality for Gaus- sian Measures.Proceedings of the American Mathemati- cal Society, 105(2):397–400, 1989

(page 23) Beckner, W. A Generalized Poincare Inequality for Gaus- sian Measures.Proceedings of the American Mathemati- cal Society, 105(2):397–400, 1989. (page 19) Bernton, E. Langevin Monte Carlo and JKO splitting. In Proceedings of the Conference On Learning Theory, vol- ume 75 ofPMLR, pp. 1777–1798. JMLR, 2018. (page 3) Bezanson, J., Edelman, A., Karpi...

work page 1989

[2] [2]

Transformations des signaux aléatoires a travers les systèmes non linéaires sans mémoire.Annales des Télécommunications, 19(9):203–220, 1964

(page 1) Bonnet, G. Transformations des signaux aléatoires a travers les systèmes non linéaires sans mémoire.Annales des Télécommunications, 19(9):203–220, 1964. (pages 3, 22) Bottou, L. On-line learning and stochastic approximations. InOn-Line Learning in Neural Networks, pp. 9–42. Cam- bridge University Press, 1 edition, 1999. (page 1) Bottou, L., Curti...

work page 1964

[3] [3]

Unpublished draft, november 3, 2024 edition, 2024

(pages 4, 7) Chewi, S.Log-Concave Sampling. Unpublished draft, november 3, 2024 edition, 2024. URL https:// chewisinho.github.io/main.pdf. (pages 5, 27) Chewi, S., Niles-Weed, J., and Rigollet, P.Statistical Op- timal Transport: École d’Été de Probabilités de Saint- Flour XLIX - 2019. Number 2364 in Lecture Notes in Mathematics École d’Été de Probabilités...

work page 2024

[4] [4]

Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235,

(page 27) Diao, M. Z., Balasubramanian, K., Chewi, S., and Salim, A. Forward-backward Gaussian variational inference via JKO in the Bures-Wasserstein space. InProceedings of the International Conference on Machine Learning, vol- ume 202 ofPMLR, pp. 7960–7991. JMLR, 2023. (pages 2, 3, 5, 6, 8, 19, 21, 27, 37) Dieuleveut, A., Fort, G., Moulines, E., and Wai...

work page arXiv 2023

[5] [5]

(page 5) Geffner, T

Curran Associates, Inc., 2018. (page 5) Geffner, T. and Domke, J. Approximation based variance reduction for reparameterization gradients. InAdvances in Neural Information Processing Systems, volume 33, pp. 2397–2407. Curran Associates, Inc., 2020a. (page 5) Geffner, T. and Domke, J. A rule for gradient estimator selection, with an application to variatio...

work page 2018

[6] [6]

(page 1) Ho, Y . C. and Cao, X. Perturbation analysis and optimiza- tion of queueing networks.Journal of Optimization The- ory and Applications, 40(4):559–582, 1983. (pages 2, 4) Hoffman, M. and Ma, Y . Black-box variational inference as a parametric approximation to Langevin dynamics. In Proceedings of the International Conference on Machine Learning, vo...

work page 1983

[7] [7]

Theo- retical Guarantees for Variational Inference with Fixed- Variance Mixture of Gaussians

(pages 2, 6) Huix, T., Korba, A., Durmus, A., and Moulines, E. Theo- retical Guarantees for Variational Inference with Fixed- Variance Mixture of Gaussians. InProceedings of the International Conference on Machine Learning, volume 235 ofPMLR, pp. 20700–20721. JMLR, 2024. (page 2) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduct...

work page 2024

[8] [8]

A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method

IEEE Press. (page 2) Khan, M. E. and Rue, H. The Bayesian learning rule.Jour- nal of Machine Learning Research, 24(281):1–46, 2023. (pages 2, 8) Kim, K., Oh, J., Wu, K., Ma, Y ., and Gardner, J. R. On the convergence of black-box variational inference. In Advances in Neural Information Processing Systems, vol- ume 36, pp. 44615–44657. Curran Associates In...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Stein’s lemma for the reparameterization trick with exponential family mixtures.arXiv preprint arXiv:1910.13398,

(pages 2, 3, 5, 20, 27, 36) Lin, W., Khan, M. E., and Schmidt, M. Fast and sim- ple natural-gradient variational inference with mixture of exponential-family approximations. InProceedings of the International Conference on Machine Learning, vol- ume 97 ofPMLR, pp. 3992–4002. JMLR, 2019. (pages 2, 8) Lin, W., Khan, M. E., and Schmidt, M. Stein’s lemma for ...

work page arXiv 2019

[10] [10]

Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition

(page 4) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro- bust stochastic approximation approach to stochastic pro- gramming.SIAM Journal on Optimization, 19(4):1574– 1609, 2009. (pages 1, 2, 29) Opper, M. and Archambeau, C. The variational Gaussian approximation revisited.Neural Computation, 21(3):786– 792, 2009. (page 5) Parikh, N. and Boyd, S...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[11] [11]

Meaningful lower-bound of√ a2 +b −a when a≫b >0

(pages 2, 4, 19) Stich, S. U. Unified optimal analysis of the (stochastic) gradient method. arXiv Preprint arXiv:1907.04232, 2019. (pages 5, 23, 24) Sun, F., Fatkhullin, I., and He, N. Natural gradient VI: Guarantees for non-conjugate models. InAdvances in Neural Information Processing Systems, volume 38 (to appear). Curran Associates, Inc., 2025. (page 8...

work page arXiv 1907

[12] [12]

ppauto” line of business, part of the “Schedule P loss data

JMLR, 2019. (page 5) Yi, M. and Liu, S. Bridging the gap between variational inference and Wasserstein gradient flows. arXiv Preprint arXiv:2310.20090, 2023. (pages 2, 4, 6) 13 SGVI with Price’s Gradient Estimator from Bures-Wasserstein to Parameter Space TABLE OFCONTENTS 1 Introduction 1 2 Background 2 2.1 Problem Setup . . . . . . . . . . . . . . . . . ...

work page arXiv 2019

[13] [13]

The properties of the proximal operator are summarized as follows: Lemma C.19.Denote λ∗ ∈arg min λ∈Λ F(q λ), where qλ is parametrized as in Assumption 2.2

This follows from the non-expansiveness of the proximal opera- tor and the fact that the gradient descent step on the energy results in a contraction due to coercivity. The properties of the proximal operator are summarized as follows: Lemma C.19.Denote λ∗ ∈arg min λ∈Λ F(q λ), where qλ is parametrized as in Assumption 2.2. Then the proximal operator of th...

work page 2023