pith. sign in

arxiv: 2004.06443 · v4 · submitted 2020-04-14 · 📊 stat.ML · cs.LG

Particle-based Energetic Variational Inference

Pith reviewed 2026-05-24 15:35 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords variational inferenceparticle-based variational inferenceenergetic variational inferenceSVGDKL-divergenceapproximation-then-variation
0
0 comments X

The pith

Energetic variational inference derives existing particle methods including SVGD and introduces an approximation-then-variation scheme that reduces KL-divergence each step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces energetic variational inference as a framework that minimizes the variational inference objective using a prescribed energy-dissipation law. This framework derives many existing particle-based variational inference methods, including SVGD, and supports creation of new schemes. The highlighted new method approximates the density with particles first and then performs the variational update, preserving variational structure at the particle level. Experiments indicate this ordering produces larger KL-divergence reductions per iteration and better fidelity to the target distribution than some prior particle methods.

Core claim

The energetic variational inference framework, based on a prescribed energy-dissipation law, derives many particle-based variational inference methods including SVGD; a new approximation-then-variation scheme performs particle-based density approximation first then the variational procedure, maintains the variational structure at the particle level, and significantly decreases the KL-divergence in each iteration.

What carries the argument

Energetic variational inference (EVI) framework that minimizes the VI objective based on an energy-dissipation law, together with the approximation-then-variation ordering for particle schemes.

Load-bearing premise

Performing the particle-based density approximation first and the variational update second preserves the variational structure at the particle level.

What would settle it

An experiment or calculation showing that the new approximation-then-variation scheme fails to decrease KL-divergence more than existing particle methods or fails to improve fidelity to the target distribution.

read the original abstract

We introduce a new variational inference (VI) framework, called energetic variational inference (EVI). It minimizes the VI objective function based on a prescribed energy-dissipation law. Using the EVI framework, we can derive many existing Particle-based Variational Inference (ParVI) methods, including the popular Stein Variational Gradient Descent (SVGD) approach. More importantly, many new ParVI schemes can be created under this framework. For illustration, we propose a new particle-based EVI scheme, which performs the particle-based approximation of the density first and then uses the approximated density in the variational procedure, or "Approximation-then-Variation" for short. Thanks to this order of approximation and variation, the new scheme can maintain the variational structure at the particle level, and can significantly decrease the KL-divergence in each iteration. Numerical experiments show the proposed method outperforms some existing ParVI methods in terms of fidelity to the target distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Energetic Variational Inference (EVI), a framework that derives particle-based variational inference (ParVI) methods, including SVGD, from a prescribed energy-dissipation law. It proposes a new 'Approximation-then-Variation' scheme that first approximates the density via particles and then applies the variational update, claiming this order preserves the variational structure at the particle level, yields a strict decrease in KL divergence per iteration, and outperforms existing ParVI methods in numerical experiments.

Significance. If the consistency between the discrete particle scheme and the continuous energy-dissipation law holds, the work supplies a unifying derivation for existing ParVI algorithms and a new scheme whose per-iteration KL decrease is inherited from the underlying variational structure rather than imposed ad hoc. This would strengthen the theoretical grounding of particle-based inference and enable systematic construction of new methods with controllable dissipation properties.

major comments (3)
  1. [Abstract / derivation of new scheme] The central claim that the Approximation-then-Variation scheme 'maintains the variational structure at the particle level' and 'can significantly decrease the KL-divergence in each iteration' (abstract) requires an explicit verification that the particle approximation of the density, when inserted before the variation step, produces a velocity field that remains the Wasserstein gradient of the same energy functional. The manuscript must supply the Euler-Lagrange equation or weak-form derivation showing that the approximated dissipation functional is consistent with the continuous EVI law up to controllable error; without this, the asserted KL decrease does not follow from the framework.
  2. [Section deriving SVGD and other ParVI methods] The derivation that existing ParVI methods (including SVGD) arise from the EVI energy-dissipation law must be checked for parameter-free status. If the particle approximation step introduces kernel bandwidths or other tuning parameters that are fitted rather than prescribed by the dissipation law, the claim that EVI supplies a parameter-free unification is undermined.
  3. [Numerical experiments section] Numerical experiments are cited as showing outperformance, yet the abstract supplies no error bars, convergence plots of KL divergence, or comparison against the continuous-time limit of the scheme. The manuscript should report the measured per-iteration KL decrease and confirm it is not an artifact of the chosen particle count or kernel.
minor comments (2)
  1. Notation for the energy functional and dissipation potential should be introduced once and used consistently; the transition from continuous density to empirical measure needs an explicit symbol.
  2. [Abstract] The abstract states the new scheme 'outperforms some existing ParVI methods' without naming the baselines or reporting quantitative metrics; this should be clarified in the abstract or moved to the results section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points for strengthening the theoretical justification and experimental presentation. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract / derivation of new scheme] The central claim that the Approximation-then-Variation scheme 'maintains the variational structure at the particle level' and 'can significantly decrease the KL-divergence in each iteration' (abstract) requires an explicit verification that the particle approximation of the density, when inserted before the variation step, produces a velocity field that remains the Wasserstein gradient of the same energy functional. The manuscript must supply the Euler-Lagrange equation or weak-form derivation showing that the approximated dissipation functional is consistent with the continuous EVI law up to controllable error; without this, the asserted KL decrease does not follow from the framework.

    Authors: We agree that an explicit weak-form derivation is required to rigorously connect the particle scheme to the continuous energy-dissipation law. In the revised manuscript we will add the Euler-Lagrange derivation for the approximated dissipation functional, showing that the resulting velocity field is the Wasserstein gradient of the energy (up to a discretization error controlled by particle number and kernel width). This will directly establish the per-iteration KL decrease from the variational structure. revision: yes

  2. Referee: [Section deriving SVGD and other ParVI methods] The derivation that existing ParVI methods (including SVGD) arise from the EVI energy-dissipation law must be checked for parameter-free status. If the particle approximation step introduces kernel bandwidths or other tuning parameters that are fitted rather than prescribed by the dissipation law, the claim that EVI supplies a parameter-free unification is undermined.

    Authors: The EVI framework itself prescribes the form of the update directly from the energy-dissipation law without introducing extra parameters. Kernel bandwidths and similar quantities belong to the choice of particle approximation (as they do in the original SVGD derivation) and are not fitted by the EVI procedure. The unification claim concerns the variational origin of the dynamics, which remains parameter-free at the continuous level. We will insert a clarifying paragraph distinguishing the law from the approximation choices. revision: partial

  3. Referee: [Numerical experiments section] Numerical experiments are cited as showing outperformance, yet the abstract supplies no error bars, convergence plots of KL divergence, or comparison against the continuous-time limit of the scheme. The manuscript should report the measured per-iteration KL decrease and confirm it is not an artifact of the chosen particle count or kernel.

    Authors: We will expand the numerical section to include error bars from multiple independent runs, per-iteration KL-divergence trajectories, and a brief comparison with the continuous-time limit obtained by increasing particle count. These additions will demonstrate that the observed KL decrease is consistent with the theory and not an artifact of specific discretization parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in external energy-dissipation law

full rationale

The paper introduces the EVI framework from a prescribed energy-dissipation law (external to any fitted quantities inside the manuscript) and shows that existing ParVI methods including SVGD can be recovered as special cases. The new approximation-then-variation scheme is explicitly defined by the ordering of operations; its claimed preservation of variational structure and KL decrease are asserted as consequences of that ordering and are supported by numerical experiments rather than by re-labeling a fit as a prediction. No load-bearing self-citation chain, uniqueness theorem imported from the same authors, or ansatz smuggled via prior work appears in the abstract or described derivation. The central claims therefore remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that variational objectives can be minimized via a prescribed energy-dissipation law; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption The variational inference objective can be minimized based on a prescribed energy-dissipation law.
    This is the core premise of the EVI framework stated in the abstract.

pith-pipeline@v0.9.0 · 5688 in / 1195 out tokens · 29635 ms · 2026-05-24T15:35:40.864830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 5 internal anchors

  1. [1]

    Journal of the Royal Statistical Society: Series B 28(1), 131–142 (1966)

    Ali, S.M., Silvey, S.D.: A general class of coefficients of di- vergence of one distribution from another. Journal of the Royal Statistical Society: Series B 28(1), 131–142 (1966)

  2. [2]

    Manuscripta Mathematica 121(1), 1–50 (2006)

    Ambrosio, L., Lisini, S., Savar´ e, G.: Stability of flows as- sociated to gradient vector fields and convergence of iter- ated transport maps. Manuscripta Mathematica 121(1), 1–50 (2006)

  3. [3]

    In: Advances in Neural Information Processing Systems, pp

    Arbel, M., Korba, A., Salim, A., Gretton, A.: Maximum mean discrepancy gradient flow. In: Advances in Neural Information Processing Systems, pp. 6484–6494 (2019)

  4. [4]

    Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)

  5. [5]

    Springer, New York (2006)

    Bishop, C.M.: Pattern recognition and machine learning. Springer, New York (2006)

  6. [6]

    Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: A review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)

  7. [7]

    Carrillo, J.A., Craig, K., Patacchini, F.S.: A blob method for diffusion. Calc. Var. Partial. Differ. Equ. 58(2), 53 (2019)

  8. [8]

    Carrillo, J.A., D¨ uring, B., Matthes, D., McCormick, D.S.: A Lagrangian scheme for the solution of nonlinear dif- fusion equations using moving simplex meshes. J. Sci. Comput. 75(3), 1463–1499 (2018)

  9. [9]

    Nonlinear partial differential equations and hyperbolic wave phe- nomena 526, 37–51 (2010)

    Carrillo, J.A., Lisini, S.: On the asymptotic behavior of the gradient flow of a polyconvex functional. Nonlinear partial differential equations and hyperbolic wave phe- nomena 526, 37–51 (2010)

  10. [10]

    The American Statistician 46(3), 167–174 (1992)

    Casella, G., George, E.I.: Explaining the Gibbs sampler. The American Statistician 46(3), 167–174 (1992)

  11. [11]

    A Unified Particle-Optimization Framework for Scalable Bayesian Sampling

    Chen, C., Zhang, R., Wang, W., Li, B., Chen, L.: A unified particle-optimization framework for scalable Bayesian sampling. arXiv preprint arXiv:1805.11659 (2018)

  12. [12]

    arXiv preprint arXiv:1901.08659 (2019)

    Chen, P., Wu, K., Chen, J., O’Leary-Roseberry, T., Ghat- tas, O.: Projected stein variational newton: A fast and scalable Bayesian inference method in high dimensions. arXiv preprint arXiv:1901.08659 (2019)

  13. [13]

    In: Artificial Intelligence and Statistics, pp

    Dai, B., He, N., Dai, H., Song, L.: Provable bayesian infer- ence via particle mirror descent. In: Artificial Intelligence and Statistics, pp. 985–994 (2016)

  14. [14]

    Degond, P., Mustieles, F.J.: A deterministic approxima- tion of diffusion equations using particles. SIAM J. Sci. Comput. 11(2), 293–310 (1990)

  15. [15]

    In: Ad- vances in Neural Information Processing Systems, pp

    Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., Sche- ichl, R.: A Stein variational Newton method. In: Ad- vances in Neural Information Processing Systems, pp. 9169–9179 (2018)

  16. [16]

    The Phase Field Method for Geometric Moving Interfaces and Their Numerical Approximations

    Du, Q., Feng, X.: The phase field method for geometric moving interfaces and their numerical approximations. arXiv preprint arXiv:1902.04924 (2019)

  17. [17]

    Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)

  18. [18]

    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)

  19. [19]

    El Moselhy, T.A., Marzouk, Y.M.: Bayesian inference with optimal maps. J. Comput. Phys. 231(23), 7815– 7850 (2012)

  20. [20]

    Evans, L.C., Savin, O., Gangbo, W.: Diffeomorphisms and nonlinear heat flows. SIAM J. Math. Anal. 37(3), 737–751 (2005)

  21. [21]

    In: Inter- national Symposium on Applied Stochastic Models and Data Analysis, pp

    Francois, D., Wertz, V., Verleysen, M., et al.: About the locality of kernels in high-dimensional spaces. In: Inter- national Symposium on Applied Stochastic Models and Data Analysis, pp. 238–245. Citeseer (2005)

  22. [22]

    Approximate inference with Wasserstein gradient flows

    Frogner, C., Poggio, T.: Approximate inference with Wasserstein gradient flows. arXiv preprint arXiv:1806.04542 (2018)

  23. [23]

    Chapman and Hall/CRC (2013)

    Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Ve- htari, A., Rubin, D.B.: Bayesian data analysis. Chapman and Hall/CRC (2013)

  24. [24]

    IEEE Trans

    Geman, S., Geman, D.: Stochastic relaxation, Gibbs dis- tributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. P AMI-6(6), 721–741 (1984)

  25. [25]

    In: Proceedings of the 29th International Coference on International Conference on Machine Learning, pp

    Gershman, S.J., Hoffman, M.D., Blei, D.M.: Nonpara- metric variational inference. In: Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 235–242 (2012)

  26. [26]

    Handbook of Mathematical Analysis in Mechanics of Viscous Fluids pp

    Giga, M.H., Kirshtein, A., Liu, C.: Variational modeling and complex fluids. Handbook of Mathematical Analysis in Mechanics of Viscous Fluids pp. 1–41 (2017)

  27. [27]

    Cambridge University Press (2008)

    Gonzalez, O., Stuart, A.M.: A first course in continuum mechanics. Cambridge University Press (2008)

  28. [28]

    Computational Statistics 14(3), 375–396 (1999)

    Haario, H., Saksman, E., Tamminen, J.: Adaptive pro- posal distribution for random walk Metropolis algorithm. Computational Statistics 14(3), 375–396 (1999)

  29. [29]

    Biometrika 57(1), 97–109 (1970)

    Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)

  30. [30]

    Hohenberg, P.C., Halperin, B.I.: Theory of dynamic crit- ical phenomena. Rev. Mod. Phys. 49(3), 435 (1977)

  31. [31]

    Iserles, A.: A first course in the numerical analysis of differential equations. No. 44 in Cambridge Texts in Applied Mathematics. Cambridge university press, New York (2009)

  32. [32]

    Machine Learning 37(2), 183–233 (1999)

    Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Machine Learning 37(2), 183–233 (1999)

  33. [33]

    Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)

  34. [34]

    In: Advances in Neural Information Processing Systems, pp

    Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Advances in Neural Information Processing Systems, pp. 4743–4751 (2016)

  35. [35]

    In: ESAIM: Proceedings, vol

    Lacombe, G., Mas-Gallic, S.: Presentation and analysis of a diffusion-velocity method. In: ESAIM: Proceedings, vol. 7, pp. 225–233. EDP Sciences (1999)

  36. [36]

    arXiv preprint arXiv:1902.03394 (2019)

    Li, L., Liu, J.G., Liu, Z., Lu, J.: A stochastic version of Stein variational gradient descent for efficient sampling. arXiv preprint arXiv:1902.03394 (2019)

  37. [37]

    In: Multi-Scale Phenomena in Complex Fluids: Modeling, Analysis and Numerical Simulation, pp

    Liu, C.: An introduction of elastic complex fluids: an en- ergetic variational approach. In: Multi-Scale Phenomena in Complex Fluids: Modeling, Analysis and Numerical Simulation, pp. 286–337. World Scientific (2009)

  38. [38]

    Journal of Computational Physics p

    Liu, C., Wang, Y.: On Lagrangian schemes for porous medium type generalized diffusion equations: a discrete energetic variational approach. Journal of Computational Physics p. 109566 (2020)

  39. [39]

    arXiv preprint arXiv:2003.10413 (2020)

    Liu, C., Wang, Y.: A variational Lagrangian scheme for a phase field model: A discrete energetic variational ap- proach. arXiv preprint arXiv:2003.10413 (2020)

  40. [40]

    In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) Particle-based Energetic Variational Inference 17

    Liu, C., Zhu, J.: Riemannian Stein variational gradient descent for Bayesian inference. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) Particle-based Energetic Variational Inference 17

  41. [41]

    In: International Conference on Machine Learning, pp

    Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J.: Under- standing and accelerating particle-based variational infer- ence. In: International Conference on Machine Learning, pp. 4082–4092 (2019)

  42. [42]

    In: Advances in Neural Information Processing Sys- tems, pp

    Liu, Q.: Stein variational gradient descent as gradient flow. In: Advances in Neural Information Processing Sys- tems, pp. 3115–3123 (2017)

  43. [43]

    In: Ad- vances in Neural Information Processing Systems, pp

    Liu, Q., Wang, D.: Stein variational gradient descent: A general purpose Bayesian inference algorithm. In: Ad- vances in Neural Information Processing Systems, pp. 2378–2386 (2016)

  44. [44]

    Lu, J., Lu, Y., Nolen, J.: Scaling limit of the Stein varia- tional gradient descent: The mean field regime. SIAM J. Math. Anal. 51(2), 648–671 (2019)

  45. [45]

    Cambridge university press (2003)

    MacKay, D.J., Mac Kay, D.J.: Information theory, infer- ence and learning algorithms. Cambridge university press (2003)

  46. [46]

    ESAIM: Mathematical Modelling and Numerical Analysis 53(1), 145–172 (2019)

    Matthes, D., Plazotta, S.: A variational formulation of the BDF2 method for metric gradient flows. ESAIM: Mathematical Modelling and Numerical Analysis 53(1), 145–172 (2019)

  47. [47]

    Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087– 1092 (1953)

  48. [48]

    In: Neu- ral networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop, pp

    Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.R.: Fisher discriminant analysis with kernels. In: Neu- ral networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop, pp. 41–48. Ieee (1999)

  49. [49]

    MIT press (2012)

    Murphy, K.P.: Machine learning: a probabilistic perspec- tive. MIT press (2012)

  50. [50]

    Department of Computer Science, University of Toronto Toronto, Ontario, Canada (1993)

    Neal, R.M.: Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, Ontario, Canada (1993)

  51. [51]

    In: Learning in Graphical Models, pp

    Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, pp. 355–368. Springer (1998)

  52. [52]

    Onsager, L.: Reciprocal relations in irreversible processes. I. Phys. Rev. 37(4), 405 (1931)

  53. [53]

    Onsager, L.: Reciprocal relations in irreversible processes. II. Phys. Rev. 38(12), 2265 (1931)

  54. [54]

    arXiv preprint arXiv:1912.02762 (2019)

    Papamakarios, G., Nalisnick, E., Rezende, D.J., Mo- hamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762 (2019)

  55. [55]

    Nuclear Physics B 180(3), 378–384 (1981)

    Parisi, G.: Correlation functions and computer simula- tions. Nuclear Physics B 180(3), 378–384 (1981)

  56. [56]

    Proceedings of the London Mathematical Society 1(1), 119–124 (1873)

    Rayleigh, L.: Note on the numerical calculation of the roots of fluctuating functions. Proceedings of the London Mathematical Society 1(1), 119–124 (1873)

  57. [57]

    Variational Inference with Normalizing Flows

    Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770 (2015)

  58. [58]

    Bernoulli 2(4), 341–363 (1996)

    Roberts, G.O., Tweedie, R.L., et al.: Exponential con- vergence of Langevin distributions and their discrete ap- proximations. Bernoulli 2(4), 341–363 (1996)

  59. [59]

    Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)

  60. [60]

    Rossky, P.J., Doll, J.D., Friedman, H.L.: Brownian dy- namics as smart Monte Carlo simulation. J. Chem. Phys. 69(10), 4628–4633 (1978)

  61. [61]

    In: International Conference on Machine Learning, pp

    Salimans, T., Kingma, D., Welling, M.: Markov chain Monte Carlo and variational inference: Bridging the gap. In: International Conference on Machine Learning, pp. 1218–1226 (2015)

  62. [62]

    Deep Diffeomorphic Normalizing Flows

    Salman, H., Yadollahpour, P., Fletcher, T., Batmanghe- lich, K.: Deep diffeomorphic normalizing flows. arXiv preprint arXiv:1810.03256 (2018)

  63. [63]

    Santambrogio, F.: {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bull. Math. Sci 7(1), 87–154 (2017)

  64. [64]

    The Journal of Machine Learning Research 20(1), 31–82 (2019)

    Sonoda, S., Murata, N.: Transport analysis of infinitely deep neural network. The Journal of Machine Learning Research 20(1), 31–82 (2019)

  65. [65]

    Acta Numer

    Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer. 19, 451–559 (2010)

  66. [66]

    Tabak, E.G., Vanden-Eijnden, E., et al.: Density estima- tion by dual ascent of the log-likelihood. Commun. Math. Sci. 8(1), 217–233 (2010)

  67. [67]

    Cambridge University Press (2005)

    Temam, R., Miranville, A.: Mathematical modeling in continuum mechanics. Cambridge University Press (2005)

  68. [68]

    Villani, C.: Optimal transport: old and new, vol. 338. Springer Science & Business Media (2008)

  69. [69]

    Founda- tions and Trends ® in Machine Learning 1(1–2), 1–305 (2008)

    Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. Founda- tions and Trends ® in Machine Learning 1(1–2), 1–305 (2008)

  70. [70]

    In: Ad- vances in Neural Information Processing Systems, pp

    Wang, D., Tang, Z., Bajaj, C., Liu, Q.: Stein variational gradient descent with matrix-valued kernels. In: Ad- vances in Neural Information Processing Systems, pp. 7834–7844 (2019)

  71. [71]

    In: Proceedings of the 28th International Conference on Machine Learning, pp

    Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning, pp. 681– 688 (2011)