pith. sign in

arxiv: 2606.07841 · v1 · pith:ORJIH4VRnew · submitted 2026-06-05 · 📊 stat.CO · cs.LG· stat.ML

Large-scale empirical tuning and comparison of default optimizers for variational inference

Pith reviewed 2026-06-27 20:01 UTC · model grok-4.3

classification 📊 stat.CO cs.LGstat.ML
keywords black-box variational inferencestochastic optimizationadaptive optimizersempirical comparisonBayesian inferencevariational inferenceoptimization benchmarks
0
0 comments X

The pith

No single optimizer dominates black-box variational inference, but five algorithms reliably approach the best observed performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a large empirical comparison of 56 stochastic gradient optimizers on 1092 Bayesian inference problems to test which adaptive methods can serve as tuning-free defaults for black-box variational inference. The study spans problems with dimensions from 1 to 10,000 and condition numbers up to 100 million, using over 550,000 optimization runs. Results show no method wins on every problem, yet a fixed selection of five algorithms consistently reaches performance close to the best observed across the collection. This finding supports treating black-box variational inference as a practical default tool when expert tuning is unavailable and supplies a concrete baseline for developing new optimizers.

Core claim

A benchmark of 56 stochastic optimization algorithms applied to 1092 Bayesian problems demonstrates that no single method dominates, but running a selection of five algorithms suffices to reliably get close to the best-possible observed performance.

What carries the argument

The large-scale empirical benchmark of 56 stochastic gradient-based optimization algorithms evaluated on 1092 Bayesian inference optimization problems with varying dimensions, condition numbers, and variational families.

If this is right

  • A practitioner can obtain reliable black-box variational inference results by running five default optimizers in parallel without any problem-specific tuning.
  • New stochastic optimization algorithms for variational inference can be evaluated against the performance distribution established by the five-algorithm selection.
  • Black-box variational inference becomes more usable in applications where expert tuning is not feasible.
  • The study supplies a standardized set of problems for future comparisons of optimization methods in this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Portfolio or switching strategies that try several of the five algorithms may further improve reliability without increasing the total number of runs.
  • The same five-algorithm approach could be tested on other stochastic optimization tasks outside variational inference to check transferability.
  • Problem characteristics such as dimension or condition number could be used to choose among the five rather than always running all of them.

Load-bearing premise

The 1092 chosen Bayesian inference problems and the performance metric used are representative of typical real-world black-box variational inference tasks and success criteria.

What would settle it

A new optimizer that consistently outperforms the best of the five selected algorithms on a broad subset of the 1092 problems or on additional problems drawn from the same distribution of dimensions and condition numbers.

Figures

Figures reproduced from arXiv: 2606.07841 by Charles C. Margossian, Jonathan H. Huggins, Kyurae Kim, Trevor Campbell.

Figure 1
Figure 1. Figure 1: Qualitative characteristics of the problems in the benchmarking suite. Fig. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scatter plots of the number of completed iterations versus target posterior [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmap of the average rank (in terms of final ELBO objective) of each step [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Heatmap of the probability that no soft failure occurs (including out-of-memory, [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Box plots of squared gradient norm at the 5-minute mark relative to the [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example trace plots of performance on individual problems. For MAP problems [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Box plots of the 5-minute squared gradient norm of each tuned algorithm, [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Failure probability for each tuned algorithm grouped by objective function [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Box plots of the 5-minute negative ELBO rank for each tuned algorithm, split [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cumulative distribution functions of the 5-minute negative ELBO rank for all [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Heatmap showing the probability that each tuned algorithm in row [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Trace plots of performance on the held-out challenge problems. For MAP [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Stan code for the original earn_height model [70], followed by Stan code for the subsampled earn_height_subsampled model. Note the additional parameter SUBIDX that is appended to the state, treated as an extra dimension in the BridgeStan input argument, and then used to select just one of the data log-likelihood terms to include in the target log probability density. // O ri gi n al model code data { i n … view at source ↗
read the original abstract

Black-box variational inference (BBVI) is a methodology for posterior approximation that relies on stochastic optimization. In practice, the stochastic optimizers underpinning BBVI generally require extensive problem-specific tuning, which undermines its promise as a truly "black box" inference algorithm. However, over the past decade, many new adaptive stochastic optimization algorithms have been developed that reduce or remove entirely the need for tuning. In this work, we investigate this new collection of adaptive methods in the context of BBVI, with the goal of establishing the current state of the art in tuning-free optimization-based inference. In particular, we present a large-scale empirical evaluation of 56 stochastic gradient-based optimization algorithms applied to 1092 Bayesian inference optimization problems, involving over 550,000 individual optimization runs and 15 core-years of compute. The optimization algorithms we evaluate are chosen to represent a wide spectrum of recent approaches and the benchmark problems are chosen to span a range of difficulty, with posterior target dimension 1-10^4, condition number 1-10^8, and a range of variational families. Our results show that no single method dominates, but running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance. We thus provide a strong baseline for applications where expert tuning is not possible and for comparison when developing new stochastic optimization algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports a large-scale empirical benchmark of 56 stochastic gradient optimizers for black-box variational inference (BBVI) across 1092 problems (posterior dimensions 1–10^4, condition numbers 1–10^8), totaling >550k runs and 15 core-years. No single optimizer dominates; the central claim is that a fixed selection of 5 algorithms suffices to reach near the best-observed performance on most problems, providing a practical default baseline when expert tuning is unavailable.

Significance. If the central claim holds after addressing selection methodology, the work supplies a valuable, reproducible empirical baseline for untuned BBVI and a reference point for new optimizer development. The scale and breadth of the problem set are genuine strengths for an empirical study in this area.

major comments (2)
  1. [Results (5-algorithm selection)] Results section (around the identification of the 5-algorithm subset): the headline claim that 'running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance' requires an explicit statement of the selection procedure. If the 5 were chosen by inspecting performance patterns across the entire 1092-problem collection without cross-validation, held-out problems, or a pre-specified rule, the reported reliability is at risk of being an in-sample observation rather than a generalizable recommendation.
  2. [Methods / benchmark construction] Methods or benchmark-construction section: the paper must detail the criteria used to select the 1092 problems and the performance metric (e.g., how 'best-possible' is computed, whether multiple-comparison corrections were applied, and any statistical testing of differences). Without these, it is difficult to assess whether the problems and metric are representative of typical real-world BBVI tasks, which directly affects the strength of the 'no single method dominates' and '5 suffice' conclusions.
minor comments (2)
  1. [Abstract / Introduction] Clarify in the abstract and introduction whether the 56 algorithms include all recent adaptive methods or a curated subset, and provide a brief justification for any omissions.
  2. [Figures / Tables] Figure captions and tables reporting per-problem or aggregate performance should include error bars or confidence intervals derived from the multiple runs per problem.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and for recognizing the scale and potential utility of the benchmark. We address each major comment below.

read point-by-point responses
  1. Referee: [Results (5-algorithm selection)] Results section (around the identification of the 5-algorithm subset): the headline claim that 'running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance' requires an explicit statement of the selection procedure. If the 5 were chosen by inspecting performance patterns across the entire 1092-problem collection without cross-validation, held-out problems, or a pre-specified rule, the reported reliability is at risk of being an in-sample observation rather than a generalizable recommendation.

    Authors: We agree that the selection procedure requires an explicit statement. In the revision we will add a dedicated paragraph in the Results section describing the exact rule used to identify the 5-algorithm subset (top performers across binned regimes of dimension and condition number that together reach within 5% of the best observed ELBO on at least 85% of problems). While the initial identification used the full collection, we will also report performance of the same fixed 5 on a randomly selected held-out collection of 150 problems to provide evidence of out-of-sample behavior. revision: yes

  2. Referee: [Methods / benchmark construction] Methods or benchmark-construction section: the paper must detail the criteria used to select the 1092 problems and the performance metric (e.g., how 'best-possible' is computed, whether multiple-comparison corrections were applied, and any statistical testing of differences). Without these, it is difficult to assess whether the problems and metric are representative of typical real-world BBVI tasks, which directly affects the strength of the 'no single method dominates' and '5 suffice' conclusions.

    Authors: We will expand the Methods section with a new subsection on benchmark construction. This will specify the sampling procedure used to obtain the 1092 problems (stratified coverage of dimension and condition-number ranges), define the performance metric (negative evidence lower bound), state that the best-possible value for each problem is the minimum ELBO attained by any of the 56 optimizers across all random seeds, and report the statistical tests (paired Wilcoxon tests with Bonferroni correction) used to support claims of no single dominant method. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivation chain or self-referential claims

full rationale

This is a purely empirical paper reporting results from running 56 optimizers on 1092 problems. No mathematical derivations, equations, fitted parameters presented as predictions, ansatzes, or uniqueness theorems appear in the abstract or described content. The statement that a selection of 5 algorithms suffices is an in-sample empirical observation from the benchmark runs, not a derivation that reduces to its own inputs by construction. No self-citations are invoked as load-bearing premises. The paper is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking paper; contains no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5786 in / 1014 out tokens · 28011 ms · 2026-06-27T20:01:05.932243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

116 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    PosteriorDB.jl: A Julia package to work with posteriordb

    Axen, S. (2026). “PosteriorDB.jl: A Julia package to work with posteriordb.” GitHub Repository: https://github.com/sethaxen/PosteriorDB.jl. Version 0.6.0

  2. [2]

    Layer Normalization

    Ba, J., Kiros, J., and Hinton, G. (2016). “Layer normalization.”arXiv:1607.06450

  3. [3]

    Online learning rate adaptation with hypergradient descent

    Baydin, A., Cornish, R., Rubio, D., Schmidt, M., and Wood, F. (2018). “Online learning rate adaptation with hypergradient descent.” InProceedings of the International Conference on Learning Representations

  4. [4]

    Training neural networks for and by interpolation

    Berrada, L., Zisserman, A., and Kumar, M. (2020). “Training neural networks for and by interpolation.” InProceedings of the International Conference on Machine Learning, volume 119 ofPMLR, 799–809

  5. [5]

    Julia: A fresh approach to numerical computing

    Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). “Julia: A fresh approach to numerical computing.”SIAM review, 59(1): 65–98

  6. [6]

    Variational inference: A review for statisticians

    Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). “Variational inference: A review for statisticians.”Journal of the American Statistical Association, 112(518): 859–877

  7. [7]

    On-line learning and stochastic approximations

    Bottou, L. (1999). “On-line learning and stochastic approximations.” InOn-Line Learning in Neural Networks, 9–42. Cambridge University Press, 1 edition

  8. [8]

    Optimization methods for large-scale machine learning

    Bottou, L., Curtis, F. E., and Nocedal, J. (2018). “Optimization methods for large-scale machine learning.”SIAM Review, 60(2): 223–311

  9. [9]

    Quasi-Monte Carlo variational inference

    Buchholz, A., Wenzel, F., and Mandt, S. (2018). “Quasi-Monte Carlo variational inference.” InProceedings of the International Conference on Machine Learning, volume 80 ofPMLR, 668–677. JMLR

  10. [10]

    Importance weighted autoencoders

    Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). “Importance weighted autoencoders.” InProceedings of the International Conference on Learning Repre- sentations

  11. [11]

    Sample average approximation for black-box variational inference

    Burroni, J., Domke, J., and Sheldon, D. (2024). “Sample average approximation for black-box variational inference.” InProceedings of the Conference on Uncertainty in Artificial Intelligence, volume 244 ofPMLR, 471–498. JMLR

  12. [12]

    EigenVI: Score-based variational inference with orthogonal function expansions

    Cai, D., Modi, C., Margossian, C., Gower, R., Blei, D., and Saul, L. (2024). “EigenVI: Score-based variational inference with orthogonal function expansions.” InAdvances in Neural Information Processing Systems, 132691–132721. Curran Associates, Inc

  13. [13]

    Batch and Match: Black-box variational inference with a score-based T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian29 divergence

    Cai, D., Modi, C., Pillaud-Vivien, L., Margossian, C., Gower, R., Blei, D., and Saul, L. (2024). “Batch and Match: Black-box variational inference with a score-based T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian29 divergence.” InProceedings International Conference on Machine Learning, volume 235 ofPMLR, 5258–5297. JMLR

  14. [14]

    Making SGD parameter-free

    Carmon, Y. and Hinder, O. (2022). “Making SGD parameter-free.” InProceedings of the Conference on Learning Theory, volume 178 ofPMLR, 2360–2389

  15. [15]

    Algorithms for computing the sample variance: Analysis and recommendations

    Chan, T. F., Golub, G. H., and LeVeque, R. J. (1983). “Algorithms for computing the sample variance: Analysis and recommendations.”The American Statistician, 37(3): 242–247

  16. [16]

    Symbolic discovery of optimization algorithms

    Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y., and Le, Q. (2023). “Symbolic discovery of optimization algorithms.” InAdvances in Neural Information Processing Systems, volume 36, 49205–49233. Curran Associates, Inc

  17. [17]

    Mechanic: A learning rate tuner

    Cutkosky, A., Defazio, A., and Mehta, H. (2023). “Mechanic: A learning rate tuner.” InAdvances in Neural Information Processing Systems, volume 36, 47828–47848. Curran Associates, Inc

  18. [18]

    The Helmholtz Machine

    Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). “The Helmholtz Machine.”Neural Computation, 7(5): 889–904

  19. [19]

    Big Batch SGD: Auto- mated inference using adaptive batch sizes

    De, S., Yadav, A., Jacobs, D., and Goldstein, T. (2017). “Big Batch SGD: Auto- mated inference using adaptive batch sizes.” InProceedings of the International Conference on Artificial Intelligence and Statistics, volume 52 ofPMLR, 1504–1513. JMLR

  20. [20]

    SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives

    Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives.” In Advances in Neural Information Processing Systems, volume 27, 1646–1654. Curran Associates, Inc

  21. [21]

    Learning-rate-free learning by D- adaptation

    Defazio, A. and Mishchenko, K. (2023). “Learning-rate-free learning by D- adaptation.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 7449–7479. JMLR

  22. [22]

    Robust, accurate stochastic optimization for variational inference

    Dhaka, A. K., Catalina, A., Andersen, M. R., ns Magnusson, M., Huggins, J., and Vehtari, A. (2020). “Robust, accurate stochastic optimization for variational inference.” InAdvances in Neural Information Processing Systems, volume 33, 10961–10973. Curran Associates, Inc

  23. [23]

    Challenges and opportunities in high-dimensional variational inference

    Dhaka, A. K., Catalina, A., Welandawe, M., Andersen, M. R., Huggins, J., and Vehtari, A. (2021). “Challenges and opportunities in high-dimensional variational inference.” InAdvances in Neural Information Processing Systems, volume 34, 7787–7798. Curran Associates, Inc

  24. [24]

    Forward- backward Gaussian variational inference via JKO in the Bures-Wasserstein space

    Diao, M. Z., Balasubramanian, K., Chewi, S., and Salim, A. (2023). “Forward- backward Gaussian variational inference via JKO in the Bures-Wasserstein space.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 7960–7991. JMLR

  25. [25]

    Variational 30Default optimizers for variational inference inference viaχ-upper bound minimization

    Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. (2017). “Variational 30Default optimizers for variational inference inference viaχ-upper bound minimization.” InAdvances in Neural Information Processing Systems, volume 30, 2729–2738. Curran Associates, Inc

  26. [26]

    bridging the gap between constant step size stochastic gradient descent and Markov chains

    Dieuleveut, A., Durmus, A., and Bach, F. (2020). “bridging the gap between constant step size stochastic gradient descent and Markov chains.”The Annals of Statistics, 48(3): 1348 – 1382

  27. [27]

    Provable gradient variance guarantees for black-box variational inference

    Domke, J. (2019). “Provable gradient variance guarantees for black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 32, 329–338. Curran Associates, Inc

  28. [28]

    Provable smoothness guarantees for black-box variational inference

    — (2020). “Provable smoothness guarantees for black-box variational inference.” InProceedings of the International Conference on Machine Learning, volume 119 ofPMLR, 2587–2596. JMLR

  29. [29]

    Provable convergence guarantees for black-box variational inference

    Domke, J., Gower, R., and Garrigos, G. (2023). “Provable convergence guarantees for black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 36, 66289–66327. Curran Associates, Inc

  30. [30]

    Importance weighting and variational inference

    Domke, J. and Sheldon, D. R. (2018). “Importance weighting and variational inference.” InAdvances in Neural Information Processing Systems, volume 31, 4470–4479. Curran Associates, Inc

  31. [31]

    Divide and Couple: using Monte Carlo variational objectives for posterior approximation

    — (2019). “Divide and Couple: using Monte Carlo variational objectives for posterior approximation.” InAdvances in Neural Information Processing Systems, volume 32, 339–349. Curran Associates, Inc

  32. [32]

    Adaptive subgradient methods for online learning and stochastic optimization

    Duchi, J., Hazan, E., and Singer, Y. (2011). “Adaptive subgradient methods for online learning and stochastic optimization.”Journal of Machine Learning Research, 12: 2121–2159

  33. [33]

    Batch means and spectral variance estimators in Markov chain Monte Carlo

    Flegal, J. M. and Jones, G. L. (2010). “Batch means and spectral variance estimators in Markov chain Monte Carlo.”The Annals of Statistics, 38(2): 1034– 1070

  34. [34]

    Multilevel Monte Carlo variational inference

    Fujisawa, M. and Sato, I. (2021). “Multilevel Monte Carlo variational inference.” Journal of Machine Learning Research, 22(278): 1–44

  35. [35]

    Don’t be so monotone: Relax- ing stochastic line search in over-parametrized models

    Galli, L., Rauhut, H., and Schmidt, M. (2023). “Don’t be so monotone: Relax- ing stochastic line search in over-parametrized models.” InAdvances in Neural Information Processing Systems, volume 36, 34752–34764. Curran Associates, Inc

  36. [36]

    Empirical evaluation of biased methods for alpha divergence minimization

    Geffner, T. and Domke, J. (2021). “Empirical evaluation of biased methods for alpha divergence minimization.” InProceedings of the Symposium on Advances in Approximate Bayesian Inference

  37. [37]

    MCMC variational inference via uncorrected Hamiltonian annealing

    — (2021). “MCMC variational inference via uncorrected Hamiltonian annealing.” InAdvances in Neural Information Processing Systems, volume 34, 639–651. Curran Associates, Inc

  38. [38]

    On the difficulty of unbiased alpha divergence minimization

    — (2021). “On the difficulty of unbiased alpha divergence minimization.” In Proceedings of the International Conference on Machine Learning, volume 139 of PMLR, 3650–3659. JMLR. T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian31

  39. [39]

    Inference from iterative simulation using multiple sequences

    Gelman, A. and Rubin, D. (1992). “Inference from iterative simulation using multiple sequences.”Statistical Science, 7(4): 457–511

  40. [40]

    Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box

    Giordano, R., Ingram, M., and Broderick, T. (2024). “Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box.”Journal of Machine Learning Research, 25: 1–39

  41. [41]

    Practical variational inference for neural networks

    Graves, A. (2011). “Practical variational inference for neural networks.” In Advances in Neural Information Processing Systems, volume 24, 2348–2356. Curran Associates, Inc

  42. [42]

    Shampoo: Preconditioned stochastic tensor optimization

    Gupta, V., Koren, T., and Singer, Y. (2018). “Shampoo: Preconditioned stochastic tensor optimization.” InProceedings of the International Conference on Machine Learning, volume 80 ofPMLR, 1842–1850. JMLR

  43. [43]

    Revisiting the Polyak step size

    Hazan, E. and Kakade, S. (2019). “Revisiting the Polyak step size.” arXiv:1905.00313

  44. [44]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition.” InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778

  45. [45]

    Black-Box alpha divergence minimization

    Hernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hernandez-Lobato, D., and Turner, R. (2016). “Black-Box alpha divergence minimization.” InProceedings of the International Conference on Machine Learning, volume 48 ofPMLR, 1511–1520. JMLR

  46. [46]

    Keeping the neural networks simple by minimizing the description length of the weights

    Hinton, G. E. and van Camp, D. (1993). “Keeping the neural networks simple by minimizing the description length of the weights.” InProceedings of the Annual Conference on Computational Learning Theory, 5–13. ACM Press

  47. [47]

    Perturbation analysis and optimization of queueing networks

    Ho, Y. C. and Cao, X. (1983). “Perturbation analysis and optimization of queueing networks.”Journal of Optimization Theory and Applications, 40(4): 559–582

  48. [48]

    Validated variational inference via practical posterior error bounds

    Huggins, J., Kasprzak, M., Campbell, T., and Broderick, T. (2020). “Validated variational inference via practical posterior error bounds.” InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 ofPMLR, 1792–1802. JMLR

  49. [49]

    Variational Inference using Implicit Distributions

    Huszár, F. (2017). “Variational inference using implicit distributions.” arXiv:1702.08235

  50. [50]

    DoG is SGD’s best friend: A parameter-free dynamic step size schedule

    Ivgi, M., Hinder, O., and Carmon, Y. (2023). “DoG is SGD’s best friend: A parameter-free dynamic step size schedule.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 14465–14499. JMLR

  51. [51]

    Accelerating stochastic gradient descent using predictive variance reduction

    Johnson, R. and Zhang, T. (2013). “Accelerating stochastic gradient descent using predictive variance reduction.” InAdvances in Neural Information Processing Systems, volume 26, 315–323. Curran Associates, Inc

  52. [52]

    Muon: An optimizer for hidden layers in neural networks

    Jordan,K.,Jin,Y.,Boza,V., Jiacheng,Y.,Cesista,F.,Newhouse,L.,and Bernstein, J. (2024). “Muon: An optimizer for hidden layers in neural networks.” URLhttps://kellerjordan.github.io/posts/muon/ 32Default optimizers for variational inference

  53. [53]

    An introduction to variational methods for graphical models

    Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). “An introduction to variational methods for graphical models.”Machine Learning, 37(2): 183–233

  54. [54]

    DoWG unleashed: An effi- cient universal parameter-free gradient descent method

    Khaled, A., Mishchenko, K., and Jin, C. (2023). “DoWG unleashed: An effi- cient universal parameter-free gradient descent method.” InAdvances in Neural Information Processing Systems, volume 36, 6748–6769

  55. [55]

    The Bayesian learning rule

    Khan, M. E. and Rue, H. (2023). “The Bayesian learning rule.”Journal of Machine Learning Research, 24(281): 1–46

  56. [56]

    Linear convergence of black- box variational inference: Should we stick the landing?

    Kim, K., Ma, Y., and Gardner, J. R. (2024). “Linear convergence of black- box variational inference: Should we stick the landing?” InProceedings of the International Conference on Artificial Intelligence and Statistics, volume 238 of PMLR, 235–243. JMLR

  57. [57]

    On the convergence of black-box variational inference

    Kim, K., Oh, J., Wu, K., Ma, Y., and Gardner, J. R. (2023). “On the convergence of black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 36, 44615–44657. Curran Associates Inc

  58. [58]

    A guide to sample average approximation

    Kim, S., Pasupathy, R., and Henderson, S. (2015). “A guide to sample average approximation.” InHandbook of Simulation Optimization, 207–243. Springer

  59. [59]

    Adam: A method for stochastic optimization

    Kingma, D. and Ba, J. (2015). “Adam: A method for stochastic optimization.” In Proceedings of the International Conference on Learning Representations

  60. [60]

    Auto-encoding variational Bayes

    Kingma, D. P. and Welling, M. (2014). “Auto-encoding variational Bayes.” In Proceedings of the International Conference on Learning Representations

  61. [61]

    Automatic differentiation variational inference

    Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). “Automatic differentiation variational inference.”Journal of Machine Learning Research, 18(14): 1–45

  62. [62]

    Optimization guarantees for square-root natural-gradient variational inference

    Kumar, N., Möllenhoff, T., Khan, M. E., and Lucchi, A. (2025). “Optimization guarantees for square-root natural-gradient variational inference.”Transactions on Machine Learning Research

  63. [63]

    Varia- tional inference via Wasserstein gradient flows

    Lambert, M., Chewi, S., Bach, F., Bonnabel, S., and Rigollet, P. (2022). “Varia- tional inference via Wasserstein gradient flows.” InAdvances in Neural Information Processing Systems, volume 35, 14434–14447. Curran Associates, Inc

  64. [64]

    A stochastic gradient method with an exponential convergence rate for finite training sets

    Le Roux, N., Schmidt, M., and Bach, F. (2012). “A stochastic gradient method with an exponential convergence rate for finite training sets.” InAdvances in Neural Information Processing Systems, 2663–2671. Curran Associates, Inc

  65. [65]

    Rényi divergence variational inference

    Li, Y. and Turner, R. E. (2016). “Rényi divergence variational inference.” In Advances in Neural Information Processing Systems, volume 29, 1073–1081. Curran Associates, Inc

  66. [66]

    Fast and simple natural-gradient variational inference with mixture of exponential-family approximations

    Lin, W., Khan, M. E., and Schmidt, M. (2019). “Fast and simple natural-gradient variational inference with mixture of exponential-family approximations.” In Proceedings of the International Conference on Machine Learning, volume 97 of PMLR, 3992–4002. JMLR. T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian33

  67. [67]

    Batch size selection for variance estimators in MCMC

    Liu, Y., Vats, D., and Flegal, J. M. (2022). “Batch size selection for variance estimators in MCMC.”Methodology and Computing in Applied Probability, 24(1): 65–93

  68. [68]

    Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence

    Loizou, N., Vaswani, S., Laradji, I. H., and Lacoste-Julien, S. (2021). “Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence.” In Proceedings of The International Conference on Artificial Intelligence and Statistics, PMLR, 1306–1314. JMLR

  69. [69]

    The power of interpolation: Un- derstanding the effectiveness of SGD in modern over-parametrized learning

    Ma, S., Bassily, R., and Belkin, M. (2018). “The power of interpolation: Un- derstanding the effectiveness of SGD in modern over-parametrized learning.” In Proceedings of the International Conference on Machine Learning, volume 80 of PMLR, 3325–3334. JMLR

  70. [70]

    posteriordb: Testing, benchmarking and developing Bayesian inference algorithms

    Magnusson, M., Torgander, J., Bürkner, P.-C., Zhang, L., Carpenter, B., and Vehtari, A. (2025). “posteriordb: Testing, benchmarking and developing Bayesian inference algorithms.” InProceedings of The International Conference on Artificial Intelligence and Statistics, volume 258 ofPMLR, 1198–1206. JMLR

  71. [71]

    Torsten: A platform for Bayesian inference of pharmacometric models

    Margossian, C. C., Zhang, Y., Gillespie, B., Bales, B., Volfovsky, A., Pavlovic, V., and Gelman, A. (2022). “Torsten: A platform for Bayesian inference of pharmacometric models.”Statistics and Computing, 32(6): 1–15

  72. [72]

    Expectation propagation for approximate Bayesian inference

    Minka, T. P. (2001). “Expectation propagation for approximate Bayesian inference.” InProceedings of the Conference on Uncertainty in Artificial Intelligence, 362–369. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc

  73. [73]

    Variational Inference with Gaussian Score Matching

    Modi, C., Gower, R., Margossian, C., Yao, Y., Blei, D., and Saul, L. (2023). “Variational Inference with Gaussian Score Matching.” InAdvances in Neural Information Processing Systems, volume 36, 29935–29950. Curran Associates, Inc

  74. [74]

    Monte Carlo Gradient Estimation in Machine Learning

    Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2020). “Monte Carlo Gradient Estimation in Machine Learning.”Journal of Machine Learning Research, 21(132): 1–62

  75. [75]

    Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients

    Moses, W. and Churavy, V. (2020). “Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients.” InAdvances in Neural Information Processing Systems, volume 33, 12472–12485. Curran Associates, Inc

  76. [76]

    Rectified linear units improve restricted Boltz- mann machines

    Nair, V. and Hinton, G. E. (2010). “Rectified linear units improve restricted Boltz- mann machines.” InProceedings of the International Conference on International Conference on Machine Learning, ICML, 807–814. Madison, WI, USA: Omnipress

  77. [77]

    Gaussian Variational Approximation with a Factor Covariance Structure

    Ong, V. M.-H., Nott, D. J., and Smith, M. S. (2018). “Gaussian Variational Approximation with a Factor Covariance Structure.”Journal of Computational and Graphical Statistics, 27(3): 465–478

  78. [78]

    Parameter-free stochastic optimization of variationally coherent functions

    Orabona, F. and Pál, D. (2021). “Parameter-free stochastic optimization of variationally coherent functions.”arXiv:2102.00236

  79. [79]

    Training deep networks without learning 34Default optimizers for variational inference rates through coin betting

    Orabona, F. and Tommasi, T. (2017). “Training deep networks without learning 34Default optimizers for variational inference rates through coin betting.” InAdvances in Neural Information Processing Systems, volume 30, 2160–2170. Curran Associates, Inc

  80. [80]

    R package

    outbreak package authors (2024).outbreak: Tools for Simulating and Analyzing Epidemic Outbreaks. R package. URLhttps://CRAN.R-project.org/package=outbreak

Showing first 80 references.