Large-scale empirical tuning and comparison of default optimizers for variational inference

Charles C. Margossian; Jonathan H. Huggins; Kyurae Kim; Trevor Campbell

arxiv: 2606.07841 · v1 · pith:ORJIH4VRnew · submitted 2026-06-05 · 📊 stat.CO · cs.LG· stat.ML

Large-scale empirical tuning and comparison of default optimizers for variational inference

Trevor Campbell , Jonathan H. Huggins , Kyurae Kim , Charles C. Margossian This is my paper

Pith reviewed 2026-06-27 20:01 UTC · model grok-4.3

classification 📊 stat.CO cs.LGstat.ML

keywords black-box variational inferencestochastic optimizationadaptive optimizersempirical comparisonBayesian inferencevariational inferenceoptimization benchmarks

0 comments

The pith

No single optimizer dominates black-box variational inference, but five algorithms reliably approach the best observed performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a large empirical comparison of 56 stochastic gradient optimizers on 1092 Bayesian inference problems to test which adaptive methods can serve as tuning-free defaults for black-box variational inference. The study spans problems with dimensions from 1 to 10,000 and condition numbers up to 100 million, using over 550,000 optimization runs. Results show no method wins on every problem, yet a fixed selection of five algorithms consistently reaches performance close to the best observed across the collection. This finding supports treating black-box variational inference as a practical default tool when expert tuning is unavailable and supplies a concrete baseline for developing new optimizers.

Core claim

A benchmark of 56 stochastic optimization algorithms applied to 1092 Bayesian problems demonstrates that no single method dominates, but running a selection of five algorithms suffices to reliably get close to the best-possible observed performance.

What carries the argument

The large-scale empirical benchmark of 56 stochastic gradient-based optimization algorithms evaluated on 1092 Bayesian inference optimization problems with varying dimensions, condition numbers, and variational families.

If this is right

A practitioner can obtain reliable black-box variational inference results by running five default optimizers in parallel without any problem-specific tuning.
New stochastic optimization algorithms for variational inference can be evaluated against the performance distribution established by the five-algorithm selection.
Black-box variational inference becomes more usable in applications where expert tuning is not feasible.
The study supplies a standardized set of problems for future comparisons of optimization methods in this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Portfolio or switching strategies that try several of the five algorithms may further improve reliability without increasing the total number of runs.
The same five-algorithm approach could be tested on other stochastic optimization tasks outside variational inference to check transferability.
Problem characteristics such as dimension or condition number could be used to choose among the five rather than always running all of them.

Load-bearing premise

The 1092 chosen Bayesian inference problems and the performance metric used are representative of typical real-world black-box variational inference tasks and success criteria.

What would settle it

A new optimizer that consistently outperforms the best of the five selected algorithms on a broad subset of the 1092 problems or on additional problems drawn from the same distribution of dimensions and condition numbers.

Figures

Figures reproduced from arXiv: 2606.07841 by Charles C. Margossian, Jonathan H. Huggins, Kyurae Kim, Trevor Campbell.

**Figure 2.** Figure 2: Scatter plots of the number of completed iterations versus target posterior [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmap of the average rank (in terms of final ELBO objective) of each step [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Heatmap of the probability that no soft failure occurs (including out-of-memory, [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Box plots of squared gradient norm at the 5-minute mark relative to the [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Example trace plots of performance on individual problems. For MAP problems [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Box plots of the 5-minute squared gradient norm of each tuned algorithm, [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Failure probability for each tuned algorithm grouped by objective function [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Box plots of the 5-minute negative ELBO rank for each tuned algorithm, split [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Cumulative distribution functions of the 5-minute negative ELBO rank for all [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Heatmap showing the probability that each tuned algorithm in row [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Trace plots of performance on the held-out challenge problems. For MAP [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Stan code for the original earn_height model [70], followed by Stan code for the subsampled earn_height_subsampled model. Note the additional parameter SUBIDX that is appended to the state, treated as an extra dimension in the BridgeStan input argument, and then used to select just one of the data log-likelihood terms to include in the target log probability density. // O ri gi n al model code data { i n … view at source ↗

read the original abstract

Black-box variational inference (BBVI) is a methodology for posterior approximation that relies on stochastic optimization. In practice, the stochastic optimizers underpinning BBVI generally require extensive problem-specific tuning, which undermines its promise as a truly "black box" inference algorithm. However, over the past decade, many new adaptive stochastic optimization algorithms have been developed that reduce or remove entirely the need for tuning. In this work, we investigate this new collection of adaptive methods in the context of BBVI, with the goal of establishing the current state of the art in tuning-free optimization-based inference. In particular, we present a large-scale empirical evaluation of 56 stochastic gradient-based optimization algorithms applied to 1092 Bayesian inference optimization problems, involving over 550,000 individual optimization runs and 15 core-years of compute. The optimization algorithms we evaluate are chosen to represent a wide spectrum of recent approaches and the benchmark problems are chosen to span a range of difficulty, with posterior target dimension 1-10^4, condition number 1-10^8, and a range of variational families. Our results show that no single method dominates, but running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance. We thus provide a strong baseline for applications where expert tuning is not possible and for comparison when developing new stochastic optimization algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid large-scale benchmark of BBVI optimizers shows five defaults get close to best observed performance, but the selection of those five looks data-driven without clear validation.

read the letter

The paper's core contribution is running 56 stochastic optimizers across 1092 BBVI problems with over 550k total runs. This scale is the real value: it gives a practical picture of how current adaptive methods behave on a wide range of dimensions, condition numbers, and variational families without expert tuning.

It does the empirical work cleanly. The finding that no single optimizer wins everywhere but a fixed selection of five reaches near the best-observed performance on most problems is useful for anyone who needs a default setup. The benchmark design covers enough variation to make the comparison credible as a baseline.

The main soft spot is the selection process for those five. The stress-test concern holds: if the five were identified by looking at performance across the entire collection, the reported reliability is an in-sample observation rather than a tested recommendation for new problems. The abstract gives no indication of held-out problems, cross-validation, or a pre-specified rule, so the generalization claim needs checking in the methods. Minor additional issues are the usual ones for this kind of study—how problems were chosen and whether multiple-comparison adjustments were applied—but those are secondary to the selection question.

This paper is for practitioners who want tuning-free BBVI defaults and for researchers building new optimizers who need a strong reference point. The work is honest empirical benchmarking with no circular claims, so it deserves a serious referee even if revisions are needed on the validation details.

Referee Report

2 major / 2 minor

Summary. The paper reports a large-scale empirical benchmark of 56 stochastic gradient optimizers for black-box variational inference (BBVI) across 1092 problems (posterior dimensions 1–10^4, condition numbers 1–10^8), totaling >550k runs and 15 core-years. No single optimizer dominates; the central claim is that a fixed selection of 5 algorithms suffices to reach near the best-observed performance on most problems, providing a practical default baseline when expert tuning is unavailable.

Significance. If the central claim holds after addressing selection methodology, the work supplies a valuable, reproducible empirical baseline for untuned BBVI and a reference point for new optimizer development. The scale and breadth of the problem set are genuine strengths for an empirical study in this area.

major comments (2)

[Results (5-algorithm selection)] Results section (around the identification of the 5-algorithm subset): the headline claim that 'running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance' requires an explicit statement of the selection procedure. If the 5 were chosen by inspecting performance patterns across the entire 1092-problem collection without cross-validation, held-out problems, or a pre-specified rule, the reported reliability is at risk of being an in-sample observation rather than a generalizable recommendation.
[Methods / benchmark construction] Methods or benchmark-construction section: the paper must detail the criteria used to select the 1092 problems and the performance metric (e.g., how 'best-possible' is computed, whether multiple-comparison corrections were applied, and any statistical testing of differences). Without these, it is difficult to assess whether the problems and metric are representative of typical real-world BBVI tasks, which directly affects the strength of the 'no single method dominates' and '5 suffice' conclusions.

minor comments (2)

[Abstract / Introduction] Clarify in the abstract and introduction whether the 56 algorithms include all recent adaptive methods or a curated subset, and provide a brief justification for any omissions.
[Figures / Tables] Figure captions and tables reporting per-problem or aggregate performance should include error bars or confidence intervals derived from the multiple runs per problem.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and for recognizing the scale and potential utility of the benchmark. We address each major comment below.

read point-by-point responses

Referee: [Results (5-algorithm selection)] Results section (around the identification of the 5-algorithm subset): the headline claim that 'running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance' requires an explicit statement of the selection procedure. If the 5 were chosen by inspecting performance patterns across the entire 1092-problem collection without cross-validation, held-out problems, or a pre-specified rule, the reported reliability is at risk of being an in-sample observation rather than a generalizable recommendation.

Authors: We agree that the selection procedure requires an explicit statement. In the revision we will add a dedicated paragraph in the Results section describing the exact rule used to identify the 5-algorithm subset (top performers across binned regimes of dimension and condition number that together reach within 5% of the best observed ELBO on at least 85% of problems). While the initial identification used the full collection, we will also report performance of the same fixed 5 on a randomly selected held-out collection of 150 problems to provide evidence of out-of-sample behavior. revision: yes
Referee: [Methods / benchmark construction] Methods or benchmark-construction section: the paper must detail the criteria used to select the 1092 problems and the performance metric (e.g., how 'best-possible' is computed, whether multiple-comparison corrections were applied, and any statistical testing of differences). Without these, it is difficult to assess whether the problems and metric are representative of typical real-world BBVI tasks, which directly affects the strength of the 'no single method dominates' and '5 suffice' conclusions.

Authors: We will expand the Methods section with a new subsection on benchmark construction. This will specify the sampling procedure used to obtain the 1092 problems (stratified coverage of dimension and condition-number ranges), define the performance metric (negative evidence lower bound), state that the best-possible value for each problem is the minimum ELBO attained by any of the 56 optimizers across all random seeds, and report the statistical tests (paired Wilcoxon tests with Bonferroni correction) used to support claims of no single dominant method. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivation chain or self-referential claims

full rationale

This is a purely empirical paper reporting results from running 56 optimizers on 1092 problems. No mathematical derivations, equations, fitted parameters presented as predictions, ansatzes, or uniqueness theorems appear in the abstract or described content. The statement that a selection of 5 algorithms suffices is an in-sample empirical observation from the benchmark runs, not a derivation that reduces to its own inputs by construction. No self-citations are invoked as load-bearing premises. The paper is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking paper; contains no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5786 in / 1014 out tokens · 28011 ms · 2026-06-27T20:01:05.932243+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

116 extracted references · 6 canonical work pages · 3 internal anchors

[1]

PosteriorDB.jl: A Julia package to work with posteriordb

Axen, S. (2026). “PosteriorDB.jl: A Julia package to work with posteriordb.” GitHub Repository: https://github.com/sethaxen/PosteriorDB.jl. Version 0.6.0

2026
[2]

Layer Normalization

Ba, J., Kiros, J., and Hinton, G. (2016). “Layer normalization.”arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Online learning rate adaptation with hypergradient descent

Baydin, A., Cornish, R., Rubio, D., Schmidt, M., and Wood, F. (2018). “Online learning rate adaptation with hypergradient descent.” InProceedings of the International Conference on Learning Representations

2018
[4]

Training neural networks for and by interpolation

Berrada, L., Zisserman, A., and Kumar, M. (2020). “Training neural networks for and by interpolation.” InProceedings of the International Conference on Machine Learning, volume 119 ofPMLR, 799–809

2020
[5]

Julia: A fresh approach to numerical computing

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). “Julia: A fresh approach to numerical computing.”SIAM review, 59(1): 65–98

2017
[6]

Variational inference: A review for statisticians

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). “Variational inference: A review for statisticians.”Journal of the American Statistical Association, 112(518): 859–877

2017
[7]

On-line learning and stochastic approximations

Bottou, L. (1999). “On-line learning and stochastic approximations.” InOn-Line Learning in Neural Networks, 9–42. Cambridge University Press, 1 edition

1999
[8]

Optimization methods for large-scale machine learning

Bottou, L., Curtis, F. E., and Nocedal, J. (2018). “Optimization methods for large-scale machine learning.”SIAM Review, 60(2): 223–311

2018
[9]

Quasi-Monte Carlo variational inference

Buchholz, A., Wenzel, F., and Mandt, S. (2018). “Quasi-Monte Carlo variational inference.” InProceedings of the International Conference on Machine Learning, volume 80 ofPMLR, 668–677. JMLR

2018
[10]

Importance weighted autoencoders

Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). “Importance weighted autoencoders.” InProceedings of the International Conference on Learning Repre- sentations

2015
[11]

Sample average approximation for black-box variational inference

Burroni, J., Domke, J., and Sheldon, D. (2024). “Sample average approximation for black-box variational inference.” InProceedings of the Conference on Uncertainty in Artificial Intelligence, volume 244 ofPMLR, 471–498. JMLR

2024
[12]

EigenVI: Score-based variational inference with orthogonal function expansions

Cai, D., Modi, C., Margossian, C., Gower, R., Blei, D., and Saul, L. (2024). “EigenVI: Score-based variational inference with orthogonal function expansions.” InAdvances in Neural Information Processing Systems, 132691–132721. Curran Associates, Inc

2024
[13]

Batch and Match: Black-box variational inference with a score-based T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian29 divergence

Cai, D., Modi, C., Pillaud-Vivien, L., Margossian, C., Gower, R., Blei, D., and Saul, L. (2024). “Batch and Match: Black-box variational inference with a score-based T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian29 divergence.” InProceedings International Conference on Machine Learning, volume 235 ofPMLR, 5258–5297. JMLR

2024
[14]

Making SGD parameter-free

Carmon, Y. and Hinder, O. (2022). “Making SGD parameter-free.” InProceedings of the Conference on Learning Theory, volume 178 ofPMLR, 2360–2389

2022
[15]

Algorithms for computing the sample variance: Analysis and recommendations

Chan, T. F., Golub, G. H., and LeVeque, R. J. (1983). “Algorithms for computing the sample variance: Analysis and recommendations.”The American Statistician, 37(3): 242–247

1983
[16]

Symbolic discovery of optimization algorithms

Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y., and Le, Q. (2023). “Symbolic discovery of optimization algorithms.” InAdvances in Neural Information Processing Systems, volume 36, 49205–49233. Curran Associates, Inc

2023
[17]

Mechanic: A learning rate tuner

Cutkosky, A., Defazio, A., and Mehta, H. (2023). “Mechanic: A learning rate tuner.” InAdvances in Neural Information Processing Systems, volume 36, 47828–47848. Curran Associates, Inc

2023
[18]

The Helmholtz Machine

Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). “The Helmholtz Machine.”Neural Computation, 7(5): 889–904

1995
[19]

Big Batch SGD: Auto- mated inference using adaptive batch sizes

De, S., Yadav, A., Jacobs, D., and Goldstein, T. (2017). “Big Batch SGD: Auto- mated inference using adaptive batch sizes.” InProceedings of the International Conference on Artificial Intelligence and Statistics, volume 52 ofPMLR, 1504–1513. JMLR

2017
[20]

SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives

Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives.” In Advances in Neural Information Processing Systems, volume 27, 1646–1654. Curran Associates, Inc

2014
[21]

Learning-rate-free learning by D- adaptation

Defazio, A. and Mishchenko, K. (2023). “Learning-rate-free learning by D- adaptation.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 7449–7479. JMLR

2023
[22]

Robust, accurate stochastic optimization for variational inference

Dhaka, A. K., Catalina, A., Andersen, M. R., ns Magnusson, M., Huggins, J., and Vehtari, A. (2020). “Robust, accurate stochastic optimization for variational inference.” InAdvances in Neural Information Processing Systems, volume 33, 10961–10973. Curran Associates, Inc

2020
[23]

Challenges and opportunities in high-dimensional variational inference

Dhaka, A. K., Catalina, A., Welandawe, M., Andersen, M. R., Huggins, J., and Vehtari, A. (2021). “Challenges and opportunities in high-dimensional variational inference.” InAdvances in Neural Information Processing Systems, volume 34, 7787–7798. Curran Associates, Inc

2021
[24]

Forward- backward Gaussian variational inference via JKO in the Bures-Wasserstein space

Diao, M. Z., Balasubramanian, K., Chewi, S., and Salim, A. (2023). “Forward- backward Gaussian variational inference via JKO in the Bures-Wasserstein space.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 7960–7991. JMLR

2023
[25]

Variational 30Default optimizers for variational inference inference viaχ-upper bound minimization

Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. (2017). “Variational 30Default optimizers for variational inference inference viaχ-upper bound minimization.” InAdvances in Neural Information Processing Systems, volume 30, 2729–2738. Curran Associates, Inc

2017
[26]

bridging the gap between constant step size stochastic gradient descent and Markov chains

Dieuleveut, A., Durmus, A., and Bach, F. (2020). “bridging the gap between constant step size stochastic gradient descent and Markov chains.”The Annals of Statistics, 48(3): 1348 – 1382

2020
[27]

Provable gradient variance guarantees for black-box variational inference

Domke, J. (2019). “Provable gradient variance guarantees for black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 32, 329–338. Curran Associates, Inc

2019
[28]

Provable smoothness guarantees for black-box variational inference

— (2020). “Provable smoothness guarantees for black-box variational inference.” InProceedings of the International Conference on Machine Learning, volume 119 ofPMLR, 2587–2596. JMLR

2020
[29]

Provable convergence guarantees for black-box variational inference

Domke, J., Gower, R., and Garrigos, G. (2023). “Provable convergence guarantees for black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 36, 66289–66327. Curran Associates, Inc

2023
[30]

Importance weighting and variational inference

Domke, J. and Sheldon, D. R. (2018). “Importance weighting and variational inference.” InAdvances in Neural Information Processing Systems, volume 31, 4470–4479. Curran Associates, Inc

2018
[31]

Divide and Couple: using Monte Carlo variational objectives for posterior approximation

— (2019). “Divide and Couple: using Monte Carlo variational objectives for posterior approximation.” InAdvances in Neural Information Processing Systems, volume 32, 339–349. Curran Associates, Inc

2019
[32]

Adaptive subgradient methods for online learning and stochastic optimization

Duchi, J., Hazan, E., and Singer, Y. (2011). “Adaptive subgradient methods for online learning and stochastic optimization.”Journal of Machine Learning Research, 12: 2121–2159

2011
[33]

Batch means and spectral variance estimators in Markov chain Monte Carlo

Flegal, J. M. and Jones, G. L. (2010). “Batch means and spectral variance estimators in Markov chain Monte Carlo.”The Annals of Statistics, 38(2): 1034– 1070

2010
[34]

Multilevel Monte Carlo variational inference

Fujisawa, M. and Sato, I. (2021). “Multilevel Monte Carlo variational inference.” Journal of Machine Learning Research, 22(278): 1–44

2021
[35]

Don’t be so monotone: Relax- ing stochastic line search in over-parametrized models

Galli, L., Rauhut, H., and Schmidt, M. (2023). “Don’t be so monotone: Relax- ing stochastic line search in over-parametrized models.” InAdvances in Neural Information Processing Systems, volume 36, 34752–34764. Curran Associates, Inc

2023
[36]

Empirical evaluation of biased methods for alpha divergence minimization

Geffner, T. and Domke, J. (2021). “Empirical evaluation of biased methods for alpha divergence minimization.” InProceedings of the Symposium on Advances in Approximate Bayesian Inference

2021
[37]

MCMC variational inference via uncorrected Hamiltonian annealing

— (2021). “MCMC variational inference via uncorrected Hamiltonian annealing.” InAdvances in Neural Information Processing Systems, volume 34, 639–651. Curran Associates, Inc

2021
[38]

On the difficulty of unbiased alpha divergence minimization

— (2021). “On the difficulty of unbiased alpha divergence minimization.” In Proceedings of the International Conference on Machine Learning, volume 139 of PMLR, 3650–3659. JMLR. T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian31

2021
[39]

Inference from iterative simulation using multiple sequences

Gelman, A. and Rubin, D. (1992). “Inference from iterative simulation using multiple sequences.”Statistical Science, 7(4): 457–511

1992
[40]

Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box

Giordano, R., Ingram, M., and Broderick, T. (2024). “Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box.”Journal of Machine Learning Research, 25: 1–39

2024
[41]

Practical variational inference for neural networks

Graves, A. (2011). “Practical variational inference for neural networks.” In Advances in Neural Information Processing Systems, volume 24, 2348–2356. Curran Associates, Inc

2011
[42]

Shampoo: Preconditioned stochastic tensor optimization

Gupta, V., Koren, T., and Singer, Y. (2018). “Shampoo: Preconditioned stochastic tensor optimization.” InProceedings of the International Conference on Machine Learning, volume 80 ofPMLR, 1842–1850. JMLR

2018
[43]

Revisiting the Polyak step size

Hazan, E. and Kakade, S. (2019). “Revisiting the Polyak step size.” arXiv:1905.00313

work page arXiv 2019
[44]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition.” InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778

2016
[45]

Black-Box alpha divergence minimization

Hernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hernandez-Lobato, D., and Turner, R. (2016). “Black-Box alpha divergence minimization.” InProceedings of the International Conference on Machine Learning, volume 48 ofPMLR, 1511–1520. JMLR

2016
[46]

Keeping the neural networks simple by minimizing the description length of the weights

Hinton, G. E. and van Camp, D. (1993). “Keeping the neural networks simple by minimizing the description length of the weights.” InProceedings of the Annual Conference on Computational Learning Theory, 5–13. ACM Press

1993
[47]

Perturbation analysis and optimization of queueing networks

Ho, Y. C. and Cao, X. (1983). “Perturbation analysis and optimization of queueing networks.”Journal of Optimization Theory and Applications, 40(4): 559–582

1983
[48]

Validated variational inference via practical posterior error bounds

Huggins, J., Kasprzak, M., Campbell, T., and Broderick, T. (2020). “Validated variational inference via practical posterior error bounds.” InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 ofPMLR, 1792–1802. JMLR

2020
[49]

Variational Inference using Implicit Distributions

Huszár, F. (2017). “Variational inference using implicit distributions.” arXiv:1702.08235

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

DoG is SGD’s best friend: A parameter-free dynamic step size schedule

Ivgi, M., Hinder, O., and Carmon, Y. (2023). “DoG is SGD’s best friend: A parameter-free dynamic step size schedule.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 14465–14499. JMLR

2023
[51]

Accelerating stochastic gradient descent using predictive variance reduction

Johnson, R. and Zhang, T. (2013). “Accelerating stochastic gradient descent using predictive variance reduction.” InAdvances in Neural Information Processing Systems, volume 26, 315–323. Curran Associates, Inc

2013
[52]

Muon: An optimizer for hidden layers in neural networks

Jordan,K.,Jin,Y.,Boza,V., Jiacheng,Y.,Cesista,F.,Newhouse,L.,and Bernstein, J. (2024). “Muon: An optimizer for hidden layers in neural networks.” URLhttps://kellerjordan.github.io/posts/muon/ 32Default optimizers for variational inference

2024
[53]

An introduction to variational methods for graphical models

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). “An introduction to variational methods for graphical models.”Machine Learning, 37(2): 183–233

1999
[54]

DoWG unleashed: An effi- cient universal parameter-free gradient descent method

Khaled, A., Mishchenko, K., and Jin, C. (2023). “DoWG unleashed: An effi- cient universal parameter-free gradient descent method.” InAdvances in Neural Information Processing Systems, volume 36, 6748–6769

2023
[55]

The Bayesian learning rule

Khan, M. E. and Rue, H. (2023). “The Bayesian learning rule.”Journal of Machine Learning Research, 24(281): 1–46

2023
[56]

Linear convergence of black- box variational inference: Should we stick the landing?

Kim, K., Ma, Y., and Gardner, J. R. (2024). “Linear convergence of black- box variational inference: Should we stick the landing?” InProceedings of the International Conference on Artificial Intelligence and Statistics, volume 238 of PMLR, 235–243. JMLR

2024
[57]

On the convergence of black-box variational inference

Kim, K., Oh, J., Wu, K., Ma, Y., and Gardner, J. R. (2023). “On the convergence of black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 36, 44615–44657. Curran Associates Inc

2023
[58]

A guide to sample average approximation

Kim, S., Pasupathy, R., and Henderson, S. (2015). “A guide to sample average approximation.” InHandbook of Simulation Optimization, 207–243. Springer

2015
[59]

Adam: A method for stochastic optimization

Kingma, D. and Ba, J. (2015). “Adam: A method for stochastic optimization.” In Proceedings of the International Conference on Learning Representations

2015
[60]

Auto-encoding variational Bayes

Kingma, D. P. and Welling, M. (2014). “Auto-encoding variational Bayes.” In Proceedings of the International Conference on Learning Representations

2014
[61]

Automatic differentiation variational inference

Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). “Automatic differentiation variational inference.”Journal of Machine Learning Research, 18(14): 1–45

2017
[62]

Optimization guarantees for square-root natural-gradient variational inference

Kumar, N., Möllenhoff, T., Khan, M. E., and Lucchi, A. (2025). “Optimization guarantees for square-root natural-gradient variational inference.”Transactions on Machine Learning Research

2025
[63]

Varia- tional inference via Wasserstein gradient flows

Lambert, M., Chewi, S., Bach, F., Bonnabel, S., and Rigollet, P. (2022). “Varia- tional inference via Wasserstein gradient flows.” InAdvances in Neural Information Processing Systems, volume 35, 14434–14447. Curran Associates, Inc

2022
[64]

A stochastic gradient method with an exponential convergence rate for finite training sets

Le Roux, N., Schmidt, M., and Bach, F. (2012). “A stochastic gradient method with an exponential convergence rate for finite training sets.” InAdvances in Neural Information Processing Systems, 2663–2671. Curran Associates, Inc

2012
[65]

Rényi divergence variational inference

Li, Y. and Turner, R. E. (2016). “Rényi divergence variational inference.” In Advances in Neural Information Processing Systems, volume 29, 1073–1081. Curran Associates, Inc

2016
[66]

Fast and simple natural-gradient variational inference with mixture of exponential-family approximations

Lin, W., Khan, M. E., and Schmidt, M. (2019). “Fast and simple natural-gradient variational inference with mixture of exponential-family approximations.” In Proceedings of the International Conference on Machine Learning, volume 97 of PMLR, 3992–4002. JMLR. T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian33

2019
[67]

Batch size selection for variance estimators in MCMC

Liu, Y., Vats, D., and Flegal, J. M. (2022). “Batch size selection for variance estimators in MCMC.”Methodology and Computing in Applied Probability, 24(1): 65–93

2022
[68]

Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence

Loizou, N., Vaswani, S., Laradji, I. H., and Lacoste-Julien, S. (2021). “Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence.” In Proceedings of The International Conference on Artificial Intelligence and Statistics, PMLR, 1306–1314. JMLR

2021
[69]

The power of interpolation: Un- derstanding the effectiveness of SGD in modern over-parametrized learning

Ma, S., Bassily, R., and Belkin, M. (2018). “The power of interpolation: Un- derstanding the effectiveness of SGD in modern over-parametrized learning.” In Proceedings of the International Conference on Machine Learning, volume 80 of PMLR, 3325–3334. JMLR

2018
[70]

posteriordb: Testing, benchmarking and developing Bayesian inference algorithms

Magnusson, M., Torgander, J., Bürkner, P.-C., Zhang, L., Carpenter, B., and Vehtari, A. (2025). “posteriordb: Testing, benchmarking and developing Bayesian inference algorithms.” InProceedings of The International Conference on Artificial Intelligence and Statistics, volume 258 ofPMLR, 1198–1206. JMLR

2025
[71]

Torsten: A platform for Bayesian inference of pharmacometric models

Margossian, C. C., Zhang, Y., Gillespie, B., Bales, B., Volfovsky, A., Pavlovic, V., and Gelman, A. (2022). “Torsten: A platform for Bayesian inference of pharmacometric models.”Statistics and Computing, 32(6): 1–15

2022
[72]

Expectation propagation for approximate Bayesian inference

Minka, T. P. (2001). “Expectation propagation for approximate Bayesian inference.” InProceedings of the Conference on Uncertainty in Artificial Intelligence, 362–369. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc

2001
[73]

Variational Inference with Gaussian Score Matching

Modi, C., Gower, R., Margossian, C., Yao, Y., Blei, D., and Saul, L. (2023). “Variational Inference with Gaussian Score Matching.” InAdvances in Neural Information Processing Systems, volume 36, 29935–29950. Curran Associates, Inc

2023
[74]

Monte Carlo Gradient Estimation in Machine Learning

Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2020). “Monte Carlo Gradient Estimation in Machine Learning.”Journal of Machine Learning Research, 21(132): 1–62

2020
[75]

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients

Moses, W. and Churavy, V. (2020). “Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients.” InAdvances in Neural Information Processing Systems, volume 33, 12472–12485. Curran Associates, Inc

2020
[76]

Rectified linear units improve restricted Boltz- mann machines

Nair, V. and Hinton, G. E. (2010). “Rectified linear units improve restricted Boltz- mann machines.” InProceedings of the International Conference on International Conference on Machine Learning, ICML, 807–814. Madison, WI, USA: Omnipress

2010
[77]

Gaussian Variational Approximation with a Factor Covariance Structure

Ong, V. M.-H., Nott, D. J., and Smith, M. S. (2018). “Gaussian Variational Approximation with a Factor Covariance Structure.”Journal of Computational and Graphical Statistics, 27(3): 465–478

2018
[78]

Parameter-free stochastic optimization of variationally coherent functions

Orabona, F. and Pál, D. (2021). “Parameter-free stochastic optimization of variationally coherent functions.”arXiv:2102.00236

work page arXiv 2021
[79]

Training deep networks without learning 34Default optimizers for variational inference rates through coin betting

Orabona, F. and Tommasi, T. (2017). “Training deep networks without learning 34Default optimizers for variational inference rates through coin betting.” InAdvances in Neural Information Processing Systems, volume 30, 2160–2170. Curran Associates, Inc

2017
[80]

R package

outbreak package authors (2024).outbreak: Tools for Simulating and Analyzing Epidemic Outbreaks. R package. URLhttps://CRAN.R-project.org/package=outbreak

2024

Showing first 80 references.

[1] [1]

PosteriorDB.jl: A Julia package to work with posteriordb

Axen, S. (2026). “PosteriorDB.jl: A Julia package to work with posteriordb.” GitHub Repository: https://github.com/sethaxen/PosteriorDB.jl. Version 0.6.0

2026

[2] [2]

Layer Normalization

Ba, J., Kiros, J., and Hinton, G. (2016). “Layer normalization.”arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Online learning rate adaptation with hypergradient descent

Baydin, A., Cornish, R., Rubio, D., Schmidt, M., and Wood, F. (2018). “Online learning rate adaptation with hypergradient descent.” InProceedings of the International Conference on Learning Representations

2018

[4] [4]

Training neural networks for and by interpolation

Berrada, L., Zisserman, A., and Kumar, M. (2020). “Training neural networks for and by interpolation.” InProceedings of the International Conference on Machine Learning, volume 119 ofPMLR, 799–809

2020

[5] [5]

Julia: A fresh approach to numerical computing

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). “Julia: A fresh approach to numerical computing.”SIAM review, 59(1): 65–98

2017

[6] [6]

Variational inference: A review for statisticians

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). “Variational inference: A review for statisticians.”Journal of the American Statistical Association, 112(518): 859–877

2017

[7] [7]

On-line learning and stochastic approximations

Bottou, L. (1999). “On-line learning and stochastic approximations.” InOn-Line Learning in Neural Networks, 9–42. Cambridge University Press, 1 edition

1999

[8] [8]

Optimization methods for large-scale machine learning

Bottou, L., Curtis, F. E., and Nocedal, J. (2018). “Optimization methods for large-scale machine learning.”SIAM Review, 60(2): 223–311

2018

[9] [9]

Quasi-Monte Carlo variational inference

Buchholz, A., Wenzel, F., and Mandt, S. (2018). “Quasi-Monte Carlo variational inference.” InProceedings of the International Conference on Machine Learning, volume 80 ofPMLR, 668–677. JMLR

2018

[10] [10]

Importance weighted autoencoders

Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). “Importance weighted autoencoders.” InProceedings of the International Conference on Learning Repre- sentations

2015

[11] [11]

Sample average approximation for black-box variational inference

Burroni, J., Domke, J., and Sheldon, D. (2024). “Sample average approximation for black-box variational inference.” InProceedings of the Conference on Uncertainty in Artificial Intelligence, volume 244 ofPMLR, 471–498. JMLR

2024

[12] [12]

EigenVI: Score-based variational inference with orthogonal function expansions

Cai, D., Modi, C., Margossian, C., Gower, R., Blei, D., and Saul, L. (2024). “EigenVI: Score-based variational inference with orthogonal function expansions.” InAdvances in Neural Information Processing Systems, 132691–132721. Curran Associates, Inc

2024

[13] [13]

Batch and Match: Black-box variational inference with a score-based T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian29 divergence

Cai, D., Modi, C., Pillaud-Vivien, L., Margossian, C., Gower, R., Blei, D., and Saul, L. (2024). “Batch and Match: Black-box variational inference with a score-based T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian29 divergence.” InProceedings International Conference on Machine Learning, volume 235 ofPMLR, 5258–5297. JMLR

2024

[14] [14]

Making SGD parameter-free

Carmon, Y. and Hinder, O. (2022). “Making SGD parameter-free.” InProceedings of the Conference on Learning Theory, volume 178 ofPMLR, 2360–2389

2022

[15] [15]

Algorithms for computing the sample variance: Analysis and recommendations

Chan, T. F., Golub, G. H., and LeVeque, R. J. (1983). “Algorithms for computing the sample variance: Analysis and recommendations.”The American Statistician, 37(3): 242–247

1983

[16] [16]

Symbolic discovery of optimization algorithms

Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y., and Le, Q. (2023). “Symbolic discovery of optimization algorithms.” InAdvances in Neural Information Processing Systems, volume 36, 49205–49233. Curran Associates, Inc

2023

[17] [17]

Mechanic: A learning rate tuner

Cutkosky, A., Defazio, A., and Mehta, H. (2023). “Mechanic: A learning rate tuner.” InAdvances in Neural Information Processing Systems, volume 36, 47828–47848. Curran Associates, Inc

2023

[18] [18]

The Helmholtz Machine

Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). “The Helmholtz Machine.”Neural Computation, 7(5): 889–904

1995

[19] [19]

Big Batch SGD: Auto- mated inference using adaptive batch sizes

De, S., Yadav, A., Jacobs, D., and Goldstein, T. (2017). “Big Batch SGD: Auto- mated inference using adaptive batch sizes.” InProceedings of the International Conference on Artificial Intelligence and Statistics, volume 52 ofPMLR, 1504–1513. JMLR

2017

[20] [20]

SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives

Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives.” In Advances in Neural Information Processing Systems, volume 27, 1646–1654. Curran Associates, Inc

2014

[21] [21]

Learning-rate-free learning by D- adaptation

Defazio, A. and Mishchenko, K. (2023). “Learning-rate-free learning by D- adaptation.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 7449–7479. JMLR

2023

[22] [22]

Robust, accurate stochastic optimization for variational inference

Dhaka, A. K., Catalina, A., Andersen, M. R., ns Magnusson, M., Huggins, J., and Vehtari, A. (2020). “Robust, accurate stochastic optimization for variational inference.” InAdvances in Neural Information Processing Systems, volume 33, 10961–10973. Curran Associates, Inc

2020

[23] [23]

Challenges and opportunities in high-dimensional variational inference

Dhaka, A. K., Catalina, A., Welandawe, M., Andersen, M. R., Huggins, J., and Vehtari, A. (2021). “Challenges and opportunities in high-dimensional variational inference.” InAdvances in Neural Information Processing Systems, volume 34, 7787–7798. Curran Associates, Inc

2021

[24] [24]

Forward- backward Gaussian variational inference via JKO in the Bures-Wasserstein space

Diao, M. Z., Balasubramanian, K., Chewi, S., and Salim, A. (2023). “Forward- backward Gaussian variational inference via JKO in the Bures-Wasserstein space.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 7960–7991. JMLR

2023

[25] [25]

Variational 30Default optimizers for variational inference inference viaχ-upper bound minimization

Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. (2017). “Variational 30Default optimizers for variational inference inference viaχ-upper bound minimization.” InAdvances in Neural Information Processing Systems, volume 30, 2729–2738. Curran Associates, Inc

2017

[26] [26]

bridging the gap between constant step size stochastic gradient descent and Markov chains

Dieuleveut, A., Durmus, A., and Bach, F. (2020). “bridging the gap between constant step size stochastic gradient descent and Markov chains.”The Annals of Statistics, 48(3): 1348 – 1382

2020

[27] [27]

Provable gradient variance guarantees for black-box variational inference

Domke, J. (2019). “Provable gradient variance guarantees for black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 32, 329–338. Curran Associates, Inc

2019

[28] [28]

Provable smoothness guarantees for black-box variational inference

— (2020). “Provable smoothness guarantees for black-box variational inference.” InProceedings of the International Conference on Machine Learning, volume 119 ofPMLR, 2587–2596. JMLR

2020

[29] [29]

Provable convergence guarantees for black-box variational inference

Domke, J., Gower, R., and Garrigos, G. (2023). “Provable convergence guarantees for black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 36, 66289–66327. Curran Associates, Inc

2023

[30] [30]

Importance weighting and variational inference

Domke, J. and Sheldon, D. R. (2018). “Importance weighting and variational inference.” InAdvances in Neural Information Processing Systems, volume 31, 4470–4479. Curran Associates, Inc

2018

[31] [31]

Divide and Couple: using Monte Carlo variational objectives for posterior approximation

— (2019). “Divide and Couple: using Monte Carlo variational objectives for posterior approximation.” InAdvances in Neural Information Processing Systems, volume 32, 339–349. Curran Associates, Inc

2019

[32] [32]

Adaptive subgradient methods for online learning and stochastic optimization

Duchi, J., Hazan, E., and Singer, Y. (2011). “Adaptive subgradient methods for online learning and stochastic optimization.”Journal of Machine Learning Research, 12: 2121–2159

2011

[33] [33]

Batch means and spectral variance estimators in Markov chain Monte Carlo

Flegal, J. M. and Jones, G. L. (2010). “Batch means and spectral variance estimators in Markov chain Monte Carlo.”The Annals of Statistics, 38(2): 1034– 1070

2010

[34] [34]

Multilevel Monte Carlo variational inference

Fujisawa, M. and Sato, I. (2021). “Multilevel Monte Carlo variational inference.” Journal of Machine Learning Research, 22(278): 1–44

2021

[35] [35]

Don’t be so monotone: Relax- ing stochastic line search in over-parametrized models

Galli, L., Rauhut, H., and Schmidt, M. (2023). “Don’t be so monotone: Relax- ing stochastic line search in over-parametrized models.” InAdvances in Neural Information Processing Systems, volume 36, 34752–34764. Curran Associates, Inc

2023

[36] [36]

Empirical evaluation of biased methods for alpha divergence minimization

Geffner, T. and Domke, J. (2021). “Empirical evaluation of biased methods for alpha divergence minimization.” InProceedings of the Symposium on Advances in Approximate Bayesian Inference

2021

[37] [37]

MCMC variational inference via uncorrected Hamiltonian annealing

— (2021). “MCMC variational inference via uncorrected Hamiltonian annealing.” InAdvances in Neural Information Processing Systems, volume 34, 639–651. Curran Associates, Inc

2021

[38] [38]

On the difficulty of unbiased alpha divergence minimization

— (2021). “On the difficulty of unbiased alpha divergence minimization.” In Proceedings of the International Conference on Machine Learning, volume 139 of PMLR, 3650–3659. JMLR. T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian31

2021

[39] [39]

Inference from iterative simulation using multiple sequences

Gelman, A. and Rubin, D. (1992). “Inference from iterative simulation using multiple sequences.”Statistical Science, 7(4): 457–511

1992

[40] [40]

Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box

Giordano, R., Ingram, M., and Broderick, T. (2024). “Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box.”Journal of Machine Learning Research, 25: 1–39

2024

[41] [41]

Practical variational inference for neural networks

Graves, A. (2011). “Practical variational inference for neural networks.” In Advances in Neural Information Processing Systems, volume 24, 2348–2356. Curran Associates, Inc

2011

[42] [42]

Shampoo: Preconditioned stochastic tensor optimization

Gupta, V., Koren, T., and Singer, Y. (2018). “Shampoo: Preconditioned stochastic tensor optimization.” InProceedings of the International Conference on Machine Learning, volume 80 ofPMLR, 1842–1850. JMLR

2018

[43] [43]

Revisiting the Polyak step size

Hazan, E. and Kakade, S. (2019). “Revisiting the Polyak step size.” arXiv:1905.00313

work page arXiv 2019

[44] [44]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition.” InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778

2016

[45] [45]

Black-Box alpha divergence minimization

Hernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hernandez-Lobato, D., and Turner, R. (2016). “Black-Box alpha divergence minimization.” InProceedings of the International Conference on Machine Learning, volume 48 ofPMLR, 1511–1520. JMLR

2016

[46] [46]

Keeping the neural networks simple by minimizing the description length of the weights

Hinton, G. E. and van Camp, D. (1993). “Keeping the neural networks simple by minimizing the description length of the weights.” InProceedings of the Annual Conference on Computational Learning Theory, 5–13. ACM Press

1993

[47] [47]

Perturbation analysis and optimization of queueing networks

Ho, Y. C. and Cao, X. (1983). “Perturbation analysis and optimization of queueing networks.”Journal of Optimization Theory and Applications, 40(4): 559–582

1983

[48] [48]

Validated variational inference via practical posterior error bounds

Huggins, J., Kasprzak, M., Campbell, T., and Broderick, T. (2020). “Validated variational inference via practical posterior error bounds.” InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 ofPMLR, 1792–1802. JMLR

2020

[49] [49]

Variational Inference using Implicit Distributions

Huszár, F. (2017). “Variational inference using implicit distributions.” arXiv:1702.08235

work page internal anchor Pith review Pith/arXiv arXiv 2017

[50] [50]

DoG is SGD’s best friend: A parameter-free dynamic step size schedule

Ivgi, M., Hinder, O., and Carmon, Y. (2023). “DoG is SGD’s best friend: A parameter-free dynamic step size schedule.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 14465–14499. JMLR

2023

[51] [51]

Accelerating stochastic gradient descent using predictive variance reduction

Johnson, R. and Zhang, T. (2013). “Accelerating stochastic gradient descent using predictive variance reduction.” InAdvances in Neural Information Processing Systems, volume 26, 315–323. Curran Associates, Inc

2013

[52] [52]

Muon: An optimizer for hidden layers in neural networks

Jordan,K.,Jin,Y.,Boza,V., Jiacheng,Y.,Cesista,F.,Newhouse,L.,and Bernstein, J. (2024). “Muon: An optimizer for hidden layers in neural networks.” URLhttps://kellerjordan.github.io/posts/muon/ 32Default optimizers for variational inference

2024

[53] [53]

An introduction to variational methods for graphical models

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). “An introduction to variational methods for graphical models.”Machine Learning, 37(2): 183–233

1999

[54] [54]

DoWG unleashed: An effi- cient universal parameter-free gradient descent method

Khaled, A., Mishchenko, K., and Jin, C. (2023). “DoWG unleashed: An effi- cient universal parameter-free gradient descent method.” InAdvances in Neural Information Processing Systems, volume 36, 6748–6769

2023

[55] [55]

The Bayesian learning rule

Khan, M. E. and Rue, H. (2023). “The Bayesian learning rule.”Journal of Machine Learning Research, 24(281): 1–46

2023

[56] [56]

Linear convergence of black- box variational inference: Should we stick the landing?

Kim, K., Ma, Y., and Gardner, J. R. (2024). “Linear convergence of black- box variational inference: Should we stick the landing?” InProceedings of the International Conference on Artificial Intelligence and Statistics, volume 238 of PMLR, 235–243. JMLR

2024

[57] [57]

On the convergence of black-box variational inference

Kim, K., Oh, J., Wu, K., Ma, Y., and Gardner, J. R. (2023). “On the convergence of black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 36, 44615–44657. Curran Associates Inc

2023

[58] [58]

A guide to sample average approximation

Kim, S., Pasupathy, R., and Henderson, S. (2015). “A guide to sample average approximation.” InHandbook of Simulation Optimization, 207–243. Springer

2015

[59] [59]

Adam: A method for stochastic optimization

Kingma, D. and Ba, J. (2015). “Adam: A method for stochastic optimization.” In Proceedings of the International Conference on Learning Representations

2015

[60] [60]

Auto-encoding variational Bayes

Kingma, D. P. and Welling, M. (2014). “Auto-encoding variational Bayes.” In Proceedings of the International Conference on Learning Representations

2014

[61] [61]

Automatic differentiation variational inference

Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). “Automatic differentiation variational inference.”Journal of Machine Learning Research, 18(14): 1–45

2017

[62] [62]

Optimization guarantees for square-root natural-gradient variational inference

Kumar, N., Möllenhoff, T., Khan, M. E., and Lucchi, A. (2025). “Optimization guarantees for square-root natural-gradient variational inference.”Transactions on Machine Learning Research

2025

[63] [63]

Varia- tional inference via Wasserstein gradient flows

Lambert, M., Chewi, S., Bach, F., Bonnabel, S., and Rigollet, P. (2022). “Varia- tional inference via Wasserstein gradient flows.” InAdvances in Neural Information Processing Systems, volume 35, 14434–14447. Curran Associates, Inc

2022

[64] [64]

A stochastic gradient method with an exponential convergence rate for finite training sets

Le Roux, N., Schmidt, M., and Bach, F. (2012). “A stochastic gradient method with an exponential convergence rate for finite training sets.” InAdvances in Neural Information Processing Systems, 2663–2671. Curran Associates, Inc

2012

[65] [65]

Rényi divergence variational inference

Li, Y. and Turner, R. E. (2016). “Rényi divergence variational inference.” In Advances in Neural Information Processing Systems, volume 29, 1073–1081. Curran Associates, Inc

2016

[66] [66]

Fast and simple natural-gradient variational inference with mixture of exponential-family approximations

Lin, W., Khan, M. E., and Schmidt, M. (2019). “Fast and simple natural-gradient variational inference with mixture of exponential-family approximations.” In Proceedings of the International Conference on Machine Learning, volume 97 of PMLR, 3992–4002. JMLR. T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian33

2019

[67] [67]

Batch size selection for variance estimators in MCMC

Liu, Y., Vats, D., and Flegal, J. M. (2022). “Batch size selection for variance estimators in MCMC.”Methodology and Computing in Applied Probability, 24(1): 65–93

2022

[68] [68]

Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence

Loizou, N., Vaswani, S., Laradji, I. H., and Lacoste-Julien, S. (2021). “Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence.” In Proceedings of The International Conference on Artificial Intelligence and Statistics, PMLR, 1306–1314. JMLR

2021

[69] [69]

The power of interpolation: Un- derstanding the effectiveness of SGD in modern over-parametrized learning

Ma, S., Bassily, R., and Belkin, M. (2018). “The power of interpolation: Un- derstanding the effectiveness of SGD in modern over-parametrized learning.” In Proceedings of the International Conference on Machine Learning, volume 80 of PMLR, 3325–3334. JMLR

2018

[70] [70]

posteriordb: Testing, benchmarking and developing Bayesian inference algorithms

Magnusson, M., Torgander, J., Bürkner, P.-C., Zhang, L., Carpenter, B., and Vehtari, A. (2025). “posteriordb: Testing, benchmarking and developing Bayesian inference algorithms.” InProceedings of The International Conference on Artificial Intelligence and Statistics, volume 258 ofPMLR, 1198–1206. JMLR

2025

[71] [71]

Torsten: A platform for Bayesian inference of pharmacometric models

Margossian, C. C., Zhang, Y., Gillespie, B., Bales, B., Volfovsky, A., Pavlovic, V., and Gelman, A. (2022). “Torsten: A platform for Bayesian inference of pharmacometric models.”Statistics and Computing, 32(6): 1–15

2022

[72] [72]

Expectation propagation for approximate Bayesian inference

Minka, T. P. (2001). “Expectation propagation for approximate Bayesian inference.” InProceedings of the Conference on Uncertainty in Artificial Intelligence, 362–369. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc

2001

[73] [73]

Variational Inference with Gaussian Score Matching

Modi, C., Gower, R., Margossian, C., Yao, Y., Blei, D., and Saul, L. (2023). “Variational Inference with Gaussian Score Matching.” InAdvances in Neural Information Processing Systems, volume 36, 29935–29950. Curran Associates, Inc

2023

[74] [74]

Monte Carlo Gradient Estimation in Machine Learning

Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2020). “Monte Carlo Gradient Estimation in Machine Learning.”Journal of Machine Learning Research, 21(132): 1–62

2020

[75] [75]

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients

Moses, W. and Churavy, V. (2020). “Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients.” InAdvances in Neural Information Processing Systems, volume 33, 12472–12485. Curran Associates, Inc

2020

[76] [76]

Rectified linear units improve restricted Boltz- mann machines

Nair, V. and Hinton, G. E. (2010). “Rectified linear units improve restricted Boltz- mann machines.” InProceedings of the International Conference on International Conference on Machine Learning, ICML, 807–814. Madison, WI, USA: Omnipress

2010

[77] [77]

Gaussian Variational Approximation with a Factor Covariance Structure

Ong, V. M.-H., Nott, D. J., and Smith, M. S. (2018). “Gaussian Variational Approximation with a Factor Covariance Structure.”Journal of Computational and Graphical Statistics, 27(3): 465–478

2018

[78] [78]

Parameter-free stochastic optimization of variationally coherent functions

Orabona, F. and Pál, D. (2021). “Parameter-free stochastic optimization of variationally coherent functions.”arXiv:2102.00236

work page arXiv 2021

[79] [79]

Training deep networks without learning 34Default optimizers for variational inference rates through coin betting

Orabona, F. and Tommasi, T. (2017). “Training deep networks without learning 34Default optimizers for variational inference rates through coin betting.” InAdvances in Neural Information Processing Systems, volume 30, 2160–2170. Curran Associates, Inc

2017

[80] [80]

R package

outbreak package authors (2024).outbreak: Tools for Simulating and Analyzing Epidemic Outbreaks. R package. URLhttps://CRAN.R-project.org/package=outbreak

2024