Large-scale empirical tuning and comparison of default optimizers for variational inference
Pith reviewed 2026-06-27 20:01 UTC · model grok-4.3
The pith
No single optimizer dominates black-box variational inference, but five algorithms reliably approach the best observed performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A benchmark of 56 stochastic optimization algorithms applied to 1092 Bayesian problems demonstrates that no single method dominates, but running a selection of five algorithms suffices to reliably get close to the best-possible observed performance.
What carries the argument
The large-scale empirical benchmark of 56 stochastic gradient-based optimization algorithms evaluated on 1092 Bayesian inference optimization problems with varying dimensions, condition numbers, and variational families.
If this is right
- A practitioner can obtain reliable black-box variational inference results by running five default optimizers in parallel without any problem-specific tuning.
- New stochastic optimization algorithms for variational inference can be evaluated against the performance distribution established by the five-algorithm selection.
- Black-box variational inference becomes more usable in applications where expert tuning is not feasible.
- The study supplies a standardized set of problems for future comparisons of optimization methods in this domain.
Where Pith is reading between the lines
- Portfolio or switching strategies that try several of the five algorithms may further improve reliability without increasing the total number of runs.
- The same five-algorithm approach could be tested on other stochastic optimization tasks outside variational inference to check transferability.
- Problem characteristics such as dimension or condition number could be used to choose among the five rather than always running all of them.
Load-bearing premise
The 1092 chosen Bayesian inference problems and the performance metric used are representative of typical real-world black-box variational inference tasks and success criteria.
What would settle it
A new optimizer that consistently outperforms the best of the five selected algorithms on a broad subset of the 1092 problems or on additional problems drawn from the same distribution of dimensions and condition numbers.
Figures
read the original abstract
Black-box variational inference (BBVI) is a methodology for posterior approximation that relies on stochastic optimization. In practice, the stochastic optimizers underpinning BBVI generally require extensive problem-specific tuning, which undermines its promise as a truly "black box" inference algorithm. However, over the past decade, many new adaptive stochastic optimization algorithms have been developed that reduce or remove entirely the need for tuning. In this work, we investigate this new collection of adaptive methods in the context of BBVI, with the goal of establishing the current state of the art in tuning-free optimization-based inference. In particular, we present a large-scale empirical evaluation of 56 stochastic gradient-based optimization algorithms applied to 1092 Bayesian inference optimization problems, involving over 550,000 individual optimization runs and 15 core-years of compute. The optimization algorithms we evaluate are chosen to represent a wide spectrum of recent approaches and the benchmark problems are chosen to span a range of difficulty, with posterior target dimension 1-10^4, condition number 1-10^8, and a range of variational families. Our results show that no single method dominates, but running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance. We thus provide a strong baseline for applications where expert tuning is not possible and for comparison when developing new stochastic optimization algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a large-scale empirical benchmark of 56 stochastic gradient optimizers for black-box variational inference (BBVI) across 1092 problems (posterior dimensions 1–10^4, condition numbers 1–10^8), totaling >550k runs and 15 core-years. No single optimizer dominates; the central claim is that a fixed selection of 5 algorithms suffices to reach near the best-observed performance on most problems, providing a practical default baseline when expert tuning is unavailable.
Significance. If the central claim holds after addressing selection methodology, the work supplies a valuable, reproducible empirical baseline for untuned BBVI and a reference point for new optimizer development. The scale and breadth of the problem set are genuine strengths for an empirical study in this area.
major comments (2)
- [Results (5-algorithm selection)] Results section (around the identification of the 5-algorithm subset): the headline claim that 'running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance' requires an explicit statement of the selection procedure. If the 5 were chosen by inspecting performance patterns across the entire 1092-problem collection without cross-validation, held-out problems, or a pre-specified rule, the reported reliability is at risk of being an in-sample observation rather than a generalizable recommendation.
- [Methods / benchmark construction] Methods or benchmark-construction section: the paper must detail the criteria used to select the 1092 problems and the performance metric (e.g., how 'best-possible' is computed, whether multiple-comparison corrections were applied, and any statistical testing of differences). Without these, it is difficult to assess whether the problems and metric are representative of typical real-world BBVI tasks, which directly affects the strength of the 'no single method dominates' and '5 suffice' conclusions.
minor comments (2)
- [Abstract / Introduction] Clarify in the abstract and introduction whether the 56 algorithms include all recent adaptive methods or a curated subset, and provide a brief justification for any omissions.
- [Figures / Tables] Figure captions and tables reporting per-problem or aggregate performance should include error bars or confidence intervals derived from the multiple runs per problem.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and for recognizing the scale and potential utility of the benchmark. We address each major comment below.
read point-by-point responses
-
Referee: [Results (5-algorithm selection)] Results section (around the identification of the 5-algorithm subset): the headline claim that 'running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance' requires an explicit statement of the selection procedure. If the 5 were chosen by inspecting performance patterns across the entire 1092-problem collection without cross-validation, held-out problems, or a pre-specified rule, the reported reliability is at risk of being an in-sample observation rather than a generalizable recommendation.
Authors: We agree that the selection procedure requires an explicit statement. In the revision we will add a dedicated paragraph in the Results section describing the exact rule used to identify the 5-algorithm subset (top performers across binned regimes of dimension and condition number that together reach within 5% of the best observed ELBO on at least 85% of problems). While the initial identification used the full collection, we will also report performance of the same fixed 5 on a randomly selected held-out collection of 150 problems to provide evidence of out-of-sample behavior. revision: yes
-
Referee: [Methods / benchmark construction] Methods or benchmark-construction section: the paper must detail the criteria used to select the 1092 problems and the performance metric (e.g., how 'best-possible' is computed, whether multiple-comparison corrections were applied, and any statistical testing of differences). Without these, it is difficult to assess whether the problems and metric are representative of typical real-world BBVI tasks, which directly affects the strength of the 'no single method dominates' and '5 suffice' conclusions.
Authors: We will expand the Methods section with a new subsection on benchmark construction. This will specify the sampling procedure used to obtain the 1092 problems (stratified coverage of dimension and condition-number ranges), define the performance metric (negative evidence lower bound), state that the best-possible value for each problem is the minimum ELBO attained by any of the 56 optimizers across all random seeds, and report the statistical tests (paired Wilcoxon tests with Bonferroni correction) used to support claims of no single dominant method. revision: yes
Circularity Check
Empirical benchmarking study with no derivation chain or self-referential claims
full rationale
This is a purely empirical paper reporting results from running 56 optimizers on 1092 problems. No mathematical derivations, equations, fitted parameters presented as predictions, ansatzes, or uniqueness theorems appear in the abstract or described content. The statement that a selection of 5 algorithms suffices is an in-sample empirical observation from the benchmark runs, not a derivation that reduces to its own inputs by construction. No self-citations are invoked as load-bearing premises. The paper is self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
PosteriorDB.jl: A Julia package to work with posteriordb
Axen, S. (2026). “PosteriorDB.jl: A Julia package to work with posteriordb.” GitHub Repository: https://github.com/sethaxen/PosteriorDB.jl. Version 0.6.0
2026
-
[2]
Ba, J., Kiros, J., and Hinton, G. (2016). “Layer normalization.”arXiv:1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Online learning rate adaptation with hypergradient descent
Baydin, A., Cornish, R., Rubio, D., Schmidt, M., and Wood, F. (2018). “Online learning rate adaptation with hypergradient descent.” InProceedings of the International Conference on Learning Representations
2018
-
[4]
Training neural networks for and by interpolation
Berrada, L., Zisserman, A., and Kumar, M. (2020). “Training neural networks for and by interpolation.” InProceedings of the International Conference on Machine Learning, volume 119 ofPMLR, 799–809
2020
-
[5]
Julia: A fresh approach to numerical computing
Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). “Julia: A fresh approach to numerical computing.”SIAM review, 59(1): 65–98
2017
-
[6]
Variational inference: A review for statisticians
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). “Variational inference: A review for statisticians.”Journal of the American Statistical Association, 112(518): 859–877
2017
-
[7]
On-line learning and stochastic approximations
Bottou, L. (1999). “On-line learning and stochastic approximations.” InOn-Line Learning in Neural Networks, 9–42. Cambridge University Press, 1 edition
1999
-
[8]
Optimization methods for large-scale machine learning
Bottou, L., Curtis, F. E., and Nocedal, J. (2018). “Optimization methods for large-scale machine learning.”SIAM Review, 60(2): 223–311
2018
-
[9]
Quasi-Monte Carlo variational inference
Buchholz, A., Wenzel, F., and Mandt, S. (2018). “Quasi-Monte Carlo variational inference.” InProceedings of the International Conference on Machine Learning, volume 80 ofPMLR, 668–677. JMLR
2018
-
[10]
Importance weighted autoencoders
Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). “Importance weighted autoencoders.” InProceedings of the International Conference on Learning Repre- sentations
2015
-
[11]
Sample average approximation for black-box variational inference
Burroni, J., Domke, J., and Sheldon, D. (2024). “Sample average approximation for black-box variational inference.” InProceedings of the Conference on Uncertainty in Artificial Intelligence, volume 244 ofPMLR, 471–498. JMLR
2024
-
[12]
EigenVI: Score-based variational inference with orthogonal function expansions
Cai, D., Modi, C., Margossian, C., Gower, R., Blei, D., and Saul, L. (2024). “EigenVI: Score-based variational inference with orthogonal function expansions.” InAdvances in Neural Information Processing Systems, 132691–132721. Curran Associates, Inc
2024
-
[13]
Batch and Match: Black-box variational inference with a score-based T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian29 divergence
Cai, D., Modi, C., Pillaud-Vivien, L., Margossian, C., Gower, R., Blei, D., and Saul, L. (2024). “Batch and Match: Black-box variational inference with a score-based T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian29 divergence.” InProceedings International Conference on Machine Learning, volume 235 ofPMLR, 5258–5297. JMLR
2024
-
[14]
Making SGD parameter-free
Carmon, Y. and Hinder, O. (2022). “Making SGD parameter-free.” InProceedings of the Conference on Learning Theory, volume 178 ofPMLR, 2360–2389
2022
-
[15]
Algorithms for computing the sample variance: Analysis and recommendations
Chan, T. F., Golub, G. H., and LeVeque, R. J. (1983). “Algorithms for computing the sample variance: Analysis and recommendations.”The American Statistician, 37(3): 242–247
1983
-
[16]
Symbolic discovery of optimization algorithms
Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y., and Le, Q. (2023). “Symbolic discovery of optimization algorithms.” InAdvances in Neural Information Processing Systems, volume 36, 49205–49233. Curran Associates, Inc
2023
-
[17]
Mechanic: A learning rate tuner
Cutkosky, A., Defazio, A., and Mehta, H. (2023). “Mechanic: A learning rate tuner.” InAdvances in Neural Information Processing Systems, volume 36, 47828–47848. Curran Associates, Inc
2023
-
[18]
The Helmholtz Machine
Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). “The Helmholtz Machine.”Neural Computation, 7(5): 889–904
1995
-
[19]
Big Batch SGD: Auto- mated inference using adaptive batch sizes
De, S., Yadav, A., Jacobs, D., and Goldstein, T. (2017). “Big Batch SGD: Auto- mated inference using adaptive batch sizes.” InProceedings of the International Conference on Artificial Intelligence and Statistics, volume 52 ofPMLR, 1504–1513. JMLR
2017
-
[20]
SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives
Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives.” In Advances in Neural Information Processing Systems, volume 27, 1646–1654. Curran Associates, Inc
2014
-
[21]
Learning-rate-free learning by D- adaptation
Defazio, A. and Mishchenko, K. (2023). “Learning-rate-free learning by D- adaptation.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 7449–7479. JMLR
2023
-
[22]
Robust, accurate stochastic optimization for variational inference
Dhaka, A. K., Catalina, A., Andersen, M. R., ns Magnusson, M., Huggins, J., and Vehtari, A. (2020). “Robust, accurate stochastic optimization for variational inference.” InAdvances in Neural Information Processing Systems, volume 33, 10961–10973. Curran Associates, Inc
2020
-
[23]
Challenges and opportunities in high-dimensional variational inference
Dhaka, A. K., Catalina, A., Welandawe, M., Andersen, M. R., Huggins, J., and Vehtari, A. (2021). “Challenges and opportunities in high-dimensional variational inference.” InAdvances in Neural Information Processing Systems, volume 34, 7787–7798. Curran Associates, Inc
2021
-
[24]
Forward- backward Gaussian variational inference via JKO in the Bures-Wasserstein space
Diao, M. Z., Balasubramanian, K., Chewi, S., and Salim, A. (2023). “Forward- backward Gaussian variational inference via JKO in the Bures-Wasserstein space.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 7960–7991. JMLR
2023
-
[25]
Variational 30Default optimizers for variational inference inference viaχ-upper bound minimization
Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. (2017). “Variational 30Default optimizers for variational inference inference viaχ-upper bound minimization.” InAdvances in Neural Information Processing Systems, volume 30, 2729–2738. Curran Associates, Inc
2017
-
[26]
bridging the gap between constant step size stochastic gradient descent and Markov chains
Dieuleveut, A., Durmus, A., and Bach, F. (2020). “bridging the gap between constant step size stochastic gradient descent and Markov chains.”The Annals of Statistics, 48(3): 1348 – 1382
2020
-
[27]
Provable gradient variance guarantees for black-box variational inference
Domke, J. (2019). “Provable gradient variance guarantees for black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 32, 329–338. Curran Associates, Inc
2019
-
[28]
Provable smoothness guarantees for black-box variational inference
— (2020). “Provable smoothness guarantees for black-box variational inference.” InProceedings of the International Conference on Machine Learning, volume 119 ofPMLR, 2587–2596. JMLR
2020
-
[29]
Provable convergence guarantees for black-box variational inference
Domke, J., Gower, R., and Garrigos, G. (2023). “Provable convergence guarantees for black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 36, 66289–66327. Curran Associates, Inc
2023
-
[30]
Importance weighting and variational inference
Domke, J. and Sheldon, D. R. (2018). “Importance weighting and variational inference.” InAdvances in Neural Information Processing Systems, volume 31, 4470–4479. Curran Associates, Inc
2018
-
[31]
Divide and Couple: using Monte Carlo variational objectives for posterior approximation
— (2019). “Divide and Couple: using Monte Carlo variational objectives for posterior approximation.” InAdvances in Neural Information Processing Systems, volume 32, 339–349. Curran Associates, Inc
2019
-
[32]
Adaptive subgradient methods for online learning and stochastic optimization
Duchi, J., Hazan, E., and Singer, Y. (2011). “Adaptive subgradient methods for online learning and stochastic optimization.”Journal of Machine Learning Research, 12: 2121–2159
2011
-
[33]
Batch means and spectral variance estimators in Markov chain Monte Carlo
Flegal, J. M. and Jones, G. L. (2010). “Batch means and spectral variance estimators in Markov chain Monte Carlo.”The Annals of Statistics, 38(2): 1034– 1070
2010
-
[34]
Multilevel Monte Carlo variational inference
Fujisawa, M. and Sato, I. (2021). “Multilevel Monte Carlo variational inference.” Journal of Machine Learning Research, 22(278): 1–44
2021
-
[35]
Don’t be so monotone: Relax- ing stochastic line search in over-parametrized models
Galli, L., Rauhut, H., and Schmidt, M. (2023). “Don’t be so monotone: Relax- ing stochastic line search in over-parametrized models.” InAdvances in Neural Information Processing Systems, volume 36, 34752–34764. Curran Associates, Inc
2023
-
[36]
Empirical evaluation of biased methods for alpha divergence minimization
Geffner, T. and Domke, J. (2021). “Empirical evaluation of biased methods for alpha divergence minimization.” InProceedings of the Symposium on Advances in Approximate Bayesian Inference
2021
-
[37]
MCMC variational inference via uncorrected Hamiltonian annealing
— (2021). “MCMC variational inference via uncorrected Hamiltonian annealing.” InAdvances in Neural Information Processing Systems, volume 34, 639–651. Curran Associates, Inc
2021
-
[38]
On the difficulty of unbiased alpha divergence minimization
— (2021). “On the difficulty of unbiased alpha divergence minimization.” In Proceedings of the International Conference on Machine Learning, volume 139 of PMLR, 3650–3659. JMLR. T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian31
2021
-
[39]
Inference from iterative simulation using multiple sequences
Gelman, A. and Rubin, D. (1992). “Inference from iterative simulation using multiple sequences.”Statistical Science, 7(4): 457–511
1992
-
[40]
Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box
Giordano, R., Ingram, M., and Broderick, T. (2024). “Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box.”Journal of Machine Learning Research, 25: 1–39
2024
-
[41]
Practical variational inference for neural networks
Graves, A. (2011). “Practical variational inference for neural networks.” In Advances in Neural Information Processing Systems, volume 24, 2348–2356. Curran Associates, Inc
2011
-
[42]
Shampoo: Preconditioned stochastic tensor optimization
Gupta, V., Koren, T., and Singer, Y. (2018). “Shampoo: Preconditioned stochastic tensor optimization.” InProceedings of the International Conference on Machine Learning, volume 80 ofPMLR, 1842–1850. JMLR
2018
-
[43]
Revisiting the Polyak step size
Hazan, E. and Kakade, S. (2019). “Revisiting the Polyak step size.” arXiv:1905.00313
-
[44]
Deep residual learning for image recognition
He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition.” InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778
2016
-
[45]
Black-Box alpha divergence minimization
Hernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hernandez-Lobato, D., and Turner, R. (2016). “Black-Box alpha divergence minimization.” InProceedings of the International Conference on Machine Learning, volume 48 ofPMLR, 1511–1520. JMLR
2016
-
[46]
Keeping the neural networks simple by minimizing the description length of the weights
Hinton, G. E. and van Camp, D. (1993). “Keeping the neural networks simple by minimizing the description length of the weights.” InProceedings of the Annual Conference on Computational Learning Theory, 5–13. ACM Press
1993
-
[47]
Perturbation analysis and optimization of queueing networks
Ho, Y. C. and Cao, X. (1983). “Perturbation analysis and optimization of queueing networks.”Journal of Optimization Theory and Applications, 40(4): 559–582
1983
-
[48]
Validated variational inference via practical posterior error bounds
Huggins, J., Kasprzak, M., Campbell, T., and Broderick, T. (2020). “Validated variational inference via practical posterior error bounds.” InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 ofPMLR, 1792–1802. JMLR
2020
-
[49]
Variational Inference using Implicit Distributions
Huszár, F. (2017). “Variational inference using implicit distributions.” arXiv:1702.08235
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[50]
DoG is SGD’s best friend: A parameter-free dynamic step size schedule
Ivgi, M., Hinder, O., and Carmon, Y. (2023). “DoG is SGD’s best friend: A parameter-free dynamic step size schedule.” InProceedings of the International Conference on Machine Learning, volume 202 ofPMLR, 14465–14499. JMLR
2023
-
[51]
Accelerating stochastic gradient descent using predictive variance reduction
Johnson, R. and Zhang, T. (2013). “Accelerating stochastic gradient descent using predictive variance reduction.” InAdvances in Neural Information Processing Systems, volume 26, 315–323. Curran Associates, Inc
2013
-
[52]
Muon: An optimizer for hidden layers in neural networks
Jordan,K.,Jin,Y.,Boza,V., Jiacheng,Y.,Cesista,F.,Newhouse,L.,and Bernstein, J. (2024). “Muon: An optimizer for hidden layers in neural networks.” URLhttps://kellerjordan.github.io/posts/muon/ 32Default optimizers for variational inference
2024
-
[53]
An introduction to variational methods for graphical models
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). “An introduction to variational methods for graphical models.”Machine Learning, 37(2): 183–233
1999
-
[54]
DoWG unleashed: An effi- cient universal parameter-free gradient descent method
Khaled, A., Mishchenko, K., and Jin, C. (2023). “DoWG unleashed: An effi- cient universal parameter-free gradient descent method.” InAdvances in Neural Information Processing Systems, volume 36, 6748–6769
2023
-
[55]
The Bayesian learning rule
Khan, M. E. and Rue, H. (2023). “The Bayesian learning rule.”Journal of Machine Learning Research, 24(281): 1–46
2023
-
[56]
Linear convergence of black- box variational inference: Should we stick the landing?
Kim, K., Ma, Y., and Gardner, J. R. (2024). “Linear convergence of black- box variational inference: Should we stick the landing?” InProceedings of the International Conference on Artificial Intelligence and Statistics, volume 238 of PMLR, 235–243. JMLR
2024
-
[57]
On the convergence of black-box variational inference
Kim, K., Oh, J., Wu, K., Ma, Y., and Gardner, J. R. (2023). “On the convergence of black-box variational inference.” InAdvances in Neural Information Processing Systems, volume 36, 44615–44657. Curran Associates Inc
2023
-
[58]
A guide to sample average approximation
Kim, S., Pasupathy, R., and Henderson, S. (2015). “A guide to sample average approximation.” InHandbook of Simulation Optimization, 207–243. Springer
2015
-
[59]
Adam: A method for stochastic optimization
Kingma, D. and Ba, J. (2015). “Adam: A method for stochastic optimization.” In Proceedings of the International Conference on Learning Representations
2015
-
[60]
Auto-encoding variational Bayes
Kingma, D. P. and Welling, M. (2014). “Auto-encoding variational Bayes.” In Proceedings of the International Conference on Learning Representations
2014
-
[61]
Automatic differentiation variational inference
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). “Automatic differentiation variational inference.”Journal of Machine Learning Research, 18(14): 1–45
2017
-
[62]
Optimization guarantees for square-root natural-gradient variational inference
Kumar, N., Möllenhoff, T., Khan, M. E., and Lucchi, A. (2025). “Optimization guarantees for square-root natural-gradient variational inference.”Transactions on Machine Learning Research
2025
-
[63]
Varia- tional inference via Wasserstein gradient flows
Lambert, M., Chewi, S., Bach, F., Bonnabel, S., and Rigollet, P. (2022). “Varia- tional inference via Wasserstein gradient flows.” InAdvances in Neural Information Processing Systems, volume 35, 14434–14447. Curran Associates, Inc
2022
-
[64]
A stochastic gradient method with an exponential convergence rate for finite training sets
Le Roux, N., Schmidt, M., and Bach, F. (2012). “A stochastic gradient method with an exponential convergence rate for finite training sets.” InAdvances in Neural Information Processing Systems, 2663–2671. Curran Associates, Inc
2012
-
[65]
Rényi divergence variational inference
Li, Y. and Turner, R. E. (2016). “Rényi divergence variational inference.” In Advances in Neural Information Processing Systems, volume 29, 1073–1081. Curran Associates, Inc
2016
-
[66]
Fast and simple natural-gradient variational inference with mixture of exponential-family approximations
Lin, W., Khan, M. E., and Schmidt, M. (2019). “Fast and simple natural-gradient variational inference with mixture of exponential-family approximations.” In Proceedings of the International Conference on Machine Learning, volume 97 of PMLR, 3992–4002. JMLR. T. Campbell, J. H. Huggins, K. Kim, C. C. Margossian33
2019
-
[67]
Batch size selection for variance estimators in MCMC
Liu, Y., Vats, D., and Flegal, J. M. (2022). “Batch size selection for variance estimators in MCMC.”Methodology and Computing in Applied Probability, 24(1): 65–93
2022
-
[68]
Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence
Loizou, N., Vaswani, S., Laradji, I. H., and Lacoste-Julien, S. (2021). “Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence.” In Proceedings of The International Conference on Artificial Intelligence and Statistics, PMLR, 1306–1314. JMLR
2021
-
[69]
The power of interpolation: Un- derstanding the effectiveness of SGD in modern over-parametrized learning
Ma, S., Bassily, R., and Belkin, M. (2018). “The power of interpolation: Un- derstanding the effectiveness of SGD in modern over-parametrized learning.” In Proceedings of the International Conference on Machine Learning, volume 80 of PMLR, 3325–3334. JMLR
2018
-
[70]
posteriordb: Testing, benchmarking and developing Bayesian inference algorithms
Magnusson, M., Torgander, J., Bürkner, P.-C., Zhang, L., Carpenter, B., and Vehtari, A. (2025). “posteriordb: Testing, benchmarking and developing Bayesian inference algorithms.” InProceedings of The International Conference on Artificial Intelligence and Statistics, volume 258 ofPMLR, 1198–1206. JMLR
2025
-
[71]
Torsten: A platform for Bayesian inference of pharmacometric models
Margossian, C. C., Zhang, Y., Gillespie, B., Bales, B., Volfovsky, A., Pavlovic, V., and Gelman, A. (2022). “Torsten: A platform for Bayesian inference of pharmacometric models.”Statistics and Computing, 32(6): 1–15
2022
-
[72]
Expectation propagation for approximate Bayesian inference
Minka, T. P. (2001). “Expectation propagation for approximate Bayesian inference.” InProceedings of the Conference on Uncertainty in Artificial Intelligence, 362–369. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc
2001
-
[73]
Variational Inference with Gaussian Score Matching
Modi, C., Gower, R., Margossian, C., Yao, Y., Blei, D., and Saul, L. (2023). “Variational Inference with Gaussian Score Matching.” InAdvances in Neural Information Processing Systems, volume 36, 29935–29950. Curran Associates, Inc
2023
-
[74]
Monte Carlo Gradient Estimation in Machine Learning
Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2020). “Monte Carlo Gradient Estimation in Machine Learning.”Journal of Machine Learning Research, 21(132): 1–62
2020
-
[75]
Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients
Moses, W. and Churavy, V. (2020). “Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients.” InAdvances in Neural Information Processing Systems, volume 33, 12472–12485. Curran Associates, Inc
2020
-
[76]
Rectified linear units improve restricted Boltz- mann machines
Nair, V. and Hinton, G. E. (2010). “Rectified linear units improve restricted Boltz- mann machines.” InProceedings of the International Conference on International Conference on Machine Learning, ICML, 807–814. Madison, WI, USA: Omnipress
2010
-
[77]
Gaussian Variational Approximation with a Factor Covariance Structure
Ong, V. M.-H., Nott, D. J., and Smith, M. S. (2018). “Gaussian Variational Approximation with a Factor Covariance Structure.”Journal of Computational and Graphical Statistics, 27(3): 465–478
2018
-
[78]
Parameter-free stochastic optimization of variationally coherent functions
Orabona, F. and Pál, D. (2021). “Parameter-free stochastic optimization of variationally coherent functions.”arXiv:2102.00236
-
[79]
Training deep networks without learning 34Default optimizers for variational inference rates through coin betting
Orabona, F. and Tommasi, T. (2017). “Training deep networks without learning 34Default optimizers for variational inference rates through coin betting.” InAdvances in Neural Information Processing Systems, volume 30, 2160–2170. Curran Associates, Inc
2017
-
[80]
R package
outbreak package authors (2024).outbreak: Tools for Simulating and Analyzing Epidemic Outbreaks. R package. URLhttps://CRAN.R-project.org/package=outbreak
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.