Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks

J\"org Martin; Josua Faller

arxiv: 2502.02345 · v2 · submitted 2025-02-04 · 💻 cs.LG

Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks

Josua Faller , J\"org Martin This is my paper

Pith reviewed 2026-05-23 03:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords subspace inferenceLaplace approximationBayesian neural networkslow-rank approximationuncertainty quantificationcovariance matrixscalable inference

0 comments

The pith

A low-rank subspace model for the Laplace approximation in Bayesian neural networks is optimal for a given dataset and closely matches the full approximation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives an expression for a subspace model in a Laplace-based Bayesian inference setting that is optimal in a certain sense for any specific dataset. It shows empirically that replacing the full covariance matrix with a dimensionally reduced version yields a Laplace approximation nearly identical to the exact one. The work also supplies a practical scalable version of this construction that outperforms prior subspace approaches and introduces a metric for comparing approximation quality when the full Laplace is unavailable.

Core claim

What carries the argument

Low-rank approximation to the covariance matrix within the Laplace approximation, which produces the optimal subspace model for a given dataset.

If this is right

The derived subspace Laplace model can serve as a baseline for benchmarking other subspace inference methods.
The scalable approximation to the subspace construction can be used in practice on larger models where the full covariance is intractable.
The new metric enables direct qualitative comparison of different subspace models even when the exact Laplace is unknown.
Low-rank covariance reduction preserves the essential uncertainty properties of the Laplace posterior for the datasets tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other posterior approximations that rely on a covariance or precision matrix, such as variational methods with Gaussian assumptions.
If the low-rank structure holds across training runs, it could reduce memory costs when storing or sampling from the posterior in deployed systems.
The optimality property might be tested by checking whether the reduced model minimizes a chosen divergence to the full Laplace on new data distributions.

Load-bearing premise

A subspace of the parameter space is sufficient to capture the uncertainty that the full Laplace approximation would produce.

What would settle it

On a held-out dataset, compute both the full Laplace and the low-rank subspace Laplace posteriors and measure whether their predictive distributions or calibration metrics differ by more than a small tolerance.

Figures

Figures reproduced from arXiv: 2502.02345 by J\"org Martin, Josua Faller.

**Figure 1.** Figure 1: Comparison of low rank approximations and subset methods for different regression datasets. Different [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 3.** Figure 3: Relative error (20) (left) and trace criterion (25) (right) for corrupted MNIST datasets [43] and three different dimensions s = 100, 500, 1000 (shown by markers in increasing size). Different choices for P are indicated by different colours and marker shapes: Square markers ■ indicate subset based methods, whereas discs • indicate low-rank based methods (proposed in this work). The colour coding is chos… view at source ↗

**Figure 2.** Figure 2: Relative error (20) and logarithm of trace (25) of the epistemic covariance matrix for MNIST and FashionMNIST. number of ‘dead parameters’ whose gradient is almost zero, which provides a natural subset to be selected. Indeed, ENB has the most number of dead parameters with 93%. More details on this investigation are given in Appendix E. A comparison between the first and the second row of [PITH_FULL_IMAG… view at source ↗

**Figure 4.** Figure 4: Evaluation with the trace criterion (25) for CIFAR10 and ImageNet10 and different choices of P. Missing values in Figure 4b are due to vanishing trace values. to approximate ΣP,X well. 6 Conclusion In this work we propose to look at subspace Laplace approximations of Bayesian neural networks through the lens of their predictive covariances. This approach allows us to derive the existence of an optimal subs… view at source ↗

**Figure 5.** Figure 5: Relative error (20) of the epistemic covariance matrix of the studied subset methods for s up to the number of parameters p for MNIST. B Existence of an Optimal Subspace Model for the Laplace Approximation Theorem (Existence of an optimal subspace model for the Laplace approximation). Consider the problem (21) with s ≤ smax = min(nC, p). Suppose that JX ∈ R nC×p has full rank. For any invertible Q ∈ R s×… view at source ↗

**Figure 6.** Figure 6: The top displays a heatmap which highlights the activity of the gradients corresponding to the parameter [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: The NLL metric (30) for the datasets and subspace models considered in this work. The colour and linestyle coding is identical to the one in [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Figure 8a visualizes the prediction quality of the parametric function [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we underpin the validity of this assumption by using low rank techniques. We derive an expression for a subspace model to a Bayesian inference scenario based on the Laplace approximation that is, in a certain sense, optimal given a specific dataset. We empirically show that a Laplace approximation constructed with a dimensionally reduced covariance matrix closely matches the full Laplace approximation obtained using the exact covariance matrix. Where feasible, this subspace model can serve as a baseline for benchmarking the performance of subspace models. In addition, we provide a scalable approximation of this subspace construction that is usable in practice and compare it to existing subspace models from the literature. In general, our approximation scheme outperforms previous work. Furthermore, we present a metric to qualitatively compare the approximation quality of different subspace models even if the exact Laplace approximation is unknown.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Derives an optimal low-rank subspace expression for Laplace BNNs with a scalable version that beats priors on their metrics, but all checks stay inside the Laplace approximation.

read the letter

This paper derives an expression for an optimal subspace model under the Laplace approximation for Bayesian neural networks and shows that a reduced-covariance version stays close to the full Laplace posterior. They also give a practical scalable construction that outperforms earlier subspace methods and introduce a new qualitative metric for comparing subspace models when the exact Laplace is unavailable. The derivation uses low-rank techniques applied to the Hessian-based Gaussian, and the empirical results indicate the reduced form matches well while the scalable version improves on prior work. The new metric is presented as a way to benchmark even without the full covariance. These elements—the optimal expression and the comparison metric—appear as the fresh contributions relative to existing subspace inference. The work is technically grounded in the Laplace setting and supplies a baseline that could be useful for that specific approximation. The main limitation is that the validation and the claim about underpinning the subspace assumption for uncertainty quantification both remain entirely within the Laplace world. Matching full and reduced Laplace does not test whether the subspace captures directions relevant to the true posterior, which can differ from the Gaussian approximation in high-dimensional non-convex neural net landscapes. No comparisons to MCMC or HMC are described. This is aimed at people working on scalable Bayesian methods for neural networks, particularly those already using or extending Laplace approximations. Readers focused on low-rank techniques or subspace baselines will get concrete value from the derivation and the metric. The paper shows clear engagement with the math and prior literature on its own terms. It deserves a serious referee to check the derivation details and the experimental comparisons.

Referee Report

2 major / 2 minor

Summary. The paper derives an expression for an optimal low-rank subspace model in a Laplace-approximation Bayesian inference setting for neural networks, empirically demonstrates that a dimensionally reduced covariance Laplace approximation closely matches the full Laplace approximation, introduces a scalable practical version of this construction that outperforms prior subspace methods, and proposes a metric for qualitatively comparing subspace approximations when the exact Laplace is unavailable.

Significance. If the derivations and empirical matches hold under detailed scrutiny, the work supplies a theoretically grounded baseline for subspace inference specifically within the Laplace framework for BNNs, along with a usable scalable method and a comparison metric; these could serve as reference points for evaluating other subspace techniques and clarifying the conditions under which low-dimensional parameter subspaces suffice for uncertainty quantification.

major comments (2)

[Abstract and Introduction] Abstract and introduction: the central motivation—to underpin the general assumption that a subspace of parameter space suffices for reliable uncertainty quantification—is supported only by agreement between reduced and full Laplace approximations. Because the Laplace approximation itself can deviate substantially from the true posterior (particularly in non-convex, high-dimensional BNN landscapes), internal matching within the Laplace setting does not directly validate the subspace assumption for actual Bayesian inference; an external check against sampling-based posteriors (e.g., HMC or MCMC on the same models) would be required to make this claim load-bearing.
[Empirical Evaluation] Empirical section: the reported closeness of reduced to full Laplace and outperformance of the scalable version are presented without accompanying error analysis, variance estimates across random seeds, or ablation on the rank choice; without these, the strength of support for the optimality and practical utility claims cannot be verified.

minor comments (2)

[Derivation] Notation for the low-rank factors and the definition of optimality should be introduced with explicit equations early in the derivation section to improve readability.
[Figures] Figure captions for the qualitative metric comparisons should state the precise models, datasets, and rank values used so that the plots are self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and Introduction] Abstract and introduction: the central motivation—to underpin the general assumption that a subspace of parameter space suffices for reliable uncertainty quantification—is supported only by agreement between reduced and full Laplace approximations. Because the Laplace approximation itself can deviate substantially from the true posterior (particularly in non-convex, high-dimensional BNN landscapes), internal matching within the Laplace setting does not directly validate the subspace assumption for actual Bayesian inference; an external check against sampling-based posteriors (e.g., HMC or MCMC on the same models) would be required to make this claim load-bearing.

Authors: We appreciate this point. Our manuscript is explicitly focused on the Laplace approximation setting for Bayesian neural networks, as reflected in the title and throughout the text. The goal is to derive an optimal low-rank subspace specifically for the Laplace-approximated posterior and to provide a baseline within that framework. We do not claim that this validates the subspace assumption for the true posterior. To address the concern, we will revise the abstract and introduction to more precisely state that the validation is internal to the Laplace approximation and that the work provides a theoretically grounded baseline for subspace methods in the Laplace context. We note that direct comparisons to HMC or MCMC are computationally infeasible for the network sizes considered in this work. revision: partial
Referee: [Empirical Evaluation] Empirical section: the reported closeness of reduced to full Laplace and outperformance of the scalable version are presented without accompanying error analysis, variance estimates across random seeds, or ablation on the rank choice; without these, the strength of support for the optimality and practical utility claims cannot be verified.

Authors: We agree that additional statistical rigor would strengthen the empirical claims. In the revised version, we will include variance estimates across multiple random seeds where applicable, provide error analysis for the reported metrics, and add an ablation study on the effect of the rank choice on the approximation quality. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses standard low-rank approximation on Laplace covariance without self-referential reduction

full rationale

The paper derives an optimal low-rank subspace expression for the Laplace-approximated posterior by applying standard matrix low-rank techniques (e.g., SVD or similar) directly to the Hessian-derived covariance; this is a conventional approximation step whose optimality follows from the Eckart-Young theorem applied to the given matrix, not from any redefinition of the target quantity in terms of itself. The empirical claim is a direct numerical comparison of the full Laplace (exact covariance) versus the reduced-covariance version on the same models, which is an independent verification within the Laplace setting rather than a fitted input renamed as prediction. No load-bearing step relies on self-citation chains, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation. The derivation remains self-contained against the Laplace approximation as its explicit benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not detail any free parameters, axioms, or invented entities; the work extends existing Laplace approximation and low-rank matrix techniques without introducing new postulated quantities.

pith-pipeline@v0.9.0 · 5684 in / 1048 out tokens · 22522 ms · 2026-05-23T03:37:21.494874+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 13 internal anchors

[1]

Uncertainty in deep learning

Yarin Gal. Uncertainty in deep learning

work page
[2]

URL https://www.cs.ox.ac.uk/people/ yarin.gal/website/thesis/thesis.pdf

work page
[3]

Weight un- certainty in neural network

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight un- certainty in neural network. In International Conference on Machine Learning, pages 1613–1622. PMLR, 2015

work page 2015
[4]

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Probabilistic backpropagation for scalable learning of Bayesian neural networks

José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. InInternational confer- ence on machine learning, pages 1861–1869. PMLR, 2015

work page 2015
[6]

A simple baseline for Bayesian uncertainty in deep learning

Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. Advances in neural information processing systems, 32, 2019

work page 2019
[7]

Variational Dropout and the Local Reparameterization Trick

Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. arXiv preprint arXiv:1506.02557, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Jordan, Zoubin Ghahramani, T

Michael I. Jordan, Zoubin Ghahramani, T. Jaakkola, andLawrenceK.Saul. Anintroductiontovariational methods for graphical models.Machine Learning, 37:183–233, 1999

work page 1999
[9]

Wainwright and Michael I

Martin J. Wainwright and Michael I. Jordan. Graph- ical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1:1–305, 2008

work page 2008
[10]

A practical Bayesian framework for backpropagation networks.Neural Computation, 4:448–472, 1992

David John Cameron MacKay. A practical Bayesian framework for backpropagation networks.Neural Computation, 4:448–472, 1992

work page 1992
[11]

Opti- mal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Opti- mal brain damage.Advances in neural information processing systems, 2, 1989

work page 1989
[12]

A scalable Laplace approximation for neural networks

Hippolyt Ritter, Aleksandar Botev, and David Bar- ber. A scalable Laplace approximation for neural networks. InInternational Conference on Learning Representations, 2018

work page 2018
[13]

Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, M

Erik A. Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, M. Bauer, and Philipp Hennig. Laplace redux - effortless Bayesian deep learning. In Neural Information Processing Systems, 2021

work page 2021
[14]

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Yann N. Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Ben- gio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimiza- tion. arXiv preprint: 1406.2572 , 2014. URL http://arxiv.org/abs/1406.2572

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Levent Sagun, Léon Bottou, and Yann LeCun. Eigen- values of the Hessian in deep learning: Singularity and beyond.arXiv preprint: 1611.07476, 2016. URL https://arxiv.org/abs/1611.07476

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

Vardan Papyan. The full spectrum of deepnet Hes- sians at scale: Dynamics with sgd training and sam- ple size. arXiv preprint: 1811.07062, 2018. URL https://arxiv.org/abs/1811.07062

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Schraudolph

Nicol N. Schraudolph. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent.Neural Computation, 14(7):1723–1738, 07 2002. ISSN 0899-

work page 2002
[18]

doi: 10.1162/08997660260028683

work page doi:10.1162/08997660260028683
[19]

Revisiting Natural Gradient for Deep Networks

Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks.arxiv preprint: 1301.3584, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[21]

URL http://arxiv.org/abs/1412.1193

work page arXiv
[22]

Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to ac- celerate training of deep neural networks. arXiv preprint: 1602.07868, 2016. URL https://arxiv. org/abs/1602.07868

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Overcoming catastrophic forgetting in neural networks.Proceed- ings of the national academy of sciences, 114(13): 3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabi- nowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceed- ings of the national academy of sciences, 114(13): 3521–3526, 2017

work page 2017
[25]

URL https://arxiv.org/abs/2002.10118

work page arXiv 2002
[26]

Scalable Bayesian Optimization Using Deep Neural Networks

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable Bayesian optimization using deep neural networks. arXiv preprint: 1502.05700, 2015. URL https://arxiv.org/abs/1502.05700

work page internal anchor Pith review Pith/arXiv arXiv 2015
[27]

Subspace Inference for Bayesian Deep Learning

Pavel Izmailov, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. Subspace inference for Bayesian deep learning.arxiv preprint: 1907.07504, 2019. 9

work page internal anchor Pith review Pith/arXiv arXiv 1907
[28]

Daxberger, E

E. Daxberger, E. Nalisnick, J. Allingham, J. An- torán, and J. M. Hernández-Lobato. Bayesian deep learning via subnetwork inference. In Pro- ceedings of 38th International Conference on Ma- chine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 2510–2521. PMLR, July 2021. URL https://proceedings. mlr.press/v139/daxberger21a.html

work page 2021
[29]

Do Bayesian neural net- works need to be fully stochastic?arxiv preprint: 2211.06291, 2023

Mrinank Sharma, Sebastian Farquhar, Eric Nalis- nick, and Tom Rainforth. Do Bayesian neural net- works need to be fully stochastic?arxiv preprint: 2211.06291, 2023. URL https://arxiv.org/abs/ 2211.06291

work page arXiv 2023
[30]

A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017

Yu Cheng, Duo Wang, Pan Zhou, and Zhang Tao. A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017

work page arXiv 2017
[31]

Andrew Y. K. Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. ’in- between’ uncertainty in Bayesian neural networks. arXiv preprint: 1906.11537, 2019. URL https: //arxiv.org/abs/1906.11537

work page internal anchor Pith review Pith/arXiv arXiv 1906
[33]

URL https://arxiv.org/abs/2008.08400

work page arXiv 2008
[34]

Accelerated linearized Laplace approximation for Bayesian deep learning

Zhijie Deng, Feng Zhou, and Jun Zhu. Accelerated linearized Laplace approximation for Bayesian deep learning. ArXiv, abs/2210.12642, 2022

work page arXiv 2022
[35]

Ortega, Simón Rodríguez Santana, and Daniel Hern’andez-Lobato

Luis A. Ortega, Simón Rodríguez Santana, and Daniel Hern’andez-Lobato. Variational linearized Laplace approximation for Bayesian deep learning. ArXiv, abs/2302.12565, 2023

work page arXiv 2023
[36]

The evidence frame- work applied to classification networks.Neural Com- putation, 4:720–736, 1992

David John Cameron MacKay. The evidence frame- work applied to classification networks.Neural Com- putation, 4:720–736, 1992

work page 1992
[37]

Practical Gauss-Newton optimisation for deep learning

Aleksandar Botev, Hippolyt Ritter, and David Bar- ber. Practical Gauss-Newton optimisation for deep learning. In International Conference on Machine Learning, 2017

work page 2017
[38]

Optimizing neural networks with Kronecker-factored approxi- mate curvature

James Martens and Roger Baker Grosse. Optimizing neural networks with Kronecker-factored approxi- mate curvature. In International Conference on Machine Learning, 2015

work page 2015
[39]

Partially stochastic in- finitely deep Bayesian neural networks

Sergio Calvo-Ordonez, Matthieu Meunier, Francesco Piatti, and Yuantao Shi. Partially stochastic in- finitely deep Bayesian neural networks. arxiv preprint: 2402.03495, 2024. URL https://arxiv. org/abs/2402.03495

work page arXiv 2024
[40]

Tom M. Heskes. On natural learning and pruning in multilayered perceptrons.Neural Computation, 12:881–901, 2000

work page 2000
[41]

Christopher M. Bishop. Pattern recognition and machine learning, 5th Edition . Informa- tion science and statistics. Springer, 2007. ISBN 9780387310732. URL https://www.worldcat.org/ oclc/71008143

work page arXiv 2007
[42]

Zur Theorie der linearen und nichtlinearen Integralgleichungen

Erhard Schmidt. Zur Theorie der linearen und nichtlinearen Integralgleichungen. Mathematische Annalen, 63(4):433–476, Dec 1907. ISSN 1432-

work page 1907
[43]

URL https: //doi.org/10.1007/BF01449770

doi: 10.1007/BF01449770. URL https: //doi.org/10.1007/BF01449770

work page doi:10.1007/bf01449770
[44]

The approximation of one matrix by another of lower rank

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psy- chometrika, 1(3):211–218, Sep 1936. ISSN 1860-

work page 1936
[45]

URL https: //doi.org/10.1007/BF02288367

doi: 10.1007/BF02288367. URL https: //doi.org/10.1007/BF02288367

work page doi:10.1007/bf02288367
[46]

L. Mirsky. Symmetric gauge functions and uni- tarily invariant norms. The Quarterly Journal of Mathematics, 11(1):50–59, 01 1960. ISSN 0033-

work page 1960
[47]

URL https: //doi.org/10.1093/qmath/11.1.50

doi: 10.1093/qmath/11.1.50. URL https: //doi.org/10.1093/qmath/11.1.50

work page doi:10.1093/qmath/11.1.50
[48]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2): 49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2013
[49]

van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, and Frank Hutter

Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, and Frank Hutter. OpenML-Python: an extensible Python API for OpenML.arXiv, 1911.02490. URLhttps: //arxiv.org/pdf/1911.02490.pdf

work page arXiv 1911
[50]

Gradient-based learning applied to document recognition

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998

work page 1998
[51]

MNIST-C: A Robustness Benchmark for Computer Vision

Norman Mu and Justin Gilmer. Mnist-c: A robust- ness benchmark for computer vision, 2019. URL https://arxiv.org/abs/1906.02337

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms. ArXiv, abs/1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. https: //www.cs.toronto.edu/~kriz/cifar.html

work page 2009
[54]

Imagenet: A large-scale hierar- chical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierar- chical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848. 10

work page doi:10.1109/cvpr.2009.5206848 2009
[55]

Pytorch: An imperative style, high-performance deep learning library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

work page 2019
[56]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recogni- tion. arXiv preprint arXiv:1512.03385, 2015. URL http://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015
[57]

Simple and scalable predictive un- certainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive un- certainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017

work page 2017
[58]

J. Yao, W. Pan, S. Ghosh, and F. Doshi-Velez. Qual- ity of uncertainty quantification for bayesian neural network inference. 2019

work page 2019
[59]

mean” and “standard deviation

Sameer K. Deshpande, Soumya Ghosh, Tin D. Nguyen, and Tamara Broderick. Are you using test log-likelihood correctly? Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/ forum?id=n2YifD4Dxo. dataset αinit nepoch warm up/ decay Red Wine 0.0004 300 (0.3/0.3) ENB 0.004 1500 (0.1/0.5) California 0.0004 100 (0.3/0.5) Naval Propulsion 0.0004 100 (0....

work page 2024
[60]

The object Us needs to be computable

work page
[61]

Obstacle 2 is rather straightforward to circumvent as we can compute the matrix product via mini-batches from the training data

The computation of the productJ T X ′Us needs to be feasible. Obstacle 2 is rather straightforward to circumvent as we can compute the matrix product via mini-batches from the training data. It turns out that Obstacle 1 sets the actual limit on the subset of training data as we compute Us via an SVD of the objectJX ′ΨapproxJ T X ′ ∈ RnC×nC. For Red Wine a...

work page
[62]

In other words, the NLL evaluates subspace models that use less parameters as better

First, note that for most models the NLL rises with increasing s. In other words, the NLL evaluates subspace models that use less parameters as better

work page
[63]

Fisher information matrix of the predictive distri- bution

Second, the full model has the highest NLL value. In other words the NLL ranks it as the worst perform- ing model, whereas the models that approximate it perform better under this metric. It seems rather implausible that an approximated ob- ject yields preciser estimates than the object which it approximates. We feel therefore save to conclude that the ra...

work page

[1] [1]

Uncertainty in deep learning

Yarin Gal. Uncertainty in deep learning

work page

[2] [2]

URL https://www.cs.ox.ac.uk/people/ yarin.gal/website/thesis/thesis.pdf

work page

[3] [3]

Weight un- certainty in neural network

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight un- certainty in neural network. In International Conference on Machine Learning, pages 1613–1622. PMLR, 2015

work page 2015

[4] [4]

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Probabilistic backpropagation for scalable learning of Bayesian neural networks

José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. InInternational confer- ence on machine learning, pages 1861–1869. PMLR, 2015

work page 2015

[6] [6]

A simple baseline for Bayesian uncertainty in deep learning

Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. Advances in neural information processing systems, 32, 2019

work page 2019

[7] [7]

Variational Dropout and the Local Reparameterization Trick

Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. arXiv preprint arXiv:1506.02557, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

Jordan, Zoubin Ghahramani, T

Michael I. Jordan, Zoubin Ghahramani, T. Jaakkola, andLawrenceK.Saul. Anintroductiontovariational methods for graphical models.Machine Learning, 37:183–233, 1999

work page 1999

[9] [9]

Wainwright and Michael I

Martin J. Wainwright and Michael I. Jordan. Graph- ical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1:1–305, 2008

work page 2008

[10] [10]

A practical Bayesian framework for backpropagation networks.Neural Computation, 4:448–472, 1992

David John Cameron MacKay. A practical Bayesian framework for backpropagation networks.Neural Computation, 4:448–472, 1992

work page 1992

[11] [11]

Opti- mal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Opti- mal brain damage.Advances in neural information processing systems, 2, 1989

work page 1989

[12] [12]

A scalable Laplace approximation for neural networks

Hippolyt Ritter, Aleksandar Botev, and David Bar- ber. A scalable Laplace approximation for neural networks. InInternational Conference on Learning Representations, 2018

work page 2018

[13] [13]

Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, M

Erik A. Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, M. Bauer, and Philipp Hennig. Laplace redux - effortless Bayesian deep learning. In Neural Information Processing Systems, 2021

work page 2021

[14] [14]

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Yann N. Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Ben- gio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimiza- tion. arXiv preprint: 1406.2572 , 2014. URL http://arxiv.org/abs/1406.2572

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Levent Sagun, Léon Bottou, and Yann LeCun. Eigen- values of the Hessian in deep learning: Singularity and beyond.arXiv preprint: 1611.07476, 2016. URL https://arxiv.org/abs/1611.07476

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

Vardan Papyan. The full spectrum of deepnet Hes- sians at scale: Dynamics with sgd training and sam- ple size. arXiv preprint: 1811.07062, 2018. URL https://arxiv.org/abs/1811.07062

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Schraudolph

Nicol N. Schraudolph. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent.Neural Computation, 14(7):1723–1738, 07 2002. ISSN 0899-

work page 2002

[18] [18]

doi: 10.1162/08997660260028683

work page doi:10.1162/08997660260028683

[19] [19]

Revisiting Natural Gradient for Deep Networks

Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks.arxiv preprint: 1301.3584, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[20] [21]

URL http://arxiv.org/abs/1412.1193

work page arXiv

[21] [22]

Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to ac- celerate training of deep neural networks. arXiv preprint: 1602.07868, 2016. URL https://arxiv. org/abs/1602.07868

work page internal anchor Pith review Pith/arXiv arXiv 2016

[22] [23]

Overcoming catastrophic forgetting in neural networks.Proceed- ings of the national academy of sciences, 114(13): 3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabi- nowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceed- ings of the national academy of sciences, 114(13): 3521–3526, 2017

work page 2017

[23] [25]

URL https://arxiv.org/abs/2002.10118

work page arXiv 2002

[24] [26]

Scalable Bayesian Optimization Using Deep Neural Networks

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable Bayesian optimization using deep neural networks. arXiv preprint: 1502.05700, 2015. URL https://arxiv.org/abs/1502.05700

work page internal anchor Pith review Pith/arXiv arXiv 2015

[25] [27]

Subspace Inference for Bayesian Deep Learning

Pavel Izmailov, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. Subspace inference for Bayesian deep learning.arxiv preprint: 1907.07504, 2019. 9

work page internal anchor Pith review Pith/arXiv arXiv 1907

[26] [28]

Daxberger, E

E. Daxberger, E. Nalisnick, J. Allingham, J. An- torán, and J. M. Hernández-Lobato. Bayesian deep learning via subnetwork inference. In Pro- ceedings of 38th International Conference on Ma- chine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 2510–2521. PMLR, July 2021. URL https://proceedings. mlr.press/v139/daxberger21a.html

work page 2021

[27] [29]

Do Bayesian neural net- works need to be fully stochastic?arxiv preprint: 2211.06291, 2023

Mrinank Sharma, Sebastian Farquhar, Eric Nalis- nick, and Tom Rainforth. Do Bayesian neural net- works need to be fully stochastic?arxiv preprint: 2211.06291, 2023. URL https://arxiv.org/abs/ 2211.06291

work page arXiv 2023

[28] [30]

A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017

Yu Cheng, Duo Wang, Pan Zhou, and Zhang Tao. A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017

work page arXiv 2017

[29] [31]

Andrew Y. K. Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. ’in- between’ uncertainty in Bayesian neural networks. arXiv preprint: 1906.11537, 2019. URL https: //arxiv.org/abs/1906.11537

work page internal anchor Pith review Pith/arXiv arXiv 1906

[30] [33]

URL https://arxiv.org/abs/2008.08400

work page arXiv 2008

[31] [34]

Accelerated linearized Laplace approximation for Bayesian deep learning

Zhijie Deng, Feng Zhou, and Jun Zhu. Accelerated linearized Laplace approximation for Bayesian deep learning. ArXiv, abs/2210.12642, 2022

work page arXiv 2022

[32] [35]

Ortega, Simón Rodríguez Santana, and Daniel Hern’andez-Lobato

Luis A. Ortega, Simón Rodríguez Santana, and Daniel Hern’andez-Lobato. Variational linearized Laplace approximation for Bayesian deep learning. ArXiv, abs/2302.12565, 2023

work page arXiv 2023

[33] [36]

The evidence frame- work applied to classification networks.Neural Com- putation, 4:720–736, 1992

David John Cameron MacKay. The evidence frame- work applied to classification networks.Neural Com- putation, 4:720–736, 1992

work page 1992

[34] [37]

Practical Gauss-Newton optimisation for deep learning

Aleksandar Botev, Hippolyt Ritter, and David Bar- ber. Practical Gauss-Newton optimisation for deep learning. In International Conference on Machine Learning, 2017

work page 2017

[35] [38]

Optimizing neural networks with Kronecker-factored approxi- mate curvature

James Martens and Roger Baker Grosse. Optimizing neural networks with Kronecker-factored approxi- mate curvature. In International Conference on Machine Learning, 2015

work page 2015

[36] [39]

Partially stochastic in- finitely deep Bayesian neural networks

Sergio Calvo-Ordonez, Matthieu Meunier, Francesco Piatti, and Yuantao Shi. Partially stochastic in- finitely deep Bayesian neural networks. arxiv preprint: 2402.03495, 2024. URL https://arxiv. org/abs/2402.03495

work page arXiv 2024

[37] [40]

Tom M. Heskes. On natural learning and pruning in multilayered perceptrons.Neural Computation, 12:881–901, 2000

work page 2000

[38] [41]

Christopher M. Bishop. Pattern recognition and machine learning, 5th Edition . Informa- tion science and statistics. Springer, 2007. ISBN 9780387310732. URL https://www.worldcat.org/ oclc/71008143

work page arXiv 2007

[39] [42]

Zur Theorie der linearen und nichtlinearen Integralgleichungen

Erhard Schmidt. Zur Theorie der linearen und nichtlinearen Integralgleichungen. Mathematische Annalen, 63(4):433–476, Dec 1907. ISSN 1432-

work page 1907

[40] [43]

URL https: //doi.org/10.1007/BF01449770

doi: 10.1007/BF01449770. URL https: //doi.org/10.1007/BF01449770

work page doi:10.1007/bf01449770

[41] [44]

The approximation of one matrix by another of lower rank

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psy- chometrika, 1(3):211–218, Sep 1936. ISSN 1860-

work page 1936

[42] [45]

URL https: //doi.org/10.1007/BF02288367

doi: 10.1007/BF02288367. URL https: //doi.org/10.1007/BF02288367

work page doi:10.1007/bf02288367

[43] [46]

L. Mirsky. Symmetric gauge functions and uni- tarily invariant norms. The Quarterly Journal of Mathematics, 11(1):50–59, 01 1960. ISSN 0033-

work page 1960

[44] [47]

URL https: //doi.org/10.1093/qmath/11.1.50

doi: 10.1093/qmath/11.1.50. URL https: //doi.org/10.1093/qmath/11.1.50

work page doi:10.1093/qmath/11.1.50

[45] [48]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2): 49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2013

[46] [49]

van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, and Frank Hutter

Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, and Frank Hutter. OpenML-Python: an extensible Python API for OpenML.arXiv, 1911.02490. URLhttps: //arxiv.org/pdf/1911.02490.pdf

work page arXiv 1911

[47] [50]

Gradient-based learning applied to document recognition

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998

work page 1998

[48] [51]

MNIST-C: A Robustness Benchmark for Computer Vision

Norman Mu and Justin Gilmer. Mnist-c: A robust- ness benchmark for computer vision, 2019. URL https://arxiv.org/abs/1906.02337

work page internal anchor Pith review Pith/arXiv arXiv 2019

[49] [52]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms. ArXiv, abs/1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[50] [53]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. https: //www.cs.toronto.edu/~kriz/cifar.html

work page 2009

[51] [54]

Imagenet: A large-scale hierar- chical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierar- chical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848. 10

work page doi:10.1109/cvpr.2009.5206848 2009

[52] [55]

Pytorch: An imperative style, high-performance deep learning library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

work page 2019

[53] [56]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recogni- tion. arXiv preprint arXiv:1512.03385, 2015. URL http://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015

[54] [57]

Simple and scalable predictive un- certainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive un- certainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017

work page 2017

[55] [58]

J. Yao, W. Pan, S. Ghosh, and F. Doshi-Velez. Qual- ity of uncertainty quantification for bayesian neural network inference. 2019

work page 2019

[56] [59]

mean” and “standard deviation

Sameer K. Deshpande, Soumya Ghosh, Tin D. Nguyen, and Tamara Broderick. Are you using test log-likelihood correctly? Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/ forum?id=n2YifD4Dxo. dataset αinit nepoch warm up/ decay Red Wine 0.0004 300 (0.3/0.3) ENB 0.004 1500 (0.1/0.5) California 0.0004 100 (0.3/0.5) Naval Propulsion 0.0004 100 (0....

work page 2024

[57] [60]

The object Us needs to be computable

work page

[58] [61]

Obstacle 2 is rather straightforward to circumvent as we can compute the matrix product via mini-batches from the training data

The computation of the productJ T X ′Us needs to be feasible. Obstacle 2 is rather straightforward to circumvent as we can compute the matrix product via mini-batches from the training data. It turns out that Obstacle 1 sets the actual limit on the subset of training data as we compute Us via an SVD of the objectJX ′ΨapproxJ T X ′ ∈ RnC×nC. For Red Wine a...

work page

[59] [62]

In other words, the NLL evaluates subspace models that use less parameters as better

First, note that for most models the NLL rises with increasing s. In other words, the NLL evaluates subspace models that use less parameters as better

work page

[60] [63]

Fisher information matrix of the predictive distri- bution

Second, the full model has the highest NLL value. In other words the NLL ranks it as the worst perform- ing model, whereas the models that approximate it perform better under this metric. It seems rather implausible that an approximated ob- ject yields preciser estimates than the object which it approximates. We feel therefore save to conclude that the ra...

work page