Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks
Pith reviewed 2026-05-23 03:37 UTC · model grok-4.3
The pith
A low-rank subspace model for the Laplace approximation in Bayesian neural networks is optimal for a given dataset and closely matches the full approximation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we underpin the validity of this assumption by using low rank techniques. We derive an expression for a subspace model to a Bayesian inference scenario based on the Laplace approximation that is, in a certain sense, optimal given a specific dataset. We empirically show that a Laplace approximation constructed with a dimensionally reduced covariance matrix closely matches the full Laplace approximation obtained using the exact covariance matrix.
What carries the argument
Low-rank approximation to the covariance matrix within the Laplace approximation, which produces the optimal subspace model for a given dataset.
If this is right
- The derived subspace Laplace model can serve as a baseline for benchmarking other subspace inference methods.
- The scalable approximation to the subspace construction can be used in practice on larger models where the full covariance is intractable.
- The new metric enables direct qualitative comparison of different subspace models even when the exact Laplace is unknown.
- Low-rank covariance reduction preserves the essential uncertainty properties of the Laplace posterior for the datasets tested.
Where Pith is reading between the lines
- The approach may extend to other posterior approximations that rely on a covariance or precision matrix, such as variational methods with Gaussian assumptions.
- If the low-rank structure holds across training runs, it could reduce memory costs when storing or sampling from the posterior in deployed systems.
- The optimality property might be tested by checking whether the reduced model minimizes a chosen divergence to the full Laplace on new data distributions.
Load-bearing premise
A subspace of the parameter space is sufficient to capture the uncertainty that the full Laplace approximation would produce.
What would settle it
On a held-out dataset, compute both the full Laplace and the low-rank subspace Laplace posteriors and measure whether their predictive distributions or calibration metrics differ by more than a small tolerance.
Figures
read the original abstract
Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we underpin the validity of this assumption by using low rank techniques. We derive an expression for a subspace model to a Bayesian inference scenario based on the Laplace approximation that is, in a certain sense, optimal given a specific dataset. We empirically show that a Laplace approximation constructed with a dimensionally reduced covariance matrix closely matches the full Laplace approximation obtained using the exact covariance matrix. Where feasible, this subspace model can serve as a baseline for benchmarking the performance of subspace models. In addition, we provide a scalable approximation of this subspace construction that is usable in practice and compare it to existing subspace models from the literature. In general, our approximation scheme outperforms previous work. Furthermore, we present a metric to qualitatively compare the approximation quality of different subspace models even if the exact Laplace approximation is unknown.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives an expression for an optimal low-rank subspace model in a Laplace-approximation Bayesian inference setting for neural networks, empirically demonstrates that a dimensionally reduced covariance Laplace approximation closely matches the full Laplace approximation, introduces a scalable practical version of this construction that outperforms prior subspace methods, and proposes a metric for qualitatively comparing subspace approximations when the exact Laplace is unavailable.
Significance. If the derivations and empirical matches hold under detailed scrutiny, the work supplies a theoretically grounded baseline for subspace inference specifically within the Laplace framework for BNNs, along with a usable scalable method and a comparison metric; these could serve as reference points for evaluating other subspace techniques and clarifying the conditions under which low-dimensional parameter subspaces suffice for uncertainty quantification.
major comments (2)
- [Abstract and Introduction] Abstract and introduction: the central motivation—to underpin the general assumption that a subspace of parameter space suffices for reliable uncertainty quantification—is supported only by agreement between reduced and full Laplace approximations. Because the Laplace approximation itself can deviate substantially from the true posterior (particularly in non-convex, high-dimensional BNN landscapes), internal matching within the Laplace setting does not directly validate the subspace assumption for actual Bayesian inference; an external check against sampling-based posteriors (e.g., HMC or MCMC on the same models) would be required to make this claim load-bearing.
- [Empirical Evaluation] Empirical section: the reported closeness of reduced to full Laplace and outperformance of the scalable version are presented without accompanying error analysis, variance estimates across random seeds, or ablation on the rank choice; without these, the strength of support for the optimality and practical utility claims cannot be verified.
minor comments (2)
- [Derivation] Notation for the low-rank factors and the definition of optimality should be introduced with explicit equations early in the derivation section to improve readability.
- [Figures] Figure captions for the qualitative metric comparisons should state the precise models, datasets, and rank values used so that the plots are self-contained.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract and Introduction] Abstract and introduction: the central motivation—to underpin the general assumption that a subspace of parameter space suffices for reliable uncertainty quantification—is supported only by agreement between reduced and full Laplace approximations. Because the Laplace approximation itself can deviate substantially from the true posterior (particularly in non-convex, high-dimensional BNN landscapes), internal matching within the Laplace setting does not directly validate the subspace assumption for actual Bayesian inference; an external check against sampling-based posteriors (e.g., HMC or MCMC on the same models) would be required to make this claim load-bearing.
Authors: We appreciate this point. Our manuscript is explicitly focused on the Laplace approximation setting for Bayesian neural networks, as reflected in the title and throughout the text. The goal is to derive an optimal low-rank subspace specifically for the Laplace-approximated posterior and to provide a baseline within that framework. We do not claim that this validates the subspace assumption for the true posterior. To address the concern, we will revise the abstract and introduction to more precisely state that the validation is internal to the Laplace approximation and that the work provides a theoretically grounded baseline for subspace methods in the Laplace context. We note that direct comparisons to HMC or MCMC are computationally infeasible for the network sizes considered in this work. revision: partial
-
Referee: [Empirical Evaluation] Empirical section: the reported closeness of reduced to full Laplace and outperformance of the scalable version are presented without accompanying error analysis, variance estimates across random seeds, or ablation on the rank choice; without these, the strength of support for the optimality and practical utility claims cannot be verified.
Authors: We agree that additional statistical rigor would strengthen the empirical claims. In the revised version, we will include variance estimates across multiple random seeds where applicable, provide error analysis for the reported metrics, and add an ablation study on the effect of the rank choice on the approximation quality. revision: yes
Circularity Check
No circularity: derivation uses standard low-rank approximation on Laplace covariance without self-referential reduction
full rationale
The paper derives an optimal low-rank subspace expression for the Laplace-approximated posterior by applying standard matrix low-rank techniques (e.g., SVD or similar) directly to the Hessian-derived covariance; this is a conventional approximation step whose optimality follows from the Eckart-Young theorem applied to the given matrix, not from any redefinition of the target quantity in terms of itself. The empirical claim is a direct numerical comparison of the full Laplace (exact covariance) versus the reduced-covariance version on the same models, which is an independent verification within the Laplace setting rather than a fitted input renamed as prediction. No load-bearing step relies on self-citation chains, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation. The derivation remains self-contained against the Laplace approximation as its explicit benchmark.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
URL https://www.cs.ox.ac.uk/people/ yarin.gal/website/thesis/thesis.pdf
-
[3]
Weight un- certainty in neural network
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight un- certainty in neural network. In International Conference on Machine Learning, pages 1613–1622. PMLR, 2015
work page 2015
-
[4]
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Probabilistic backpropagation for scalable learning of Bayesian neural networks
José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. InInternational confer- ence on machine learning, pages 1861–1869. PMLR, 2015
work page 2015
-
[6]
A simple baseline for Bayesian uncertainty in deep learning
Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. Advances in neural information processing systems, 32, 2019
work page 2019
-
[7]
Variational Dropout and the Local Reparameterization Trick
Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. arXiv preprint arXiv:1506.02557, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[8]
Michael I. Jordan, Zoubin Ghahramani, T. Jaakkola, andLawrenceK.Saul. Anintroductiontovariational methods for graphical models.Machine Learning, 37:183–233, 1999
work page 1999
-
[9]
Martin J. Wainwright and Michael I. Jordan. Graph- ical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1:1–305, 2008
work page 2008
-
[10]
A practical Bayesian framework for backpropagation networks.Neural Computation, 4:448–472, 1992
David John Cameron MacKay. A practical Bayesian framework for backpropagation networks.Neural Computation, 4:448–472, 1992
work page 1992
-
[11]
Opti- mal brain damage.Advances in neural information processing systems, 2, 1989
Yann LeCun, John Denker, and Sara Solla. Opti- mal brain damage.Advances in neural information processing systems, 2, 1989
work page 1989
-
[12]
A scalable Laplace approximation for neural networks
Hippolyt Ritter, Aleksandar Botev, and David Bar- ber. A scalable Laplace approximation for neural networks. InInternational Conference on Learning Representations, 2018
work page 2018
-
[13]
Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, M
Erik A. Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, M. Bauer, and Philipp Hennig. Laplace redux - effortless Bayesian deep learning. In Neural Information Processing Systems, 2021
work page 2021
-
[14]
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Yann N. Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Ben- gio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimiza- tion. arXiv preprint: 1406.2572 , 2014. URL http://arxiv.org/abs/1406.2572
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[15]
Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
Levent Sagun, Léon Bottou, and Yann LeCun. Eigen- values of the Hessian in deep learning: Singularity and beyond.arXiv preprint: 1611.07476, 2016. URL https://arxiv.org/abs/1611.07476
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size
Vardan Papyan. The full spectrum of deepnet Hes- sians at scale: Dynamics with sgd training and sam- ple size. arXiv preprint: 1811.07062, 2018. URL https://arxiv.org/abs/1811.07062
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Nicol N. Schraudolph. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent.Neural Computation, 14(7):1723–1738, 07 2002. ISSN 0899-
work page 2002
-
[18]
doi: 10.1162/08997660260028683
-
[19]
Revisiting Natural Gradient for Deep Networks
Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks.arxiv preprint: 1301.3584, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
- [21]
-
[22]
Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to ac- celerate training of deep neural networks. arXiv preprint: 1602.07868, 2016. URL https://arxiv. org/abs/1602.07868
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
James Kirkpatrick, Razvan Pascanu, Neil Rabi- nowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceed- ings of the national academy of sciences, 114(13): 3521–3526, 2017
work page 2017
- [25]
-
[26]
Scalable Bayesian Optimization Using Deep Neural Networks
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable Bayesian optimization using deep neural networks. arXiv preprint: 1502.05700, 2015. URL https://arxiv.org/abs/1502.05700
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[27]
Subspace Inference for Bayesian Deep Learning
Pavel Izmailov, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. Subspace inference for Bayesian deep learning.arxiv preprint: 1907.07504, 2019. 9
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[28]
E. Daxberger, E. Nalisnick, J. Allingham, J. An- torán, and J. M. Hernández-Lobato. Bayesian deep learning via subnetwork inference. In Pro- ceedings of 38th International Conference on Ma- chine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 2510–2521. PMLR, July 2021. URL https://proceedings. mlr.press/v139/daxberger21a.html
work page 2021
-
[29]
Do Bayesian neural net- works need to be fully stochastic?arxiv preprint: 2211.06291, 2023
Mrinank Sharma, Sebastian Farquhar, Eric Nalis- nick, and Tom Rainforth. Do Bayesian neural net- works need to be fully stochastic?arxiv preprint: 2211.06291, 2023. URL https://arxiv.org/abs/ 2211.06291
-
[30]
A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017
Yu Cheng, Duo Wang, Pan Zhou, and Zhang Tao. A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017
-
[31]
Andrew Y. K. Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. ’in- between’ uncertainty in Bayesian neural networks. arXiv preprint: 1906.11537, 2019. URL https: //arxiv.org/abs/1906.11537
work page internal anchor Pith review Pith/arXiv arXiv 1906
- [33]
-
[34]
Accelerated linearized Laplace approximation for Bayesian deep learning
Zhijie Deng, Feng Zhou, and Jun Zhu. Accelerated linearized Laplace approximation for Bayesian deep learning. ArXiv, abs/2210.12642, 2022
-
[35]
Ortega, Simón Rodríguez Santana, and Daniel Hern’andez-Lobato
Luis A. Ortega, Simón Rodríguez Santana, and Daniel Hern’andez-Lobato. Variational linearized Laplace approximation for Bayesian deep learning. ArXiv, abs/2302.12565, 2023
-
[36]
The evidence frame- work applied to classification networks.Neural Com- putation, 4:720–736, 1992
David John Cameron MacKay. The evidence frame- work applied to classification networks.Neural Com- putation, 4:720–736, 1992
work page 1992
-
[37]
Practical Gauss-Newton optimisation for deep learning
Aleksandar Botev, Hippolyt Ritter, and David Bar- ber. Practical Gauss-Newton optimisation for deep learning. In International Conference on Machine Learning, 2017
work page 2017
-
[38]
Optimizing neural networks with Kronecker-factored approxi- mate curvature
James Martens and Roger Baker Grosse. Optimizing neural networks with Kronecker-factored approxi- mate curvature. In International Conference on Machine Learning, 2015
work page 2015
-
[39]
Partially stochastic in- finitely deep Bayesian neural networks
Sergio Calvo-Ordonez, Matthieu Meunier, Francesco Piatti, and Yuantao Shi. Partially stochastic in- finitely deep Bayesian neural networks. arxiv preprint: 2402.03495, 2024. URL https://arxiv. org/abs/2402.03495
-
[40]
Tom M. Heskes. On natural learning and pruning in multilayered perceptrons.Neural Computation, 12:881–901, 2000
work page 2000
- [41]
-
[42]
Zur Theorie der linearen und nichtlinearen Integralgleichungen
Erhard Schmidt. Zur Theorie der linearen und nichtlinearen Integralgleichungen. Mathematische Annalen, 63(4):433–476, Dec 1907. ISSN 1432-
work page 1907
-
[43]
URL https: //doi.org/10.1007/BF01449770
doi: 10.1007/BF01449770. URL https: //doi.org/10.1007/BF01449770
-
[44]
The approximation of one matrix by another of lower rank
Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psy- chometrika, 1(3):211–218, Sep 1936. ISSN 1860-
work page 1936
-
[45]
URL https: //doi.org/10.1007/BF02288367
doi: 10.1007/BF02288367. URL https: //doi.org/10.1007/BF02288367
-
[46]
L. Mirsky. Symmetric gauge functions and uni- tarily invariant norms. The Quarterly Journal of Mathematics, 11(1):50–59, 01 1960. ISSN 0033-
work page 1960
-
[47]
URL https: //doi.org/10.1093/qmath/11.1.50
doi: 10.1093/qmath/11.1.50. URL https: //doi.org/10.1093/qmath/11.1.50
-
[48]
van Rijn, Bernd Bischl, and Luis Torgo
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2): 49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198
-
[49]
Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, and Frank Hutter. OpenML-Python: an extensible Python API for OpenML.arXiv, 1911.02490. URLhttps: //arxiv.org/pdf/1911.02490.pdf
-
[50]
Gradient-based learning applied to document recognition
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998
work page 1998
-
[51]
MNIST-C: A Robustness Benchmark for Computer Vision
Norman Mu and Justin Gilmer. Mnist-c: A robust- ness benchmark for computer vision, 2019. URL https://arxiv.org/abs/1906.02337
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[52]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms. ArXiv, abs/1708.07747, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[53]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. https: //www.cs.toronto.edu/~kriz/cifar.html
work page 2009
-
[54]
Imagenet: A large-scale hierar- chical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierar- chical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848. 10
-
[55]
Pytorch: An imperative style, high-performance deep learning library, 2019
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...
work page 2019
-
[56]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recogni- tion. arXiv preprint arXiv:1512.03385, 2015. URL http://arxiv.org/abs/1512.03385
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[57]
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive un- certainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017
work page 2017
-
[58]
J. Yao, W. Pan, S. Ghosh, and F. Doshi-Velez. Qual- ity of uncertainty quantification for bayesian neural network inference. 2019
work page 2019
-
[59]
Sameer K. Deshpande, Soumya Ghosh, Tin D. Nguyen, and Tamara Broderick. Are you using test log-likelihood correctly? Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/ forum?id=n2YifD4Dxo. dataset αinit nepoch warm up/ decay Red Wine 0.0004 300 (0.3/0.3) ENB 0.004 1500 (0.1/0.5) California 0.0004 100 (0.3/0.5) Naval Propulsion 0.0004 100 (0....
work page 2024
-
[60]
The object Us needs to be computable
-
[61]
The computation of the productJ T X ′Us needs to be feasible. Obstacle 2 is rather straightforward to circumvent as we can compute the matrix product via mini-batches from the training data. It turns out that Obstacle 1 sets the actual limit on the subset of training data as we compute Us via an SVD of the objectJX ′ΨapproxJ T X ′ ∈ RnC×nC. For Red Wine a...
-
[62]
In other words, the NLL evaluates subspace models that use less parameters as better
First, note that for most models the NLL rises with increasing s. In other words, the NLL evaluates subspace models that use less parameters as better
-
[63]
Fisher information matrix of the predictive distri- bution
Second, the full model has the highest NLL value. In other words the NLL ranks it as the worst perform- ing model, whereas the models that approximate it perform better under this metric. It seems rather implausible that an approximated ob- ject yields preciser estimates than the object which it approximates. We feel therefore save to conclude that the ra...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.