pith. sign in

arxiv: 2502.02345 · v2 · submitted 2025-02-04 · 💻 cs.LG

Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks

Pith reviewed 2026-05-23 03:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords subspace inferenceLaplace approximationBayesian neural networkslow-rank approximationuncertainty quantificationcovariance matrixscalable inference
0
0 comments X

The pith

A low-rank subspace model for the Laplace approximation in Bayesian neural networks is optimal for a given dataset and closely matches the full approximation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives an expression for a subspace model in a Laplace-based Bayesian inference setting that is optimal in a certain sense for any specific dataset. It shows empirically that replacing the full covariance matrix with a dimensionally reduced version yields a Laplace approximation nearly identical to the exact one. The work also supplies a practical scalable version of this construction that outperforms prior subspace approaches and introduces a metric for comparing approximation quality when the full Laplace is unavailable.

Core claim

Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we underpin the validity of this assumption by using low rank techniques. We derive an expression for a subspace model to a Bayesian inference scenario based on the Laplace approximation that is, in a certain sense, optimal given a specific dataset. We empirically show that a Laplace approximation constructed with a dimensionally reduced covariance matrix closely matches the full Laplace approximation obtained using the exact covariance matrix.

What carries the argument

Low-rank approximation to the covariance matrix within the Laplace approximation, which produces the optimal subspace model for a given dataset.

If this is right

  • The derived subspace Laplace model can serve as a baseline for benchmarking other subspace inference methods.
  • The scalable approximation to the subspace construction can be used in practice on larger models where the full covariance is intractable.
  • The new metric enables direct qualitative comparison of different subspace models even when the exact Laplace is unknown.
  • Low-rank covariance reduction preserves the essential uncertainty properties of the Laplace posterior for the datasets tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to other posterior approximations that rely on a covariance or precision matrix, such as variational methods with Gaussian assumptions.
  • If the low-rank structure holds across training runs, it could reduce memory costs when storing or sampling from the posterior in deployed systems.
  • The optimality property might be tested by checking whether the reduced model minimizes a chosen divergence to the full Laplace on new data distributions.

Load-bearing premise

A subspace of the parameter space is sufficient to capture the uncertainty that the full Laplace approximation would produce.

What would settle it

On a held-out dataset, compute both the full Laplace and the low-rank subspace Laplace posteriors and measure whether their predictive distributions or calibration metrics differ by more than a small tolerance.

Figures

Figures reproduced from arXiv: 2502.02345 by J\"org Martin, Josua Faller.

Figure 1
Figure 1. Figure 1: Comparison of low rank approximations and subset methods for different regression datasets. Different [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relative error (20) (left) and trace criterion (25) (right) for corrupted MNIST datasets [43] and three dif￾ferent dimensions s = 100, 500, 1000 (shown by markers in increasing size). Different choices for P are indicated by different colours and marker shapes: Square markers ■ indicate subset based methods, whereas discs • indi￾cate low-rank based methods (proposed in this work). The colour coding is chos… view at source ↗
Figure 2
Figure 2. Figure 2: Relative error (20) and logarithm of trace (25) of the epistemic covariance matrix for MNIST and FashionMNIST. number of ‘dead parameters’ whose gradient is almost zero, which provides a natural subset to be selected. In￾deed, ENB has the most number of dead parameters with 93%. More details on this investigation are given in Appendix E. A comparison between the first and the second row of [PITH_FULL_IMAG… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation with the trace criterion (25) for CIFAR10 and ImageNet10 and different choices of P. Missing values in Figure 4b are due to vanishing trace values. to approximate ΣP,X well. 6 Conclusion In this work we propose to look at subspace Laplace approximations of Bayesian neural networks through the lens of their predictive covariances. This approach allows us to derive the existence of an optimal subs… view at source ↗
Figure 5
Figure 5. Figure 5: Relative error (20) of the epistemic covariance matrix of the studied subset methods for s up to the number of parameters p for MNIST. B Existence of an Optimal Sub￾space Model for the Laplace Ap￾proximation Theorem (Existence of an optimal subspace model for the Laplace approximation). Consider the problem (21) with s ≤ smax = min(nC, p). Suppose that JX ∈ R nC×p has full rank. For any invertible Q ∈ R s×… view at source ↗
Figure 6
Figure 6. Figure 6: The top displays a heatmap which highlights the activity of the gradients corresponding to the parameter [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The NLL metric (30) for the datasets and subspace models considered in this work. The colour and linestyle coding is identical to the one in [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Figure 8a visualizes the prediction quality of the parametric function [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we underpin the validity of this assumption by using low rank techniques. We derive an expression for a subspace model to a Bayesian inference scenario based on the Laplace approximation that is, in a certain sense, optimal given a specific dataset. We empirically show that a Laplace approximation constructed with a dimensionally reduced covariance matrix closely matches the full Laplace approximation obtained using the exact covariance matrix. Where feasible, this subspace model can serve as a baseline for benchmarking the performance of subspace models. In addition, we provide a scalable approximation of this subspace construction that is usable in practice and compare it to existing subspace models from the literature. In general, our approximation scheme outperforms previous work. Furthermore, we present a metric to qualitatively compare the approximation quality of different subspace models even if the exact Laplace approximation is unknown.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives an expression for an optimal low-rank subspace model in a Laplace-approximation Bayesian inference setting for neural networks, empirically demonstrates that a dimensionally reduced covariance Laplace approximation closely matches the full Laplace approximation, introduces a scalable practical version of this construction that outperforms prior subspace methods, and proposes a metric for qualitatively comparing subspace approximations when the exact Laplace is unavailable.

Significance. If the derivations and empirical matches hold under detailed scrutiny, the work supplies a theoretically grounded baseline for subspace inference specifically within the Laplace framework for BNNs, along with a usable scalable method and a comparison metric; these could serve as reference points for evaluating other subspace techniques and clarifying the conditions under which low-dimensional parameter subspaces suffice for uncertainty quantification.

major comments (2)
  1. [Abstract and Introduction] Abstract and introduction: the central motivation—to underpin the general assumption that a subspace of parameter space suffices for reliable uncertainty quantification—is supported only by agreement between reduced and full Laplace approximations. Because the Laplace approximation itself can deviate substantially from the true posterior (particularly in non-convex, high-dimensional BNN landscapes), internal matching within the Laplace setting does not directly validate the subspace assumption for actual Bayesian inference; an external check against sampling-based posteriors (e.g., HMC or MCMC on the same models) would be required to make this claim load-bearing.
  2. [Empirical Evaluation] Empirical section: the reported closeness of reduced to full Laplace and outperformance of the scalable version are presented without accompanying error analysis, variance estimates across random seeds, or ablation on the rank choice; without these, the strength of support for the optimality and practical utility claims cannot be verified.
minor comments (2)
  1. [Derivation] Notation for the low-rank factors and the definition of optimality should be introduced with explicit equations early in the derivation section to improve readability.
  2. [Figures] Figure captions for the qualitative metric comparisons should state the precise models, datasets, and rank values used so that the plots are self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract and Introduction] Abstract and introduction: the central motivation—to underpin the general assumption that a subspace of parameter space suffices for reliable uncertainty quantification—is supported only by agreement between reduced and full Laplace approximations. Because the Laplace approximation itself can deviate substantially from the true posterior (particularly in non-convex, high-dimensional BNN landscapes), internal matching within the Laplace setting does not directly validate the subspace assumption for actual Bayesian inference; an external check against sampling-based posteriors (e.g., HMC or MCMC on the same models) would be required to make this claim load-bearing.

    Authors: We appreciate this point. Our manuscript is explicitly focused on the Laplace approximation setting for Bayesian neural networks, as reflected in the title and throughout the text. The goal is to derive an optimal low-rank subspace specifically for the Laplace-approximated posterior and to provide a baseline within that framework. We do not claim that this validates the subspace assumption for the true posterior. To address the concern, we will revise the abstract and introduction to more precisely state that the validation is internal to the Laplace approximation and that the work provides a theoretically grounded baseline for subspace methods in the Laplace context. We note that direct comparisons to HMC or MCMC are computationally infeasible for the network sizes considered in this work. revision: partial

  2. Referee: [Empirical Evaluation] Empirical section: the reported closeness of reduced to full Laplace and outperformance of the scalable version are presented without accompanying error analysis, variance estimates across random seeds, or ablation on the rank choice; without these, the strength of support for the optimality and practical utility claims cannot be verified.

    Authors: We agree that additional statistical rigor would strengthen the empirical claims. In the revised version, we will include variance estimates across multiple random seeds where applicable, provide error analysis for the reported metrics, and add an ablation study on the effect of the rank choice on the approximation quality. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses standard low-rank approximation on Laplace covariance without self-referential reduction

full rationale

The paper derives an optimal low-rank subspace expression for the Laplace-approximated posterior by applying standard matrix low-rank techniques (e.g., SVD or similar) directly to the Hessian-derived covariance; this is a conventional approximation step whose optimality follows from the Eckart-Young theorem applied to the given matrix, not from any redefinition of the target quantity in terms of itself. The empirical claim is a direct numerical comparison of the full Laplace (exact covariance) versus the reduced-covariance version on the same models, which is an independent verification within the Laplace setting rather than a fitted input renamed as prediction. No load-bearing step relies on self-citation chains, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation. The derivation remains self-contained against the Laplace approximation as its explicit benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not detail any free parameters, axioms, or invented entities; the work extends existing Laplace approximation and low-rank matrix techniques without introducing new postulated quantities.

pith-pipeline@v0.9.0 · 5684 in / 1048 out tokens · 22522 ms · 2026-05-23T03:37:21.494874+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 13 internal anchors

  1. [1]

    Uncertainty in deep learning

    Yarin Gal. Uncertainty in deep learning

  2. [2]

    URL https://www.cs.ox.ac.uk/people/ yarin.gal/website/thesis/thesis.pdf

  3. [3]

    Weight un- certainty in neural network

    Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight un- certainty in neural network. In International Conference on Machine Learning, pages 1613–1622. PMLR, 2015

  4. [4]

    What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

    Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977, 2017

  5. [5]

    Probabilistic backpropagation for scalable learning of Bayesian neural networks

    José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. InInternational confer- ence on machine learning, pages 1861–1869. PMLR, 2015

  6. [6]

    A simple baseline for Bayesian uncertainty in deep learning

    Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. Advances in neural information processing systems, 32, 2019

  7. [7]

    Variational Dropout and the Local Reparameterization Trick

    Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. arXiv preprint arXiv:1506.02557, 2015

  8. [8]

    Jordan, Zoubin Ghahramani, T

    Michael I. Jordan, Zoubin Ghahramani, T. Jaakkola, andLawrenceK.Saul. Anintroductiontovariational methods for graphical models.Machine Learning, 37:183–233, 1999

  9. [9]

    Wainwright and Michael I

    Martin J. Wainwright and Michael I. Jordan. Graph- ical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1:1–305, 2008

  10. [10]

    A practical Bayesian framework for backpropagation networks.Neural Computation, 4:448–472, 1992

    David John Cameron MacKay. A practical Bayesian framework for backpropagation networks.Neural Computation, 4:448–472, 1992

  11. [11]

    Opti- mal brain damage.Advances in neural information processing systems, 2, 1989

    Yann LeCun, John Denker, and Sara Solla. Opti- mal brain damage.Advances in neural information processing systems, 2, 1989

  12. [12]

    A scalable Laplace approximation for neural networks

    Hippolyt Ritter, Aleksandar Botev, and David Bar- ber. A scalable Laplace approximation for neural networks. InInternational Conference on Learning Representations, 2018

  13. [13]

    Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, M

    Erik A. Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, M. Bauer, and Philipp Hennig. Laplace redux - effortless Bayesian deep learning. In Neural Information Processing Systems, 2021

  14. [14]

    Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

    Yann N. Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Ben- gio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimiza- tion. arXiv preprint: 1406.2572 , 2014. URL http://arxiv.org/abs/1406.2572

  15. [15]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    Levent Sagun, Léon Bottou, and Yann LeCun. Eigen- values of the Hessian in deep learning: Singularity and beyond.arXiv preprint: 1611.07476, 2016. URL https://arxiv.org/abs/1611.07476

  16. [16]

    The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

    Vardan Papyan. The full spectrum of deepnet Hes- sians at scale: Dynamics with sgd training and sam- ple size. arXiv preprint: 1811.07062, 2018. URL https://arxiv.org/abs/1811.07062

  17. [17]

    Schraudolph

    Nicol N. Schraudolph. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent.Neural Computation, 14(7):1723–1738, 07 2002. ISSN 0899-

  18. [18]

    doi: 10.1162/08997660260028683

  19. [19]

    Revisiting Natural Gradient for Deep Networks

    Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks.arxiv preprint: 1301.3584, 2013

  20. [21]

    URL http://arxiv.org/abs/1412.1193

  21. [22]

    Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to ac- celerate training of deep neural networks. arXiv preprint: 1602.07868, 2016. URL https://arxiv. org/abs/1602.07868

  22. [23]

    Overcoming catastrophic forgetting in neural networks.Proceed- ings of the national academy of sciences, 114(13): 3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabi- nowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceed- ings of the national academy of sciences, 114(13): 3521–3526, 2017

  23. [25]

    URL https://arxiv.org/abs/2002.10118

  24. [26]

    Scalable Bayesian Optimization Using Deep Neural Networks

    Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. Scalable Bayesian optimization using deep neural networks. arXiv preprint: 1502.05700, 2015. URL https://arxiv.org/abs/1502.05700

  25. [27]

    Subspace Inference for Bayesian Deep Learning

    Pavel Izmailov, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. Subspace inference for Bayesian deep learning.arxiv preprint: 1907.07504, 2019. 9

  26. [28]

    Daxberger, E

    E. Daxberger, E. Nalisnick, J. Allingham, J. An- torán, and J. M. Hernández-Lobato. Bayesian deep learning via subnetwork inference. In Pro- ceedings of 38th International Conference on Ma- chine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 2510–2521. PMLR, July 2021. URL https://proceedings. mlr.press/v139/daxberger21a.html

  27. [29]

    Do Bayesian neural net- works need to be fully stochastic?arxiv preprint: 2211.06291, 2023

    Mrinank Sharma, Sebastian Farquhar, Eric Nalis- nick, and Tom Rainforth. Do Bayesian neural net- works need to be fully stochastic?arxiv preprint: 2211.06291, 2023. URL https://arxiv.org/abs/ 2211.06291

  28. [30]

    A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017

    Yu Cheng, Duo Wang, Pan Zhou, and Zhang Tao. A survey of model compression and acceleration for deep neural networks.ArXiv, abs/1710.09282, 2017

  29. [31]

    Andrew Y. K. Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. ’in- between’ uncertainty in Bayesian neural networks. arXiv preprint: 1906.11537, 2019. URL https: //arxiv.org/abs/1906.11537

  30. [33]

    URL https://arxiv.org/abs/2008.08400

  31. [34]

    Accelerated linearized Laplace approximation for Bayesian deep learning

    Zhijie Deng, Feng Zhou, and Jun Zhu. Accelerated linearized Laplace approximation for Bayesian deep learning. ArXiv, abs/2210.12642, 2022

  32. [35]

    Ortega, Simón Rodríguez Santana, and Daniel Hern’andez-Lobato

    Luis A. Ortega, Simón Rodríguez Santana, and Daniel Hern’andez-Lobato. Variational linearized Laplace approximation for Bayesian deep learning. ArXiv, abs/2302.12565, 2023

  33. [36]

    The evidence frame- work applied to classification networks.Neural Com- putation, 4:720–736, 1992

    David John Cameron MacKay. The evidence frame- work applied to classification networks.Neural Com- putation, 4:720–736, 1992

  34. [37]

    Practical Gauss-Newton optimisation for deep learning

    Aleksandar Botev, Hippolyt Ritter, and David Bar- ber. Practical Gauss-Newton optimisation for deep learning. In International Conference on Machine Learning, 2017

  35. [38]

    Optimizing neural networks with Kronecker-factored approxi- mate curvature

    James Martens and Roger Baker Grosse. Optimizing neural networks with Kronecker-factored approxi- mate curvature. In International Conference on Machine Learning, 2015

  36. [39]

    Partially stochastic in- finitely deep Bayesian neural networks

    Sergio Calvo-Ordonez, Matthieu Meunier, Francesco Piatti, and Yuantao Shi. Partially stochastic in- finitely deep Bayesian neural networks. arxiv preprint: 2402.03495, 2024. URL https://arxiv. org/abs/2402.03495

  37. [40]

    Tom M. Heskes. On natural learning and pruning in multilayered perceptrons.Neural Computation, 12:881–901, 2000

  38. [41]

    Christopher M. Bishop. Pattern recognition and machine learning, 5th Edition . Informa- tion science and statistics. Springer, 2007. ISBN 9780387310732. URL https://www.worldcat.org/ oclc/71008143

  39. [42]

    Zur Theorie der linearen und nichtlinearen Integralgleichungen

    Erhard Schmidt. Zur Theorie der linearen und nichtlinearen Integralgleichungen. Mathematische Annalen, 63(4):433–476, Dec 1907. ISSN 1432-

  40. [43]

    URL https: //doi.org/10.1007/BF01449770

    doi: 10.1007/BF01449770. URL https: //doi.org/10.1007/BF01449770

  41. [44]

    The approximation of one matrix by another of lower rank

    Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psy- chometrika, 1(3):211–218, Sep 1936. ISSN 1860-

  42. [45]

    URL https: //doi.org/10.1007/BF02288367

    doi: 10.1007/BF02288367. URL https: //doi.org/10.1007/BF02288367

  43. [46]

    L. Mirsky. Symmetric gauge functions and uni- tarily invariant norms. The Quarterly Journal of Mathematics, 11(1):50–59, 01 1960. ISSN 0033-

  44. [47]

    URL https: //doi.org/10.1093/qmath/11.1.50

    doi: 10.1093/qmath/11.1.50. URL https: //doi.org/10.1093/qmath/11.1.50

  45. [48]

    van Rijn, Bernd Bischl, and Luis Torgo

    Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2): 49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198

  46. [49]

    van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, and Frank Hutter

    Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, and Frank Hutter. OpenML-Python: an extensible Python API for OpenML.arXiv, 1911.02490. URLhttps: //arxiv.org/pdf/1911.02490.pdf

  47. [50]

    Gradient-based learning applied to document recognition

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998

  48. [51]

    MNIST-C: A Robustness Benchmark for Computer Vision

    Norman Mu and Justin Gilmer. Mnist-c: A robust- ness benchmark for computer vision, 2019. URL https://arxiv.org/abs/1906.02337

  49. [52]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms. ArXiv, abs/1708.07747, 2017

  50. [53]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. https: //www.cs.toronto.edu/~kriz/cifar.html

  51. [54]

    Imagenet: A large-scale hierar- chical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierar- chical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848. 10

  52. [55]

    Pytorch: An imperative style, high-performance deep learning library, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

  53. [56]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recogni- tion. arXiv preprint arXiv:1512.03385, 2015. URL http://arxiv.org/abs/1512.03385

  54. [57]

    Simple and scalable predictive un- certainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive un- certainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017

  55. [58]

    J. Yao, W. Pan, S. Ghosh, and F. Doshi-Velez. Qual- ity of uncertainty quantification for bayesian neural network inference. 2019

  56. [59]

    mean” and “standard deviation

    Sameer K. Deshpande, Soumya Ghosh, Tin D. Nguyen, and Tamara Broderick. Are you using test log-likelihood correctly? Trans. Mach. Learn. Res., 2024, 2024. URLhttps://openreview.net/ forum?id=n2YifD4Dxo. dataset αinit nepoch warm up/ decay Red Wine 0.0004 300 (0.3/0.3) ENB 0.004 1500 (0.1/0.5) California 0.0004 100 (0.3/0.5) Naval Propulsion 0.0004 100 (0....

  57. [60]

    The object Us needs to be computable

  58. [61]

    Obstacle 2 is rather straightforward to circumvent as we can compute the matrix product via mini-batches from the training data

    The computation of the productJ T X ′Us needs to be feasible. Obstacle 2 is rather straightforward to circumvent as we can compute the matrix product via mini-batches from the training data. It turns out that Obstacle 1 sets the actual limit on the subset of training data as we compute Us via an SVD of the objectJX ′ΨapproxJ T X ′ ∈ RnC×nC. For Red Wine a...

  59. [62]

    In other words, the NLL evaluates subspace models that use less parameters as better

    First, note that for most models the NLL rises with increasing s. In other words, the NLL evaluates subspace models that use less parameters as better

  60. [63]

    Fisher information matrix of the predictive distri- bution

    Second, the full model has the highest NLL value. In other words the NLL ranks it as the worst perform- ing model, whereas the models that approximate it perform better under this metric. It seems rather implausible that an approximated ob- ject yields preciser estimates than the object which it approximates. We feel therefore save to conclude that the ra...