Is the Last Layer Sufficient for Uncertainty Quantification?

Chris van der Heide; Fred Roosta; Joseph Wilson; Liam Hodgkinson

arxiv: 2605.30741 · v1 · pith:OA33ZWJ7new · submitted 2026-05-29 · 📊 stat.ML · cs.LG

Is the Last Layer Sufficient for Uncertainty Quantification?

Joseph Wilson , Chris van der Heide , Liam Hodgkinson , Fred Roosta This is my paper

Pith reviewed 2026-06-28 21:31 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords epistemic uncertaintydeep neural networkslast-layer linearizationBayesian generalized linear modelsrandom matrix theorypredictive posterioruncertainty quantificationcomputational efficiency

0 comments

The pith

Last-layer linearization matches full-network performance for epistemic uncertainty quantification but with substantially lower computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether approximating a deep neural network's epistemic uncertainty by linearizing only its final layer produces results as reliable as linearizing the entire network. It applies random matrix theory to compare the two approaches theoretically and finds no meaningful advantage for the full linearization. Large-scale experiments on diverse machine learning tasks then confirm that the last-layer method delivers comparable uncertainty estimates while requiring far less computation to obtain the predictive posterior.

Core claim

By comparing Bayesian generalized linear models obtained from full-network linearization versus last-layer linearization of DNNs, using both random matrix theory for theoretical comparison and empirical evaluation across modern tasks, the analysis concludes that a last-layer approximation yields comparable UQ performance while offering substantially improved computational efficiency.

What carries the argument

Last-layer linearization of a DNN to produce a Bayesian generalized linear model whose predictive posterior supplies the epistemic uncertainty estimate.

If this is right

Last-layer linearization can be substituted for full linearization without degrading UQ quality in the tested regimes.
The reduced computational burden makes Bayesian UQ practical for larger or deeper networks.
Resources previously spent on full-network posterior approximations can be reallocated to model capacity or data scale.
Existing full-linearization pipelines may be simplified to last-layer versions while preserving performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Uncertainty behavior in these networks may be dominated by the final layer's parameter posterior rather than earlier layers.
Targeted last-layer methods could be designed from the start to further reduce overhead beyond current approximations.
The same sufficiency pattern might appear in other post-training analyses such as gradient-based attributions.
Deployment settings with strict latency constraints can adopt last-layer UQ with reduced risk of performance loss.

Load-bearing premise

The random matrix theory analysis assumes that the linearization approximations and network scaling regimes used in the theoretical comparison accurately capture the behavior of practical DNNs on real data distributions.

What would settle it

An experiment on a standard benchmark where full-network linearization produces statistically significantly better-calibrated uncertainty estimates or superior out-of-distribution detection than last-layer linearization would refute the claim of comparability.

Figures

Figures reproduced from arXiv: 2605.30741 by Chris van der Heide, Fred Roosta, Joseph Wilson, Liam Hodgkinson.

**Figure 1.** Figure 1: Variance of the maximum softmax probability, on the (top) Three Islands dataset and (bottom) Two Moons dataset. We see that a last-layer approximation does not affect quality of UQ. Excerpt from Figure 14a. use of DNNs in mission-critical settings (Nemani et al., 2023). There has been significant progress in the field of UQ in recent years, mainly in three key areas: conformal predictions (Vovk et al., 200… view at source ↗

**Figure 2.** Figure 2: BFE as a function of γ for the (top) CK and (bottom) NTK. Top left has λ ∗ = 1/(1 − τ ), bottom left has λ ∗ = PL i=0 a i σ/(1 − τ ), and the right plots have λ = 0.01. The dotted blue line is the minimum BFE, (1 + ln 2π)/2. We observe that at λ∗, strong descent occurs. law of robustness” for mean predictors given in Bubeck & Sellke (2021), which required functions to be sufficiently over-parameterized to… view at source ↗

**Figure 4.** Figure 4: BFE as a function of epochs of training, for (top) Gaussian data with λ = λ ∗ and (bottom) a teacher network y = sin(w T x) with λ = 1, and τ = 10−3 . of the CK to model noiseless data in this regime is an artifact of initialized networks. Specifically, we generate training data, and then train an MLP to minimize the empirical risk using a squared-error loss. We form the CK and NTK from this trained MLP. N… view at source ↗

**Figure 5.** Figure 5: Toy regression problem. Here, the red points are training points, the black line is the underlying curve, the blue line is the mean prediction, and the green shading represents ±2σ(x), where σ(x) is the computed variance. 5. Empirical Evaluation Employing the sampling scheme from Section 4, we now provide a large-scale comparison of LL-GLM versus DNNGLM on a series of machine learning tasks, on regression… view at source ↗

**Figure 6.** Figure 6: VARROC-ID (V-ID), VARROC-OOD (V-OOD), VARROC-MI-ID (MI-ID) and VARROC-MI-OOD (MI-OOD) results for image classification tasks [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: BFE for (left) CK and (right) NTK for normalized Gaussian input and output data, as a function of layer width γl = γ0 = · · · = γL, for τ = 0.1, λ = 0.5. The empirical BFE is calculated with Gaussian input data, and is represented with circles, while the limiting BFE in (MP) is represented with dashed lines. A.4. Bound on Free Energy Lemma 6. For input X0 = d −1/2 0 Z, with Z having i.i.d. columns that are… view at source ↗

**Figure 8.** Figure 8: Log-determinant of KX + τλ as a function of epochs of training, for (top) Gaussian data with λ = λ ∗ and (bottom) a teacher network y = sin(w T x) with λ = 1, and τ = 10−3 . Therefore, if we assume we have a kernel function with large Z τ,λ n , then this means that at the time of posterior inference, functions with (relatively) high posterior weight will give a better fit to the training data, compared to … view at source ↗

**Figure 9.** Figure 9: Data-fit term for KX + τλ as a function of epochs of training, for (top) Gaussian data with λ = λ ∗ and (bottom) a teacher network y = sin(w T x) with λ = 1, and τ = 10−3 . 0 100 200 300 CK 50 200 0.1 0.6 1.0 0.2 0.3 0.4 0.5 NTK dl = 50 dl = 100 dl = 200 0 50 100 150 200 Epochs 0 50 100 150 50 200 0.0 0.4 0.9 0 50 100 150 200 Epochs 0.05 0.10 0.15 Mean Bayes Free Energy [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 10.** Figure 10: Data-fit term for KX + τλ, under a zero-mean prior, as a function of epochs of training, for (top) Gaussian data with λ = λ ∗ and (bottom) a teacher network y = sin(w T x) with λ = 1, and τ = 10−3 . 30 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: BFE as a function of epochs of training, for (top) MNIST and (bottom) CIFAR-10, with λ = 1, and τ = 10−3 . where ℓ2(f, y) = (f − y) 2 , and y˜i ∈ R c , i = 1, . . . , n. Suppose θ ‡ S is any solution to (21). Using θ ‡ S , one can construct a family of solutions to (21) as θ ⋆ S,z = θ ‡ S + I − JS(θˆ S, X) †JS(θˆ S, X) z, ∀z ∈ R pS . (22) The second term in (22) consists of all vectors in the null spa… view at source ↗

**Figure 12.** Figure 12: BFE as a function of epochs of training, for UCI regression datasets, for a small MLP trained with Adam for 1500 epochs using a learning rate 0.01, with λ = 1, τ = 0.001. We see that BFE for CK drops below that of the NTK eventually. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗

**Figure 13.** Figure 13: Normalized singular value distribution for final trained features, for several datasets and models. The top rows displays regression datasets, while the bottom row gives image classification datasets. We consistently see many small singular values. The dotted black line denotes machine epsilon for float32, while the solid black line denotes machine epsilon for float16. F. Justification for VARROC In order… view at source ↗

**Figure 14.** Figure 14: Evaluation of distance-aware property of Bayesian methods. dimension of xi ∈ Xtrain). All uncertainty values are normalized to [0, 1] for each method. We plot the results in Figure 14a. We see that LL-GLM, DNN-GLM and SNGP possess the distance-aware variance property in the logit space and probability space; this is impressive, considering that SNGP is not post-hoc, and involves adapting the structure and… view at source ↗

read the original abstract

Epistemic uncertainty quantification (UQ) for deep neural networks (DNNs) is a requirement for safe adoption of AI in mission-critical settings. Several leading methods for UQ linearize DNNs to form Bayesian Generalized Linear Models (GLMs), where epistemic uncertainty is modeled via the predictive posterior distribution. Linearizing around the parameters of the final connected layer of a DNN is a commonly used approximation for reducing the computational burden of such GLMs, though it is often believed to come at the cost of degraded performance. In this work, we compare GLMs arising from full-network and last-layer linearization using both theoretical and empirical approaches. We first employ tools from random matrix theory to conduct a theoretical comparison; this analysis reveals no meaningful improvement in the UQ capabilities of full linearization. Coupled with a large-scale empirical evaluation across a range of modern machine learning tasks, we arrive at the following conclusion: a last-layer approximation yields comparable UQ performance while offering substantially improved computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Last-layer linearization matches full-network UQ performance per their RMT comparison and experiments, but the theory rests on unverified scaling assumptions.

read the letter

The paper concludes that last-layer linearization for Bayesian GLMs in DNNs delivers epistemic UQ performance close to full-network linearization while cutting compute substantially. Random matrix theory is used to show no meaningful difference between the two, and this is paired with empirical tests across several modern tasks.

The RMT analysis is the clearest new piece. Most last-layer work stays empirical or uses simpler heuristics; bringing in random matrix tools for a direct comparison of the linearization scopes, then testing at scale, is a step beyond the usual treatment.

The soft spot is the set of modeling choices in the RMT part. The comparison assumes particular linearization points, width and depth scaling regimes, and random-matrix approximations to the Hessian or feature covariances. If these do not reproduce the posterior geometry of finite-width networks on real, non-Gaussian data, the theoretical half weakens and the claim falls back on the experiments alone. The abstract supplies no derivation details, error bars, or dataset descriptions, so the strength of the empirical side is also hard to judge without the full text.

This is aimed at practitioners who need epistemic UQ in large models but want to avoid the full cost of linearizing every layer. Readers working on efficient safety-critical applications or on theoretical analysis of neural posteriors will get the most from it. The question is relevant and the paper attempts both theory and data, so it deserves a serious referee.

I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that epistemic UQ via Bayesian GLMs obtained by DNN linearization can use a last-layer approximation without meaningful loss in performance relative to full-network linearization. This is supported by a random-matrix-theory comparison showing no improvement from full linearization, together with large-scale empirical results across modern ML tasks, leading to the conclusion that last-layer linearization is sufficient while being substantially more computationally efficient.

Significance. If the central claim holds, the result would justify a widely applicable computational shortcut for UQ in DNNs, reducing the cost of posterior inference while preserving predictive uncertainty quality; this would be a practically useful finding for safety-critical applications.

major comments (2)

[RMT analysis section] § on RMT analysis (theoretical comparison): the conclusion that full-network linearization yields 'no meaningful improvement' rests on specific modeling choices for the linearization point, width/depth scaling, and random-matrix approximations to the Hessian/feature covariances; these regimes' fidelity to finite-width networks on real non-Gaussian data is not verified, which directly undermines the theoretical half of the central claim.
[Empirical evaluation] Empirical evaluation section: the abstract and available text provide no derivation details, error bars, dataset descriptions, or quantitative metrics (e.g., specific UQ scores or statistical tests), so the claim of 'comparable UQ performance' cannot be assessed for robustness or effect size from the presented material.

minor comments (2)

Clarify the precise definition of 'comparable' (e.g., within what tolerance on which metric) in both the theoretical and empirical parts.
Ensure all notation for the GLM posterior and linearization is introduced with explicit equations before the RMT comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and presentation of our results. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [RMT analysis section] § on RMT analysis (theoretical comparison): the conclusion that full-network linearization yields 'no meaningful improvement' rests on specific modeling choices for the linearization point, width/depth scaling, and random-matrix approximations to the Hessian/feature covariances; these regimes' fidelity to finite-width networks on real non-Gaussian data is not verified, which directly undermines the theoretical half of the central claim.

Authors: The RMT analysis is conducted under standard asymptotic regimes (infinite width and depth) with explicitly stated modeling choices for the linearization point and covariance approximations; these are common in the literature for obtaining closed-form insights into DNN Hessians and feature maps. The analysis shows that, within these regimes, full-network linearization does not yield meaningful improvement over last-layer. We agree that direct verification of the RMT approximations against finite-width real-data Hessians would be valuable additional evidence, but the manuscript's large-scale empirical evaluation on non-Gaussian real-world tasks provides complementary validation of the practical conclusion. In revision we will add an explicit limitations paragraph discussing the asymptotic assumptions and their relation to the empirical results. revision: partial
Referee: [Empirical evaluation] Empirical evaluation section: the abstract and available text provide no derivation details, error bars, dataset descriptions, or quantitative metrics (e.g., specific UQ scores or statistical tests), so the claim of 'comparable UQ performance' cannot be assessed for robustness or effect size from the presented material.

Authors: The full manuscript contains a dedicated empirical section that specifies all datasets, the exact UQ metrics (negative log-likelihood, expected calibration error, Brier score), error bars obtained from multiple independent runs with different random seeds, and direct comparisons between last-layer and full-network linearization. The abstract is intentionally concise. To improve accessibility we will add a summary table of key quantitative results and explicit cross-references in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: RMT analysis and empirics are independent of the target claim

full rationale

The derivation chain consists of an external random-matrix-theory comparison (under explicitly stated linearization and scaling assumptions) followed by a separate large-scale empirical evaluation. Neither component reduces to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain; the RMT step is presented as an independent theoretical tool rather than an ansatz or uniqueness result imported from the authors' prior work. The conclusion that last-layer linearization suffices therefore rests on two non-circular inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The claim rests on unstated assumptions of the random matrix theory model and the representativeness of the empirical tasks.

pith-pipeline@v0.9.1-grok · 5700 in / 958 out tokens · 17404 ms · 2026-06-28T21:31:41.738668+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ElemeNet: Multiscale Molecular Machine Learning with Uncertainty Quantification Across the Periodic Table
physics.chem-ph 2026-06 unverdicted novelty 6.0

ElemeNet is a unified ML software package for molecular property prediction across elements 1-100 with built-in uncertainty quantification and competitive benchmarks on diverse chemistry datasets.

Reference graph

Works this paper leans on

9 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P

URL https://openreview.net/forum? id=ruGY8v10mK. Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. Laplace redux-effortless bayesian deep learning.Advances in neural information processing systems, 34:20089–20103, 2021. de Jong, I. P., Sburlea, A. I., and Valdenegro-Toro, M. Un- certainty quantification in machine learnin...

arXiv 2021
[2]

PMLR, 2019. Fan, Z. and Wang, Z. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Advances in Neural Information Processing Systems, 33: 7710–7721, 2020. 10 Is the Last Layer Sufficient for Uncertainty Quantification? Feng, R., Zheng, K., Huang, Y ., Zhao, D., Jordan, M., and Zha, Z.-J. Rank diminishing in deep n...

Pith/arXiv arXiv 2019
[3]

11 Is the Last Layer Sufficient for Uncertainty Quantification? Krivoruchko, K

PMLR, 2020. 11 Is the Last Layer Sufficient for Uncertainty Quantification? Krivoruchko, K. and Gribov, A. Evaluation of empirical bayesian kriging.Spatial Statistics, 32:100368, 2019. Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in Neural Information Processing Sys...

2020
[4]

Nemani, V ., Biggio, L., Huan, X., Hu, Z., Fink, O., Tran, A., Wang, Y ., Zhang, X., and Hu, C

Springer Science & Business Media, 2012. Nemani, V ., Biggio, L., Huan, X., Hu, Z., Fink, O., Tran, A., Wang, Y ., Zhang, X., and Hu, C. Uncertainty quan- tification in machine learning for engineering design and health prognostics: A tutorial.Mechanical Systems and Signal Processing, 205:110796, 2023. Ortega, L. A., Rodriguez Santana, S., and Hern ´andez...

work page doi:10.7551/mitpress/3206 2012
[5]

Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R

URL https://www.auai.org/uai2018/ proceedings/papers/207.pdf. Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. Scalable bayesian optimization using deep neural net- works. InInternational conference on machine learning, pp. 2171–2180. PMLR, 2015. Tulino, A. M., Verd´u, S., et al. Random matri...

2015
[6]

cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper. pdf. V ovk, V ., Gammerman, A., and Shafer, G.Algorithmic Learning in a Random World, volume 29. Springer, 2005. Wang, Z., Engel, A., Sarwate, A. D., Dumitriu, I., and Chiang, T. Spectral evolution and invariance in linear- width neural networks.Advanc...

2017
[7]

For a Bayesian method, we generally have a mean predictor µ:R d →R c and a covariance function Σ :R d →R c×c, that output in the softmax space

Firstly, we use the variance of the maximum softmax prediction as our uncertainty score. For a Bayesian method, we generally have a mean predictor µ:R d →R c and a covariance function Σ :R d →R c×c, that output in the softmax space. For the variance of the maximum softmax prediction for a given test point x⋆, we first find ˆc=argmaxk µ(x∗)k, where µ(x⋆)k ...
[8]

Secondly, we compute the AUCROC score for two settings: on an in-distribution test set, where we seek to detect 35 Is the Last Layer Sufficient for Uncertainty Quantification? correctly predicted versus incorrectly predicted points (V ARROC-ID), and using an OOD test set, where we seek to detect correctly predicted versus OOD points (V ARROC-OOD). Note th...

2024
[9]

We take 40 posterior samples, and discard the first10as a burn-in

component of SMS-UBU, we take the pre-trained DNN, run SWA (using Adam with weight decay equal to that for our original DNN training) for 5 epochs, and then run SMS-UBU from the averaged parameters. We take 40 posterior samples, and discard the first10as a burn-in. GPT-2 The trained GPT-2 weights were taken fromHuggingFace; a classification head was then ...

[1] [1]

Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P

URL https://openreview.net/forum? id=ruGY8v10mK. Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. Laplace redux-effortless bayesian deep learning.Advances in neural information processing systems, 34:20089–20103, 2021. de Jong, I. P., Sburlea, A. I., and Valdenegro-Toro, M. Un- certainty quantification in machine learnin...

arXiv 2021

[2] [2]

PMLR, 2019. Fan, Z. and Wang, Z. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Advances in Neural Information Processing Systems, 33: 7710–7721, 2020. 10 Is the Last Layer Sufficient for Uncertainty Quantification? Feng, R., Zheng, K., Huang, Y ., Zhao, D., Jordan, M., and Zha, Z.-J. Rank diminishing in deep n...

Pith/arXiv arXiv 2019

[3] [3]

11 Is the Last Layer Sufficient for Uncertainty Quantification? Krivoruchko, K

PMLR, 2020. 11 Is the Last Layer Sufficient for Uncertainty Quantification? Krivoruchko, K. and Gribov, A. Evaluation of empirical bayesian kriging.Spatial Statistics, 32:100368, 2019. Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in Neural Information Processing Sys...

2020

[4] [4]

Nemani, V ., Biggio, L., Huan, X., Hu, Z., Fink, O., Tran, A., Wang, Y ., Zhang, X., and Hu, C

Springer Science & Business Media, 2012. Nemani, V ., Biggio, L., Huan, X., Hu, Z., Fink, O., Tran, A., Wang, Y ., Zhang, X., and Hu, C. Uncertainty quan- tification in machine learning for engineering design and health prognostics: A tutorial.Mechanical Systems and Signal Processing, 205:110796, 2023. Ortega, L. A., Rodriguez Santana, S., and Hern ´andez...

work page doi:10.7551/mitpress/3206 2012

[5] [5]

Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R

URL https://www.auai.org/uai2018/ proceedings/papers/207.pdf. Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. Scalable bayesian optimization using deep neural net- works. InInternational conference on machine learning, pp. 2171–2180. PMLR, 2015. Tulino, A. M., Verd´u, S., et al. Random matri...

2015

[6] [6]

cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper. pdf. V ovk, V ., Gammerman, A., and Shafer, G.Algorithmic Learning in a Random World, volume 29. Springer, 2005. Wang, Z., Engel, A., Sarwate, A. D., Dumitriu, I., and Chiang, T. Spectral evolution and invariance in linear- width neural networks.Advanc...

2017

[7] [7]

For a Bayesian method, we generally have a mean predictor µ:R d →R c and a covariance function Σ :R d →R c×c, that output in the softmax space

Firstly, we use the variance of the maximum softmax prediction as our uncertainty score. For a Bayesian method, we generally have a mean predictor µ:R d →R c and a covariance function Σ :R d →R c×c, that output in the softmax space. For the variance of the maximum softmax prediction for a given test point x⋆, we first find ˆc=argmaxk µ(x∗)k, where µ(x⋆)k ...

[8] [8]

Secondly, we compute the AUCROC score for two settings: on an in-distribution test set, where we seek to detect 35 Is the Last Layer Sufficient for Uncertainty Quantification? correctly predicted versus incorrectly predicted points (V ARROC-ID), and using an OOD test set, where we seek to detect correctly predicted versus OOD points (V ARROC-OOD). Note th...

2024

[9] [9]

We take 40 posterior samples, and discard the first10as a burn-in

component of SMS-UBU, we take the pre-trained DNN, run SWA (using Adam with weight decay equal to that for our original DNN training) for 5 epochs, and then run SMS-UBU from the averaged parameters. We take 40 posterior samples, and discard the first10as a burn-in. GPT-2 The trained GPT-2 weights were taken fromHuggingFace; a classification head was then ...