Richer Bayesian Last Layers with Subsampled NTK Features

\'Alvaro Cartea; Jonathan Plenk; Jose Miguel Hern\'andez-Lobato; Kamil Ciosek; Richard Bergna; Sergio Calvo-Ordo\~nez; Yarin Gal

arxiv: 2602.01279 · v2 · pith:Y3JZHU7Nnew · submitted 2026-02-01 · 💻 cs.LG

Richer Bayesian Last Layers with Subsampled NTK Features

Sergio Calvo-Ordo\~nez , Jonathan Plenk , Richard Bergna , \'Alvaro Cartea , Yarin Gal , Jose Miguel Hern\'andez-Lobato , Kamil Ciosek This is my paper

Pith reviewed 2026-05-22 10:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords Bayesian last layersNeural Tangent Kernelepistemic uncertaintysubsamplinguncertainty calibrationneural networksposterior variance

0 comments

The pith

Projecting NTK features onto last-layer features corrects underestimation of epistemic uncertainty in Bayesian last layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard Bayesian last layers treat only the final layer as random and therefore underestimate the epistemic uncertainty coming from the rest of the network. The authors address this by projecting Neural Tangent Kernel features of the full network onto the linear span of the last-layer features. The resulting posterior inference remains linear and cheap yet produces variances that are mathematically guaranteed to be at least as large as those of an ordinary Bayesian last layer. Uniform subsampling of the NTK features keeps the method scalable, and the paper supplies approximation error bounds for both the projection and the posterior. Experiments across regression, bandits, classification, and out-of-distribution detection show improved uncertainty calibration compared with baseline Bayesian last layers.

Core claim

By projecting subsampled NTK features onto the space spanned by the last-layer weights, the method constructs a Bayesian posterior over the last layer whose covariance reflects variability in earlier layers. The authors prove that the marginal posterior variance at any test point is always at least as large as the variance obtained by a conventional Bayesian last layer. This guarantee follows directly from the geometry of the projection: the NTK component orthogonal to the last-layer span is discarded, but the retained component still enlarges the effective prior covariance.

What carries the argument

The projection of full-network NTK features onto the column space of the last-layer feature matrix; this linear map lets the Bayesian update incorporate information from all layers without leaving the cheap last-layer inference regime.

If this is right

Posterior variances are provably at least as large as those from a standard Bayesian last layer.
Approximation bounds hold for both the projection matrix and the resulting posterior when features are uniformly subsampled.
The enriched model shows improved calibration on UCI regression tasks and competitive performance on contextual bandit problems.
Uncertainty estimates improve on image classification and out-of-distribution detection in both image and tabular data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection idea could be tested with other kernels that approximate the behavior of deep networks.
Subsampling strategies might be refined by importance sampling rather than uniform selection to further reduce variance in the estimates.
If the method generalizes, it could serve as a drop-in replacement for standard Bayesian last layers in any architecture where NTK features are computable.

Load-bearing premise

The projection of NTK features onto the linear span of the last-layer features is sufficient to capture the epistemic uncertainty induced by earlier layers.

What would settle it

Observing a data point where the enriched Bayesian last layer reports a strictly smaller posterior variance than the standard version would directly contradict the provable inequality; likewise, failure to observe calibration gains on standard benchmark suites would weaken the practical claim.

read the original abstract

Bayesian Last Layers (BLLs) provide a convenient and computationally efficient way to estimate uncertainty in neural networks. However, they underestimate epistemic uncertainty because they apply a Bayesian treatment only to the final layer, ignoring uncertainty induced by earlier layers. We propose a method that improves BLLs by leveraging a projection of Neural Tangent Kernel (NTK) features onto the space spanned by the last-layer features. This enables posterior inference that accounts for variability of the full network while retaining the low computational cost of inference of a standard BLL. We show that our method yields posterior variances that are provably greater or equal to those of a standard BLL, correcting its tendency to underestimate epistemic uncertainty. To further reduce computational cost, we introduce a uniform subsampling scheme for estimating the projection matrix and for posterior inference. We derive approximation bounds for both types of subsampling. Empirical evaluations on UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks in image and tabular datasets, demonstrate improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines, while reducing computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NTK projection into BLLs gives a clean variance guarantee over standard last layers, but the subsampling step leaves the inequality unproven for the approximated version.

read the letter

The main thing here is that they take a standard Bayesian last layer, project subsampled NTK features onto the span of the last-layer weights, and get posterior variances that are provably at least as large as the plain BLL case. That directly targets the underestimation of epistemic uncertainty while keeping inference cheap. They then add uniform subsampling for both the projection matrix and the posterior, with separate approximation bounds for each.

Referee Report

2 major / 2 minor

Summary. The paper proposes enriching Bayesian Last Layers (BLLs) by projecting Neural Tangent Kernel (NTK) features onto the linear span of the last-layer features. This construction is shown to yield posterior variances that are provably at least as large as those of a standard BLL, thereby correcting underestimation of epistemic uncertainty while retaining the computational advantages of last-layer Bayesian inference. Uniform subsampling is introduced both to estimate the projection matrix and to perform inference, accompanied by separate approximation bounds on the resulting matrix errors. Experiments across UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks report improved calibration and uncertainty estimates relative to standard BLLs and several baselines.

Significance. If the variance inequality is preserved under the subsampled estimator, the work supplies a practical, theoretically grounded route to richer epistemic uncertainty quantification without incurring the cost of full-network Bayesian inference. The explicit derivation of approximation bounds for both projection estimation and inference, together with the empirical demonstration of improved calibration on regression, bandit, and OOD tasks, constitutes a concrete advance over existing last-layer methods.

major comments (2)

[§3 and §4] §3 (exact-projection case) and §4 (subsampled case): The central claim that posterior variances are provably ≥ those of a standard BLL holds for the exact projection onto the last-layer span. However, the uniform subsampling used both to form the projection matrix and to evaluate the predictive variance introduces separate matrix-norm approximation bounds; these bounds do not automatically guarantee that the quadratic form determining the posterior variance remains above the BLL baseline once the exact projection is replaced by its subsampled estimate. A sufficiently large finite-sample error could reverse the inequality even when the exact case is valid.
[Abstract and §2] Abstract and §2 (method): The assumption that the projection of NTK features onto the span of the last-layer features is sufficient to capture epistemic uncertainty induced by earlier layers is stated but not accompanied by a quantitative characterization of the residual uncertainty orthogonal to that span. If this residual component is non-negligible, the claimed correction to BLL underestimation may be only partial.

minor comments (2)

[§4] Notation for the subsampled projection matrix and the resulting approximate kernel should be introduced with an explicit equation number to avoid ambiguity when the approximation bounds are applied.
[Experiments section] The experimental tables would benefit from reporting the effective subsample size (as a fraction of the full feature dimension) alongside the reported metrics so that the computational-accuracy trade-off is immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major concerns point by point below, and we will make revisions to improve the clarity of the theoretical claims and limitations.

read point-by-point responses

Referee: [§3 and §4] §3 (exact-projection case) and §4 (subsampled case): The central claim that posterior variances are provably ≥ those of a standard BLL holds for the exact projection onto the last-layer span. However, the uniform subsampling used both to form the projection matrix and to evaluate the predictive variance introduces separate matrix-norm approximation bounds; these bounds do not automatically guarantee that the quadratic form determining the posterior variance remains above the BLL baseline once the exact projection is replaced by its subsampled estimate. A sufficiently large finite-sample error could reverse the inequality even when the exact case is valid.

Authors: We concur that the provable inequality is established strictly for the exact projection case analyzed in §3. For the subsampled estimators in §4, we provide matrix-norm bounds on the approximation errors for both the projection matrix estimation and the inference step. These bounds do not directly imply preservation of the variance inequality for any finite subsample size. We will revise the text in §4 to explicitly acknowledge this point and to clarify that the inequality holds exactly only in the limit of full sampling, while the subsampled version approximates it with controllable error. We will also add a note on the practical implications based on our experimental subsample sizes. revision: yes
Referee: [Abstract and §2] Abstract and §2 (method): The assumption that the projection of NTK features onto the span of the last-layer features is sufficient to capture epistemic uncertainty induced by earlier layers is stated but not accompanied by a quantitative characterization of the residual uncertainty orthogonal to that span. If this residual component is non-negligible, the claimed correction to BLL underestimation may be only partial.

Authors: The method projects NTK features to capture contributions from the entire network while restricting to the last-layer span for efficiency. We recognize that a quantitative analysis of the residual uncertainty in the orthogonal complement is not provided. Such a characterization would require a more detailed decomposition of the NTK and its interaction with the network architecture, which is outside the scope of the current work. We will revise §2 to include a clearer statement of this modeling assumption and its potential limitations, indicating that the approach addresses a significant portion of the epistemic uncertainty but may leave some residual unaccounted for. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation of variance bounds

full rationale

The paper derives the central inequality (posterior variances provably >= standard BLL) from the explicit projection of NTK features onto the linear span of last-layer features; this projection is an external construction using the standard NTK kernel rather than a quantity defined in terms of the target variance. Separate approximation bounds are stated for the uniform subsampling of the projection matrix and for inference, without the main inequality being redefined or forced by those bounds. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided derivation chain; the result retains independent mathematical content from the projection property.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that NTK features adequately represent full-network variability and that uniform subsampling preserves the necessary geometry for the projection; no new invented entities or heavily fitted parameters are introduced in the abstract description.

free parameters (1)

subsampling ratio
Chosen to trade off computational cost against approximation quality for the projection matrix and posterior inference.

axioms (1)

domain assumption NTK features capture variability induced by earlier layers
Invoked to justify why the projection onto last-layer space accounts for full-network epistemic uncertainty.

pith-pipeline@v0.9.0 · 5747 in / 1124 out tokens · 31982 ms · 2026-05-22T10:57:30.226259+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Neural Tangent Kernel for Classification
cs.LG 2026-05 unverdicted novelty 6.0

Wide neural networks with cross-entropy loss maintain constant NTK under parameter regularization or non-degenerate targets, enabling linearized approximation and explicit NTK-based solution characterization.