Richer Bayesian Last Layers with Subsampled NTK Features
Pith reviewed 2026-05-22 10:57 UTC · model grok-4.3
The pith
Projecting NTK features onto last-layer features corrects underestimation of epistemic uncertainty in Bayesian last layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By projecting subsampled NTK features onto the space spanned by the last-layer weights, the method constructs a Bayesian posterior over the last layer whose covariance reflects variability in earlier layers. The authors prove that the marginal posterior variance at any test point is always at least as large as the variance obtained by a conventional Bayesian last layer. This guarantee follows directly from the geometry of the projection: the NTK component orthogonal to the last-layer span is discarded, but the retained component still enlarges the effective prior covariance.
What carries the argument
The projection of full-network NTK features onto the column space of the last-layer feature matrix; this linear map lets the Bayesian update incorporate information from all layers without leaving the cheap last-layer inference regime.
If this is right
- Posterior variances are provably at least as large as those from a standard Bayesian last layer.
- Approximation bounds hold for both the projection matrix and the resulting posterior when features are uniformly subsampled.
- The enriched model shows improved calibration on UCI regression tasks and competitive performance on contextual bandit problems.
- Uncertainty estimates improve on image classification and out-of-distribution detection in both image and tabular data.
Where Pith is reading between the lines
- The same projection idea could be tested with other kernels that approximate the behavior of deep networks.
- Subsampling strategies might be refined by importance sampling rather than uniform selection to further reduce variance in the estimates.
- If the method generalizes, it could serve as a drop-in replacement for standard Bayesian last layers in any architecture where NTK features are computable.
Load-bearing premise
The projection of NTK features onto the linear span of the last-layer features is sufficient to capture the epistemic uncertainty induced by earlier layers.
What would settle it
Observing a data point where the enriched Bayesian last layer reports a strictly smaller posterior variance than the standard version would directly contradict the provable inequality; likewise, failure to observe calibration gains on standard benchmark suites would weaken the practical claim.
read the original abstract
Bayesian Last Layers (BLLs) provide a convenient and computationally efficient way to estimate uncertainty in neural networks. However, they underestimate epistemic uncertainty because they apply a Bayesian treatment only to the final layer, ignoring uncertainty induced by earlier layers. We propose a method that improves BLLs by leveraging a projection of Neural Tangent Kernel (NTK) features onto the space spanned by the last-layer features. This enables posterior inference that accounts for variability of the full network while retaining the low computational cost of inference of a standard BLL. We show that our method yields posterior variances that are provably greater or equal to those of a standard BLL, correcting its tendency to underestimate epistemic uncertainty. To further reduce computational cost, we introduce a uniform subsampling scheme for estimating the projection matrix and for posterior inference. We derive approximation bounds for both types of subsampling. Empirical evaluations on UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks in image and tabular datasets, demonstrate improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines, while reducing computational cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes enriching Bayesian Last Layers (BLLs) by projecting Neural Tangent Kernel (NTK) features onto the linear span of the last-layer features. This construction is shown to yield posterior variances that are provably at least as large as those of a standard BLL, thereby correcting underestimation of epistemic uncertainty while retaining the computational advantages of last-layer Bayesian inference. Uniform subsampling is introduced both to estimate the projection matrix and to perform inference, accompanied by separate approximation bounds on the resulting matrix errors. Experiments across UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks report improved calibration and uncertainty estimates relative to standard BLLs and several baselines.
Significance. If the variance inequality is preserved under the subsampled estimator, the work supplies a practical, theoretically grounded route to richer epistemic uncertainty quantification without incurring the cost of full-network Bayesian inference. The explicit derivation of approximation bounds for both projection estimation and inference, together with the empirical demonstration of improved calibration on regression, bandit, and OOD tasks, constitutes a concrete advance over existing last-layer methods.
major comments (2)
- [§3 and §4] §3 (exact-projection case) and §4 (subsampled case): The central claim that posterior variances are provably ≥ those of a standard BLL holds for the exact projection onto the last-layer span. However, the uniform subsampling used both to form the projection matrix and to evaluate the predictive variance introduces separate matrix-norm approximation bounds; these bounds do not automatically guarantee that the quadratic form determining the posterior variance remains above the BLL baseline once the exact projection is replaced by its subsampled estimate. A sufficiently large finite-sample error could reverse the inequality even when the exact case is valid.
- [Abstract and §2] Abstract and §2 (method): The assumption that the projection of NTK features onto the span of the last-layer features is sufficient to capture epistemic uncertainty induced by earlier layers is stated but not accompanied by a quantitative characterization of the residual uncertainty orthogonal to that span. If this residual component is non-negligible, the claimed correction to BLL underestimation may be only partial.
minor comments (2)
- [§4] Notation for the subsampled projection matrix and the resulting approximate kernel should be introduced with an explicit equation number to avoid ambiguity when the approximation bounds are applied.
- [Experiments section] The experimental tables would benefit from reporting the effective subsample size (as a fraction of the full feature dimension) alongside the reported metrics so that the computational-accuracy trade-off is immediately visible.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address the major concerns point by point below, and we will make revisions to improve the clarity of the theoretical claims and limitations.
read point-by-point responses
-
Referee: [§3 and §4] §3 (exact-projection case) and §4 (subsampled case): The central claim that posterior variances are provably ≥ those of a standard BLL holds for the exact projection onto the last-layer span. However, the uniform subsampling used both to form the projection matrix and to evaluate the predictive variance introduces separate matrix-norm approximation bounds; these bounds do not automatically guarantee that the quadratic form determining the posterior variance remains above the BLL baseline once the exact projection is replaced by its subsampled estimate. A sufficiently large finite-sample error could reverse the inequality even when the exact case is valid.
Authors: We concur that the provable inequality is established strictly for the exact projection case analyzed in §3. For the subsampled estimators in §4, we provide matrix-norm bounds on the approximation errors for both the projection matrix estimation and the inference step. These bounds do not directly imply preservation of the variance inequality for any finite subsample size. We will revise the text in §4 to explicitly acknowledge this point and to clarify that the inequality holds exactly only in the limit of full sampling, while the subsampled version approximates it with controllable error. We will also add a note on the practical implications based on our experimental subsample sizes. revision: yes
-
Referee: [Abstract and §2] Abstract and §2 (method): The assumption that the projection of NTK features onto the span of the last-layer features is sufficient to capture epistemic uncertainty induced by earlier layers is stated but not accompanied by a quantitative characterization of the residual uncertainty orthogonal to that span. If this residual component is non-negligible, the claimed correction to BLL underestimation may be only partial.
Authors: The method projects NTK features to capture contributions from the entire network while restricting to the last-layer span for efficiency. We recognize that a quantitative analysis of the residual uncertainty in the orthogonal complement is not provided. Such a characterization would require a more detailed decomposition of the NTK and its interaction with the network architecture, which is outside the scope of the current work. We will revise §2 to include a clearer statement of this modeling assumption and its potential limitations, indicating that the approach addresses a significant portion of the epistemic uncertainty but may leave some residual unaccounted for. revision: yes
Circularity Check
No circularity in the derivation of variance bounds
full rationale
The paper derives the central inequality (posterior variances provably >= standard BLL) from the explicit projection of NTK features onto the linear span of last-layer features; this projection is an external construction using the standard NTK kernel rather than a quantity defined in terms of the target variance. Separate approximation bounds are stated for the uniform subsampling of the projection matrix and for inference, without the main inequality being redefined or forced by those bounds. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided derivation chain; the result retains independent mathematical content from the projection property.
Axiom & Free-Parameter Ledger
free parameters (1)
- subsampling ratio
axioms (1)
- domain assumption NTK features capture variability induced by earlier layers
Forward citations
Cited by 1 Pith paper
-
The Neural Tangent Kernel for Classification
Wide neural networks with cross-entropy loss maintain constant NTK under parameter regularization or non-degenerate targets, enabling linearized approximation and explicit NTK-based solution characterization.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.