Singular Bayesian Neural Networks

David A. Stephens; Mame Diarra Toure

arxiv: 2602.00387 · v3 · submitted 2026-01-30 · 📊 stat.ML · cs.LG· stat.AP

Singular Bayesian Neural Networks

Mame Diarra Toure , David A. Stephens This is my paper

Pith reviewed 2026-05-16 08:44 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.AP

keywords singular posteriorslow-rank Bayesian networksPAC-Bayes boundsgeneralization boundsneural network uncertaintyrank-r manifoldparameter efficiency

0 comments

The pith

Low-rank factorization of weights induces singular posteriors in Bayesian neural networks with improved generalization bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that representing neural network weight matrices as the product of two lower-dimensional matrices creates a posterior distribution that is singular and concentrates on a lower-dimensional manifold. This structure encodes weight correlations through shared factors, unlike the independent assumptions in standard mean-field posteriors. The resulting PAC-Bayes bounds have a complexity term that scales with the square root of rank times the sum of dimensions, allowing substantial parameter reduction. Experiments across various architectures show competitive performance with far fewer parameters and better handling of uncertainty and out-of-distribution data.

Core claim

By parameterizing weights as W = AB^T with A in R^{m x r}, B in R^{n x r}, we induce a posterior that is singular with respect to the Lebesgue measure, concentrating on the rank-r manifold. This singularity captures structured weight correlations through shared latent factors. We derive PAC-Bayes generalization bounds whose complexity term scales as sqrt(r(m+n)) instead of sqrt(m n), and prove loss bounds that decompose the error into optimization and rank-induced bias using the Eckart-Young-Mirsky theorem.

What carries the argument

The low-rank matrix factorization W = AB^T that forces the posterior to concentrate on the rank-r manifold rather than the full space.

If this is right

Generalization complexity scales with sqrt(r(m + n)) when r is chosen smaller than min(m, n).
Error decomposes into optimization error plus rank-induced bias term from the Eckart-Young-Mirsky theorem.
Empirical performance matches or exceeds mean-field models while using up to 33 times fewer parameters.
Improved out-of-distribution detection and calibration on MLPs, LSTMs, and Transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If neural network weights commonly have fast singular value decay, this method could become a default for Bayesian inference in deep learning.
Adaptive selection of rank r during training might further optimize the bias-complexity tradeoff.
Similar low-rank parameterizations could be applied to other Bayesian models with matrix parameters to induce structured posteriors.

Load-bearing premise

Weight matrices must exhibit sufficiently fast singular value decay so that small ranks incur little bias.

What would settle it

Observing that performance degrades sharply for small r on datasets where weight matrices do not show rapid singular value decay would falsify the practical advantage.

Figures

Figures reproduced from arXiv: 2602.00387 by David A. Stephens, Mame Diarra Toure.

**Figure 1.** Figure 1: ). This filters high-frequency noise that does not align with the dominant low-rank structure, providing generalization benefits that independent weight posteriors can not capture. The rank r controls this expressiveness: higher rank enables richer correlation patterns while maintaining O(r(m + n)) parameters [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Geometric distinction between mean-field and lowrank posteriors. 3D projection of weight space R m×n . Left: Rank-r manifold Mr (blue surface, dimension r(m + n − r)). Middle: Mean-field posterior qMF(W) has full-dimensional support (volume). Right: Low-rank posterior qLR(A, B) concentrates on the manifold surface. 3.4. PAC-Bayes Generalization Bounds Theorem 3.8 (Tighter Bounds for Low-Rank Posteriors).… view at source ↗

**Figure 3.** Figure 3: Empirical generalization bounds for low-rank Bayesian neural networks. PAC-Bayes (left) and Gaussian complexity (right) bounds use empirical values from trained LSTM model and training data. PAC-Bayes bound exhibits critical rank r ∗ ≈ 11 transitioning from non-vacuous to vacuous (> 1, dashed red line). Gaussian complexity decreases from 45.56 (full-rank) to 18.97 with rank reduction. Full-rank Bayesian m… view at source ↗

**Figure 4.** Figure 4: Model comparison on MIMIC-III (averaged across 5 seeds). Low-Rank Gaussian r=15 (orange) achieves superior OOD detection. Deep Ensemble maintains better calibration and in-domain discrimination. Rank-1 multiplicative (green) achieves better calibration but weaker OOD detection. Full-Rank BBB (blue) shows balanced but moderate performance across metrics. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Selective prediction on Beijing Air Quality LSTM. While Deep Ensemble (red) achieves best point predictions at 100% retention, Bayesian methods outperform when discarding the most uncertain samples. Low-Rank (green) achieves largest improvement (17.4% MAE reduction at 80% retention), demonstrating superior uncertainty quality for selective prediction. 4.3. Beijing Air Quality: Time Series Forecasting Next… view at source ↗

**Figure 6.** Figure 6: Model comparison on Beijing Air Quality LSTM (averaged across 4 seeds). Deep Ensemble (brown) achieves best OOD detection but poor coverage. Low-Rank BBB (green) shows best coverage (PICP) and near-best calibration (ECE) with secondbest OOD performance. Low-Rank SVD (red) shows strong OOD detection with good coverage. Full-Rank BBB (orange) achieves best calibration but moderate coverage. Rank-1 BBB (purp… view at source ↗

**Figure 7.** Figure 7: Singular value decay for embedding and value projection matrices showing rapid decay [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Model comparison on SST-2 Transformer (averaged across 4 seeds). Deep Ensemble (green) achieves best overall performance. Low-Rank BBB (red) shows good OOD detection and highest epistemic uncertainty discrimination (MI Ratio) among bayesian models. Full-Rank BBB (orange) demonstrates moderate balanced performance. Rank-1 BBB (brown) maintains good calibration but weaker OOD metrics. 4.5. Comparison to SWAG… view at source ↗

**Figure 9.** Figure 9: Singular value decay and low-rank approximation error for deterministic neural network weight matrices. Layer 0 (44 × 128, left): Singular values decay rapidly (top-left, log scale). Red dashed line at r = 15 indicates selected rank. Bottom-left shows energy retention curve: rank r = 15 achieves 73.9% of total Frobenius norm energy. Layer 1 (128 × 128, right): Similar rapid spectral decay (top-right). Red … view at source ↗

**Figure 10.** Figure 10: Trade-off between OOD detection (AUROC OOD MI) and calibration (NLL, ECE) for low-rank Gaussian models with varying rank pairs (r1, r2). Rank Selection via Ablation Study Rank selection is performed through ablation studies with reduced computational budget (fewer epochs, fewer MC samples) [PITH_FULL_IMAGE:figures/full_fig_p055_10.png] view at source ↗

**Figure 11.** Figure 11: OOD detection performance on MIMIC-III. Low-Rank Gaussian (r = 15) achieves AUROC OOD of 0.802, outperforming Full-Rank BNN (0.770) and Deep Ensemble (0.738) while using ∼ 30% of full rank parameters and 12% of deep ensemble parameters Computational Efficiency at Small Scale [PITH_FULL_IMAGE:figures/full_fig_p056_11.png] view at source ↗

**Figure 12.** Figure 12: Parameter-efficiency adjusted performance across models. Metrics normalized to [0, 1] and scaled by p min params/params to reward computational efficiency. Low-Rank Gaussian (r = 15) achieves the best efficiency-performance trade-off, particularly excelling in AUPR metrics while maintaining compact parameterization [PITH_FULL_IMAGE:figures/full_fig_p057_12.png] view at source ↗

**Figure 13.** Figure 13: Trade-off between OOD detection (AUROC, y-axis) and calibration metrics (NLL and ECE, x-axis) for low-rank LSTM models with varying rank pairs (rih, rhh) on Beijing air quality forecasting. Points labeled with rank configurations; color indicates KL weight. Dashed lines mark baseline thresholds [PITH_FULL_IMAGE:figures/full_fig_p057_13.png] view at source ↗

**Figure 14.** Figure 14: Trade-off between OOD detection (AUROC, y-axis) and predictive performance (MAE and RMSE, x-axis) for low-rank LSTM models with varying rank configurations on Beijing air quality forecasting. Lower MAE/RMSE indicates better forecasting accuracy. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_14.png] view at source ↗

**Figure 15.** Figure 15: Singular value decay of pretrained LSTM weight matrices. we can observe fast decay for both x-to-gate and h-to-gate layers Calibration Analysis [PITH_FULL_IMAGE:figures/full_fig_p058_15.png] view at source ↗

**Figure 16.** Figure 16: Calibration curves for Bayesian LSTM variants. The dashed diagonal represents perfect calibration. Models closer to the diagonal exhibit better-calibrated uncertainty estimates. Selective Prediction Analysis Selective prediction, where models abstain from predictions on uncertain inputs, is critical for safety-critical applications. We evaluate how well each model’s uncertainty estimates enable effective … view at source ↗

**Figure 17.** Figure 17: Parameter-efficiency adjusted performance across models. Metrics normalized to [0, 1] and scaled by √ param efficiency. Low-Rank BBB (red) achieves the best overall tradeoff, dominating OOD detection and uncertainty metrics while maintaining competitive accuracy. Uncertainty Distribution Analysis Figures 18 and 19 visualize the distribution of predictive uncertainty for in-distribution (SST-2) versus out-… view at source ↗

**Figure 18.** Figure 18: Predictive standard deviation distributions for in-distribution (blue) vs OOD (orange) samples. Low-Rank BBB and Deep Ensemble show clear separation; Rank-1 BBB shows near-degenerate distributions. H.5. Supplementary Low-Rank Ensembling Study on SST-2 To test whether ensembling mitigates the calibration gap observed for a single low-rank posterior, we trained a 5-member low-rank Bayesian ensemble on SST-2… view at source ↗

**Figure 19.** Figure 19: Mutual information distributions for in-distribution (blue) vs OOD (orange) samples. Similar patterns to STD, with Deep Ensemble showing strongest separation [PITH_FULL_IMAGE:figures/full_fig_p063_19.png] view at source ↗

**Figure 20.** Figure 20: Epistemic uncertainty quantification: Full-Rank vs. Low-Rank Bayesian Neural Networks. Training on N = 1024 samples, xtrain ∼ Uniform[−0.1, 0.6]. Purple: predictive IQR; Blue: epistemic IQR; Red line: median; ×: training data. Full-Rank shows epistemic expansion ratio 1.90× (in-domain to out-of-domain). Low-Rank achieves 2.01× ratio with 65% fewer parameters, demonstrating conservative and reliable uncert… view at source ↗

read the original abstract

Bayesian neural networks promise calibrated uncertainty but require $O(mn)$ parameters for standard mean-field Gaussian posteriors. We argue this cost is often unnecessary, particularly when weight matrices exhibit fast singular value decay. By parameterizing weights as $W = AB^{\top}$ with $A \in \mathbb{R}^{m \times r}$, $B \in \mathbb{R}^{n \times r}$, we induce a posterior that is \emph{singular} with respect to the Lebesgue measure, concentrating on the rank-$r$ manifold. This singularity captures structured weight correlations through shared latent factors, geometrically distinct from mean-field's independence assumption. We derive PAC-Bayes generalization bounds whose complexity term scales as $\sqrt{r(m+n)}$ instead of $\sqrt{m n}$, and prove loss bounds that decompose the error into optimization and rank-induced bias using the Eckart-Young-Mirsky theorem. We further adapt recent Gaussian complexity bounds for low-rank deterministic networks to Bayesian predictive means. Empirically, across MLPs, LSTMs, and Transformers on standard benchmarks, our method achieves competitive predictive performance while using up to $33\times$ fewer parameters than 5-member Deep Ensembles. It substantially improves OOD detection and often improves calibration relative to mean-field and perturbation baselines, while Deep Ensembles can still be stronger on in-distribution likelihood-based metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes singular Bayesian neural networks by parameterizing weight matrices as low-rank factorizations W = AB^T (A in R^{m x r}, B in R^{n x r}), inducing posteriors singular with respect to Lebesgue measure on the full matrix space and supported on the rank-r manifold. It derives PAC-Bayes generalization bounds whose complexity term scales as sqrt(r(m+n)) rather than sqrt(mn), proves loss bounds that decompose error into optimization and rank-induced bias via the Eckart-Young-Mirsky theorem, adapts recent Gaussian complexity bounds for low-rank deterministic networks to Bayesian predictive means, and reports empirical results on MLPs, LSTMs, and Transformers showing competitive predictive performance, up to 33x fewer parameters than 5-member Deep Ensembles, improved OOD detection, and often better calibration.

Significance. If the central derivations hold, the work offers a principled reduction in effective dimensionality for Bayesian neural networks by exploiting fast singular-value decay in weights, yielding tighter PAC-Bayes bounds and practical gains in parameter efficiency and uncertainty quantification over mean-field approximations. The explicit use of the Eckart-Young-Mirsky theorem for bias decomposition and the adaptation of Gaussian complexity bounds are strengths that ground the claims in standard tools of the field.

major comments (2)

[§3, Theorem 3.1] §3, Theorem 3.1: the PAC-Bayes bound derivation defines the prior and posterior on the r(m+n)-dimensional space of (A,B) and invokes the push-forward measure on W; the proof must explicitly verify that the KL divergence term remains unchanged under this push-forward, as any Jacobian factor or support restriction could alter the complexity term scaling.
[§5.1, Eq. (14)] §5.1, Eq. (14): the loss decomposition into optimization error plus rank-induced bias invokes the Eckart-Young-Mirsky theorem for the bias term; the statement that this bias is 'parameter-free' is contradicted by the dependence on the chosen rank r, which must be treated as a hyperparameter whose selection affects the bound.

minor comments (2)

[Table 2] Table 2: the reported parameter counts for the proposed method versus Deep Ensembles should include a column for effective degrees of freedom to make the 33x reduction claim directly verifiable.
[Notation section] Notation section: the symbol r for rank is used in the abstract and introduction without an explicit definition sentence; add one clarifying sentence before the first use of the factorization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and positive recommendation for minor revision. We address each major comment below and have incorporated the suggested clarifications into the revised manuscript.

read point-by-point responses

Referee: [§3, Theorem 3.1] §3, Theorem 3.1: the PAC-Bayes bound derivation defines the prior and posterior on the r(m+n)-dimensional space of (A,B) and invokes the push-forward measure on W; the proof must explicitly verify that the KL divergence term remains unchanged under this push-forward, as any Jacobian factor or support restriction could alter the complexity term scaling.

Authors: We agree that an explicit verification is required for rigor. In the revised manuscript we have expanded the proof of Theorem 3.1 with a dedicated paragraph showing that the KL divergence is taken directly between the prior and posterior measures on the (A,B) parameter space (which is equipped with Lebesgue measure in R^{r(m+n)}). The push-forward to the space of rank-r matrices W is a smooth immersion, and because both measures are absolutely continuous with respect to the lower-dimensional Lebesgue measure on this manifold, no Jacobian determinant appears in the KL term. Consequently the complexity term continues to scale as sqrt(r(m+n)) without alteration. revision: yes
Referee: [§5.1, Eq. (14)] §5.1, Eq. (14): the loss decomposition into optimization error plus rank-induced bias invokes the Eckart-Young-Mirsky theorem for the bias term; the statement that this bias is 'parameter-free' is contradicted by the dependence on the chosen rank r, which must be treated as a hyperparameter whose selection affects the bound.

Authors: The referee correctly identifies an imprecise phrasing. We have removed the term 'parameter-free' from the revised text. The bias term is now explicitly described as depending on the chosen rank r, which is treated as a hyperparameter selected via validation or domain knowledge. The decomposition itself (optimization error plus Eckart-Young-Mirsky approximation error) remains valid and is now presented with the dependence on r made transparent in both the statement and the discussion of the bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; parameterization and bounds follow from explicit modeling choice and standard theory

full rationale

The core construction parameterizes W = AB^T explicitly to restrict support to the rank-r manifold, which is a deliberate low-dimensional prior choice whose dimension r(m+n) directly yields the stated PAC-Bayes complexity scaling; this is definitional of the method rather than a hidden reduction. The Eckart-Young-Mirsky decomposition for bias is invoked as an external deterministic fact. No self-citations appear load-bearing, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The derivation chain remains self-contained against external PAC-Bayes and matrix approximation results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that weight matrices in trained networks exhibit fast singular value decay, plus standard mathematical results such as the Eckart-Young-Mirsky theorem. The rank r is a free parameter that must be chosen.

free parameters (1)

rank r
Chosen per layer or globally; controls the bias-variance trade-off and must be selected based on observed singular value decay.

axioms (1)

standard math Eckart-Young-Mirsky theorem
Invoked to decompose total error into optimization error plus rank-induced bias.

pith-pipeline@v0.9.0 · 5537 in / 1189 out tokens · 25371 ms · 2026-05-16T08:44:27.831570+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By parameterizing weights as W=AB^T ... the induced posterior q_W is singular with respect to Lebesgue measure, concentrating on the rank-r manifold (Theorem 3.4)
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PAC-Bayes generalization bounds whose complexity term scales as sqrt(r(m+n)) instead of sqrt(mn)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

DOI: https://doi.org/10.24432/C5JS49. Chen, S. PM2.5 Data of Five Chinese Cities. UCI Machine Learning Repository, 2016. DOI: https://doi.org/10.24432/C52K58. Chen, T., Fox, E., and Guestrin, C. Stochastic gradient Hamiltonian Monte Carlo. InInternational Conference on Machine Learning (ICML), pp. 1683–1691, 2014. Cinquin, T., Immer, A., Horn, M., and For...

work page doi:10.24432/c5js49 2016
[2]

This depends on optimization quality and can be reduced with better training procedures

Learning error: ∥W−W ∗ r ∥F measures how far the learned W=AB ⊤ is from the optimal rank-r approximation. This depends on optimization quality and can be reduced with better training procedures

work page
[3]

ln QD i=1 qi(θi) QD i=1 pi(θi) # =E θ∼Q

Rank bias (approximation error): qPρ i=r+1 σ2 i (W ∗) is the unavoidable error from restricting to rank r. If W ∗ has rapidly decaying singular values, this term is small. C.4. Extension to Stochastic Weights (Bayesian Setting) Lemma C.7(Variance and Loss, Lemma 1.2.3 in (Nesterov, 2018)).Assume ℓ(·, y) is differentiable and β-smooth in its first argument...

work page 2018
[4]

original class)

The BNN predictor class has the same Gaussian complexity as the support class, despite being a much larger set (closed convex hull vs. original class)

work page
[5]

The complexity is bounded by that of Pinto’s deterministic class, allowing us to invoke existing deterministic bounds

work page
[6]

No degradation occurs in the complexity bound from the deterministic to the Bayesian setting. E.7. Generalization from Vector-Valued Gaussian Complexity We now state the generalization theorem from Pinto et al. (2025) for vector-valued Gaussian complexity and Lipschitz losses, which we will apply to our BNN predictor. Theorem E.15(Generalization via vecto...

work page 2025
[7]

A contraction inequality showing that Lipschitz losses contract the vector-valued Gaussian complexity by at most√ πL

work page
[8]

We cite it in this combined form to avoid re-deriving technical constants

A concentration inequality relating the empirical Gaussian complexity to generalization. We cite it in this combined form to avoid re-deriving technical constants. Our contribution is the class inclusion argument (Proposition E.13) that enables applying this theorem tof BNN. E.7.1. EXPLICITCOMPLEXITYBOUND FORPINTO’SCLASS We now recall the explicit upper b...

work page 2025
[9]

3.Depth dependence: The product QD i=j+1 C0Ci shows how spectral norms accumulate through the network depth

Data dependence: The bound explicitly depends on ∥X∥ F , the norm of the training inputs, unlike PAC-Bayes bounds which depend on the posterior-prior KL divergence. 3.Depth dependence: The product QD i=j+1 C0Ci shows how spectral norms accumulate through the network depth

work page
[10]

This is an artifact of the layer-peeling proof technique (specifically, how the chaining argument handles the input layer), not a reflection of a missing rank constraint

First layer: The term B1 √h1 for the first layer differs in form from the p rjhj terms for deeper layers. This is an artifact of the layer-peeling proof technique (specifically, how the chaining argument handles the input layer), not a reflection of a missing rank constraint. The rank of the first layer still enters through B1 ≤ √r1 C1, so that the first-...

work page 2025
[11]

In the extreme case of a near-deterministic posterior, the PAC-Bayes gap approaches zero while the Gaussian complexity bound remains fixed by the class geometry

Posterior concentration:When the posterior is highly concentrated around a good solution, Cmax can be very small, making the KL term tight. In the extreme case of a near-deterministic posterior, the PAC-Bayes gap approaches zero while the Gaussian complexity bound remains fixed by the class geometry

work page
[12]

The Gaussian complexity bound’s leading term grows linearly in∥X∥ F

Large input norms:PAC-Bayes bounds do not depend on ∥X∥ F , which is advantageous when inputs have very large norm or heavy tails. The Gaussian complexity bound’s leading term grows linearly in∥X∥ F

work page
[13]

When din ≫d out and C2 0 C4 1 R2 is large, PAC-Bayes can still be favorable despite thedin term, because it avoids the spectral norm constants entirely

High-dimensional inputs with moderate rank:Looking at the full expressions, PAC-Bayes scales with r(din +d out) while Gaussian complexity (with bounded inputs) scales with C2 0 C4 1 R2 ·r d out. When din ≫d out and C2 0 C4 1 R2 is large, PAC-Bayes can still be favorable despite thedin term, because it avoids the spectral norm constants entirely. Concretel...

work page
[14]

Diffuse posteriors with bounded spectral norms:When Cmax is large (the posterior has not concentrated tightly), the PAC-Bayes KL term can be vacuous. The Gaussian complexity bound depends on the spectral norm constraints C1, not on posterior concentration, and can remain informative even when the posterior is spread over a large region of parameter space

work page
[15]

This is the regime where the Gaussian complexity bound is numerically tighter

Normalized inputs with small spectral norm constants:When ∥xj∥2 ≤R with R moderate and spectral norms are well-controlled (C1 not too large), the Gaussian complexity bound’s leading constant √ πLC 0C2 1 R can be smaller than p Cmax/2. This is the regime where the Gaussian complexity bound is numerically tighter

work page
[16]

This makes them valuable as aprior-free sanity check

No prior specification required:Unlike PAC-Bayes, which requires choosing a prior P and where the bound quality depends sensitively on this choice, Gaussian complexity bounds depend only on the realized network architecture and data. This makes them valuable as aprior-free sanity check

work page
[17]

good event

Architectural transparency:The bound explicitly shows how rank, spectral norms, and layer dimensions interact through the network depth, providing guidance for architectural design that is not available from the PAC-Bayes bound’s KL-divergence summary. RemarkE.25 (Rank dependence).Both bounds have √rrank dependence in their leading terms: •PAC-Bayes:The p...

work page 2017
[18]

In reparameterized variational inference, the same clipping can be applied to sampled factors

Projection / spectral normalization.After each parameter update during optimization, rescale factors to enforce spectral norm constraints: Ai ←min 1, C A i ∥Ai∥2 Ai, B i ←min 1, C B i ∥Bi∥2 Bi. In reparameterized variational inference, the same clipping can be applied to sampled factors. This ensures almost-sure satisfaction of the bounds

work page
[19]

For example, a truncated Gaussian with support {A:∥A∥ 2 ≤C A} naturally satisfies the assumption

Compact-support families.Use truncated Gaussians or uniform distributions supported on the spectral norm ball. For example, a truncated Gaussian with support {A:∥A∥ 2 ≤C A} naturally satisfies the assumption. This requires computing the normalization constant, which may be computationally expensive for high-dimensional matrices. 3.Soft constraints via pen...

work page 2017
[20]

Biases are either stochastic Gaussians or deterministic (both tested); we report the deterministic-bias variant for parameter efficiency

Full-Rank Gaussian (BBB).Following (Blundell et al., 2015), we place an independent diagonal Gaussian variational posterior on each weight and bias: q(Wij |µ ij, σ2 ij) =N(µ ij, σ2 ij), reparameterized as W=µ+σ⊙ϵ with ϵ∼ N(0, I) . Biases are either stochastic Gaussians or deterministic (both tested); we report the deterministic-bias variant for parameter ...

work page 2015
[21]

Each entry of A and B has a diagonal Gaussian posterior: q(Aik) =N(µ A ik,(σ A ik)2) and similarly for B

Low-Rank Gaussian Factorization.We impose a rank- r factorization W≈AB ⊤ where A∈R n×r and B∈R m×r with r≪min(n, m) . Each entry of A and B has a diagonal Gaussian posterior: q(Aik) =N(µ A ik,(σ A ik)2) and similarly for B. The bias remains deterministic. We sweep ranks r∈ {10,25,50} for hidden layers and report the rank that best balances accuracy and ca...

work page
[22]

Rank-1 Multiplicative (Dusenberry et al., 2020).Following Dusenberry et al. (Dusenberry et al., 2020a), we fix a deterministic base weight matrix Wand apply a rank-1 stochastic multiplicative perturbation: W ′ = W⊙(s⊗r), where s∈R n and r∈R m are stochastic vectors with diagonal Gaussian posteriors. This parameterization dramatically reduces the stochasti...

work page 2020
[23]

This prior encourages sparsity by placing probability mass on both a standard Gaussian and a narrow Gaussian; it is identical across all four models to ensure fair comparison

+ (1−π)N(0, σ 2 2), with fixed hyperparameters π= 0.5 , σ1 = 1.0, and σ2 = exp(−6)≈0.00248 . This prior encourages sparsity by placing probability mass on both a standard Gaussian and a narrow Gaussian; it is identical across all four models to ensure fair comparison. Training Procedure.The models are trained by minimizing the ELBO: L=E q(θ)[−logp(y|x, θ)...

work page arXiv 2017
[24]

Prior: mixturep(w) = 0.5N(0,2.0 2) + 0.5N(0, e−6)

Full-Rank BNN (Bayes by Backprop):All weights have diagonal Gaussian posteriors q(w) =N(µ w, σ2 w). Prior: mixturep(w) = 0.5N(0,2.0 2) + 0.5N(0, e−6). Total trainable parameters:20,802

work page
[25]

Input (1→100 ) and output (100→1 ) layers remain full-rank

Low-Rank BNN:Hidden layer ( 100→100 ) uses factorization W≈AB ⊤ with rank r= 16 . Input (1→100 ) and output (100→1 ) layers remain full-rank. Same mixture prior. Total trainable parameters:7,202(65% parameter reduction). Training uses KL annealing (ramped over 760 epochs) andβ= 0.0001/Ntempering to avoid KL dominance. I.3.2. RESULTS Model Single Pass RMSE...

work page
[26]

This is a desirable property for safe prediction

Wider absolute OOD band:Low-rank maintains significantly larger out-of-domain uncertainty ( 0.0940 vs 0.0483), providing more conservative credible intervals where data are absent. This is a desirable property for safe prediction

work page
[27]

This shows low-rank preserves the qualitative epistemic sensitivity: uncertainty grows when leaving the training domain

Sharper expansion ratio:Although in-domain uncertainty is higher in low-rank ( 0.0467 vs 0.0254), the OOD/in- domain ratio is nearly identical (2.01× vs 1.90×). This shows low-rank preserves the qualitative epistemic sensitivity: uncertainty grows when leaving the training domain

work page
[28]

Rather than under-confident (overfitting) predictions, the low-rank posterior spreads mass more broadly, yielding conservative predictions even in-domain

Stronger in-domain regularization:Low-rank’s higher baseline epistemic IQR (0.0467) reflects structured regular- ization from rank constraints. Rather than under-confident (overfitting) predictions, the low-rank posterior spreads mass more broadly, yielding conservative predictions even in-domain. 69 Singular Bayesian Neural Networks: Measure-Theoretic Si...

work page

[1] [1]

DOI: https://doi.org/10.24432/C5JS49. Chen, S. PM2.5 Data of Five Chinese Cities. UCI Machine Learning Repository, 2016. DOI: https://doi.org/10.24432/C52K58. Chen, T., Fox, E., and Guestrin, C. Stochastic gradient Hamiltonian Monte Carlo. InInternational Conference on Machine Learning (ICML), pp. 1683–1691, 2014. Cinquin, T., Immer, A., Horn, M., and For...

work page doi:10.24432/c5js49 2016

[2] [2]

This depends on optimization quality and can be reduced with better training procedures

Learning error: ∥W−W ∗ r ∥F measures how far the learned W=AB ⊤ is from the optimal rank-r approximation. This depends on optimization quality and can be reduced with better training procedures

work page

[3] [3]

ln QD i=1 qi(θi) QD i=1 pi(θi) # =E θ∼Q

Rank bias (approximation error): qPρ i=r+1 σ2 i (W ∗) is the unavoidable error from restricting to rank r. If W ∗ has rapidly decaying singular values, this term is small. C.4. Extension to Stochastic Weights (Bayesian Setting) Lemma C.7(Variance and Loss, Lemma 1.2.3 in (Nesterov, 2018)).Assume ℓ(·, y) is differentiable and β-smooth in its first argument...

work page 2018

[4] [4]

original class)

The BNN predictor class has the same Gaussian complexity as the support class, despite being a much larger set (closed convex hull vs. original class)

work page

[5] [5]

The complexity is bounded by that of Pinto’s deterministic class, allowing us to invoke existing deterministic bounds

work page

[6] [6]

No degradation occurs in the complexity bound from the deterministic to the Bayesian setting. E.7. Generalization from Vector-Valued Gaussian Complexity We now state the generalization theorem from Pinto et al. (2025) for vector-valued Gaussian complexity and Lipschitz losses, which we will apply to our BNN predictor. Theorem E.15(Generalization via vecto...

work page 2025

[7] [7]

A contraction inequality showing that Lipschitz losses contract the vector-valued Gaussian complexity by at most√ πL

work page

[8] [8]

We cite it in this combined form to avoid re-deriving technical constants

A concentration inequality relating the empirical Gaussian complexity to generalization. We cite it in this combined form to avoid re-deriving technical constants. Our contribution is the class inclusion argument (Proposition E.13) that enables applying this theorem tof BNN. E.7.1. EXPLICITCOMPLEXITYBOUND FORPINTO’SCLASS We now recall the explicit upper b...

work page 2025

[9] [9]

3.Depth dependence: The product QD i=j+1 C0Ci shows how spectral norms accumulate through the network depth

Data dependence: The bound explicitly depends on ∥X∥ F , the norm of the training inputs, unlike PAC-Bayes bounds which depend on the posterior-prior KL divergence. 3.Depth dependence: The product QD i=j+1 C0Ci shows how spectral norms accumulate through the network depth

work page

[10] [10]

This is an artifact of the layer-peeling proof technique (specifically, how the chaining argument handles the input layer), not a reflection of a missing rank constraint

First layer: The term B1 √h1 for the first layer differs in form from the p rjhj terms for deeper layers. This is an artifact of the layer-peeling proof technique (specifically, how the chaining argument handles the input layer), not a reflection of a missing rank constraint. The rank of the first layer still enters through B1 ≤ √r1 C1, so that the first-...

work page 2025

[11] [11]

In the extreme case of a near-deterministic posterior, the PAC-Bayes gap approaches zero while the Gaussian complexity bound remains fixed by the class geometry

Posterior concentration:When the posterior is highly concentrated around a good solution, Cmax can be very small, making the KL term tight. In the extreme case of a near-deterministic posterior, the PAC-Bayes gap approaches zero while the Gaussian complexity bound remains fixed by the class geometry

work page

[12] [12]

The Gaussian complexity bound’s leading term grows linearly in∥X∥ F

Large input norms:PAC-Bayes bounds do not depend on ∥X∥ F , which is advantageous when inputs have very large norm or heavy tails. The Gaussian complexity bound’s leading term grows linearly in∥X∥ F

work page

[13] [13]

When din ≫d out and C2 0 C4 1 R2 is large, PAC-Bayes can still be favorable despite thedin term, because it avoids the spectral norm constants entirely

High-dimensional inputs with moderate rank:Looking at the full expressions, PAC-Bayes scales with r(din +d out) while Gaussian complexity (with bounded inputs) scales with C2 0 C4 1 R2 ·r d out. When din ≫d out and C2 0 C4 1 R2 is large, PAC-Bayes can still be favorable despite thedin term, because it avoids the spectral norm constants entirely. Concretel...

work page

[14] [14]

Diffuse posteriors with bounded spectral norms:When Cmax is large (the posterior has not concentrated tightly), the PAC-Bayes KL term can be vacuous. The Gaussian complexity bound depends on the spectral norm constraints C1, not on posterior concentration, and can remain informative even when the posterior is spread over a large region of parameter space

work page

[15] [15]

This is the regime where the Gaussian complexity bound is numerically tighter

Normalized inputs with small spectral norm constants:When ∥xj∥2 ≤R with R moderate and spectral norms are well-controlled (C1 not too large), the Gaussian complexity bound’s leading constant √ πLC 0C2 1 R can be smaller than p Cmax/2. This is the regime where the Gaussian complexity bound is numerically tighter

work page

[16] [16]

This makes them valuable as aprior-free sanity check

No prior specification required:Unlike PAC-Bayes, which requires choosing a prior P and where the bound quality depends sensitively on this choice, Gaussian complexity bounds depend only on the realized network architecture and data. This makes them valuable as aprior-free sanity check

work page

[17] [17]

good event

Architectural transparency:The bound explicitly shows how rank, spectral norms, and layer dimensions interact through the network depth, providing guidance for architectural design that is not available from the PAC-Bayes bound’s KL-divergence summary. RemarkE.25 (Rank dependence).Both bounds have √rrank dependence in their leading terms: •PAC-Bayes:The p...

work page 2017

[18] [18]

In reparameterized variational inference, the same clipping can be applied to sampled factors

Projection / spectral normalization.After each parameter update during optimization, rescale factors to enforce spectral norm constraints: Ai ←min 1, C A i ∥Ai∥2 Ai, B i ←min 1, C B i ∥Bi∥2 Bi. In reparameterized variational inference, the same clipping can be applied to sampled factors. This ensures almost-sure satisfaction of the bounds

work page

[19] [19]

For example, a truncated Gaussian with support {A:∥A∥ 2 ≤C A} naturally satisfies the assumption

Compact-support families.Use truncated Gaussians or uniform distributions supported on the spectral norm ball. For example, a truncated Gaussian with support {A:∥A∥ 2 ≤C A} naturally satisfies the assumption. This requires computing the normalization constant, which may be computationally expensive for high-dimensional matrices. 3.Soft constraints via pen...

work page 2017

[20] [20]

Biases are either stochastic Gaussians or deterministic (both tested); we report the deterministic-bias variant for parameter efficiency

Full-Rank Gaussian (BBB).Following (Blundell et al., 2015), we place an independent diagonal Gaussian variational posterior on each weight and bias: q(Wij |µ ij, σ2 ij) =N(µ ij, σ2 ij), reparameterized as W=µ+σ⊙ϵ with ϵ∼ N(0, I) . Biases are either stochastic Gaussians or deterministic (both tested); we report the deterministic-bias variant for parameter ...

work page 2015

[21] [21]

Each entry of A and B has a diagonal Gaussian posterior: q(Aik) =N(µ A ik,(σ A ik)2) and similarly for B

Low-Rank Gaussian Factorization.We impose a rank- r factorization W≈AB ⊤ where A∈R n×r and B∈R m×r with r≪min(n, m) . Each entry of A and B has a diagonal Gaussian posterior: q(Aik) =N(µ A ik,(σ A ik)2) and similarly for B. The bias remains deterministic. We sweep ranks r∈ {10,25,50} for hidden layers and report the rank that best balances accuracy and ca...

work page

[22] [22]

Rank-1 Multiplicative (Dusenberry et al., 2020).Following Dusenberry et al. (Dusenberry et al., 2020a), we fix a deterministic base weight matrix Wand apply a rank-1 stochastic multiplicative perturbation: W ′ = W⊙(s⊗r), where s∈R n and r∈R m are stochastic vectors with diagonal Gaussian posteriors. This parameterization dramatically reduces the stochasti...

work page 2020

[23] [23]

This prior encourages sparsity by placing probability mass on both a standard Gaussian and a narrow Gaussian; it is identical across all four models to ensure fair comparison

+ (1−π)N(0, σ 2 2), with fixed hyperparameters π= 0.5 , σ1 = 1.0, and σ2 = exp(−6)≈0.00248 . This prior encourages sparsity by placing probability mass on both a standard Gaussian and a narrow Gaussian; it is identical across all four models to ensure fair comparison. Training Procedure.The models are trained by minimizing the ELBO: L=E q(θ)[−logp(y|x, θ)...

work page arXiv 2017

[24] [24]

Prior: mixturep(w) = 0.5N(0,2.0 2) + 0.5N(0, e−6)

Full-Rank BNN (Bayes by Backprop):All weights have diagonal Gaussian posteriors q(w) =N(µ w, σ2 w). Prior: mixturep(w) = 0.5N(0,2.0 2) + 0.5N(0, e−6). Total trainable parameters:20,802

work page

[25] [25]

Input (1→100 ) and output (100→1 ) layers remain full-rank

Low-Rank BNN:Hidden layer ( 100→100 ) uses factorization W≈AB ⊤ with rank r= 16 . Input (1→100 ) and output (100→1 ) layers remain full-rank. Same mixture prior. Total trainable parameters:7,202(65% parameter reduction). Training uses KL annealing (ramped over 760 epochs) andβ= 0.0001/Ntempering to avoid KL dominance. I.3.2. RESULTS Model Single Pass RMSE...

work page

[26] [26]

This is a desirable property for safe prediction

Wider absolute OOD band:Low-rank maintains significantly larger out-of-domain uncertainty ( 0.0940 vs 0.0483), providing more conservative credible intervals where data are absent. This is a desirable property for safe prediction

work page

[27] [27]

This shows low-rank preserves the qualitative epistemic sensitivity: uncertainty grows when leaving the training domain

Sharper expansion ratio:Although in-domain uncertainty is higher in low-rank ( 0.0467 vs 0.0254), the OOD/in- domain ratio is nearly identical (2.01× vs 1.90×). This shows low-rank preserves the qualitative epistemic sensitivity: uncertainty grows when leaving the training domain

work page

[28] [28]

Rather than under-confident (overfitting) predictions, the low-rank posterior spreads mass more broadly, yielding conservative predictions even in-domain

Stronger in-domain regularization:Low-rank’s higher baseline epistemic IQR (0.0467) reflects structured regular- ization from rank constraints. Rather than under-confident (overfitting) predictions, the low-rank posterior spreads mass more broadly, yielding conservative predictions even in-domain. 69 Singular Bayesian Neural Networks: Measure-Theoretic Si...

work page