arxiv: 2604.04673 · v1 · submitted 2026-04-06 · 🧮 math.ST · cs.LG· stat.ML· stat.TH

Minimaxity and Admissibility of Bayesian Neural Networks

Daniel Andrew Coulson , Martin T. Wells This is my paper

Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3

classification 🧮 math.ST cs.LGstat.MLstat.TH

keywords Bayesian neural networksminimaxityadmissibilitynormal location modelhyperpriorReLU networksquadratic lossKullback-Leibler loss

0 comments p. Extension

The pith

A hyperprior on effective output variance makes deep ReLU BNN decision rules admissible and minimax in the normal location model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies decision rules produced by deep fully connected ReLU Bayesian neural networks when estimating a vector of normal means under quadratic loss. With fixed prior scales these rules are not minimax. The authors introduce a hyperprior on the effective output variance of the BNN prior so that the square root of the marginal density becomes superharmonic; this single property establishes that the induced decision rule is simultaneously admissible and minimax. The same construction is shown to deliver optimality for predictive density estimation under Kullback-Leibler loss, and the claims are checked in simulation.

Core claim

In the normal location model, deep ReLU Bayesian neural networks with fixed prior scales induce Bayes rules that are not minimax under quadratic loss. Introducing a hyperprior on the effective output variance that yields a superharmonic square-root marginal density produces a decision rule that is both admissible and minimax. The construction extends directly to the problem of estimating the predictive density under Kullback-Leibler loss.

What carries the argument

A hyperprior placed on the effective output variance of the BNN prior, chosen so the square root of the resulting marginal density is superharmonic.

If this is right

The BNN decision rule becomes minimax under quadratic loss.
The same rule is admissible.
The optimality result carries over to estimating the predictive density under Kullback-Leibler loss.
Numerical simulations confirm the theoretical minimax and admissibility properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hyperprior technique could be tested on BNNs with other activation functions or in non-location models to check whether minimaxity follows from analogous marginal-density conditions.
If the superharmonic property can be preserved under approximate inference methods such as variational Bayes, then practical implementations might inherit the optimality guarantees.
The result suggests that prior-scale tuning in deep networks can be reframed as a problem of engineering the marginal density to satisfy classical decision-theoretic criteria.

Load-bearing premise

A hyperprior on effective output variance can be chosen that produces a superharmonic square-root marginal density for the deep ReLU BNN prior in the normal location model.

What would settle it

Direct computation or high-precision numerical integration showing that the square root of the marginal density under the proposed hyperprior fails to be superharmonic would disprove the simultaneous admissibility and minimaxity claim.

Figures

Figures reproduced from arXiv: 2604.04673 by Daniel Andrew Coulson, Martin T. Wells.

**Figure 2.** Figure 2: Estimated risk for several decision rules in dimension [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

**Figure 3.** Figure 3: Estimated risk for several decision rules in dimension [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Estimated risk for several decision rules in dimension [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Estimated risk for several decision rules in dimension [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Estimated risk for several decision rules in dimension [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Bayesian neural networks (BNNs) offer a natural probabilistic formulation for inference in deep learning models. Despite their popularity, their optimality has received limited attention through the lens of statistical decision theory. In this paper, we study decision rules induced by deep, fully connected feedforward ReLU BNNs in the normal location model under quadratic loss. We show that, for fixed prior scales, the induced Bayes decision rule is not minimax. We then propose a hyperprior on the effective output variance of the BNN prior that yields a superharmonic square-root marginal density, establishing that the resulting decision rule is simultaneously admissible and minimax. We further extend these results from the quadratic loss setting to the predictive density estimation problem with Kullback--Leibler loss. Finally, we validate our theoretical findings numerically through simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows fixed-scale deep ReLU BNN priors are non-minimax in the normal location model but a hyperprior on effective output variance can make the induced rule admissible and minimax via a superharmonic square-root marginal.

read the letter

The central result is that fixed-scale priors on deep fully connected ReLU BNNs produce Bayes rules that are not minimax under quadratic loss in the normal location model. The authors then introduce a hyperprior on the effective output variance so that the square root of the marginal density becomes superharmonic, which delivers both admissibility and minimaxity. They carry the same construction over to predictive density estimation under Kullback-Leibler loss and include simulation checks.

Referee Report

1 major / 2 minor

Summary. The manuscript examines decision rules induced by deep fully connected ReLU Bayesian neural networks in the normal location model under quadratic loss. It shows that Bayes rules with fixed prior scales are not minimax, then proposes a hyperprior on the effective output variance of the BNN prior that produces a superharmonic square-root marginal density, establishing simultaneous admissibility and minimaxity. The results are extended to predictive density estimation under Kullback-Leibler loss, with numerical simulations provided for validation.

Significance. If the central hyperprior construction is shown to deliver the required superharmonic property, the work would supply a valuable decision-theoretic foundation for BNNs, demonstrating how a carefully chosen hyperprior can achieve minimaxity and admissibility in a non-Gaussian, non-smooth prior setting. The extension to KL loss and the simulation results strengthen the contribution by linking theory to practice.

major comments (1)

[Derivation of superharmonicity via the hyperprior (section containing the main admissibility theorem)] The proof that the proposed hyperprior on effective output variance yields a superharmonic square-root marginal density (central to both the minimaxity and admissibility claims) must explicitly verify the Laplacian inequality in the presence of ReLU-induced non-smoothness. The induced marginal arises from a finite mixture of piecewise-linear maps of Gaussians, so second derivatives exhibit jumps across kink hyperplanes; standard integration-by-parts arguments for superharmonicity assume sufficient smoothness that is violated here. The manuscript should either derive the inequality distributionally or show that the hyperprior choice cancels the non-smooth contributions.

minor comments (2)

[Introduction and model setup] Define the effective output variance and its relation to the BNN prior parameters at the first appearance, rather than deferring the definition.
[Numerical experiments] In the simulation section, report the precise network depths, widths, activation details, and Monte Carlo sample sizes used to approximate the marginal densities and risk functions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript. The point raised about verifying the superharmonicity in the presence of ReLU non-smoothness is important, and we will strengthen the proof accordingly in the revised version.

read point-by-point responses

Referee: [Derivation of superharmonicity via the hyperprior (section containing the main admissibility theorem)] The proof that the proposed hyperprior on effective output variance yields a superharmonic square-root marginal density (central to both the minimaxity and admissibility claims) must explicitly verify the Laplacian inequality in the presence of ReLU-induced non-smoothness. The induced marginal arises from a finite mixture of piecewise-linear maps of Gaussians, so second derivatives exhibit jumps across kink hyperplanes; standard integration-by-parts arguments for superharmonicity assume sufficient smoothness that is violated here. The manuscript should either derive the inequality distributionally or show that the hyperprior choice cancels the non-smooth contributions.

Authors: We thank the referee for highlighting the technical subtlety arising from the non-smoothness of the ReLU activations. We will revise the manuscript to derive the Laplacian inequality in the distributional sense. We will explicitly compute the weak form of the Laplacian for the square-root marginal density, which is induced by the hyperprior on the effective output variance. This involves integrating by parts against smooth test functions and verifying that the contributions from the jumps across the kink hyperplanes are controlled by the choice of hyperprior, ensuring the superharmonicity inequality holds. A new lemma will be added to the section containing the main admissibility theorem to provide this verification. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit hyperprior construction yields superharmonicity without self-referential reduction

full rationale

The derivation begins by showing that fixed-scale BNN priors induce non-minimax Bayes rules in the normal location model. It then explicitly proposes a hyperprior on effective output variance chosen to ensure the square-root marginal density is superharmonic. This construction directly invokes the external decision-theoretic fact that superharmonic square-root marginals yield admissible minimax rules under quadratic loss (and extends to KL loss). No step defines the hyperprior in terms of the target property, renames a fitted quantity as a prediction, or relies on load-bearing self-citations. The ReLU non-smoothness is addressed by the construction itself rather than assumed away. The numerical simulations are validation only and do not enter the theoretical chain. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard normal location model and quadratic loss as background assumptions, plus the constructed hyperprior chosen specifically to induce the superharmonic property.

free parameters (1)

hyperprior on effective output variance
Introduced to produce the superharmonic square-root marginal density required for the admissibility result.

axioms (1)

domain assumption Normal location model under quadratic loss
The setting in which the BNN-induced decision rules are analyzed.

pith-pipeline@v0.9.0 · 5438 in / 1206 out tokens · 49093 ms · 2026-05-10T19:28:59.282963+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

hyperprior on the effective output variance of the BNN prior that yields a superharmonic square-root marginal density
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking (D=3) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

stretched exponential upper bound … Meijer-G … depth d

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Goodness of prediction fit.Biometrika, 62(3):547–554, 1975

James Aitchison. Goodness of prediction fit.Biometrika, 62(3):547–554, 1975

work page 1975
[2]

A primer on bayesian neural networks: review and debates.arXiv preprint arXiv:2309.16314, 2023

Julyan Arbel, Konstantinos Pitas, Mariia Vladimirova, and Vincent Fortuin. A primer on bayesian neural networks: review and debates.arXiv preprint arXiv:2309.16314, 2023. URL https://arxiv.org/abs/2309.16314

work page arXiv 2023
[3]

A family of minimax estimators of the mean of a multivariate normal distribution.The Annals of Mathematical Statistics, pages 642–645, 1970

Alvin J Baranchik. A family of minimax estimators of the mean of a multivariate normal distribution.The Annals of Mathematical Statistics, pages 642–645, 1970

work page 1970
[4]

Admissible estimators, recurrent diffusions, and insoluble boundary value problems.The Annals of Mathematical Statistics, 42(3):855–903, 1971

Lawrence D Brown. Admissible estimators, recurrent diffusions, and insoluble boundary value problems.The Annals of Mathematical Statistics, 42(3):855–903, 1971. 93

work page 1971
[5]

Admissible predictive density estimation

Lawrence D Brown, Edward I George, and Xinyi Xu. Admissible predictive density estimation. The Annals of Statistics, 36(3):1156–1170, 2008

work page 2008
[6]

The horseshoe estimator for sparse signals.Biometrika, pages 465–480, 2010

Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for sparse signals.Biometrika, pages 465–480, 2010

work page 2010
[7]

Bayesian neural networks for stock price forecasting before and during covid-19 pandemic.Plos One, 16(7):e0253217, 2021

Rohitash Chandra and Yixuan He. Bayesian neural networks for stock price forecasting before and during covid-19 pandemic.Plos One, 16(7):e0253217, 2021

work page 2021
[8]

Posterior and variational inference for deep neural networks with heavy-tailed weights.Journal of Machine Learning Research, 26(122):1–58, 2025

Paul Egels and Isma ˜AG ¸ l Castillo. Posterior and variational inference for deep neural networks with heavy-tailed weights.Journal of Machine Learning Research, 26(122):1–58, 2025

work page 2025
[9]

Bayesian neural network priors revisited

Vincent Fortuin, Adri` a Garriga-Alonso, Sebastian W Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. InInternational Conference on Learning Representations, 2022

work page 2022
[10]

On the construction of bayes minimax estimators.The Annals of Statistics, pages 660–671, 1998

Dominique Fourdrinier, William E Strawderman, and Martin T Wells. On the construction of bayes minimax estimators.The Annals of Statistics, pages 660–671, 1998

work page 1998
[11]

Springer, 2018

Dominique Fourdrinier, William E Strawderman, and Martin T Wells.Shrinkage Estimation. Springer, 2018

work page 2018
[12]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InInternational Conference on Machine Learning, pages 1050–1059. PMLR, 2016

work page 2016
[13]

The variance-gamma product distribution

Robert E Gaunt, Siqi Li, and Heather L Sutcliffe. The variance-gamma product distribution. Results in Mathematics, 80(7):208, 2025

work page 2025
[14]

Improved minimax predictive densities under kullback-leibler loss.The Annals of Statistics, 34(1):78–91, 2006

Edward I George, Feng Liang, and Xinyi Xu. Improved minimax predictive densities under kullback-leibler loss.The Annals of Statistics, 34(1):78–91, 2006

work page 2006
[15]

Tabpfn: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel M¨ uller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[16]

A bayesian neural network approach for modelling censored data with an application to prognosis after surgery for breast cancer

Paulo JG Lisboa, H Wong, P Harris, and Ric Swindell. A bayesian neural network approach for modelling censored data with an application to prognosis after surgery for breast cancer. Artificial Intelligence in Medicine, 28(1):1–25, 2003

work page 2003
[17]

A bayesian neural network for severe-hail size prediction

Caren Marzban and Arthur Witt. A bayesian neural network for severe-hail size prediction. Weather and Forecasting, 16(5):600–610, 2001

work page 2001
[18]

Springer Science & Business Media, 2009

Arakaparampil M Mathai, Ram Kishore Saxena, and Hans J Haubold.The H-function: theory and applications. Springer Science & Business Media, 2009

work page 2009
[19]

Gaussian process behaviour in wide deep neural networks

Alexander G de G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahra- mani. Gaussian process behaviour in wide deep neural networks. InInternational Conference on Learning Representations, 2018

work page 2018
[20]

Transformers can do bayesian inference

Samuel M¨ uller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. InInternational Conference on Learning Representa- tions, 2022. 94

work page 2022
[21]

Springer Science & Business Media, 2012

Radford M Neal.Bayesian Learning for Neural Networks, volume 118. Springer Science & Business Media, 2012

work page 2012
[22]

Position: Bayesian deep learning is needed in the age of large-scale ai

Theodore Papamarkou, Maria Skoularidou, Konstantina Palla, Laurence Aitchison, Julyan Arbel, David Dunson, Maurizio Filippone, Vincent Fortuin, Philipp Hennig, Jos´ e Miguel Hern´ andez-Lobato, et al. Position: Bayesian deep learning is needed in the age of large-scale ai. InForty-first International Conference on Machine Learning, 2024

work page 2024
[23]

Posterior concentration for sparse deep learning

Nicholas G Polson and Veronika Roˇ ckov´ a. Posterior concentration for sparse deep learning. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[24]

R Foundation for Statistical Computing, Vienna, Austria, 2023

R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2023. URLhttps://www.R-project.org/

work page 2023
[25]

Estimation of the mean of a multivariate normal distribution.The Annals of Statistics, pages 1135–1151, 1981

Charles M Stein. Estimation of the mean of a multivariate normal distribution.The Annals of Statistics, pages 1135–1151, 1981

work page 1981
[26]

Proper bayes minimax estimators of the multivariate normal mean

William E Strawderman. Proper bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical Statistics, 42(1):385–388, 1971

work page 1971
[27]

Exact marginal prior distributions of finite bayesian neural networks.Advances in Neural Information Processing Systems, 34:3364–3375, 2021

Jacob Zavatone-Veth and Cengiz Pehlevan. Exact marginal prior distributions of finite bayesian neural networks.Advances in Neural Information Processing Systems, 34:3364–3375, 2021. 95

work page 2021