Minimaxity and Admissibility of Bayesian Neural Networks
Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3
The pith
A hyperprior on effective output variance makes deep ReLU BNN decision rules admissible and minimax in the normal location model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the normal location model, deep ReLU Bayesian neural networks with fixed prior scales induce Bayes rules that are not minimax under quadratic loss. Introducing a hyperprior on the effective output variance that yields a superharmonic square-root marginal density produces a decision rule that is both admissible and minimax. The construction extends directly to the problem of estimating the predictive density under Kullback-Leibler loss.
What carries the argument
A hyperprior placed on the effective output variance of the BNN prior, chosen so the square root of the resulting marginal density is superharmonic.
If this is right
- The BNN decision rule becomes minimax under quadratic loss.
- The same rule is admissible.
- The optimality result carries over to estimating the predictive density under Kullback-Leibler loss.
- Numerical simulations confirm the theoretical minimax and admissibility properties.
Where Pith is reading between the lines
- The same hyperprior technique could be tested on BNNs with other activation functions or in non-location models to check whether minimaxity follows from analogous marginal-density conditions.
- If the superharmonic property can be preserved under approximate inference methods such as variational Bayes, then practical implementations might inherit the optimality guarantees.
- The result suggests that prior-scale tuning in deep networks can be reframed as a problem of engineering the marginal density to satisfy classical decision-theoretic criteria.
Load-bearing premise
A hyperprior on effective output variance can be chosen that produces a superharmonic square-root marginal density for the deep ReLU BNN prior in the normal location model.
What would settle it
Direct computation or high-precision numerical integration showing that the square root of the marginal density under the proposed hyperprior fails to be superharmonic would disprove the simultaneous admissibility and minimaxity claim.
Figures
read the original abstract
Bayesian neural networks (BNNs) offer a natural probabilistic formulation for inference in deep learning models. Despite their popularity, their optimality has received limited attention through the lens of statistical decision theory. In this paper, we study decision rules induced by deep, fully connected feedforward ReLU BNNs in the normal location model under quadratic loss. We show that, for fixed prior scales, the induced Bayes decision rule is not minimax. We then propose a hyperprior on the effective output variance of the BNN prior that yields a superharmonic square-root marginal density, establishing that the resulting decision rule is simultaneously admissible and minimax. We further extend these results from the quadratic loss setting to the predictive density estimation problem with Kullback--Leibler loss. Finally, we validate our theoretical findings numerically through simulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines decision rules induced by deep fully connected ReLU Bayesian neural networks in the normal location model under quadratic loss. It shows that Bayes rules with fixed prior scales are not minimax, then proposes a hyperprior on the effective output variance of the BNN prior that produces a superharmonic square-root marginal density, establishing simultaneous admissibility and minimaxity. The results are extended to predictive density estimation under Kullback-Leibler loss, with numerical simulations provided for validation.
Significance. If the central hyperprior construction is shown to deliver the required superharmonic property, the work would supply a valuable decision-theoretic foundation for BNNs, demonstrating how a carefully chosen hyperprior can achieve minimaxity and admissibility in a non-Gaussian, non-smooth prior setting. The extension to KL loss and the simulation results strengthen the contribution by linking theory to practice.
major comments (1)
- [Derivation of superharmonicity via the hyperprior (section containing the main admissibility theorem)] The proof that the proposed hyperprior on effective output variance yields a superharmonic square-root marginal density (central to both the minimaxity and admissibility claims) must explicitly verify the Laplacian inequality in the presence of ReLU-induced non-smoothness. The induced marginal arises from a finite mixture of piecewise-linear maps of Gaussians, so second derivatives exhibit jumps across kink hyperplanes; standard integration-by-parts arguments for superharmonicity assume sufficient smoothness that is violated here. The manuscript should either derive the inequality distributionally or show that the hyperprior choice cancels the non-smooth contributions.
minor comments (2)
- [Introduction and model setup] Define the effective output variance and its relation to the BNN prior parameters at the first appearance, rather than deferring the definition.
- [Numerical experiments] In the simulation section, report the precise network depths, widths, activation details, and Monte Carlo sample sizes used to approximate the marginal densities and risk functions.
Simulated Author's Rebuttal
We thank the referee for the insightful comments on our manuscript. The point raised about verifying the superharmonicity in the presence of ReLU non-smoothness is important, and we will strengthen the proof accordingly in the revised version.
read point-by-point responses
-
Referee: [Derivation of superharmonicity via the hyperprior (section containing the main admissibility theorem)] The proof that the proposed hyperprior on effective output variance yields a superharmonic square-root marginal density (central to both the minimaxity and admissibility claims) must explicitly verify the Laplacian inequality in the presence of ReLU-induced non-smoothness. The induced marginal arises from a finite mixture of piecewise-linear maps of Gaussians, so second derivatives exhibit jumps across kink hyperplanes; standard integration-by-parts arguments for superharmonicity assume sufficient smoothness that is violated here. The manuscript should either derive the inequality distributionally or show that the hyperprior choice cancels the non-smooth contributions.
Authors: We thank the referee for highlighting the technical subtlety arising from the non-smoothness of the ReLU activations. We will revise the manuscript to derive the Laplacian inequality in the distributional sense. We will explicitly compute the weak form of the Laplacian for the square-root marginal density, which is induced by the hyperprior on the effective output variance. This involves integrating by parts against smooth test functions and verifying that the contributions from the jumps across the kink hyperplanes are controlled by the choice of hyperprior, ensuring the superharmonicity inequality holds. A new lemma will be added to the section containing the main admissibility theorem to provide this verification. revision: yes
Circularity Check
No circularity: explicit hyperprior construction yields superharmonicity without self-referential reduction
full rationale
The derivation begins by showing that fixed-scale BNN priors induce non-minimax Bayes rules in the normal location model. It then explicitly proposes a hyperprior on effective output variance chosen to ensure the square-root marginal density is superharmonic. This construction directly invokes the external decision-theoretic fact that superharmonic square-root marginals yield admissible minimax rules under quadratic loss (and extends to KL loss). No step defines the hyperprior in terms of the target property, renames a fitted quantity as a prediction, or relies on load-bearing self-citations. The ReLU non-smoothness is addressed by the construction itself rather than assumed away. The numerical simulations are validation only and do not enter the theoretical chain. The argument is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- hyperprior on effective output variance
axioms (1)
- domain assumption Normal location model under quadratic loss
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J uniqueness) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
hyperprior on the effective output variance of the BNN prior that yields a superharmonic square-root marginal density
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking (D=3) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
stretched exponential upper bound … Meijer-G … depth d
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Goodness of prediction fit.Biometrika, 62(3):547–554, 1975
James Aitchison. Goodness of prediction fit.Biometrika, 62(3):547–554, 1975
work page 1975
-
[2]
A primer on bayesian neural networks: review and debates.arXiv preprint arXiv:2309.16314, 2023
Julyan Arbel, Konstantinos Pitas, Mariia Vladimirova, and Vincent Fortuin. A primer on bayesian neural networks: review and debates.arXiv preprint arXiv:2309.16314, 2023. URL https://arxiv.org/abs/2309.16314
-
[3]
Alvin J Baranchik. A family of minimax estimators of the mean of a multivariate normal distribution.The Annals of Mathematical Statistics, pages 642–645, 1970
work page 1970
-
[4]
Lawrence D Brown. Admissible estimators, recurrent diffusions, and insoluble boundary value problems.The Annals of Mathematical Statistics, 42(3):855–903, 1971. 93
work page 1971
-
[5]
Admissible predictive density estimation
Lawrence D Brown, Edward I George, and Xinyi Xu. Admissible predictive density estimation. The Annals of Statistics, 36(3):1156–1170, 2008
work page 2008
-
[6]
The horseshoe estimator for sparse signals.Biometrika, pages 465–480, 2010
Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for sparse signals.Biometrika, pages 465–480, 2010
work page 2010
-
[7]
Rohitash Chandra and Yixuan He. Bayesian neural networks for stock price forecasting before and during covid-19 pandemic.Plos One, 16(7):e0253217, 2021
work page 2021
-
[8]
Paul Egels and Isma ˜AG ¸ l Castillo. Posterior and variational inference for deep neural networks with heavy-tailed weights.Journal of Machine Learning Research, 26(122):1–58, 2025
work page 2025
-
[9]
Bayesian neural network priors revisited
Vincent Fortuin, Adri` a Garriga-Alonso, Sebastian W Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. InInternational Conference on Learning Representations, 2022
work page 2022
-
[10]
On the construction of bayes minimax estimators.The Annals of Statistics, pages 660–671, 1998
Dominique Fourdrinier, William E Strawderman, and Martin T Wells. On the construction of bayes minimax estimators.The Annals of Statistics, pages 660–671, 1998
work page 1998
-
[11]
Dominique Fourdrinier, William E Strawderman, and Martin T Wells.Shrinkage Estimation. Springer, 2018
work page 2018
-
[12]
Dropout as a bayesian approximation: Representing model uncertainty in deep learning
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InInternational Conference on Machine Learning, pages 1050–1059. PMLR, 2016
work page 2016
-
[13]
The variance-gamma product distribution
Robert E Gaunt, Siqi Li, and Heather L Sutcliffe. The variance-gamma product distribution. Results in Mathematics, 80(7):208, 2025
work page 2025
-
[14]
Edward I George, Feng Liang, and Xinyi Xu. Improved minimax predictive densities under kullback-leibler loss.The Annals of Statistics, 34(1):78–91, 2006
work page 2006
-
[15]
Tabpfn: A transformer that solves small tabular classification problems in a second
Noah Hollmann, Samuel M¨ uller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[16]
Paulo JG Lisboa, H Wong, P Harris, and Ric Swindell. A bayesian neural network approach for modelling censored data with an application to prognosis after surgery for breast cancer. Artificial Intelligence in Medicine, 28(1):1–25, 2003
work page 2003
-
[17]
A bayesian neural network for severe-hail size prediction
Caren Marzban and Arthur Witt. A bayesian neural network for severe-hail size prediction. Weather and Forecasting, 16(5):600–610, 2001
work page 2001
-
[18]
Springer Science & Business Media, 2009
Arakaparampil M Mathai, Ram Kishore Saxena, and Hans J Haubold.The H-function: theory and applications. Springer Science & Business Media, 2009
work page 2009
-
[19]
Gaussian process behaviour in wide deep neural networks
Alexander G de G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahra- mani. Gaussian process behaviour in wide deep neural networks. InInternational Conference on Learning Representations, 2018
work page 2018
-
[20]
Transformers can do bayesian inference
Samuel M¨ uller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. InInternational Conference on Learning Representa- tions, 2022. 94
work page 2022
-
[21]
Springer Science & Business Media, 2012
Radford M Neal.Bayesian Learning for Neural Networks, volume 118. Springer Science & Business Media, 2012
work page 2012
-
[22]
Position: Bayesian deep learning is needed in the age of large-scale ai
Theodore Papamarkou, Maria Skoularidou, Konstantina Palla, Laurence Aitchison, Julyan Arbel, David Dunson, Maurizio Filippone, Vincent Fortuin, Philipp Hennig, Jos´ e Miguel Hern´ andez-Lobato, et al. Position: Bayesian deep learning is needed in the age of large-scale ai. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[23]
Posterior concentration for sparse deep learning
Nicholas G Polson and Veronika Roˇ ckov´ a. Posterior concentration for sparse deep learning. Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[24]
R Foundation for Statistical Computing, Vienna, Austria, 2023
R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2023. URLhttps://www.R-project.org/
work page 2023
-
[25]
Charles M Stein. Estimation of the mean of a multivariate normal distribution.The Annals of Statistics, pages 1135–1151, 1981
work page 1981
-
[26]
Proper bayes minimax estimators of the multivariate normal mean
William E Strawderman. Proper bayes minimax estimators of the multivariate normal mean. The Annals of Mathematical Statistics, 42(1):385–388, 1971
work page 1971
-
[27]
Jacob Zavatone-Veth and Cengiz Pehlevan. Exact marginal prior distributions of finite bayesian neural networks.Advances in Neural Information Processing Systems, 34:3364–3375, 2021. 95
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.