Direct Bethe Free Energy Minimization for Bayesian Neural Network
Pith reviewed 2026-05-13 06:27 UTC · model grok-4.3
The pith
Direct Bethe free energy minimization for BNNs produces losses strictly between MAP and ELBO, enables joint empirical Bayes in one gradient pass, and matches reference methods on UCI benchmarks at single-pass cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Both objectives sit strictly between MAP and the ELBO (L_MAP ≤ L_Bethe ≤ L_ELBO), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes in a single gradient pass.
Load-bearing premise
Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses and that the factor graph is tree-structured so the Bethe free energy is exact; if these do not hold the closed-form claims and inequality may fail.
Figures
read the original abstract
We propose training Bayesian neural networks by directly minimizing the Bethe free energy rather than maximizing a variational lower bound. On tree-structured factor graphs the Bethe free energy is exact; deterministic layers drop out of the objective and are trained by standard backpropagation, so the framework accommodates any mixture of probabilistic and deterministic subgraphs without modification. Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses: for a Gaussian likelihood the Bethe loss equals the exact marginal likelihood, and for a probit likelihood it reduces to a closed form via the probit-Gaussian convolution. Both objectives sit strictly between MAP and the ELBO ($L_\text{MAP} \leq L_\text{Bethe} \leq L_\text{ELBO}$), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes - joint optimization of weights, covariance, and hyperparameters - in a single gradient pass, with no cross-validation or outer loop. All variants admit a closed-form predictive at MAP-equivalent inference cost, in contrast to ensemble and sampling-based methods. On 8 UCI regression and 12 UCI classification benchmarks evaluated under a single shared hyperparameter regime, Bethe is competitive with standard reference methods at single-pass cost. Independently, joint single-pass empirical Bayes matches grid-search cross-validation of the prior precision on essentially all dataset-variant combinations, eliminating the outer hyperparameter loop without measurable cost. Isolated optimization gaps on a few datasets reflect numerical rather than principled limitations of the framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes training Bayesian neural networks by directly minimizing the Bethe free energy on tree-structured factor graphs instead of maximizing a variational lower bound. With a last-layer Gaussian weight posterior, the resulting losses are analytically tractable: the Gaussian-likelihood case recovers the exact marginal likelihood while the probit case yields a closed form via probit-Gaussian convolution. Both objectives satisfy L_MAP ≤ L_Bethe ≤ L_ELBO, and a Z-consistent prior makes the prior precision differentiable, enabling single-pass joint optimization of weights, covariance, and hyperparameters. Deterministic layers are trained by standard backpropagation, closed-form predictives are available at MAP cost, and the method is shown to be competitive with reference approaches on 8 UCI regression and 12 UCI classification benchmarks under a shared hyperparameter regime.
Significance. If the central derivations hold, the work supplies a principled interpolation between MAP and variational inference that removes the structural Jensen gap on trees, together with an efficient single-pass empirical-Bayes procedure and MAP-cost inference. These features address two persistent practical bottlenecks in Bayesian deep learning and are supported by the reported UCI results; the framework could therefore influence scalable BNN training pipelines that mix probabilistic and deterministic components.
major comments (2)
- [Abstract] Abstract: The claim that the Bethe loss equals the exact marginal likelihood for Gaussian likelihood (and yields a closed form for probit) is load-bearing for the tractability and inequality statements; this equality is stated to hold because the factor graph is tree-structured and the posterior is restricted to a last-layer Gaussian. The manuscript must supply the explicit derivation (presumably in the methods section) showing how deterministic layers drop out while preserving exactness, and must state the precise conditions under which the inequality L_MAP ≤ L_Bethe ≤ L_ELBO remains strict rather than becoming an equality by construction.
- [Abstract] Abstract: The Z-consistent prior is presented as making prior precision a differentiable parameter that enables single-pass empirical Bayes without an outer loop. Because this construction is derived inside the same last-layer Gaussian Bethe framework, the manuscript should verify (in the relevant optimization section) that joint gradient updates on the prior precision preserve the claimed positioning relative to the marginal likelihood and do not introduce bias when the tree-structure or Gaussian assumptions are only approximately satisfied.
minor comments (1)
- [Experimental section] The abstract refers to '8 UCI regression and 12 UCI classification benchmarks' evaluated under a single shared hyperparameter regime; the corresponding experimental section should list the exact hyperparameter values used for all competing methods to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to include the requested explicit derivations, clarifications on conditions, and verification of the optimization procedure.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the Bethe loss equals the exact marginal likelihood for Gaussian likelihood (and yields a closed form for probit) is load-bearing for the tractability and inequality statements; this equality is stated to hold because the factor graph is tree-structured and the posterior is restricted to a last-layer Gaussian. The manuscript must supply the explicit derivation (presumably in the methods section) showing how deterministic layers drop out while preserving exactness, and must state the precise conditions under which the inequality L_MAP ≤ L_Bethe ≤ L_ELBO remains strict rather than becoming an equality by construction.
Authors: We agree that an explicit derivation is essential for rigor. In the revised manuscript we will add a dedicated subsection in the Methods section that starts from the Bethe free energy on a tree-structured factor graph, shows that all messages from deterministic layers are deterministic and therefore contribute zero entropy and cancel in the interaction terms, and reduces the objective exactly to the marginal likelihood integral when the likelihood is Gaussian and the last-layer posterior is Gaussian. For the probit case we will derive the closed-form expression via the known Gaussian-probit convolution. We will also state the precise conditions for the inequality: L_Bethe equals the true marginal likelihood (hence L_MAP ≤ L_Bethe ≤ L_ELBO with equality to the marginal likelihood) precisely when the factor graph is a tree and the last-layer posterior is Gaussian; the inequalities are strict for non-linear deterministic layers whose true posterior is non-Gaussian, and become equalities only in the linear-Gaussian case or when the variational family is exact. revision: yes
-
Referee: [Abstract] Abstract: The Z-consistent prior is presented as making prior precision a differentiable parameter that enables single-pass empirical Bayes without an outer loop. Because this construction is derived inside the same last-layer Gaussian Bethe framework, the manuscript should verify (in the relevant optimization section) that joint gradient updates on the prior precision preserve the claimed positioning relative to the marginal likelihood and do not introduce bias when the tree-structure or Gaussian assumptions are only approximately satisfied.
Authors: We will expand the optimization section to include a short verification: because the Z-consistent prior is obtained by reparameterizing the Bethe free energy itself, any gradient step on the prior precision remains inside the same objective and therefore preserves L_MAP ≤ L_Bethe ≤ L_ELBO by construction. When the tree or Gaussian assumptions hold only approximately, the procedure yields a consistent approximation to the marginal likelihood rather than an exact one; we will note this limitation explicitly and point to the UCI results, which show that the single-pass empirical-Bayes estimates match cross-validated values on nearly all dataset-variant pairs, indicating no measurable bias under the regimes tested. revision: yes
Circularity Check
No circularity; derivation follows from standard Bethe free energy on trees plus explicit last-layer Gaussian restriction
full rationale
The paper's central results (Bethe loss expressions, L_MAP ≤ L_Bethe ≤ L_ELBO inequalities, and single-pass empirical Bayes via Z-consistent prior) are obtained by substituting the last-layer Gaussian posterior into the standard Bethe free energy formula for tree-structured factor graphs. The Gaussian-likelihood case recovers the exact marginal likelihood and the probit case yields a closed form via convolution; both inequalities are direct consequences of the variational ordering and exactness on trees. No equation reduces to a fitted parameter renamed as a prediction, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled via prior work. The framework is therefore self-contained against external graphical-model benchmarks once the two explicit restrictions are granted.
Axiom & Free-Parameter Ledger
free parameters (1)
- prior precision
axioms (1)
- domain assumption Bethe free energy equals the exact marginal free energy on tree-structured factor graphs
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J uniqueness) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Both objectives sit strictly between MAP and the ELBO (L_MAP ≤ L_Bethe ≤ L_ELBO), removing the structural Jensen gap... Z-consistent prior formulation makes the prior precision a differentiable parameter
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking (D = 3) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On tree-structured factor graphs the Bethe free energy is exact
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Yedidia, Jonathan S. and Freeman, William T. and Weiss, Yair , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
- [2]
-
[3]
Williams, Christopher K. I. and Rasmussen, Carl Edward , title =
-
[4]
MacKay, David J. C. , title =. Neural Computation , volume =
-
[5]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Daxberger, Erik and Kristiadi, Agustinus and Immer, Alexander and Eschenhagen, Runa and Bauer, Matthias and Hennig, Philipp , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[6]
International Conference on Learning Representations (ICLR) , year =
Harrison, James and Willes, John and Snoek, Jasper , title =. International Conference on Learning Representations (ICLR) , year =
-
[7]
Calvo-Ord. Rich. International Conference on Learning Representations (ICLR) , year =
-
[8]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[9]
Wasserstein Auto-Encoders , booktitle =
Tolstikhin, Ilya and Bousquet, Olivier and Gelly, Sylvain and Sch. Wasserstein Auto-Encoders , booktitle =
-
[10]
MacKay, David J. C. , title =
- [11]
- [12]
-
[13]
Proceedings of the 34th International Conference on Machine Learning (ICML) , year =
Molchanov, Dmitry and Ashukha, Arsenii and Vetrov, Dmitry , title =. Proceedings of the 34th International Conference on Machine Learning (ICML) , year =
-
[14]
Focal Loss for Dense Object Detection , booktitle =
Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Doll. Focal Loss for Dense Object Detection , booktitle =
-
[15]
Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT) , year =
Thiemann, Niklas and Igel, Christian and Wintenberger, Olivier and Seldin, Yevgeny , title =. Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT) , year =
-
[16]
and Izmailov, Pavel and Garipov, Timur and Vetrov, Dmitry P
Maddox, Wesley J. and Izmailov, Pavel and Garipov, Timur and Vetrov, Dmitry P. and Wilson, Andrew Gordon , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[17]
International Conference on Machine Learning (ICML) , year =
Gal, Yarin and Ghahramani, Zoubin , title =. International Conference on Machine Learning (ICML) , year =
-
[18]
International Conference on Learning Representations (ICLR) , year =
Burda, Yuri and Grosse, Roger and Salakhutdinov, Ruslan , title =. International Conference on Learning Representations (ICLR) , year =
-
[19]
and Salimans, Tim and Jozefowicz, Rafal and Chen, Xi and Sutskever, Ilya and Welling, Max , title =
Kingma, Diederik P. and Salimans, Tim and Jozefowicz, Rafal and Chen, Xi and Sutskever, Ilya and Welling, Max , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[20]
Preventing Posterior Collapse with Delta-
Razavi, Ali and van den Oord, A. Preventing Posterior Collapse with Delta-. International Conference on Learning Representations (ICLR) , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.