Direct Bethe Free Energy Minimization for Bayesian Neural Network

Pavel Prochazka

arxiv: 2605.08446 · v2 · submitted 2026-05-08 · 💻 cs.LG

Direct Bethe Free Energy Minimization for Bayesian Neural Network

Pavel Prochazka This is my paper

Pith reviewed 2026-05-13 06:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords bethecostenergyfreelikelihoodpriortextbayes

0 comments

The pith

Direct Bethe free energy minimization for BNNs produces losses strictly between MAP and ELBO, enables joint empirical Bayes in one gradient pass, and matches reference methods on UCI benchmarks at single-pass cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Bayesian neural networks treat network weights as random variables to capture prediction uncertainty. Standard training uses variational inference to maximize a lower bound called the ELBO on the marginal likelihood. This work instead minimizes the Bethe free energy, a quantity from graphical models that equals the true free energy on tree-structured graphs. Deterministic layers fall out of the objective and train via ordinary backpropagation, so the method mixes probabilistic and deterministic parts without special handling. By restricting the posterior over weights to a Gaussian only on the final layer, the loss becomes analytically tractable. For Gaussian likelihoods it recovers the exact marginal likelihood; for probit likelihoods it reduces to a closed-form convolution. The resulting Bethe objective always lies between the MAP loss and the ELBO, closing a gap no variational family can remove. A Z-consistent prior formulation makes the prior precision a differentiable parameter, so weights, covariance, and hyperparameters optimize together in one gradient step with no outer cross-validation loop. Predictive distributions are also available in closed form at MAP-equivalent cost. Experiments on eight UCI regression and twelve UCI classification tasks under a shared hyperparameter regime show performance competitive with standard methods.

Core claim

Both objectives sit strictly between MAP and the ELBO (L_MAP ≤ L_Bethe ≤ L_ELBO), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes in a single gradient pass.

Load-bearing premise

Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses and that the factor graph is tree-structured so the Bethe free energy is exact; if these do not hold the closed-form claims and inequality may fail.

Figures

Figures reproduced from arXiv: 2605.08446 by Pavel Prochazka.

**Figure 2.** Figure 2: Factor graph for Direct Bethe Optimisation. Circles: variable nodes; filled squares: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

We propose training Bayesian neural networks by directly minimizing the Bethe free energy rather than maximizing a variational lower bound. On tree-structured factor graphs the Bethe free energy is exact; deterministic layers drop out of the objective and are trained by standard backpropagation, so the framework accommodates any mixture of probabilistic and deterministic subgraphs without modification. Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses: for a Gaussian likelihood the Bethe loss equals the exact marginal likelihood, and for a probit likelihood it reduces to a closed form via the probit-Gaussian convolution. Both objectives sit strictly between MAP and the ELBO ($L_\text{MAP} \leq L_\text{Bethe} \leq L_\text{ELBO}$), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes - joint optimization of weights, covariance, and hyperparameters - in a single gradient pass, with no cross-validation or outer loop. All variants admit a closed-form predictive at MAP-equivalent inference cost, in contrast to ensemble and sampling-based methods. On 8 UCI regression and 12 UCI classification benchmarks evaluated under a single shared hyperparameter regime, Bethe is competitive with standard reference methods at single-pass cost. Independently, joint single-pass empirical Bayes matches grid-search cross-validation of the prior precision on essentially all dataset-variant combinations, eliminating the outer hyperparameter loop without measurable cost. Isolated optimization gaps on a few datasets reflect numerical rather than principled limitations of the framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a direct Bethe free energy route for BNNs that lands between MAP and ELBO under a last-layer Gaussian restriction and folds prior tuning into one gradient pass.

read the letter

The core move is minimizing the Bethe free energy directly for Bayesian neural nets instead of pushing an ELBO. With the posterior restricted to a Gaussian on the last layer, the resulting losses sit strictly between MAP and the ELBO, and the Z-consistent prior turns the precision into a differentiable parameter so empirical Bayes happens in the same optimization without an outer loop or grid search. On the UCI regression and classification benchmarks the method stays competitive with standard references at single-pass cost, and the joint hyperparameter fit matches cross-validation on almost every dataset-variant pair. Deterministic layers simply fall out of the objective and train by ordinary backprop, which keeps the framework flexible for mixed graphs. For Gaussian likelihoods the loss recovers the exact marginal likelihood; the probit case stays closed form through the convolution. Those are the concrete advances worth noting. The derivations appear internally consistent once the last-layer Gaussian and tree-structured assumptions are granted. The closed-form predictive at MAP-level cost is a clear practical gain over sampling or ensembles. The soft spots are exactly where the stress-test note flags them. The sandwich inequality and analytic tractability both require the factor graph to be a tree and the posterior to be Gaussian only on the final layer. Introduce probabilistic layers deeper in the net and cycles appear, the Bethe quantity stops being exact, and the positioning relative to MAP and ELBO would need new approximations. The abstract already notes a few numerical gaps on isolated datasets, which suggests stability or initialization sensitivity could surface in wider testing. I would want the full proofs and more architecture variants before treating the “no structural Jensen gap” claim as general. This is for people working on uncertainty quantification in deep models who want something faster than full variational inference or MCMC but more grounded than plain MAP. A reader already comfortable with graphical models and variational methods will see the value in the explicit MAP-ELBO placement and the single-pass empirical Bayes device. It deserves a serious referee. The ideas are distinct enough and the UCI results are clean enough under a shared hyperparameter regime that the claims should be checked in detail rather than desk-rejected.

Referee Report

2 major / 1 minor

Summary. The paper proposes training Bayesian neural networks by directly minimizing the Bethe free energy on tree-structured factor graphs instead of maximizing a variational lower bound. With a last-layer Gaussian weight posterior, the resulting losses are analytically tractable: the Gaussian-likelihood case recovers the exact marginal likelihood while the probit case yields a closed form via probit-Gaussian convolution. Both objectives satisfy L_MAP ≤ L_Bethe ≤ L_ELBO, and a Z-consistent prior makes the prior precision differentiable, enabling single-pass joint optimization of weights, covariance, and hyperparameters. Deterministic layers are trained by standard backpropagation, closed-form predictives are available at MAP cost, and the method is shown to be competitive with reference approaches on 8 UCI regression and 12 UCI classification benchmarks under a shared hyperparameter regime.

Significance. If the central derivations hold, the work supplies a principled interpolation between MAP and variational inference that removes the structural Jensen gap on trees, together with an efficient single-pass empirical-Bayes procedure and MAP-cost inference. These features address two persistent practical bottlenecks in Bayesian deep learning and are supported by the reported UCI results; the framework could therefore influence scalable BNN training pipelines that mix probabilistic and deterministic components.

major comments (2)

[Abstract] Abstract: The claim that the Bethe loss equals the exact marginal likelihood for Gaussian likelihood (and yields a closed form for probit) is load-bearing for the tractability and inequality statements; this equality is stated to hold because the factor graph is tree-structured and the posterior is restricted to a last-layer Gaussian. The manuscript must supply the explicit derivation (presumably in the methods section) showing how deterministic layers drop out while preserving exactness, and must state the precise conditions under which the inequality L_MAP ≤ L_Bethe ≤ L_ELBO remains strict rather than becoming an equality by construction.
[Abstract] Abstract: The Z-consistent prior is presented as making prior precision a differentiable parameter that enables single-pass empirical Bayes without an outer loop. Because this construction is derived inside the same last-layer Gaussian Bethe framework, the manuscript should verify (in the relevant optimization section) that joint gradient updates on the prior precision preserve the claimed positioning relative to the marginal likelihood and do not introduce bias when the tree-structure or Gaussian assumptions are only approximately satisfied.

minor comments (1)

[Experimental section] The abstract refers to '8 UCI regression and 12 UCI classification benchmarks' evaluated under a single shared hyperparameter regime; the corresponding experimental section should list the exact hyperparameter values used for all competing methods to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to include the requested explicit derivations, clarifications on conditions, and verification of the optimization procedure.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the Bethe loss equals the exact marginal likelihood for Gaussian likelihood (and yields a closed form for probit) is load-bearing for the tractability and inequality statements; this equality is stated to hold because the factor graph is tree-structured and the posterior is restricted to a last-layer Gaussian. The manuscript must supply the explicit derivation (presumably in the methods section) showing how deterministic layers drop out while preserving exactness, and must state the precise conditions under which the inequality L_MAP ≤ L_Bethe ≤ L_ELBO remains strict rather than becoming an equality by construction.

Authors: We agree that an explicit derivation is essential for rigor. In the revised manuscript we will add a dedicated subsection in the Methods section that starts from the Bethe free energy on a tree-structured factor graph, shows that all messages from deterministic layers are deterministic and therefore contribute zero entropy and cancel in the interaction terms, and reduces the objective exactly to the marginal likelihood integral when the likelihood is Gaussian and the last-layer posterior is Gaussian. For the probit case we will derive the closed-form expression via the known Gaussian-probit convolution. We will also state the precise conditions for the inequality: L_Bethe equals the true marginal likelihood (hence L_MAP ≤ L_Bethe ≤ L_ELBO with equality to the marginal likelihood) precisely when the factor graph is a tree and the last-layer posterior is Gaussian; the inequalities are strict for non-linear deterministic layers whose true posterior is non-Gaussian, and become equalities only in the linear-Gaussian case or when the variational family is exact. revision: yes
Referee: [Abstract] Abstract: The Z-consistent prior is presented as making prior precision a differentiable parameter that enables single-pass empirical Bayes without an outer loop. Because this construction is derived inside the same last-layer Gaussian Bethe framework, the manuscript should verify (in the relevant optimization section) that joint gradient updates on the prior precision preserve the claimed positioning relative to the marginal likelihood and do not introduce bias when the tree-structure or Gaussian assumptions are only approximately satisfied.

Authors: We will expand the optimization section to include a short verification: because the Z-consistent prior is obtained by reparameterizing the Bethe free energy itself, any gradient step on the prior precision remains inside the same objective and therefore preserves L_MAP ≤ L_Bethe ≤ L_ELBO by construction. When the tree or Gaussian assumptions hold only approximately, the procedure yields a consistent approximation to the marginal likelihood rather than an exact one; we will note this limitation explicitly and point to the UCI results, which show that the single-pass empirical-Bayes estimates match cross-validated values on nearly all dataset-variant pairs, indicating no measurable bias under the regimes tested. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation follows from standard Bethe free energy on trees plus explicit last-layer Gaussian restriction

full rationale

The paper's central results (Bethe loss expressions, L_MAP ≤ L_Bethe ≤ L_ELBO inequalities, and single-pass empirical Bayes via Z-consistent prior) are obtained by substituting the last-layer Gaussian posterior into the standard Bethe free energy formula for tree-structured factor graphs. The Gaussian-likelihood case recovers the exact marginal likelihood and the probit case yields a closed form via convolution; both inequalities are direct consequences of the variational ordering and exactness on trees. No equation reduces to a fitted parameter renamed as a prediction, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled via prior work. The framework is therefore self-contained against external graphical-model benchmarks once the two explicit restrictions are granted.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the exactness of Bethe free energy on trees and the analytic tractability under last-layer Gaussian posteriors; no new entities are postulated.

free parameters (1)

prior precision
Made differentiable via Z-consistent formulation for joint optimization with weights and covariance.

axioms (1)

domain assumption Bethe free energy equals the exact marginal free energy on tree-structured factor graphs
Invoked to justify exactness for deterministic layers and the overall objective.

pith-pipeline@v0.9.0 · 5577 in / 1338 out tokens · 55366 ms · 2026-05-13T06:27:38.694560+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Both objectives sit strictly between MAP and the ELBO (L_MAP ≤ L_Bethe ≤ L_ELBO), removing the structural Jensen gap... Z-consistent prior formulation makes the prior precision a differentiable parameter
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D = 3) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On tree-structured factor graphs the Bethe free energy is exact

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

and Freeman, William T

Yedidia, Jonathan S. and Freeman, William T. and Weiss, Yair , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[2]

, title =

Minka, Thomas P. , title =. Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI) , year =

work page
[3]

Williams, Christopher K. I. and Rasmussen, Carl Edward , title =

work page
[4]

MacKay, David J. C. , title =. Neural Computation , volume =

work page
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Daxberger, Erik and Kristiadi, Agustinus and Immer, Alexander and Eschenhagen, Runa and Bauer, Matthias and Hennig, Philipp , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[6]

International Conference on Learning Representations (ICLR) , year =

Harrison, James and Willes, John and Snoek, Jasper , title =. International Conference on Learning Representations (ICLR) , year =

work page
[7]

Calvo-Ord. Rich. International Conference on Learning Representations (ICLR) , year =

work page
[8]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[9]

Wasserstein Auto-Encoders , booktitle =

Tolstikhin, Ilya and Bousquet, Olivier and Gelly, Sylvain and Sch. Wasserstein Auto-Encoders , booktitle =

work page
[10]

MacKay, David J. C. , title =

work page
[11]

, title =

Tipping, Michael E. , title =. Journal of Machine Learning Research , volume =

work page
[12]

and Thomas, Joy A

Cover, Thomas M. and Thomas, Joy A. , title =

work page
[13]

Proceedings of the 34th International Conference on Machine Learning (ICML) , year =

Molchanov, Dmitry and Ashukha, Arsenii and Vetrov, Dmitry , title =. Proceedings of the 34th International Conference on Machine Learning (ICML) , year =

work page
[14]

Focal Loss for Dense Object Detection , booktitle =

Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Doll. Focal Loss for Dense Object Detection , booktitle =

work page
[15]

Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT) , year =

Thiemann, Niklas and Igel, Christian and Wintenberger, Olivier and Seldin, Yevgeny , title =. Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT) , year =

work page
[16]

and Izmailov, Pavel and Garipov, Timur and Vetrov, Dmitry P

Maddox, Wesley J. and Izmailov, Pavel and Garipov, Timur and Vetrov, Dmitry P. and Wilson, Andrew Gordon , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[17]

International Conference on Machine Learning (ICML) , year =

Gal, Yarin and Ghahramani, Zoubin , title =. International Conference on Machine Learning (ICML) , year =

work page
[18]

International Conference on Learning Representations (ICLR) , year =

Burda, Yuri and Grosse, Roger and Salakhutdinov, Ruslan , title =. International Conference on Learning Representations (ICLR) , year =

work page
[19]

and Salimans, Tim and Jozefowicz, Rafal and Chen, Xi and Sutskever, Ilya and Welling, Max , title =

Kingma, Diederik P. and Salimans, Tim and Jozefowicz, Rafal and Chen, Xi and Sutskever, Ilya and Welling, Max , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[20]

Preventing Posterior Collapse with Delta-

Razavi, Ali and van den Oord, A. Preventing Posterior Collapse with Delta-. International Conference on Learning Representations (ICLR) , year =

work page

[1] [1]

and Freeman, William T

Yedidia, Jonathan S. and Freeman, William T. and Weiss, Yair , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[2] [2]

, title =

Minka, Thomas P. , title =. Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI) , year =

work page

[3] [3]

Williams, Christopher K. I. and Rasmussen, Carl Edward , title =

work page

[4] [4]

MacKay, David J. C. , title =. Neural Computation , volume =

work page

[5] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Daxberger, Erik and Kristiadi, Agustinus and Immer, Alexander and Eschenhagen, Runa and Bauer, Matthias and Hennig, Philipp , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[6] [6]

International Conference on Learning Representations (ICLR) , year =

Harrison, James and Willes, John and Snoek, Jasper , title =. International Conference on Learning Representations (ICLR) , year =

work page

[7] [7]

Calvo-Ord. Rich. International Conference on Learning Representations (ICLR) , year =

work page

[8] [8]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[9] [9]

Wasserstein Auto-Encoders , booktitle =

Tolstikhin, Ilya and Bousquet, Olivier and Gelly, Sylvain and Sch. Wasserstein Auto-Encoders , booktitle =

work page

[10] [10]

MacKay, David J. C. , title =

work page

[11] [11]

, title =

Tipping, Michael E. , title =. Journal of Machine Learning Research , volume =

work page

[12] [12]

and Thomas, Joy A

Cover, Thomas M. and Thomas, Joy A. , title =

work page

[13] [13]

Proceedings of the 34th International Conference on Machine Learning (ICML) , year =

Molchanov, Dmitry and Ashukha, Arsenii and Vetrov, Dmitry , title =. Proceedings of the 34th International Conference on Machine Learning (ICML) , year =

work page

[14] [14]

Focal Loss for Dense Object Detection , booktitle =

Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Doll. Focal Loss for Dense Object Detection , booktitle =

work page

[15] [15]

Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT) , year =

Thiemann, Niklas and Igel, Christian and Wintenberger, Olivier and Seldin, Yevgeny , title =. Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT) , year =

work page

[16] [16]

and Izmailov, Pavel and Garipov, Timur and Vetrov, Dmitry P

Maddox, Wesley J. and Izmailov, Pavel and Garipov, Timur and Vetrov, Dmitry P. and Wilson, Andrew Gordon , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[17] [17]

International Conference on Machine Learning (ICML) , year =

Gal, Yarin and Ghahramani, Zoubin , title =. International Conference on Machine Learning (ICML) , year =

work page

[18] [18]

International Conference on Learning Representations (ICLR) , year =

Burda, Yuri and Grosse, Roger and Salakhutdinov, Ruslan , title =. International Conference on Learning Representations (ICLR) , year =

work page

[19] [19]

and Salimans, Tim and Jozefowicz, Rafal and Chen, Xi and Sutskever, Ilya and Welling, Max , title =

Kingma, Diederik P. and Salimans, Tim and Jozefowicz, Rafal and Chen, Xi and Sutskever, Ilya and Welling, Max , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[20] [20]

Preventing Posterior Collapse with Delta-

Razavi, Ali and van den Oord, A. Preventing Posterior Collapse with Delta-. International Conference on Learning Representations (ICLR) , year =

work page