pith. sign in

arxiv: 2605.08446 · v2 · submitted 2026-05-08 · 💻 cs.LG

Direct Bethe Free Energy Minimization for Bayesian Neural Network

Pith reviewed 2026-05-13 06:27 UTC · model grok-4.3

classification 💻 cs.LG
keywords bethecostenergyfreelikelihoodpriortextbayes
0
0 comments X

The pith

Direct Bethe free energy minimization for BNNs produces losses strictly between MAP and ELBO, enables joint empirical Bayes in one gradient pass, and matches reference methods on UCI benchmarks at single-pass cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Bayesian neural networks treat network weights as random variables to capture prediction uncertainty. Standard training uses variational inference to maximize a lower bound called the ELBO on the marginal likelihood. This work instead minimizes the Bethe free energy, a quantity from graphical models that equals the true free energy on tree-structured graphs. Deterministic layers fall out of the objective and train via ordinary backpropagation, so the method mixes probabilistic and deterministic parts without special handling. By restricting the posterior over weights to a Gaussian only on the final layer, the loss becomes analytically tractable. For Gaussian likelihoods it recovers the exact marginal likelihood; for probit likelihoods it reduces to a closed-form convolution. The resulting Bethe objective always lies between the MAP loss and the ELBO, closing a gap no variational family can remove. A Z-consistent prior formulation makes the prior precision a differentiable parameter, so weights, covariance, and hyperparameters optimize together in one gradient step with no outer cross-validation loop. Predictive distributions are also available in closed form at MAP-equivalent cost. Experiments on eight UCI regression and twelve UCI classification tasks under a shared hyperparameter regime show performance competitive with standard methods.

Core claim

Both objectives sit strictly between MAP and the ELBO (L_MAP ≤ L_Bethe ≤ L_ELBO), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes in a single gradient pass.

Load-bearing premise

Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses and that the factor graph is tree-structured so the Bethe free energy is exact; if these do not hold the closed-form claims and inequality may fail.

Figures

Figures reproduced from arXiv: 2605.08446 by Pavel Prochazka.

Figure 1
Figure 1. Figure 1: Two-Moons benchmark: MAP (overconfident), Laplace (poorly scaled), and Bethe (princi [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Factor graph for Direct Bethe Optimisation. Circles: variable nodes; filled squares: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

We propose training Bayesian neural networks by directly minimizing the Bethe free energy rather than maximizing a variational lower bound. On tree-structured factor graphs the Bethe free energy is exact; deterministic layers drop out of the objective and are trained by standard backpropagation, so the framework accommodates any mixture of probabilistic and deterministic subgraphs without modification. Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses: for a Gaussian likelihood the Bethe loss equals the exact marginal likelihood, and for a probit likelihood it reduces to a closed form via the probit-Gaussian convolution. Both objectives sit strictly between MAP and the ELBO ($L_\text{MAP} \leq L_\text{Bethe} \leq L_\text{ELBO}$), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes - joint optimization of weights, covariance, and hyperparameters - in a single gradient pass, with no cross-validation or outer loop. All variants admit a closed-form predictive at MAP-equivalent inference cost, in contrast to ensemble and sampling-based methods. On 8 UCI regression and 12 UCI classification benchmarks evaluated under a single shared hyperparameter regime, Bethe is competitive with standard reference methods at single-pass cost. Independently, joint single-pass empirical Bayes matches grid-search cross-validation of the prior precision on essentially all dataset-variant combinations, eliminating the outer hyperparameter loop without measurable cost. Isolated optimization gaps on a few datasets reflect numerical rather than principled limitations of the framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes training Bayesian neural networks by directly minimizing the Bethe free energy on tree-structured factor graphs instead of maximizing a variational lower bound. With a last-layer Gaussian weight posterior, the resulting losses are analytically tractable: the Gaussian-likelihood case recovers the exact marginal likelihood while the probit case yields a closed form via probit-Gaussian convolution. Both objectives satisfy L_MAP ≤ L_Bethe ≤ L_ELBO, and a Z-consistent prior makes the prior precision differentiable, enabling single-pass joint optimization of weights, covariance, and hyperparameters. Deterministic layers are trained by standard backpropagation, closed-form predictives are available at MAP cost, and the method is shown to be competitive with reference approaches on 8 UCI regression and 12 UCI classification benchmarks under a shared hyperparameter regime.

Significance. If the central derivations hold, the work supplies a principled interpolation between MAP and variational inference that removes the structural Jensen gap on trees, together with an efficient single-pass empirical-Bayes procedure and MAP-cost inference. These features address two persistent practical bottlenecks in Bayesian deep learning and are supported by the reported UCI results; the framework could therefore influence scalable BNN training pipelines that mix probabilistic and deterministic components.

major comments (2)
  1. [Abstract] Abstract: The claim that the Bethe loss equals the exact marginal likelihood for Gaussian likelihood (and yields a closed form for probit) is load-bearing for the tractability and inequality statements; this equality is stated to hold because the factor graph is tree-structured and the posterior is restricted to a last-layer Gaussian. The manuscript must supply the explicit derivation (presumably in the methods section) showing how deterministic layers drop out while preserving exactness, and must state the precise conditions under which the inequality L_MAP ≤ L_Bethe ≤ L_ELBO remains strict rather than becoming an equality by construction.
  2. [Abstract] Abstract: The Z-consistent prior is presented as making prior precision a differentiable parameter that enables single-pass empirical Bayes without an outer loop. Because this construction is derived inside the same last-layer Gaussian Bethe framework, the manuscript should verify (in the relevant optimization section) that joint gradient updates on the prior precision preserve the claimed positioning relative to the marginal likelihood and do not introduce bias when the tree-structure or Gaussian assumptions are only approximately satisfied.
minor comments (1)
  1. [Experimental section] The abstract refers to '8 UCI regression and 12 UCI classification benchmarks' evaluated under a single shared hyperparameter regime; the corresponding experimental section should list the exact hyperparameter values used for all competing methods to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to include the requested explicit derivations, clarifications on conditions, and verification of the optimization procedure.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the Bethe loss equals the exact marginal likelihood for Gaussian likelihood (and yields a closed form for probit) is load-bearing for the tractability and inequality statements; this equality is stated to hold because the factor graph is tree-structured and the posterior is restricted to a last-layer Gaussian. The manuscript must supply the explicit derivation (presumably in the methods section) showing how deterministic layers drop out while preserving exactness, and must state the precise conditions under which the inequality L_MAP ≤ L_Bethe ≤ L_ELBO remains strict rather than becoming an equality by construction.

    Authors: We agree that an explicit derivation is essential for rigor. In the revised manuscript we will add a dedicated subsection in the Methods section that starts from the Bethe free energy on a tree-structured factor graph, shows that all messages from deterministic layers are deterministic and therefore contribute zero entropy and cancel in the interaction terms, and reduces the objective exactly to the marginal likelihood integral when the likelihood is Gaussian and the last-layer posterior is Gaussian. For the probit case we will derive the closed-form expression via the known Gaussian-probit convolution. We will also state the precise conditions for the inequality: L_Bethe equals the true marginal likelihood (hence L_MAP ≤ L_Bethe ≤ L_ELBO with equality to the marginal likelihood) precisely when the factor graph is a tree and the last-layer posterior is Gaussian; the inequalities are strict for non-linear deterministic layers whose true posterior is non-Gaussian, and become equalities only in the linear-Gaussian case or when the variational family is exact. revision: yes

  2. Referee: [Abstract] Abstract: The Z-consistent prior is presented as making prior precision a differentiable parameter that enables single-pass empirical Bayes without an outer loop. Because this construction is derived inside the same last-layer Gaussian Bethe framework, the manuscript should verify (in the relevant optimization section) that joint gradient updates on the prior precision preserve the claimed positioning relative to the marginal likelihood and do not introduce bias when the tree-structure or Gaussian assumptions are only approximately satisfied.

    Authors: We will expand the optimization section to include a short verification: because the Z-consistent prior is obtained by reparameterizing the Bethe free energy itself, any gradient step on the prior precision remains inside the same objective and therefore preserves L_MAP ≤ L_Bethe ≤ L_ELBO by construction. When the tree or Gaussian assumptions hold only approximately, the procedure yields a consistent approximation to the marginal likelihood rather than an exact one; we will note this limitation explicitly and point to the UCI results, which show that the single-pass empirical-Bayes estimates match cross-validated values on nearly all dataset-variant pairs, indicating no measurable bias under the regimes tested. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation follows from standard Bethe free energy on trees plus explicit last-layer Gaussian restriction

full rationale

The paper's central results (Bethe loss expressions, L_MAP ≤ L_Bethe ≤ L_ELBO inequalities, and single-pass empirical Bayes via Z-consistent prior) are obtained by substituting the last-layer Gaussian posterior into the standard Bethe free energy formula for tree-structured factor graphs. The Gaussian-likelihood case recovers the exact marginal likelihood and the probit case yields a closed form via convolution; both inequalities are direct consequences of the variational ordering and exactness on trees. No equation reduces to a fitted parameter renamed as a prediction, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled via prior work. The framework is therefore self-contained against external graphical-model benchmarks once the two explicit restrictions are granted.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the exactness of Bethe free energy on trees and the analytic tractability under last-layer Gaussian posteriors; no new entities are postulated.

free parameters (1)
  • prior precision
    Made differentiable via Z-consistent formulation for joint optimization with weights and covariance.
axioms (1)
  • domain assumption Bethe free energy equals the exact marginal free energy on tree-structured factor graphs
    Invoked to justify exactness for deterministic layers and the overall objective.

pith-pipeline@v0.9.0 · 5577 in / 1338 out tokens · 55366 ms · 2026-05-13T06:27:38.694560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    and Freeman, William T

    Yedidia, Jonathan S. and Freeman, William T. and Weiss, Yair , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  2. [2]

    , title =

    Minka, Thomas P. , title =. Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI) , year =

  3. [3]

    Williams, Christopher K. I. and Rasmussen, Carl Edward , title =

  4. [4]

    MacKay, David J. C. , title =. Neural Computation , volume =

  5. [5]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Daxberger, Erik and Kristiadi, Agustinus and Immer, Alexander and Eschenhagen, Runa and Bauer, Matthias and Hennig, Philipp , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  6. [6]

    International Conference on Learning Representations (ICLR) , year =

    Harrison, James and Willes, John and Snoek, Jasper , title =. International Conference on Learning Representations (ICLR) , year =

  7. [7]

    Calvo-Ord. Rich. International Conference on Learning Representations (ICLR) , year =

  8. [8]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  9. [9]

    Wasserstein Auto-Encoders , booktitle =

    Tolstikhin, Ilya and Bousquet, Olivier and Gelly, Sylvain and Sch. Wasserstein Auto-Encoders , booktitle =

  10. [10]

    MacKay, David J. C. , title =

  11. [11]

    , title =

    Tipping, Michael E. , title =. Journal of Machine Learning Research , volume =

  12. [12]

    and Thomas, Joy A

    Cover, Thomas M. and Thomas, Joy A. , title =

  13. [13]

    Proceedings of the 34th International Conference on Machine Learning (ICML) , year =

    Molchanov, Dmitry and Ashukha, Arsenii and Vetrov, Dmitry , title =. Proceedings of the 34th International Conference on Machine Learning (ICML) , year =

  14. [14]

    Focal Loss for Dense Object Detection , booktitle =

    Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Doll. Focal Loss for Dense Object Detection , booktitle =

  15. [15]

    Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT) , year =

    Thiemann, Niklas and Igel, Christian and Wintenberger, Olivier and Seldin, Yevgeny , title =. Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT) , year =

  16. [16]

    and Izmailov, Pavel and Garipov, Timur and Vetrov, Dmitry P

    Maddox, Wesley J. and Izmailov, Pavel and Garipov, Timur and Vetrov, Dmitry P. and Wilson, Andrew Gordon , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  17. [17]

    International Conference on Machine Learning (ICML) , year =

    Gal, Yarin and Ghahramani, Zoubin , title =. International Conference on Machine Learning (ICML) , year =

  18. [18]

    International Conference on Learning Representations (ICLR) , year =

    Burda, Yuri and Grosse, Roger and Salakhutdinov, Ruslan , title =. International Conference on Learning Representations (ICLR) , year =

  19. [19]

    and Salimans, Tim and Jozefowicz, Rafal and Chen, Xi and Sutskever, Ilya and Welling, Max , title =

    Kingma, Diederik P. and Salimans, Tim and Jozefowicz, Rafal and Chen, Xi and Sutskever, Ilya and Welling, Max , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  20. [20]

    Preventing Posterior Collapse with Delta-

    Razavi, Ali and van den Oord, A. Preventing Posterior Collapse with Delta-. International Conference on Learning Representations (ICLR) , year =