pith. sign in

arxiv: 2605.29467 · v1 · pith:VWFFZZDZnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

Pith reviewed 2026-06-29 09:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords factor graphsvariational inferenceclosed-form inferencemessage passingmixture of expertsprobabilistic modelsGaussian messagesGamma messages
0
0 comments X

The pith

Any model composed from five factor-graph primitives admits closed-form variational message passing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that stacking probabilistic building blocks typically breaks closed-form inference but that five specific primitives can be composed while preserving it. The primitives are a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node. Under mean-field factorization each preserves a small set of message families so that Gaussian and Gamma messages remain closed and the exponential link stays tractable via the Gaussian moment-generating function and Gamma sufficient statistics. This construction supports models of increasing depth including input-dependent gating and split-branch routing that encodes decision trees, and it yields a Bayesian mixture of experts with inferred gating on time-series forecasting tasks.

Core claim

Any model composed from the five primitives admits closed-form variational message passing because each primitive preserves a small set of message families under mean-field factorization: messages on Gaussian variables remain Gaussian, messages on precision variables remain Gamma, and the exponential link remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family.

What carries the argument

The five factor-graph primitives (bilinear factor, exponential link, Gamma prior, Gaussian likelihood, equality node) that each preserve Gaussian and Gamma message families under mean-field factorization.

If this is right

  • Stacking routing layers encodes arbitrary decision trees while retaining closed-form inference.
  • Universal function approximation is achieved with closed-form variational message passing.
  • A Bayesian mixture of experts arises in which gating functions are inferred rather than learned.
  • Applied to ensemble time-series forecasting the approach yields calibrated uncertainty over expert selection on benchmark datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preservation of message families might extend to other link functions whose moment-generating functions admit closed-form expectations with Gamma statistics.
  • Deeper compositions could be tested to confirm whether the Gaussian and Gamma families remain closed at arbitrary depth.
  • The framework offers a route to build deep probabilistic models that avoid sampling while still encoding complex routing.

Load-bearing premise

Under mean-field factorization the only non-conjugate interface is the exponential link and it remains tractable through the Gaussian moment-generating function together with the sufficient statistics of the Gamma family.

What would settle it

A concrete counter-example model built only from the five primitives in which at least one variational message update requires numerical integration or approximation outside the claimed Gaussian and Gamma families.

Figures

Figures reproduced from arXiv: 2605.29467 by Bert de Vries, Dmitry Bagaev, \.Ismail \c{S}en\"oz, Jeff Beck, Kyrylo Yemets, Mykola Lukashchuk, Wouter M. Kouw.

Figure 1
Figure 1. Figure 1: The building blocks as factor graph nodes. Square nodes are factors; round nodes represent neighboring nodes to which [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Depth 0 factor graph (static ensemble), shown start [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The precision word π and Depth-1 model. (a) Internal structure: a softdot and exponential link connected by latent z, computing input-dependent precision γ from w, ϕ, and τ . (b) Compact notation; double border indicates a composite word, filled semi-circle marks the τ (input precision) side. (c) Depth 1 factor graph (Precision-Gated Experts) for expert i=1, observation j=1: compared to Depth 0 ( [PITH_FU… view at source ↗
Figure 4
Figure 4. Figure 4: Depth 2 factor graph (split-branch routing) for one expert, one observation. The router softdot produces [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Two modes of message computation. (a) Under mean-field factorization constraints, the message from the softdot toward z depends only on the marginal types of its other edges: q(w) ∈ N and q(ϕ) ∈ N (solid lines), q(τ ) ∈ G (dashed line). Which factors are connected on the other side of these edges is irrelevant; the line styles act as a type system. (b) The exp link uses belief propagation (BP); as a determ… view at source ↗
Figure 6
Figure 6. Figure 6: Posterior prediction for the XOR encoding with two [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Radar charts of log-transformed MSE and NLL averaged over all horizons. Each axis corresponds to a dataset; larger [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Noisy experts factor graph, shown for expert [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pareto frontier relating model size to radar-chart area. Static and Noisy Diagonal offer the strongest trade-off between [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of forecasting with confidence interval of Static, Noisy Diagonal and MoE ensembles on electricity dataset [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of forecasting with confidence interval of Static, Noisy Diagonal and MoE ensembles on exchange rate [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference can be preserved. We identify five factor-graph primitives: a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node, and prove that any model composed from them admits closed-form variational message passing. The construction works because each primitive preserves a small set of message families: under mean-field factorization, messages on Gaussian variables remain Gaussian and messages on precision variables remain Gamma, while the only non-conjugate interface, the exponential link, remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family. We demonstrate composition at increasing depth, from static ensembles through input-dependent gating to split-branch routing, and show that stacking routing layers encodes arbitrary decision trees, establishing universal function approximation with closed-form inference. Applied to ensemble time-series forecasting, the framework yields a Bayesian mixture of experts in which gating functions are inferred rather than learned, providing calibrated uncertainty over expert selection across five benchmark datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies five factor-graph primitives (bilinear factor, exponential link, Gamma prior, Gaussian likelihood, equality node) and claims that any model composed from them admits closed-form variational message passing. It asserts that each primitive preserves Gaussian messages on variables and Gamma messages on precisions under mean-field factorization, with the sole non-conjugate interface (exponential link) remaining tractable via the Gaussian moment-generating function and Gamma sufficient statistics. The work demonstrates compositions of increasing depth (ensembles, input-dependent gating, split-branch routing) that encode arbitrary decision trees for universal approximation, and applies the framework to a Bayesian mixture-of-experts model for ensemble time-series forecasting with inferred gating on five benchmarks.

Significance. If the central preservation result holds under arbitrary compositions, the framework would enable scalable closed-form variational inference for deep non-conjugate architectures while retaining calibrated uncertainty, a meaningful advance for Bayesian deep learning. The explicit construction of decision-tree routing with tractable messages and the forecasting application are concrete strengths.

major comments (2)
  1. [Section on primitives and message-family preservation] The central claim requires that local family preservation composes globally. The manuscript must supply an explicit lemma or inductive argument (in the section presenting the five primitives and their message updates) showing that cavity distributions seen by the exponential link remain exactly Gaussian when the link receives messages routed through bilinear factors or equality nodes in multi-branch wirings; mean-field factorization alone does not guarantee this closure without additional propagation rules.
  2. [Experimental evaluation on time-series forecasting] Table or figure reporting the forecasting results: the claim of 'calibrated uncertainty over expert selection' requires quantitative verification (e.g., proper scoring rules or coverage of predictive intervals) that isolates the benefit of closed-form gating inference versus learned alternatives; without these metrics the application does not yet substantiate the broader methodological contribution.
minor comments (1)
  1. Notation for message natural parameters and the precise definition of the exponential-link update should be introduced with a single consistent table or appendix to aid readability across the composition examples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate the suggested changes in the revision.

read point-by-point responses
  1. Referee: [Section on primitives and message-family preservation] The central claim requires that local family preservation composes globally. The manuscript must supply an explicit lemma or inductive argument (in the section presenting the five primitives and their message updates) showing that cavity distributions seen by the exponential link remain exactly Gaussian when the link receives messages routed through bilinear factors or equality nodes in multi-branch wirings; mean-field factorization alone does not guarantee this closure without additional propagation rules.

    Authors: We agree that an explicit inductive argument would strengthen the presentation of the central claim. In the revised manuscript we will insert a new lemma in the section on the five primitives. The lemma will prove by induction on composition depth that, under mean-field factorization, cavity distributions arriving at any exponential link remain exactly Gaussian even when messages are routed through arbitrary wirings of bilinear factors and equality nodes. The base case covers the local updates already stated; the inductive step shows that the message families are closed under the additional propagation rules induced by equality nodes and bilinear factors. revision: yes

  2. Referee: [Experimental evaluation on time-series forecasting] Table or figure reporting the forecasting results: the claim of 'calibrated uncertainty over expert selection' requires quantitative verification (e.g., proper scoring rules or coverage of predictive intervals) that isolates the benefit of closed-form gating inference versus learned alternatives; without these metrics the application does not yet substantiate the broader methodological contribution.

    Authors: We acknowledge that the current experimental section would benefit from additional quantitative verification of calibration. In the revision we will expand the forecasting results to include the Continuous Ranked Probability Score (CRPS) and the empirical coverage of 95% predictive intervals. These metrics will be reported for the closed-form variational gating model and compared against learned-gating baselines on the same five benchmarks, thereby isolating the contribution of the closed-form inference procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: preservation property is asserted as a theorem to be proven from the primitives

full rationale

The paper identifies five factor-graph primitives and states that it proves any composition admits closed-form variational message passing because each primitive preserves Gaussian/Gamma message families (with the exponential link handled via MGF). No equations, fitted parameters, or self-citations are shown that would make the claimed closure reduce to a quantity defined by the same inputs. The central claim is a preservation theorem under mean-field factorization rather than a renaming, fit, or self-referential definition; the derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on the standard mean-field assumption and on the algebraic closure properties of the five chosen primitives; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Mean-field factorization is assumed throughout.
    The preservation of Gaussian and Gamma message families is stated to hold under mean-field factorization.

pith-pipeline@v0.9.1-grok · 5743 in / 1251 out tokens · 33464 ms · 2026-06-29T09:12:25.206909+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    Bagaev and B

    D. Bagaev and B. De Vries. Reactive Message Passing for Scalable Bayesian Inference . Scientific Programming, 2023: 0 1--26, May 2023. ISSN 1875-919X, 1058-9244. doi:10.1155/2023/6601690. URL https://www.hindawi.com/journals/sp/2023/6601690/

  2. [2]

    Bergmann

    R. Bergmann. Manopt.jl: Optimization on Manifolds in Julia . Journal of Open Source Software, 7 0 (70): 0 3866, 2022. doi:10.21105/joss.03866

  3. [3]

    W. G. Cochran. Problems arising in the analysis of a series of similar experiments. Supplement to the Journal of the Royal Statistical Society, 4 0 (1): 0 102--118, 1937

  4. [4]

    G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2 0 (4): 0 303--314, Dec. 1989. ISSN 1435-568X. doi:10.1007/BF02551274. URL https://doi.org/10.1007/BF02551274

  5. [5]

    J. Dauwels. On Variational Message Passing on Factor Graphs . In IEEE International Symposium on Information Theory , pages 2546--2550, Nice, France, June 2007. doi:10.1109/ISIT.2007.4557602. URL http://ieeexplore.ieee.org/abstract/document/4557602

  6. [6]

    G. Forney. Codes on graphs: normal realizations. IEEE Transactions on Information Theory, 47 0 (2): 0 520--548, Feb. 2001. ISSN 0018-9448. doi:10.1109/18.910573. URL https://ieeexplore.ieee.org/abstract/document/910573

  7. [7]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

  8. [8]

    Long short-term memory

    S. Hochreiter and J. Schmidhuber. Long Short - Term Memory . Neural Comput., 9 0 (8): 0 1735--1780, Nov. 1997. ISSN 0899-7667. doi:10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735

  9. [9]

    K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4 0 (2): 0 251--257, 1991. ISSN 0893-6080. doi:https://doi.org/10.1016/0893-6080(91)90009-T. URL https://www.sciencedirect.com/science/article/pii/089360809190009T

  10. [10]

    R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3 0 (1): 0 79--87, 1991

  11. [11]

    M. E. Khan. Information Geometry of Variational Bayes . Information Geometry, 8 0 (S1): 0 275--289, Nov. 2025. ISSN 2511-2481, 2511-249X. doi:10.1007/s41884-025-00174-3. URL https://link.springer.com/10.1007/s41884-025-00174-3

  12. [12]

    M. E. Khan and H. Rue. The Bayesian learning rule. Journal of Machine Learning Research, 24 0 (281): 0 1--46, 2023

  13. [13]

    D. P. Kingma and M. Welling. Auto- Encoding Variational Bayes . arXiv:1312.6114 [cs, stat], Dec. 2013. URL http://arxiv.org/abs/1312.6114. arXiv: 1312.6114

  14. [14]

    F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory, 47 0 (2): 0 498--519, 2001. doi:10.1109/18.910572. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=910572

  15. [15]

    Deep learning.Nature, 521(7553):436– 444, 2015

    Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521 0 (7553): 0 436--444, May 2015. ISSN 0028-0836, 1476-4687. doi:10.1038/nature14539. URL https://www.nature.com/articles/nature14539

  16. [16]

    Loeliger

    H.-A. Loeliger. An introduction to factor graphs. Signal Processing Magazine, IEEE, 21 0 (1): 0 28--41, Jan. 2004. doi:10.1109/MSP.2004.1267047. URL https://ieeexplore.ieee.org/document/1267047

  17. [17]

    Loeliger

    H.-A. Loeliger. Factor Graphs and Message Passing Algorithms -- Part 1: Introduction , 2007. URL http://www.crm.sns.it/media/course/1524/Loeliger_A.pdf

  18. [18]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization . In 7th International Conference on Learning Representations , ICLR 2019, New Orleans , LA , USA , May 6-9, 2019 . OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

  19. [19]

    Lukashchuk, I

    M. Lukashchuk, I. Senöz, and B. de Vries. Q-conjugate message passing for efficient bayesian inference. In International conference on probabilistic graphical models, pages 295--311. PMLR, 2024

  20. [20]

    Lukashchuk, D

    M. Lukashchuk, D. Bagaev, A. Podusenko, \. I . S en \"o z, and B. de Vries. ExponentialFamilyManifolds .jl: Representing exponential families as Riemannian manifolds. Proceedings of the JuliaCon Conferences, 7 0 (70): 0 179, 2025. doi:10.21105/jcon.00179. URL https://doi.org/10.21105/jcon.00179

  21. [21]

    R. M. Neal. MCMC using Hamiltonian dynamics . May 2011. doi:10.1201/b10905. URL http://arxiv.org/abs/1206.1901. arXiv:1206.1901 [physics, stat]

  22. [22]

    W. W. L. Nuijten, D. Bagaev, and B. de Vries. GraphPPL .jl: A Probabilistic Programming Language for Graphical Models . Entropy, 26 0 (11), 2024. ISSN 1099-4300. doi:10.3390/e26110890. URL https://www.mdpi.com/1099-4300/26/11/890

  23. [23]

    Ranganath, S

    R. Ranganath, S. Gerrish, and D. Blei. Black Box Variational Inference . In S. Kaski and J. Corander, editors, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics , volume 33 of Proceedings of Machine Learning Research , pages 814--822, Reykjavik, Iceland, Apr. 2014. PMLR. URL https://proceedings.mlr.press/v33...

  24. [24]

    D. J. Rezende and S. Mohamed. Variational Inference with Normalizing Flows . arXiv:1505.05770 [cs, stat], May 2015. URL http://arxiv.org/abs/1505.05770. arXiv: 1505.05770

  25. [25]

    W. Rudin. Real and complex analysis. McGraw - Hill international editions Mathematics series. McGraw-Hill, New York, NY, 3. ed., internat. ed., [nachdr.] edition, 2013. ISBN 978-0-07-100276-9 978-0-07-054234-1. OCLC: 957461070

  26. [26]

    Senöz, T

    I. Senöz, T. van de Laar, D. Bagaev, and B. de Vries. Variational Message Passing and Local Constraint Manipulation in Factor Graphs . Entropy, 23 0 (7): 0 807, July 2021. ISSN 1099-4300. doi:10.3390/e23070807. URL https://www.mdpi.com/1099-4300/23/7/807

  27. [27]

    Smola, S

    A. Smola, S. Vishwanathan, and E. Eskin. Laplace propagation. In Advances in neural information processing systems, volume 16. MIT Press, 2003. URL https://proceedings.neurips.cc/paper_files/paper/2003/file/7fd804295ef7f6a2822bf4c61f9dc4a8-Paper.pdf

  28. [28]

    Trindade

    A. Trindade. ElectricityLoadDiagrams20112014 . UCI Machine Learning Repository, 10: 0 C58C86, 2015

  29. [29]

    van de Laar, M

    T. van de Laar, M. Cox, I. Senoz, I. Bocharov, and B. de Vries. ForneyLab : A Toolbox for Biologically Plausible Free Energy Minimization in Dynamic Neural Models . In Conference on Complex Systems ( CCS ) , Thessaloniki, Greece, Sept. 2018

  30. [30]

    L. A. Weber, P. T. Waade, N. Legrand, A. H. Møller, K. E. Stephan, and C. Mathys. The generalized Hierarchical Gaussian Filter . Mar. 2026. doi:10.7554/elife.110174.1. URL http://dx.doi.org/10.7554/eLife.110174.1

  31. [31]

    Winn and C

    J. Winn and C. M. Bishop. Variational Message Passing . Journal of Machine Learning Research, 6 0 (23): 0 661--694, 2005. ISSN 1533-7928. URL http://jmlr.org/papers/v6/winn05a.html

  32. [32]

    J. S. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51 0 (7): 0 2282--2312, July 2005. ISSN 0018-9448. doi:10.1109/TIT.2005.850085. URL http://ieeexplore.ieee.org/abstract/document/1459044

  33. [33]

    A. Zeng, M. Chen, L. Zhang, and Q. Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence , volume 37, pages 11121--11128, 2023

  34. [34]

    H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang. Informer: Beyond Efficient Transformer for Long Sequence Time - Series Forecasting . In The Thirty - Fifth AAAI Conference on Artificial Intelligence , AAAI 2021, Virtual Conference , volume 35, pages 11106--11115. AAAI Press, 2021