Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Mahito Sugiyama; Simon Luo

arxiv: 1906.12063 · v1 · pith:YRJBRQT7new · submitted 2019-06-28 · 📊 stat.ML · cs.LG

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Simon Luo , Mahito Sugiyama This is my paper

Pith reviewed 2026-05-25 14:01 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords bias-variance tradeoffhigher-order interactionsBoltzmann machinehierarchical probabilistic modelsinference algorithmGibbs samplingannealed importance sampling

0 comments

The pith

Higher-order interactions match hidden layers in total error but show lower variance for small training samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares hidden layers and higher-order feature interactions inside hierarchical probabilistic models by measuring their bias-variance trade-offs. It develops a practical inference procedure that combines Gibbs sampling with annealed importance sampling for the log-linear higher-order Boltzmann machine. Bias-variance decomposition on fitted models then reveals that both modeling choices produce errors of similar magnitude. The decomposition further shows that higher-order interactions incur noticeably less variance when the number of training examples is small. These observations indicate that, for limited data, higher-order terms can serve as a lower-variance substitute for added hidden layers.

Core claim

Using the proposed Gibbs-plus-annealed-importance-sampling procedure, the authors fit both hidden-layer and higher-order Boltzmann machines and decompose their errors into bias and variance components. They report that the two families of models exhibit comparable total error of the same order of magnitude, while the higher-order-interaction models display lower variance when trained on smaller sample sizes.

What carries the argument

Bias-variance decomposition performed on the log-linear higher-order Boltzmann machine after inference via Gibbs sampling combined with annealed importance sampling.

If this is right

Both hidden layers and higher-order interactions can achieve comparable generalization performance.
When training data are scarce, higher-order interactions are expected to produce more stable predictions.
The inference algorithm makes systematic bias-variance studies feasible for models with explicit higher-order terms.
Model designers can trade depth for explicit feature interactions without large changes in overall error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same variance-reduction pattern may appear in other exponential-family models that admit explicit higher-order potentials.
For very large datasets the variance advantage is likely to shrink as bias terms become dominant in both approaches.
Practical implementations could test whether the observed trade-off persists when the same model class is trained with modern variational methods instead of the proposed sampler.

Load-bearing premise

The bias-variance decomposition on the fitted models accurately reflects their true generalization behavior and the inference algorithm supplies parameter estimates accurate enough for that decomposition to be meaningful.

What would settle it

An experiment on a fresh small-sample dataset in which the measured variance of higher-order-interaction models exceeds the variance of hidden-layer models would falsify the reported variance advantage.

Figures

Figures reproduced from arXiv: 1906.12063 by Mahito Sugiyama, Simon Luo.

**Figure 2.** Figure 2: An illustration of the decomposition of the bias [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical evaluation of the error generated from the bias and variance of the RBM [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical evaluation of the error generated from the bias and variance for the HBM [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical evaluation of the error generated from the bias and variance for varying hidden nodes in the RBM [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical evaluation of the error generated from the bias and variance for varying order of interactions in the HBM [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Empirical evaluation of the error generated from the bias and variance for varying sample size in the RBM [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Empirical evaluation of the error generated from the bias and variance for varying sample size in the HBM [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Comparing empirical error in model for the HBM with RBM against the number of model parameters [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

Hierarchical probabilistic models are able to use a large number of parameters to create a model with a high representation power. However, it is well known that increasing the number of parameters also increases the complexity of the model which leads to a bias-variance trade-off. Although it is a classical problem, the bias-variance trade-off between hidden layers and higher-order interactions have not been well studied. In our study, we propose an efficient inference algorithm for the log-linear formulation of the higher-order Boltzmann machine using a combination of Gibbs sampling and annealed importance sampling. We then perform a bias-variance decomposition to study the differences in hidden layers and higher-order interactions. Our results have shown that using hidden layers and higher-order interactions have a comparable error with a similar order of magnitude and using higher-order interactions produce less variance for smaller sample size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports comparable total error but lower variance for explicit higher-order interactions versus hidden layers at small n, backed by a Gibbs+AIS inference method for higher-order Boltzmann machines.

read the letter

The core finding is that in this model class, higher-order interactions achieve similar overall error to hidden layers but with noticeably lower variance when sample sizes are small. The authors introduce a combined Gibbs sampling and annealed importance sampling procedure for inference in the log-linear higher-order Boltzmann machine, then apply a standard bias-variance decomposition to fitted models and report the breakdown by component. That direct empirical comparison on the two ways of increasing model capacity is the main piece of work here. The inference algorithm is presented as practical for this formulation, and the results are framed as guidance on when one approach might be preferable over the other. The decomposition itself follows the usual formulas and is applied consistently to both model types, so there is no obvious circularity in how the quantities are defined. The citation pattern is light and focused on the relevant prior work on Boltzmann machines and bias-variance analysis. The soft spot is exactly the one flagged in the stress-test note: the variance advantage is claimed in the small-n regime, yet higher-order models carry combinatorially many parameters, and nothing in the provided description shows mixing diagnostics, recovery tests on synthetic data, or other checks that the sampler and AIS estimates remain reliable enough there for the empirical variance term to reflect true generalization rather than estimation noise. If those checks are absent or weak, the reported difference could be an artifact. This is a targeted, narrow question rather than a broad methodological advance, so the paper is mainly useful to researchers already working with higher-order or hierarchical probabilistic models who need concrete numbers on this particular trade-off. A reader looking for practical model-design heuristics in that niche would find the comparison worth seeing. It is solid enough on its own terms to merit a serious referee, even if the inference validation needs strengthening in revision.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Gibbs sampling + annealed importance sampling inference procedure for the log-linear higher-order Boltzmann machine and uses it to perform a bias-variance decomposition comparing hidden-layer models against higher-order interaction models. The central empirical claim is that the two model classes achieve comparable total error of similar magnitude, while higher-order interactions exhibit lower variance at small sample sizes.

Significance. If the inference procedure recovers parameters sufficiently accurately for the decomposition to be meaningful, the result would provide guidance on model-class choice in small-data regimes for hierarchical probabilistic models. The work also supplies an inference algorithm whose practical utility would be strengthened by explicit accuracy validation.

major comments (2)

[Inference algorithm and experimental sections] The headline bias-variance comparison is only interpretable if the Gibbs+AIS procedure yields sufficiently accurate parameter estimates (especially for the combinatorially many higher-order parameters) in the small-n regime where the variance advantage is asserted. No recovery experiments on synthetic data, mixing diagnostics, or comparison against exact inference on tractable instances are described to support this assumption. (Inference algorithm and experimental sections)
[Experimental results] The abstract (and therefore the experimental claims) provides no information on experimental design, the precise sample sizes tested, the number of hidden units versus interaction order, number of replicates, or statistical significance of the reported variance difference. Without these controls the claim that higher-order interactions “produce less variance for smaller sample size” cannot be evaluated. (Experimental results)

minor comments (1)

Notation for the higher-order log-linear model and the precise form of the bias-variance decomposition should be stated explicitly with equation numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important aspects for strengthening the manuscript. We address the major comments below and will incorporate revisions to improve the clarity and rigor of the work.

read point-by-point responses

Referee: The headline bias-variance comparison is only interpretable if the Gibbs+AIS procedure yields sufficiently accurate parameter estimates (especially for the combinatorially many higher-order parameters) in the small-n regime where the variance advantage is asserted. No recovery experiments on synthetic data, mixing diagnostics, or comparison against exact inference on tractable instances are described to support this assumption. (Inference algorithm and experimental sections)

Authors: We agree with the referee that validating the accuracy of the Gibbs sampling combined with annealed importance sampling (AIS) inference procedure is essential to ensure the bias-variance decomposition is meaningful, particularly given the large number of higher-order parameters. The original manuscript did not include synthetic recovery experiments or mixing diagnostics. In the revised manuscript, we will add a section on inference validation, including parameter recovery on synthetic data, trace plots or autocorrelation for mixing, and comparisons to exact inference on small tractable models. This will support the reliability of the results in the small-sample regime. revision: yes
Referee: The abstract (and therefore the experimental claims) provides no information on experimental design, the precise sample sizes tested, the number of hidden units versus interaction order, number of replicates, or statistical significance of the reported variance difference. Without these controls the claim that higher-order interactions “produce less variance for smaller sample size” cannot be evaluated. (Experimental results)

Authors: We acknowledge that the abstract is missing key details on the experimental design, which makes it difficult to fully evaluate the claims. We will revise the abstract to include specifics such as the range of sample sizes tested (e.g., small n regimes), configurations for hidden units and interaction orders, the number of replicates performed, and any statistical significance testing for the variance differences. These details will also be expanded in the main experimental section for completeness. revision: yes

Circularity Check

0 steps flagged

Empirical comparison with no definitional or self-citation reductions

full rationale

The paper proposes a Gibbs + AIS inference procedure for higher-order Boltzmann machines, fits models, and reports an empirical bias-variance decomposition comparing hidden-layer and higher-order interaction models. The headline result (comparable total error, lower variance for higher-order models at small n) is presented as an observed experimental outcome rather than a quantity obtained by algebraic rearrangement of fitted parameters or by a self-citation chain. No equation is shown to equal its own input by construction, and no load-bearing premise rests on prior work by the same authors. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5667 in / 943 out tokens · 23576 ms · 2026-05-25T14:01:56.964981+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use a bias-variance decomposition... E[DKL(P*,P̂B)] = DKL(P*,P*_B) + var(P*_B,B) via generalized Pythagorean theorem on the dually flat manifold
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

log-linear formulation... zeta function, Möbius function... θ(x) parameters on poset S(B)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page
[2]

H.; Hinton, G

Ackley, D. H.; Hinton, G. E.; and Sejnowski, T. J. 1987. A learning algorithm for B oltzmann machines. In Readings in Computer Vision . Elsevier. 522--533

work page 1987
[3]

Amari, S. 2001. Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory 47(5):1701--1711

work page 2001
[4]

A., and Priestley, H

Davey, B. A., and Priestley, H. A. 2002. Introduction to Lattices and Order . Cambridge University Press

work page 2002
[5]

Friedman, J.; Hastie, T.; and Tibshirani, R. 2001. The Elements of Statistical Learning . Springer

work page 2001
[6]

Geman, S., and Geman, D. 1984. Stochastic relaxation, G ibbs distributions, and the B ayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6):721--741

work page 1984
[7]

H.; Keimel, K.; Lawson, J

Gierz, G.; Hofmann, K. H.; Keimel, K.; Lawson, J. D.; Mislove, M.; and Scott, D. S. 2003. Continuous Lattices and Comains , volume 93. Cambridge University Press

work page 2003
[8]

B.; Maddison, C

Grosse, R. B.; Maddison, C. J.; and Salakhutdinov, R. R. 2013. Annealing between distributions by averaging moments. In Advances in Neural Information Processing Systems (NIPS) , 2769--2777

work page 2013
[9]

Hinton, G. E. 2002. Training products of experts by minimizing contrastive divergence. Neural Computation 14(8):1771--1800

work page 2002
[10]

Hinton, G. E. 2012. A practical guide to training restricted B oltzmann machines. In Neural Networks: Tricks of the Trade . Springer. 599--619

work page 2012
[11]

Le Roux, N., and Bengio, Y. 2008. Representational power of restricted B oltzmann machines and deep belief networks. Neural Computation 20(6):1631--1649

work page 2008
[12]

R.; Ning, X.; Cheng, C.; and Gerstein, M

Min, M. R.; Ning, X.; Cheng, C.; and Gerstein, M. 2014. Interpretable sparse high-order B oltzmann machines. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS) , 614--622

work page 2014
[13]

Nakahara, H.; Amari, S.; and Richmond, B. J. 2006. A comparison of descriptive models of a single spike train by information-geometric measure. Neural Computation 18(3):545--568

work page 2006
[14]

Nakahara, H., and Amari, S. 2002. Information-geometric measure for neural spikes. Neural Computation 14(10):2269--2316

work page 2002
[15]

Neal, R. M. 2001. Annealed importance sampling. Statistics and Computing 11(2):125--139

work page 2001
[16]

Neal, R. M. 2005. Estimating ratios of normalizing constants using linked importance sampling. arXiv:math/0511216

work page internal anchor Pith review Pith/arXiv arXiv 2005
[17]

Salakhutdinov, R., and Hinton, G. E. 2009. Deep B oltzmann machines. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS) , 448--455

work page 2009
[18]

Salakhutdinov, R., and Hinton, G. E. 2012. An efficient learning procedure for deep B oltzmann machines. Neural Computation 24(8):1967--2006

work page 2012
[19]

Salakhutdinov, R. 2008. Learning and evaluating B oltzmann machines. Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto

work page 2008
[20]

Sejnowski, T. J. 1986. Higher-order B oltzmann machines. In AIP Conference Proceedings , volume 151, 398--403. AIP

work page 1986
[21]

Sugiyama, M.; Nakahara, H.; and Tsuda, K. 2016. Information decomposition on structured space. In 2016 IEEE International Symposium on Information Theory (ISIT) , 575--579. IEEE

work page 2016
[22]

Sugiyama, M.; Nakahara, H.; and Tsuda, K. 2017. Tensor balancing on statistical manifold. In Proceedings of the 34th International Conference on Machine Learning (ICML) , volume 70, 3270--3279

work page 2017
[23]

Tieleman, T. 2008. Training restricted B oltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning (ICML) , 1064--1071

work page 2008

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[2] [2]

H.; Hinton, G

Ackley, D. H.; Hinton, G. E.; and Sejnowski, T. J. 1987. A learning algorithm for B oltzmann machines. In Readings in Computer Vision . Elsevier. 522--533

work page 1987

[3] [3]

Amari, S. 2001. Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory 47(5):1701--1711

work page 2001

[4] [4]

A., and Priestley, H

Davey, B. A., and Priestley, H. A. 2002. Introduction to Lattices and Order . Cambridge University Press

work page 2002

[5] [5]

Friedman, J.; Hastie, T.; and Tibshirani, R. 2001. The Elements of Statistical Learning . Springer

work page 2001

[6] [6]

Geman, S., and Geman, D. 1984. Stochastic relaxation, G ibbs distributions, and the B ayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6):721--741

work page 1984

[7] [7]

H.; Keimel, K.; Lawson, J

Gierz, G.; Hofmann, K. H.; Keimel, K.; Lawson, J. D.; Mislove, M.; and Scott, D. S. 2003. Continuous Lattices and Comains , volume 93. Cambridge University Press

work page 2003

[8] [8]

B.; Maddison, C

Grosse, R. B.; Maddison, C. J.; and Salakhutdinov, R. R. 2013. Annealing between distributions by averaging moments. In Advances in Neural Information Processing Systems (NIPS) , 2769--2777

work page 2013

[9] [9]

Hinton, G. E. 2002. Training products of experts by minimizing contrastive divergence. Neural Computation 14(8):1771--1800

work page 2002

[10] [10]

Hinton, G. E. 2012. A practical guide to training restricted B oltzmann machines. In Neural Networks: Tricks of the Trade . Springer. 599--619

work page 2012

[11] [11]

Le Roux, N., and Bengio, Y. 2008. Representational power of restricted B oltzmann machines and deep belief networks. Neural Computation 20(6):1631--1649

work page 2008

[12] [12]

R.; Ning, X.; Cheng, C.; and Gerstein, M

Min, M. R.; Ning, X.; Cheng, C.; and Gerstein, M. 2014. Interpretable sparse high-order B oltzmann machines. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS) , 614--622

work page 2014

[13] [13]

Nakahara, H.; Amari, S.; and Richmond, B. J. 2006. A comparison of descriptive models of a single spike train by information-geometric measure. Neural Computation 18(3):545--568

work page 2006

[14] [14]

Nakahara, H., and Amari, S. 2002. Information-geometric measure for neural spikes. Neural Computation 14(10):2269--2316

work page 2002

[15] [15]

Neal, R. M. 2001. Annealed importance sampling. Statistics and Computing 11(2):125--139

work page 2001

[16] [16]

Neal, R. M. 2005. Estimating ratios of normalizing constants using linked importance sampling. arXiv:math/0511216

work page internal anchor Pith review Pith/arXiv arXiv 2005

[17] [17]

Salakhutdinov, R., and Hinton, G. E. 2009. Deep B oltzmann machines. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS) , 448--455

work page 2009

[18] [18]

Salakhutdinov, R., and Hinton, G. E. 2012. An efficient learning procedure for deep B oltzmann machines. Neural Computation 24(8):1967--2006

work page 2012

[19] [19]

Salakhutdinov, R. 2008. Learning and evaluating B oltzmann machines. Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto

work page 2008

[20] [20]

Sejnowski, T. J. 1986. Higher-order B oltzmann machines. In AIP Conference Proceedings , volume 151, 398--403. AIP

work page 1986

[21] [21]

Sugiyama, M.; Nakahara, H.; and Tsuda, K. 2016. Information decomposition on structured space. In 2016 IEEE International Symposium on Information Theory (ISIT) , 575--579. IEEE

work page 2016

[22] [22]

Sugiyama, M.; Nakahara, H.; and Tsuda, K. 2017. Tensor balancing on statistical manifold. In Proceedings of the 34th International Conference on Machine Learning (ICML) , volume 70, 3270--3279

work page 2017

[23] [23]

Tieleman, T. 2008. Training restricted B oltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning (ICML) , 1064--1071

work page 2008