pith. sign in

arxiv: 1906.12063 · v1 · pith:YRJBRQT7new · submitted 2019-06-28 · 📊 stat.ML · cs.LG

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Pith reviewed 2026-05-25 14:01 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords bias-variance tradeoffhigher-order interactionsBoltzmann machinehierarchical probabilistic modelsinference algorithmGibbs samplingannealed importance sampling
0
0 comments X

The pith

Higher-order interactions match hidden layers in total error but show lower variance for small training samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares hidden layers and higher-order feature interactions inside hierarchical probabilistic models by measuring their bias-variance trade-offs. It develops a practical inference procedure that combines Gibbs sampling with annealed importance sampling for the log-linear higher-order Boltzmann machine. Bias-variance decomposition on fitted models then reveals that both modeling choices produce errors of similar magnitude. The decomposition further shows that higher-order interactions incur noticeably less variance when the number of training examples is small. These observations indicate that, for limited data, higher-order terms can serve as a lower-variance substitute for added hidden layers.

Core claim

Using the proposed Gibbs-plus-annealed-importance-sampling procedure, the authors fit both hidden-layer and higher-order Boltzmann machines and decompose their errors into bias and variance components. They report that the two families of models exhibit comparable total error of the same order of magnitude, while the higher-order-interaction models display lower variance when trained on smaller sample sizes.

What carries the argument

Bias-variance decomposition performed on the log-linear higher-order Boltzmann machine after inference via Gibbs sampling combined with annealed importance sampling.

If this is right

  • Both hidden layers and higher-order interactions can achieve comparable generalization performance.
  • When training data are scarce, higher-order interactions are expected to produce more stable predictions.
  • The inference algorithm makes systematic bias-variance studies feasible for models with explicit higher-order terms.
  • Model designers can trade depth for explicit feature interactions without large changes in overall error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variance-reduction pattern may appear in other exponential-family models that admit explicit higher-order potentials.
  • For very large datasets the variance advantage is likely to shrink as bias terms become dominant in both approaches.
  • Practical implementations could test whether the observed trade-off persists when the same model class is trained with modern variational methods instead of the proposed sampler.

Load-bearing premise

The bias-variance decomposition on the fitted models accurately reflects their true generalization behavior and the inference algorithm supplies parameter estimates accurate enough for that decomposition to be meaningful.

What would settle it

An experiment on a fresh small-sample dataset in which the measured variance of higher-order-interaction models exceeds the variance of hidden-layer models would falsify the reported variance advantage.

Figures

Figures reproduced from arXiv: 1906.12063 by Mahito Sugiyama, Simon Luo.

Figure 1
Figure 1. Figure 1: Example of Boltzmann machine modeling high [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the decomposition of the bias [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Empirical evaluation of the error generated from the bias and variance of the RBM [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empirical evaluation of the error generated from the bias and variance for the HBM [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Empirical evaluation of the error generated from the bias and variance for varying hidden nodes in the RBM [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Empirical evaluation of the error generated from the bias and variance for varying order of interactions in the HBM [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Empirical evaluation of the error generated from the bias and variance for varying sample size in the RBM [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Empirical evaluation of the error generated from the bias and variance for varying sample size in the HBM [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparing empirical error in model for the HBM with RBM against the number of model parameters [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
read the original abstract

Hierarchical probabilistic models are able to use a large number of parameters to create a model with a high representation power. However, it is well known that increasing the number of parameters also increases the complexity of the model which leads to a bias-variance trade-off. Although it is a classical problem, the bias-variance trade-off between hidden layers and higher-order interactions have not been well studied. In our study, we propose an efficient inference algorithm for the log-linear formulation of the higher-order Boltzmann machine using a combination of Gibbs sampling and annealed importance sampling. We then perform a bias-variance decomposition to study the differences in hidden layers and higher-order interactions. Our results have shown that using hidden layers and higher-order interactions have a comparable error with a similar order of magnitude and using higher-order interactions produce less variance for smaller sample size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Gibbs sampling + annealed importance sampling inference procedure for the log-linear higher-order Boltzmann machine and uses it to perform a bias-variance decomposition comparing hidden-layer models against higher-order interaction models. The central empirical claim is that the two model classes achieve comparable total error of similar magnitude, while higher-order interactions exhibit lower variance at small sample sizes.

Significance. If the inference procedure recovers parameters sufficiently accurately for the decomposition to be meaningful, the result would provide guidance on model-class choice in small-data regimes for hierarchical probabilistic models. The work also supplies an inference algorithm whose practical utility would be strengthened by explicit accuracy validation.

major comments (2)
  1. [Inference algorithm and experimental sections] The headline bias-variance comparison is only interpretable if the Gibbs+AIS procedure yields sufficiently accurate parameter estimates (especially for the combinatorially many higher-order parameters) in the small-n regime where the variance advantage is asserted. No recovery experiments on synthetic data, mixing diagnostics, or comparison against exact inference on tractable instances are described to support this assumption. (Inference algorithm and experimental sections)
  2. [Experimental results] The abstract (and therefore the experimental claims) provides no information on experimental design, the precise sample sizes tested, the number of hidden units versus interaction order, number of replicates, or statistical significance of the reported variance difference. Without these controls the claim that higher-order interactions “produce less variance for smaller sample size” cannot be evaluated. (Experimental results)
minor comments (1)
  1. Notation for the higher-order log-linear model and the precise form of the bias-variance decomposition should be stated explicitly with equation numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important aspects for strengthening the manuscript. We address the major comments below and will incorporate revisions to improve the clarity and rigor of the work.

read point-by-point responses
  1. Referee: The headline bias-variance comparison is only interpretable if the Gibbs+AIS procedure yields sufficiently accurate parameter estimates (especially for the combinatorially many higher-order parameters) in the small-n regime where the variance advantage is asserted. No recovery experiments on synthetic data, mixing diagnostics, or comparison against exact inference on tractable instances are described to support this assumption. (Inference algorithm and experimental sections)

    Authors: We agree with the referee that validating the accuracy of the Gibbs sampling combined with annealed importance sampling (AIS) inference procedure is essential to ensure the bias-variance decomposition is meaningful, particularly given the large number of higher-order parameters. The original manuscript did not include synthetic recovery experiments or mixing diagnostics. In the revised manuscript, we will add a section on inference validation, including parameter recovery on synthetic data, trace plots or autocorrelation for mixing, and comparisons to exact inference on small tractable models. This will support the reliability of the results in the small-sample regime. revision: yes

  2. Referee: The abstract (and therefore the experimental claims) provides no information on experimental design, the precise sample sizes tested, the number of hidden units versus interaction order, number of replicates, or statistical significance of the reported variance difference. Without these controls the claim that higher-order interactions “produce less variance for smaller sample size” cannot be evaluated. (Experimental results)

    Authors: We acknowledge that the abstract is missing key details on the experimental design, which makes it difficult to fully evaluate the claims. We will revise the abstract to include specifics such as the range of sample sizes tested (e.g., small n regimes), configurations for hidden units and interaction orders, the number of replicates performed, and any statistical significance testing for the variance differences. These details will also be expanded in the main experimental section for completeness. revision: yes

Circularity Check

0 steps flagged

Empirical comparison with no definitional or self-citation reductions

full rationale

The paper proposes a Gibbs + AIS inference procedure for higher-order Boltzmann machines, fits models, and reports an empirical bias-variance decomposition comparing hidden-layer and higher-order interaction models. The headline result (comparable total error, lower variance for higher-order models at small n) is presented as an observed experimental outcome rather than a quantity obtained by algebraic rearrangement of fitted parameters or by a self-citation chain. No equation is shown to equal its own input by construction, and no load-bearing premise rests on prior work by the same authors. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5667 in / 943 out tokens · 23576 ms · 2026-05-25T14:01:56.964981+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

  2. [2]

    H.; Hinton, G

    Ackley, D. H.; Hinton, G. E.; and Sejnowski, T. J. 1987. A learning algorithm for B oltzmann machines. In Readings in Computer Vision . Elsevier. 522--533

  3. [3]

    Amari, S. 2001. Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory 47(5):1701--1711

  4. [4]

    A., and Priestley, H

    Davey, B. A., and Priestley, H. A. 2002. Introduction to Lattices and Order . Cambridge University Press

  5. [5]

    Friedman, J.; Hastie, T.; and Tibshirani, R. 2001. The Elements of Statistical Learning . Springer

  6. [6]

    Geman, S., and Geman, D. 1984. Stochastic relaxation, G ibbs distributions, and the B ayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6):721--741

  7. [7]

    H.; Keimel, K.; Lawson, J

    Gierz, G.; Hofmann, K. H.; Keimel, K.; Lawson, J. D.; Mislove, M.; and Scott, D. S. 2003. Continuous Lattices and Comains , volume 93. Cambridge University Press

  8. [8]

    B.; Maddison, C

    Grosse, R. B.; Maddison, C. J.; and Salakhutdinov, R. R. 2013. Annealing between distributions by averaging moments. In Advances in Neural Information Processing Systems (NIPS) , 2769--2777

  9. [9]

    Hinton, G. E. 2002. Training products of experts by minimizing contrastive divergence. Neural Computation 14(8):1771--1800

  10. [10]

    Hinton, G. E. 2012. A practical guide to training restricted B oltzmann machines. In Neural Networks: Tricks of the Trade . Springer. 599--619

  11. [11]

    Le Roux, N., and Bengio, Y. 2008. Representational power of restricted B oltzmann machines and deep belief networks. Neural Computation 20(6):1631--1649

  12. [12]

    R.; Ning, X.; Cheng, C.; and Gerstein, M

    Min, M. R.; Ning, X.; Cheng, C.; and Gerstein, M. 2014. Interpretable sparse high-order B oltzmann machines. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS) , 614--622

  13. [13]

    Nakahara, H.; Amari, S.; and Richmond, B. J. 2006. A comparison of descriptive models of a single spike train by information-geometric measure. Neural Computation 18(3):545--568

  14. [14]

    Nakahara, H., and Amari, S. 2002. Information-geometric measure for neural spikes. Neural Computation 14(10):2269--2316

  15. [15]

    Neal, R. M. 2001. Annealed importance sampling. Statistics and Computing 11(2):125--139

  16. [16]

    Neal, R. M. 2005. Estimating ratios of normalizing constants using linked importance sampling. arXiv:math/0511216

  17. [17]

    Salakhutdinov, R., and Hinton, G. E. 2009. Deep B oltzmann machines. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS) , 448--455

  18. [18]

    Salakhutdinov, R., and Hinton, G. E. 2012. An efficient learning procedure for deep B oltzmann machines. Neural Computation 24(8):1967--2006

  19. [19]

    Salakhutdinov, R. 2008. Learning and evaluating B oltzmann machines. Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto

  20. [20]

    Sejnowski, T. J. 1986. Higher-order B oltzmann machines. In AIP Conference Proceedings , volume 151, 398--403. AIP

  21. [21]

    Sugiyama, M.; Nakahara, H.; and Tsuda, K. 2016. Information decomposition on structured space. In 2016 IEEE International Symposium on Information Theory (ISIT) , 575--579. IEEE

  22. [22]

    Sugiyama, M.; Nakahara, H.; and Tsuda, K. 2017. Tensor balancing on statistical manifold. In Proceedings of the 34th International Conference on Machine Learning (ICML) , volume 70, 3270--3279

  23. [23]

    Tieleman, T. 2008. Training restricted B oltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning (ICML) , 1064--1071