pith. sign in

arxiv: 2301.09428 · v2 · submitted 2023-01-23 · 💻 cs.LG · cond-mat.dis-nn· cond-mat.stat-mech

Explaining the effects of non-convergent sampling in the training of Energy-Based Models

Pith reviewed 2026-05-24 10:16 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncond-mat.stat-mech
keywords Energy-based modelsnon-convergent samplingMarkov chainsdynamical processempirical statisticsdiffusion modelsBoltzmann machine
0
0 comments X

The pith

EBMs trained with non-persistent short Markov chain runs reproduce empirical data statistics through a precise dynamical process rather than equilibrium convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows analytically that Energy-Based Models trained by estimating gradients with short, non-convergent Markov chains starting from random points can exactly match a set of empirical statistics from the data. This match happens through the specific dynamics created by the incomplete sampling, not by converging to the model's equilibrium distribution. A reader would care because the result explains why short-run sampling strategies produce high-quality samples efficiently in practice. It also supplies the analytical basis for treating EBMs as diffusion-style models.

Core claim

EBMs trained with non-persistent short runs to estimate the gradient can perfectly reproduce a set of empirical statistics of the data, not at the level of the equilibrium measure, but through a precise dynamical process. The authors derive this for generic EBMs, work it out explicitly in two solvable models, and verify the predictions numerically on a ConvNet EBM and a Boltzmann machine.

What carries the argument

Non-persistent short Markov chain runs that begin from random initial conditions and induce a dynamical process whose statistics are encoded exactly into the trained parameters.

If this is right

  • EBMs become usable as diffusion models because the dynamical encoding replaces the need for equilibrium sampling.
  • Short runs from random starts constitute an efficient, high-quality sampling method whose effect is now explained from first principles.
  • The effect can be computed in closed form for solvable models, giving exact predictions for how parameters shift under non-convergent training.
  • Numerical checks on neural-network EBMs confirm that the dynamical matching holds beyond the solvable cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dynamical-matching principle could be applied to other approximate-sampling generative models to reduce the cost of training.
  • Training objectives might be redesigned to optimize the dynamical statistics directly instead of an equilibrium loss.
  • The link to diffusion models raises the question of whether EBMs inherit convergence or mixing guarantees known for diffusion processes.

Load-bearing premise

The short runs must begin from random initial conditions so that the incomplete sampling dynamics alone, without equilibrium convergence, exactly capture the empirical statistics.

What would settle it

Train an EBM with the described short runs, then compare the statistics of samples produced by the same short-run procedure against the statistics obtained from long equilibrated chains on the same trained model; mismatch in the equilibrated case would support the claim.

Figures

Figures reproduced from arXiv: 2301.09428 by Aur\'elien Decelle, Beatriz Seoane, Elisabeth Agoritsas, Giovanni Catania.

Figure 1
Figure 1. Figure 1: Left: Evolution of J (k) α (t) for different values of k (different colors) and initial conditions for J (k) α (0). Right: Convergence for a large training time t, where the higher k, the faster the convergence; dash-dotted lines are the numerical integration of Eq. (18), full lines show the exponential fit J (k) α (t) − J (k),∞ α ∼A exp(−t/τ ) with τ given by Eq. (20). Inset: the asymptotic values J (k) α… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Numerical resolution of the learning dynamics in presence of two modes. The dash-lines represent the resolution using a convergent dynamics k → ∞, while the plain ones correspond to k= 1. Right: (inset) evolution of J (k) α (t) for k= 1 at various stages of the learning. (Main) The error on the correlation function, E (2) = ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Generation results obtained with a ConvNet EBM trained on CIFAR-10 using k = 150 Langevin MCMC sampling steps from random initial conditions. Left: We use the Frechet Inception Distance Score (FID) to evaluate the generation quality as a function of the sampling time k ′ . The different colors correspond to different training epochs. We see again that the best score is achieved at k ′ = k (corresponding to… view at source ↗
Figure 4
Figure 4. Figure 4: Left: Error over the covariance matrices E (2) = P i<j (⟨xixj ⟩k′ ,p0 −⟨xixj ⟩pD ) 2 / N 2  between the training set and the data generated at different learning ages vs the generation time k ′ , for a BM trained with data sampled from a 2D ferromagnetic Ising model with N = 72 at k= 5. Similarly to [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Error over the eigenvalues of the covariance matrix generated after k ′ steps of MCMC vs the dataset covariance matrix. The setting is the same of [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

In this paper, we quantify the impact of using non-convergent Markov chains to train Energy-Based models (EBMs). In particular, we show analytically that EBMs trained with non-persistent short runs to estimate the gradient can perfectly reproduce a set of empirical statistics of the data, not at the level of the equilibrium measure, but through a precise dynamical process. Our results provide a first-principles explanation for the observations of recent works proposing the strategy of using short runs starting from random initial conditions as an efficient way to generate high-quality samples in EBMs, and lay the groundwork for using EBMs as diffusion models. After explaining this effect in generic EBMs, we analyze two solvable models in which the effect of the non-convergent sampling in the trained parameters can be described in detail. Finally, we test these predictions numerically on a ConvNet EBM and a Boltzmann machine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that EBMs trained with non-persistent short Markov chain runs (starting from random initial conditions) to estimate gradients can exactly reproduce a chosen set of empirical statistics of the data via the finite-step dynamics of those runs, rather than by matching the equilibrium measure. An analytical derivation is presented for generic EBMs, followed by exact solutions for two solvable models that make the effect explicit, and numerical validation on a ConvNet EBM and a Boltzmann machine. The work positions this as an explanation for the success of short-run sampling strategies and as groundwork for EBMs as diffusion models.

Significance. If the central analytical claim holds under its assumptions, the result supplies a first-principles account of why short non-convergent chains succeed in EBM training and training-for-sampling, together with concrete solvable cases and numerical checks. These elements constitute a genuine contribution that could inform both theory and practice in energy-based and diffusion-style models.

major comments (2)
  1. [generic EBMs section] Generic EBMs section: the exact analytical result for generic EBMs is stated to follow from the training dynamics of short runs, yet the derivation inherits the assumption that each chain begins from random, data-independent initial conditions without additional justification that this holds when the initial distribution is data-dependent or when chain length varies; this assumption is load-bearing for the generality claim.
  2. [solvable models sections] Solvable models sections: the exact solutions are presented for two models, but the manuscript does not report an error analysis or sensitivity study with respect to finite chain length or initialization variance; without this, it is unclear whether the claimed exact reproduction remains robust outside the idealized limits used in the derivations.
minor comments (2)
  1. The abstract should briefly specify which empirical statistics are exactly reproduced by the dynamical process.
  2. [numerical experiments section] Numerical experiments section: figure captions and text should state the precise chain lengths, initialization distributions, and number of gradient steps used in the ConvNet and Boltzmann machine experiments to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and the positive assessment of the significance of our work. We address the major comments below and will revise the manuscript accordingly to improve clarity and robustness.

read point-by-point responses
  1. Referee: [generic EBMs section] Generic EBMs section: the exact analytical result for generic EBMs is stated to follow from the training dynamics of short runs, yet the derivation inherits the assumption that each chain begins from random, data-independent initial conditions without additional justification that this holds when the initial distribution is data-dependent or when chain length varies; this assumption is load-bearing for the generality claim.

    Authors: The derivation in the generic EBMs section is developed specifically under the assumption of short runs initialized from random, data-independent initial conditions, which is the setting used in the non-persistent short-run training strategies that the paper seeks to explain. This assumption is stated in the manuscript and is central to the result, as different initializations would lead to different dynamics. We do not claim the result holds for data-dependent initial distributions. To address the concern, we will revise the text to more explicitly state the scope of the assumption and provide a brief justification for focusing on data-independent random initials, namely that this matches the practical strategy whose success we aim to account for. Regarding chain length variation, the result holds for any fixed finite length under the stated conditions. revision: partial

  2. Referee: [solvable models sections] Solvable models sections: the exact solutions are presented for two models, but the manuscript does not report an error analysis or sensitivity study with respect to finite chain length or initialization variance; without this, it is unclear whether the claimed exact reproduction remains robust outside the idealized limits used in the derivations.

    Authors: We acknowledge that while the solvable models allow for exact derivations in specific cases, an analysis of robustness to variations in chain length and initialization would be beneficial. In the revised version, we will add a sensitivity study section or subsection, including both analytical considerations where feasible and numerical experiments to quantify the error or deviation as chain length and initialization variance change. This will help demonstrate the robustness of the effect beyond the idealized limits. revision: yes

Circularity Check

0 steps flagged

No circularity: analytical derivation from short-run Markov dynamics is self-contained

full rationale

The paper derives its central claim analytically from the explicit form of the gradient estimator using finite-length Markov chains initialized at random (data-independent) noise. No step reduces a prediction to a fitted parameter by construction, invokes a self-citation as the sole justification for a uniqueness claim, or renames an input statistic as an output. The assumption on initial conditions is stated explicitly in the abstract and generic-EBM section rather than smuggled in; the subsequent solvable-model sections simply instantiate the same derivation. The result therefore stands or falls on the correctness of the dynamical calculation itself, not on any definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, invented entities, or ad-hoc axioms; the derivation rests on standard properties of Markov chains and gradient estimation in EBMs.

axioms (1)
  • standard math Markov chain Monte Carlo sampling reaches a stationary distribution under standard conditions
    Invoked when contrasting convergent vs. non-convergent short runs

pith-pipeline@v0.9.0 · 5705 in / 1205 out tokens · 29586 ms · 2026-05-24T10:16:40.223082+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

  1. [1]

    H., Hinton, G

    Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. A learning algorithm for boltzmann machines. Cognitive science, 9 0 (1): 0 147--169, 1985

  2. [2]

    Learning a restricted boltzmann machine using biased monte carlo sampling

    B \'e reux, N., Decelle, A., Furtlehner, C., and Seoane, B. Learning a restricted boltzmann machine using biased monte carlo sampling. arXiv preprint arXiv:2206.01310, 2022

  3. [3]

    Science , author =

    Carleo, G. and Troyer, M. Solving the quantum many-body problem with artificial neural networks. Science, 355 0 (6325): 0 602--606, 2017. doi:10.1126/science.aag2302. URL https://www.science.org/doi/abs/10.1126/science.aag2302

  4. [4]

    and Furtlehner, C

    Decelle, A. and Furtlehner, C. Exact training of restricted boltzmann machines on intrinsically low dimensional data. Phys. Rev. Lett., 127: 0 158303, Oct 2021 a . doi:10.1103/PhysRevLett.127.158303. URL https://link.aps.org/doi/10.1103/PhysRevLett.127.158303

  5. [5]

    and Furtlehner, C

    Decelle, A. and Furtlehner, C. Restricted boltzmann machine: Recent advances and mean-field theory. Chinese Physics B, 30 0 (4): 0 040202, 2021 b

  6. [6]

    Spectral dynamics of learning in restricted boltzmann machines

    Decelle, A., Fissore, G., and Furtlehner, C. Spectral dynamics of learning in restricted boltzmann machines. Europhysics Letters, 119 0 (6): 0 60001, nov 2017. doi:10.1209/0295-5075/119/60001. URL https://dx.doi.org/10.1209/0295-5075/119/60001

  7. [7]

    Thermodynamics of restricted boltzmann machines and related learning dynamics

    Decelle, A., Fissore, G., and Furtlehner, C. Thermodynamics of restricted boltzmann machines and related learning dynamics. Journal of Statistical Physics, 172 0 (6): 0 1576--1608, 2018. doi:https://doi.org/10.1007/s10955-018-2105-y

  8. [8]

    Equilibrium and non-equilibrium regimes in the learning of restricted boltzmann machines

    Decelle, A., Furtlehner, C., and Seoane, B. Equilibrium and non-equilibrium regimes in the learning of restricted boltzmann machines. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. doi:10.48550/ARXIV.2105.13889. URL https://openreview.net/forum?id=Bq_RoftLEeN

  9. [9]

    and Nichol, A

    Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34: 0 8780--8794, 2021

  10. [10]

    and Mordatch, I

    Du, Y. and Mordatch, I. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019

  11. [11]

    Compositional visual generation with energy based models

    Du, Y., Li, S., and Mordatch, I. Compositional visual generation with energy based models. Advances in Neural Information Processing Systems, 33: 0 6637--6647, 2020

  12. [12]

    Structure and eigenvalues of heat-bath markov chains

    Dyer, M., Greenhill, C., and Ullrich, M. Structure and eigenvalues of heat-bath markov chains. Linear Algebra and its Applications, 454: 0 57--71, 2014

  13. [13]

    Robust multi-output learning with highly incomplete data via restricted boltzmann machines

    Fissore, G., Decelle, A., Furtlehner, C., and Han, Y. Robust multi-output learning with highly incomplete data via restricted boltzmann machines. In Proceedings of the 9th European Starting AI Researchers’ Symposium 2020. arXiv, 2019. doi:10.48550/ARXIV.1912.09382. URL https://arxiv.org/abs/1912.09382

  14. [14]

    W., and Krzakala, F

    Gabri \'e , M., Tramel, E. W., and Krzakala, F. Training restricted B oltzmann machine via the T houless- A nderson- P almer free energy. In Advances in neural information processing systems, pp.\ 640--648, 2015

  15. [15]

    Generative adversarial networks

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. Communications of the ACM, 63 0 (11): 0 139--144, 2020

  16. [16]

    Layerwise systematic scan: Deep boltzmann machines and beyond

    Guo, H., Kara, K., and Zhang, C. Layerwise systematic scan: Deep boltzmann machines and beyond. In International Conference on Artificial Intelligence and Statistics, pp.\ 178--187. PMLR, 2018

  17. [17]

    Products of experts

    Hinton, G. Products of experts. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), volume 1, pp.\ 1--6 vol.1, 1999. doi:10.1049/cp:19991075

  18. [18]

    Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural computation, 14 0 (8): 0 1771--1800, 2002. doi:10.1162/089976602760128018

  19. [19]

    D., Calhoun, V

    Hjelm, R. D., Calhoun, V. D., Salakhutdinov, R., Allen, E. A., Adali, T., and Plis, S. M. Restricted boltzmann machines for neuroimaging: An application in identifying intrinsic networks. NeuroImage, 96: 0 245--260, 2014. ISSN 1053-8119. doi:https://doi.org/10.1016/j.neuroimage.2014.03.048. URL https://www.sciencedirect.com/science/article/pii/S1053811914002080

  20. [20]

    Denoising diffusion probabilistic models

    Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

  21. [21]

    Kappen, H. J. and Rodríguez, F. B. Efficient Learning in Boltzmann Machines Using Linear Response Theory . Neural Computation, 10 0 (5): 0 1137--1156, 07 1998. ISSN 0899-7667. doi:10.1162/089976698300017386. URL https://doi.org/10.1162/089976698300017386

  22. [22]

    On the solutions and the steady states of a master equation

    Keizer, J. On the solutions and the steady states of a master equation. Journal of Statistical Physics, 6 0 (2): 0 67--72, 1972

  23. [23]

    Deep Directed Generative Models with Energy-Based Probability Estimation

    Kim, T. and Bengio, Y. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016

  24. [24]

    Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018

  25. [25]

    Maximum Entropy Generators for Energy-Based Models

    Kumar, R., Ozair, S., Goyal, A., Courville, A., and Bengio, Y. Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508, 2019

  26. [26]

    A tutorial on energy-based learning

    LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F. A tutorial on energy-based learning. Predicting structured data, 1 0 (0), 2006

  27. [27]

    J., and Hinton, G

    Liao, R., Kornblith, S., Ren, M., Fleet, D. J., and Hinton, G. Gaussian-bernoulli rbms without tears. arXiv preprint arXiv:2210.10318, 2022

  28. [28]

    G., Carleo, G., Carrasquilla, J., and Cirac, J

    Melko, R. G., Carleo, G., Carrasquilla, J., and Cirac, J. I. Restricted boltzmann machines in quantum physics. Nature Physics, 15 0 (9): 0 887--892, 2019

  29. [29]

    S., Sander, C., Zecchina, R., Onuchic, J

    Morcos, F., Pagnani, A., Lunt, B., Bertolino, A., Marks, D. S., Sander, C., Zecchina, R., Onuchic, J. N., Hwa, T., and Weigt, M. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences, 108 0 (49): 0 E1293--E1301, 2011

  30. [30]

    P., Pagnani, A., Weigt, M., and Zamponi, F

    Muntoni, A. P., Pagnani, A., Weigt, M., and Zamponi, F. adabmdca: adaptive boltzmann machine learning for biological sequences. BMC bioinformatics, 22 0 (1): 0 1--19, 2021

  31. [31]

    Nguyen, H. C. and Berg, J. Bethe–peierls approximation and the inverse ising problem. Journal of Statistical Mechanics: Theory and Experiment, 2012 0 (03): 0 P03004, mar 2012. doi:10.1088/1742-5468/2012/03/P03004. URL https://dx.doi.org/10.1088/1742-5468/2012/03/P03004

  32. [32]

    C., Zecchina, R., and Berg, J

    Nguyen, H. C., Zecchina, R., and Berg, J. Inverse statistical problems: from the inverse ising problem to data science. Advances in Physics, 66 0 (3): 0 197--261, 2017. doi:10.1080/00018732.2017.1341604. URL https://doi.org/10.1080/00018732.2017.1341604

  33. [33]

    Nijkamp, E., Hill, M., Zhu, S.-C., and Wu, Y. N. Learning non-convergent non-persistent short-run mcmc toward energy-based model. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch\' e -Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips...

  34. [34]

    Nijkamp, E., Hill, M., Han, T., Zhu, S.-C., and Wu, Y. N. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. Proceedings of the AAAI Conference on Artificial Intelligence, 34 0 (04): 0 5272--5280, Apr. 2020. doi:10.1609/aaai.v34i04.5973. URL https://ojs.aaai.org/index.php/AAAI/article/view/5973

  35. [35]

    The bethe approximation for solving the inverse ising problem: a comparison with other inference methods

    Ricci-Tersenghi, F. The bethe approximation for solving the inverse ising problem: a comparison with other inference methods. Journal of Statistical Mechanics: Theory and Experiment, 2012 0 (08): 0 P08015, aug 2012. doi:10.1088/1742-5468/2012/08/P08015. URL https://dx.doi.org/10.1088/1742-5468/2012/08/P08015

  36. [36]

    and Hinton, G

    Salakhutdinov, R. and Hinton, G. Deep boltzmann machines. In van Dyk, D. and Welling, M. (eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pp.\ 448--455, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16--18 Apr 2009. PMLR. URL https:/...

  37. [37]

    Information Processing in Dynamical Systems: Foundations of Harmony Theory, volume 6

    Smolensky, P. Information Processing in Dynamical Systems: Foundations of Harmony Theory, volume 6. 1987. ISBN 9780262291408

  38. [38]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.\ 2256--2265. PMLR, 2015

  39. [39]

    and Kubo, R

    Suzuki, M. and Kubo, R. Dynamics of the ising model near the critical point. i. Journal of the Physical Society of Japan, 24 0 (1): 0 51--60, 1968. doi:10.1143/JPSJ.24.51

  40. [40]

    Vincent, H

    Tieleman, T. Training restricted B oltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp.\ 1064--1071, 2008. doi:10.1145/1390156.1390290

  41. [41]

    Learning protein constitutive motifs from sequence data

    Tubiana, J., Cocco, S., and Monasson, R. Learning protein constitutive motifs from sequence data. Elife, 8: 0 e39397, 2019

  42. [42]

    Creating artificial human genomes using generative neural networks

    Yelmen, B., Decelle, A., Ongaro, L., Marnetto, D., Tallec, C., Montinaro, F., Furtlehner, C., Pagani, L., and Jay, F. Creating artificial human genomes using generative neural networks. PLOS Genetics, 17 0 (2): 0 1--22, 02 2021. doi:10.1371/journal.pgen.1009303. URL https://doi.org/10.1371/journal.pgen.1009303

  43. [43]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...