Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines
Pith reviewed 2026-05-20 06:56 UTC · model grok-4.3
The pith
RBMs using exponential activation can represent and learn strong higher-order interactions within an analytically determined parameter range.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The space of models representable by an RBM is fully characterized by the moments of the distribution of interactions induced on the visible variables; for exponential activations this distribution acquires a tail that permits large higher-order terms within a specific, analytically fixed range of the hidden-unit bias and weight scale.
What carries the argument
The duality mapping an RBM ensemble to an effective model of interacting binary variables, with the distribution of induced couplings characterized by its low-order moments.
If this is right
- Data structures generated by strong higher-order interactions remain hard to represent for linear, step, and ReLU activations at any parameter value.
- Exponential activation enlarges the representable set precisely when the hidden-unit bias and coupling scale lie inside the derived interval.
- Quantitative agreement between moment calculations and observed learning trajectories holds across the four activations tested.
- Optimal parameter choices for exponential units can be read directly from the analytic expressions without numerical search.
Where Pith is reading between the lines
- Similar rapidly growing activations could be substituted for the exponential in deeper architectures to improve capture of multi-body correlations.
- The moment characterization offers a diagnostic for whether a given dataset is likely to be learnable by a given RBM before training begins.
- The same duality lens may be applied to other energy-based models to predict which activation choices favor higher-order statistics.
Load-bearing premise
The duality between RBMs and models of interacting binary variables fully determines the representable distributions through the moments of the induced interactions.
What would settle it
Training an exponential RBM on synthetic data generated from a model with large three-body or higher couplings and checking whether the recovered effective interactions match the predicted moments only inside the analytically derived bias-and-scale window.
Figures
read the original abstract
The great success of neural networks in recognizing hidden patterns and correlations in complex data lies in the way they take advantage of the large number of parameters and nonlinear single-unit activation, jointly. Restricted Boltzmann Machines (RBMs) provide a simple yet powerful framework for studying the impact of activation nonlinearities on performance and representation. In this work, we exploit the duality between RBMs and models of interacting binary variables to study the statistics of the interactions induced by RBM ensembles with different hidden unit activation functions. We characterize the space of representable models analytically in terms of moments of the distribution of induced interactions for four commonly used activation functions: Linear, Step, ReLU, and Exponential. Quantitative predictions of the analytical calculations on learning show a very good agreement with results of the simulations of the training process. In particular, our analysis shows that there are certain data structures, namely those generated by models of interacting variables with large interaction terms beyond pairwise, that are difficult to represent, and thus to learn, for any RBM. Yet, we find that rapidly increasing nonlinearities, such as the Exponential function, can facilitate the representation and learning of such data structures for a specific range of parameters that is determined analytically.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper exploits the RBM-Ising duality to map activation functions (Linear, Step, ReLU, Exponential) to distributions of induced couplings in an effective model of interacting binary variables. It analytically computes the moments of these distributions to characterize the space of representable models and identifies an analytically determined parameter window for the Exponential activation in which higher-order moments become large enough to represent data structures with strong multi-body interactions. Quantitative predictions from the moment analysis are reported to agree well with direct simulations of the training dynamics.
Significance. If the central mapping holds, the work supplies a concrete analytical handle on how activation nonlinearities control the capacity to encode higher-order statistics, which is a load-bearing issue for understanding representation power in energy-based models. The explicit parameter range for Exponential activations and the reported agreement between analytics and simulations constitute falsifiable, reproducible elements that could inform activation choice in RBMs and related architectures.
major comments (2)
- [§3] §3 (moment characterization): The claim that the space of representable models is fully delineated by the first few moments of the induced-interaction distribution assumes that moment matching suffices to guarantee reproduction of arbitrary higher-order statistics. For the Exponential activation, whose induced couplings are expected to be non-Gaussian, it is not shown whether residual correlations or higher cumulants outside the reported moments can still prevent the effective Hamiltonian from capturing the target multi-body terms; a bound or explicit counter-example would be needed to secure this step.
- [§4] §4 (simulation validation): The parameter window for Exponential is derived from the same moment calculations used to define the representable set; the reported agreement with training simulations therefore does not constitute an independent test of whether the moment truncation actually enlarges the reachable model space beyond what ReLU or Step functions achieve.
minor comments (2)
- [§2] Notation for the induced coupling distribution should be introduced once and used consistently; the transition from the RBM energy to the effective Ising Hamiltonian is described in two places with slightly different symbols.
- [Figure 3] Figure 3: the error bars on the learning curves for the Exponential case overlap with the ReLU curves in the reported regime; a statistical test or larger sample size would clarify whether the claimed advantage is significant.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important nuances in our moment-based characterization and the nature of our simulation validation. We address each major comment below, indicating where we will revise the manuscript for greater precision while defending the core contributions of the work.
read point-by-point responses
-
Referee: [§3] §3 (moment characterization): The claim that the space of representable models is fully delineated by the first few moments of the induced-interaction distribution assumes that moment matching suffices to guarantee reproduction of arbitrary higher-order statistics. For the Exponential activation, whose induced couplings are expected to be non-Gaussian, it is not shown whether residual correlations or higher cumulants outside the reported moments can still prevent the effective Hamiltonian from capturing the target multi-body terms; a bound or explicit counter-example would be needed to secure this step.
Authors: We appreciate this observation. Our manuscript characterizes the space of representable models through the moments of the induced coupling distribution rather than claiming that matching the first few moments rigorously guarantees exact reproduction of arbitrary higher-order statistics. The moments quantify the expected magnitude of multi-body interactions; for the Exponential activation these higher moments grow rapidly inside the identified parameter window, indicating enhanced capacity for strong higher-order terms. We acknowledge that non-Gaussian features and residual cumulants could affect precise matching and will add a clarifying paragraph in §3 stating that the moment analysis provides a necessary indicator of representational capacity but is not proven sufficient for all target statistics. A rigorous bound or counter-example lies beyond the present scope. revision: partial
-
Referee: [§4] §4 (simulation validation): The parameter window for Exponential is derived from the same moment calculations used to define the representable set; the reported agreement with training simulations therefore does not constitute an independent test of whether the moment truncation actually enlarges the reachable model space beyond what ReLU or Step functions achieve.
Authors: We agree that the simulations are guided by the same analytical moment calculations and therefore do not furnish a fully independent test of the truncation's effect on reachable model space. The numerical results instead confirm that the analytically predicted window for the Exponential activation corresponds to measurably better learning of higher-order structures, while the same window yields no advantage for Linear, Step or ReLU activations. We will revise the discussion in §4 to emphasize that the simulations validate the practical utility of the moment-derived window rather than independently proving an enlargement of the model space. revision: partial
- A rigorous bound or explicit counter-example showing whether higher cumulants or residual correlations can prevent the effective Hamiltonian from capturing target multi-body terms for the Exponential activation.
Circularity Check
Moment-based analytic characterization of RBM representable spaces is self-contained and externally validated
full rationale
The paper derives the distribution of induced interactions and their moments directly from the RBM-Ising duality for each activation function (Linear, Step, ReLU, Exponential), obtains closed-form expressions for those moments, and identifies the parameter window for Exponential activation where higher-order moments become large. These analytic results are then compared quantitatively to independent Monte Carlo simulations of the training dynamics on synthetic data generated from models with strong higher-order terms. No equation reduces a prediction to a fitted parameter by construction, no load-bearing premise rests on a self-citation chain, and the duality is used only to map activations to interaction statistics rather than to presuppose the target result. The central claim therefore remains independent of its own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- parameter range for exponential activation
axioms (1)
- domain assumption Duality between RBMs and models of interacting binary variables
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We exploit the duality between RBMs and models of interacting binary variables to study the statistics of the interactions induced by RBM ensembles with different hidden unit activation functions. We characterize the space of representable models analytically in terms of moments of the distribution of induced interactions
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
For the Exponential activation function... I_Exp_s = M γ_1^s ⟨e^{-cμ}⟩ ... Δ_Exp_s = M_0^{-1} [(γ_2/γ_1^2)^s - 1]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers.Advances in neural information processing systems, 32, 2019
work page 2019
- [2]
-
[3]
Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over- parameterization. InInternational conference on machine learning, pages 242–252. PMLR, 2019
work page 2019
-
[4]
On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022
S. Chatterjee and P. Zielinski. On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022
-
[5]
S. Oymak and M. Soltanolkotabi. Toward moderate overparameterization: Global conver- gence guarantees for training shallow neural networks.IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020
work page 2020
- [6]
-
[7]
G. di Sarra, B. Bravi, and Y. Roudi. The unbearable lightness of restricted boltzmann ma- chines: Theoretical insights and biological applications.Europhysics Letters, 149(2):21002, jan 2025
work page 2025
-
[8]
V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 807–814, Madison, WI, USA, 2010. Omnipress. 26
work page 2010
-
[9]
X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dud´ ık, editors,Proceedings of the Fourteenth Interna- tional Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR
work page 2011
-
[10]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012
work page 2012
-
[11]
Searching for Activation Functions
P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Gaussian Error Linear Units (GELUs)
D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[13]
T. Fukai and M. Shiino. Large suppression of spurious states in neural networks of nonlinear analog neurons.Phys. Rev. A, 42:7459–7466, Dec 1990
work page 1990
- [14]
-
[15]
Graded-response neurons and information encodings in autoassociative memories.Phys
Alessandro Treves. Graded-response neurons and information encodings in autoassociative memories.Phys. Rev. A, 42:2418–2430, Aug 1990
work page 1990
-
[16]
Localized activity profiles and storage capacity of rate- based autoassociative networks.Phys
Yasser Roudi and Alessandro Treves. Localized activity profiles and storage capacity of rate- based autoassociative networks.Phys. Rev. E, 73:061904, Jun 2006
work page 2006
-
[17]
A Treves. Threshold-linear formal neurons in auto-associative nets.Journal of Physics A: Mathematical and General, 23(12):2631–2650, jun 1990
work page 1990
-
[18]
F. Sch¨ onsberg, Y. Roudi, and A. Treves. Efficiency of local learning rules in threshold-linear associative networks.Phys. Rev. Lett., 126:018301, Jan 2021
work page 2021
-
[19]
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, Los Alamitos, CA, USA, dec 2015. IEEE Computer Society
work page 2015
-
[20]
V. Kunc and J. Kl´ ema. Three decades of activations: A comprehensive survey of 400 activation functions for neural networks.arXiv preprint arXiv:2402.09092, 2024
-
[21]
E. Oostwal, M. Straat, and M. Biehl. Hidden unit specialization in layered neural net- works: Relu vs. sigmoidal activation.Physica A: Statistical Mechanics and its Applications, 564:125517, 2021
work page 2021
- [22]
-
[23]
S. Nishiyama and M. Ohzeki. Solution space and storage capacity of fully connected two-layer neural networks with generic activation functions.Journal of the Physical Society of Japan, 94(1):014802, 2025
work page 2025
-
[24]
G. Manzan and D. Tantari. The effect of priors on learning with restricted boltzmann machines. Physica A: Statistical Mechanics and its Applications, 674:130766, 2025. 27
work page 2025
-
[25]
Smolensky.Information Processing in Dynamical Systems: Foundations of Harmony The- ory
P. Smolensky.Information Processing in Dynamical Systems: Foundations of Harmony The- ory. In: Rumelhart, D. E., McClelland, J. S. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1, pages 194–281. MIT Press, 1986
work page 1986
-
[26]
A. Fischer and C. Igel. An introduction to restricted boltzmann machines. In L. Alvarez, M. Mejail, L. Gomez, and J. Jacobo, editors,Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 14–36, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg
work page 2012
-
[27]
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147–169, 1985
work page 1985
-
[28]
N. Le Roux and Y. Bengio. Representational power of restricted boltzmann machines and deep belief networks.Neural computation, 20(6):1631–1649, 2008
work page 2008
-
[29]
A. Decelle and C. Furtlehner. Restricted boltzmann machine: Recent advances and mean-field theory.Chinese Physics B, 30(4):040202, 2021
work page 2021
-
[30]
C. Marullo and E. Agliari. Boltzmann machines as generalized hopfield networks: A review of recent results and outlooks.Entropy, 23(1), 2021
work page 2021
-
[31]
T. Bonnaire, G. Catania, A. Decelle, and B. Seoane. On the role of non-linear latent features in bipartite generative neural networks.SciPost Phys., 19:141, 2025
work page 2025
- [32]
-
[33]
A. Fachechi, E. Agliari, M. Aquaro, A. Coolen, and M. Mulder. Fundamental operating regimes, hyper-parameter fine-tuning and glassiness: towards an interpretable replica-theory for trained restricted boltzmann machines.Journal of Physics A: Mathematical and Theoret- ical, 58(6):065004, 2025
work page 2025
-
[34]
N. Bulso and Y. Roudi. Restricted Boltzmann Machines as Models of Interacting Variables. Neural Computation, 33(10):2646–2681, 09 2021
work page 2021
-
[35]
A. Decelle, A. Navas G´ omez, and B. Seoane. Inferring higher-order couplings with neural networks.Phys. Rev. Lett., 135:207301, Nov 2025
work page 2025
- [36]
- [37]
-
[38]
J. Tubiana and R. Monasson. Emergence of compositional representations in restricted boltz- mann machines.Phys. Rev. Lett., 118:138301, Mar 2017
work page 2017
-
[39]
F. E. Leonelli, E. Agliari, L. Albanese, and A. Barra. On the effective initialisation for restricted boltzmann machines via duality with hopfield model.Neural Networks, 143:314–326, 2021
work page 2021
-
[40]
E. Ventura, S. Cocco, R. Monasson, and Francesco Zamponi. Unlearning regularization for boltzmann machines.Machine Learning: Science and Technology, 5(2):025078, jun 2024. 28
work page 2024
-
[41]
H. Shah, K. Tamuly, A. Raghunathan, P. Jain, and P. Netrapalli. The pitfalls of simplicity bias in neural networks.Advances in Neural Information Processing Systems, 33:9573–9585, 2020
work page 2020
- [42]
-
[43]
M. Refinetti, A. Ingrosso, and S. Goldt. Neural networks trained with sgd learn distributions of increasing complexity. InInternational Conference on Machine Learning, pages 28843–28863. PMLR, 2023
work page 2023
-
[44]
F. Jangjoo, G. di Sarra, M. Marsili, and Y. Roudi. Lost in retraining: Closed-loop learning and model collapse in exponential families.Phys. Rev. Lett., 136:197301, May 2026. 29 Appendix The expected interaction in the Linear case Definingn≡s−p ⟨Ik1,···,k s⟩= X µ sX n=1 (−1)s−n X 1≤j1<j2···<jn≤n 1 2 nX l=1 ⟨(wkjl ,µ)2⟩+ nX l̸=l′=1 ⟨wkjl ,µ⟩⟨wkjl′ ,µ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.