Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines

Giovanni di Sarra; Yasser Roudi

arxiv: 2605.19178 · v1 · pith:IJWRSFZ6new · submitted 2026-05-18 · ❄️ cond-mat.dis-nn · cond-mat.stat-mech· cs.LG· physics.data-an

Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines

Giovanni di Sarra , Yasser Roudi This is my paper

Pith reviewed 2026-05-20 06:56 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn cond-mat.stat-mechcs.LGphysics.data-an

keywords Restricted Boltzmann MachinesActivation FunctionsHigher-Order InteractionsInduced CouplingsBinary Variable ModelsLearning DynamicsExponential NonlinearityMoment Analysis

0 comments

The pith

RBMs using exponential activation can represent and learn strong higher-order interactions within an analytically determined parameter range.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how activation functions shape what Restricted Boltzmann Machines can represent when modeling data with interactions beyond pairwise terms among binary variables. Exploiting the known duality to effective spin models, the authors derive the moments of the induced interaction distribution for linear, step, ReLU, and exponential units. They show analytically that large higher-order couplings are difficult for any RBM to capture, yet exponential nonlinearities open a usable window of parameters where such structures become representable. Direct comparison of these moment predictions with the outcome of gradient-based training confirms the analysis holds during learning.

Core claim

The space of models representable by an RBM is fully characterized by the moments of the distribution of interactions induced on the visible variables; for exponential activations this distribution acquires a tail that permits large higher-order terms within a specific, analytically fixed range of the hidden-unit bias and weight scale.

What carries the argument

The duality mapping an RBM ensemble to an effective model of interacting binary variables, with the distribution of induced couplings characterized by its low-order moments.

If this is right

Data structures generated by strong higher-order interactions remain hard to represent for linear, step, and ReLU activations at any parameter value.
Exponential activation enlarges the representable set precisely when the hidden-unit bias and coupling scale lie inside the derived interval.
Quantitative agreement between moment calculations and observed learning trajectories holds across the four activations tested.
Optimal parameter choices for exponential units can be read directly from the analytic expressions without numerical search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar rapidly growing activations could be substituted for the exponential in deeper architectures to improve capture of multi-body correlations.
The moment characterization offers a diagnostic for whether a given dataset is likely to be learnable by a given RBM before training begins.
The same duality lens may be applied to other energy-based models to predict which activation choices favor higher-order statistics.

Load-bearing premise

The duality between RBMs and models of interacting binary variables fully determines the representable distributions through the moments of the induced interactions.

What would settle it

Training an exponential RBM on synthetic data generated from a model with large three-body or higher couplings and checking whether the recovered effective interactions match the predicted moments only inside the analytically derived bias-and-scale window.

Figures

Figures reproduced from arXiv: 2605.19178 by Giovanni di Sarra, Yasser Roudi.

**Figure 2.** Figure 2: Hidden layer marginalization. The joint distribution of an RBM with N = 5 is marginalized with respect to the hidden layer to generate a fully-visible network with arbitrary orders of interaction between nodes. In the Linear RBM case, Eq. (2) corresponds to a Hopfieldlike pairwise model. In the non linear cases, Eq. (2) also includes every higher-order interaction term up to s = N. The three-body interact… view at source ↗

**Figure 3.** Figure 3: Solutions of ∆Exp s = 1 in the (σ 2 , w0) plane for M−1 0 = 0.1 (left) and M−1 0 = 0.002 (right). Eq. (13) is plotted with a color corresponding to the order of interaction. The black line shows the divergence γ1 = 0, where interaction fluctuations are infinitely larger than the expected value. interaction terms with increasing order s have larger fluctuation-dominated regions. Furthermore, the size of the… view at source ↗

**Figure 4.** Figure 4: I (s) 0 /M versus w0 from Eq. (15), for s = 1, 2 for the Linear activation function and s = 1, 2, 3 for Exponential, Step and ReLU. Interactions of higher orders are also present for all the activation functions except for Linear. I (s) 0 /M with s > 3 are smaller than I (3) 0 /M and are not shown for visualization purposes. The star indicates the transition point for the Exponential function. The RBM para… view at source ↗

**Figure 5.** Figure 5: shows a different situation for ReLU and Step, where the input to the hidden units changes the interaction structure in a more complicated way [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Is from Eq. (23) and Is from Eq. (20) (dashed line) versus w0 for g = 2. The solid line for the Exponential activation shows Eq. (7). Parameters are bi = 0, cµ = 0 ∀i, µ, N = 8, M = 20 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Is from Eq. (23) and Is from Eq. (20) (dashed line) versus g for w0 = 0.2. The solid line for the Exponential activation shows Eq. (7). Parameters are bi = 0, cµ = 0 ∀i, µ, N = 8, M = 20. the latter deviates from the analytical expressions. In fact, the γ1 = 1 transition is not captured by the expansion. Figs. 8 and 9 show a similar set of results as in Figs. 6 and 7 but for the variance of the interaction… view at source ↗

**Figure 8.** Figure 8: σ 2 s from Eq. (25) and Var(Ik1,··· ,ks ) from Eq. (24) (dashed line) versus w0 for g = 1. The solid line for the Exponential activation shows Eq. (9). Parameters are bi = 0 ∀i, cµ = 0 ∀µ, N = 8, M = 20. As in the previous cases, analytical expressions are compared with empirical averages. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: σ 2 s from Eq. (25) and Var(Ik1,··· ,ks ) from Eq. (24) (dashed line) versus g for w0 = 0.2. The solid line for the Exponential activation shows Eq. (9). The RBM parameters are bi = 0 ∀i, cµ = 0 ∀µ, N = 8 and M = 20. the order of interaction s. This is well captured by the theoretical expressions and shown both as a function of w0 ( [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Square root of I 2 s from Eq. (23) (n = 2) and square root of Is,2 from Eq. (22) (dashed line) versus w0 for g = 1. The solid line for the Exponential activation shows the first term in Eq. (9). The RBM parameters are bi = 0 ∀i, cµ = 0 ∀µ, N = 8 and M = 20. moments gives an estimate of the average magnitude of the interaction terms. Then, the figures show how lower order interactions are larger in magnitu… view at source ↗

**Figure 11.** Figure 11: RBMs with Exponential activation have a regime where different orders of interaction [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Fraction of decaying interaction models for Exponential, ReLU and Step activation [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Decaying and non decaying ground truth lattice gas models. Ground truth lattice gas models with N = 3 and interactions I gt k1,...,ks ∼ N (I (s) gt , I(s) gt /5). For the decaying interaction model in Eq.(28) (upper left), I (1) gt = 0.9, I (2) gt = 0.3 and I (3) gt = 0.1. For the non decaying interaction model in Eq.(29) (lower left), the interactions are 3-body, I (3) gt = 1. Edges in the networks repr… view at source ↗

**Figure 14.** Figure 14: Learning a decaying interaction model. A RBM with N = 3 and M = 4, initialized with zero-mean Gaussian weights (σ = 0.01), is trained to match Eq.(28) for different activation functions. The model is trained for 2500 epochs with a learning rate of 0.001. The first panel in each row shows the trajectory of the interactions mapped from the RBM, compared with the ground-truth interactions (dashed lines). The… view at source ↗

**Figure 15.** Figure 15: shows the training process when the ground-truth model is the three-body interaction model in [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Learning interactions from a random RBM. An RBM with N = 3 and M = 4, initialized with zero-mean Gaussian weights (σ = 0.01), is trained to match the probability distribution of a Gaussian random RBM (w0 =0.2 and g =0.2/ √ M) for different activation functions. The model is trained for 300 epoch with a learning rate of 0.02. The first panel for each activation function shows the training trajectory of t… view at source ↗

**Figure 17.** Figure 17: shows how one-body interaction models in Eq.(31) are learned by an RBM, for different values of hi and different activation functions. This kind of behavior suggests that the one-body [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: Learning a pairwise lattice gas model. RBMs with N = 3 and M = 4, initialized with zero-mean Gaussian weights (σ = 0.01), are trained to match the probability distribution of ground-truth models with pairwise interactions only (Eq.(32)) for different values of Jij . Interactions of order 1 and 3 are plotted versus J. Each panel shows the comparison between the pairwise ground truth interaction Jij and th… view at source ↗

**Figure 19.** Figure 19: Learning a three-body lattice gas model. RBMs with N = 3 and M = 4, initialized with zero-mean Gaussian weights (σ = 0.01), are trained to match the probability distribution of ground truth models with one three-body interaction only (Eq.(33)) for different values of Tijk. Interactions of s < 3 are plotted versus T. Each panel shows the comparison between the threebody ground truth interaction Tijk and t… view at source ↗

**Figure 20.** Figure 20: Learning a non decaying lattice gas model with the Exponential activation. An RBM with N = 3 and M = 8, initialized with Gaussian weights (w0 = 0.3 and g = 3), is trained on a ground truth non-decaying model. The left panel shows I gt k1,··· ,ks 2 (in blue) and Ik1,··· ,ks 2 from Eq.(3) for the trained RBM with different activation functions. The lines connect Eq. (23) (n = 2) for s = 2 and s = 3. The rig… view at source ↗

**Figure 21.** Figure 21: RBMs represent three-body interaction model with T = 0.5. The weights configurations solving the non-linear set of equations given by Eq.(3) for a three-body interaction is shown for each activation function. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: Solutions of ∆Lin s = 1 in the (σ 2 , w0) plane for M0 = 0.1 (left) and M0 = 0.002 (right). Eq. (13) is plotted with a color corresponding to the order of interaction. The black line shows the divergence w0 = 0, where interaction fluctuations are infinitely larger than their mean [PITH_FULL_IMAGE:figures/full_fig_p035_22.png] view at source ↗

**Figure 23.** Figure 23: Square root of I 2 s from Eq. (23) (n = 2) and square root of Is,2 from Eq. (22) (dashed line) versus g for w0 = 0.2. The solid line for the Exponential activation shows the first term in Eq. (9). The RBM parameters are bi = 0 ∀i, cµ = 0 ∀µ, N = 8 and M = 20. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗

**Figure 24.** Figure 24: Learning a large independent lattice gas model. RBMs with N = 10 and M = 15, initialized with zero-mean Gaussian weights (σ = 0.01), are trained to match the probability distribution of ground truth lattice gas models with one body interactions only (Eq.(31)) for different values of hi and for different activation functions. Each panel shows the comparison between the one-body ground truth interaction hi… view at source ↗

**Figure 25.** Figure 25: Learning a large pairwise lattice gas model. RBMs with N = 10 and M = 15, initialized with zero-mean Gaussian weights (σ = 0.01), are trained to match the probability distribution of ground truth lattice gas models with pairwise interactions only (Eq.(32)) for different values of Jij and for different activation functions. Each panel shows the comparison between the pairwise ground truth interaction Jij a… view at source ↗

**Figure 26.** Figure 26: Learning a large three-body lattice gas model. RBMs with N = 10 and M = 15, initialized with zero-mean Gaussian weights (σ = 0.01), are trained to match the probability distribution of ground truth lattice gas models with three-body interactions only (Eq.(33)) for different values of Tijk and for different activation functions. Each panel shows the comparison between the three-body ground truth interactio… view at source ↗

**Figure 27.** Figure 27: Learning a non decaying lattice gas model with the Exponential activation - details. The model is trained for 2500 epochs with a learning rate of 5 × 10−4 . The first panel in each row shows the trajectory of the interactions mapped from the RBM, compared with the ground-truth interactions (dashed lines). The second panel in each row shows the cross-entropy trajectory, where the target is the ground truth… view at source ↗

read the original abstract

The great success of neural networks in recognizing hidden patterns and correlations in complex data lies in the way they take advantage of the large number of parameters and nonlinear single-unit activation, jointly. Restricted Boltzmann Machines (RBMs) provide a simple yet powerful framework for studying the impact of activation nonlinearities on performance and representation. In this work, we exploit the duality between RBMs and models of interacting binary variables to study the statistics of the interactions induced by RBM ensembles with different hidden unit activation functions. We characterize the space of representable models analytically in terms of moments of the distribution of induced interactions for four commonly used activation functions: Linear, Step, ReLU, and Exponential. Quantitative predictions of the analytical calculations on learning show a very good agreement with results of the simulations of the training process. In particular, our analysis shows that there are certain data structures, namely those generated by models of interacting variables with large interaction terms beyond pairwise, that are difficult to represent, and thus to learn, for any RBM. Yet, we find that rapidly increasing nonlinearities, such as the Exponential function, can facilitate the representation and learning of such data structures for a specific range of parameters that is determined analytically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Analytic moment maps for four RBM activations identify a parameter window where exponential units boost higher-order terms, with solid simulation agreement but open questions on whether low moments fully capture the representable space.

read the letter

The main thing to know is that this paper derives closed-form moments of the induced couplings for linear, step, ReLU, and exponential activations in RBMs, then shows that the exponential case opens a usable window for representing data with strong multi-body interactions. They start from the RBM-Ising duality, express the effective pairwise couplings as sums over hidden units, and compute the resulting moment expressions as functions of the weights and activation parameters. The exponential stands out because its steep nonlinearity fattens the tails enough to produce larger higher moments, which the authors link to better coverage of data generated from models with sizable three- and four-body terms. The analytics line up closely with what appears after training, which is the most concrete part of the work. That match turns the moment formulas into something you can actually use when picking an activation for data that clearly contains higher-order structure. The softer spot is the assumption that matching the first few moments of the induced-coupling distribution is enough to guarantee the full set of representable models. If the couplings remain correlated or the distribution keeps non-Gaussian features that the low moments miss, the claimed advantage for the exponential might be narrower than the analytic window suggests. The simulations support the central comparison within the tested ranges, but they do not seem to include extra checks on the shape of the full coupling distribution or on whether the learned RBM reproduces the target higher-order statistics beyond what the moments already predict. This is aimed at people who analyze or train energy-based models and want a principled way to choose nonlinearities when higher-order statistics matter. It is not a broad new framework, but the explicit parameter range and the direct analytic-simulation comparison give it enough substance to justify referee time. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper exploits the RBM-Ising duality to map activation functions (Linear, Step, ReLU, Exponential) to distributions of induced couplings in an effective model of interacting binary variables. It analytically computes the moments of these distributions to characterize the space of representable models and identifies an analytically determined parameter window for the Exponential activation in which higher-order moments become large enough to represent data structures with strong multi-body interactions. Quantitative predictions from the moment analysis are reported to agree well with direct simulations of the training dynamics.

Significance. If the central mapping holds, the work supplies a concrete analytical handle on how activation nonlinearities control the capacity to encode higher-order statistics, which is a load-bearing issue for understanding representation power in energy-based models. The explicit parameter range for Exponential activations and the reported agreement between analytics and simulations constitute falsifiable, reproducible elements that could inform activation choice in RBMs and related architectures.

major comments (2)

[§3] §3 (moment characterization): The claim that the space of representable models is fully delineated by the first few moments of the induced-interaction distribution assumes that moment matching suffices to guarantee reproduction of arbitrary higher-order statistics. For the Exponential activation, whose induced couplings are expected to be non-Gaussian, it is not shown whether residual correlations or higher cumulants outside the reported moments can still prevent the effective Hamiltonian from capturing the target multi-body terms; a bound or explicit counter-example would be needed to secure this step.
[§4] §4 (simulation validation): The parameter window for Exponential is derived from the same moment calculations used to define the representable set; the reported agreement with training simulations therefore does not constitute an independent test of whether the moment truncation actually enlarges the reachable model space beyond what ReLU or Step functions achieve.

minor comments (2)

[§2] Notation for the induced coupling distribution should be introduced once and used consistently; the transition from the RBM energy to the effective Ising Hamiltonian is described in two places with slightly different symbols.
[Figure 3] Figure 3: the error bars on the learning curves for the Exponential case overlap with the ReLU curves in the reported regime; a statistical test or larger sample size would clarify whether the claimed advantage is significant.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important nuances in our moment-based characterization and the nature of our simulation validation. We address each major comment below, indicating where we will revise the manuscript for greater precision while defending the core contributions of the work.

read point-by-point responses

Referee: [§3] §3 (moment characterization): The claim that the space of representable models is fully delineated by the first few moments of the induced-interaction distribution assumes that moment matching suffices to guarantee reproduction of arbitrary higher-order statistics. For the Exponential activation, whose induced couplings are expected to be non-Gaussian, it is not shown whether residual correlations or higher cumulants outside the reported moments can still prevent the effective Hamiltonian from capturing the target multi-body terms; a bound or explicit counter-example would be needed to secure this step.

Authors: We appreciate this observation. Our manuscript characterizes the space of representable models through the moments of the induced coupling distribution rather than claiming that matching the first few moments rigorously guarantees exact reproduction of arbitrary higher-order statistics. The moments quantify the expected magnitude of multi-body interactions; for the Exponential activation these higher moments grow rapidly inside the identified parameter window, indicating enhanced capacity for strong higher-order terms. We acknowledge that non-Gaussian features and residual cumulants could affect precise matching and will add a clarifying paragraph in §3 stating that the moment analysis provides a necessary indicator of representational capacity but is not proven sufficient for all target statistics. A rigorous bound or counter-example lies beyond the present scope. revision: partial
Referee: [§4] §4 (simulation validation): The parameter window for Exponential is derived from the same moment calculations used to define the representable set; the reported agreement with training simulations therefore does not constitute an independent test of whether the moment truncation actually enlarges the reachable model space beyond what ReLU or Step functions achieve.

Authors: We agree that the simulations are guided by the same analytical moment calculations and therefore do not furnish a fully independent test of the truncation's effect on reachable model space. The numerical results instead confirm that the analytically predicted window for the Exponential activation corresponds to measurably better learning of higher-order structures, while the same window yields no advantage for Linear, Step or ReLU activations. We will revise the discussion in §4 to emphasize that the simulations validate the practical utility of the moment-derived window rather than independently proving an enlargement of the model space. revision: partial

standing simulated objections not resolved

A rigorous bound or explicit counter-example showing whether higher cumulants or residual correlations can prevent the effective Hamiltonian from capturing target multi-body terms for the Exponential activation.

Circularity Check

0 steps flagged

Moment-based analytic characterization of RBM representable spaces is self-contained and externally validated

full rationale

The paper derives the distribution of induced interactions and their moments directly from the RBM-Ising duality for each activation function (Linear, Step, ReLU, Exponential), obtains closed-form expressions for those moments, and identifies the parameter window for Exponential activation where higher-order moments become large. These analytic results are then compared quantitatively to independent Monte Carlo simulations of the training dynamics on synthetic data generated from models with strong higher-order terms. No equation reduces a prediction to a fitted parameter by construction, no load-bearing premise rests on a self-citation chain, and the duality is used only to map activations to interaction statistics rather than to presuppose the target result. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard statistical-mechanics duality between RBMs and interacting binary models plus the assumption that moments of the induced-interaction distribution suffice to characterize representable data structures.

free parameters (1)

parameter range for exponential activation
The specific window of parameters where exponential activation succeeds is determined analytically from the moment calculations and may implicitly depend on data assumptions.

axioms (1)

domain assumption Duality between RBMs and models of interacting binary variables
Invoked to translate activation nonlinearities into statistics of effective interactions.

pith-pipeline@v0.9.0 · 5757 in / 1174 out tokens · 51575 ms · 2026-05-20T06:56:56.344437+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We exploit the duality between RBMs and models of interacting binary variables to study the statistics of the interactions induced by RBM ensembles with different hidden unit activation functions. We characterize the space of representable models analytically in terms of moments of the distribution of induced interactions
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For the Exponential activation function... I_Exp_s = M γ_1^s ⟨e^{-cμ}⟩ ... Δ_Exp_s = M_0^{-1} [(γ_2/γ_1^2)^s - 1]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Allen-Zhu, Y

Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers.Advances in neural information processing systems, 32, 2019

work page 2019
[2]

Arora, S

S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational conference on machine learning, pages 322–332. PMLR, 2019

work page 2019
[3]

Allen-Zhu, Y

Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over- parameterization. InInternational conference on machine learning, pages 242–252. PMLR, 2019

work page 2019
[4]

On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022

S. Chatterjee and P. Zielinski. On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022

work page arXiv 2022
[5]

Oymak and M

S. Oymak and M. Soltanolkotabi. Toward moderate overparameterization: Global conver- gence guarantees for training shallow neural networks.IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020

work page 2020
[6]

Li and Y

Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data.Advances in neural information processing systems, 31, 2018

work page 2018
[7]

di Sarra, B

G. di Sarra, B. Bravi, and Y. Roudi. The unbearable lightness of restricted boltzmann ma- chines: Theoretical insights and biological applications.Europhysics Letters, 149(2):21002, jan 2025

work page 2025
[8]

Nair and G

V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 807–814, Madison, WI, USA, 2010. Omnipress. 26

work page 2010
[9]

Glorot, A

X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dud´ ık, editors,Proceedings of the Fourteenth Interna- tional Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR

work page 2011
[10]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012
[11]

Searching for Activation Functions

P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Fukai and M

T. Fukai and M. Shiino. Large suppression of spurious states in neural networks of nonlinear analog neurons.Phys. Rev. A, 42:7459–7466, Dec 1990

work page 1990
[14]

K¨ uhn, S

R. K¨ uhn, S. B¨ os, and J. L. van Hemmen. Statistical mechanics for networks of graded-response neurons.Phys. Rev. A, 43:2084–2087, Feb 1991

work page 2084
[15]

Graded-response neurons and information encodings in autoassociative memories.Phys

Alessandro Treves. Graded-response neurons and information encodings in autoassociative memories.Phys. Rev. A, 42:2418–2430, Aug 1990

work page 1990
[16]

Localized activity profiles and storage capacity of rate- based autoassociative networks.Phys

Yasser Roudi and Alessandro Treves. Localized activity profiles and storage capacity of rate- based autoassociative networks.Phys. Rev. E, 73:061904, Jun 2006

work page 2006
[17]

Threshold-linear formal neurons in auto-associative nets.Journal of Physics A: Mathematical and General, 23(12):2631–2650, jun 1990

A Treves. Threshold-linear formal neurons in auto-associative nets.Journal of Physics A: Mathematical and General, 23(12):2631–2650, jun 1990

work page 1990
[18]

Sch¨ onsberg, Y

F. Sch¨ onsberg, Y. Roudi, and A. Treves. Efficiency of local learning rules in threshold-linear associative networks.Phys. Rev. Lett., 126:018301, Jan 2021

work page 2021
[19]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, Los Alamitos, CA, USA, dec 2015. IEEE Computer Society

work page 2015
[20]

Kunc and J

V. Kunc and J. Kl´ ema. Three decades of activations: A comprehensive survey of 400 activation functions for neural networks.arXiv preprint arXiv:2402.09092, 2024

work page arXiv 2024
[21]

Oostwal, M

E. Oostwal, M. Straat, and M. Biehl. Hidden unit specialization in layered neural net- works: Relu vs. sigmoidal activation.Physica A: Statistical Mechanics and its Applications, 564:125517, 2021

work page 2021
[22]

Citton, F

O. Citton, F. Richert, and M. Biehl. Phase transition analysis for shallow neural networks with arbitrary activation functions.Physica A: Statistical Mechanics and its Applications, 660:130356, 2025

work page 2025
[23]

Nishiyama and M

S. Nishiyama and M. Ohzeki. Solution space and storage capacity of fully connected two-layer neural networks with generic activation functions.Journal of the Physical Society of Japan, 94(1):014802, 2025

work page 2025
[24]

Manzan and D

G. Manzan and D. Tantari. The effect of priors on learning with restricted boltzmann machines. Physica A: Statistical Mechanics and its Applications, 674:130766, 2025. 27

work page 2025
[25]

Smolensky.Information Processing in Dynamical Systems: Foundations of Harmony The- ory

P. Smolensky.Information Processing in Dynamical Systems: Foundations of Harmony The- ory. In: Rumelhart, D. E., McClelland, J. S. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1, pages 194–281. MIT Press, 1986

work page 1986
[26]

Fischer and C

A. Fischer and C. Igel. An introduction to restricted boltzmann machines. In L. Alvarez, M. Mejail, L. Gomez, and J. Jacobo, editors,Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 14–36, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg

work page 2012
[27]

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147–169, 1985

work page 1985
[28]

Le Roux and Y

N. Le Roux and Y. Bengio. Representational power of restricted boltzmann machines and deep belief networks.Neural computation, 20(6):1631–1649, 2008

work page 2008
[29]

Decelle and C

A. Decelle and C. Furtlehner. Restricted boltzmann machine: Recent advances and mean-field theory.Chinese Physics B, 30(4):040202, 2021

work page 2021
[30]

Marullo and E

C. Marullo and E. Agliari. Boltzmann machines as generalized hopfield networks: A review of recent results and outlooks.Entropy, 23(1), 2021

work page 2021
[31]

Bonnaire, G

T. Bonnaire, G. Catania, A. Decelle, and B. Seoane. On the role of non-linear latent features in bipartite generative neural networks.SciPost Phys., 19:141, 2025

work page 2025
[32]

Barra, A

A. Barra, A. Bernacchia, E. Santucci, and P. Contucci. On the equivalence of hopfield networks and boltzmann machines.Neural Netw, 34:1–9, Oct 2012

work page 2012
[33]

Fachechi, E

A. Fachechi, E. Agliari, M. Aquaro, A. Coolen, and M. Mulder. Fundamental operating regimes, hyper-parameter fine-tuning and glassiness: towards an interpretable replica-theory for trained restricted boltzmann machines.Journal of Physics A: Mathematical and Theoret- ical, 58(6):065004, 2025

work page 2025
[34]

Bulso and Y

N. Bulso and Y. Roudi. Restricted Boltzmann Machines as Models of Interacting Variables. Neural Computation, 33(10):2646–2681, 09 2021

work page 2021
[35]

Decelle, A

A. Decelle, A. Navas G´ omez, and B. Seoane. Inferring higher-order couplings with neural networks.Phys. Rev. Lett., 135:207301, Nov 2025

work page 2025
[36]

Barra, G

A. Barra, G. Genovese, P. Sollich, and D. Tantari. Phase transitions in restricted boltzmann machines with generic priors.Phys. Rev. E, 96:042156, Oct 2017

work page 2017
[37]

Barra, G

A. Barra, G. Genovese, P. Sollich, and D. Tantari. Phase diagram of restricted boltzmann machines and generalized hopfield networks with arbitrary priors.Phys. Rev. E, 97:022310, Feb 2018

work page 2018
[38]

Tubiana and R

J. Tubiana and R. Monasson. Emergence of compositional representations in restricted boltz- mann machines.Phys. Rev. Lett., 118:138301, Mar 2017

work page 2017
[39]

F. E. Leonelli, E. Agliari, L. Albanese, and A. Barra. On the effective initialisation for restricted boltzmann machines via duality with hopfield model.Neural Networks, 143:314–326, 2021

work page 2021
[40]

Ventura, S

E. Ventura, S. Cocco, R. Monasson, and Francesco Zamponi. Unlearning regularization for boltzmann machines.Machine Learning: Science and Technology, 5(2):025078, jun 2024. 28

work page 2024
[41]

H. Shah, K. Tamuly, A. Raghunathan, P. Jain, and P. Netrapalli. The pitfalls of simplicity bias in neural networks.Advances in Neural Information Processing Systems, 33:9573–9585, 2020

work page 2020
[42]

Rende, F

R. Rende, F. Gerace, A. Laio, and S. Goldt. A distributional simplicity bias in the learning dynamics of transformers.arXiv preprint arXiv:2410.19637, 2024

work page arXiv 2024
[43]

Refinetti, A

M. Refinetti, A. Ingrosso, and S. Goldt. Neural networks trained with sgd learn distributions of increasing complexity. InInternational Conference on Machine Learning, pages 28843–28863. PMLR, 2023

work page 2023
[44]

Jangjoo, G

F. Jangjoo, G. di Sarra, M. Marsili, and Y. Roudi. Lost in retraining: Closed-loop learning and model collapse in exponential families.Phys. Rev. Lett., 136:197301, May 2026. 29 Appendix The expected interaction in the Linear case Definingn≡s−p ⟨Ik1,···,k s⟩= X µ sX n=1 (−1)s−n X 1≤j1<j2···<jn≤n 1 2    nX l=1 ⟨(wkjl ,µ)2⟩+ nX l̸=l′=1 ⟨wkjl ,µ⟩⟨wkjl′ ,µ...

work page 2026

[1] [1]

Allen-Zhu, Y

Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers.Advances in neural information processing systems, 32, 2019

work page 2019

[2] [2]

Arora, S

S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational conference on machine learning, pages 322–332. PMLR, 2019

work page 2019

[3] [3]

Allen-Zhu, Y

Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over- parameterization. InInternational conference on machine learning, pages 242–252. PMLR, 2019

work page 2019

[4] [4]

On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022

S. Chatterjee and P. Zielinski. On the generalization mystery in deep learning.arXiv preprint arXiv:2203.10036, 2022

work page arXiv 2022

[5] [5]

Oymak and M

S. Oymak and M. Soltanolkotabi. Toward moderate overparameterization: Global conver- gence guarantees for training shallow neural networks.IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020

work page 2020

[6] [6]

Li and Y

Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data.Advances in neural information processing systems, 31, 2018

work page 2018

[7] [7]

di Sarra, B

G. di Sarra, B. Bravi, and Y. Roudi. The unbearable lightness of restricted boltzmann ma- chines: Theoretical insights and biological applications.Europhysics Letters, 149(2):21002, jan 2025

work page 2025

[8] [8]

Nair and G

V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 807–814, Madison, WI, USA, 2010. Omnipress. 26

work page 2010

[9] [9]

Glorot, A

X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dud´ ık, editors,Proceedings of the Fourteenth Interna- tional Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR

work page 2011

[10] [10]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012

[11] [11]

Searching for Activation Functions

P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

Fukai and M

T. Fukai and M. Shiino. Large suppression of spurious states in neural networks of nonlinear analog neurons.Phys. Rev. A, 42:7459–7466, Dec 1990

work page 1990

[14] [14]

K¨ uhn, S

R. K¨ uhn, S. B¨ os, and J. L. van Hemmen. Statistical mechanics for networks of graded-response neurons.Phys. Rev. A, 43:2084–2087, Feb 1991

work page 2084

[15] [15]

Graded-response neurons and information encodings in autoassociative memories.Phys

Alessandro Treves. Graded-response neurons and information encodings in autoassociative memories.Phys. Rev. A, 42:2418–2430, Aug 1990

work page 1990

[16] [16]

Localized activity profiles and storage capacity of rate- based autoassociative networks.Phys

Yasser Roudi and Alessandro Treves. Localized activity profiles and storage capacity of rate- based autoassociative networks.Phys. Rev. E, 73:061904, Jun 2006

work page 2006

[17] [17]

Threshold-linear formal neurons in auto-associative nets.Journal of Physics A: Mathematical and General, 23(12):2631–2650, jun 1990

A Treves. Threshold-linear formal neurons in auto-associative nets.Journal of Physics A: Mathematical and General, 23(12):2631–2650, jun 1990

work page 1990

[18] [18]

Sch¨ onsberg, Y

F. Sch¨ onsberg, Y. Roudi, and A. Treves. Efficiency of local learning rules in threshold-linear associative networks.Phys. Rev. Lett., 126:018301, Jan 2021

work page 2021

[19] [19]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, Los Alamitos, CA, USA, dec 2015. IEEE Computer Society

work page 2015

[20] [20]

Kunc and J

V. Kunc and J. Kl´ ema. Three decades of activations: A comprehensive survey of 400 activation functions for neural networks.arXiv preprint arXiv:2402.09092, 2024

work page arXiv 2024

[21] [21]

Oostwal, M

E. Oostwal, M. Straat, and M. Biehl. Hidden unit specialization in layered neural net- works: Relu vs. sigmoidal activation.Physica A: Statistical Mechanics and its Applications, 564:125517, 2021

work page 2021

[22] [22]

Citton, F

O. Citton, F. Richert, and M. Biehl. Phase transition analysis for shallow neural networks with arbitrary activation functions.Physica A: Statistical Mechanics and its Applications, 660:130356, 2025

work page 2025

[23] [23]

Nishiyama and M

S. Nishiyama and M. Ohzeki. Solution space and storage capacity of fully connected two-layer neural networks with generic activation functions.Journal of the Physical Society of Japan, 94(1):014802, 2025

work page 2025

[24] [24]

Manzan and D

G. Manzan and D. Tantari. The effect of priors on learning with restricted boltzmann machines. Physica A: Statistical Mechanics and its Applications, 674:130766, 2025. 27

work page 2025

[25] [25]

Smolensky.Information Processing in Dynamical Systems: Foundations of Harmony The- ory

P. Smolensky.Information Processing in Dynamical Systems: Foundations of Harmony The- ory. In: Rumelhart, D. E., McClelland, J. S. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1, pages 194–281. MIT Press, 1986

work page 1986

[26] [26]

Fischer and C

A. Fischer and C. Igel. An introduction to restricted boltzmann machines. In L. Alvarez, M. Mejail, L. Gomez, and J. Jacobo, editors,Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 14–36, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg

work page 2012

[27] [27]

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147–169, 1985

work page 1985

[28] [28]

Le Roux and Y

N. Le Roux and Y. Bengio. Representational power of restricted boltzmann machines and deep belief networks.Neural computation, 20(6):1631–1649, 2008

work page 2008

[29] [29]

Decelle and C

A. Decelle and C. Furtlehner. Restricted boltzmann machine: Recent advances and mean-field theory.Chinese Physics B, 30(4):040202, 2021

work page 2021

[30] [30]

Marullo and E

C. Marullo and E. Agliari. Boltzmann machines as generalized hopfield networks: A review of recent results and outlooks.Entropy, 23(1), 2021

work page 2021

[31] [31]

Bonnaire, G

T. Bonnaire, G. Catania, A. Decelle, and B. Seoane. On the role of non-linear latent features in bipartite generative neural networks.SciPost Phys., 19:141, 2025

work page 2025

[32] [32]

Barra, A

A. Barra, A. Bernacchia, E. Santucci, and P. Contucci. On the equivalence of hopfield networks and boltzmann machines.Neural Netw, 34:1–9, Oct 2012

work page 2012

[33] [33]

Fachechi, E

A. Fachechi, E. Agliari, M. Aquaro, A. Coolen, and M. Mulder. Fundamental operating regimes, hyper-parameter fine-tuning and glassiness: towards an interpretable replica-theory for trained restricted boltzmann machines.Journal of Physics A: Mathematical and Theoret- ical, 58(6):065004, 2025

work page 2025

[34] [34]

Bulso and Y

N. Bulso and Y. Roudi. Restricted Boltzmann Machines as Models of Interacting Variables. Neural Computation, 33(10):2646–2681, 09 2021

work page 2021

[35] [35]

Decelle, A

A. Decelle, A. Navas G´ omez, and B. Seoane. Inferring higher-order couplings with neural networks.Phys. Rev. Lett., 135:207301, Nov 2025

work page 2025

[36] [36]

Barra, G

A. Barra, G. Genovese, P. Sollich, and D. Tantari. Phase transitions in restricted boltzmann machines with generic priors.Phys. Rev. E, 96:042156, Oct 2017

work page 2017

[37] [37]

Barra, G

A. Barra, G. Genovese, P. Sollich, and D. Tantari. Phase diagram of restricted boltzmann machines and generalized hopfield networks with arbitrary priors.Phys. Rev. E, 97:022310, Feb 2018

work page 2018

[38] [38]

Tubiana and R

J. Tubiana and R. Monasson. Emergence of compositional representations in restricted boltz- mann machines.Phys. Rev. Lett., 118:138301, Mar 2017

work page 2017

[39] [39]

F. E. Leonelli, E. Agliari, L. Albanese, and A. Barra. On the effective initialisation for restricted boltzmann machines via duality with hopfield model.Neural Networks, 143:314–326, 2021

work page 2021

[40] [40]

Ventura, S

E. Ventura, S. Cocco, R. Monasson, and Francesco Zamponi. Unlearning regularization for boltzmann machines.Machine Learning: Science and Technology, 5(2):025078, jun 2024. 28

work page 2024

[41] [41]

H. Shah, K. Tamuly, A. Raghunathan, P. Jain, and P. Netrapalli. The pitfalls of simplicity bias in neural networks.Advances in Neural Information Processing Systems, 33:9573–9585, 2020

work page 2020

[42] [42]

Rende, F

R. Rende, F. Gerace, A. Laio, and S. Goldt. A distributional simplicity bias in the learning dynamics of transformers.arXiv preprint arXiv:2410.19637, 2024

work page arXiv 2024

[43] [43]

Refinetti, A

M. Refinetti, A. Ingrosso, and S. Goldt. Neural networks trained with sgd learn distributions of increasing complexity. InInternational Conference on Machine Learning, pages 28843–28863. PMLR, 2023

work page 2023

[44] [44]

Jangjoo, G

F. Jangjoo, G. di Sarra, M. Marsili, and Y. Roudi. Lost in retraining: Closed-loop learning and model collapse in exponential families.Phys. Rev. Lett., 136:197301, May 2026. 29 Appendix The expected interaction in the Linear case Definingn≡s−p ⟨Ik1,···,k s⟩= X µ sX n=1 (−1)s−n X 1≤j1<j2···<jn≤n 1 2    nX l=1 ⟨(wkjl ,µ)2⟩+ nX l̸=l′=1 ⟨wkjl ,µ⟩⟨wkjl′ ,µ...

work page 2026