arxiv: 2511.07308 · v2 · pith:77OMDIUInew · submitted 2025-11-10 · 💻 cs.LG

Can Stationary Distributions of Scale-Invariant Neural Networks Be Described by the Thermodynamics of an Ideal Gas?

Ildus Sadrtdinov , Ekaterina Lobacheva , Ivan Klimov , Mikhail Burtsev , Mikhail I. Katsnelson , Dmitry Vetrov This is my paper

Pith reviewed 2026-05-17 23:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords thermodynamic frameworkstochastic gradient descentscale-invariant neural networksideal gasstationary distributionsentropyweight decay

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{77OMDIUI}

Prints a linked pith:77OMDIUI badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Stationary distributions of SGD for scale-invariant networks correspond to ideal gas thermodynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a thermodynamic framework for the stationary distributions reached by stochastic gradient descent with weight decay in scale-invariant neural networks. It draws analogies between training hyperparameters like learning rate and weight decay and thermodynamic variables such as temperature, pressure, and volume. Using a simplified isotropic noise model, the authors demonstrate that SGD dynamics closely match those of an ideal gas, as confirmed by theory and simulations. When this framework is applied to training actual neural networks, predictions about the behavior of stationary entropy align with experimental observations. This offers a principled way to interpret training dynamics and could inform hyperparameter choices.

Core claim

We show that the stationary distribution of SGD with weight decay for scale-invariant neural networks can be described by the thermodynamics of an ideal gas. Hyperparameters correspond to thermodynamic variables, and starting from an isotropic noise model, we find close correspondence validated by theory, simulation, and extension to neural network training where entropy behavior matches experiments.

What carries the argument

The thermodynamic framework that equates SGD dynamics under isotropic noise to ideal gas behavior, with learning rate as temperature and weight decay related to pressure.

If this is right

The behavior of stationary entropy can be predicted from the thermodynamic variables.
Training hyperparameters can be interpreted and adjusted using gas law analogies.
This provides a foundation for interpreting training dynamics in a physics-based manner.
Future designs of learning rate schedulers may draw from thermodynamic principles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the analogy holds more broadly, it could apply to understanding convergence in other optimization settings.
The framework might connect to existing work on noise in gradient descent by providing a physical interpretation.
Testing the predictions on different network scales could reveal limits of the ideal gas approximation.

Load-bearing premise

The simplified isotropic noise model adequately represents the gradient noise present in actual deep network training.

What would settle it

Measuring the dependence of stationary entropy on learning rate and weight decay in trained scale-invariant networks and finding it does not match the ideal gas predictions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.07308 by Dmitry Vetrov, Ekaterina Lobacheva, Ildus Sadrtdinov, Ivan Klimov, Mikhail Burtsev, Mikhail I. Katsnelson.

**Figure 2.** Figure 2: Results for ResNet-18 on CIFAR-10 with fixed LR [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Results for the VMF isotropic noise model on a fixed sphere with radius [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Results for the VMF isotropic noise model with fixed ELR [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Results for ResNet-18 on CIFAR-10 and CIFAR-100 with fixed LR [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Results for ConvNet on CIFAR-10 and CIFAR-100 with fixed LR [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Results for ResNet-18 on CIFAR-10 and CIFAR-100 with fixed ELR [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Results for ConvNet on CIFAR-10 and CIFAR-100 with fixed ELR [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Results for ResNet-18 on CIFAR-10 and CIFAR-100 on a fixed sphere with radius [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Results for ConvNet on CIFAR-10 and CIFAR-100 on a fixed sphere with radius [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: The balance between centrifugal and centripetal forces, which preserves the weight norm ∥wk∥ after the SGD step. Stochastic gradient ∇LBk (wk) is orthogonal to the weight vector wk. To account for this mismatch, we introduce a geometric correction that explicitly considers both the deterministic and stochastic components of the gradient. A similar geometric reasoning has been discussed in Kosson et a… view at source ↗

**Figure 12.** Figure 12: Comparison between the discrete-time and SDE predictions of the stationary radius [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗

**Figure 13.** Figure 13: Results for overparameterized ResNet-18 on CIFAR-10 with fixed ELR [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗

read the original abstract

Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a thermodynamic analogy for SGD on scale-invariant nets but the isotropic noise assumption remains untested against actual gradients.

read the letter

Dear colleague, This paper sets up a thermodynamic description of the stationary distribution reached by SGD with weight decay on scale-invariant neural networks by treating it like an ideal gas. The central new element is the mapping of hyperparameters to thermodynamic quantities and the resulting predictions for stationary entropy that they then compare to training runs. They begin with an isotropic noise model, derive the gas-like behavior analytically and in simulation, and then show that the entropy trends in actual network training follow the expected pattern. The scale-invariance assumption makes the math tractable, which is a reasonable choice for networks with normalization layers. The work is clearest in the simplified setting where everything can be computed exactly. The experimental part gives some support for the entropy behavior. The weakest part is the jump from the toy noise model to real networks. Nothing in the paper quantifies how isotropic the gradient noise actually is during training—no spectra or effective dimensions are reported. Without that, it's unclear whether the derived stationary distribution applies or if anisotropy would change the entropy predictions. The reported alignments also lack error bars or tests that vary the noise structure, so the match could be weaker than it appears. The variables are defined from the same hyperparameters they predict, which adds a circularity risk. This kind of work is for researchers interested in physics-based views of optimization and hyperparameter effects. Someone looking for new ways to think about learning rate schedules might find it useful. I would send it to peer review. The idea is fresh enough and has some empirical grounding that a referee could help strengthen the noise justification and experimental controls. Best regards,

Referee Report

2 major / 2 minor

Summary. The paper develops a thermodynamic framework for stationary distributions of SGD with weight decay in scale-invariant neural networks by analogizing them to ideal-gas thermodynamics. Starting from a simplified isotropic noise model, it maps hyperparameters (learning rate to temperature, weight decay to pressure) to thermodynamic variables, derives ideal-gas-like stationary distributions, validates via theory and simulation, and reports that predictions for stationary entropy align with experiments on neural network training.

Significance. If the analogy and mappings hold, the work supplies a principled physics-based lens on training dynamics with potential to guide hyperparameter tuning and learning-rate schedulers. Strengths include explicit theory-plus-simulation validation of the toy isotropic model and falsifiable predictions for entropy behavior that are checked against experiments; these elements provide concrete, testable content rather than purely qualitative analogy.

major comments (2)

[§3] §3 (isotropic noise model derivation): the central extension from the toy model to real scale-invariant networks rests on an unquantified isotropy assumption for gradient noise; no covariance eigenvalue spectra, effective-dimension estimates, or ablation comparing isotropic vs. anisotropic noise are supplied, leaving open whether anisotropy or correlations invalidate the stationary-distribution predictions.
[Experimental results] Experimental results section: reported alignment of stationary-entropy predictions with observations supplies no error bars, no controls for the noise-model choice, and no quantitative measure of fit (e.g., R² or KL divergence), so the strength of the empirical support cannot be assessed.

minor comments (2)

[§2] Notation for the thermodynamic mappings (learning rate ↔ temperature, etc.) is introduced without an explicit table or equation summarizing the full dictionary of correspondences.
[Abstract] The abstract states that predictions 'align closely' but does not name the precise entropy estimator or the range of architectures and datasets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful review. The comments highlight important aspects that we will clarify and expand upon in the revision. We respond to each major comment in turn.

read point-by-point responses

Referee: [§3] §3 (isotropic noise model derivation): the central extension from the toy model to real scale-invariant networks rests on an unquantified isotropy assumption for gradient noise; no covariance eigenvalue spectra, effective-dimension estimates, or ablation comparing isotropic vs. anisotropic noise are supplied, leaving open whether anisotropy or correlations invalidate the stationary-distribution predictions.

Authors: The manuscript explicitly introduces the isotropic noise model as a simplified starting point for the theoretical derivation in §3. We agree that providing quantitative support for this assumption in the context of real networks would strengthen the extension to practical settings. In the revised version, we will add an analysis of the gradient noise covariance matrices from our experiments, including eigenvalue spectra and estimates of effective dimension. We will also include a brief discussion on how deviations from isotropy might affect the predictions, supported by these measurements. revision: yes
Referee: [Experimental results] Experimental results section: reported alignment of stationary-entropy predictions with observations supplies no error bars, no controls for the noise-model choice, and no quantitative measure of fit (e.g., R² or KL divergence), so the strength of the empirical support cannot be assessed.

Authors: We concur that the empirical validation would benefit from additional statistical rigor. We will revise the experimental results section to include error bars on the reported entropy values, add controls or sensitivity analyses for the noise model assumptions, and provide quantitative fit metrics such as R² or KL divergence between the predicted and observed stationary entropy behaviors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper begins from an explicit modeling assumption (simplified isotropic noise for SGD), derives the ideal-gas correspondence and thermodynamic mappings through theoretical analysis of the stationary distribution, and then checks the resulting predictions (including stationary entropy scaling) against separate experimental measurements on actual networks. These steps do not reduce to redefinitions or post-hoc fits of the same quantities; the experimental alignment functions as an external check rather than an input that is renamed as output. No self-citation chain, ansatz smuggling, or uniqueness theorem imported from prior author work is invoked to close the central argument. The framework therefore contains independent theoretical content and is not circular by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on an unproven equivalence between gradient noise and ideal-gas molecular collisions plus the assumption that scale invariance plus weight decay produces a stationary distribution whose entropy can be read off from thermodynamic variables. No independent evidence is supplied for these steps.

free parameters (1)

effective temperature
Identified with learning-rate / weight-decay ratio; its value is chosen to match observed entropy scaling rather than derived from first principles.

axioms (1)

domain assumption Gradient noise is isotropic and white.
Invoked to obtain the Maxwell-Boltzmann stationary distribution; appears in the simplified model section of the abstract.

pith-pipeline@v0.9.0 · 5482 in / 1399 out tokens · 23849 ms · 2026-05-17T23:28:24.037250+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 3 internal anchors

[1]

Alemi, A. A. and Fischer, I. (2018). TherML: Ther- modynamics of machine learning. Ali Mehmeti-Göpel, C. H. and Wand, M. (2024). On the weight dynamics of deep normalized networks. InProceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofProceed- ings of Machine Learning Research, pages 992–1007. PMLR. Arora, S., Li, Z., and ...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2017). Entropy-SGD: Biasing gra- dient descent into wide valleys. InInternational Conference on Learning Representations. Chaudhari, P. and Soatto, S. (2018). Stochastic gradi- entdescentperformsvariationalinference, converges T raining...

work page 2017
[3]

Three Factors Influencing Minima in SGD

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing in- ternal covariate shift. InInternational conference on machine learning, pages 448–456. PMLR. Izmailov, P., Wilson, A., Podoprikhin, D., Vetrov, D., and Garipov, T. (2018). Averaging weights leads to wider optima and better generalization. In34th Confe...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[4]

Le Ny, A. (2008). Introduction to (generalized) gibbs measures.Ensaios Matemáticos, 15(1-126). Li, Z. and Arora, S. (2020). An exponential learn- ing rate schedule for deep learning. InInternational Conference on Learning Representations. Li, Z., Lyu, K., and Arora, S. (2020). Reconciling modern deep learning with traditional optimization analyses: The in...

work page 2008
[5]

Liu, Z., Liu, Y., Gore, J., and Tegmark, M. (2025). Neural thermodynamic laws for large language model training. Lobacheva, E., Kodryan, M., Chirkova, N., Malinin, A., and Vetrov, D. P. (2021). On the periodic be- havior of neural network training with batch nor- malization and weight decay. InAdvances in Neural Information Processing Systems. Loshchilov,...

work page 2025
[6]

L2 Regularization versus Batch and Weight Normalization

Sclocchi, A. and Wyart, M. (2024). On the differ- ent regimes of stochastic gradient descent.Pro- ceedings of the National Academy of Sciences, 121(9):e2316301121. Smith, S. and Le, Q. V. (2018). A bayesian perspective on generalization and stochastic gradient descent. InInternational Conference on Learning Represen- tations. Tishby, N. and Zaslavsky, N. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

(i.e., switching between stochastic and full-batch optimization) and convergence toward minima of varying sharpness (Smith and Le, 2018). Chaudhari and Soatto (2018) analyze the stationary Gibbs distributionρw(w)∝exp(−Φ(w)/T)and show that the potentialΦ(w)equals the training lossL(w)if and only if the stochastic gradient noise is isotropic. The stationary...

work page 2018
[8]

Another approach is to derive temperature directly Table 2: Notations used throughout the paper

or input features in graph neural networks (Michela et al., 2025). Another approach is to derive temperature directly Table 2: Notations used throughout the paper. Left column shows quantities from optimization, right column presents analogous variables from thermodynamics (if applicable). Optimization Thermodynamics Weight vector and microstates weight v...

work page 2025
[9]

They smooth the loss land- scape (Santurkar et al.,

and LayerNorm (Ba et al., 2016), are indispensable in modern neural architectures. They smooth the loss land- scape (Santurkar et al.,

work page 2016
[10]

Beyond these benefits, normalization layers induce scale invariance in network parameters, fundamentally altering their optimization dynamics

and more stable (Bjorck et al., 2018). Beyond these benefits, normalization layers induce scale invariance in network parameters, fundamentally altering their optimization dynamics. Arora et al. (2019) show that BatchNorm implicitly tunes the learning rate, while Hoffer et al. (2018) demonstrate that the weight direction evolves according to an effective ...

work page 2018
[11]

(2019) and Li and Arora (2020)

Van Laarhoven (2017) argue that, in scale-invariant networks, weight decay does not serve as a regularizer but instead controls the learning rate through the parameter norm, a phenomenon also confirmed by Zhang et al. (2019) and Li and Arora (2020). Anotherlineofresearchexaminestheequilibriumbehaviorofscale-invariantnetworks. Wanetal.(2021)establish condi...

work page 2017
[12]

Our frameworkIn this work, we extend the previously established analogies between SGD dynamics and thermodynamics

and Lion (Chen et al., 2023). Our frameworkIn this work, we extend the previously established analogies between SGD dynamics and thermodynamics. Whereaspriorstudiesprimarilyfocusedonquantitiessuchasenergy, entropy, andtemperature, we demonstrate that the optimization of scale-invariant networks naturally gives rise to a richer thermodynamic framework, one...

work page 2023
[13]

This component is required to preserve the norm∥W t∥2 = 1 in the Ito formulation

Therefore, the SDE for radius is drt = 7 0 ∂rt ∂t − ηeff∥Wt∥2 ∥Wt∥ :0 W T t ∇L(Wt)− ηeff∥Wt∥2 ∥Wt∥ λW T t Wt + η2 eff∥Wt∥ 2 TrΣ W t dt+ + ηeff∥Wt∥2 ∥Wt∥ :0 ΣWt 1 2 Wt T dBt = −ηeffλr3 t + η2 effrt 2 TrΣ W t dt(48) Now, the derivatives ofx(hereδdenotes the Kronecker delta) ∂x ∂t = 0, ∂xk ∂xi = 1 ∥x∥(Px)ik = δik ∥x∥ − xixk ∥x∥3 , ∂ ∂xj δik ∥x∥ =− δikxj ∥x∥3...

work page 2022
[14]

is C(N, d) = log(N−1)−log Γ d 2 + 1 + d 2 logπ+γ,(101) whereΓdenotes the gamma function, andγ≈0.577is the Euler constant. E ADDITIONAL RESULTS FOR ISOTROPIC NOISE MODEL Statistics of VMF distributionFor consistency with the existing sources, we derive the statistics of the VMF distribution for inverse temperatureκ= 1/Tand then we rewrite them in terms of ...

work page 2025
[15]

(2022))—on the CIFAR-10 (Krizhevsky et al., 2009a) and CIFAR-100 (Krizhevsky et al., 2009b) datasets

and a ConvNet with four convolutional layers (adapted from Kodryan et al. (2022))—on the CIFAR-10 (Krizhevsky et al., 2009a) and CIFAR-100 (Krizhevsky et al., 2009b) datasets. Both models are made fully scale-invariant by inserting a BatchNorm layer without affine parameters after each convolutional layer. The final linear layer is kept fixed with its wei...

work page 2022
[16]

We apply no data augmentations other than channel-wise normalization

We use a batch size ofB= 128across all experiments, sampling batches independently at each iteration, thus, there is no notion of epochs. We apply no data augmentations other than channel-wise normalization. For CIFAR-10, we use mean(0.4914,0.4822,0.4465)and standard deviation(0.2023,0.1994,0.2010); for CIFAR-100, we use mean (0.5071,0.4867,0.4408)and sta...

work page 2023
[17]

Overall, these experiments confirm the results presented in the main text

These cover all four architecture-dataset pairs and three training protocols. Overall, these experiments confirm the results presented in the main text. First, the variance of stochastic gradientsσdepends solely on ηeff (in the fixed ELR and fixed sphere settings) and on the productηλ(in the fixed LR setting). Second, the temperatureTgenerally increases w...

work page 2024
[18]

setups, the maximum relative error remains low (below10%)

For most training 5In the fixed ELR case, we similarly approximate the entropy as a function oflogηeffandlogλ. setups, the maximum relative error remains low (below10%). Two notable exceptions occur in the ConvNet experiments on CIFAR-10 and CIFAR-100 with a fixed LR, where the discrepancies increase to17.6%and23.3%, respectively. These higher errors appe...

work page 2024
[19]

We observe that all four metrics stabilize for the three largest ELRs, indicating the onset of stationary behavior

ResultsFigure 13 shows the learning curves for training loss, parameter radius, and gradient-related metrics (the squared norm of the full-batch gradient,∥∇L(w)∥2, and the trace of the covariance matrix,Tr Σw). We observe that all four metrics stabilize for the three largest ELRs, indicating the onset of stationary behavior. In contrast, for smaller ELRs,...

work page 2022