pith. the verified trust layer for science. sign in

arxiv: 2511.07308 · v2 · pith:77OMDIUInew · submitted 2025-11-10 · 💻 cs.LG

Can Stationary Distributions of Scale-Invariant Neural Networks Be Described by the Thermodynamics of an Ideal Gas?

Pith reviewed 2026-05-17 23:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords thermodynamic frameworkstochastic gradient descentscale-invariant neural networksideal gasstationary distributionsentropyweight decay
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{77OMDIUI}

Prints a linked pith:77OMDIUI badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Stationary distributions of SGD for scale-invariant networks correspond to ideal gas thermodynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a thermodynamic framework for the stationary distributions reached by stochastic gradient descent with weight decay in scale-invariant neural networks. It draws analogies between training hyperparameters like learning rate and weight decay and thermodynamic variables such as temperature, pressure, and volume. Using a simplified isotropic noise model, the authors demonstrate that SGD dynamics closely match those of an ideal gas, as confirmed by theory and simulations. When this framework is applied to training actual neural networks, predictions about the behavior of stationary entropy align with experimental observations. This offers a principled way to interpret training dynamics and could inform hyperparameter choices.

Core claim

We show that the stationary distribution of SGD with weight decay for scale-invariant neural networks can be described by the thermodynamics of an ideal gas. Hyperparameters correspond to thermodynamic variables, and starting from an isotropic noise model, we find close correspondence validated by theory, simulation, and extension to neural network training where entropy behavior matches experiments.

What carries the argument

The thermodynamic framework that equates SGD dynamics under isotropic noise to ideal gas behavior, with learning rate as temperature and weight decay related to pressure.

If this is right

  • The behavior of stationary entropy can be predicted from the thermodynamic variables.
  • Training hyperparameters can be interpreted and adjusted using gas law analogies.
  • This provides a foundation for interpreting training dynamics in a physics-based manner.
  • Future designs of learning rate schedulers may draw from thermodynamic principles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the analogy holds more broadly, it could apply to understanding convergence in other optimization settings.
  • The framework might connect to existing work on noise in gradient descent by providing a physical interpretation.
  • Testing the predictions on different network scales could reveal limits of the ideal gas approximation.

Load-bearing premise

The simplified isotropic noise model adequately represents the gradient noise present in actual deep network training.

What would settle it

Measuring the dependence of stationary entropy on learning rate and weight decay in trained scale-invariant networks and finding it does not match the ideal gas predictions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.07308 by Dmitry Vetrov, Ekaterina Lobacheva, Ildus Sadrtdinov, Ivan Klimov, Mikhail Burtsev, Mikhail I. Katsnelson.

Figure 1
Figure 1. Figure 1: Results for the VMF isotropic noise model with fixed LR [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Results for ResNet-18 on CIFAR-10 with fixed LR [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results for the VMF isotropic noise model on a fixed sphere with radius [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results for the VMF isotropic noise model with fixed ELR [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results for ResNet-18 on CIFAR-10 and CIFAR-100 with fixed LR [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Results for ConvNet on CIFAR-10 and CIFAR-100 with fixed LR [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Results for ResNet-18 on CIFAR-10 and CIFAR-100 with fixed ELR [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Results for ConvNet on CIFAR-10 and CIFAR-100 with fixed ELR [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Results for ResNet-18 on CIFAR-10 and CIFAR-100 on a fixed sphere with radius [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Results for ConvNet on CIFAR-10 and CIFAR-100 on a fixed sphere with radius [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The balance be￾tween centrifugal and cen￾tripetal forces, which pre￾serves the weight norm ∥wk∥ after the SGD step. Stochastic gradient ∇LBk (wk) is orthog￾onal to the weight vector wk. To account for this mismatch, we introduce a geometric correction that ex￾plicitly considers both the deterministic and stochastic components of the gradient. A similar geometric reasoning has been discussed in Kosson et a… view at source ↗
Figure 12
Figure 12. Figure 12: Comparison between the discrete-time and SDE predictions of the stationary radius [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Results for overparameterized ResNet-18 on CIFAR-10 with fixed ELR [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
read the original abstract

Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a thermodynamic framework for stationary distributions of SGD with weight decay in scale-invariant neural networks by analogizing them to ideal-gas thermodynamics. Starting from a simplified isotropic noise model, it maps hyperparameters (learning rate to temperature, weight decay to pressure) to thermodynamic variables, derives ideal-gas-like stationary distributions, validates via theory and simulation, and reports that predictions for stationary entropy align with experiments on neural network training.

Significance. If the analogy and mappings hold, the work supplies a principled physics-based lens on training dynamics with potential to guide hyperparameter tuning and learning-rate schedulers. Strengths include explicit theory-plus-simulation validation of the toy isotropic model and falsifiable predictions for entropy behavior that are checked against experiments; these elements provide concrete, testable content rather than purely qualitative analogy.

major comments (2)
  1. [§3] §3 (isotropic noise model derivation): the central extension from the toy model to real scale-invariant networks rests on an unquantified isotropy assumption for gradient noise; no covariance eigenvalue spectra, effective-dimension estimates, or ablation comparing isotropic vs. anisotropic noise are supplied, leaving open whether anisotropy or correlations invalidate the stationary-distribution predictions.
  2. [Experimental results] Experimental results section: reported alignment of stationary-entropy predictions with observations supplies no error bars, no controls for the noise-model choice, and no quantitative measure of fit (e.g., R² or KL divergence), so the strength of the empirical support cannot be assessed.
minor comments (2)
  1. [§2] Notation for the thermodynamic mappings (learning rate ↔ temperature, etc.) is introduced without an explicit table or equation summarizing the full dictionary of correspondences.
  2. [Abstract] The abstract states that predictions 'align closely' but does not name the precise entropy estimator or the range of architectures and datasets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful review. The comments highlight important aspects that we will clarify and expand upon in the revision. We respond to each major comment in turn.

read point-by-point responses
  1. Referee: [§3] §3 (isotropic noise model derivation): the central extension from the toy model to real scale-invariant networks rests on an unquantified isotropy assumption for gradient noise; no covariance eigenvalue spectra, effective-dimension estimates, or ablation comparing isotropic vs. anisotropic noise are supplied, leaving open whether anisotropy or correlations invalidate the stationary-distribution predictions.

    Authors: The manuscript explicitly introduces the isotropic noise model as a simplified starting point for the theoretical derivation in §3. We agree that providing quantitative support for this assumption in the context of real networks would strengthen the extension to practical settings. In the revised version, we will add an analysis of the gradient noise covariance matrices from our experiments, including eigenvalue spectra and estimates of effective dimension. We will also include a brief discussion on how deviations from isotropy might affect the predictions, supported by these measurements. revision: yes

  2. Referee: [Experimental results] Experimental results section: reported alignment of stationary-entropy predictions with observations supplies no error bars, no controls for the noise-model choice, and no quantitative measure of fit (e.g., R² or KL divergence), so the strength of the empirical support cannot be assessed.

    Authors: We concur that the empirical validation would benefit from additional statistical rigor. We will revise the experimental results section to include error bars on the reported entropy values, add controls or sensitivity analyses for the noise model assumptions, and provide quantitative fit metrics such as R² or KL divergence between the predicted and observed stationary entropy behaviors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper begins from an explicit modeling assumption (simplified isotropic noise for SGD), derives the ideal-gas correspondence and thermodynamic mappings through theoretical analysis of the stationary distribution, and then checks the resulting predictions (including stationary entropy scaling) against separate experimental measurements on actual networks. These steps do not reduce to redefinitions or post-hoc fits of the same quantities; the experimental alignment functions as an external check rather than an input that is renamed as output. No self-citation chain, ansatz smuggling, or uniqueness theorem imported from prior author work is invoked to close the central argument. The framework therefore contains independent theoretical content and is not circular by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on an unproven equivalence between gradient noise and ideal-gas molecular collisions plus the assumption that scale invariance plus weight decay produces a stationary distribution whose entropy can be read off from thermodynamic variables. No independent evidence is supplied for these steps.

free parameters (1)
  • effective temperature
    Identified with learning-rate / weight-decay ratio; its value is chosen to match observed entropy scaling rather than derived from first principles.
axioms (1)
  • domain assumption Gradient noise is isotropic and white.
    Invoked to obtain the Maxwell-Boltzmann stationary distribution; appears in the simplified model section of the abstract.

pith-pipeline@v0.9.0 · 5482 in / 1399 out tokens · 23849 ms · 2026-05-17T23:28:24.037250+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 3 internal anchors

  1. [1]

    Alemi, A. A. and Fischer, I. (2018). TherML: Ther- modynamics of machine learning. Ali Mehmeti-Göpel, C. H. and Wand, M. (2024). On the weight dynamics of deep normalized networks. InProceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofProceed- ings of Machine Learning Research, pages 992–1007. PMLR. Arora, S., Li, Z., and ...

  2. [2]

    Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2017). Entropy-SGD: Biasing gra- dient descent into wide valleys. InInternational Conference on Learning Representations. Chaudhari, P. and Soatto, S. (2018). Stochastic gradi- entdescentperformsvariationalinference, converges T raining...

  3. [3]

    Three Factors Influencing Minima in SGD

    Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing in- ternal covariate shift. InInternational conference on machine learning, pages 448–456. PMLR. Izmailov, P., Wilson, A., Podoprikhin, D., Vetrov, D., and Garipov, T. (2018). Averaging weights leads to wider optima and better generalization. In34th Confe...

  4. [4]

    Le Ny, A. (2008). Introduction to (generalized) gibbs measures.Ensaios Matemáticos, 15(1-126). Li, Z. and Arora, S. (2020). An exponential learn- ing rate schedule for deep learning. InInternational Conference on Learning Representations. Li, Z., Lyu, K., and Arora, S. (2020). Reconciling modern deep learning with traditional optimization analyses: The in...

  5. [5]

    Liu, Z., Liu, Y., Gore, J., and Tegmark, M. (2025). Neural thermodynamic laws for large language model training. Lobacheva, E., Kodryan, M., Chirkova, N., Malinin, A., and Vetrov, D. P. (2021). On the periodic be- havior of neural network training with batch nor- malization and weight decay. InAdvances in Neural Information Processing Systems. Loshchilov,...

  6. [6]

    L2 Regularization versus Batch and Weight Normalization

    Sclocchi, A. and Wyart, M. (2024). On the differ- ent regimes of stochastic gradient descent.Pro- ceedings of the National Academy of Sciences, 121(9):e2316301121. Smith, S. and Le, Q. V. (2018). A bayesian perspective on generalization and stochastic gradient descent. InInternational Conference on Learning Represen- tations. Tishby, N. and Zaslavsky, N. ...

  7. [7]

    (i.e., switching between stochastic and full-batch optimization) and convergence toward minima of varying sharpness (Smith and Le, 2018). Chaudhari and Soatto (2018) analyze the stationary Gibbs distributionρw(w)∝exp(−Φ(w)/T)and show that the potentialΦ(w)equals the training lossL(w)if and only if the stochastic gradient noise is isotropic. The stationary...

  8. [8]

    Another approach is to derive temperature directly Table 2: Notations used throughout the paper

    or input features in graph neural networks (Michela et al., 2025). Another approach is to derive temperature directly Table 2: Notations used throughout the paper. Left column shows quantities from optimization, right column presents analogous variables from thermodynamics (if applicable). Optimization Thermodynamics Weight vector and microstates weight v...

  9. [9]

    They smooth the loss land- scape (Santurkar et al.,

    and LayerNorm (Ba et al., 2016), are indispensable in modern neural architectures. They smooth the loss land- scape (Santurkar et al.,

  10. [10]

    Beyond these benefits, normalization layers induce scale invariance in network parameters, fundamentally altering their optimization dynamics

    and more stable (Bjorck et al., 2018). Beyond these benefits, normalization layers induce scale invariance in network parameters, fundamentally altering their optimization dynamics. Arora et al. (2019) show that BatchNorm implicitly tunes the learning rate, while Hoffer et al. (2018) demonstrate that the weight direction evolves according to an effective ...

  11. [11]

    (2019) and Li and Arora (2020)

    Van Laarhoven (2017) argue that, in scale-invariant networks, weight decay does not serve as a regularizer but instead controls the learning rate through the parameter norm, a phenomenon also confirmed by Zhang et al. (2019) and Li and Arora (2020). Anotherlineofresearchexaminestheequilibriumbehaviorofscale-invariantnetworks. Wanetal.(2021)establish condi...

  12. [12]

    Our frameworkIn this work, we extend the previously established analogies between SGD dynamics and thermodynamics

    and Lion (Chen et al., 2023). Our frameworkIn this work, we extend the previously established analogies between SGD dynamics and thermodynamics. Whereaspriorstudiesprimarilyfocusedonquantitiessuchasenergy, entropy, andtemperature, we demonstrate that the optimization of scale-invariant networks naturally gives rise to a richer thermodynamic framework, one...

  13. [13]

    This component is required to preserve the norm∥W t∥2 = 1 in the Ito formulation

    Therefore, the SDE for radius is drt = 7 0 ∂rt ∂t − ηeff∥Wt∥2 ∥Wt∥ :0 W T t ∇L(Wt)− ηeff∥Wt∥2 ∥Wt∥ λW T t Wt + η2 eff∥Wt∥ 2 TrΣ W t dt+ + ηeff∥Wt∥2 ∥Wt∥ :0 ΣWt 1 2 Wt T dBt = −ηeffλr3 t + η2 effrt 2 TrΣ W t dt(48) Now, the derivatives ofx(hereδdenotes the Kronecker delta) ∂x ∂t = 0, ∂xk ∂xi = 1 ∥x∥(Px)ik = δik ∥x∥ − xixk ∥x∥3 , ∂ ∂xj δik ∥x∥ =− δikxj ∥x∥3...

  14. [14]

    is C(N, d) = log(N−1)−log Γ d 2 + 1 + d 2 logπ+γ,(101) whereΓdenotes the gamma function, andγ≈0.577is the Euler constant. E ADDITIONAL RESULTS FOR ISOTROPIC NOISE MODEL Statistics of VMF distributionFor consistency with the existing sources, we derive the statistics of the VMF distribution for inverse temperatureκ= 1/Tand then we rewrite them in terms of ...

  15. [15]

    (2022))—on the CIFAR-10 (Krizhevsky et al., 2009a) and CIFAR-100 (Krizhevsky et al., 2009b) datasets

    and a ConvNet with four convolutional layers (adapted from Kodryan et al. (2022))—on the CIFAR-10 (Krizhevsky et al., 2009a) and CIFAR-100 (Krizhevsky et al., 2009b) datasets. Both models are made fully scale-invariant by inserting a BatchNorm layer without affine parameters after each convolutional layer. The final linear layer is kept fixed with its wei...

  16. [16]

    We apply no data augmentations other than channel-wise normalization

    We use a batch size ofB= 128across all experiments, sampling batches independently at each iteration, thus, there is no notion of epochs. We apply no data augmentations other than channel-wise normalization. For CIFAR-10, we use mean(0.4914,0.4822,0.4465)and standard deviation(0.2023,0.1994,0.2010); for CIFAR-100, we use mean (0.5071,0.4867,0.4408)and sta...

  17. [17]

    Overall, these experiments confirm the results presented in the main text

    These cover all four architecture-dataset pairs and three training protocols. Overall, these experiments confirm the results presented in the main text. First, the variance of stochastic gradientsσdepends solely on ηeff (in the fixed ELR and fixed sphere settings) and on the productηλ(in the fixed LR setting). Second, the temperatureTgenerally increases w...

  18. [18]

    setups, the maximum relative error remains low (below10%)

    For most training 5In the fixed ELR case, we similarly approximate the entropy as a function oflogηeffandlogλ. setups, the maximum relative error remains low (below10%). Two notable exceptions occur in the ConvNet experiments on CIFAR-10 and CIFAR-100 with a fixed LR, where the discrepancies increase to17.6%and23.3%, respectively. These higher errors appe...

  19. [19]

    We observe that all four metrics stabilize for the three largest ELRs, indicating the onset of stationary behavior

    ResultsFigure 13 shows the learning curves for training loss, parameter radius, and gradient-related metrics (the squared norm of the full-batch gradient,∥∇L(w)∥2, and the trace of the covariance matrix,Tr Σw). We observe that all four metrics stabilize for the three largest ELRs, indicating the onset of stationary behavior. In contrast, for smaller ELRs,...