Can Stationary Distributions of Scale-Invariant Neural Networks Be Described by the Thermodynamics of an Ideal Gas?
Pith reviewed 2026-05-17 23:28 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{77OMDIUI}
Prints a linked pith:77OMDIUI badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Stationary distributions of SGD for scale-invariant networks correspond to ideal gas thermodynamics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that the stationary distribution of SGD with weight decay for scale-invariant neural networks can be described by the thermodynamics of an ideal gas. Hyperparameters correspond to thermodynamic variables, and starting from an isotropic noise model, we find close correspondence validated by theory, simulation, and extension to neural network training where entropy behavior matches experiments.
What carries the argument
The thermodynamic framework that equates SGD dynamics under isotropic noise to ideal gas behavior, with learning rate as temperature and weight decay related to pressure.
If this is right
- The behavior of stationary entropy can be predicted from the thermodynamic variables.
- Training hyperparameters can be interpreted and adjusted using gas law analogies.
- This provides a foundation for interpreting training dynamics in a physics-based manner.
- Future designs of learning rate schedulers may draw from thermodynamic principles.
Where Pith is reading between the lines
- If the analogy holds more broadly, it could apply to understanding convergence in other optimization settings.
- The framework might connect to existing work on noise in gradient descent by providing a physical interpretation.
- Testing the predictions on different network scales could reveal limits of the ideal gas approximation.
Load-bearing premise
The simplified isotropic noise model adequately represents the gradient noise present in actual deep network training.
What would settle it
Measuring the dependence of stationary entropy on learning rate and weight decay in trained scale-invariant networks and finding it does not match the ideal gas predictions would falsify the central claim.
Figures
read the original abstract
Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a thermodynamic framework for stationary distributions of SGD with weight decay in scale-invariant neural networks by analogizing them to ideal-gas thermodynamics. Starting from a simplified isotropic noise model, it maps hyperparameters (learning rate to temperature, weight decay to pressure) to thermodynamic variables, derives ideal-gas-like stationary distributions, validates via theory and simulation, and reports that predictions for stationary entropy align with experiments on neural network training.
Significance. If the analogy and mappings hold, the work supplies a principled physics-based lens on training dynamics with potential to guide hyperparameter tuning and learning-rate schedulers. Strengths include explicit theory-plus-simulation validation of the toy isotropic model and falsifiable predictions for entropy behavior that are checked against experiments; these elements provide concrete, testable content rather than purely qualitative analogy.
major comments (2)
- [§3] §3 (isotropic noise model derivation): the central extension from the toy model to real scale-invariant networks rests on an unquantified isotropy assumption for gradient noise; no covariance eigenvalue spectra, effective-dimension estimates, or ablation comparing isotropic vs. anisotropic noise are supplied, leaving open whether anisotropy or correlations invalidate the stationary-distribution predictions.
- [Experimental results] Experimental results section: reported alignment of stationary-entropy predictions with observations supplies no error bars, no controls for the noise-model choice, and no quantitative measure of fit (e.g., R² or KL divergence), so the strength of the empirical support cannot be assessed.
minor comments (2)
- [§2] Notation for the thermodynamic mappings (learning rate ↔ temperature, etc.) is introduced without an explicit table or equation summarizing the full dictionary of correspondences.
- [Abstract] The abstract states that predictions 'align closely' but does not name the precise entropy estimator or the range of architectures and datasets used.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful review. The comments highlight important aspects that we will clarify and expand upon in the revision. We respond to each major comment in turn.
read point-by-point responses
-
Referee: [§3] §3 (isotropic noise model derivation): the central extension from the toy model to real scale-invariant networks rests on an unquantified isotropy assumption for gradient noise; no covariance eigenvalue spectra, effective-dimension estimates, or ablation comparing isotropic vs. anisotropic noise are supplied, leaving open whether anisotropy or correlations invalidate the stationary-distribution predictions.
Authors: The manuscript explicitly introduces the isotropic noise model as a simplified starting point for the theoretical derivation in §3. We agree that providing quantitative support for this assumption in the context of real networks would strengthen the extension to practical settings. In the revised version, we will add an analysis of the gradient noise covariance matrices from our experiments, including eigenvalue spectra and estimates of effective dimension. We will also include a brief discussion on how deviations from isotropy might affect the predictions, supported by these measurements. revision: yes
-
Referee: [Experimental results] Experimental results section: reported alignment of stationary-entropy predictions with observations supplies no error bars, no controls for the noise-model choice, and no quantitative measure of fit (e.g., R² or KL divergence), so the strength of the empirical support cannot be assessed.
Authors: We concur that the empirical validation would benefit from additional statistical rigor. We will revise the experimental results section to include error bars on the reported entropy values, add controls or sensitivity analyses for the noise model assumptions, and provide quantitative fit metrics such as R² or KL divergence between the predicted and observed stationary entropy behaviors. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper begins from an explicit modeling assumption (simplified isotropic noise for SGD), derives the ideal-gas correspondence and thermodynamic mappings through theoretical analysis of the stationary distribution, and then checks the resulting predictions (including stationary entropy scaling) against separate experimental measurements on actual networks. These steps do not reduce to redefinitions or post-hoc fits of the same quantities; the experimental alignment functions as an external check rather than an input that is renamed as output. No self-citation chain, ansatz smuggling, or uniqueness theorem imported from prior author work is invoked to close the central argument. The framework therefore contains independent theoretical content and is not circular by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- effective temperature
axioms (1)
- domain assumption Gradient noise is isotropic and white.
Reference graph
Works this paper leans on
-
[1]
Alemi, A. A. and Fischer, I. (2018). TherML: Ther- modynamics of machine learning. Ali Mehmeti-Göpel, C. H. and Wand, M. (2024). On the weight dynamics of deep normalized networks. InProceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofProceed- ings of Machine Learning Research, pages 992–1007. PMLR. Arora, S., Li, Z., and ...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2017). Entropy-SGD: Biasing gra- dient descent into wide valleys. InInternational Conference on Learning Representations. Chaudhari, P. and Soatto, S. (2018). Stochastic gradi- entdescentperformsvariationalinference, converges T raining...
work page 2017
-
[3]
Three Factors Influencing Minima in SGD
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing in- ternal covariate shift. InInternational conference on machine learning, pages 448–456. PMLR. Izmailov, P., Wilson, A., Podoprikhin, D., Vetrov, D., and Garipov, T. (2018). Averaging weights leads to wider optima and better generalization. In34th Confe...
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[4]
Le Ny, A. (2008). Introduction to (generalized) gibbs measures.Ensaios Matemáticos, 15(1-126). Li, Z. and Arora, S. (2020). An exponential learn- ing rate schedule for deep learning. InInternational Conference on Learning Representations. Li, Z., Lyu, K., and Arora, S. (2020). Reconciling modern deep learning with traditional optimization analyses: The in...
work page 2008
-
[5]
Liu, Z., Liu, Y., Gore, J., and Tegmark, M. (2025). Neural thermodynamic laws for large language model training. Lobacheva, E., Kodryan, M., Chirkova, N., Malinin, A., and Vetrov, D. P. (2021). On the periodic be- havior of neural network training with batch nor- malization and weight decay. InAdvances in Neural Information Processing Systems. Loshchilov,...
work page 2025
-
[6]
L2 Regularization versus Batch and Weight Normalization
Sclocchi, A. and Wyart, M. (2024). On the differ- ent regimes of stochastic gradient descent.Pro- ceedings of the National Academy of Sciences, 121(9):e2316301121. Smith, S. and Le, Q. V. (2018). A bayesian perspective on generalization and stochastic gradient descent. InInternational Conference on Learning Represen- tations. Tishby, N. and Zaslavsky, N. ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
(i.e., switching between stochastic and full-batch optimization) and convergence toward minima of varying sharpness (Smith and Le, 2018). Chaudhari and Soatto (2018) analyze the stationary Gibbs distributionρw(w)∝exp(−Φ(w)/T)and show that the potentialΦ(w)equals the training lossL(w)if and only if the stochastic gradient noise is isotropic. The stationary...
work page 2018
-
[8]
Another approach is to derive temperature directly Table 2: Notations used throughout the paper
or input features in graph neural networks (Michela et al., 2025). Another approach is to derive temperature directly Table 2: Notations used throughout the paper. Left column shows quantities from optimization, right column presents analogous variables from thermodynamics (if applicable). Optimization Thermodynamics Weight vector and microstates weight v...
work page 2025
-
[9]
They smooth the loss land- scape (Santurkar et al.,
and LayerNorm (Ba et al., 2016), are indispensable in modern neural architectures. They smooth the loss land- scape (Santurkar et al.,
work page 2016
-
[10]
and more stable (Bjorck et al., 2018). Beyond these benefits, normalization layers induce scale invariance in network parameters, fundamentally altering their optimization dynamics. Arora et al. (2019) show that BatchNorm implicitly tunes the learning rate, while Hoffer et al. (2018) demonstrate that the weight direction evolves according to an effective ...
work page 2018
-
[11]
(2019) and Li and Arora (2020)
Van Laarhoven (2017) argue that, in scale-invariant networks, weight decay does not serve as a regularizer but instead controls the learning rate through the parameter norm, a phenomenon also confirmed by Zhang et al. (2019) and Li and Arora (2020). Anotherlineofresearchexaminestheequilibriumbehaviorofscale-invariantnetworks. Wanetal.(2021)establish condi...
work page 2017
-
[12]
and Lion (Chen et al., 2023). Our frameworkIn this work, we extend the previously established analogies between SGD dynamics and thermodynamics. Whereaspriorstudiesprimarilyfocusedonquantitiessuchasenergy, entropy, andtemperature, we demonstrate that the optimization of scale-invariant networks naturally gives rise to a richer thermodynamic framework, one...
work page 2023
-
[13]
This component is required to preserve the norm∥W t∥2 = 1 in the Ito formulation
Therefore, the SDE for radius is drt = 7 0 ∂rt ∂t − ηeff∥Wt∥2 ∥Wt∥ :0 W T t ∇L(Wt)− ηeff∥Wt∥2 ∥Wt∥ λW T t Wt + η2 eff∥Wt∥ 2 TrΣ W t dt+ + ηeff∥Wt∥2 ∥Wt∥ :0 ΣWt 1 2 Wt T dBt = −ηeffλr3 t + η2 effrt 2 TrΣ W t dt(48) Now, the derivatives ofx(hereδdenotes the Kronecker delta) ∂x ∂t = 0, ∂xk ∂xi = 1 ∥x∥(Px)ik = δik ∥x∥ − xixk ∥x∥3 , ∂ ∂xj δik ∥x∥ =− δikxj ∥x∥3...
work page 2022
-
[14]
is C(N, d) = log(N−1)−log Γ d 2 + 1 + d 2 logπ+γ,(101) whereΓdenotes the gamma function, andγ≈0.577is the Euler constant. E ADDITIONAL RESULTS FOR ISOTROPIC NOISE MODEL Statistics of VMF distributionFor consistency with the existing sources, we derive the statistics of the VMF distribution for inverse temperatureκ= 1/Tand then we rewrite them in terms of ...
work page 2025
-
[15]
(2022))—on the CIFAR-10 (Krizhevsky et al., 2009a) and CIFAR-100 (Krizhevsky et al., 2009b) datasets
and a ConvNet with four convolutional layers (adapted from Kodryan et al. (2022))—on the CIFAR-10 (Krizhevsky et al., 2009a) and CIFAR-100 (Krizhevsky et al., 2009b) datasets. Both models are made fully scale-invariant by inserting a BatchNorm layer without affine parameters after each convolutional layer. The final linear layer is kept fixed with its wei...
work page 2022
-
[16]
We apply no data augmentations other than channel-wise normalization
We use a batch size ofB= 128across all experiments, sampling batches independently at each iteration, thus, there is no notion of epochs. We apply no data augmentations other than channel-wise normalization. For CIFAR-10, we use mean(0.4914,0.4822,0.4465)and standard deviation(0.2023,0.1994,0.2010); for CIFAR-100, we use mean (0.5071,0.4867,0.4408)and sta...
work page 2023
-
[17]
Overall, these experiments confirm the results presented in the main text
These cover all four architecture-dataset pairs and three training protocols. Overall, these experiments confirm the results presented in the main text. First, the variance of stochastic gradientsσdepends solely on ηeff (in the fixed ELR and fixed sphere settings) and on the productηλ(in the fixed LR setting). Second, the temperatureTgenerally increases w...
work page 2024
-
[18]
setups, the maximum relative error remains low (below10%)
For most training 5In the fixed ELR case, we similarly approximate the entropy as a function oflogηeffandlogλ. setups, the maximum relative error remains low (below10%). Two notable exceptions occur in the ConvNet experiments on CIFAR-10 and CIFAR-100 with a fixed LR, where the discrepancies increase to17.6%and23.3%, respectively. These higher errors appe...
work page 2024
-
[19]
ResultsFigure 13 shows the learning curves for training loss, parameter radius, and gradient-related metrics (the squared norm of the full-batch gradient,∥∇L(w)∥2, and the trace of the covariance matrix,Tr Σw). We observe that all four metrics stabilize for the three largest ELRs, indicating the onset of stationary behavior. In contrast, for smaller ELRs,...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.