Thermodynamic Irreversibility of Training Algorithms

Adam Levine; Isaac Chuang; Liu Ziyin; Yuanjie Ren

arxiv: 2605.21933 · v1 · pith:4MMOCPCMnew · submitted 2026-05-21 · ❄️ cond-mat.stat-mech · cs.AI· cs.LG

Thermodynamic Irreversibility of Training Algorithms

Liu Ziyin , Yuanjie Ren , Adam Levine , Isaac Chuang This is my paper

Pith reviewed 2026-05-22 04:34 UTC · model grok-4.3

classification ❄️ cond-mat.stat-mech cs.AIcs.LG

keywords thermodynamic irreversibilitytraining algorithmsstochastic thermodynamicsentropy productionsymmetry breakingemergent forcemachine learning dynamics

0 comments

The pith

Training algorithms exhibit equivalent irreversibility measures that generate an emergent force preferring minimal-entropy trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework for analyzing the irreversibility of training algorithms in AI. It proves that four distinct characterizations of irreversibility in dynamical processes are equivalent to leading order in the step size. This equivalence implies the existence of a time-reversal-symmetry-breaking emergent force. The force breaks non-isometric continuous reparametrization symmetries but preserves orthogonal symmetries, resulting in a preference for learning trajectories that minimize entropy production.

Core claim

Four different ways to characterize the irreversibility of dynamical processes are equivalent to leading order in the step size η: numerical backward error, time-renormalized correction, microscopic time reversal asymmetry, and the regularized stochastic-thermodynamic entropy production. The irreversibility induces a time-reversal-symmetry-breaking emergent force that generically breaks non-isometric continuous reparametrization symmetries, preserves orthogonal symmetries, and leads to a universal preference for learning trajectories that minimize the entropy production rate.

What carries the argument

Equivalence of four irreversibility measures to leading order in step size, which generates a time-reversal-symmetry-breaking emergent force in far-from-equilibrium training dynamics.

If this is right

The four irreversibility characterizations agree at small step sizes.
Non-isometric continuous reparametrization symmetries are broken by the emergent force.
Orthogonal symmetries are preserved.
Learning trajectories that minimize the entropy production rate are preferred.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The preference for minimal entropy production may act as an implicit bias explaining certain generalization behaviors in overparameterized models.
This framework could suggest new ways to regularize training by controlling entropy production rates.
Higher-order corrections beyond leading order in step size might become relevant for very large learning rates or discrete updates in practice.

Load-bearing premise

The training dynamics can be modeled as a far-from-equilibrium stochastic process whose irreversibility measures are well-defined and comparable at leading order in the discrete step size.

What would settle it

A computation of the four irreversibility quantities during training of a simple neural network, checking whether they agree only to leading order in step size or deviate at higher orders.

Figures

Figures reproduced from arXiv: 2605.21933 by Adam Levine, Isaac Chuang, Liu Ziyin, Yuanjie Ren.

**Figure 2.** Figure 2: FIG. 2. Anomalous fluctuation effect in transformer training dynamics. There is a qualitative distinction between the effect [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

The training algorithms for AI systems all introduce far-from-equilibrium dynamical processes, and understanding the irreversibility of these algorithms is a fundamental step towards understanding the learning dynamics of modern AI systems. In this work, we establish a general framework for defining and analyzing the irreversibility of training algorithms. We show that four different ways to characterize the irreversibility of dynamical processes are equivalent to leading order in the step size $\eta$: numerical backward error $\phi_{\rm DE}$, time-renormalized correction $\phi_{\rm TR}$, microscopic time reversal asymmetry $\phi_{\rm TA}$, and the (regularized) stochastic-thermodynamic entropy production $\phi_{\rm ST}$. The irreversibility gives rise to a time-reversal-symmetry-breaking emergent force that generically breaks non-isometric continuous reparametrization symmetries, preserves orthogonal symmetries, and leads to a universal preference for those learning trajectories that minimize the entropy production rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Four irreversibility measures agree to leading order in step size and yield an emergent force that breaks some symmetries while favoring low-entropy trajectories.

read the letter

The main point is that this paper shows four characterizations of irreversibility in training steps are equivalent at order η and that this common value produces a time-reversal-symmetry-breaking force. The force breaks non-isometric reparametrization symmetries, keeps orthogonal ones intact, and selects trajectories that minimize entropy production rate. That equivalence and the resulting symmetry analysis are the concrete new pieces. They come from expanding the discrete update rule rather than from fitting or circular definitions, which is a clean technical step. The framework ties numerical backward error, time renormalization, microscopic time asymmetry, and regularized stochastic entropy production into one picture, and the derivation stays grounded in the dynamical equations. This is useful for anyone who wants a precise way to connect discrete optimization to non-equilibrium thermodynamics. The symmetry claims give a testable distinction between which invariances survive training and which do not. The main limitation is the leading-order restriction. For finite step sizes or stiff regions of the loss, higher-order terms could shift or weaken the emergent force, and the paper would be stronger with an explicit remainder bound or a numerical check under realistic hyperparameters. The regularization choice for the entropy production also needs to be spelled out so readers can see it does not alter the central equivalences. This work is for theorists who already think about stochastic processes and symmetry in optimization. A reader who wants a mathematical bridge between irreversibility measures will find the equivalences worth examining. It deserves a serious referee because the core identities are falsifiable even if the broader force interpretation needs more support. I would send it to review.

Referee Report

1 major / 2 minor

Summary. The manuscript develops a framework for analyzing irreversibility in training algorithms for AI systems modeled as far-from-equilibrium stochastic processes. It establishes that four characterizations of irreversibility—numerical backward error ϕ_DE, time-renormalized correction ϕ_TR, microscopic time reversal asymmetry ϕ_TA, and regularized stochastic-thermodynamic entropy production ϕ_ST—are equivalent to leading order in the discrete step size η. From this equivalence the authors derive a time-reversal-symmetry-breaking emergent force that generically breaks non-isometric continuous reparametrization symmetries, preserves orthogonal symmetries, and selects trajectories minimizing the entropy production rate.

Significance. If the leading-order equivalences and the resulting emergent force are robust, the work supplies a thermodynamic interpretation of discrete optimization dynamics that could explain universal trajectory preferences and symmetry properties observed in neural network training. The explicit connection between algorithmic irreversibility measures and stochastic thermodynamics is a notable strength, particularly if accompanied by reproducible derivations or checks against standard training hyperparameters.

major comments (1)

[Abstract] Abstract and the derivation of the emergent force: the central claim equates the four irreversibility measures to O(η) and concludes that this produces a symmetry-breaking force selecting minimum-entropy-production trajectories. However, no explicit bound on the O(η²) remainder is supplied, nor is there a numerical demonstration that the leading term dominates the force direction for typical η ≳ 0.01 or in stiff loss landscapes. This leaves open whether higher-order corrections can alter the claimed symmetry-breaking conclusions under standard training conditions.

minor comments (2)

[Section 2] The regularization procedure for ϕ_ST is mentioned but its precise form and dependence on hyperparameters could be stated more explicitly to allow direct reproduction of the equivalence.
[Section 3] Notation for the continuous-time limit and the discrete-to-continuous mapping should be introduced with a short table or diagram to clarify how each ϕ is defined before the leading-order expansion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the single major comment below and have prepared revisions to strengthen the rigor of our leading-order claims.

read point-by-point responses

Referee: [Abstract] Abstract and the derivation of the emergent force: the central claim equates the four irreversibility measures to O(η) and concludes that this produces a symmetry-breaking force selecting minimum-entropy-production trajectories. However, no explicit bound on the O(η²) remainder is supplied, nor is there a numerical demonstration that the leading term dominates the force direction for typical η ≳ 0.01 or in stiff loss landscapes. This leaves open whether higher-order corrections can alter the claimed symmetry-breaking conclusions under standard training conditions.

Authors: We agree that an explicit bound on the O(η²) remainder and supporting numerical checks would strengthen the manuscript. In the revised version we will add a perturbative analysis deriving a uniform O(η²) bound on the difference between the four irreversibility measures under standard Lipschitz and smoothness assumptions on the loss. We will also include numerical experiments for η in the range 0.001–0.05 across both convex quadratic losses and non-convex neural-network landscapes, confirming that the direction of the emergent force remains aligned with the leading-order prediction and is not overturned by higher-order terms. These additions will appear in a new subsection of Section 3 and in the supplementary material. revision: yes

Circularity Check

0 steps flagged

Derivation of irreversibility equivalences is self-contained mathematical expansion

full rationale

The paper establishes the equivalence of ϕ_DE, ϕ_TR, ϕ_TA and regularized ϕ_ST to leading order in η by direct expansion of the discrete training dynamics into continuous-time limits, without any fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The emergent force and symmetry-breaking statements are derived consequences of the time-reversal asymmetry already present in the stochastic update rule under the stated far-from-equilibrium modeling assumption. No step reduces to its own input by construction; the central claims remain independent of the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from stochastic thermodynamics and discrete dynamical systems; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Training updates constitute a far-from-equilibrium Markov process whose continuous-time limit exists for small step size η.
Invoked to define the four irreversibility measures and their leading-order equivalence.
domain assumption The stochastic-thermodynamic entropy production is regularizable in a manner that preserves the equivalence to the other three measures.
Required for ϕ_ST to be comparable with the numerical and microscopic characterizations.

pith-pipeline@v0.9.0 · 5690 in / 1521 out tokens · 40747 ms · 2026-05-22T04:34:00.911805+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ϕ_DE(θ) = η/4 ∥U(θ)∥² ... ϕ_TA = η/4 ∥U∥² + O(η²) ... lim τ→0 (lim σ²→0 σ² Σ) = 8η ϕ_ST(μ) + O(η³) ... Principle of Minimal Dissipation
IndisputableMonolith/Foundation/ArrowOfTime entropy_monotone echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the system dynamics seeks those with the lowest entropy production rate ... emergent force ... minimizes the entropy production rate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

[1]

S. Mei, T. Misiakiewicz, and A. Montanari, arXiv preprint arXiv:1902.06015 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1902
[2]

Halverson, A

J. Halverson, A. Maiti, and K. Stoner, Machine Learning: Science and Technology2, 035002 (2021)

work page 2021
[3]

Rotskoff and E

G. Rotskoff and E. Vanden-Eijnden, Communications on Pure and Applied Mathematics75, 1889 (2022)

work page 2022
[4]

G. P. Coppola, M. Helias, and Z. Ringel, arXiv preprint arXiv:2510.25553 (2025)

work page arXiv 2025
[5]

Z. Xie, I. Sato, and M. Sugiyama, arXiv preprint arXiv:2002.03495 (2020)

work page arXiv 2002
[6]

Prigogine and R

I. Prigogine and R. Lefever, inSynergetics: Cooperative phenomena in multi-component systems(Springer, 1973) pp. 124–135

work page 1973
[7]

Seifert, The European Physical Journal B64, 423 (2008)

U. Seifert, The European Physical Journal B64, 423 (2008)

work page 2008
[8]

O’Byrne, Y

J. O’Byrne, Y. Kafri, J. Tailleur, and F. van Wijland, Nature Reviews Physics4, 167 (2022)

work page 2022
[9]

For example, see the discussion in [26]

work page
[10]

D. P. Kingma and J. Ba, CoRRabs/1412.6980(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[11]

Tieleman and G

T. Tieleman and G. Hinton, Lecture 6.5—RmsProp: Di- vide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning (2012)

work page 2012
[12]

R. M. May, Nature261, 459 (1976)

work page 1976
[13]

Hairer, M

E. Hairer, M. Hochbruck, A. Iserles, and C. Lubich, Ober- wolfach Reports3, 805 (2006)

work page 2006
[14]

S. L. Smith, B. Dherin, D. G. Barrett, and S. De, arXiv preprint arXiv:2101.12176 (2021)

work page arXiv 2021
[15]

D. G. Barrett and B. Dherin, arXiv preprint arXiv:2009.11162 (2020)

work page arXiv 2009
[16]

Ziyin, H

L. Ziyin, H. Li, and M. Ueda, Physical Review E111, 065303 (2025)

work page 2025
[17]

K. G. Wilson, Physical review B4, 3174 (1971)

work page 1971
[18]

Ziyin and M

L. Ziyin and M. Ueda, Physical Review Research5, 013039 (2023)

work page 2023
[19]

K. Liu, Z. Gong, and M. Ueda, arXiv preprint arXiv:1912.11797 (2019)

work page arXiv 1912
[20]

Goldt and U

S. Goldt and U. Seifert, Physical review letters118, 010601 (2017)

work page 2017
[21]

Murashita, K

Y. Murashita, K. Funo, and M. Ueda, Physical Review E90, 042110 (2014)

work page 2014
[22]

K. Liu, L. Ziyin, and M. Ueda, inInternational Confer- ence on Machine Learning(PMLR, 2021) pp. 7045–7056

work page 2021
[23]

See [26] for a prior derivation of this result when special- ized to GD

work page
[24]

Ziyin, Y

L. Ziyin, Y. Xu, T. Poggio, and I. Chuang, arXiv preprint arXiv:2502.05300 (2025)

work page arXiv 2025
[25]

Fluctuation-dissipation relations for stochastic gradient descent

S. Yaida, arXiv preprint arXiv:1810.00004 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Ziyin, Y

L. Ziyin, Y. Xu, and I. Chuang, NeurIPS (2025)

work page 2025
[27]

Y. Xu, P. Beneventano, I. Chuang, and L. Ziyin, arXiv preprint arXiv:2602.05065 (2026). 6

work page arXiv 2026
[28]

Poggio, R

T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, Nature 428, 419 (2004)

work page 2004
[29]

Q. Li, C. Tai, and W. E, Stochastic modified equations and dynamics of stochastic gradient algorithms i: Math- ematical foundations (2018), arXiv:1811.01558 [cs.LG]. Appendix A: Notations and setup LetU(θ) be a vector field representing the update di- rection. We denote the Jacobian ofUasJ U, where [JU]ij =∂ jUi. We assume thatUis sufficiently smooth (for...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Time Reversal Asymmetry Definition 1(Time Reversal Asymmetryϕ TA).We quantify the microscopic time reversal asymmetry of the discrete update rule using the difference between the ini- tial parameter state and the state recovered by sequen- tially applying the forward and backward dynamics: 2η∇ϕTA(θt) :=θ t − ˜θt,(B8) where the recovered state is ˜θt =θ t ...

work page
[31]

symmetry

Symmetry for a Vector Field In the context of optimization, a symmetry usually refers to the invariance of the underlying loss function. Since we are working directly with the vector fieldU (which corresponds to the gradient of the loss when the update isθ k+1 =θ k −ηU(θ k)), we must define what “symmetry” means forUdirectly. Definition 2(Symmetry Conditi...

work page
[32]

Theorem 6(Continuous Symmetry Breaking).Let K(θ, λ) =θ+λQ(θ) +O(λ 2)be a continuous symmetry generated byQ(θ)

Continuous Symmetry Breaking We now derive the condition under which a continuous symmetryK(θ, λ) is preserved or broken by the effective potentialϕ DE. Theorem 6(Continuous Symmetry Breaking).Let K(θ, λ) =θ+λQ(θ) +O(λ 2)be a continuous symmetry generated byQ(θ). IfU(θ) T (JQ(θ) +J Q(θ)T )U(θ)̸= 0, then the entropic potentialϕ DE breaks the symmetry to fi...

work page
[33]

Theorem 7(Discrete Symmetry Preservation).Let the transformation beK(θ) =Oθ, whereOis an orthogonal matrix (O T O=I)

Discrete Symmetry Preservation We derive the preservation of discrete orthogonal sym- metries. Theorem 7(Discrete Symmetry Preservation).Let the transformation beK(θ) =Oθ, whereOis an orthogonal matrix (O T O=I). IfUis symmetric underK, then the effective potentialϕ DE is invariant underK. 10 Proof.The Jacobian isJ K(θ) =O. Using definition (F2), the cond...

work page
[34]

Discretization-error potentialϕ DE The discretization-error potential measures the mis- match between the discrete update and the continuous- time flow generated by the same vector field. Let Θdisc(θ;η) denote one discrete Euler step: Θdisc(θ;η) =θ−ηU(θ).(G3) Let Θ cont(θ;η) denote the time-one solution of the continuous-time ODE dϑ dτ =−ηU(ϑ), ϑ(0) =θ,Θ ...

work page
[35]

Define Θcoarse(θ;η) =θ−ηU(θ),(G10) and Θfine(θ;η) = Θ η/2 ◦Θ η/2(θ),Θ η/2(θ) =θ− η 2 U(θ)

Time-renormalization potentialϕ TR The time-renormalization potential is measured by comparing one coarse step of sizeηto two fine steps of sizeη/2. Define Θcoarse(θ;η) =θ−ηU(θ),(G10) and Θfine(θ;η) = Θ η/2 ◦Θ η/2(θ),Θ η/2(θ) =θ− η 2 U(θ). (G11) The coarse-fine mismatch is dTR(θ;η) = Θ fine(θ;η)−Θ coarse(θ;η).(G12) Expanding the two fine steps gives dTR(θ...

work page
[36]

Starting at θt, take one forward step and then one backward step with the sign of the step size reversed: θt+1 =θ t −ηU(θ t), ˜θt =θ t −ηU(θ t) +ηU(θ t+1)

Microscopic time-asymmetry potentialϕ T A The microscopic time-asymmetry potential is mea- sured by a forward–backward round trip. Starting at θt, take one forward step and then one backward step with the sign of the step size reversed: θt+1 =θ t −ηU(θ t), ˜θt =θ t −ηU(θ t) +ηU(θ t+1). (G17) If the dynamics were microscopically reversible at this step siz...

work page
[37]

We in- troduce a virtual Gaussian transition kernel pσ(θ′|θ)∝exp − ∥θ′ −θ+ηU(θ)∥ 2 2σ2 ,(G27) whereσ 2 is a small virtual noise variance

Stochastic-thermodynamic potentialϕ ST The stochastic-thermodynamic potential is measured using a regularized trajectory-probability ratio. We in- troduce a virtual Gaussian transition kernel pσ(θ′|θ)∝exp − ∥θ′ −θ+ηU(θ)∥ 2 2σ2 ,(G27) whereσ 2 is a small virtual noise variance. Given the deterministic update θ+ =θ−ηU(θ),(G28) we estimate the one-step bath ...

work page
[38]

(G33) Here bϕTA denotes the normalized quantity defined in Eq

Expected leading-order agreement The operational measurements above are designed so that, under the smoothness and symmetric-Jacobian as- sumptions, bϕDE(θ)≈ bϕTR(θ)≈ bϕTA(θ)≈ bϕST(θ)≈ η 4 ∥U(θ)∥ 2. (G33) Here bϕTA denotes the normalized quantity defined in Eq. (G25). Therefore, sweeping overηat a fixed pointθ should reveal an approximately linear scaling...

work page
[39]

We consider a quadratic potential E(θ) = 1 2 θ⊤Aθ,(G34) whereA∈R d×d is positive definite

Quadratic Test Problem We now describe the specific test problem used to validate the operational measurements. We consider a quadratic potential E(θ) = 1 2 θ⊤Aθ,(G34) whereA∈R d×d is positive definite. The update vector field is U(θ) =∇E(θ) =Aθ.(G35) The discrete dynamics are therefore θk+1 =θ k −ηAθ k.(G36) This problem is useful becauseJ U =Ais constan...

work page
[40]

The model is a 2- layer causal Transformer (GPT-style) withd model = 128, nhead = 4 attention heads, and a feedforward dimension of 512

Transformer Transformer experiment: tracking the (normalized) Adam update energy under learning-rate schedules.We analyze how the magnitude of Adam parameter up- dates evolves during training for a small decoder-only Transformer on an algorithmic task. The model is a 2- layer causal Transformer (GPT-style) withd model = 128, nhead = 4 attention heads, and...

work page 2000
[41]

Our model is a gated recurrent unit (GRU) lan- guage model withL= 2 recurrent layers and hidden size h= 256

RNN RNN experiment: tracking the (normalized) Adam up- date energy under learning-rate schedules.We study how the magnitude of parameter updates produced by Adam evolves during training for a recurrent sequence model. Our model is a gated recurrent unit (GRU) lan- guage model withL= 2 recurrent layers and hidden size h= 256. Inputs are embedded intoR d wi...

work page 2000
[42]

linear re- gression)

Perceptron Perceptron (linear regression) experiment: tracking the (normalized) Adam update energy under learning-rate schedules.To provide a convex baseline, we repeat the same update-tracking procedure on a single-layer per- ceptron trained by mean-squared error (i.e. linear re- gression). The model isf θ(x) =w ⊤xwith parameters w∈R d (no bias), whered=...

work page 2000
[43]

We report the learning-rate-normalized quan- tity eUs =U s/ηs to facilitate comparisons across sched- ules. Repetitions and visualization.For each schedule, we performR= 3 runs with different random seeds (affect- ing initialization and minibatch sampling, and the syn- thetic dataset generation) and store each{ eUs}S s=1 trajec- tory as a NumPy array, whi...

work page

[1] [1]

S. Mei, T. Misiakiewicz, and A. Montanari, arXiv preprint arXiv:1902.06015 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1902

[2] [2]

Halverson, A

J. Halverson, A. Maiti, and K. Stoner, Machine Learning: Science and Technology2, 035002 (2021)

work page 2021

[3] [3]

Rotskoff and E

G. Rotskoff and E. Vanden-Eijnden, Communications on Pure and Applied Mathematics75, 1889 (2022)

work page 2022

[4] [4]

G. P. Coppola, M. Helias, and Z. Ringel, arXiv preprint arXiv:2510.25553 (2025)

work page arXiv 2025

[5] [5]

Z. Xie, I. Sato, and M. Sugiyama, arXiv preprint arXiv:2002.03495 (2020)

work page arXiv 2002

[6] [6]

Prigogine and R

I. Prigogine and R. Lefever, inSynergetics: Cooperative phenomena in multi-component systems(Springer, 1973) pp. 124–135

work page 1973

[7] [7]

Seifert, The European Physical Journal B64, 423 (2008)

U. Seifert, The European Physical Journal B64, 423 (2008)

work page 2008

[8] [8]

O’Byrne, Y

J. O’Byrne, Y. Kafri, J. Tailleur, and F. van Wijland, Nature Reviews Physics4, 167 (2022)

work page 2022

[9] [9]

For example, see the discussion in [26]

work page

[10] [10]

D. P. Kingma and J. Ba, CoRRabs/1412.6980(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[11] [11]

Tieleman and G

T. Tieleman and G. Hinton, Lecture 6.5—RmsProp: Di- vide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning (2012)

work page 2012

[12] [12]

R. M. May, Nature261, 459 (1976)

work page 1976

[13] [13]

Hairer, M

E. Hairer, M. Hochbruck, A. Iserles, and C. Lubich, Ober- wolfach Reports3, 805 (2006)

work page 2006

[14] [14]

S. L. Smith, B. Dherin, D. G. Barrett, and S. De, arXiv preprint arXiv:2101.12176 (2021)

work page arXiv 2021

[15] [15]

D. G. Barrett and B. Dherin, arXiv preprint arXiv:2009.11162 (2020)

work page arXiv 2009

[16] [16]

Ziyin, H

L. Ziyin, H. Li, and M. Ueda, Physical Review E111, 065303 (2025)

work page 2025

[17] [17]

K. G. Wilson, Physical review B4, 3174 (1971)

work page 1971

[18] [18]

Ziyin and M

L. Ziyin and M. Ueda, Physical Review Research5, 013039 (2023)

work page 2023

[19] [19]

K. Liu, Z. Gong, and M. Ueda, arXiv preprint arXiv:1912.11797 (2019)

work page arXiv 1912

[20] [20]

Goldt and U

S. Goldt and U. Seifert, Physical review letters118, 010601 (2017)

work page 2017

[21] [21]

Murashita, K

Y. Murashita, K. Funo, and M. Ueda, Physical Review E90, 042110 (2014)

work page 2014

[22] [22]

K. Liu, L. Ziyin, and M. Ueda, inInternational Confer- ence on Machine Learning(PMLR, 2021) pp. 7045–7056

work page 2021

[23] [23]

See [26] for a prior derivation of this result when special- ized to GD

work page

[24] [24]

Ziyin, Y

L. Ziyin, Y. Xu, T. Poggio, and I. Chuang, arXiv preprint arXiv:2502.05300 (2025)

work page arXiv 2025

[25] [25]

Fluctuation-dissipation relations for stochastic gradient descent

S. Yaida, arXiv preprint arXiv:1810.00004 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Ziyin, Y

L. Ziyin, Y. Xu, and I. Chuang, NeurIPS (2025)

work page 2025

[27] [27]

Y. Xu, P. Beneventano, I. Chuang, and L. Ziyin, arXiv preprint arXiv:2602.05065 (2026). 6

work page arXiv 2026

[28] [28]

Poggio, R

T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, Nature 428, 419 (2004)

work page 2004

[29] [29]

Q. Li, C. Tai, and W. E, Stochastic modified equations and dynamics of stochastic gradient algorithms i: Math- ematical foundations (2018), arXiv:1811.01558 [cs.LG]. Appendix A: Notations and setup LetU(θ) be a vector field representing the update di- rection. We denote the Jacobian ofUasJ U, where [JU]ij =∂ jUi. We assume thatUis sufficiently smooth (for...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Time Reversal Asymmetry Definition 1(Time Reversal Asymmetryϕ TA).We quantify the microscopic time reversal asymmetry of the discrete update rule using the difference between the ini- tial parameter state and the state recovered by sequen- tially applying the forward and backward dynamics: 2η∇ϕTA(θt) :=θ t − ˜θt,(B8) where the recovered state is ˜θt =θ t ...

work page

[31] [31]

symmetry

Symmetry for a Vector Field In the context of optimization, a symmetry usually refers to the invariance of the underlying loss function. Since we are working directly with the vector fieldU (which corresponds to the gradient of the loss when the update isθ k+1 =θ k −ηU(θ k)), we must define what “symmetry” means forUdirectly. Definition 2(Symmetry Conditi...

work page

[32] [32]

Theorem 6(Continuous Symmetry Breaking).Let K(θ, λ) =θ+λQ(θ) +O(λ 2)be a continuous symmetry generated byQ(θ)

Continuous Symmetry Breaking We now derive the condition under which a continuous symmetryK(θ, λ) is preserved or broken by the effective potentialϕ DE. Theorem 6(Continuous Symmetry Breaking).Let K(θ, λ) =θ+λQ(θ) +O(λ 2)be a continuous symmetry generated byQ(θ). IfU(θ) T (JQ(θ) +J Q(θ)T )U(θ)̸= 0, then the entropic potentialϕ DE breaks the symmetry to fi...

work page

[33] [33]

Theorem 7(Discrete Symmetry Preservation).Let the transformation beK(θ) =Oθ, whereOis an orthogonal matrix (O T O=I)

Discrete Symmetry Preservation We derive the preservation of discrete orthogonal sym- metries. Theorem 7(Discrete Symmetry Preservation).Let the transformation beK(θ) =Oθ, whereOis an orthogonal matrix (O T O=I). IfUis symmetric underK, then the effective potentialϕ DE is invariant underK. 10 Proof.The Jacobian isJ K(θ) =O. Using definition (F2), the cond...

work page

[34] [34]

Discretization-error potentialϕ DE The discretization-error potential measures the mis- match between the discrete update and the continuous- time flow generated by the same vector field. Let Θdisc(θ;η) denote one discrete Euler step: Θdisc(θ;η) =θ−ηU(θ).(G3) Let Θ cont(θ;η) denote the time-one solution of the continuous-time ODE dϑ dτ =−ηU(ϑ), ϑ(0) =θ,Θ ...

work page

[35] [35]

Define Θcoarse(θ;η) =θ−ηU(θ),(G10) and Θfine(θ;η) = Θ η/2 ◦Θ η/2(θ),Θ η/2(θ) =θ− η 2 U(θ)

Time-renormalization potentialϕ TR The time-renormalization potential is measured by comparing one coarse step of sizeηto two fine steps of sizeη/2. Define Θcoarse(θ;η) =θ−ηU(θ),(G10) and Θfine(θ;η) = Θ η/2 ◦Θ η/2(θ),Θ η/2(θ) =θ− η 2 U(θ). (G11) The coarse-fine mismatch is dTR(θ;η) = Θ fine(θ;η)−Θ coarse(θ;η).(G12) Expanding the two fine steps gives dTR(θ...

work page

[36] [36]

Starting at θt, take one forward step and then one backward step with the sign of the step size reversed: θt+1 =θ t −ηU(θ t), ˜θt =θ t −ηU(θ t) +ηU(θ t+1)

Microscopic time-asymmetry potentialϕ T A The microscopic time-asymmetry potential is mea- sured by a forward–backward round trip. Starting at θt, take one forward step and then one backward step with the sign of the step size reversed: θt+1 =θ t −ηU(θ t), ˜θt =θ t −ηU(θ t) +ηU(θ t+1). (G17) If the dynamics were microscopically reversible at this step siz...

work page

[37] [37]

We in- troduce a virtual Gaussian transition kernel pσ(θ′|θ)∝exp − ∥θ′ −θ+ηU(θ)∥ 2 2σ2 ,(G27) whereσ 2 is a small virtual noise variance

Stochastic-thermodynamic potentialϕ ST The stochastic-thermodynamic potential is measured using a regularized trajectory-probability ratio. We in- troduce a virtual Gaussian transition kernel pσ(θ′|θ)∝exp − ∥θ′ −θ+ηU(θ)∥ 2 2σ2 ,(G27) whereσ 2 is a small virtual noise variance. Given the deterministic update θ+ =θ−ηU(θ),(G28) we estimate the one-step bath ...

work page

[38] [38]

(G33) Here bϕTA denotes the normalized quantity defined in Eq

Expected leading-order agreement The operational measurements above are designed so that, under the smoothness and symmetric-Jacobian as- sumptions, bϕDE(θ)≈ bϕTR(θ)≈ bϕTA(θ)≈ bϕST(θ)≈ η 4 ∥U(θ)∥ 2. (G33) Here bϕTA denotes the normalized quantity defined in Eq. (G25). Therefore, sweeping overηat a fixed pointθ should reveal an approximately linear scaling...

work page

[39] [39]

We consider a quadratic potential E(θ) = 1 2 θ⊤Aθ,(G34) whereA∈R d×d is positive definite

Quadratic Test Problem We now describe the specific test problem used to validate the operational measurements. We consider a quadratic potential E(θ) = 1 2 θ⊤Aθ,(G34) whereA∈R d×d is positive definite. The update vector field is U(θ) =∇E(θ) =Aθ.(G35) The discrete dynamics are therefore θk+1 =θ k −ηAθ k.(G36) This problem is useful becauseJ U =Ais constan...

work page

[40] [40]

The model is a 2- layer causal Transformer (GPT-style) withd model = 128, nhead = 4 attention heads, and a feedforward dimension of 512

Transformer Transformer experiment: tracking the (normalized) Adam update energy under learning-rate schedules.We analyze how the magnitude of Adam parameter up- dates evolves during training for a small decoder-only Transformer on an algorithmic task. The model is a 2- layer causal Transformer (GPT-style) withd model = 128, nhead = 4 attention heads, and...

work page 2000

[41] [41]

Our model is a gated recurrent unit (GRU) lan- guage model withL= 2 recurrent layers and hidden size h= 256

RNN RNN experiment: tracking the (normalized) Adam up- date energy under learning-rate schedules.We study how the magnitude of parameter updates produced by Adam evolves during training for a recurrent sequence model. Our model is a gated recurrent unit (GRU) lan- guage model withL= 2 recurrent layers and hidden size h= 256. Inputs are embedded intoR d wi...

work page 2000

[42] [42]

linear re- gression)

Perceptron Perceptron (linear regression) experiment: tracking the (normalized) Adam update energy under learning-rate schedules.To provide a convex baseline, we repeat the same update-tracking procedure on a single-layer per- ceptron trained by mean-squared error (i.e. linear re- gression). The model isf θ(x) =w ⊤xwith parameters w∈R d (no bias), whered=...

work page 2000

[43] [43]

We report the learning-rate-normalized quan- tity eUs =U s/ηs to facilitate comparisons across sched- ules. Repetitions and visualization.For each schedule, we performR= 3 runs with different random seeds (affect- ing initialization and minibatch sampling, and the syn- thetic dataset generation) and store each{ eUs}S s=1 trajec- tory as a NumPy array, whi...

work page