The Thermodynamic Costs of Simple Linear Regression

Anant Sahai; Michael R. DeWeese; Samuel H. D'Ambrosia; Sultan M. Daniels

arxiv: 2605.19195 · v1 · pith:7MUC62TOnew · submitted 2026-05-18 · ❄️ cond-mat.stat-mech · cs.IT· math.IT· stat.ML

The Thermodynamic Costs of Simple Linear Regression

Samuel H. D'Ambrosia , Sultan M. Daniels , Michael R. DeWeese , Anant Sahai This is my paper

Pith reviewed 2026-05-20 06:59 UTC · model grok-4.3

classification ❄️ cond-mat.stat-mech cs.ITmath.ITstat.ML

keywords thermodynamic costslinear regressionLandauer's principleenergy scaling lawsstochastic gradient descentgeneralization errorfloating-point arithmeticentropy production

0 comments

The pith

Floating-point linear regression carries a thermodynamic lower bound on energy that determines the optimal training dataset size for a given prediction accuracy target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies Landauer's principle to estimate the minimum energy dissipated when irreversibly computing simple linear regression on floating-point numbers, considering both exact solutions and stochastic gradient descent implementations. These energy estimates are combined with the generalization error that improves as more training data is used, producing scaling laws for the dataset size that minimizes total energy when inference must meet a fixed accuracy demand. The authors also outline a way to lower-bound the extra entropy produced when continuous-valued inputs create mismatches with the algorithm's discrete operations.

Core claim

By counting the irreversible bit erasures that occur in the floating-point arithmetic steps of linear regression, the authors derive a concrete lower bound on dissipated energy that increases with both dataset size and numerical precision. When this cost is weighed against the reduction in generalization error that larger datasets provide, an energy-optimal finite dataset size emerges for any required inference accuracy. The same counting approach is applied to stochastic gradient descent, yielding a distinct but related scaling relation.

What carries the argument

Landauer's principle applied to irreversible bit erasures in floating-point arithmetic steps of exact linear regression or stochastic gradient descent

If this is right

Total energy for a linear model with fixed generalization-error target reaches a minimum at a finite dataset size rather than growing without limit.
Energy costs of exact regression and SGD versions scale differently with precision and data volume, allowing direct comparison of their thermodynamic efficiency.
Inference demand that requires lower generalization error shifts the optimal training set size upward in a quantifiable way.
Mismatch between continuous inputs and discrete algorithm steps produces an additional entropy-production term that can be lower-bounded separately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same counting method could be extended to other linear models such as logistic regression or to the forward passes of small neural networks.
Hardware designers could use these bounds to prioritize reductions in floating-point erasure costs when building energy-efficient ML accelerators.
For very large-scale inference workloads the derived scaling laws predict a crossover beyond which adding more training data becomes energetically wasteful.

Load-bearing premise

Landauer's principle can be applied directly to count the irreversible bit operations in floating-point linear regression and stochastic gradient descent without additional hidden costs from memory access or control flow.

What would settle it

An experiment that measures the actual energy consumed by a processor executing floating-point linear regression on a known dataset and finds dissipation below the calculated Landauer bound for the same number of bit erasures and precision would falsify the bound.

Figures

Figures reproduced from arXiv: 2605.19195 by Anant Sahai, Michael R. DeWeese, Samuel H. D'Ambrosia, Sultan M. Daniels.

**Figure 1.** Figure 1: Floating-point structure and Gaussian approximations. (1a) The structure of a floatingpoint number where each box represents one bit. The true bin size function ∆(x) is plotted on log-log scale for both midpoint (black solid curve) and floor quantization (blue dotted curve) with p = 10 and E = 4 along with the smooth approximation ∆s(x) (red dashed curve). (1b) shows the entropy of Xf p which is a discret… view at source ↗

**Figure 2.** Figure 2: Regimes where the approximations hold. (2a) Shows the exact discrete entropy of Xf p as p varies for each E when the underlying continuous random variable is X ∼ N (0, 1). The approximation H˜ 0 s (p) is also plotted as the dashed red line. Notice that the curves for E ≥ 4 are directly on top of each other, showing that H˜ 0 s (p) is close to H(Xf p) when E is large enough to keep the probability of overfl… view at source ↗

**Figure 3.** Figure 3: The Landauer cost for exact zero-intercept simple linear regression. Input and output states are floating-point numbers with p = 24 and E = 8. The candidate values of the SNR = w 2σ 2 x /σ2 ξ are 0.062, 0.25, 1, 4, and 25. (3a) The output model approximate entropy H˜ w s (Wˆ f p), and its dependence on the number of data samples n. (3b) The approximate entropy difference rate between the input data and the… view at source ↗

**Figure 4.** Figure 4: Entropy dynamics of SGD. Input and output states are assumed to be single-precision floating-point numbers, with p = 24. Here wˆ0 = 1, σ 2 x = 1, σ 2 ξ = 1, η = 10−2 , and B = 10. (4a) shows the approximate floating-point entropy of Wˆ vs SGD step number k. (4b) shows the entropy difference between the input states and the SGD predictor at step k. Here the precision contribution is given by ∆E SGD p /(kBT … view at source ↗

**Figure 5.** Figure 5: Optimal dataset size for the exact linear regression formula and for stochastic gradient descent. (5a) shows the profit gained versus the dataset size n for the exact linear regression formula uEx(n) given in Eq. (178). (5c) shows u ′ Ex(n), the derivative of the profit function with respect to n as given in Eq. (180). (5b) shows the profit gained versus the dataset size n for SGD uSGD(n) given in Eq. (181… view at source ↗

**Figure 6.** Figure 6: Clipping and midpoint quantization with K = 3 representable values {u1, u2, u3}. The blue vertical lines represent the midpoints, and the arrows depict the regions of the real line that map to each representable value at a black vertical line. Corollary B.1.1 (Entropy of a Gaussian Random Variable Quantized to a Floating-point Number). Let X ∼ N (µ, σ2 x ). Let Xf p be the clipped and midpoint quantized f… view at source ↗

**Figure 7.** Figure 7: Numerical scale of C0 and ε0 (Corollary C.1.1) for d = 1, σ = 1. Each curve corresponds to a different exponent width E. Circles denote Gaussian marginals; squares denote Student’s t5 marginals. 15 10 5 0 5 10 15 4 6 8 10 12 14 H ( Xfp ) ( bit s ) Midpoint Exact Hs (p) H0 s(p) 15 10 5 0 5 10 15 H ( Xfp ) ( bit s ) Floor Exact Hs (p) H0 s(p) [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8: Simulating approximation 3: The dependence of the entropy of a normally distributed floating-point number on its standard deviation σ and mean µ. Dashed lines show Approximation 3 from Eq. (20), while solid lines show the exact entropy from Corollary B.1.1. Approximation 3 – E[log[|x|/ √ 2]]: We can show that Eq. (18) approximates Eq. (20) in two cases. First, for a Gaussian distribution where its mean is … view at source ↗

**Figure 9.** Figure 9: Histogram of exponent values for a normally distributed random variable. The distribution of exponent states for X ∼ N (0, σ2 ). From left to right, the distributions plotted have σ = {10−3 , 100 , 103}. All three have a discrete entropy H(log[X]) ≈ 2.54 bits. These distributions conform well to observational data in [30], [31]. Theorem D.1 (Approximating the entropy of a mean-zero univariate gaussian). Th… view at source ↗

**Figure 10.** Figure 10: Fit quality for asymptotic stochastic gradient descent. Empirical investigation of the validity of the continuous Ornstein-Uhlenbeck process approximation with a simulation of wˆ with η = 0.01, τ = 200, σ 2 x = 1, and σ 2 ξ = 1, for 1000 trials. From top to bottom, the batch sizes are B = {1, 5, 25} while τ = 200, showing the approximation is already effective for these parameters at low B [PITH_FULL_IMA… view at source ↗

**Figure 11.** Figure 11: SGD dynamics for zero-intercept simple linear regression. (11a) SGD parameter wˆ distribution at selected iterations, for 5000 separate trials of running SGD. σ 2 x = 1, σ 2 ξ = 1, η = 0.01, B = 10. (11b) Theoretical mean and standard deviation compared to the empirical mean and standard deviation as step number k increases up to final value k = τ = 2000. APPENDIX F LANDAUER COST OF AVERAGING AND SUMMING … view at source ↗

**Figure 12.** Figure 12: The probability density function fZ(z), with σx = σξ = 1. As n increases, the distribution becomes more peaked around z = 0. The simulation is of 50000 trials. Lemma G.1. Let X ∼ N (0, σ2 x In) and Ξ ∼ N (0, σ2 ξ In) be independent, with n ∈ N, and define Z = XT Ξ XT X . The probability density function of Z is fZ(z) = s 1 π (σ 2 x ) n σ 2 ξ Γ [PITH_FULL_IMAGE:figures/full_fig_p045_12.png] view at source ↗

**Figure 13.** Figure 13: The Landauer cost for exact zero-intercept simple linear regression. Input and output states are floating-point numbers with p = 4 and E = 4. The candidate values of the SNR = w2σ 2 x σ 2 ξ are 0.062, 0.25, 1, 4, and 25. (13a) The output model entropy, and its dependence on the data size n and ground truth w. (13b) The entropy difference for various values of SNR and a range of n [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 14.** Figure 14: A lower bound on the mismatch cost for continuously parameterized inputs. MMCv(σx, σξ) as a function of σx and σξ, with each plot assuming the entropy flow ∆Senv,Ex = ∆Senv,SGD = C + α(σ 2 x + σ 2 ξ ), and w = 1. A bounded optimization is performed for 0.75 ≤ σx ≤ 5, 0.5 ≤ σξ ≤ 5. 14a A sample MMCv landscape for exact linear regression with the illustrative ∆Senv,Ex = C + α(σ 2 x + σ 2 ξ ), and n = 10. 14… view at source ↗

**Figure 15.** Figure 15: The effect of the learning rate and batch size on the optimal dataset size for stochastic gradient descent. (15a) shows uSGD(n) given in Eq. (181) for varying values of the learning rate η. Notice that smaller learning rates lead to larger optimal dataset sizes. (15c) shows u ′ SGD(n) in Eq. (183) for the different learning rates. (15b) shows uSGD(n) for various values of the batch size B. For the profit … view at source ↗

**Figure 16.** Figure 16: Exact midpoint-quantized entropy vs. standard deviation σ. For each precision p ∈ {1, . . . , 8}, the exact discrete entropy H(Xf p) of X ∼ N (0, σ2 ) (with µ = 0) is plotted as a function of σ over a wide log-scale range. Each curve corresponds to a distinct value of exponent bits E ∈ {0, 1, . . . , 7}. The vertical dashed lines mark σ = 2emin and the vertical dotted lines mark σ = 2emax for each E, and … view at source ↗

**Figure 17.** Figure 17: Exact midpoint-quantized entropy vs. mean µ, p = 1. The exact entropy H(Xf p) of X ∼ N (µ, 1) (with σ = 1.0 fixed and p = 1) is plotted as a function of µ ∈ [−100, 100]. Each panel shows a different number of exponent bits E. The solid curve is the exact entropy and the dashed curve is the large-|µ| approximation H˜ µ s (Xf p). These plots were generated by sweeping µ over 500 linearly-spaced points and e… view at source ↗

**Figure 18.** Figure 18: Exact midpoint-quantized entropy vs. mean µ, p = 2. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p056_18.png] view at source ↗

**Figure 19.** Figure 19: Exact midpoint-quantized entropy vs. mean µ, p = 3. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p057_19.png] view at source ↗

**Figure 20.** Figure 20: Exact midpoint-quantized entropy vs. mean µ, p = 4. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p058_20.png] view at source ↗

**Figure 21.** Figure 21: Exact midpoint-quantized entropy vs. mean µ, p = 5. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p059_21.png] view at source ↗

**Figure 22.** Figure 22: Exact midpoint-quantized entropy vs. mean µ, p = 6. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p060_22.png] view at source ↗

**Figure 23.** Figure 23: Exact midpoint-quantized entropy vs. mean µ, p = 7. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p061_23.png] view at source ↗

read the original abstract

The construction of models from data is a significant contributor to the energetic costs of computation. Because of this, understanding how foundational thermodynamic bounds apply to modeling algorithms will be increasingly important. Here, we study the thermodynamic costs of a basic and fundamental modeling algorithm: simple linear regression. Following Landauer, we approximate the thermodynamic lower bound on irreversibly performing both exact linear regression and linear regression via stochastic gradient descent as implemented on floating-point numbers. From this, we derive energycost aware scaling laws for the optimal dataset size for training a linear regression model given a generalization error dependent demand for inference. Additionally, we discuss a method to lower bound the entropy production from the mismatch cost for algorithms with continuous input variables.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies Landauer bounds to exact and SGD linear regression to derive energy-aware dataset scaling laws, but the practical value depends on whether bit erasures dominate real dissipation.

read the letter

The main point is that this work approximates thermodynamic lower bounds on the energy cost of linear regression, both the exact closed-form version and the SGD version, when run on floating-point arithmetic. From those bounds it derives scaling relations for the training set size that minimizes total energy given a target generalization error at inference time. It also sketches a way to lower-bound entropy production from mismatch costs when inputs are continuous.

Referee Report

2 major / 2 minor

Summary. The manuscript approximates thermodynamic lower bounds on the irreversible energy costs of exact simple linear regression and its implementation via stochastic gradient descent on floating-point arithmetic, by applying Landauer's principle to count bit erasures. From these bounds it derives energy-cost-aware scaling laws for the optimal training dataset size that balance computational dissipation against a generalization-error requirement at inference time. It additionally outlines a method to lower-bound entropy production arising from mismatch costs when input variables are continuous.

Significance. If the bit-operation counting procedure yields a valid and dominant lower bound, the resulting scaling laws would supply a concrete, falsifiable link between thermodynamic principles and practical choices of training-set size in linear models. The mismatch-cost discussion for continuous variables is a useful technical contribution that could extend to other regression or optimization settings. The work is strongest where it remains within the abstract model of irreversible operations; its practical relevance hinges on whether those operations dominate real hardware dissipation.

major comments (2)

[Thermodynamic bounds and floating-point implementation] The central approximation that counts only irreversible bit operations in floating-point linear regression and SGD (as described in the derivation following the abstract) does not address memory-hierarchy accesses, data movement, or control-flow overhead. These terms are not strictly proportional to bit erasures and can exceed the Landauer floor by orders of magnitude on current hardware; without a quantitative argument that they remain sub-dominant, the claimed lower bound cannot reliably support the derived scaling laws for optimal dataset size.
[Energy-cost aware scaling laws] The scaling laws for optimal dataset size are obtained by minimizing a total cost that includes both the approximated dissipation and a generalization-error term. If the error metric used to define the inference demand is the same quantity that enters the cost function (as suggested by the abstract phrasing), the optimum may be tautological rather than predictive; an explicit statement of the functional form and any free parameters in the scaling relation is needed to assess this.

minor comments (2)

[Mismatch cost discussion] Notation for the mismatch-cost lower bound on continuous variables should be introduced with a short example (e.g., a one-dimensional Gaussian input) to clarify how the continuous-to-discrete translation is performed.
[Introduction] The abstract states that bounds are 'approximated'; a brief paragraph in the introduction or methods section listing the concrete approximations (e.g., neglect of reversible steps, assumption of uniform bit cost) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us clarify the scope and presentation of our results. Below we respond point-by-point to the major comments. We have revised the manuscript to incorporate additional discussion and explicit statements of the scaling relations where appropriate.

read point-by-point responses

Referee: [Thermodynamic bounds and floating-point implementation] The central approximation that counts only irreversible bit operations in floating-point linear regression and SGD (as described in the derivation following the abstract) does not address memory-hierarchy accesses, data movement, or control-flow overhead. These terms are not strictly proportional to bit erasures and can exceed the Landauer floor by orders of magnitude on current hardware; without a quantitative argument that they remain sub-dominant, the claimed lower bound cannot reliably support the derived scaling laws for optimal dataset size.

Authors: Our derivation applies Landauer's principle strictly to the irreversible bit erasures that occur during the floating-point arithmetic operations of exact linear regression and SGD. This yields a hardware-independent lower bound on the thermodynamic cost of those specific operations. We agree that memory-hierarchy accesses, data movement, and control flow are not included and can dominate dissipation on existing processors. In the revised manuscript we have added an explicit paragraph in the discussion section stating that the reported bounds and scaling laws concern only the Landauer-limited arithmetic component; they are intended as theoretical minima that any physical implementation must respect, rather than as predictions of total energy use on current hardware. Because a quantitative demonstration of sub-dominance would require device-specific models outside the scope of this theoretical study, we have instead emphasized how the scaling laws can be combined with empirical overhead models in future applied work. revision: partial
Referee: [Energy-cost aware scaling laws] The scaling laws for optimal dataset size are obtained by minimizing a total cost that includes both the approximated dissipation and a generalization-error term. If the error metric used to define the inference demand is the same quantity that enters the cost function (as suggested by the abstract phrasing), the optimum may be tautological rather than predictive; an explicit statement of the functional form and any free parameters in the scaling relation is needed to assess this.

Authors: The generalization error appears solely as an external performance requirement at inference time, not as a term inside the training dissipation cost. We minimize the thermodynamic cost of training subject to the constraint that the deployed model must achieve a target generalization error ε. In the revised manuscript we now state the explicit functional form: the optimal training-set size scales as N* ∝ (log(1/ε) + c · precision) / β, where β is the per-sample dissipation coefficient derived from bit erasures and c collects constants from the linear-regression solution. The free parameters are the target error ε, the floating-point precision, and the data variance; these are listed in the new scaling-law subsection. Because the energy cost is incurred only during training while ε is a post-training specification, the resulting optimum is predictive rather than tautological. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds from Landauer bounds and statistical scaling without self-referential reduction

full rationale

The paper starts from Landauer's principle applied to bit erasures in exact linear regression and floating-point SGD, counts irreversible operations, and derives energy-aware scaling laws for optimal dataset size under a generalization-error constraint. No equations or steps reduce the final scaling law to a fitted parameter or prior self-citation by construction; the optimal-N expression emerges from combining the thermodynamic cost model with standard bias-variance or generalization bounds rather than tautologically re-expressing the input error metric. The derivation remains self-contained against external thermodynamic and learning-theoretic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on Landauer's principle applied to floating-point operations and on the assumption that mismatch costs for continuous variables can be bounded separately from the main computation.

axioms (1)

domain assumption Landauer's principle supplies the minimum energy cost for each irreversible bit erasure or overwrite performed during linear regression and SGD updates on floating-point numbers.
Explicitly invoked in the abstract with the phrase 'Following Landauer, we approximate the thermodynamic lower bound'.

pith-pipeline@v0.9.0 · 5662 in / 1316 out tokens · 41102 ms · 2026-05-20T06:59:26.068436+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Following Landauer, we approximate the thermodynamic lower bound on irreversibly performing both exact linear regression and linear regression via stochastic gradient descent as implemented on floating-point numbers.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We can gain further insight into the entropy of floating-point numbers... relating the differential entropy of a continuous random variable to the discrete entropy of its counterpart discrete representation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 10 internal anchors

[1]

2024 United States Data Center Energy Usage Report,

A. Shehabi, A. Newkirk, S. Smith, A. Hubbard, N. Lei, M. Siddiket al., “2024 United States Data Center Energy Usage Report,” Lawrence Berkeley National Laboratory, Berkeley, CA, USA, Tech. Rep. LBNL-2001637, 2024. [Online]. Available: https://escholarship.org/uc/item/32d6m0d1

work page 2024
[2]

Power hungry processing: Watts driving the cost of ai deployment?

A. S. Luccioni, Y . Jernite, and E. Strubell, “Power hungry processing: Watts driving the cost of ai deployment?” in Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24), Rio de Janeiro, Brazil, 2024, pp. 85–99

work page 2024
[3]

The growing energy footprint of artificial intelligence,

A. de Vries, “The growing energy footprint of artificial intelligence,”Joule, vol. 7, no. 10, pp. 2191–2194, Oct 2023. 22

work page 2023
[4]

A systematic review of green ai,

R. Verdecchia, J. Sallou, and L. Cruz, “A systematic review of green ai,”WIREs Data Mining and Knowledge Discovery, vol. 13, no. 4, p. e1507, 2023

work page 2023
[5]

The end of moore’s law: Living without an exponential,

P. Schuster, “The end of moore’s law: Living without an exponential,”Complexity, vol. 21, no. 2, pp. 7–10, 2016

work page 2016
[6]

The end of moore’s law: A new beginning for information technology,

T. M. Conteet al., “The end of moore’s law: A new beginning for information technology,” Computing Community Consortium (CCC), Computing Research Association, Tech. Rep., 2017. [Online]. Available: https: //cra.org/ccc/resources/ccc-led-whitepapers/

work page 2017
[7]

Moore’s law and the energy requirement of computing versus performance,

L. B. Kish, “Moore’s law and the energy requirement of computing versus performance,”IEE Proceedings – Circuits, Devices and Systems, vol. 151, no. 2, pp. 190–194, Apr 2004

work page 2004
[8]

Noninvertible Global Symmet ries in the Standard Model,

N. Zhang, “Moore’s law is dead, long live moore’s law!” arXiv preprint arXiv:2205.05086, 2022. [Online]. Available: https://arxiv.org/abs/2205.05086

work page arXiv 2022
[9]

Irreversibility and heat generation in the computing process,

R. Landauer, “Irreversibility and heat generation in the computing process,”IBM Journal of Research and Development, vol. 5, no. 3, pp. 183–191, Jul 1961

work page 1961
[10]

The thermodynamics of computation—a review,

C. H. Bennett, “The thermodynamics of computation—a review,”International Journal of Theoretical Physics, vol. 21, no. 12, pp. 905–940, Dec 1982

work page 1982
[11]

Ultimate physical limits to computation,

S. Lloyd, “Ultimate physical limits to computation,”Nature, vol. 406, no. 6799, pp. 1047–1054, Aug 2000

work page 2000
[12]

Physical limits of computing,

M. P. Frank, “Physical limits of computing,”Computer, vol. 50, no. 9, pp. 14–23, Sep 2017

work page 2017
[13]

The thermodynamics of computation—a review,

C. H. Bennett, “The thermodynamics of computation—a review,”International Journal of Theoretical Physics, vol. 21, no. 12, pp. 905–940, 1982, same asBennett1982

work page 1982
[14]

The physical limits of communication and computation,

R. Landauer, “The physical limits of communication and computation,”IEEE Spectrum, vol. 9, no. 5, pp. 23–29, May 1972

work page 1972
[15]

Is stochastic thermodynamics the key to understanding the energy costs of computation?

D. H. Wolpert, J. Korbel, C. W. Lynn, F. Tasnim, J. A. Grochow, G. Kardes ¸, J. B. Aimone, V . Balasubramanian, E. D. Giuli, D. Doty, N. Freitas, M. Marsili, T. E. Ouldridge, A. W. Richa, P. Riechers, ´Edgar Rold ´an, B. Rubenstein, Z. Toroczkai, and J. Paradiso, “Is stochastic thermodynamics the key to understanding the energy costs of computation?” Proc...

work page doi:10.1073/pnas.2321112121 2024
[16]

The stochastic thermodynamics of computation,

D. H. Wolpert, “The stochastic thermodynamics of computation,”Journal of Physics A: Mathematical and Theoretical, vol. 52, no. 19, p. 193001, 2019

work page 2019
[17]

Entropy production bounds for systems running computer programs

A. Yadav, F. Caravelli, and D. H. Wolpert, “System-independent lower bounds on entropy production incurred by running a computer program,” arXiv preprint arXiv:2411.16088, 2025. [Online]. Available: https://arxiv.org/abs/2411.16088

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Thermodynamics of computations with absolute irreversibility, unidirectional transitions, and stochastic computation times,

G. Manzano, G. Kardes ¸, ´E. Rold ´an, and D. H. Wolpert, “Thermodynamics of computations with absolute irreversibility, unidirectional transitions, and stochastic computation times,”Physical Review X, vol. 14, no. 2, p. 021026, 2024

work page 2024
[19]

Dependence of integrated, instantaneous, and fluctuating entropy production on the initial state in quantum and classical processes,

A. Kolchinsky and D. H. Wolpert, “Dependence of integrated, instantaneous, and fluctuating entropy production on the initial state in quantum and classical processes,”Physical Review E, vol. 104, no. 5, p. 054107, Nov 2021

work page 2021
[20]

A logical calculus of the ideas immanent in nervous activity,

W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,”Bulletin of Mathematical Biophysics, vol. 5, no. 4, pp. 115–133, Dec 1943

work page 1943
[21]

The perceptron: A probabilistic model for information storage and organization in the brain,

F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,”Psychological Review, vol. 65, no. 6, pp. 386–408, Nov 1958

work page 1958
[22]

Adaptive switching circuits,

B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in1960 IRE WESCON Convention Record – Part 4. New York: Institute of Radio Engineers, 1960, pp. 96–104

work page 1960
[23]

James, D

G. James, D. Witten, T. Hastie, and R. Tibshirani,An introduction to statistical learning: with applications in R, ser. Springer texts in statistics. New York: Springer, 2013

work page 2013
[24]

Eric Hall and Rebecca Willett

S. Goldt and U. Seifert, “Stochastic thermodynamics of learning,”Phys. Rev. Lett., vol. 118, p. 010601, Jan 2017. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.118.010601

work page doi:10.1103/physrevlett.118.010601 2017
[25]

Energy-Efficient Algorithms,

E. D. Demaine, J. Lynch, G. J. Mirano, and N. Tyagi, “Energy-Efficient Algorithms,” inProceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ser. ITCS ’16. New York, NY , USA: Association for Computing Machinery, Jan. 2016, pp. 321–332. [Online]. Available: https://dl.acm.org/doi/10.1145/2840728.2840756

work page doi:10.1145/2840728.2840756 2016
[26]

Thermodynamic bounds on energy use in deep neural networks,

A. V . Tkachenko, “Thermodynamic bounds on energy use in deep neural networks,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09980

work page arXiv 2025
[27]

NVIDIA Blackwell Architecture Technical Overview,

NVIDIA, “NVIDIA Blackwell Architecture Technical Overview,” NVIDIA, Tech. Rep., 2025. [Online]. Available: https://resources.nvidia.com/en-us-blackwell-architecture

work page 2025
[28]

AMD CDNA 4 Architecture,

I. Advanced Micro Devices, “AMD CDNA 4 Architecture,” AMD, Tech. Rep., Oct. 2025. [Online]. Available: https: //www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-4-architecture-whitepaper.pdf

work page 2025
[29]

Data Compression With Low Distortion and Finite Blocklength,

V . Kostina, “Data Compression With Low Distortion and Finite Blocklength,”IEEE Transactions on Information Theory, vol. 63, no. 7, pp. 4268–4285, Jul. 2017. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7867787

work page arXiv 2017
[30]

Efloat: Entropy-coded floating point format for compressing vector embedding models,

R. Bordawekar, B. Abali, and M.-H. Chen, “Efloat: Entropy-coded floating point format for compressing vector embedding models,” 2022. [Online]. Available: https://arxiv.org/abs/2102.02705

work page arXiv 2022
[31]

Neuzip: Memory-efficient training and inference with dynamic compression of neu- ral networks.arXiv preprint arXiv:2410.20650,

Y . Hao, Y . Cao, and L. Mou, “NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks,” Oct. 2024, arXiv:2410.20650 [cs]. [Online]. Available: http://arxiv.org/abs/2410.20650

work page arXiv 2024
[32]

The Entropy of Floating-Point Numbers

S. Daniels, S. H. D’Ambrosia, M. R. DeWeese, and A. Sahai, “The entropy of floating-point numbers,” 2026. [Online]. Available: https://arxiv.org/abs/2605.11546 23

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Beyond chinchilla-optimal: Ac- counting for inference in language model scaling laws

N. Sardana, J. Portes, S. Doubov, and J. Frankle, “Beyond chinchilla-optimal: Accounting for inference in language model scaling laws,” 2025. [Online]. Available: https://arxiv.org/abs/2401.00448

work page arXiv 2025
[34]

An efficient reversible algorithm for linear regression,

E. D. Demaine, J. Lynch, and J. Sun, “An efficient reversible algorithm for linear regression,” in2021 International Conference on Rebooting Computing (ICRC), 2021, pp. 103–108

work page 2021
[35]

Gradient-based hyperparameter optimization through reversible learning,

D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter optimization through reversible learning,”

work page
[36]

Gradient-based Hyperparameter Optimization through Reversible Learning

[Online]. Available: https://arxiv.org/abs/1502.03492

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Tolman,The Principles of Statistical Mechanics, by Richard C

R. Tolman,The Principles of Statistical Mechanics, by Richard C. Tolman ..., ser. International series of monographs on physics. Oxford University Press, 1942. [Online]. Available: https://books.google.com/books?id=Hbr9yAEACAAJ

work page 1942
[38]

J. W. Gibbs,The Collected Works of J. Willard Gibbs. Longmans, Green and Company, 1928, vol. 1

work page 1928
[39]

The Physical Basis of the Gibbs-von Neumann entropy

O. J. E. Maroney, “The physical basis of the gibbs-von neumann entropy,” 2008. [Online]. Available: https://arxiv.org/abs/quant-ph/0701127

work page internal anchor Pith review Pith/arXiv arXiv 2008
[40]

Generalizing landauer’s principle,

——, “Generalizing landauer’s principle,”Phys. Rev. E, vol. 79, p. 031105, Mar 2009. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevE.79.031105

work page doi:10.1103/physreve.79.031105 2009
[41]

H. B. Callen,Thermodynamics and an introduction to thermostatistics. New York, NY: Wiley, 1985. [Online]. Available: https://cds.cern.ch/record/450289

work page 1985
[42]

The (absence of a) relationship between thermodynamic and logical reversibility,

O. Maroney, “The (absence of a) relationship between thermodynamic and logical reversibility,”Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics, vol. 36, no. 2, pp. 355–374, 2005. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1355219805000031

work page 2005
[43]

L. D. Landau, E. M. Lifshitz, and L. P. Pitaevskii,Statistical Physics: Part 1, 3rd ed., ser. Course of Theoretical Physics. Oxford: Pergamon Press, 1980, vol. 5

work page 1980
[44]

Chandler,Introduction to Modern Statistical Mechanics

D. Chandler,Introduction to Modern Statistical Mechanics. Oxford University Press, 1987

work page 1987
[45]

FP8 Quantization: The Power of the Exponent,

A. Kuzmin, M. van Baalen, Y . Ren, M. Nagel, J. Peters, and T. Blankevoort, “FP8 Quantization: The Power of the Exponent,”Advances in Neural Information Processing Systems, vol. 35, pp. 14 651–14 662, Dec. 2022. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2022/hash/ 5e07476b6bd2497e1fbd11b8f0b2de3c-Abstract-Conference.html

work page 2022
[46]

Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537,

B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, S. Dusan, V . Elango, M. Golub, A. Heinecke, P. James-Roxby, D. Jani, G. Kolhe, M. Langhammer, A. Li, L. Melnick, M. Mesmakhosroshahi, A. Rodriguez, M. Schulte, R. Shafipour, L. Shao, M. Siu, P. Dubey, P. Micikevicius, M. Naumov, C. Verrill...

work page arXiv 2023
[47]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,

B. Darvish Rouhani, R. Zhao, V . Elango, R. Shafipour, M. Hall, M. Mesmakhosroshahi, A. More, L. Melnick, M. Golub, G. Varatkar, L. Shao, G. Kolhe, D. Melts, J. Klar, R. L’Heureux, M. Perry, D. Burger, E. Chung, Z. S. Deng, S. Naghshineh, J. Park, and M. Naumov, “With Shared Microexponents, A Little Shifting Goes a Long Way,” inProceedings of the 50th Ann...

work page doi:10.1145/3579371.3589351 2023
[48]

Characterization and Mitigation of Training Instabilities in Microscaling Formats,

H. Su, M. Kwun, S. Gil, S. Kakade, and N. Anand, “Characterization and Mitigation of Training Instabilities in Microscaling Formats,” Jun. 2025. [Online]. Available: https://arxiv.org/abs/2506.20752v1

work page arXiv 2025
[49]

J. M. Muller,Handbook of floating-point arithmetic / Jean-Michel Muller [and others].Boston: Birkhauser, 2010

work page 2010
[50]

What every computer scientist should know about floating-point arithmetic,

D. Goldberg, “What every computer scientist should know about floating-point arithmetic,”ACM Comput. Surv., vol. 23, no. 1, p. 5–48, Mar. 1991. [Online]. Available: https://doi.org/10.1145/103162.103163

work page doi:10.1145/103162.103163 1991
[51]

T. M. Cover and J. A. Thomas,Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience, 2006

work page 2006
[52]

On the dimension and entropy of probability distributions,

A. R ´enyi, “On the dimension and entropy of probability distributions,”Acta Mathematica Academiae Scientiarum Hungarica, vol. 10, no. 1, pp. 193–215, Mar. 1959. [Online]. Available: https://doi.org/10.1007/BF02063299

work page doi:10.1007/bf02063299 1959
[53]

Information Theory and Statistical Mechanics,

E. T. Jaynes, “Information Theory and Statistical Mechanics,” inStatistical Physics, ser. Brandeis Summer Institute. New York, NY: W. A. Benjamin Inc., 1962, pp. 181–218

work page 1962
[54]

Prior probabilities,

——, “Prior probabilities,”IEEE Transactions on Systems and Cybernetics, no. 3, pp. 227–241, 1968

work page 1968
[55]

Asymptotic entropy-constrained performance of tessellating and universal randomized lattice quantization,

T. Linder and K. Zeger, “Asymptotic entropy-constrained performance of tessellating and universal randomized lattice quantization,”IEEE Transactions on Information Theory, vol. 40, no. 2, pp. 575–579, Mar. 1994. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/312189

work page 1994
[56]

Asymptotically efficient quantizing,

H. Gish and J. Pierce, “Asymptotically efficient quantizing,”IEEE Transactions on Information Theory, vol. 14, no. 5, pp. 676–683, Sep. 1968. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/1054193

work page arXiv 1968
[57]

Quantization,

R. M. Gray and D. L. Neuhoff, “Quantization,”IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2325–2383, 1998

work page 1998
[58]

Communication in the Presence of Noise,

C. Shannon, “Communication in the Presence of Noise,”Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, Jan. 1949. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/1697831

work page arXiv 1949
[59]

N. L. Johnson, S. Kotz, and N. Balakrishnan,Continuous univariate distributions, 2nd ed. New York: Wiley, 1994

work page 1994
[60]

Stochastic gradient descent as approximate bayesian inference,

S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate bayesian inference,”J. Mach. Learn. Res., vol. 18, no. 1, p. 4873–4907, Jan. 2017

work page 2017
[61]

A variational analysis of stochastic gradient algorithms,

——, “A variational analysis of stochastic gradient algorithms,” inProceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML’16. JMLR.org, 2016, p. 354–363. 24

work page 2016
[62]

Three Factors Influencing Minima in SGD

S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y . Bengio, and A. J. Storkey, “Three factors influencing minima in sgd,”ArXiv, vol. abs/1711.04623, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:7311295

work page internal anchor Pith review Pith/arXiv arXiv 2017
[63]

Optimization methods for large-scale machine learning,

L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,”SIAM review, vol. 60, no. 2, pp. 223–311, 2018

work page 2018
[64]

G. A. Pavliotis,Stochastic processes and applications : diffusion processes, the Fokker-Planck and Langevin equations / Grigorios A. Pavliotis., ser. Texts in applied mathematics, volume 60. New York: Springer, 2014 - 2014

work page 2014
[65]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020. [Online]. Available: https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[66]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” 2022. [Online]. Available: ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[67]

Carbon Emissions and Large Neural Network Training

D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,” 2021. [Online]. Available: https://arxiv.org/abs/2104.10350

work page internal anchor Pith review Pith/arXiv arXiv 2021
[68]

Are emergent abilities of large language models a mirage?

R. Schaeffer, B. Miranda, and S. Koyejo, “Are emergent abilities of large language models a mirage?”Advances in Neural Information Processing Systems, vol. 36, 2023

work page 2023
[69]

arXiv preprint arXiv:2403.15796 , year=

Z. Du, A. Zeng, Y . Dong, and J. Tang, “Understanding Emergent Abilities of Language Models from the Loss Perspective,” Jan. 2025, arXiv:2403.15796 [cs]. [Online]. Available: http://arxiv.org/abs/2403.15796

work page arXiv 2025
[70]

Optimal finite-time processes in stochastic thermodynamics,

T. Schmiedl and U. Seifert, “Optimal finite-time processes in stochastic thermodynamics,”Phys. Rev. Lett., vol. 98, p. 108301, Mar 2007. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.98.108301

work page doi:10.1103/physrevlett.98.108301 2007
[71]

Thermodynamic metrics and optimal paths,

D. A. Sivak and G. E. Crooks, “Thermodynamic metrics and optimal paths,”Phys. Rev. Lett., vol. 108, p. 190602, May

work page
[72]

Available: https://link.aps.org/doi/10.1103/PhysRevLett.108.190602

[Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.108.190602

work page doi:10.1103/physrevlett.108.190602
[73]

Freitas, J

N. Freitas, J.-C. Delvenne, and M. Esposito, “Stochastic thermodynamics of nonlinear electronic circuits: A realistic framework for computing aroundkt,”Phys. Rev. X, vol. 11, p. 031064, Sep 2021. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevX.11.031064

work page doi:10.1103/physrevx.11.031064 2021
[74]

Dependence of dissipation on the initial distribution over states,

A. Kolchinsky and D. H. Wolpert, “Dependence of dissipation on the initial distribution over states,”Journal of Statistical Mechanics: Theory and Experiment, vol. 2017, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID: 17899737

work page 2017
[75]

Thermodynamics of computing with circuits,

D. H. Wolpert and A. Kolchinsky, “Thermodynamics of computing with circuits,”New Journal of Physics, vol. 22, no. 6, p. 063047, jun 2020. [Online]. Available: https://doi.org/10.1088/1367-2630/ab82b8

work page doi:10.1088/1367-2630/ab82b8 2020
[76]

BFloat16: The secret to high performance on Cloud TPUs — Google Cloud Blog — cloud.google.com,

“BFloat16: The secret to high performance on Cloud TPUs — Google Cloud Blog — cloud.google.com,” https://cloud. google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, [Accessed 01-12- 2025]

work page 2025
[77]

W., and Keutzer, K

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” 2021. [Online]. Available: https://arxiv.org/abs/2103.13630

work page arXiv 2021
[78]

Deep Learning with Limited Numerical Precision

S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” 2015. [Online]. Available: https://arxiv.org/abs/1502.02551

work page internal anchor Pith review Pith/arXiv arXiv 2015
[79]

Quantizing deep convolutional networks for efficient inference: A whitepaper

R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” 2018. [Online]. Available: https://arxiv.org/abs/1806.08342

work page internal anchor Pith review Pith/arXiv arXiv 2018
[80]

Efqat: An efficient framework for quantization-aware training,

S. Ashkboos, B. Verhoef, T. Hoefler, E. Eleftheriou, and M. Dazzi, “Efqat: An efficient framework for quantization-aware training,”CoRR, vol. abs/2411.11038, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2411.11038

work page doi:10.48550/arxiv.2411.11038 2024

Showing first 80 references.

[1] [1]

2024 United States Data Center Energy Usage Report,

A. Shehabi, A. Newkirk, S. Smith, A. Hubbard, N. Lei, M. Siddiket al., “2024 United States Data Center Energy Usage Report,” Lawrence Berkeley National Laboratory, Berkeley, CA, USA, Tech. Rep. LBNL-2001637, 2024. [Online]. Available: https://escholarship.org/uc/item/32d6m0d1

work page 2024

[2] [2]

Power hungry processing: Watts driving the cost of ai deployment?

A. S. Luccioni, Y . Jernite, and E. Strubell, “Power hungry processing: Watts driving the cost of ai deployment?” in Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24), Rio de Janeiro, Brazil, 2024, pp. 85–99

work page 2024

[3] [3]

The growing energy footprint of artificial intelligence,

A. de Vries, “The growing energy footprint of artificial intelligence,”Joule, vol. 7, no. 10, pp. 2191–2194, Oct 2023. 22

work page 2023

[4] [4]

A systematic review of green ai,

R. Verdecchia, J. Sallou, and L. Cruz, “A systematic review of green ai,”WIREs Data Mining and Knowledge Discovery, vol. 13, no. 4, p. e1507, 2023

work page 2023

[5] [5]

The end of moore’s law: Living without an exponential,

P. Schuster, “The end of moore’s law: Living without an exponential,”Complexity, vol. 21, no. 2, pp. 7–10, 2016

work page 2016

[6] [6]

The end of moore’s law: A new beginning for information technology,

T. M. Conteet al., “The end of moore’s law: A new beginning for information technology,” Computing Community Consortium (CCC), Computing Research Association, Tech. Rep., 2017. [Online]. Available: https: //cra.org/ccc/resources/ccc-led-whitepapers/

work page 2017

[7] [7]

Moore’s law and the energy requirement of computing versus performance,

L. B. Kish, “Moore’s law and the energy requirement of computing versus performance,”IEE Proceedings – Circuits, Devices and Systems, vol. 151, no. 2, pp. 190–194, Apr 2004

work page 2004

[8] [8]

Noninvertible Global Symmet ries in the Standard Model,

N. Zhang, “Moore’s law is dead, long live moore’s law!” arXiv preprint arXiv:2205.05086, 2022. [Online]. Available: https://arxiv.org/abs/2205.05086

work page arXiv 2022

[9] [9]

Irreversibility and heat generation in the computing process,

R. Landauer, “Irreversibility and heat generation in the computing process,”IBM Journal of Research and Development, vol. 5, no. 3, pp. 183–191, Jul 1961

work page 1961

[10] [10]

The thermodynamics of computation—a review,

C. H. Bennett, “The thermodynamics of computation—a review,”International Journal of Theoretical Physics, vol. 21, no. 12, pp. 905–940, Dec 1982

work page 1982

[11] [11]

Ultimate physical limits to computation,

S. Lloyd, “Ultimate physical limits to computation,”Nature, vol. 406, no. 6799, pp. 1047–1054, Aug 2000

work page 2000

[12] [12]

Physical limits of computing,

M. P. Frank, “Physical limits of computing,”Computer, vol. 50, no. 9, pp. 14–23, Sep 2017

work page 2017

[13] [13]

The thermodynamics of computation—a review,

C. H. Bennett, “The thermodynamics of computation—a review,”International Journal of Theoretical Physics, vol. 21, no. 12, pp. 905–940, 1982, same asBennett1982

work page 1982

[14] [14]

The physical limits of communication and computation,

R. Landauer, “The physical limits of communication and computation,”IEEE Spectrum, vol. 9, no. 5, pp. 23–29, May 1972

work page 1972

[15] [15]

Is stochastic thermodynamics the key to understanding the energy costs of computation?

D. H. Wolpert, J. Korbel, C. W. Lynn, F. Tasnim, J. A. Grochow, G. Kardes ¸, J. B. Aimone, V . Balasubramanian, E. D. Giuli, D. Doty, N. Freitas, M. Marsili, T. E. Ouldridge, A. W. Richa, P. Riechers, ´Edgar Rold ´an, B. Rubenstein, Z. Toroczkai, and J. Paradiso, “Is stochastic thermodynamics the key to understanding the energy costs of computation?” Proc...

work page doi:10.1073/pnas.2321112121 2024

[16] [16]

The stochastic thermodynamics of computation,

D. H. Wolpert, “The stochastic thermodynamics of computation,”Journal of Physics A: Mathematical and Theoretical, vol. 52, no. 19, p. 193001, 2019

work page 2019

[17] [17]

Entropy production bounds for systems running computer programs

A. Yadav, F. Caravelli, and D. H. Wolpert, “System-independent lower bounds on entropy production incurred by running a computer program,” arXiv preprint arXiv:2411.16088, 2025. [Online]. Available: https://arxiv.org/abs/2411.16088

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Thermodynamics of computations with absolute irreversibility, unidirectional transitions, and stochastic computation times,

G. Manzano, G. Kardes ¸, ´E. Rold ´an, and D. H. Wolpert, “Thermodynamics of computations with absolute irreversibility, unidirectional transitions, and stochastic computation times,”Physical Review X, vol. 14, no. 2, p. 021026, 2024

work page 2024

[19] [19]

Dependence of integrated, instantaneous, and fluctuating entropy production on the initial state in quantum and classical processes,

A. Kolchinsky and D. H. Wolpert, “Dependence of integrated, instantaneous, and fluctuating entropy production on the initial state in quantum and classical processes,”Physical Review E, vol. 104, no. 5, p. 054107, Nov 2021

work page 2021

[20] [20]

A logical calculus of the ideas immanent in nervous activity,

W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,”Bulletin of Mathematical Biophysics, vol. 5, no. 4, pp. 115–133, Dec 1943

work page 1943

[21] [21]

The perceptron: A probabilistic model for information storage and organization in the brain,

F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,”Psychological Review, vol. 65, no. 6, pp. 386–408, Nov 1958

work page 1958

[22] [22]

Adaptive switching circuits,

B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in1960 IRE WESCON Convention Record – Part 4. New York: Institute of Radio Engineers, 1960, pp. 96–104

work page 1960

[23] [23]

James, D

G. James, D. Witten, T. Hastie, and R. Tibshirani,An introduction to statistical learning: with applications in R, ser. Springer texts in statistics. New York: Springer, 2013

work page 2013

[24] [24]

Eric Hall and Rebecca Willett

S. Goldt and U. Seifert, “Stochastic thermodynamics of learning,”Phys. Rev. Lett., vol. 118, p. 010601, Jan 2017. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.118.010601

work page doi:10.1103/physrevlett.118.010601 2017

[25] [25]

Energy-Efficient Algorithms,

E. D. Demaine, J. Lynch, G. J. Mirano, and N. Tyagi, “Energy-Efficient Algorithms,” inProceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ser. ITCS ’16. New York, NY , USA: Association for Computing Machinery, Jan. 2016, pp. 321–332. [Online]. Available: https://dl.acm.org/doi/10.1145/2840728.2840756

work page doi:10.1145/2840728.2840756 2016

[26] [26]

Thermodynamic bounds on energy use in deep neural networks,

A. V . Tkachenko, “Thermodynamic bounds on energy use in deep neural networks,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09980

work page arXiv 2025

[27] [27]

NVIDIA Blackwell Architecture Technical Overview,

NVIDIA, “NVIDIA Blackwell Architecture Technical Overview,” NVIDIA, Tech. Rep., 2025. [Online]. Available: https://resources.nvidia.com/en-us-blackwell-architecture

work page 2025

[28] [28]

AMD CDNA 4 Architecture,

I. Advanced Micro Devices, “AMD CDNA 4 Architecture,” AMD, Tech. Rep., Oct. 2025. [Online]. Available: https: //www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-4-architecture-whitepaper.pdf

work page 2025

[29] [29]

Data Compression With Low Distortion and Finite Blocklength,

V . Kostina, “Data Compression With Low Distortion and Finite Blocklength,”IEEE Transactions on Information Theory, vol. 63, no. 7, pp. 4268–4285, Jul. 2017. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7867787

work page arXiv 2017

[30] [30]

Efloat: Entropy-coded floating point format for compressing vector embedding models,

R. Bordawekar, B. Abali, and M.-H. Chen, “Efloat: Entropy-coded floating point format for compressing vector embedding models,” 2022. [Online]. Available: https://arxiv.org/abs/2102.02705

work page arXiv 2022

[31] [31]

Neuzip: Memory-efficient training and inference with dynamic compression of neu- ral networks.arXiv preprint arXiv:2410.20650,

Y . Hao, Y . Cao, and L. Mou, “NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks,” Oct. 2024, arXiv:2410.20650 [cs]. [Online]. Available: http://arxiv.org/abs/2410.20650

work page arXiv 2024

[32] [32]

The Entropy of Floating-Point Numbers

S. Daniels, S. H. D’Ambrosia, M. R. DeWeese, and A. Sahai, “The entropy of floating-point numbers,” 2026. [Online]. Available: https://arxiv.org/abs/2605.11546 23

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

Beyond chinchilla-optimal: Ac- counting for inference in language model scaling laws

N. Sardana, J. Portes, S. Doubov, and J. Frankle, “Beyond chinchilla-optimal: Accounting for inference in language model scaling laws,” 2025. [Online]. Available: https://arxiv.org/abs/2401.00448

work page arXiv 2025

[34] [34]

An efficient reversible algorithm for linear regression,

E. D. Demaine, J. Lynch, and J. Sun, “An efficient reversible algorithm for linear regression,” in2021 International Conference on Rebooting Computing (ICRC), 2021, pp. 103–108

work page 2021

[35] [35]

Gradient-based hyperparameter optimization through reversible learning,

D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter optimization through reversible learning,”

work page

[36] [36]

Gradient-based Hyperparameter Optimization through Reversible Learning

[Online]. Available: https://arxiv.org/abs/1502.03492

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Tolman,The Principles of Statistical Mechanics, by Richard C

R. Tolman,The Principles of Statistical Mechanics, by Richard C. Tolman ..., ser. International series of monographs on physics. Oxford University Press, 1942. [Online]. Available: https://books.google.com/books?id=Hbr9yAEACAAJ

work page 1942

[38] [38]

J. W. Gibbs,The Collected Works of J. Willard Gibbs. Longmans, Green and Company, 1928, vol. 1

work page 1928

[39] [39]

The Physical Basis of the Gibbs-von Neumann entropy

O. J. E. Maroney, “The physical basis of the gibbs-von neumann entropy,” 2008. [Online]. Available: https://arxiv.org/abs/quant-ph/0701127

work page internal anchor Pith review Pith/arXiv arXiv 2008

[40] [40]

Generalizing landauer’s principle,

——, “Generalizing landauer’s principle,”Phys. Rev. E, vol. 79, p. 031105, Mar 2009. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevE.79.031105

work page doi:10.1103/physreve.79.031105 2009

[41] [41]

H. B. Callen,Thermodynamics and an introduction to thermostatistics. New York, NY: Wiley, 1985. [Online]. Available: https://cds.cern.ch/record/450289

work page 1985

[42] [42]

The (absence of a) relationship between thermodynamic and logical reversibility,

O. Maroney, “The (absence of a) relationship between thermodynamic and logical reversibility,”Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics, vol. 36, no. 2, pp. 355–374, 2005. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1355219805000031

work page 2005

[43] [43]

L. D. Landau, E. M. Lifshitz, and L. P. Pitaevskii,Statistical Physics: Part 1, 3rd ed., ser. Course of Theoretical Physics. Oxford: Pergamon Press, 1980, vol. 5

work page 1980

[44] [44]

Chandler,Introduction to Modern Statistical Mechanics

D. Chandler,Introduction to Modern Statistical Mechanics. Oxford University Press, 1987

work page 1987

[45] [45]

FP8 Quantization: The Power of the Exponent,

A. Kuzmin, M. van Baalen, Y . Ren, M. Nagel, J. Peters, and T. Blankevoort, “FP8 Quantization: The Power of the Exponent,”Advances in Neural Information Processing Systems, vol. 35, pp. 14 651–14 662, Dec. 2022. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2022/hash/ 5e07476b6bd2497e1fbd11b8f0b2de3c-Abstract-Conference.html

work page 2022

[46] [46]

Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537,

B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, S. Dusan, V . Elango, M. Golub, A. Heinecke, P. James-Roxby, D. Jani, G. Kolhe, M. Langhammer, A. Li, L. Melnick, M. Mesmakhosroshahi, A. Rodriguez, M. Schulte, R. Shafipour, L. Shao, M. Siu, P. Dubey, P. Micikevicius, M. Naumov, C. Verrill...

work page arXiv 2023

[47] [47]

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,

B. Darvish Rouhani, R. Zhao, V . Elango, R. Shafipour, M. Hall, M. Mesmakhosroshahi, A. More, L. Melnick, M. Golub, G. Varatkar, L. Shao, G. Kolhe, D. Melts, J. Klar, R. L’Heureux, M. Perry, D. Burger, E. Chung, Z. S. Deng, S. Naghshineh, J. Park, and M. Naumov, “With Shared Microexponents, A Little Shifting Goes a Long Way,” inProceedings of the 50th Ann...

work page doi:10.1145/3579371.3589351 2023

[48] [48]

Characterization and Mitigation of Training Instabilities in Microscaling Formats,

H. Su, M. Kwun, S. Gil, S. Kakade, and N. Anand, “Characterization and Mitigation of Training Instabilities in Microscaling Formats,” Jun. 2025. [Online]. Available: https://arxiv.org/abs/2506.20752v1

work page arXiv 2025

[49] [49]

J. M. Muller,Handbook of floating-point arithmetic / Jean-Michel Muller [and others].Boston: Birkhauser, 2010

work page 2010

[50] [50]

What every computer scientist should know about floating-point arithmetic,

D. Goldberg, “What every computer scientist should know about floating-point arithmetic,”ACM Comput. Surv., vol. 23, no. 1, p. 5–48, Mar. 1991. [Online]. Available: https://doi.org/10.1145/103162.103163

work page doi:10.1145/103162.103163 1991

[51] [51]

T. M. Cover and J. A. Thomas,Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience, 2006

work page 2006

[52] [52]

On the dimension and entropy of probability distributions,

A. R ´enyi, “On the dimension and entropy of probability distributions,”Acta Mathematica Academiae Scientiarum Hungarica, vol. 10, no. 1, pp. 193–215, Mar. 1959. [Online]. Available: https://doi.org/10.1007/BF02063299

work page doi:10.1007/bf02063299 1959

[53] [53]

Information Theory and Statistical Mechanics,

E. T. Jaynes, “Information Theory and Statistical Mechanics,” inStatistical Physics, ser. Brandeis Summer Institute. New York, NY: W. A. Benjamin Inc., 1962, pp. 181–218

work page 1962

[54] [54]

Prior probabilities,

——, “Prior probabilities,”IEEE Transactions on Systems and Cybernetics, no. 3, pp. 227–241, 1968

work page 1968

[55] [55]

Asymptotic entropy-constrained performance of tessellating and universal randomized lattice quantization,

T. Linder and K. Zeger, “Asymptotic entropy-constrained performance of tessellating and universal randomized lattice quantization,”IEEE Transactions on Information Theory, vol. 40, no. 2, pp. 575–579, Mar. 1994. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/312189

work page 1994

[56] [56]

Asymptotically efficient quantizing,

H. Gish and J. Pierce, “Asymptotically efficient quantizing,”IEEE Transactions on Information Theory, vol. 14, no. 5, pp. 676–683, Sep. 1968. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/1054193

work page arXiv 1968

[57] [57]

Quantization,

R. M. Gray and D. L. Neuhoff, “Quantization,”IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2325–2383, 1998

work page 1998

[58] [58]

Communication in the Presence of Noise,

C. Shannon, “Communication in the Presence of Noise,”Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, Jan. 1949. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/1697831

work page arXiv 1949

[59] [59]

N. L. Johnson, S. Kotz, and N. Balakrishnan,Continuous univariate distributions, 2nd ed. New York: Wiley, 1994

work page 1994

[60] [60]

Stochastic gradient descent as approximate bayesian inference,

S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate bayesian inference,”J. Mach. Learn. Res., vol. 18, no. 1, p. 4873–4907, Jan. 2017

work page 2017

[61] [61]

A variational analysis of stochastic gradient algorithms,

——, “A variational analysis of stochastic gradient algorithms,” inProceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML’16. JMLR.org, 2016, p. 354–363. 24

work page 2016

[62] [62]

Three Factors Influencing Minima in SGD

S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y . Bengio, and A. J. Storkey, “Three factors influencing minima in sgd,”ArXiv, vol. abs/1711.04623, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:7311295

work page internal anchor Pith review Pith/arXiv arXiv 2017

[63] [63]

Optimization methods for large-scale machine learning,

L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,”SIAM review, vol. 60, no. 2, pp. 223–311, 2018

work page 2018

[64] [64]

G. A. Pavliotis,Stochastic processes and applications : diffusion processes, the Fokker-Planck and Langevin equations / Grigorios A. Pavliotis., ser. Texts in applied mathematics, volume 60. New York: Springer, 2014 - 2014

work page 2014

[65] [65]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020. [Online]. Available: https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[66] [66]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” 2022. [Online]. Available: ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[67] [67]

Carbon Emissions and Large Neural Network Training

D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,” 2021. [Online]. Available: https://arxiv.org/abs/2104.10350

work page internal anchor Pith review Pith/arXiv arXiv 2021

[68] [68]

Are emergent abilities of large language models a mirage?

R. Schaeffer, B. Miranda, and S. Koyejo, “Are emergent abilities of large language models a mirage?”Advances in Neural Information Processing Systems, vol. 36, 2023

work page 2023

[69] [69]

arXiv preprint arXiv:2403.15796 , year=

Z. Du, A. Zeng, Y . Dong, and J. Tang, “Understanding Emergent Abilities of Language Models from the Loss Perspective,” Jan. 2025, arXiv:2403.15796 [cs]. [Online]. Available: http://arxiv.org/abs/2403.15796

work page arXiv 2025

[70] [70]

Optimal finite-time processes in stochastic thermodynamics,

T. Schmiedl and U. Seifert, “Optimal finite-time processes in stochastic thermodynamics,”Phys. Rev. Lett., vol. 98, p. 108301, Mar 2007. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.98.108301

work page doi:10.1103/physrevlett.98.108301 2007

[71] [71]

Thermodynamic metrics and optimal paths,

D. A. Sivak and G. E. Crooks, “Thermodynamic metrics and optimal paths,”Phys. Rev. Lett., vol. 108, p. 190602, May

work page

[72] [72]

Available: https://link.aps.org/doi/10.1103/PhysRevLett.108.190602

[Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.108.190602

work page doi:10.1103/physrevlett.108.190602

[73] [73]

Freitas, J

N. Freitas, J.-C. Delvenne, and M. Esposito, “Stochastic thermodynamics of nonlinear electronic circuits: A realistic framework for computing aroundkt,”Phys. Rev. X, vol. 11, p. 031064, Sep 2021. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevX.11.031064

work page doi:10.1103/physrevx.11.031064 2021

[74] [74]

Dependence of dissipation on the initial distribution over states,

A. Kolchinsky and D. H. Wolpert, “Dependence of dissipation on the initial distribution over states,”Journal of Statistical Mechanics: Theory and Experiment, vol. 2017, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID: 17899737

work page 2017

[75] [75]

Thermodynamics of computing with circuits,

D. H. Wolpert and A. Kolchinsky, “Thermodynamics of computing with circuits,”New Journal of Physics, vol. 22, no. 6, p. 063047, jun 2020. [Online]. Available: https://doi.org/10.1088/1367-2630/ab82b8

work page doi:10.1088/1367-2630/ab82b8 2020

[76] [76]

BFloat16: The secret to high performance on Cloud TPUs — Google Cloud Blog — cloud.google.com,

“BFloat16: The secret to high performance on Cloud TPUs — Google Cloud Blog — cloud.google.com,” https://cloud. google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, [Accessed 01-12- 2025]

work page 2025

[77] [77]

W., and Keutzer, K

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” 2021. [Online]. Available: https://arxiv.org/abs/2103.13630

work page arXiv 2021

[78] [78]

Deep Learning with Limited Numerical Precision

S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” 2015. [Online]. Available: https://arxiv.org/abs/1502.02551

work page internal anchor Pith review Pith/arXiv arXiv 2015

[79] [79]

Quantizing deep convolutional networks for efficient inference: A whitepaper

R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” 2018. [Online]. Available: https://arxiv.org/abs/1806.08342

work page internal anchor Pith review Pith/arXiv arXiv 2018

[80] [80]

Efqat: An efficient framework for quantization-aware training,

S. Ashkboos, B. Verhoef, T. Hoefler, E. Eleftheriou, and M. Dazzi, “Efqat: An efficient framework for quantization-aware training,”CoRR, vol. abs/2411.11038, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2411.11038

work page doi:10.48550/arxiv.2411.11038 2024