pith. sign in

arxiv: 2605.19195 · v1 · pith:7MUC62TOnew · submitted 2026-05-18 · ❄️ cond-mat.stat-mech · cs.IT· math.IT· stat.ML

The Thermodynamic Costs of Simple Linear Regression

Pith reviewed 2026-05-20 06:59 UTC · model grok-4.3

classification ❄️ cond-mat.stat-mech cs.ITmath.ITstat.ML
keywords thermodynamic costslinear regressionLandauer's principleenergy scaling lawsstochastic gradient descentgeneralization errorfloating-point arithmeticentropy production
0
0 comments X

The pith

Floating-point linear regression carries a thermodynamic lower bound on energy that determines the optimal training dataset size for a given prediction accuracy target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies Landauer's principle to estimate the minimum energy dissipated when irreversibly computing simple linear regression on floating-point numbers, considering both exact solutions and stochastic gradient descent implementations. These energy estimates are combined with the generalization error that improves as more training data is used, producing scaling laws for the dataset size that minimizes total energy when inference must meet a fixed accuracy demand. The authors also outline a way to lower-bound the extra entropy produced when continuous-valued inputs create mismatches with the algorithm's discrete operations.

Core claim

By counting the irreversible bit erasures that occur in the floating-point arithmetic steps of linear regression, the authors derive a concrete lower bound on dissipated energy that increases with both dataset size and numerical precision. When this cost is weighed against the reduction in generalization error that larger datasets provide, an energy-optimal finite dataset size emerges for any required inference accuracy. The same counting approach is applied to stochastic gradient descent, yielding a distinct but related scaling relation.

What carries the argument

Landauer's principle applied to irreversible bit erasures in floating-point arithmetic steps of exact linear regression or stochastic gradient descent

If this is right

  • Total energy for a linear model with fixed generalization-error target reaches a minimum at a finite dataset size rather than growing without limit.
  • Energy costs of exact regression and SGD versions scale differently with precision and data volume, allowing direct comparison of their thermodynamic efficiency.
  • Inference demand that requires lower generalization error shifts the optimal training set size upward in a quantifiable way.
  • Mismatch between continuous inputs and discrete algorithm steps produces an additional entropy-production term that can be lower-bounded separately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same counting method could be extended to other linear models such as logistic regression or to the forward passes of small neural networks.
  • Hardware designers could use these bounds to prioritize reductions in floating-point erasure costs when building energy-efficient ML accelerators.
  • For very large-scale inference workloads the derived scaling laws predict a crossover beyond which adding more training data becomes energetically wasteful.

Load-bearing premise

Landauer's principle can be applied directly to count the irreversible bit operations in floating-point linear regression and stochastic gradient descent without additional hidden costs from memory access or control flow.

What would settle it

An experiment that measures the actual energy consumed by a processor executing floating-point linear regression on a known dataset and finds dissipation below the calculated Landauer bound for the same number of bit erasures and precision would falsify the bound.

Figures

Figures reproduced from arXiv: 2605.19195 by Anant Sahai, Michael R. DeWeese, Samuel H. D'Ambrosia, Sultan M. Daniels.

Figure 1
Figure 1. Figure 1: Floating-point structure and Gaussian approximations. (1a) The structure of a floating￾point number where each box represents one bit. The true bin size function ∆(x) is plotted on log-log scale for both midpoint (black solid curve) and floor quantization (blue dotted curve) with p = 10 and E = 4 along with the smooth approximation ∆s(x) (red dashed curve). (1b) shows the entropy of Xf p which is a discret… view at source ↗
Figure 2
Figure 2. Figure 2: Regimes where the approximations hold. (2a) Shows the exact discrete entropy of Xf p as p varies for each E when the underlying continuous random variable is X ∼ N (0, 1). The approximation H˜ 0 s (p) is also plotted as the dashed red line. Notice that the curves for E ≥ 4 are directly on top of each other, showing that H˜ 0 s (p) is close to H(Xf p) when E is large enough to keep the probability of overfl… view at source ↗
Figure 3
Figure 3. Figure 3: The Landauer cost for exact zero-intercept simple linear regression. Input and output states are floating-point numbers with p = 24 and E = 8. The candidate values of the SNR = w 2σ 2 x /σ2 ξ are 0.062, 0.25, 1, 4, and 25. (3a) The output model approximate entropy H˜ w s (Wˆ f p), and its dependence on the number of data samples n. (3b) The approximate entropy difference rate between the input data and the… view at source ↗
Figure 4
Figure 4. Figure 4: Entropy dynamics of SGD. Input and output states are assumed to be single-precision floating-point numbers, with p = 24. Here wˆ0 = 1, σ 2 x = 1, σ 2 ξ = 1, η = 10−2 , and B = 10. (4a) shows the approximate floating-point entropy of Wˆ vs SGD step number k. (4b) shows the entropy difference between the input states and the SGD predictor at step k. Here the precision contribution is given by ∆E SGD p /(kBT … view at source ↗
Figure 5
Figure 5. Figure 5: Optimal dataset size for the exact linear regression formula and for stochastic gradient descent. (5a) shows the profit gained versus the dataset size n for the exact linear regression formula uEx(n) given in Eq. (178). (5c) shows u ′ Ex(n), the derivative of the profit function with respect to n as given in Eq. (180). (5b) shows the profit gained versus the dataset size n for SGD uSGD(n) given in Eq. (181… view at source ↗
Figure 6
Figure 6. Figure 6: Clipping and midpoint quantization with K = 3 representable values {u1, u2, u3}. The blue vertical lines represent the midpoints, and the arrows depict the regions of the real line that map to each representable value at a black vertical line. Corollary B.1.1 (Entropy of a Gaussian Random Variable Quantized to a Floating-point Num￾ber). Let X ∼ N (µ, σ2 x ). Let Xf p be the clipped and midpoint quantized f… view at source ↗
Figure 7
Figure 7. Figure 7: Numerical scale of C0 and ε0 (Corollary C.1.1) for d = 1, σ = 1. Each curve corresponds to a different exponent width E. Circles denote Gaussian marginals; squares denote Student’s t5 marginals. 15 10 5 0 5 10 15 4 6 8 10 12 14 H ( Xfp ) ( bit s ) Midpoint Exact Hs (p) H0 s(p) 15 10 5 0 5 10 15 H ( Xfp ) ( bit s ) Floor Exact Hs (p) H0 s(p) [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Simulating approximation 3: The dependence of the entropy of a normally distributed floating-point number on its standard deviation σ and mean µ. Dashed lines show Approximation 3 from Eq. (20), while solid lines show the exact entropy from Corollary B.1.1. Approximation 3 – E[log[|x|/ √ 2]]: We can show that Eq. (18) approximates Eq. (20) in two cases. First, for a Gaussian distribution where its mean is … view at source ↗
Figure 9
Figure 9. Figure 9: Histogram of exponent values for a normally distributed random variable. The distribution of exponent states for X ∼ N (0, σ2 ). From left to right, the distributions plotted have σ = {10−3 , 100 , 103}. All three have a discrete entropy H(log[X]) ≈ 2.54 bits. These distributions conform well to observational data in [30], [31]. Theorem D.1 (Approximating the entropy of a mean-zero univariate gaussian). Th… view at source ↗
Figure 10
Figure 10. Figure 10: Fit quality for asymptotic stochastic gradient descent. Empirical investigation of the validity of the continuous Ornstein-Uhlenbeck process approximation with a simulation of wˆ with η = 0.01, τ = 200, σ 2 x = 1, and σ 2 ξ = 1, for 1000 trials. From top to bottom, the batch sizes are B = {1, 5, 25} while τ = 200, showing the approximation is already effective for these parameters at low B [PITH_FULL_IMA… view at source ↗
Figure 11
Figure 11. Figure 11: SGD dynamics for zero-intercept simple linear regression. (11a) SGD parameter wˆ distribution at selected iterations, for 5000 separate trials of running SGD. σ 2 x = 1, σ 2 ξ = 1, η = 0.01, B = 10. (11b) Theoretical mean and standard deviation compared to the empirical mean and standard deviation as step number k increases up to final value k = τ = 2000. APPENDIX F LANDAUER COST OF AVERAGING AND SUMMING … view at source ↗
Figure 12
Figure 12. Figure 12: The probability density function fZ(z), with σx = σξ = 1. As n increases, the distribution becomes more peaked around z = 0. The simulation is of 50000 trials. Lemma G.1. Let X ∼ N (0, σ2 x In) and Ξ ∼ N (0, σ2 ξ In) be independent, with n ∈ N, and define Z = XT Ξ XT X . The probability density function of Z is fZ(z) = s 1 π (σ 2 x ) n σ 2 ξ Γ [PITH_FULL_IMAGE:figures/full_fig_p045_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The Landauer cost for exact zero-intercept simple linear regression. Input and output states are floating-point numbers with p = 4 and E = 4. The candidate values of the SNR = w2σ 2 x σ 2 ξ are 0.062, 0.25, 1, 4, and 25. (13a) The output model entropy, and its dependence on the data size n and ground truth w. (13b) The entropy difference for various values of SNR and a range of n [PITH_FULL_IMAGE:figures… view at source ↗
Figure 14
Figure 14. Figure 14: A lower bound on the mismatch cost for continuously parameterized inputs. MMCv(σx, σξ) as a function of σx and σξ, with each plot assuming the entropy flow ∆Senv,Ex = ∆Senv,SGD = C + α(σ 2 x + σ 2 ξ ), and w = 1. A bounded optimization is performed for 0.75 ≤ σx ≤ 5, 0.5 ≤ σξ ≤ 5. 14a A sample MMCv landscape for exact linear regression with the illustrative ∆Senv,Ex = C + α(σ 2 x + σ 2 ξ ), and n = 10. 14… view at source ↗
Figure 15
Figure 15. Figure 15: The effect of the learning rate and batch size on the optimal dataset size for stochastic gradient descent. (15a) shows uSGD(n) given in Eq. (181) for varying values of the learning rate η. Notice that smaller learning rates lead to larger optimal dataset sizes. (15c) shows u ′ SGD(n) in Eq. (183) for the different learning rates. (15b) shows uSGD(n) for various values of the batch size B. For the profit … view at source ↗
Figure 16
Figure 16. Figure 16: Exact midpoint-quantized entropy vs. standard deviation σ. For each precision p ∈ {1, . . . , 8}, the exact discrete entropy H(Xf p) of X ∼ N (0, σ2 ) (with µ = 0) is plotted as a function of σ over a wide log-scale range. Each curve corresponds to a distinct value of exponent bits E ∈ {0, 1, . . . , 7}. The vertical dashed lines mark σ = 2emin and the vertical dotted lines mark σ = 2emax for each E, and … view at source ↗
Figure 17
Figure 17. Figure 17: Exact midpoint-quantized entropy vs. mean µ, p = 1. The exact entropy H(Xf p) of X ∼ N (µ, 1) (with σ = 1.0 fixed and p = 1) is plotted as a function of µ ∈ [−100, 100]. Each panel shows a different number of exponent bits E. The solid curve is the exact entropy and the dashed curve is the large-|µ| approximation H˜ µ s (Xf p). These plots were generated by sweeping µ over 500 linearly-spaced points and e… view at source ↗
Figure 18
Figure 18. Figure 18: Exact midpoint-quantized entropy vs. mean µ, p = 2. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p056_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Exact midpoint-quantized entropy vs. mean µ, p = 3. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p057_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Exact midpoint-quantized entropy vs. mean µ, p = 4. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p058_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Exact midpoint-quantized entropy vs. mean µ, p = 5. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p059_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Exact midpoint-quantized entropy vs. mean µ, p = 6. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p060_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Exact midpoint-quantized entropy vs. mean µ, p = 7. Same experiment as [PITH_FULL_IMAGE:figures/full_fig_p061_23.png] view at source ↗
read the original abstract

The construction of models from data is a significant contributor to the energetic costs of computation. Because of this, understanding how foundational thermodynamic bounds apply to modeling algorithms will be increasingly important. Here, we study the thermodynamic costs of a basic and fundamental modeling algorithm: simple linear regression. Following Landauer, we approximate the thermodynamic lower bound on irreversibly performing both exact linear regression and linear regression via stochastic gradient descent as implemented on floating-point numbers. From this, we derive energycost aware scaling laws for the optimal dataset size for training a linear regression model given a generalization error dependent demand for inference. Additionally, we discuss a method to lower bound the entropy production from the mismatch cost for algorithms with continuous input variables.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript approximates thermodynamic lower bounds on the irreversible energy costs of exact simple linear regression and its implementation via stochastic gradient descent on floating-point arithmetic, by applying Landauer's principle to count bit erasures. From these bounds it derives energy-cost-aware scaling laws for the optimal training dataset size that balance computational dissipation against a generalization-error requirement at inference time. It additionally outlines a method to lower-bound entropy production arising from mismatch costs when input variables are continuous.

Significance. If the bit-operation counting procedure yields a valid and dominant lower bound, the resulting scaling laws would supply a concrete, falsifiable link between thermodynamic principles and practical choices of training-set size in linear models. The mismatch-cost discussion for continuous variables is a useful technical contribution that could extend to other regression or optimization settings. The work is strongest where it remains within the abstract model of irreversible operations; its practical relevance hinges on whether those operations dominate real hardware dissipation.

major comments (2)
  1. [Thermodynamic bounds and floating-point implementation] The central approximation that counts only irreversible bit operations in floating-point linear regression and SGD (as described in the derivation following the abstract) does not address memory-hierarchy accesses, data movement, or control-flow overhead. These terms are not strictly proportional to bit erasures and can exceed the Landauer floor by orders of magnitude on current hardware; without a quantitative argument that they remain sub-dominant, the claimed lower bound cannot reliably support the derived scaling laws for optimal dataset size.
  2. [Energy-cost aware scaling laws] The scaling laws for optimal dataset size are obtained by minimizing a total cost that includes both the approximated dissipation and a generalization-error term. If the error metric used to define the inference demand is the same quantity that enters the cost function (as suggested by the abstract phrasing), the optimum may be tautological rather than predictive; an explicit statement of the functional form and any free parameters in the scaling relation is needed to assess this.
minor comments (2)
  1. [Mismatch cost discussion] Notation for the mismatch-cost lower bound on continuous variables should be introduced with a short example (e.g., a one-dimensional Gaussian input) to clarify how the continuous-to-discrete translation is performed.
  2. [Introduction] The abstract states that bounds are 'approximated'; a brief paragraph in the introduction or methods section listing the concrete approximations (e.g., neglect of reversible steps, assumption of uniform bit cost) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us clarify the scope and presentation of our results. Below we respond point-by-point to the major comments. We have revised the manuscript to incorporate additional discussion and explicit statements of the scaling relations where appropriate.

read point-by-point responses
  1. Referee: [Thermodynamic bounds and floating-point implementation] The central approximation that counts only irreversible bit operations in floating-point linear regression and SGD (as described in the derivation following the abstract) does not address memory-hierarchy accesses, data movement, or control-flow overhead. These terms are not strictly proportional to bit erasures and can exceed the Landauer floor by orders of magnitude on current hardware; without a quantitative argument that they remain sub-dominant, the claimed lower bound cannot reliably support the derived scaling laws for optimal dataset size.

    Authors: Our derivation applies Landauer's principle strictly to the irreversible bit erasures that occur during the floating-point arithmetic operations of exact linear regression and SGD. This yields a hardware-independent lower bound on the thermodynamic cost of those specific operations. We agree that memory-hierarchy accesses, data movement, and control flow are not included and can dominate dissipation on existing processors. In the revised manuscript we have added an explicit paragraph in the discussion section stating that the reported bounds and scaling laws concern only the Landauer-limited arithmetic component; they are intended as theoretical minima that any physical implementation must respect, rather than as predictions of total energy use on current hardware. Because a quantitative demonstration of sub-dominance would require device-specific models outside the scope of this theoretical study, we have instead emphasized how the scaling laws can be combined with empirical overhead models in future applied work. revision: partial

  2. Referee: [Energy-cost aware scaling laws] The scaling laws for optimal dataset size are obtained by minimizing a total cost that includes both the approximated dissipation and a generalization-error term. If the error metric used to define the inference demand is the same quantity that enters the cost function (as suggested by the abstract phrasing), the optimum may be tautological rather than predictive; an explicit statement of the functional form and any free parameters in the scaling relation is needed to assess this.

    Authors: The generalization error appears solely as an external performance requirement at inference time, not as a term inside the training dissipation cost. We minimize the thermodynamic cost of training subject to the constraint that the deployed model must achieve a target generalization error ε. In the revised manuscript we now state the explicit functional form: the optimal training-set size scales as N* ∝ (log(1/ε) + c · precision) / β, where β is the per-sample dissipation coefficient derived from bit erasures and c collects constants from the linear-regression solution. The free parameters are the target error ε, the floating-point precision, and the data variance; these are listed in the new scaling-law subsection. Because the energy cost is incurred only during training while ε is a post-training specification, the resulting optimum is predictive rather than tautological. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds from Landauer bounds and statistical scaling without self-referential reduction

full rationale

The paper starts from Landauer's principle applied to bit erasures in exact linear regression and floating-point SGD, counts irreversible operations, and derives energy-aware scaling laws for optimal dataset size under a generalization-error constraint. No equations or steps reduce the final scaling law to a fitted parameter or prior self-citation by construction; the optimal-N expression emerges from combining the thermodynamic cost model with standard bias-variance or generalization bounds rather than tautologically re-expressing the input error metric. The derivation remains self-contained against external thermodynamic and learning-theoretic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on Landauer's principle applied to floating-point operations and on the assumption that mismatch costs for continuous variables can be bounded separately from the main computation.

axioms (1)
  • domain assumption Landauer's principle supplies the minimum energy cost for each irreversible bit erasure or overwrite performed during linear regression and SGD updates on floating-point numbers.
    Explicitly invoked in the abstract with the phrase 'Following Landauer, we approximate the thermodynamic lower bound'.

pith-pipeline@v0.9.0 · 5662 in / 1316 out tokens · 41102 ms · 2026-05-20T06:59:26.068436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 10 internal anchors

  1. [1]

    2024 United States Data Center Energy Usage Report,

    A. Shehabi, A. Newkirk, S. Smith, A. Hubbard, N. Lei, M. Siddiket al., “2024 United States Data Center Energy Usage Report,” Lawrence Berkeley National Laboratory, Berkeley, CA, USA, Tech. Rep. LBNL-2001637, 2024. [Online]. Available: https://escholarship.org/uc/item/32d6m0d1

  2. [2]

    Power hungry processing: Watts driving the cost of ai deployment?

    A. S. Luccioni, Y . Jernite, and E. Strubell, “Power hungry processing: Watts driving the cost of ai deployment?” in Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24), Rio de Janeiro, Brazil, 2024, pp. 85–99

  3. [3]

    The growing energy footprint of artificial intelligence,

    A. de Vries, “The growing energy footprint of artificial intelligence,”Joule, vol. 7, no. 10, pp. 2191–2194, Oct 2023. 22

  4. [4]

    A systematic review of green ai,

    R. Verdecchia, J. Sallou, and L. Cruz, “A systematic review of green ai,”WIREs Data Mining and Knowledge Discovery, vol. 13, no. 4, p. e1507, 2023

  5. [5]

    The end of moore’s law: Living without an exponential,

    P. Schuster, “The end of moore’s law: Living without an exponential,”Complexity, vol. 21, no. 2, pp. 7–10, 2016

  6. [6]

    The end of moore’s law: A new beginning for information technology,

    T. M. Conteet al., “The end of moore’s law: A new beginning for information technology,” Computing Community Consortium (CCC), Computing Research Association, Tech. Rep., 2017. [Online]. Available: https: //cra.org/ccc/resources/ccc-led-whitepapers/

  7. [7]

    Moore’s law and the energy requirement of computing versus performance,

    L. B. Kish, “Moore’s law and the energy requirement of computing versus performance,”IEE Proceedings – Circuits, Devices and Systems, vol. 151, no. 2, pp. 190–194, Apr 2004

  8. [8]

    Noninvertible Global Symmet ries in the Standard Model,

    N. Zhang, “Moore’s law is dead, long live moore’s law!” arXiv preprint arXiv:2205.05086, 2022. [Online]. Available: https://arxiv.org/abs/2205.05086

  9. [9]

    Irreversibility and heat generation in the computing process,

    R. Landauer, “Irreversibility and heat generation in the computing process,”IBM Journal of Research and Development, vol. 5, no. 3, pp. 183–191, Jul 1961

  10. [10]

    The thermodynamics of computation—a review,

    C. H. Bennett, “The thermodynamics of computation—a review,”International Journal of Theoretical Physics, vol. 21, no. 12, pp. 905–940, Dec 1982

  11. [11]

    Ultimate physical limits to computation,

    S. Lloyd, “Ultimate physical limits to computation,”Nature, vol. 406, no. 6799, pp. 1047–1054, Aug 2000

  12. [12]

    Physical limits of computing,

    M. P. Frank, “Physical limits of computing,”Computer, vol. 50, no. 9, pp. 14–23, Sep 2017

  13. [13]

    The thermodynamics of computation—a review,

    C. H. Bennett, “The thermodynamics of computation—a review,”International Journal of Theoretical Physics, vol. 21, no. 12, pp. 905–940, 1982, same asBennett1982

  14. [14]

    The physical limits of communication and computation,

    R. Landauer, “The physical limits of communication and computation,”IEEE Spectrum, vol. 9, no. 5, pp. 23–29, May 1972

  15. [15]

    Is stochastic thermodynamics the key to understanding the energy costs of computation?

    D. H. Wolpert, J. Korbel, C. W. Lynn, F. Tasnim, J. A. Grochow, G. Kardes ¸, J. B. Aimone, V . Balasubramanian, E. D. Giuli, D. Doty, N. Freitas, M. Marsili, T. E. Ouldridge, A. W. Richa, P. Riechers, ´Edgar Rold ´an, B. Rubenstein, Z. Toroczkai, and J. Paradiso, “Is stochastic thermodynamics the key to understanding the energy costs of computation?” Proc...

  16. [16]

    The stochastic thermodynamics of computation,

    D. H. Wolpert, “The stochastic thermodynamics of computation,”Journal of Physics A: Mathematical and Theoretical, vol. 52, no. 19, p. 193001, 2019

  17. [17]

    Entropy production bounds for systems running computer programs

    A. Yadav, F. Caravelli, and D. H. Wolpert, “System-independent lower bounds on entropy production incurred by running a computer program,” arXiv preprint arXiv:2411.16088, 2025. [Online]. Available: https://arxiv.org/abs/2411.16088

  18. [18]

    Thermodynamics of computations with absolute irreversibility, unidirectional transitions, and stochastic computation times,

    G. Manzano, G. Kardes ¸, ´E. Rold ´an, and D. H. Wolpert, “Thermodynamics of computations with absolute irreversibility, unidirectional transitions, and stochastic computation times,”Physical Review X, vol. 14, no. 2, p. 021026, 2024

  19. [19]

    Dependence of integrated, instantaneous, and fluctuating entropy production on the initial state in quantum and classical processes,

    A. Kolchinsky and D. H. Wolpert, “Dependence of integrated, instantaneous, and fluctuating entropy production on the initial state in quantum and classical processes,”Physical Review E, vol. 104, no. 5, p. 054107, Nov 2021

  20. [20]

    A logical calculus of the ideas immanent in nervous activity,

    W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,”Bulletin of Mathematical Biophysics, vol. 5, no. 4, pp. 115–133, Dec 1943

  21. [21]

    The perceptron: A probabilistic model for information storage and organization in the brain,

    F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,”Psychological Review, vol. 65, no. 6, pp. 386–408, Nov 1958

  22. [22]

    Adaptive switching circuits,

    B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in1960 IRE WESCON Convention Record – Part 4. New York: Institute of Radio Engineers, 1960, pp. 96–104

  23. [23]

    James, D

    G. James, D. Witten, T. Hastie, and R. Tibshirani,An introduction to statistical learning: with applications in R, ser. Springer texts in statistics. New York: Springer, 2013

  24. [24]

    Eric Hall and Rebecca Willett

    S. Goldt and U. Seifert, “Stochastic thermodynamics of learning,”Phys. Rev. Lett., vol. 118, p. 010601, Jan 2017. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.118.010601

  25. [25]

    Energy-Efficient Algorithms,

    E. D. Demaine, J. Lynch, G. J. Mirano, and N. Tyagi, “Energy-Efficient Algorithms,” inProceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ser. ITCS ’16. New York, NY , USA: Association for Computing Machinery, Jan. 2016, pp. 321–332. [Online]. Available: https://dl.acm.org/doi/10.1145/2840728.2840756

  26. [26]

    Thermodynamic bounds on energy use in deep neural networks,

    A. V . Tkachenko, “Thermodynamic bounds on energy use in deep neural networks,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09980

  27. [27]

    NVIDIA Blackwell Architecture Technical Overview,

    NVIDIA, “NVIDIA Blackwell Architecture Technical Overview,” NVIDIA, Tech. Rep., 2025. [Online]. Available: https://resources.nvidia.com/en-us-blackwell-architecture

  28. [28]

    AMD CDNA 4 Architecture,

    I. Advanced Micro Devices, “AMD CDNA 4 Architecture,” AMD, Tech. Rep., Oct. 2025. [Online]. Available: https: //www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-4-architecture-whitepaper.pdf

  29. [29]

    Data Compression With Low Distortion and Finite Blocklength,

    V . Kostina, “Data Compression With Low Distortion and Finite Blocklength,”IEEE Transactions on Information Theory, vol. 63, no. 7, pp. 4268–4285, Jul. 2017. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7867787

  30. [30]

    Efloat: Entropy-coded floating point format for compressing vector embedding models,

    R. Bordawekar, B. Abali, and M.-H. Chen, “Efloat: Entropy-coded floating point format for compressing vector embedding models,” 2022. [Online]. Available: https://arxiv.org/abs/2102.02705

  31. [31]

    Neuzip: Memory-efficient training and inference with dynamic compression of neu- ral networks.arXiv preprint arXiv:2410.20650,

    Y . Hao, Y . Cao, and L. Mou, “NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks,” Oct. 2024, arXiv:2410.20650 [cs]. [Online]. Available: http://arxiv.org/abs/2410.20650

  32. [32]

    The Entropy of Floating-Point Numbers

    S. Daniels, S. H. D’Ambrosia, M. R. DeWeese, and A. Sahai, “The entropy of floating-point numbers,” 2026. [Online]. Available: https://arxiv.org/abs/2605.11546 23

  33. [33]

    Beyond chinchilla-optimal: Ac- counting for inference in language model scaling laws

    N. Sardana, J. Portes, S. Doubov, and J. Frankle, “Beyond chinchilla-optimal: Accounting for inference in language model scaling laws,” 2025. [Online]. Available: https://arxiv.org/abs/2401.00448

  34. [34]

    An efficient reversible algorithm for linear regression,

    E. D. Demaine, J. Lynch, and J. Sun, “An efficient reversible algorithm for linear regression,” in2021 International Conference on Rebooting Computing (ICRC), 2021, pp. 103–108

  35. [35]

    Gradient-based hyperparameter optimization through reversible learning,

    D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter optimization through reversible learning,”

  36. [36]

    Gradient-based Hyperparameter Optimization through Reversible Learning

    [Online]. Available: https://arxiv.org/abs/1502.03492

  37. [37]

    Tolman,The Principles of Statistical Mechanics, by Richard C

    R. Tolman,The Principles of Statistical Mechanics, by Richard C. Tolman ..., ser. International series of monographs on physics. Oxford University Press, 1942. [Online]. Available: https://books.google.com/books?id=Hbr9yAEACAAJ

  38. [38]

    J. W. Gibbs,The Collected Works of J. Willard Gibbs. Longmans, Green and Company, 1928, vol. 1

  39. [39]

    The Physical Basis of the Gibbs-von Neumann entropy

    O. J. E. Maroney, “The physical basis of the gibbs-von neumann entropy,” 2008. [Online]. Available: https://arxiv.org/abs/quant-ph/0701127

  40. [40]

    Generalizing landauer’s principle,

    ——, “Generalizing landauer’s principle,”Phys. Rev. E, vol. 79, p. 031105, Mar 2009. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevE.79.031105

  41. [41]

    H. B. Callen,Thermodynamics and an introduction to thermostatistics. New York, NY: Wiley, 1985. [Online]. Available: https://cds.cern.ch/record/450289

  42. [42]

    The (absence of a) relationship between thermodynamic and logical reversibility,

    O. Maroney, “The (absence of a) relationship between thermodynamic and logical reversibility,”Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics, vol. 36, no. 2, pp. 355–374, 2005. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1355219805000031

  43. [43]

    L. D. Landau, E. M. Lifshitz, and L. P. Pitaevskii,Statistical Physics: Part 1, 3rd ed., ser. Course of Theoretical Physics. Oxford: Pergamon Press, 1980, vol. 5

  44. [44]

    Chandler,Introduction to Modern Statistical Mechanics

    D. Chandler,Introduction to Modern Statistical Mechanics. Oxford University Press, 1987

  45. [45]

    FP8 Quantization: The Power of the Exponent,

    A. Kuzmin, M. van Baalen, Y . Ren, M. Nagel, J. Peters, and T. Blankevoort, “FP8 Quantization: The Power of the Exponent,”Advances in Neural Information Processing Systems, vol. 35, pp. 14 651–14 662, Dec. 2022. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2022/hash/ 5e07476b6bd2497e1fbd11b8f0b2de3c-Abstract-Conference.html

  46. [46]

    Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537,

    B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolf, S. Dusan, V . Elango, M. Golub, A. Heinecke, P. James-Roxby, D. Jani, G. Kolhe, M. Langhammer, A. Li, L. Melnick, M. Mesmakhosroshahi, A. Rodriguez, M. Schulte, R. Shafipour, L. Shao, M. Siu, P. Dubey, P. Micikevicius, M. Naumov, C. Verrill...

  47. [47]

    Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,

    B. Darvish Rouhani, R. Zhao, V . Elango, R. Shafipour, M. Hall, M. Mesmakhosroshahi, A. More, L. Melnick, M. Golub, G. Varatkar, L. Shao, G. Kolhe, D. Melts, J. Klar, R. L’Heureux, M. Perry, D. Burger, E. Chung, Z. S. Deng, S. Naghshineh, J. Park, and M. Naumov, “With Shared Microexponents, A Little Shifting Goes a Long Way,” inProceedings of the 50th Ann...

  48. [48]

    Characterization and Mitigation of Training Instabilities in Microscaling Formats,

    H. Su, M. Kwun, S. Gil, S. Kakade, and N. Anand, “Characterization and Mitigation of Training Instabilities in Microscaling Formats,” Jun. 2025. [Online]. Available: https://arxiv.org/abs/2506.20752v1

  49. [49]

    J. M. Muller,Handbook of floating-point arithmetic / Jean-Michel Muller [and others].Boston: Birkhauser, 2010

  50. [50]

    What every computer scientist should know about floating-point arithmetic,

    D. Goldberg, “What every computer scientist should know about floating-point arithmetic,”ACM Comput. Surv., vol. 23, no. 1, p. 5–48, Mar. 1991. [Online]. Available: https://doi.org/10.1145/103162.103163

  51. [51]

    T. M. Cover and J. A. Thomas,Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience, 2006

  52. [52]

    On the dimension and entropy of probability distributions,

    A. R ´enyi, “On the dimension and entropy of probability distributions,”Acta Mathematica Academiae Scientiarum Hungarica, vol. 10, no. 1, pp. 193–215, Mar. 1959. [Online]. Available: https://doi.org/10.1007/BF02063299

  53. [53]

    Information Theory and Statistical Mechanics,

    E. T. Jaynes, “Information Theory and Statistical Mechanics,” inStatistical Physics, ser. Brandeis Summer Institute. New York, NY: W. A. Benjamin Inc., 1962, pp. 181–218

  54. [54]

    Prior probabilities,

    ——, “Prior probabilities,”IEEE Transactions on Systems and Cybernetics, no. 3, pp. 227–241, 1968

  55. [55]

    Asymptotic entropy-constrained performance of tessellating and universal randomized lattice quantization,

    T. Linder and K. Zeger, “Asymptotic entropy-constrained performance of tessellating and universal randomized lattice quantization,”IEEE Transactions on Information Theory, vol. 40, no. 2, pp. 575–579, Mar. 1994. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/312189

  56. [56]

    Asymptotically efficient quantizing,

    H. Gish and J. Pierce, “Asymptotically efficient quantizing,”IEEE Transactions on Information Theory, vol. 14, no. 5, pp. 676–683, Sep. 1968. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/1054193

  57. [57]

    Quantization,

    R. M. Gray and D. L. Neuhoff, “Quantization,”IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2325–2383, 1998

  58. [58]

    Communication in the Presence of Noise,

    C. Shannon, “Communication in the Presence of Noise,”Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, Jan. 1949. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/1697831

  59. [59]

    N. L. Johnson, S. Kotz, and N. Balakrishnan,Continuous univariate distributions, 2nd ed. New York: Wiley, 1994

  60. [60]

    Stochastic gradient descent as approximate bayesian inference,

    S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate bayesian inference,”J. Mach. Learn. Res., vol. 18, no. 1, p. 4873–4907, Jan. 2017

  61. [61]

    A variational analysis of stochastic gradient algorithms,

    ——, “A variational analysis of stochastic gradient algorithms,” inProceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML’16. JMLR.org, 2016, p. 354–363. 24

  62. [62]

    Three Factors Influencing Minima in SGD

    S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y . Bengio, and A. J. Storkey, “Three factors influencing minima in sgd,”ArXiv, vol. abs/1711.04623, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:7311295

  63. [63]

    Optimization methods for large-scale machine learning,

    L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,”SIAM review, vol. 60, no. 2, pp. 223–311, 2018

  64. [64]

    G. A. Pavliotis,Stochastic processes and applications : diffusion processes, the Fokker-Planck and Langevin equations / Grigorios A. Pavliotis., ser. Texts in applied mathematics, volume 60. New York: Springer, 2014 - 2014

  65. [65]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020. [Online]. Available: https://arxiv.org/abs/2001.08361

  66. [66]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” 2022. [Online]. Available: ...

  67. [67]

    Carbon Emissions and Large Neural Network Training

    D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,” 2021. [Online]. Available: https://arxiv.org/abs/2104.10350

  68. [68]

    Are emergent abilities of large language models a mirage?

    R. Schaeffer, B. Miranda, and S. Koyejo, “Are emergent abilities of large language models a mirage?”Advances in Neural Information Processing Systems, vol. 36, 2023

  69. [69]

    arXiv preprint arXiv:2403.15796 , year=

    Z. Du, A. Zeng, Y . Dong, and J. Tang, “Understanding Emergent Abilities of Language Models from the Loss Perspective,” Jan. 2025, arXiv:2403.15796 [cs]. [Online]. Available: http://arxiv.org/abs/2403.15796

  70. [70]

    Optimal finite-time processes in stochastic thermodynamics,

    T. Schmiedl and U. Seifert, “Optimal finite-time processes in stochastic thermodynamics,”Phys. Rev. Lett., vol. 98, p. 108301, Mar 2007. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.98.108301

  71. [71]

    Thermodynamic metrics and optimal paths,

    D. A. Sivak and G. E. Crooks, “Thermodynamic metrics and optimal paths,”Phys. Rev. Lett., vol. 108, p. 190602, May

  72. [72]

    Available: https://link.aps.org/doi/10.1103/PhysRevLett.108.190602

    [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.108.190602

  73. [73]

    Freitas, J

    N. Freitas, J.-C. Delvenne, and M. Esposito, “Stochastic thermodynamics of nonlinear electronic circuits: A realistic framework for computing aroundkt,”Phys. Rev. X, vol. 11, p. 031064, Sep 2021. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevX.11.031064

  74. [74]

    Dependence of dissipation on the initial distribution over states,

    A. Kolchinsky and D. H. Wolpert, “Dependence of dissipation on the initial distribution over states,”Journal of Statistical Mechanics: Theory and Experiment, vol. 2017, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID: 17899737

  75. [75]

    Thermodynamics of computing with circuits,

    D. H. Wolpert and A. Kolchinsky, “Thermodynamics of computing with circuits,”New Journal of Physics, vol. 22, no. 6, p. 063047, jun 2020. [Online]. Available: https://doi.org/10.1088/1367-2630/ab82b8

  76. [76]

    BFloat16: The secret to high performance on Cloud TPUs — Google Cloud Blog — cloud.google.com,

    “BFloat16: The secret to high performance on Cloud TPUs — Google Cloud Blog — cloud.google.com,” https://cloud. google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, [Accessed 01-12- 2025]

  77. [77]

    W., and Keutzer, K

    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” 2021. [Online]. Available: https://arxiv.org/abs/2103.13630

  78. [78]

    Deep Learning with Limited Numerical Precision

    S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” 2015. [Online]. Available: https://arxiv.org/abs/1502.02551

  79. [79]

    Quantizing deep convolutional networks for efficient inference: A whitepaper

    R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” 2018. [Online]. Available: https://arxiv.org/abs/1806.08342

  80. [80]

    Efqat: An efficient framework for quantization-aware training,

    S. Ashkboos, B. Verhoef, T. Hoefler, E. Eleftheriou, and M. Dazzi, “Efqat: An efficient framework for quantization-aware training,”CoRR, vol. abs/2411.11038, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2411.11038

Showing first 80 references.