pith. sign in

arxiv: 2509.09088 · v3 · pith:K4KXYQM3new · submitted 2025-09-11 · 💻 cs.LG · math.DG· math.DS

An entropy formula for the Deep Linear Network

Pith reviewed 2026-05-22 12:50 UTC · model grok-4.3

classification 💻 cs.LG math.DGmath.DS
keywords deep linear networksRiemannian geometryBoltzmann entropygroup orbitsbalanced manifoldRiemannian submersionJacobi matrices
0
0 comments X

The pith

A Boltzmann entropy for deep linear networks is defined from group orbits on the balanced manifold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a thermodynamic description of learning in deep linear networks by studying their Riemannian geometry. Group actions are used to analyze overparametrization, and a Boltzmann entropy is defined and computed from the foliation of the balanced manifold by group orbits. Riemannian submersion from the balanced manifold produces the geometry on the space of observables. The central technical step is an explicit orthonormal basis for the tangent space of the balanced manifold constructed via the theory of Jacobi matrices.

Core claim

The Riemannian geometry on the space of observables is obtained by Riemannian submersion of the balanced manifold, and a Boltzmann entropy is defined and computed using the foliation of the balanced manifold by group orbits. The explicit construction of an orthonormal basis for the tangent space of the balanced manifold using Jacobi matrices enables the entropy calculation.

What carries the argument

Balanced manifold in parameter space, foliated by group orbits, with Riemannian submersion to the space of observables and tangent-space orthonormal basis from Jacobi matrices.

If this is right

  • The entropy admits an explicit formula for any depth and width of the deep linear network.
  • Overparametrization is characterized by the dimension of the group orbits.
  • The geometry of observables matches the earlier definition obtained by direct construction.
  • A thermodynamic account of the learning dynamics becomes available through this entropy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same orbit-foliation construction may supply entropy measures for other overparametrized models whose parameter space admits a balanced manifold.
  • Entropy gradients along this foliation could be compared with the loss landscape to test whether entropy increase tracks generalization.
  • Numerical evaluation of the Jacobi-matrix basis on small networks would give concrete entropy values that can be tracked during gradient descent.

Load-bearing premise

The foliation of the balanced manifold by group orbits supplies a natural and physically meaningful definition of Boltzmann entropy for the learning process.

What would settle it

Direct computation of the entropy formula on a trained deep linear network that shows no consistent relation to overparametrization degree or to changes in the learning trajectory would falsify the definition.

read the original abstract

We study the Riemannian geometry of the Deep Linear Network (DLN) as a foundation for a thermodynamic description of the learning process. The main tools are the use of group actions to analyze overparametrization and the use of Riemannian submersion from the space of parameters to the space of observables. The foliation of the balanced manifold in the parameter space by group orbits is used to define and compute a Boltzmann entropy. We also show that the Riemannian geometry on the space of observables defined in [2] is obtained by Riemannian submersion of the balanced manifold. The main technical step is an explicit construction of an orthonormal basis for the tangent space of the balanced manifold using the theory of Jacobi matrices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a Riemannian geometric framework for deep linear networks (DLNs) as a foundation for a thermodynamic description of learning. It analyzes overparametrization via group actions on the parameter space, defines a Boltzmann entropy from the foliation of the balanced manifold by group orbits, and shows that the Riemannian geometry on the space of observables arises via Riemannian submersion from the balanced manifold. The central technical contribution is an explicit construction of an orthonormal basis for the tangent space to the balanced manifold using the theory of Jacobi matrices.

Significance. If the technical claims are established, the work supplies a concrete geometric and entropic toolset for DLNs that could connect optimization dynamics to thermodynamic principles. The explicit orthonormal basis and submersion result are strengths that enable parameter-free derivations of entropy and geometry; these could support falsifiable predictions about learning trajectories once the construction is verified for general architectures.

major comments (2)
  1. [Main technical section on tangent-space basis construction] The central claim that an explicit orthonormal basis for T_p M_bal is obtained via Jacobi matrices (main technical step) must be shown to hold independently of depth L and width. The recurrence relations for tridiagonal Jacobi operators do not automatically guarantee orthonormality once the tangent-space identification involves products of weight matrices whose singular values are constrained only by the balanced condition; this must be verified explicitly for L>2 and unequal widths, as it is load-bearing for both the Riemannian submersion and the well-definedness of the entropy as log-volume of orbits.
  2. [Section defining and computing the Boltzmann entropy] The definition of Boltzmann entropy via the foliation by group orbits assumes this geometric volume supplies a physically meaningful entropy for the learning process. The manuscript should supply a concrete check (e.g., reduction to a known case for L=2 or comparison with an alternative entropy) showing that the resulting formula is not an artifact of the chosen foliation.
minor comments (2)
  1. [Introduction / Preliminaries] Notation for the balanced manifold and the group action should be introduced with a short diagram or explicit coordinate chart in the first technical section to aid readability.
  2. [Abstract] The abstract states that the geometry on observables is recovered by submersion but does not preview the explicit entropy formula; adding one sentence would orient readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading of our manuscript and the constructive comments. We address each major comment below and will revise the manuscript accordingly to strengthen the technical claims and interpretations.

read point-by-point responses
  1. Referee: [Main technical section on tangent-space basis construction] The central claim that an explicit orthonormal basis for T_p M_bal is obtained via Jacobi matrices (main technical step) must be shown to hold independently of depth L and width. The recurrence relations for tridiagonal Jacobi operators do not automatically guarantee orthonormality once the tangent-space identification involves products of weight matrices whose singular values are constrained only by the balanced condition; this must be verified explicitly for L>2 and unequal widths, as it is load-bearing for both the Riemannian submersion and the well-definedness of the entropy as log-volume of orbits.

    Authors: We thank the referee for this observation. The construction in the manuscript uses the theory of Jacobi matrices to produce the orthonormal basis for the tangent space to the balanced manifold, with the balanced condition ensuring the singular values satisfy the recurrence relations that preserve orthonormality for arbitrary depth and widths. However, to address the request for explicit verification, the revised manuscript will include an appendix with direct computations for L=3 and unequal widths (e.g., 4-3-5), confirming that the inner products remain zero off-diagonal and unity on-diagonal under the balanced constraint. This will make the independence from specific L and widths fully explicit. revision: yes

  2. Referee: [Section defining and computing the Boltzmann entropy] The definition of Boltzmann entropy via the foliation by group orbits assumes this geometric volume supplies a physically meaningful entropy for the learning process. The manuscript should supply a concrete check (e.g., reduction to a known case for L=2 or comparison with an alternative entropy) showing that the resulting formula is not an artifact of the chosen foliation.

    Authors: We agree that an explicit consistency check strengthens the interpretation. The revised manuscript will add a subsection reducing the general entropy formula to the L=2 case. In this reduction the balanced manifold and group orbits recover the standard matrix factorization geometry, and the entropy matches the known expression for the volume of orbits in the two-layer setting. This demonstrates that the formula is consistent with prior results rather than an artifact of the foliation. A brief comparison with the entropy induced by the loss level sets will also be included to further support its relevance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via explicit construction

full rationale

The paper constructs an explicit orthonormal basis for the tangent space to the balanced manifold using Jacobi matrix theory and defines Boltzmann entropy directly from the volume of group orbits in the foliation. It then derives the observable-space geometry as a Riemannian submersion of this manifold, referencing [2] only for comparison rather than as a load-bearing premise. No step reduces a claimed prediction or result to a fitted parameter, self-definition, or unverified self-citation chain; the central claims rest on the provided geometric constructions and are independent of the target entropy formula.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard differential geometry and Lie group theory with the novel elements being the entropy definition and the explicit tangent-space basis.

axioms (1)
  • standard math Standard axioms of Riemannian manifolds and Lie group actions on parameter space.
    Invoked to analyze overparametrization and to define the balanced manifold and its foliation.

pith-pipeline@v0.9.0 · 5639 in / 1287 out tokens · 36042 ms · 2026-05-22T12:50:05.025972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Geometric and Spectral Alignment for Deep Neural Network I

    cs.LG 2026-05 unverdicted novelty 6.0

    Residual network Jacobians under Frobenius normalization have singular spectra that form trace-normalized Cartan orbits satisfying slack-aware margin inequalities bounding exponent drift to order (log M)/L in zero-sla...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 1 Pith paper

  1. [1]

    Arora, N

    S. Arora, N. Cohen, and E. Hazan , On the optimization of deep networks: Implicit acceleration by overparameterization , in International Conference on Machine Learning, PMLR, 2018, pp. 244--253

  2. [2]

    B. Bah, H. Rauhut, U. Terstiege, and M. Westdickenberg , Learning deep linear neural networks: R iemannian gradient flows and convergence to global minimizers , Information and Inference: A Journal of the IMA, 11 (2022), pp. 307--353

  3. [3]

    Baldi and K

    P. Baldi and K. Hornik , Learning in linear neural networks: a survey , IEEE Transactions on Neural Networks, 6 (1995), pp. 837--858

  4. [4]

    Belkin, D

    M. Belkin, D. Hsu, S. Ma, and S. Mandal , Reconciling modern machine-learning practice and the classical bias--variance trade-off , Proceedings of the National Academy of Sciences, 116 (2019), pp. 15849--15854

  5. [5]

    Belkin, D

    M. Belkin, D. Hsu, and J. Xu , Two models of double descent for weak features , SIAM Journal on Mathematics of Data Science, 2 (2020), pp. 1167--1180

  6. [6]

    Br \'e chet, K

    P. Br \'e chet, K. Papagiannouli, J. An, and G. Mont \'u far , Critical points and convergence analysis of generative deep linear networks trained with B ures- W asserstein loss , in International Conference on Machine Learning, PMLR, 2023, pp. 3106--3147

  7. [7]

    Brockett , Modeling the transient behavior of stochastic gradient algorithms , in 2011 50th IEEE Conference on Decision and Control and European Control Conference, IEEE, 2011, pp

    R. Brockett , Modeling the transient behavior of stochastic gradient algorithms , in 2011 50th IEEE Conference on Decision and Control and European Control Conference, IEEE, 2011, pp. 4461--4466

  8. [8]

    R. W. Brockett , Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems , Linear Algebra Appl., 146 (1991), pp. 79--91

  9. [9]

    77 of Math

    height 2pt depth -1.6pt width 23pt, Dynamical systems and their associated automata , in Systems and networks: mathematical theory and applications, V ol.\ I ( R egensburg, 1993), vol. 77 of Math. Res., Akademie-Verlag, Berlin, 1994, pp. 49--69

  10. [10]

    Chen , Geodesics in the deep linear network , Preprint, (2025)

    A. Chen , Geodesics in the deep linear network , Preprint, (2025)

  11. [11]

    A. Chen, T. S. Kotwal, and G. Menon , Equilibrium measures in the deep linear network , Preprint, (2025)

  12. [12]

    Chen and P

    T. Chen and P. M. Ewald , Geometric structure of D eep L earning networks and construction of global L ^2 minimizers , arXiv:2309.10639, (2024)

  13. [13]

    Chizat, M

    L. Chizat, M. Colombo, X. Fern \'a ndez-Real, and A. Figalli , Infinite-width limit of deep linear neural networks , Communications on Pure and Applied Mathematics, 77 (2024), pp. 3958--4007

  14. [14]

    H. T. M. Chu, S. Ghosh, C. T. Lam, and S. S. Mukherjee , Implicit regularization via spectral neural networks and non-linear matrix sensing , arXiv:2402.17595, (2024)

  15. [15]

    Cohen, G

    N. Cohen, G. Menon, and Z. Veraszto , Deep linear networks for matrix completion—an infinite depth limit , SIAM Journal on Applied Dynamical Systems, 22 (2023), pp. 3208--3232

  16. [16]

    R. Ge, C. Jin, and Y. Zheng , No spurious local minima in nonconvex low rank problems: A unified geometric analysis , in International Conference on Machine Learning, PMLR, 2017, pp. 1233--1242

  17. [17]

    G. H. Golub and C. F. Van Loan , Matrix computations , Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, Baltimore, MD, fourth ed., 2013

  18. [18]

    Gunasekar, J

    S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro , Implicit bias of gradient descent on linear convolutional networks , Advances in Neural Information Processing Systems, 31 (2018)

  19. [19]

    Gunasekar, B

    S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro , Implicit regularization in matrix factorization , in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , 2017, pp. 6151--6159

  20. [20]

    Ji and M

    Z. Ji and M. Telgarsky , Directional convergence and alignment in deep learning , Advances in Neural Information Processing Systems, 33 (2020), pp. 17176--17186

  21. [21]

    Kato , Perturbation theory for linear operators , vol

    T. Kato , Perturbation theory for linear operators , vol. Band 132 of Die Grundlehren der mathematischen Wissenschaften, Springer-Verlag New York, Inc., New York, 1966

  22. [22]

    K. Kohn, T. Merkh, G. Mont\'ufar, and M. Trager , Geometry of linear convolutional networks , SIAM J. Appl. Algebra Geom., 6 (2022), pp. 368--406

  23. [23]

    K. Kohn, G. Mont\'ufar, V. Shahverdi, and M. Trager , Function space and critical points of linear convolutional networks , SIAM J. Appl. Algebra Geom., 8 (2024), pp. 333--362

  24. [24]

    Lindsey and G

    K. Lindsey and G. Menon , Regularization implies balancedness in the deep linear network , Preprint, (2025)

  25. [25]

    Menon , The geometry of the deep linear network , in XIV Symposium on Probability and Stochastic Processes, C

    G. Menon , The geometry of the deep linear network , in XIV Symposium on Probability and Stochastic Processes, C. G. H. Chan, J. A. L. Mimbela, and C. G. P. Sergio I. L\' o pez, eds., Progress in Probability, Birkh\" a user Cham, 2025

  26. [26]

    Menon and T

    G. Menon and T. Yu , The R iemannian L angevin equation and conic programs , Bulletin of the Institute of Mathematics Academia Sinica (New Series), 20 (2025), pp. 197--213

  27. [27]

    height 2pt depth -1.6pt width 23pt, A R iemannian L angevin equation for the deep linear network , Preprint, (2025)

  28. [28]

    G. M. Nguegnang, H. Rauhut, and U. Terstiege , Convergence of gradient descent for learning linear neural networks , Adv. Contin. Discrete Models, (2024), pp. Paper No. 23, 28

  29. [29]

    F. W. Ponting and H. S. A. Potter , The volume of orthogonal and unitary space , The Quarterly Journal of Mathematics, os-20 (1949), pp. 146--154

  30. [30]

    Vardi , On the implicit bias in deep-learning algorithms , Communications of the ACM, 66 (2023), pp

    G. Vardi , On the implicit bias in deep-learning algorithms , Communications of the ACM, 66 (2023), pp. 86--93