An entropy formula for the Deep Linear Network
Pith reviewed 2026-05-22 12:50 UTC · model grok-4.3
The pith
A Boltzmann entropy for deep linear networks is defined from group orbits on the balanced manifold.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Riemannian geometry on the space of observables is obtained by Riemannian submersion of the balanced manifold, and a Boltzmann entropy is defined and computed using the foliation of the balanced manifold by group orbits. The explicit construction of an orthonormal basis for the tangent space of the balanced manifold using Jacobi matrices enables the entropy calculation.
What carries the argument
Balanced manifold in parameter space, foliated by group orbits, with Riemannian submersion to the space of observables and tangent-space orthonormal basis from Jacobi matrices.
If this is right
- The entropy admits an explicit formula for any depth and width of the deep linear network.
- Overparametrization is characterized by the dimension of the group orbits.
- The geometry of observables matches the earlier definition obtained by direct construction.
- A thermodynamic account of the learning dynamics becomes available through this entropy.
Where Pith is reading between the lines
- The same orbit-foliation construction may supply entropy measures for other overparametrized models whose parameter space admits a balanced manifold.
- Entropy gradients along this foliation could be compared with the loss landscape to test whether entropy increase tracks generalization.
- Numerical evaluation of the Jacobi-matrix basis on small networks would give concrete entropy values that can be tracked during gradient descent.
Load-bearing premise
The foliation of the balanced manifold by group orbits supplies a natural and physically meaningful definition of Boltzmann entropy for the learning process.
What would settle it
Direct computation of the entropy formula on a trained deep linear network that shows no consistent relation to overparametrization degree or to changes in the learning trajectory would falsify the definition.
read the original abstract
We study the Riemannian geometry of the Deep Linear Network (DLN) as a foundation for a thermodynamic description of the learning process. The main tools are the use of group actions to analyze overparametrization and the use of Riemannian submersion from the space of parameters to the space of observables. The foliation of the balanced manifold in the parameter space by group orbits is used to define and compute a Boltzmann entropy. We also show that the Riemannian geometry on the space of observables defined in [2] is obtained by Riemannian submersion of the balanced manifold. The main technical step is an explicit construction of an orthonormal basis for the tangent space of the balanced manifold using the theory of Jacobi matrices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a Riemannian geometric framework for deep linear networks (DLNs) as a foundation for a thermodynamic description of learning. It analyzes overparametrization via group actions on the parameter space, defines a Boltzmann entropy from the foliation of the balanced manifold by group orbits, and shows that the Riemannian geometry on the space of observables arises via Riemannian submersion from the balanced manifold. The central technical contribution is an explicit construction of an orthonormal basis for the tangent space to the balanced manifold using the theory of Jacobi matrices.
Significance. If the technical claims are established, the work supplies a concrete geometric and entropic toolset for DLNs that could connect optimization dynamics to thermodynamic principles. The explicit orthonormal basis and submersion result are strengths that enable parameter-free derivations of entropy and geometry; these could support falsifiable predictions about learning trajectories once the construction is verified for general architectures.
major comments (2)
- [Main technical section on tangent-space basis construction] The central claim that an explicit orthonormal basis for T_p M_bal is obtained via Jacobi matrices (main technical step) must be shown to hold independently of depth L and width. The recurrence relations for tridiagonal Jacobi operators do not automatically guarantee orthonormality once the tangent-space identification involves products of weight matrices whose singular values are constrained only by the balanced condition; this must be verified explicitly for L>2 and unequal widths, as it is load-bearing for both the Riemannian submersion and the well-definedness of the entropy as log-volume of orbits.
- [Section defining and computing the Boltzmann entropy] The definition of Boltzmann entropy via the foliation by group orbits assumes this geometric volume supplies a physically meaningful entropy for the learning process. The manuscript should supply a concrete check (e.g., reduction to a known case for L=2 or comparison with an alternative entropy) showing that the resulting formula is not an artifact of the chosen foliation.
minor comments (2)
- [Introduction / Preliminaries] Notation for the balanced manifold and the group action should be introduced with a short diagram or explicit coordinate chart in the first technical section to aid readability.
- [Abstract] The abstract states that the geometry on observables is recovered by submersion but does not preview the explicit entropy formula; adding one sentence would orient readers.
Simulated Author's Rebuttal
We thank the referee for the careful reading of our manuscript and the constructive comments. We address each major comment below and will revise the manuscript accordingly to strengthen the technical claims and interpretations.
read point-by-point responses
-
Referee: [Main technical section on tangent-space basis construction] The central claim that an explicit orthonormal basis for T_p M_bal is obtained via Jacobi matrices (main technical step) must be shown to hold independently of depth L and width. The recurrence relations for tridiagonal Jacobi operators do not automatically guarantee orthonormality once the tangent-space identification involves products of weight matrices whose singular values are constrained only by the balanced condition; this must be verified explicitly for L>2 and unequal widths, as it is load-bearing for both the Riemannian submersion and the well-definedness of the entropy as log-volume of orbits.
Authors: We thank the referee for this observation. The construction in the manuscript uses the theory of Jacobi matrices to produce the orthonormal basis for the tangent space to the balanced manifold, with the balanced condition ensuring the singular values satisfy the recurrence relations that preserve orthonormality for arbitrary depth and widths. However, to address the request for explicit verification, the revised manuscript will include an appendix with direct computations for L=3 and unequal widths (e.g., 4-3-5), confirming that the inner products remain zero off-diagonal and unity on-diagonal under the balanced constraint. This will make the independence from specific L and widths fully explicit. revision: yes
-
Referee: [Section defining and computing the Boltzmann entropy] The definition of Boltzmann entropy via the foliation by group orbits assumes this geometric volume supplies a physically meaningful entropy for the learning process. The manuscript should supply a concrete check (e.g., reduction to a known case for L=2 or comparison with an alternative entropy) showing that the resulting formula is not an artifact of the chosen foliation.
Authors: We agree that an explicit consistency check strengthens the interpretation. The revised manuscript will add a subsection reducing the general entropy formula to the L=2 case. In this reduction the balanced manifold and group orbits recover the standard matrix factorization geometry, and the entropy matches the known expression for the volume of orbits in the two-layer setting. This demonstrates that the formula is consistent with prior results rather than an artifact of the foliation. A brief comparison with the entropy induced by the loss level sets will also be included to further support its relevance. revision: yes
Circularity Check
No significant circularity; derivation self-contained via explicit construction
full rationale
The paper constructs an explicit orthonormal basis for the tangent space to the balanced manifold using Jacobi matrix theory and defines Boltzmann entropy directly from the volume of group orbits in the foliation. It then derives the observable-space geometry as a Riemannian submersion of this manifold, referencing [2] only for comparison rather than as a load-bearing premise. No step reduces a claimed prediction or result to a fitted parameter, self-definition, or unverified self-citation chain; the central claims rest on the provided geometric constructions and are independent of the target entropy formula.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard axioms of Riemannian manifolds and Lie group actions on parameter space.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The main technical step is an explicit construction of an orthonormal basis for the tangent space of the balanced manifold using the theory of Jacobi matrices.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
S(X) = (N−1) log c_d + ½ log van(Σ²)/van(Σ^{2N})
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Geometric and Spectral Alignment for Deep Neural Network I
Residual network Jacobians under Frobenius normalization have singular spectra that form trace-normalized Cartan orbits satisfying slack-aware margin inequalities bounding exponent drift to order (log M)/L in zero-sla...
Reference graph
Works this paper leans on
- [1]
-
[2]
B. Bah, H. Rauhut, U. Terstiege, and M. Westdickenberg , Learning deep linear neural networks: R iemannian gradient flows and convergence to global minimizers , Information and Inference: A Journal of the IMA, 11 (2022), pp. 307--353
work page 2022
-
[3]
P. Baldi and K. Hornik , Learning in linear neural networks: a survey , IEEE Transactions on Neural Networks, 6 (1995), pp. 837--858
work page 1995
- [4]
- [5]
-
[6]
P. Br \'e chet, K. Papagiannouli, J. An, and G. Mont \'u far , Critical points and convergence analysis of generative deep linear networks trained with B ures- W asserstein loss , in International Conference on Machine Learning, PMLR, 2023, pp. 3106--3147
work page 2023
-
[7]
R. Brockett , Modeling the transient behavior of stochastic gradient algorithms , in 2011 50th IEEE Conference on Decision and Control and European Control Conference, IEEE, 2011, pp. 4461--4466
work page 2011
-
[8]
R. W. Brockett , Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems , Linear Algebra Appl., 146 (1991), pp. 79--91
work page 1991
-
[9]
height 2pt depth -1.6pt width 23pt, Dynamical systems and their associated automata , in Systems and networks: mathematical theory and applications, V ol.\ I ( R egensburg, 1993), vol. 77 of Math. Res., Akademie-Verlag, Berlin, 1994, pp. 49--69
work page 1993
-
[10]
Chen , Geodesics in the deep linear network , Preprint, (2025)
A. Chen , Geodesics in the deep linear network , Preprint, (2025)
work page 2025
-
[11]
A. Chen, T. S. Kotwal, and G. Menon , Equilibrium measures in the deep linear network , Preprint, (2025)
work page 2025
-
[12]
T. Chen and P. M. Ewald , Geometric structure of D eep L earning networks and construction of global L ^2 minimizers , arXiv:2309.10639, (2024)
- [13]
- [14]
- [15]
-
[16]
R. Ge, C. Jin, and Y. Zheng , No spurious local minima in nonconvex low rank problems: A unified geometric analysis , in International Conference on Machine Learning, PMLR, 2017, pp. 1233--1242
work page 2017
-
[17]
G. H. Golub and C. F. Van Loan , Matrix computations , Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, Baltimore, MD, fourth ed., 2013
work page 2013
-
[18]
S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro , Implicit bias of gradient descent on linear convolutional networks , Advances in Neural Information Processing Systems, 31 (2018)
work page 2018
-
[19]
S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro , Implicit regularization in matrix factorization , in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , 2017, pp. 6151--6159
work page 2017
- [20]
-
[21]
Kato , Perturbation theory for linear operators , vol
T. Kato , Perturbation theory for linear operators , vol. Band 132 of Die Grundlehren der mathematischen Wissenschaften, Springer-Verlag New York, Inc., New York, 1966
work page 1966
-
[22]
K. Kohn, T. Merkh, G. Mont\'ufar, and M. Trager , Geometry of linear convolutional networks , SIAM J. Appl. Algebra Geom., 6 (2022), pp. 368--406
work page 2022
-
[23]
K. Kohn, G. Mont\'ufar, V. Shahverdi, and M. Trager , Function space and critical points of linear convolutional networks , SIAM J. Appl. Algebra Geom., 8 (2024), pp. 333--362
work page 2024
-
[24]
K. Lindsey and G. Menon , Regularization implies balancedness in the deep linear network , Preprint, (2025)
work page 2025
-
[25]
G. Menon , The geometry of the deep linear network , in XIV Symposium on Probability and Stochastic Processes, C. G. H. Chan, J. A. L. Mimbela, and C. G. P. Sergio I. L\' o pez, eds., Progress in Probability, Birkh\" a user Cham, 2025
work page 2025
-
[26]
G. Menon and T. Yu , The R iemannian L angevin equation and conic programs , Bulletin of the Institute of Mathematics Academia Sinica (New Series), 20 (2025), pp. 197--213
work page 2025
-
[27]
height 2pt depth -1.6pt width 23pt, A R iemannian L angevin equation for the deep linear network , Preprint, (2025)
work page 2025
-
[28]
G. M. Nguegnang, H. Rauhut, and U. Terstiege , Convergence of gradient descent for learning linear neural networks , Adv. Contin. Discrete Models, (2024), pp. Paper No. 23, 28
work page 2024
-
[29]
F. W. Ponting and H. S. A. Potter , The volume of orthogonal and unitary space , The Quarterly Journal of Mathematics, os-20 (1949), pp. 146--154
work page 1949
-
[30]
Vardi , On the implicit bias in deep-learning algorithms , Communications of the ACM, 66 (2023), pp
G. Vardi , On the implicit bias in deep-learning algorithms , Communications of the ACM, 66 (2023), pp. 86--93
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.