The Implicit Bias of Depth: From Neural Collapse to Softmax Codes
Pith reviewed 2026-05-25 05:24 UTC · model grok-4.3
The pith
Depth in neural networks induces an implicit low-rank bias that favors softmax codes over neural collapse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the deep unconstrained feature model trained without regularization, depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to neural collapse. These alternatives correspond to softmax codes. Under spectral initialization, early-time repulsion among singular values drives the low-rank emergence, and depth reduces the basin of attraction for full neural collapse. In randomly initialized networks, increasing width biases training toward higher-rank solutions.
What carries the argument
The deep unconstrained feature model (equivalent to a deep linear network with orthogonal inputs) together with the mechanism of norm propagation efficiency in low-rank matrices under successive multiplications.
If this is right
- Low-rank softmax codes emerge as stable solutions in sufficiently deep networks.
- Early repulsion among singular values accelerates the shift away from neural collapse.
- Deeper networks shrink the region of parameter space that converges to full neural collapse.
- Wider networks counteract the low-rank bias and favor higher-rank solutions.
- The bias arises solely from gradient descent and depth under unregularized cross-entropy.
Where Pith is reading between the lines
- Architectures that enforce higher effective width at each layer may be needed to recover neural collapse in very deep models.
- The same norm-propagation mechanism could be tested in nonlinear networks by measuring effective rank after training.
- This low-rank preference may explain why some deep classifiers generalize differently from their shallow counterparts even at the same training accuracy.
Load-bearing premise
The deep unconstrained feature model with orthogonal inputs and training without regularization isolates the pure effects of gradient descent and depth on the geometry.
What would settle it
Train a deep unconstrained feature model with spectral initialization and check whether the learned classifier weights exhibit rank strictly less than the number of classes while still achieving low loss.
Figures
read the original abstract
Neural collapse (NC) describes the structured geometry that emerges in the features and weights of trained classifiers. Recent theory suggests NC can be suboptimal in deep architectures, attributing this to an explicit low-rank bias from L2 regularization. We study the deep unconstrained feature model (UFM)-equivalent to a deep linear network with orthogonal inputs-trained without regularization, to isolate how gradient descent and depth alone shape NC. We show that depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to NC. These alternatives, we argue, correspond to softmax codes: max-margin solutions previously found in width-bottlenecked networks. Analyzing training dynamics under spectral initialization, we identify an early-time repulsion among singular values that drives low-rank emergence, and characterize how depth shrinks NC's basin of attraction. Finally, we show that some effects act in the opposite direction: for randomly initialized networks, increasing width biases training toward higher-rank solutions. Our results provide the first asymptotic and dynamic characterization of implicit bias in deep UFMs trained with unregularized multiclass cross-entropy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the deep unconstrained feature model (UFM), equivalent to a deep linear network with orthogonal inputs, trained without regularization on multiclass cross-entropy. It claims that depth alone induces an implicit low-rank bias because low-rank matrices propagate norm more efficiently under successive multiplications, yielding low-rank alternatives to neural collapse (NC) that correspond to softmax codes. The work analyzes training dynamics under spectral initialization to identify early-time singular-value repulsion, shows that depth shrinks NC's basin of attraction, and demonstrates that increasing width can bias toward higher-rank solutions. It positions these results as the first asymptotic and dynamic characterization of implicit bias in deep UFMs.
Significance. If the derivations and dynamics hold, the results provide a principled explanation for why NC can be suboptimal in deep architectures without invoking explicit L2 regularization, linking depth-induced low-rank bias to previously observed max-margin softmax codes. The dynamic analysis of singular-value repulsion and basin shrinkage constitutes a concrete advance over static characterizations of NC. The width-depth contrast is a useful counterpoint that could inform architecture design.
minor comments (3)
- The abstract states that the deep UFM is 'equivalent to a deep linear network with orthogonal inputs'; the introduction or model section should include an explicit derivation or citation establishing the precise conditions under which this equivalence holds for the unregularized cross-entropy objective.
- The phrase 'softmax codes' is introduced without a self-contained definition or pointer to the prior width-bottleneck literature; a brief recap with the relevant citation should appear at first use.
- The claim of providing the 'first asymptotic and dynamic characterization' would be strengthened by a short table or paragraph in the related-work section contrasting the present dynamic results against the static NC analyses in the cited references.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript and the recommendation of minor revision. The provided summary accurately reflects the paper's focus on depth-induced implicit low-rank bias in deep UFMs under unregularized cross-entropy.
Circularity Check
No significant circularity detected
full rationale
The paper analyzes the deep UFM (equivalent to a deep linear network with orthogonal inputs) under unregularized gradient descent, deriving an implicit low-rank bias from the efficiency of norm propagation under matrix multiplications and characterizing dynamics via spectral initialization. These steps are presented as direct consequences of the model equations and training process rather than reductions to fitted parameters, self-definitions, or load-bearing self-citations. The identification of softmax codes and basin-of-attraction effects follows from the stated assumptions without circular renaming or imported uniqueness theorems. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Proceedings of the National Academy of Sciences , volume=
The replicator equation and other game dynamics , author=. Proceedings of the National Academy of Sciences , volume=. 2014 , publisher=
work page 2014
-
[2]
Advances in Neural Information Processing Systems , volume=
Mirror descent maximizes generalized margin and can be implemented efficiently , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
Spectral Analysis of Large Dimensional Random Matrices , author=. 2009 , url=
work page 2009
-
[4]
arXiv preprint arXiv:2405.00985 , year=
Progressive feedforward collapse of resnet training , author=. arXiv preprint arXiv:2405.00985 , year=
-
[5]
arXiv preprint arXiv:2506.05801 , year=
Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model , author=. arXiv preprint arXiv:2506.05801 , year=
-
[6]
Neural (Tangent Kernel) Collapse , volume =
Seleznova, Mariia and Weitzner, Dana and Giryes, Raja and Kutyniok, Gitta and Chou, Hung-Hsu , booktitle =. Neural (Tangent Kernel) Collapse , volume =
-
[7]
Neural Collapse in Deep Homogeneous Classifiers and The Role of Weight Decay , author=. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
work page 2022
-
[8]
International Conference on Machine Learning , pages=
Perturbation analysis of neural collapse , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[9]
Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics , author=. 2025 , url=
work page 2025
-
[10]
Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization
Stochastic gradient/mirror descent: Minimax optimality and implicit regularization , author=. arXiv preprint arXiv:1806.00952 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Advances in Neural Information Processing Systems , volume=
The implicit bias of adam on separable data , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
arXiv preprint arXiv:2308.16898 , year=
Transformers as support vector machines , author=. arXiv preprint arXiv:2308.16898 , year=
-
[13]
Advances in Neural Information Processing Systems , volume=
Implicit bias of gradient descent for logistic regression at the edge of stability , author=. Advances in Neural Information Processing Systems , volume=
-
[14]
arXiv preprint arXiv:2410.14581 , year=
Optimizing attention with mirror descent: Generalized max-margin token selection , author=. arXiv preprint arXiv:2410.14581 , year=
-
[15]
Advances in Neural Information Processing Systems , volume=
(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
Advances in Neural Information Processing Systems , volume=
Implicit bias of mirror flow on separable data , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
Learning multiple layers of features from tiny images , journal =
Alex Krizhevsky , year =. Learning multiple layers of features from tiny images , journal =
-
[18]
Lecun, Y. and Bottou, L. and Bengio, Y. and Haffner, P. , journal=. Gradient-based learning applied to document recognition , year=
-
[19]
Transactions on Machine Learning Research , year=
Implicit Bias and Fast Convergence Rates for Self-attention , author=. Transactions on Machine Learning Research , year=
-
[20]
arXiv preprint arXiv:2110.06084 , year=
Implicit bias of linear equivariant networks , author=. arXiv preprint arXiv:2110.06084 , year=
-
[21]
IEEE Transactions on Neural Networks and Learning Systems , volume=
Stochastic mirror descent on overparameterized nonlinear models , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2021 , publisher=
work page 2021
-
[22]
arXiv preprint arXiv:2502.04664 , year=
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. arXiv preprint arXiv:2502.04664 , year=
-
[23]
arXiv preprint arXiv:2502.16075 , year=
Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks , author=. arXiv preprint arXiv:2502.16075 , year=
-
[24]
The 22nd International Conference on Artificial Intelligence and Statistics , pages=
Convergence of gradient descent on separable data , author=. The 22nd International Conference on Artificial Intelligence and Statistics , pages=. 2019 , organization=
work page 2019
-
[25]
Advances in neural information processing systems , volume=
Margin maximizing loss functions , author=. Advances in neural information processing systems , volume=
-
[26]
Journal of Machine Learning Research , volume=
Boosting as a regularized path to a maximum margin classifier , author=. Journal of Machine Learning Research , volume=
-
[27]
arXiv preprint arXiv:2505.08348 , year=
Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations , author=. arXiv preprint arXiv:2505.08348 , year=
-
[28]
arXiv preprint arXiv:2405.14468 , year=
Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal? , author=. arXiv preprint arXiv:2405.14468 , year=
-
[29]
International Conference on Machine Learning , year=
Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data , author=. International Conference on Machine Learning , year=
-
[30]
Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers , author=. ArXiv , year=
-
[31]
International Conference on Learning Representations , year=
Gradient descent aligns the layers of deep linear networks , author=. International Conference on Learning Representations , year=
-
[32]
Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , author=. ArXiv , year=
-
[33]
arXiv preprint arXiv:2501.19104 , year=
Neural Collapse Beyond the Unconstrained Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime , author=. arXiv preprint arXiv:2501.19104 , year=
-
[34]
Implicit Regularization in Deep Matrix Factorization , volume =
Arora, Sanjeev and Cohen, Nadav and Hu, Wei and Luo, Yuping , booktitle =. Implicit Regularization in Deep Matrix Factorization , volume =
- [35]
-
[36]
Neural Information Processing Systems , year=
Exploring Generalization in Deep Learning , author=. Neural Information Processing Systems , year=
-
[37]
International Conference on Learning Representations , year=
Understanding deep learning requires rethinking generalization , author=. International Conference on Learning Representations , year=
-
[38]
Proceedings of the 35th International Conference on Machine Learning , pages =
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =
work page 2018
-
[39]
The Persistence of Neural Collapse Despite Low-Rank Bias: An Analytic Perspective Through Unconstrained Features , author=. ArXiv , year=
-
[40]
Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse , url =
Jacot, Arthur and S\'. Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse , url =. International Conference on Learning Representations , pages =
-
[41]
arXiv preprint arXiv:2405.08920 , year=
Neural collapse meets differential privacy: curious behaviors of NoisyGD with near-perfect representation learning , author=. arXiv preprint arXiv:2405.08920 , year=
-
[42]
Advances in Neural Information Processing Systems , volume=
A geometric analysis of neural collapse with unconstrained features , author=. Advances in Neural Information Processing Systems , volume=
-
[43]
Horn, Roger A. and Johnson, Charles R. , year=. Matrix Analysis , publisher=
-
[44]
arXiv preprint arXiv:2212.12206 , year=
Principled and efficient transfer learning of deep models via neural collapse , author=. arXiv preprint arXiv:2212.12206 , year=
-
[45]
EPA: Neural Collapse Inspired Robust Out-of-distribution Detector , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=
work page 2024
-
[46]
Explorations on high dimensional landscapes
Explorations on high dimensional landscapes , author=. arXiv preprint arXiv:1412.6615 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Artificial intelligence and statistics , pages=
The loss surfaces of multilayer networks , author=. Artificial intelligence and statistics , pages=. 2015 , organization=
work page 2015
-
[48]
arXiv preprint arXiv:2006.09091 , year=
Flatness is a false friend , author=. arXiv preprint arXiv:2006.09091 , year=
-
[49]
arXiv: Numerical Analysis , year=
Low-rank matrix recovery via regularized nuclear norm minimization , author=. arXiv: Numerical Analysis , year=
-
[50]
Proceedings of the AAAI Conference on Artificial Intelligence , year=
Low-Rank Matrix Recovery via Efficient Schatten p-Norm Minimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=
-
[51]
Journal of Machine Learning Research , volume=
Learning rates as a function of batch size: A random matrix theory approach to neural network training , author=. Journal of Machine Learning Research , volume=
-
[52]
Communications of the ACM , volume=
On the implicit bias in deep-learning algorithms , author=. Communications of the ACM , volume=. 2023 , publisher=
work page 2023
-
[53]
Advances in Neural Information Processing Systems , volume=
On margin maximization in linear and relu networks , author=. Advances in Neural Information Processing Systems , volume=
-
[54]
International Conference on Algorithmic Learning Theory , pages=
Implicit regularization towards rank minimization in relu networks , author=. International Conference on Algorithmic Learning Theory , pages=. 2023 , organization=
work page 2023
-
[55]
arXiv preprint arXiv:2303.06484 , year=
Generalizing and decoupling neural collapse via hyperspherical uniformity gap , author=. arXiv preprint arXiv:2303.06484 , year=
-
[56]
International Conference on Machine Learning , pages=
Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=
work page 2017
-
[57]
Physical Review Letters , volume=
Unveiling the structure of wide flat minima in neural networks , author=. Physical Review Letters , volume=. 2021 , publisher=
work page 2021
-
[58]
Journal of Statistical Mechanics: Theory and Experiment , volume=
The loss surfaces of neural networks with general activation functions , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=
work page 2021
-
[59]
Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
Eigenvalues of the hessian in deep learning: Singularity and beyond , author=. arXiv preprint arXiv:1611.07476 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
Advances in neural information processing systems , volume=
How regularization affects the critical points in linear networks , author=. Advances in neural information processing systems , volume=
-
[61]
arXiv preprint arXiv:1810.02281 , year=
A convergence analysis of gradient descent for deep linear neural networks , author=. arXiv preprint arXiv:1810.02281 , year=
-
[62]
International Conference on Machine Learning , year=
Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , author=. International Conference on Machine Learning , year=
-
[63]
Advances in neural information processing systems , volume=
Deep learning without poor local minima , author=. Advances in neural information processing systems , volume=
-
[64]
arXiv preprint arXiv:2402.03991 , year=
Provable Emergence of Deep Neural Collapse and Low-Rank Bias in L2-Regularized Nonlinear Networks , author=. arXiv preprint arXiv:2402.03991 , year=
-
[65]
Representation Costs of Linear Neural Networks: Analysis and Design , volume =
Dai, Zhen and Karzand, Mina and Srebro, Nathan , booktitle =. Representation Costs of Linear Neural Networks: Analysis and Design , volume =
-
[66]
Advances in neural information processing systems , volume=
An improved analysis of training over-parameterized deep neural networks , author=. Advances in neural information processing systems , volume=
-
[67]
IEEE Journal on Selected Areas in Information Theory , volume=
Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks , author=. IEEE Journal on Selected Areas in Information Theory , volume=. 2020 , publisher=
work page 2020
-
[68]
IEEE Signal Processing Magazine , volume=
The global landscape of neural networks: An overview , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=
work page 2020
-
[69]
Spurious Local Minima are Common in Two-Layer
Safran, Itay and Shamir, Ohad , booktitle =. Spurious Local Minima are Common in Two-Layer. 2018 , volume =
work page 2018
-
[70]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
The Loss Surface of Deep Linear Networks Viewed Through the Algebraic Geometry Lens , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[71]
The Low-Rank Simplicity Bias in Deep Networks , author=. Trans. Mach. Learn. Res. , year=
-
[72]
Training invariances and the low-rank phenomenon: beyond linear networks , author=. ArXiv , year=
-
[73]
Implicit bias of SGD in L2-regularized linear DNNs: One-way jumps from high to low rank , author=. ArXiv , year=
-
[74]
Neural Information Processing Systems , year=
Implicit Bias of Gradient Descent on Linear Convolutional Networks , author=. Neural Information Processing Systems , year=
-
[75]
International Conference on Machine Learning , pages=
Characterizing implicit bias in terms of optimization geometry , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
-
[76]
Proceedings of the 38th International Conference on Machine Learning , pages =
Understanding the Dynamics of Gradient Flow in Overparameterized Linear models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =
work page 2021
-
[77]
Proceedings of Thirty Third Conference on Learning Theory , pages =
Gradient descent follows the regularization path for general losses , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , volume =
work page 2020
-
[78]
2018 Information Theory and Applications Workshop (ITA) , year=
Implicit Regularization in Matrix Factorization , author=. 2018 Information Theory and Applications Workshop (ITA) , year=
work page 2018
-
[79]
Directional convergence and alignment in deep learning , url =
Ji, Ziwei and Telgarsky, Matus , booktitle =. Directional convergence and alignment in deep learning , url =
-
[80]
Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank , author=. ArXiv , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.