The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

Christos Thrampoulidis; Connall Garrod; Jonathan P. Keating

arxiv: 2605.23087 · v1 · pith:GN4EIQPDnew · submitted 2026-05-21 · 💻 cs.LG

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

Connall Garrod , Jonathan P. Keating , Christos Thrampoulidis This is my paper

Pith reviewed 2026-05-25 05:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural collapseimplicit biasdepthsoftmax codeslow-rank biasunconstrained feature modelcross-entropymulticlass classification

0 comments

The pith

Depth in neural networks induces an implicit low-rank bias that favors softmax codes over neural collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the deep unconstrained feature model, which isolates gradient descent and depth effects in networks equivalent to deep linear models with orthogonal inputs. It establishes that depth alone creates an implicit bias toward low-rank weight matrices because such matrices transmit norm more efficiently across successive layer multiplications. This bias promotes low-rank alternatives to the standard neural collapse geometry, which the authors identify as softmax codes. The work also tracks how depth shrinks the basin of attraction for neural collapse and how width can push solutions toward higher rank. These results characterize the implicit bias of depth under unregularized multiclass cross-entropy training.

Core claim

In the deep unconstrained feature model trained without regularization, depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to neural collapse. These alternatives correspond to softmax codes. Under spectral initialization, early-time repulsion among singular values drives the low-rank emergence, and depth reduces the basin of attraction for full neural collapse. In randomly initialized networks, increasing width biases training toward higher-rank solutions.

What carries the argument

The deep unconstrained feature model (equivalent to a deep linear network with orthogonal inputs) together with the mechanism of norm propagation efficiency in low-rank matrices under successive multiplications.

If this is right

Low-rank softmax codes emerge as stable solutions in sufficiently deep networks.
Early repulsion among singular values accelerates the shift away from neural collapse.
Deeper networks shrink the region of parameter space that converges to full neural collapse.
Wider networks counteract the low-rank bias and favor higher-rank solutions.
The bias arises solely from gradient descent and depth under unregularized cross-entropy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that enforce higher effective width at each layer may be needed to recover neural collapse in very deep models.
The same norm-propagation mechanism could be tested in nonlinear networks by measuring effective rank after training.
This low-rank preference may explain why some deep classifiers generalize differently from their shallow counterparts even at the same training accuracy.

Load-bearing premise

The deep unconstrained feature model with orthogonal inputs and training without regularization isolates the pure effects of gradient descent and depth on the geometry.

What would settle it

Train a deep unconstrained feature model with spectral initialization and check whether the learned classifier weights exhibit rank strictly less than the number of classes while still achieving low loss.

Figures

Figures reproduced from arXiv: 2605.23087 by Christos Thrampoulidis, Connall Garrod, Jonathan P. Keating.

**Figure 1.** Figure 1: Illustration of low-rank bias for K = 4, L = 3, n = 1, with all parameter matrices normalized to unit Frobenius norm. µDNC, µ′ DNC denote two feature embeddings of a logit matrix Z under DNC, while µLR, µ′ LR denote two feature embeddings of a logit matrix Z under the low-rank geometry of Eq. (14). Although DNC yields the larger angle between features, the low-rank geometry lies on a large-radius hypersph… view at source ↗

**Figure 3.** Figure 3: Optimal low-rank structures correspond to softmax codes. Left: Columns of the gram factor X defined in Eq. (5) for one run with L = 10 that converged to rank 2. Middle: For each K, we trained five deep UFMs that converged to rank 2; shown are the largest and smallest adjacent angle gaps of the columns xi of the Gram factor X over all five runs. The angle 360/K corresponds to the d = 2 softmax code. Right: … view at source ↗

**Figure 4.** Figure 4: Experiments under Hadamard initialization. Left: Evolution of the normalized logit singular values aˆi under Eq. (6) for depth L = 2 and K = 16 classes; singular values initialized with uniform entries in [0, 1], then rescaled to have L1 norm 10−3 . Note convergence to low-rank solution where many modes approach zero. Middle: Corresponding evolution of the logit norm ∥a∥1 = P i ai. By the time norm increas… view at source ↗

**Figure 5.** Figure 5: Impact of network width on network training. Depicted are the mean and one-standard-deviation error bars over five runs. Left: Frobenius distance metric M (see (61) in App.) between the logit-matrix derivative dZ dt (0) and the simplex ETF. Middle: Same metric for the logits after training has induced the logit norm to increase by a scale factor. Right: Effective rank of the logit matrix after training. In… view at source ↗

**Figure 6.** Figure 6: Confirmation of theory on the MNIST and CIFAR-10 datasets using the ResNet-20 architecture with a ReLU head. We report the average effective rank of the mean logit matrix at the end of training and one-standard-deviation error bars over five runs. Left: Here the ReLU head has width d = 50 and a variety of depths L. Middle: Here the ReLU head has depth L = 3 for a variety of widths. Right: Scatter plot of t… view at source ↗

**Figure 7.** Figure 7: Demonstrations of different normalized mean logit matrices after training for L = 3, K = 10. Left: logits that are approximately DNC, with effective rank 8.9 ≈ K − 1. Right: logits of a low rank alternative, with effective rank 4.3. E. Further Numerical Experiments E.1. Main Text Experimental Details In this section we provide further details for the plots of the main text [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 8.** Figure 8: Plot of the normalized mean logit matrix at convergence of a rank two solution for K = 8, L = 10. training run for simplicity, but the dynamics exhibit the same phenomenology under this setup across a large number of random initializations [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗

**Figure 9.** Figure 9: Left: Empirical mean of the effective rank of the mean logit matrix at t = 103 under the dynamics of Equation (6) with parameter K = 16 for various network depths L. The vector a ∈ R K−1 is initialized with entries uniformly sampled from [0.5, 1], then normalized to have L1 norm of 10−3 . For each L the average empirical effective rank over 10 runs is reported, along with one-standard-deviation error bars.… view at source ↗

**Figure 10.** Figure 10: Evolution dynamics under Eq. (6) from the initialization described in Eq. (49) with γ = 0.2, δ = 0.1 for parameters L = 2, K = 16. Top Left: Evolution of the logit norm ∥a∥1 = P i ai. Top Right: Evolution of the normalized singular values aˆi. Bottom Left: Evolution of the KL-divergence DKL( 1 K−1 1K−1 ∥ aˆ), defined in Eq. (52). Bottom Right: mean logit matrix at the end of training, normalized to have u… view at source ↗

**Figure 11.** Figure 11: Experiments under Random initialization for the deep UFM. Left: Evolution of the normalized mean logit singular values aˆi under gradient flow on the loss function of Eq. (1) for depth L = 3, width d = 100, K = 10 classes and n = 5 data points per class. Network was initialized with Gaussian normal entries of standard deviation 0.03. Note convergence to low-rank solution where many modes approach zero as … view at source ↗

read the original abstract

Neural collapse (NC) describes the structured geometry that emerges in the features and weights of trained classifiers. Recent theory suggests NC can be suboptimal in deep architectures, attributing this to an explicit low-rank bias from L2 regularization. We study the deep unconstrained feature model (UFM)-equivalent to a deep linear network with orthogonal inputs-trained without regularization, to isolate how gradient descent and depth alone shape NC. We show that depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to NC. These alternatives, we argue, correspond to softmax codes: max-margin solutions previously found in width-bottlenecked networks. Analyzing training dynamics under spectral initialization, we identify an early-time repulsion among singular values that drives low-rank emergence, and characterize how depth shrinks NC's basin of attraction. Finally, we show that some effects act in the opposite direction: for randomly initialized networks, increasing width biases training toward higher-rank solutions. Our results provide the first asymptotic and dynamic characterization of implicit bias in deep UFMs trained with unregularized multiclass cross-entropy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Depth in unregularized deep UFMs creates an implicit low-rank bias that favors softmax codes over standard neural collapse.

read the letter

Depth induces an implicit low-rank bias in these models, pushing solutions toward softmax codes rather than full neural collapse. The paper gives the first asymptotic and dynamic characterization of implicit bias for deep UFMs under unregularized cross-entropy, which is the core new piece. It moves past earlier NC results that usually rely on regularization or stay with shallow cases. The dynamic analysis under spectral initialization stands out: it identifies early-time repulsion among singular values as the driver of low-rank structure and shows depth shrinking the basin for NC. The counter-finding that wider networks bias toward higher rank is a useful balance and keeps the story from being one-sided. The modeling choice of the deep UFM as a deep linear net with orthogonal inputs cleanly isolates depth and gradient descent, which is a fair move for the question they ask. The link from low-rank propagation to softmax codes is argued directly from the norm-efficiency observation. The main soft spot is that the UFM equivalence, while convenient, leaves open how much carries over to nonlinear activations or non-orthogonal inputs; the paper does not claim universality, so this is more a scope limit than a flaw. The correspondence to softmax codes is plausible on the given evidence but would be stronger with tighter formal ties or a few concrete low-dimensional checks. This paper is for people working on implicit bias, NC geometry, and architecture effects in training dynamics. A reader who already knows the NC literature will get the most out of the depth-specific dynamics. It shows clear engagement with the problem and the prior results, so it deserves a serious referee even if some steps need tightening in review.

Referee Report

0 major / 3 minor

Summary. The paper studies the deep unconstrained feature model (UFM), equivalent to a deep linear network with orthogonal inputs, trained without regularization on multiclass cross-entropy. It claims that depth alone induces an implicit low-rank bias because low-rank matrices propagate norm more efficiently under successive multiplications, yielding low-rank alternatives to neural collapse (NC) that correspond to softmax codes. The work analyzes training dynamics under spectral initialization to identify early-time singular-value repulsion, shows that depth shrinks NC's basin of attraction, and demonstrates that increasing width can bias toward higher-rank solutions. It positions these results as the first asymptotic and dynamic characterization of implicit bias in deep UFMs.

Significance. If the derivations and dynamics hold, the results provide a principled explanation for why NC can be suboptimal in deep architectures without invoking explicit L2 regularization, linking depth-induced low-rank bias to previously observed max-margin softmax codes. The dynamic analysis of singular-value repulsion and basin shrinkage constitutes a concrete advance over static characterizations of NC. The width-depth contrast is a useful counterpoint that could inform architecture design.

minor comments (3)

The abstract states that the deep UFM is 'equivalent to a deep linear network with orthogonal inputs'; the introduction or model section should include an explicit derivation or citation establishing the precise conditions under which this equivalence holds for the unregularized cross-entropy objective.
The phrase 'softmax codes' is introduced without a self-contained definition or pointer to the prior width-bottleneck literature; a brief recap with the relevant citation should appear at first use.
The claim of providing the 'first asymptotic and dynamic characterization' would be strengthened by a short table or paragraph in the related-work section contrasting the present dynamic results against the static NC analyses in the cited references.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and the recommendation of minor revision. The provided summary accurately reflects the paper's focus on depth-induced implicit low-rank bias in deep UFMs under unregularized cross-entropy.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper analyzes the deep UFM (equivalent to a deep linear network with orthogonal inputs) under unregularized gradient descent, deriving an implicit low-rank bias from the efficiency of norm propagation under matrix multiplications and characterizing dynamics via spectral initialization. These steps are presented as direct consequences of the model equations and training process rather than reductions to fitted parameters, self-definitions, or load-bearing self-citations. The identification of softmax codes and basin-of-attraction effects follows from the stated assumptions without circular renaming or imported uniqueness theorems. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; the central model relies on the unconstrained feature model but no further breakdown is available.

pith-pipeline@v0.9.0 · 5729 in / 1066 out tokens · 29080 ms · 2026-05-25T05:24:42.681843+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

149 extracted references · 149 canonical work pages · 5 internal anchors

[1]

Proceedings of the National Academy of Sciences , volume=

The replicator equation and other game dynamics , author=. Proceedings of the National Academy of Sciences , volume=. 2014 , publisher=

work page 2014
[2]

Advances in Neural Information Processing Systems , volume=

Mirror descent maximizes generalized margin and can be implemented efficiently , author=. Advances in Neural Information Processing Systems , volume=

work page
[3]

2009 , url=

Spectral Analysis of Large Dimensional Random Matrices , author=. 2009 , url=

work page 2009
[4]

arXiv preprint arXiv:2405.00985 , year=

Progressive feedforward collapse of resnet training , author=. arXiv preprint arXiv:2405.00985 , year=

work page arXiv
[5]

arXiv preprint arXiv:2506.05801 , year=

Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model , author=. arXiv preprint arXiv:2506.05801 , year=

work page arXiv
[6]

Neural (Tangent Kernel) Collapse , volume =

Seleznova, Mariia and Weitzner, Dana and Giryes, Raja and Kutyniok, Gitta and Chou, Hung-Hsu , booktitle =. Neural (Tangent Kernel) Collapse , volume =

work page
[7]

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

Neural Collapse in Deep Homogeneous Classifiers and The Role of Weight Decay , author=. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

work page 2022
[8]

International Conference on Machine Learning , pages=

Perturbation analysis of neural collapse , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[9]

2025 , url=

Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics , author=. 2025 , url=

work page 2025
[10]

Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization

Stochastic gradient/mirror descent: Minimax optimality and implicit regularization , author=. arXiv preprint arXiv:1806.00952 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Advances in Neural Information Processing Systems , volume=

The implicit bias of adam on separable data , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

arXiv preprint arXiv:2308.16898 , year=

Transformers as support vector machines , author=. arXiv preprint arXiv:2308.16898 , year=

work page arXiv
[13]

Advances in Neural Information Processing Systems , volume=

Implicit bias of gradient descent for logistic regression at the edge of stability , author=. Advances in Neural Information Processing Systems , volume=

work page
[14]

arXiv preprint arXiv:2410.14581 , year=

Optimizing attention with mirror descent: Generalized max-margin token selection , author=. arXiv preprint arXiv:2410.14581 , year=

work page arXiv
[15]

Advances in Neural Information Processing Systems , volume=

(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

Advances in Neural Information Processing Systems , volume=

Implicit bias of mirror flow on separable data , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

Learning multiple layers of features from tiny images , journal =

Alex Krizhevsky , year =. Learning multiple layers of features from tiny images , journal =

work page
[18]

and Bottou, L

Lecun, Y. and Bottou, L. and Bengio, Y. and Haffner, P. , journal=. Gradient-based learning applied to document recognition , year=

work page
[19]

Transactions on Machine Learning Research , year=

Implicit Bias and Fast Convergence Rates for Self-attention , author=. Transactions on Machine Learning Research , year=

work page
[20]

arXiv preprint arXiv:2110.06084 , year=

Implicit bias of linear equivariant networks , author=. arXiv preprint arXiv:2110.06084 , year=

work page arXiv
[21]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Stochastic mirror descent on overparameterized nonlinear models , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2021 , publisher=

work page 2021
[22]

arXiv preprint arXiv:2502.04664 , year=

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. arXiv preprint arXiv:2502.04664 , year=

work page arXiv
[23]

arXiv preprint arXiv:2502.16075 , year=

Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks , author=. arXiv preprint arXiv:2502.16075 , year=

work page arXiv
[24]

The 22nd International Conference on Artificial Intelligence and Statistics , pages=

Convergence of gradient descent on separable data , author=. The 22nd International Conference on Artificial Intelligence and Statistics , pages=. 2019 , organization=

work page 2019
[25]

Advances in neural information processing systems , volume=

Margin maximizing loss functions , author=. Advances in neural information processing systems , volume=

work page
[26]

Journal of Machine Learning Research , volume=

Boosting as a regularized path to a maximum margin classifier , author=. Journal of Machine Learning Research , volume=

work page
[27]

arXiv preprint arXiv:2505.08348 , year=

Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations , author=. arXiv preprint arXiv:2505.08348 , year=

work page arXiv
[28]

arXiv preprint arXiv:2405.14468 , year=

Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal? , author=. arXiv preprint arXiv:2405.14468 , year=

work page arXiv
[29]

International Conference on Machine Learning , year=

Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data , author=. International Conference on Machine Learning , year=

work page
[30]

ArXiv , year=

Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers , author=. ArXiv , year=

work page
[31]

International Conference on Learning Representations , year=

Gradient descent aligns the layers of deep linear networks , author=. International Conference on Learning Representations , year=

work page
[32]

ArXiv , year=

Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , author=. ArXiv , year=

work page
[33]

arXiv preprint arXiv:2501.19104 , year=

Neural Collapse Beyond the Unconstrained Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime , author=. arXiv preprint arXiv:2501.19104 , year=

work page arXiv
[34]

Implicit Regularization in Deep Matrix Factorization , volume =

Arora, Sanjeev and Cohen, Nadav and Hu, Wei and Luo, Yuping , booktitle =. Implicit Regularization in Deep Matrix Factorization , volume =

work page
[35]

ArXiv , year=

Generalization in Deep Learning , author=. ArXiv , year=

work page
[36]

Neural Information Processing Systems , year=

Exploring Generalization in Deep Learning , author=. Neural Information Processing Systems , year=

work page
[37]

International Conference on Learning Representations , year=

Understanding deep learning requires rethinking generalization , author=. International Conference on Learning Representations , year=

work page
[38]

Proceedings of the 35th International Conference on Machine Learning , pages =

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018
[39]

ArXiv , year=

The Persistence of Neural Collapse Despite Low-Rank Bias: An Analytic Perspective Through Unconstrained Features , author=. ArXiv , year=

work page
[40]

Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse , url =

Jacot, Arthur and S\'. Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse , url =. International Conference on Learning Representations , pages =

work page
[41]

arXiv preprint arXiv:2405.08920 , year=

Neural collapse meets differential privacy: curious behaviors of NoisyGD with near-perfect representation learning , author=. arXiv preprint arXiv:2405.08920 , year=

work page arXiv
[42]

Advances in Neural Information Processing Systems , volume=

A geometric analysis of neural collapse with unconstrained features , author=. Advances in Neural Information Processing Systems , volume=

work page
[43]

and Johnson, Charles R

Horn, Roger A. and Johnson, Charles R. , year=. Matrix Analysis , publisher=

work page
[44]

arXiv preprint arXiv:2212.12206 , year=

Principled and efficient transfer learning of deep models via neural collapse , author=. arXiv preprint arXiv:2212.12206 , year=

work page arXiv
[45]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

EPA: Neural Collapse Inspired Robust Out-of-distribution Detector , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024
[46]

Explorations on high dimensional landscapes

Explorations on high dimensional landscapes , author=. arXiv preprint arXiv:1412.6615 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Artificial intelligence and statistics , pages=

The loss surfaces of multilayer networks , author=. Artificial intelligence and statistics , pages=. 2015 , organization=

work page 2015
[48]

arXiv preprint arXiv:2006.09091 , year=

Flatness is a false friend , author=. arXiv preprint arXiv:2006.09091 , year=

work page arXiv 2006
[49]

arXiv: Numerical Analysis , year=

Low-rank matrix recovery via regularized nuclear norm minimization , author=. arXiv: Numerical Analysis , year=

work page
[50]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

Low-Rank Matrix Recovery via Efficient Schatten p-Norm Minimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

work page
[51]

Journal of Machine Learning Research , volume=

Learning rates as a function of batch size: A random matrix theory approach to neural network training , author=. Journal of Machine Learning Research , volume=

work page
[52]

Communications of the ACM , volume=

On the implicit bias in deep-learning algorithms , author=. Communications of the ACM , volume=. 2023 , publisher=

work page 2023
[53]

Advances in Neural Information Processing Systems , volume=

On margin maximization in linear and relu networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[54]

International Conference on Algorithmic Learning Theory , pages=

Implicit regularization towards rank minimization in relu networks , author=. International Conference on Algorithmic Learning Theory , pages=. 2023 , organization=

work page 2023
[55]

arXiv preprint arXiv:2303.06484 , year=

Generalizing and decoupling neural collapse via hyperspherical uniformity gap , author=. arXiv preprint arXiv:2303.06484 , year=

work page arXiv
[56]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[57]

Physical Review Letters , volume=

Unveiling the structure of wide flat minima in neural networks , author=. Physical Review Letters , volume=. 2021 , publisher=

work page 2021
[58]

Journal of Statistical Mechanics: Theory and Experiment , volume=

The loss surfaces of neural networks with general activation functions , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

work page 2021
[59]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Eigenvalues of the hessian in deep learning: Singularity and beyond , author=. arXiv preprint arXiv:1611.07476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Advances in neural information processing systems , volume=

How regularization affects the critical points in linear networks , author=. Advances in neural information processing systems , volume=

work page
[61]

arXiv preprint arXiv:1810.02281 , year=

A convergence analysis of gradient descent for deep linear neural networks , author=. arXiv preprint arXiv:1810.02281 , year=

work page arXiv
[62]

International Conference on Machine Learning , year=

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , author=. International Conference on Machine Learning , year=

work page
[63]

Advances in neural information processing systems , volume=

Deep learning without poor local minima , author=. Advances in neural information processing systems , volume=

work page
[64]

arXiv preprint arXiv:2402.03991 , year=

Provable Emergence of Deep Neural Collapse and Low-Rank Bias in L2-Regularized Nonlinear Networks , author=. arXiv preprint arXiv:2402.03991 , year=

work page arXiv
[65]

Representation Costs of Linear Neural Networks: Analysis and Design , volume =

Dai, Zhen and Karzand, Mina and Srebro, Nathan , booktitle =. Representation Costs of Linear Neural Networks: Analysis and Design , volume =

work page
[66]

Advances in neural information processing systems , volume=

An improved analysis of training over-parameterized deep neural networks , author=. Advances in neural information processing systems , volume=

work page
[67]

IEEE Journal on Selected Areas in Information Theory , volume=

Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks , author=. IEEE Journal on Selected Areas in Information Theory , volume=. 2020 , publisher=

work page 2020
[68]

IEEE Signal Processing Magazine , volume=

The global landscape of neural networks: An overview , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=

work page 2020
[69]

Spurious Local Minima are Common in Two-Layer

Safran, Itay and Shamir, Ohad , booktitle =. Spurious Local Minima are Common in Two-Layer. 2018 , volume =

work page 2018
[70]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

The Loss Surface of Deep Linear Networks Viewed Through the Algebraic Geometry Lens , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[71]

The Low-Rank Simplicity Bias in Deep Networks , author=. Trans. Mach. Learn. Res. , year=

work page
[72]

ArXiv , year=

Training invariances and the low-rank phenomenon: beyond linear networks , author=. ArXiv , year=

work page
[73]

ArXiv , year=

Implicit bias of SGD in L2-regularized linear DNNs: One-way jumps from high to low rank , author=. ArXiv , year=

work page
[74]

Neural Information Processing Systems , year=

Implicit Bias of Gradient Descent on Linear Convolutional Networks , author=. Neural Information Processing Systems , year=

work page
[75]

International Conference on Machine Learning , pages=

Characterizing implicit bias in terms of optimization geometry , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[76]

Proceedings of the 38th International Conference on Machine Learning , pages =

Understanding the Dynamics of Gradient Flow in Overparameterized Linear models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

work page 2021
[77]

Proceedings of Thirty Third Conference on Learning Theory , pages =

Gradient descent follows the regularization path for general losses , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , volume =

work page 2020
[78]

2018 Information Theory and Applications Workshop (ITA) , year=

Implicit Regularization in Matrix Factorization , author=. 2018 Information Theory and Applications Workshop (ITA) , year=

work page 2018
[79]

Directional convergence and alignment in deep learning , url =

Ji, Ziwei and Telgarsky, Matus , booktitle =. Directional convergence and alignment in deep learning , url =

work page
[80]

ArXiv , year=

Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank , author=. ArXiv , year=

work page

Showing first 80 references.

[1] [1]

Proceedings of the National Academy of Sciences , volume=

The replicator equation and other game dynamics , author=. Proceedings of the National Academy of Sciences , volume=. 2014 , publisher=

work page 2014

[2] [2]

Advances in Neural Information Processing Systems , volume=

Mirror descent maximizes generalized margin and can be implemented efficiently , author=. Advances in Neural Information Processing Systems , volume=

work page

[3] [3]

2009 , url=

Spectral Analysis of Large Dimensional Random Matrices , author=. 2009 , url=

work page 2009

[4] [4]

arXiv preprint arXiv:2405.00985 , year=

Progressive feedforward collapse of resnet training , author=. arXiv preprint arXiv:2405.00985 , year=

work page arXiv

[5] [5]

arXiv preprint arXiv:2506.05801 , year=

Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model , author=. arXiv preprint arXiv:2506.05801 , year=

work page arXiv

[6] [6]

Neural (Tangent Kernel) Collapse , volume =

Seleznova, Mariia and Weitzner, Dana and Giryes, Raja and Kutyniok, Gitta and Chou, Hung-Hsu , booktitle =. Neural (Tangent Kernel) Collapse , volume =

work page

[7] [7]

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

Neural Collapse in Deep Homogeneous Classifiers and The Role of Weight Decay , author=. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

work page 2022

[8] [8]

International Conference on Machine Learning , pages=

Perturbation analysis of neural collapse , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[9] [9]

2025 , url=

Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics , author=. 2025 , url=

work page 2025

[10] [10]

Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization

Stochastic gradient/mirror descent: Minimax optimality and implicit regularization , author=. arXiv preprint arXiv:1806.00952 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Advances in Neural Information Processing Systems , volume=

The implicit bias of adam on separable data , author=. Advances in Neural Information Processing Systems , volume=

work page

[12] [12]

arXiv preprint arXiv:2308.16898 , year=

Transformers as support vector machines , author=. arXiv preprint arXiv:2308.16898 , year=

work page arXiv

[13] [13]

Advances in Neural Information Processing Systems , volume=

Implicit bias of gradient descent for logistic regression at the edge of stability , author=. Advances in Neural Information Processing Systems , volume=

work page

[14] [14]

arXiv preprint arXiv:2410.14581 , year=

Optimizing attention with mirror descent: Generalized max-margin token selection , author=. arXiv preprint arXiv:2410.14581 , year=

work page arXiv

[15] [15]

Advances in Neural Information Processing Systems , volume=

(S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , author=. Advances in Neural Information Processing Systems , volume=

work page

[16] [16]

Advances in Neural Information Processing Systems , volume=

Implicit bias of mirror flow on separable data , author=. Advances in Neural Information Processing Systems , volume=

work page

[17] [17]

Learning multiple layers of features from tiny images , journal =

Alex Krizhevsky , year =. Learning multiple layers of features from tiny images , journal =

work page

[18] [18]

and Bottou, L

Lecun, Y. and Bottou, L. and Bengio, Y. and Haffner, P. , journal=. Gradient-based learning applied to document recognition , year=

work page

[19] [19]

Transactions on Machine Learning Research , year=

Implicit Bias and Fast Convergence Rates for Self-attention , author=. Transactions on Machine Learning Research , year=

work page

[20] [20]

arXiv preprint arXiv:2110.06084 , year=

Implicit bias of linear equivariant networks , author=. arXiv preprint arXiv:2110.06084 , year=

work page arXiv

[21] [21]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Stochastic mirror descent on overparameterized nonlinear models , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2021 , publisher=

work page 2021

[22] [22]

arXiv preprint arXiv:2502.04664 , year=

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. arXiv preprint arXiv:2502.04664 , year=

work page arXiv

[23] [23]

arXiv preprint arXiv:2502.16075 , year=

Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks , author=. arXiv preprint arXiv:2502.16075 , year=

work page arXiv

[24] [24]

The 22nd International Conference on Artificial Intelligence and Statistics , pages=

Convergence of gradient descent on separable data , author=. The 22nd International Conference on Artificial Intelligence and Statistics , pages=. 2019 , organization=

work page 2019

[25] [25]

Advances in neural information processing systems , volume=

Margin maximizing loss functions , author=. Advances in neural information processing systems , volume=

work page

[26] [26]

Journal of Machine Learning Research , volume=

Boosting as a regularized path to a maximum margin classifier , author=. Journal of Machine Learning Research , volume=

work page

[27] [27]

arXiv preprint arXiv:2505.08348 , year=

Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations , author=. arXiv preprint arXiv:2505.08348 , year=

work page arXiv

[28] [28]

arXiv preprint arXiv:2405.14468 , year=

Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal? , author=. arXiv preprint arXiv:2405.14468 , year=

work page arXiv

[29] [29]

International Conference on Machine Learning , year=

Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data , author=. International Conference on Machine Learning , year=

work page

[30] [30]

ArXiv , year=

Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers , author=. ArXiv , year=

work page

[31] [31]

International Conference on Learning Representations , year=

Gradient descent aligns the layers of deep linear networks , author=. International Conference on Learning Representations , year=

work page

[32] [32]

ArXiv , year=

Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , author=. ArXiv , year=

work page

[33] [33]

arXiv preprint arXiv:2501.19104 , year=

Neural Collapse Beyond the Unconstrained Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime , author=. arXiv preprint arXiv:2501.19104 , year=

work page arXiv

[34] [34]

Implicit Regularization in Deep Matrix Factorization , volume =

Arora, Sanjeev and Cohen, Nadav and Hu, Wei and Luo, Yuping , booktitle =. Implicit Regularization in Deep Matrix Factorization , volume =

work page

[35] [35]

ArXiv , year=

Generalization in Deep Learning , author=. ArXiv , year=

work page

[36] [36]

Neural Information Processing Systems , year=

Exploring Generalization in Deep Learning , author=. Neural Information Processing Systems , year=

work page

[37] [37]

International Conference on Learning Representations , year=

Understanding deep learning requires rethinking generalization , author=. International Conference on Learning Representations , year=

work page

[38] [38]

Proceedings of the 35th International Conference on Machine Learning , pages =

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018

[39] [39]

ArXiv , year=

The Persistence of Neural Collapse Despite Low-Rank Bias: An Analytic Perspective Through Unconstrained Features , author=. ArXiv , year=

work page

[40] [40]

Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse , url =

Jacot, Arthur and S\'. Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse , url =. International Conference on Learning Representations , pages =

work page

[41] [41]

arXiv preprint arXiv:2405.08920 , year=

Neural collapse meets differential privacy: curious behaviors of NoisyGD with near-perfect representation learning , author=. arXiv preprint arXiv:2405.08920 , year=

work page arXiv

[42] [42]

Advances in Neural Information Processing Systems , volume=

A geometric analysis of neural collapse with unconstrained features , author=. Advances in Neural Information Processing Systems , volume=

work page

[43] [43]

and Johnson, Charles R

Horn, Roger A. and Johnson, Charles R. , year=. Matrix Analysis , publisher=

work page

[44] [44]

arXiv preprint arXiv:2212.12206 , year=

Principled and efficient transfer learning of deep models via neural collapse , author=. arXiv preprint arXiv:2212.12206 , year=

work page arXiv

[45] [45]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

EPA: Neural Collapse Inspired Robust Out-of-distribution Detector , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024

[46] [46]

Explorations on high dimensional landscapes

Explorations on high dimensional landscapes , author=. arXiv preprint arXiv:1412.6615 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Artificial intelligence and statistics , pages=

The loss surfaces of multilayer networks , author=. Artificial intelligence and statistics , pages=. 2015 , organization=

work page 2015

[48] [48]

arXiv preprint arXiv:2006.09091 , year=

Flatness is a false friend , author=. arXiv preprint arXiv:2006.09091 , year=

work page arXiv 2006

[49] [49]

arXiv: Numerical Analysis , year=

Low-rank matrix recovery via regularized nuclear norm minimization , author=. arXiv: Numerical Analysis , year=

work page

[50] [50]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

Low-Rank Matrix Recovery via Efficient Schatten p-Norm Minimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

work page

[51] [51]

Journal of Machine Learning Research , volume=

Learning rates as a function of batch size: A random matrix theory approach to neural network training , author=. Journal of Machine Learning Research , volume=

work page

[52] [52]

Communications of the ACM , volume=

On the implicit bias in deep-learning algorithms , author=. Communications of the ACM , volume=. 2023 , publisher=

work page 2023

[53] [53]

Advances in Neural Information Processing Systems , volume=

On margin maximization in linear and relu networks , author=. Advances in Neural Information Processing Systems , volume=

work page

[54] [54]

International Conference on Algorithmic Learning Theory , pages=

Implicit regularization towards rank minimization in relu networks , author=. International Conference on Algorithmic Learning Theory , pages=. 2023 , organization=

work page 2023

[55] [55]

arXiv preprint arXiv:2303.06484 , year=

Generalizing and decoupling neural collapse via hyperspherical uniformity gap , author=. arXiv preprint arXiv:2303.06484 , year=

work page arXiv

[56] [56]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017

[57] [57]

Physical Review Letters , volume=

Unveiling the structure of wide flat minima in neural networks , author=. Physical Review Letters , volume=. 2021 , publisher=

work page 2021

[58] [58]

Journal of Statistical Mechanics: Theory and Experiment , volume=

The loss surfaces of neural networks with general activation functions , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

work page 2021

[59] [59]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Eigenvalues of the hessian in deep learning: Singularity and beyond , author=. arXiv preprint arXiv:1611.07476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

Advances in neural information processing systems , volume=

How regularization affects the critical points in linear networks , author=. Advances in neural information processing systems , volume=

work page

[61] [61]

arXiv preprint arXiv:1810.02281 , year=

A convergence analysis of gradient descent for deep linear neural networks , author=. arXiv preprint arXiv:1810.02281 , year=

work page arXiv

[62] [62]

International Conference on Machine Learning , year=

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , author=. International Conference on Machine Learning , year=

work page

[63] [63]

Advances in neural information processing systems , volume=

Deep learning without poor local minima , author=. Advances in neural information processing systems , volume=

work page

[64] [64]

arXiv preprint arXiv:2402.03991 , year=

Provable Emergence of Deep Neural Collapse and Low-Rank Bias in L2-Regularized Nonlinear Networks , author=. arXiv preprint arXiv:2402.03991 , year=

work page arXiv

[65] [65]

Representation Costs of Linear Neural Networks: Analysis and Design , volume =

Dai, Zhen and Karzand, Mina and Srebro, Nathan , booktitle =. Representation Costs of Linear Neural Networks: Analysis and Design , volume =

work page

[66] [66]

Advances in neural information processing systems , volume=

An improved analysis of training over-parameterized deep neural networks , author=. Advances in neural information processing systems , volume=

work page

[67] [67]

IEEE Journal on Selected Areas in Information Theory , volume=

Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks , author=. IEEE Journal on Selected Areas in Information Theory , volume=. 2020 , publisher=

work page 2020

[68] [68]

IEEE Signal Processing Magazine , volume=

The global landscape of neural networks: An overview , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=

work page 2020

[69] [69]

Spurious Local Minima are Common in Two-Layer

Safran, Itay and Shamir, Ohad , booktitle =. Spurious Local Minima are Common in Two-Layer. 2018 , volume =

work page 2018

[70] [70]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

The Loss Surface of Deep Linear Networks Viewed Through the Algebraic Geometry Lens , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page

[71] [71]

The Low-Rank Simplicity Bias in Deep Networks , author=. Trans. Mach. Learn. Res. , year=

work page

[72] [72]

ArXiv , year=

Training invariances and the low-rank phenomenon: beyond linear networks , author=. ArXiv , year=

work page

[73] [73]

ArXiv , year=

Implicit bias of SGD in L2-regularized linear DNNs: One-way jumps from high to low rank , author=. ArXiv , year=

work page

[74] [74]

Neural Information Processing Systems , year=

Implicit Bias of Gradient Descent on Linear Convolutional Networks , author=. Neural Information Processing Systems , year=

work page

[75] [75]

International Conference on Machine Learning , pages=

Characterizing implicit bias in terms of optimization geometry , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018

[76] [76]

Proceedings of the 38th International Conference on Machine Learning , pages =

Understanding the Dynamics of Gradient Flow in Overparameterized Linear models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

work page 2021

[77] [77]

Proceedings of Thirty Third Conference on Learning Theory , pages =

Gradient descent follows the regularization path for general losses , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , volume =

work page 2020

[78] [78]

2018 Information Theory and Applications Workshop (ITA) , year=

Implicit Regularization in Matrix Factorization , author=. 2018 Information Theory and Applications Workshop (ITA) , year=

work page 2018

[79] [79]

Directional convergence and alignment in deep learning , url =

Ji, Ziwei and Telgarsky, Matus , booktitle =. Directional convergence and alignment in deep learning , url =

work page

[80] [80]

ArXiv , year=

Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank , author=. ArXiv , year=

work page