pith. sign in

arxiv: 2605.23087 · v1 · pith:GN4EIQPDnew · submitted 2026-05-21 · 💻 cs.LG

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

Pith reviewed 2026-05-25 05:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords neural collapseimplicit biasdepthsoftmax codeslow-rank biasunconstrained feature modelcross-entropymulticlass classification
0
0 comments X

The pith

Depth in neural networks induces an implicit low-rank bias that favors softmax codes over neural collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the deep unconstrained feature model, which isolates gradient descent and depth effects in networks equivalent to deep linear models with orthogonal inputs. It establishes that depth alone creates an implicit bias toward low-rank weight matrices because such matrices transmit norm more efficiently across successive layer multiplications. This bias promotes low-rank alternatives to the standard neural collapse geometry, which the authors identify as softmax codes. The work also tracks how depth shrinks the basin of attraction for neural collapse and how width can push solutions toward higher rank. These results characterize the implicit bias of depth under unregularized multiclass cross-entropy training.

Core claim

In the deep unconstrained feature model trained without regularization, depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to neural collapse. These alternatives correspond to softmax codes. Under spectral initialization, early-time repulsion among singular values drives the low-rank emergence, and depth reduces the basin of attraction for full neural collapse. In randomly initialized networks, increasing width biases training toward higher-rank solutions.

What carries the argument

The deep unconstrained feature model (equivalent to a deep linear network with orthogonal inputs) together with the mechanism of norm propagation efficiency in low-rank matrices under successive multiplications.

If this is right

  • Low-rank softmax codes emerge as stable solutions in sufficiently deep networks.
  • Early repulsion among singular values accelerates the shift away from neural collapse.
  • Deeper networks shrink the region of parameter space that converges to full neural collapse.
  • Wider networks counteract the low-rank bias and favor higher-rank solutions.
  • The bias arises solely from gradient descent and depth under unregularized cross-entropy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that enforce higher effective width at each layer may be needed to recover neural collapse in very deep models.
  • The same norm-propagation mechanism could be tested in nonlinear networks by measuring effective rank after training.
  • This low-rank preference may explain why some deep classifiers generalize differently from their shallow counterparts even at the same training accuracy.

Load-bearing premise

The deep unconstrained feature model with orthogonal inputs and training without regularization isolates the pure effects of gradient descent and depth on the geometry.

What would settle it

Train a deep unconstrained feature model with spectral initialization and check whether the learned classifier weights exhibit rank strictly less than the number of classes while still achieving low loss.

Figures

Figures reproduced from arXiv: 2605.23087 by Christos Thrampoulidis, Connall Garrod, Jonathan P. Keating.

Figure 1
Figure 1. Figure 1: Illustration of low-rank bias for K = 4, L = 3, n = 1, with all parameter matrices normalized to unit Frobenius norm. µDNC, µ′ DNC denote two feature embeddings of a logit matrix Z under DNC, while µLR, µ′ LR denote two feature embeddings of a logit matrix Z under the low-rank geometry of Eq. (14). Although DNC yields the larger angle between features, the low-rank ge￾ometry lies on a large-radius hypersph… view at source ↗
Figure 3
Figure 3. Figure 3: Optimal low-rank structures correspond to softmax codes. Left: Columns of the gram factor X defined in Eq. (5) for one run with L = 10 that converged to rank 2. Middle: For each K, we trained five deep UFMs that converged to rank 2; shown are the largest and smallest adjacent angle gaps of the columns xi of the Gram factor X over all five runs. The angle 360/K corresponds to the d = 2 softmax code. Right: … view at source ↗
Figure 4
Figure 4. Figure 4: Experiments under Hadamard initialization. Left: Evolution of the normalized logit singular values aˆi under Eq. (6) for depth L = 2 and K = 16 classes; singular values initialized with uniform entries in [0, 1], then rescaled to have L1 norm 10−3 . Note convergence to low-rank solution where many modes approach zero. Middle: Corresponding evolution of the logit norm ∥a∥1 = P i ai. By the time norm increas… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of network width on network training. Depicted are the mean and one-standard-deviation error bars over five runs. Left: Frobenius distance metric M (see (61) in App.) between the logit-matrix derivative dZ dt (0) and the simplex ETF. Middle: Same metric for the logits after training has induced the logit norm to increase by a scale factor. Right: Effective rank of the logit matrix after training. In… view at source ↗
Figure 6
Figure 6. Figure 6: Confirmation of theory on the MNIST and CIFAR-10 datasets using the ResNet-20 architecture with a ReLU head. We report the average effective rank of the mean logit matrix at the end of training and one-standard-deviation error bars over five runs. Left: Here the ReLU head has width d = 50 and a variety of depths L. Middle: Here the ReLU head has depth L = 3 for a variety of widths. Right: Scatter plot of t… view at source ↗
Figure 7
Figure 7. Figure 7: Demonstrations of different normalized mean logit matrices after training for L = 3, K = 10. Left: logits that are approximately DNC, with effective rank 8.9 ≈ K − 1. Right: logits of a low rank alternative, with effective rank 4.3. E. Further Numerical Experiments E.1. Main Text Experimental Details In this section we provide further details for the plots of the main text [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 8
Figure 8. Figure 8: Plot of the normalized mean logit matrix at convergence of a rank two solution for K = 8, L = 10. training run for simplicity, but the dynamics exhibit the same phenomenology under this setup across a large number of random initializations [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: Empirical mean of the effective rank of the mean logit matrix at t = 103 under the dynamics of Equation (6) with parameter K = 16 for various network depths L. The vector a ∈ R K−1 is initialized with entries uniformly sampled from [0.5, 1], then normalized to have L1 norm of 10−3 . For each L the average empirical effective rank over 10 runs is reported, along with one-standard-deviation error bars.… view at source ↗
Figure 10
Figure 10. Figure 10: Evolution dynamics under Eq. (6) from the initialization described in Eq. (49) with γ = 0.2, δ = 0.1 for parameters L = 2, K = 16. Top Left: Evolution of the logit norm ∥a∥1 = P i ai. Top Right: Evolution of the normalized singular values aˆi. Bottom Left: Evolution of the KL-divergence DKL( 1 K−1 1K−1 ∥ aˆ), defined in Eq. (52). Bottom Right: mean logit matrix at the end of training, normalized to have u… view at source ↗
Figure 11
Figure 11. Figure 11: Experiments under Random initialization for the deep UFM. Left: Evolution of the normalized mean logit singular values aˆi under gradient flow on the loss function of Eq. (1) for depth L = 3, width d = 100, K = 10 classes and n = 5 data points per class. Network was initialized with Gaussian normal entries of standard deviation 0.03. Note convergence to low-rank solution where many modes approach zero as … view at source ↗
read the original abstract

Neural collapse (NC) describes the structured geometry that emerges in the features and weights of trained classifiers. Recent theory suggests NC can be suboptimal in deep architectures, attributing this to an explicit low-rank bias from L2 regularization. We study the deep unconstrained feature model (UFM)-equivalent to a deep linear network with orthogonal inputs-trained without regularization, to isolate how gradient descent and depth alone shape NC. We show that depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to NC. These alternatives, we argue, correspond to softmax codes: max-margin solutions previously found in width-bottlenecked networks. Analyzing training dynamics under spectral initialization, we identify an early-time repulsion among singular values that drives low-rank emergence, and characterize how depth shrinks NC's basin of attraction. Finally, we show that some effects act in the opposite direction: for randomly initialized networks, increasing width biases training toward higher-rank solutions. Our results provide the first asymptotic and dynamic characterization of implicit bias in deep UFMs trained with unregularized multiclass cross-entropy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper studies the deep unconstrained feature model (UFM), equivalent to a deep linear network with orthogonal inputs, trained without regularization on multiclass cross-entropy. It claims that depth alone induces an implicit low-rank bias because low-rank matrices propagate norm more efficiently under successive multiplications, yielding low-rank alternatives to neural collapse (NC) that correspond to softmax codes. The work analyzes training dynamics under spectral initialization to identify early-time singular-value repulsion, shows that depth shrinks NC's basin of attraction, and demonstrates that increasing width can bias toward higher-rank solutions. It positions these results as the first asymptotic and dynamic characterization of implicit bias in deep UFMs.

Significance. If the derivations and dynamics hold, the results provide a principled explanation for why NC can be suboptimal in deep architectures without invoking explicit L2 regularization, linking depth-induced low-rank bias to previously observed max-margin softmax codes. The dynamic analysis of singular-value repulsion and basin shrinkage constitutes a concrete advance over static characterizations of NC. The width-depth contrast is a useful counterpoint that could inform architecture design.

minor comments (3)
  1. The abstract states that the deep UFM is 'equivalent to a deep linear network with orthogonal inputs'; the introduction or model section should include an explicit derivation or citation establishing the precise conditions under which this equivalence holds for the unregularized cross-entropy objective.
  2. The phrase 'softmax codes' is introduced without a self-contained definition or pointer to the prior width-bottleneck literature; a brief recap with the relevant citation should appear at first use.
  3. The claim of providing the 'first asymptotic and dynamic characterization' would be strengthened by a short table or paragraph in the related-work section contrasting the present dynamic results against the static NC analyses in the cited references.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and the recommendation of minor revision. The provided summary accurately reflects the paper's focus on depth-induced implicit low-rank bias in deep UFMs under unregularized cross-entropy.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper analyzes the deep UFM (equivalent to a deep linear network with orthogonal inputs) under unregularized gradient descent, deriving an implicit low-rank bias from the efficiency of norm propagation under matrix multiplications and characterizing dynamics via spectral initialization. These steps are presented as direct consequences of the model equations and training process rather than reductions to fitted parameters, self-definitions, or load-bearing self-citations. The identification of softmax codes and basin-of-attraction effects follows from the stated assumptions without circular renaming or imported uniqueness theorems. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; the central model relies on the unconstrained feature model but no further breakdown is available.

pith-pipeline@v0.9.0 · 5729 in / 1066 out tokens · 29080 ms · 2026-05-25T05:24:42.681843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

149 extracted references · 149 canonical work pages · 5 internal anchors

  1. [1]

    Proceedings of the National Academy of Sciences , volume=

    The replicator equation and other game dynamics , author=. Proceedings of the National Academy of Sciences , volume=. 2014 , publisher=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Mirror descent maximizes generalized margin and can be implemented efficiently , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    2009 , url=

    Spectral Analysis of Large Dimensional Random Matrices , author=. 2009 , url=

  4. [4]

    arXiv preprint arXiv:2405.00985 , year=

    Progressive feedforward collapse of resnet training , author=. arXiv preprint arXiv:2405.00985 , year=

  5. [5]

    arXiv preprint arXiv:2506.05801 , year=

    Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model , author=. arXiv preprint arXiv:2506.05801 , year=

  6. [6]

    Neural (Tangent Kernel) Collapse , volume =

    Seleznova, Mariia and Weitzner, Dana and Giryes, Raja and Kutyniok, Gitta and Chou, Hung-Hsu , booktitle =. Neural (Tangent Kernel) Collapse , volume =

  7. [7]

    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

    Neural Collapse in Deep Homogeneous Classifiers and The Role of Weight Decay , author=. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

  8. [8]

    International Conference on Machine Learning , pages=

    Perturbation analysis of neural collapse , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  9. [9]

    2025 , url=

    Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics , author=. 2025 , url=

  10. [10]

    Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization

    Stochastic gradient/mirror descent: Minimax optimality and implicit regularization , author=. arXiv preprint arXiv:1806.00952 , year=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    The implicit bias of adam on separable data , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    arXiv preprint arXiv:2308.16898 , year=

    Transformers as support vector machines , author=. arXiv preprint arXiv:2308.16898 , year=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    Implicit bias of gradient descent for logistic regression at the edge of stability , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    arXiv preprint arXiv:2410.14581 , year=

    Optimizing attention with mirror descent: Generalized max-margin token selection , author=. arXiv preprint arXiv:2410.14581 , year=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    (S) GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Implicit bias of mirror flow on separable data , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    Learning multiple layers of features from tiny images , journal =

    Alex Krizhevsky , year =. Learning multiple layers of features from tiny images , journal =

  18. [18]

    and Bottou, L

    Lecun, Y. and Bottou, L. and Bengio, Y. and Haffner, P. , journal=. Gradient-based learning applied to document recognition , year=

  19. [19]

    Transactions on Machine Learning Research , year=

    Implicit Bias and Fast Convergence Rates for Self-attention , author=. Transactions on Machine Learning Research , year=

  20. [20]

    arXiv preprint arXiv:2110.06084 , year=

    Implicit bias of linear equivariant networks , author=. arXiv preprint arXiv:2110.06084 , year=

  21. [21]

    IEEE Transactions on Neural Networks and Learning Systems , volume=

    Stochastic mirror descent on overparameterized nonlinear models , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2021 , publisher=

  22. [22]

    arXiv preprint arXiv:2502.04664 , year=

    Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data , author=. arXiv preprint arXiv:2502.04664 , year=

  23. [23]

    arXiv preprint arXiv:2502.16075 , year=

    Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks , author=. arXiv preprint arXiv:2502.16075 , year=

  24. [24]

    The 22nd International Conference on Artificial Intelligence and Statistics , pages=

    Convergence of gradient descent on separable data , author=. The 22nd International Conference on Artificial Intelligence and Statistics , pages=. 2019 , organization=

  25. [25]

    Advances in neural information processing systems , volume=

    Margin maximizing loss functions , author=. Advances in neural information processing systems , volume=

  26. [26]

    Journal of Machine Learning Research , volume=

    Boosting as a regularized path to a maximum margin classifier , author=. Journal of Machine Learning Research , volume=

  27. [27]

    arXiv preprint arXiv:2505.08348 , year=

    Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations , author=. arXiv preprint arXiv:2505.08348 , year=

  28. [28]

    arXiv preprint arXiv:2405.14468 , year=

    Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal? , author=. arXiv preprint arXiv:2405.14468 , year=

  29. [29]

    International Conference on Machine Learning , year=

    Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data , author=. International Conference on Machine Learning , year=

  30. [30]

    ArXiv , year=

    Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers , author=. ArXiv , year=

  31. [31]

    International Conference on Learning Representations , year=

    Gradient descent aligns the layers of deep linear networks , author=. International Conference on Learning Representations , year=

  32. [32]

    ArXiv , year=

    Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , author=. ArXiv , year=

  33. [33]

    arXiv preprint arXiv:2501.19104 , year=

    Neural Collapse Beyond the Unconstrained Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime , author=. arXiv preprint arXiv:2501.19104 , year=

  34. [34]

    Implicit Regularization in Deep Matrix Factorization , volume =

    Arora, Sanjeev and Cohen, Nadav and Hu, Wei and Luo, Yuping , booktitle =. Implicit Regularization in Deep Matrix Factorization , volume =

  35. [35]

    ArXiv , year=

    Generalization in Deep Learning , author=. ArXiv , year=

  36. [36]

    Neural Information Processing Systems , year=

    Exploring Generalization in Deep Learning , author=. Neural Information Processing Systems , year=

  37. [37]

    International Conference on Learning Representations , year=

    Understanding deep learning requires rethinking generalization , author=. International Conference on Learning Representations , year=

  38. [38]

    Proceedings of the 35th International Conference on Machine Learning , pages =

    On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

  39. [39]

    ArXiv , year=

    The Persistence of Neural Collapse Despite Low-Rank Bias: An Analytic Perspective Through Unconstrained Features , author=. ArXiv , year=

  40. [40]

    Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse , url =

    Jacot, Arthur and S\'. Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse , url =. International Conference on Learning Representations , pages =

  41. [41]

    arXiv preprint arXiv:2405.08920 , year=

    Neural collapse meets differential privacy: curious behaviors of NoisyGD with near-perfect representation learning , author=. arXiv preprint arXiv:2405.08920 , year=

  42. [42]

    Advances in Neural Information Processing Systems , volume=

    A geometric analysis of neural collapse with unconstrained features , author=. Advances in Neural Information Processing Systems , volume=

  43. [43]

    and Johnson, Charles R

    Horn, Roger A. and Johnson, Charles R. , year=. Matrix Analysis , publisher=

  44. [44]

    arXiv preprint arXiv:2212.12206 , year=

    Principled and efficient transfer learning of deep models via neural collapse , author=. arXiv preprint arXiv:2212.12206 , year=

  45. [45]

    ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    EPA: Neural Collapse Inspired Robust Out-of-distribution Detector , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

  46. [46]

    Explorations on high dimensional landscapes

    Explorations on high dimensional landscapes , author=. arXiv preprint arXiv:1412.6615 , year=

  47. [47]

    Artificial intelligence and statistics , pages=

    The loss surfaces of multilayer networks , author=. Artificial intelligence and statistics , pages=. 2015 , organization=

  48. [48]

    arXiv preprint arXiv:2006.09091 , year=

    Flatness is a false friend , author=. arXiv preprint arXiv:2006.09091 , year=

  49. [49]

    arXiv: Numerical Analysis , year=

    Low-rank matrix recovery via regularized nuclear norm minimization , author=. arXiv: Numerical Analysis , year=

  50. [50]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    Low-Rank Matrix Recovery via Efficient Schatten p-Norm Minimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  51. [51]

    Journal of Machine Learning Research , volume=

    Learning rates as a function of batch size: A random matrix theory approach to neural network training , author=. Journal of Machine Learning Research , volume=

  52. [52]

    Communications of the ACM , volume=

    On the implicit bias in deep-learning algorithms , author=. Communications of the ACM , volume=. 2023 , publisher=

  53. [53]

    Advances in Neural Information Processing Systems , volume=

    On margin maximization in linear and relu networks , author=. Advances in Neural Information Processing Systems , volume=

  54. [54]

    International Conference on Algorithmic Learning Theory , pages=

    Implicit regularization towards rank minimization in relu networks , author=. International Conference on Algorithmic Learning Theory , pages=. 2023 , organization=

  55. [55]

    arXiv preprint arXiv:2303.06484 , year=

    Generalizing and decoupling neural collapse via hyperspherical uniformity gap , author=. arXiv preprint arXiv:2303.06484 , year=

  56. [56]

    International Conference on Machine Learning , pages=

    Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  57. [57]

    Physical Review Letters , volume=

    Unveiling the structure of wide flat minima in neural networks , author=. Physical Review Letters , volume=. 2021 , publisher=

  58. [58]

    Journal of Statistical Mechanics: Theory and Experiment , volume=

    The loss surfaces of neural networks with general activation functions , author=. Journal of Statistical Mechanics: Theory and Experiment , volume=. 2021 , publisher=

  59. [59]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    Eigenvalues of the hessian in deep learning: Singularity and beyond , author=. arXiv preprint arXiv:1611.07476 , year=

  60. [60]

    Advances in neural information processing systems , volume=

    How regularization affects the critical points in linear networks , author=. Advances in neural information processing systems , volume=

  61. [61]

    arXiv preprint arXiv:1810.02281 , year=

    A convergence analysis of gradient descent for deep linear neural networks , author=. arXiv preprint arXiv:1810.02281 , year=

  62. [62]

    International Conference on Machine Learning , year=

    Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , author=. International Conference on Machine Learning , year=

  63. [63]

    Advances in neural information processing systems , volume=

    Deep learning without poor local minima , author=. Advances in neural information processing systems , volume=

  64. [64]

    arXiv preprint arXiv:2402.03991 , year=

    Provable Emergence of Deep Neural Collapse and Low-Rank Bias in L2-Regularized Nonlinear Networks , author=. arXiv preprint arXiv:2402.03991 , year=

  65. [65]

    Representation Costs of Linear Neural Networks: Analysis and Design , volume =

    Dai, Zhen and Karzand, Mina and Srebro, Nathan , booktitle =. Representation Costs of Linear Neural Networks: Analysis and Design , volume =

  66. [66]

    Advances in neural information processing systems , volume=

    An improved analysis of training over-parameterized deep neural networks , author=. Advances in neural information processing systems , volume=

  67. [67]

    IEEE Journal on Selected Areas in Information Theory , volume=

    Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks , author=. IEEE Journal on Selected Areas in Information Theory , volume=. 2020 , publisher=

  68. [68]

    IEEE Signal Processing Magazine , volume=

    The global landscape of neural networks: An overview , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=

  69. [69]

    Spurious Local Minima are Common in Two-Layer

    Safran, Itay and Shamir, Ohad , booktitle =. Spurious Local Minima are Common in Two-Layer. 2018 , volume =

  70. [70]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    The Loss Surface of Deep Linear Networks Viewed Through the Algebraic Geometry Lens , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  71. [71]

    The Low-Rank Simplicity Bias in Deep Networks , author=. Trans. Mach. Learn. Res. , year=

  72. [72]

    ArXiv , year=

    Training invariances and the low-rank phenomenon: beyond linear networks , author=. ArXiv , year=

  73. [73]

    ArXiv , year=

    Implicit bias of SGD in L2-regularized linear DNNs: One-way jumps from high to low rank , author=. ArXiv , year=

  74. [74]

    Neural Information Processing Systems , year=

    Implicit Bias of Gradient Descent on Linear Convolutional Networks , author=. Neural Information Processing Systems , year=

  75. [75]

    International Conference on Machine Learning , pages=

    Characterizing implicit bias in terms of optimization geometry , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  76. [76]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Understanding the Dynamics of Gradient Flow in Overparameterized Linear models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

  77. [77]

    Proceedings of Thirty Third Conference on Learning Theory , pages =

    Gradient descent follows the regularization path for general losses , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , volume =

  78. [78]

    2018 Information Theory and Applications Workshop (ITA) , year=

    Implicit Regularization in Matrix Factorization , author=. 2018 Information Theory and Applications Workshop (ITA) , year=

  79. [79]

    Directional convergence and alignment in deep learning , url =

    Ji, Ziwei and Telgarsky, Matus , booktitle =. Directional convergence and alignment in deep learning , url =

  80. [80]

    ArXiv , year=

    Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank , author=. ArXiv , year=

Showing first 80 references.