pith. sign in

arXiv preprint arXiv:1810.02032 , year=

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it
abstract

This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i-th weight matrix asymptotically equals its rank-1 approximation $u_iv_i^{\top}$; (iii) these rank-1 matrices are aligned across layers, meaning $|v_{i+1}^{\top}u_i|\to1$. In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network --- the product of its weight matrices --- converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon.

citation-role summary

background 1

citation-polarity summary

roles

background 1

polarities

background 1

representative citing papers

Implicit Bias in Deep Linear Discriminant Analysis

cs.LG · 2026-03-03 · unverdicted · novelty 7.0

Gradient flow on deep diagonal linear LDA networks with balanced initialization converts additive updates to multiplicative updates, automatically conserving the (2/L) quasi-norm.

A Theory of Saddle Escape in Deep Nonlinear Networks

cs.LG · 2026-05-02 · conditional · novelty 7.0 · 2 refs

An exact norm-imbalance identity classifies activations into four classes and reduces deep nonlinear training flow to a scalar ODE that predicts saddle escape time scaling as ε to the power of minus (r-2) for r bottleneck layers.

The Effect of Mini-Batch Noise on the Implicit Bias of Adam

cs.LG · 2026-02-02 · unverdicted · novelty 6.0

Mini-batch noise reverses how Adam's β2 controls anti-regularization, making default momentum values suitable for small batches but requiring β1 closer to β2 for large batches to favor flatter minima.

citing papers explorer

Showing 6 of 6 citing papers.