arXiv preprint arXiv:1810.02032 , year=

Gradient descent aligns the layers of deep linear networks , author= · 2018 · cs.LG · arXiv 1810.02032

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i-th weight matrix asymptotically equals its rank-1 approximation $u_iv_i^{\top}$; (iii) these rank-1 matrices are aligned across layers, meaning $|v_{i+1}^{\top}u_i|\to1$. In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network --- the product of its weight matrices --- converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Implicit Bias in Deep Linear Discriminant Analysis

cs.LG · 2026-03-03 · unverdicted · novelty 7.0

Gradient flow on deep diagonal linear LDA networks with balanced initialization converts additive updates to multiplicative updates, automatically conserving the (2/L) quasi-norm.

A Theory of Saddle Escape in Deep Nonlinear Networks

cs.LG · 2026-05-02 · conditional · novelty 7.0 · 2 refs

An exact norm-imbalance identity classifies activations into four classes and reduces deep nonlinear training flow to a scalar ODE that predicts saddle escape time scaling as ε to the power of minus (r-2) for r bottleneck layers.

The Effect of Mini-Batch Noise on the Implicit Bias of Adam

cs.LG · 2026-02-02 · unverdicted · novelty 6.0

Mini-batch noise reverses how Adam's β2 controls anti-regularization, making default momentum values suitable for small batches but requiring β1 closer to β2 for large batches to favor flatter minima.

Prediction horizon shapes representations in predictive learning

cs.LG · 2025-11-12 · unverdicted · novelty 6.0

Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

cs.LG · 2024-01-02 · unverdicted · novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

math.OC · 2026-05-11 · unverdicted · novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

citing papers explorer

Showing 6 of 6 citing papers.

Implicit Bias in Deep Linear Discriminant Analysis cs.LG · 2026-03-03 · unverdicted · none · ref 10 · internal anchor
Gradient flow on deep diagonal linear LDA networks with balanced initialization converts additive updates to multiplicative updates, automatically conserving the (2/L) quasi-norm.
A Theory of Saddle Escape in Deep Nonlinear Networks cs.LG · 2026-05-02 · conditional · none · ref 24 · 2 links
An exact norm-imbalance identity classifies activations into four classes and reduces deep nonlinear training flow to a scalar ODE that predicts saddle escape time scaling as ε to the power of minus (r-2) for r bottleneck layers.
The Effect of Mini-Batch Noise on the Implicit Bias of Adam cs.LG · 2026-02-02 · unverdicted · none · ref 26 · internal anchor
Mini-batch noise reverses how Adam's β2 controls anti-regularization, making default momentum values suitable for small batches but requiring β1 closer to β2 for large batches to favor flatter minima.
Prediction horizon shapes representations in predictive learning cs.LG · 2025-11-12 · unverdicted · none · ref 2 · internal anchor
Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models cs.LG · 2024-01-02 · unverdicted · none · ref 160 · internal anchor
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities math.OC · 2026-05-11 · unverdicted · none · ref 47
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

arXiv preprint arXiv:1810.02032 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer