Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape
Pith reviewed 2026-05-19 12:12 UTC · model grok-4.3
The pith
Deep ReLU networks escape the origin saddle along directions with a low-rank bias that grows with layer depth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the optimal escape direction from the origin saddle in a deep ReLU network initialized with small weights has a low-rank bias in its deeper layers: the first singular value of the ell-th layer weight matrix is at least ell to the power 1/4 larger than any other singular value. The authors also establish related properties of these escape directions and propose that deep ReLU networks follow saddle-to-saddle dynamics, visiting a sequence of saddles whose bottleneck ranks increase over time.
What carries the argument
Optimal escape direction, defined as the solution to an optimization problem that encodes the leading-order gradient descent flow away from the origin saddle.
Load-bearing premise
Gradient descent is initially dominated by the saddle at the origin, so that escape directions are characterized by solving an optimization problem whose objective reflects the leading-order dynamics near zero.
What would settle it
Run gradient descent from small random weights on a deep ReLU network and check whether the singular values of the weight matrices along the first escape trajectory satisfy the predicted ell to the power 1/4 scaling between the largest and second-largest values.
read the original abstract
When a deep ReLU network is initialized with small weights, gradient descent (GD) is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions along which GD leaves the origin, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the $\ell$-th layer weight matrix is at least $\ell^{\frac{1}{4}}$ larger than any other singular value. We also prove a number of related results about these escape directions. We suggest that deep ReLU networks exhibit saddle-to-saddle dynamics, with GD visiting a sequence of saddles with increasing bottleneck rank (Jacot, 2023).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in deep ReLU networks initialized with small weights, gradient descent is initially dominated by the saddle at the origin in parameter space. It studies the escape directions from this saddle and shows that the optimal escape direction features a low-rank bias in deeper layers: the first singular value of the ℓ-th layer weight matrix is at least ℓ^{1/4} larger than any other singular value. The work also proves related results about these escape directions and suggests that deep ReLU networks exhibit saddle-to-saddle dynamics, visiting a sequence of saddles with increasing bottleneck rank.
Significance. If the claimed ℓ^{1/4} low-rank bias in the optimal escape direction holds under the stated assumptions, the result would provide a quantitative characterization of the initial implicit bias in deep ReLU training, offering insight into layer-dependent rank preferences and supporting the saddle-to-saddle trajectory picture in neural network optimization.
major comments (2)
- [Abstract] Abstract: The central quantitative claim is that the first singular value of the ℓ-th layer in the optimal escape direction is at least ℓ^{1/4} larger than any other. The abstract supplies neither the explicit optimization problem whose solution defines this 'optimal escape direction' nor any derivation or proof sketch showing how the ℓ^{1/4} gap follows from the leading-order dynamics away from the origin saddle.
- [Abstract] Abstract: The analysis presupposes that gradient descent is initially dominated by the saddle at the origin and that escape directions are characterized as the solution to an optimization problem encoding the leading-order dynamics. The abstract does not state the precise form of this optimization problem or the conditions under which the domination assumption holds, leaving the load-bearing step unverified.
minor comments (1)
- The abstract refers to 'a number of related results about these escape directions' without enumerating them; the full manuscript should list and briefly describe these results to permit evaluation of their scope and novelty.
Simulated Author's Rebuttal
We thank the referee for their comments, which correctly identify that the abstract is too terse regarding the definition of the optimal escape direction and the underlying optimization problem. We will revise the abstract to include a concise statement of the optimization problem and the small-initialization assumption that justifies domination by the origin saddle.
read point-by-point responses
-
Referee: [Abstract] The central quantitative claim is that the first singular value of the ℓ-th layer in the optimal escape direction is at least ℓ^{1/4} larger than any other. The abstract supplies neither the explicit optimization problem whose solution defines this 'optimal escape direction' nor any derivation or proof sketch showing how the ℓ^{1/4} gap follows from the leading-order dynamics away from the origin saddle.
Authors: We agree that the abstract does not state the optimization problem or sketch the derivation. The full manuscript defines the optimal escape direction as the solution to a variational problem that maximizes the leading-order growth rate of the loss under the linearized dynamics near the origin; the ℓ^{1/4} factor then follows from balancing the contributions of successive layers in this variational problem. We will add one sentence to the abstract that names this optimization problem and indicates that the gap is obtained by analyzing the layer-wise singular-value scaling in its solution. revision: yes
-
Referee: [Abstract] The analysis presupposes that gradient descent is initially dominated by the saddle at the origin and that escape directions are characterized as the solution to an optimization problem encoding the leading-order dynamics. The abstract does not state the precise form of this optimization problem or the conditions under which the domination assumption holds, leaving the load-bearing step unverified.
Authors: The manuscript states the domination assumption explicitly in the introduction and derives the optimization problem from the leading-order Taylor expansion of the loss under gradient flow with small initialization. The assumption holds when the initial weights are sufficiently small relative to the data scale. We will insert a short clause in the abstract that both names the optimization problem and notes the small-initialization regime under which the origin saddle dominates the early dynamics. revision: yes
Circularity Check
No significant circularity identified
full rationale
Only the abstract is available, which states the central claim of an ℓ^{1/4} low-rank bias in the optimal escape direction and suggests saddle-to-saddle dynamics via a citation to Jacot (2023). No equations, optimization problem, derivation steps, or proof are provided, so no load-bearing step can be quoted or shown to reduce by construction to its inputs, a fitted parameter, or a self-citation chain. The citation is not used to establish the primary low-rank result within the visible text, leaving the analysis self-contained against external benchmarks as far as the given material permits.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
optimal escape direction features a low-rank bias... first singular value of the ℓ-th layer weight matrix is at least ℓ^{1/4} larger
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
saddle-to-saddle dynamics... increasing bottleneck rank
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
A Theory of Saddle Escape in Deep Nonlinear Networks
Derives exact norm-imbalance identity for deep nonlinear nets, classifying activations into four classes and yielding escape time law τ★ = Θ(ε^{-(r-2)}) governed by bottleneck depth r.
-
A Theory of Saddle Escape in Deep Nonlinear Networks
An exact norm-imbalance identity classifies activations into four classes and reduces deep nonlinear training flow to a scalar ODE that predicts saddle escape time scaling as ε to the power of minus (r-2) for r bottle...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.