Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

Arthur Jacot; Ioannis Bantzis; James B. Simon

arxiv: 2505.21722 · v2 · submitted 2025-05-27 · 💻 cs.LG · cs.AI· stat.ML

Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

Ioannis Bantzis , James B. Simon , Arthur Jacot This is my paper

Pith reviewed 2026-05-19 12:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords deep ReLU networkssaddle escapelow-rank biasgradient descent dynamicssaddle-to-saddlebottleneck rankimplicit regularization

0 comments

The pith

Deep ReLU networks escape the origin saddle along directions with a low-rank bias that grows with layer depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies gradient descent in deep ReLU networks that start with small weights. It shows that the network first escapes the saddle at the origin along an optimal direction that carries a low-rank bias in deeper layers, where the leading singular value of the ell-th weight matrix exceeds the rest by a factor of at least ell to the power 1/4. A sympathetic reader would care because this points to training as a saddle-to-saddle process in which the network crosses saddles of steadily rising bottleneck rank and therefore builds effective complexity in stages.

Core claim

The central claim is that the optimal escape direction from the origin saddle in a deep ReLU network initialized with small weights has a low-rank bias in its deeper layers: the first singular value of the ell-th layer weight matrix is at least ell to the power 1/4 larger than any other singular value. The authors also establish related properties of these escape directions and propose that deep ReLU networks follow saddle-to-saddle dynamics, visiting a sequence of saddles whose bottleneck ranks increase over time.

What carries the argument

Optimal escape direction, defined as the solution to an optimization problem that encodes the leading-order gradient descent flow away from the origin saddle.

Load-bearing premise

Gradient descent is initially dominated by the saddle at the origin, so that escape directions are characterized by solving an optimization problem whose objective reflects the leading-order dynamics near zero.

What would settle it

Run gradient descent from small random weights on a deep ReLU network and check whether the singular values of the weight matrices along the first escape trajectory satisfy the predicted ell to the power 1/4 scaling between the largest and second-largest values.

read the original abstract

When a deep ReLU network is initialized with small weights, gradient descent (GD) is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions along which GD leaves the origin, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the $\ell$-th layer weight matrix is at least $\ell^{\frac{1}{4}}$ larger than any other singular value. We also prove a number of related results about these escape directions. We suggest that deep ReLU networks exhibit saddle-to-saddle dynamics, with GD visiting a sequence of saddles with increasing bottleneck rank (Jacot, 2023).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a depth-scaled low-rank bias in the first saddle escape for deep ReLU networks, but the abstract leaves the rigor uncheckable.

read the letter

The main thing to know is that the optimal escape direction from the origin saddle in deep ReLU nets carries a low-rank bias that scales with depth: the top singular value in layer ℓ is at least ℓ^{1/4} bigger than the rest. This appears at the very beginning of training when weights are small. They do well in making the escape directions concrete and linking them directly to the saddle-to-saddle picture from earlier work. The depth dependence is a nice addition that could sharpen ideas about why deeper models regularize differently right at the start of training. Framing the escape as solving an optimization problem that captures the leading dynamics away from zero is a reasonable way to make the analogy to Hessian eigenvectors. The obvious soft spot is the lack of any derivation or proof details in what we have so far. Without seeing how they set up the leading-order dynamics or derive the bound, it's tough to judge if the result is tight or if it depends on unstated approximations. The author overlap with the cited Jacot paper is not a problem in itself, but it does put the burden on the new parts to show they are not just rephrasing old quantities. No internal contradictions jump out from the abstract, though. This kind of paper is for people working on the theory of gradient descent in overparameterized networks. A reader who cares about implicit bias and landscape analysis would find the specific scaling interesting to follow up on, especially if they are already thinking about saddle escapes. I would send it to peer review. The topic is relevant and the claim is specific enough that referees could check the math once the full text is there. Even with the current limited evidence, it seems worth a closer look rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that in deep ReLU networks initialized with small weights, gradient descent is initially dominated by the saddle at the origin in parameter space. It studies the escape directions from this saddle and shows that the optimal escape direction features a low-rank bias in deeper layers: the first singular value of the ℓ-th layer weight matrix is at least ℓ^{1/4} larger than any other singular value. The work also proves related results about these escape directions and suggests that deep ReLU networks exhibit saddle-to-saddle dynamics, visiting a sequence of saddles with increasing bottleneck rank.

Significance. If the claimed ℓ^{1/4} low-rank bias in the optimal escape direction holds under the stated assumptions, the result would provide a quantitative characterization of the initial implicit bias in deep ReLU training, offering insight into layer-dependent rank preferences and supporting the saddle-to-saddle trajectory picture in neural network optimization.

major comments (2)

[Abstract] Abstract: The central quantitative claim is that the first singular value of the ℓ-th layer in the optimal escape direction is at least ℓ^{1/4} larger than any other. The abstract supplies neither the explicit optimization problem whose solution defines this 'optimal escape direction' nor any derivation or proof sketch showing how the ℓ^{1/4} gap follows from the leading-order dynamics away from the origin saddle.
[Abstract] Abstract: The analysis presupposes that gradient descent is initially dominated by the saddle at the origin and that escape directions are characterized as the solution to an optimization problem encoding the leading-order dynamics. The abstract does not state the precise form of this optimization problem or the conditions under which the domination assumption holds, leaving the load-bearing step unverified.

minor comments (1)

The abstract refers to 'a number of related results about these escape directions' without enumerating them; the full manuscript should list and briefly describe these results to permit evaluation of their scope and novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments, which correctly identify that the abstract is too terse regarding the definition of the optimal escape direction and the underlying optimization problem. We will revise the abstract to include a concise statement of the optimization problem and the small-initialization assumption that justifies domination by the origin saddle.

read point-by-point responses

Referee: [Abstract] The central quantitative claim is that the first singular value of the ℓ-th layer in the optimal escape direction is at least ℓ^{1/4} larger than any other. The abstract supplies neither the explicit optimization problem whose solution defines this 'optimal escape direction' nor any derivation or proof sketch showing how the ℓ^{1/4} gap follows from the leading-order dynamics away from the origin saddle.

Authors: We agree that the abstract does not state the optimization problem or sketch the derivation. The full manuscript defines the optimal escape direction as the solution to a variational problem that maximizes the leading-order growth rate of the loss under the linearized dynamics near the origin; the ℓ^{1/4} factor then follows from balancing the contributions of successive layers in this variational problem. We will add one sentence to the abstract that names this optimization problem and indicates that the gap is obtained by analyzing the layer-wise singular-value scaling in its solution. revision: yes
Referee: [Abstract] The analysis presupposes that gradient descent is initially dominated by the saddle at the origin and that escape directions are characterized as the solution to an optimization problem encoding the leading-order dynamics. The abstract does not state the precise form of this optimization problem or the conditions under which the domination assumption holds, leaving the load-bearing step unverified.

Authors: The manuscript states the domination assumption explicitly in the introduction and derives the optimization problem from the leading-order Taylor expansion of the loss under gradient flow with small initialization. The assumption holds when the initial weights are sufficiently small relative to the data scale. We will insert a short clause in the abstract that both names the optimization problem and notes the small-initialization regime under which the origin saddle dominates the early dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

Only the abstract is available, which states the central claim of an ℓ^{1/4} low-rank bias in the optimal escape direction and suggests saddle-to-saddle dynamics via a citation to Jacot (2023). No equations, optimization problem, derivation steps, or proof are provided, so no load-bearing step can be quoted or shown to reduce by construction to its inputs, a fitted parameter, or a self-citation chain. The citation is not used to establish the primary low-rank result within the visible text, leaving the analysis self-contained against external benchmarks as far as the given material permits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on standard assumptions about ReLU networks, gradient flow near the origin, and the existence of an 'optimal' escape direction whose precise definition is not supplied.

pith-pipeline@v0.9.0 · 5647 in / 1258 out tokens · 42890 ms · 2026-05-19T12:12:26.904135+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimal escape direction features a low-rank bias... first singular value of the ℓ-th layer weight matrix is at least ℓ^{1/4} larger
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

saddle-to-saddle dynamics... increasing bottleneck rank

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Theory of Saddle Escape in Deep Nonlinear Networks
cs.LG 2026-05 unverdicted novelty 7.0

Derives exact norm-imbalance identity for deep nonlinear nets, classifying activations into four classes and yielding escape time law τ★ = Θ(ε^{-(r-2)}) governed by bottleneck depth r.
A Theory of Saddle Escape in Deep Nonlinear Networks
cs.LG 2026-05 conditional novelty 7.0

An exact norm-imbalance identity classifies activations into four classes and reduces deep nonlinear training flow to a scalar ODE that predicts saddle escape time scaling as ε to the power of minus (r-2) for r bottle...