pith. machine review for the scientific record. sign in

arxiv: 2512.21075 · v2 · submitted 2025-12-24 · 💻 cs.LG · cs.AI· math.PR· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Feature Learning Dynamics in Infinite-Depth Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.PRstat.ML
keywords ResNetfeature learningdepth-μP scalingSDE limitsinfinite depthforward-backward couplingweight reuse
0
0 comments X

The pith

Finite ResNet training dynamics converge to a decoupled Neural Feature Dynamics limit with O(L^{-1}) error under depth-μP scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in one-layer ResNets under depth-μP scaling, the forward-backward coupling caused by reusing the same weights in backpropagation becomes negligible as depth L grows to infinity. This allows the training process to be described by a simpler system of stochastic differential equations called Neural Feature Dynamics, which keeps the important feature-gradient covariances but drops the depth-dependent correlations. A reader would care because this gives a rigorous way to analyze how features evolve in very deep networks without having to track every correlation induced by weight sharing. The result bridges practical finite networks to a continuous infinite-depth description with explicit error rates.

Core claim

Under nondegeneracy assumptions on the feature-gradient covariance structure, the finite-network training dynamics converge to the Neural Feature Dynamics limit with an O(L^{-1}) depth-discretization error, while the reused-weight coupling term decays faster at O(L^{-2}). At initialization the coupling vanishes as O(n^{-1}) uniformly in depth, and during SGD training the surviving correlation is higher order in depth and accumulates negligibly over layers as L approaches infinity.

What carries the argument

Neural Feature Dynamics (NFD), a forward-backward SDE system with decoupled backward weights that retains the feature-gradient covariance structure generated during training.

If this is right

  • The reused-weight forward-backward correlation becomes negligible in the infinite-depth limit.
  • Feature learning dynamics can be modeled by forward and backward processes that are decoupled at each layer.
  • The infinite-depth limit preserves the covariance structure produced by SGD while removing depth-induced correlations.
  • Convergence holds uniformly under the stated nondegeneracy conditions on the covariance process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar decoupling might appear in other residual architectures when depth-μP scaling is used.
  • The O(L^{-2}) decay of coupling could guide the design of depth-aware training schedules that exploit the faster suppression.
  • The NFD equations might serve as a reduced-order model for studying generalization bounds in deep residual networks.

Load-bearing premise

Nondegeneracy assumptions on the feature-gradient covariance structure generated during training are needed for the SDE limit to exist and for the coupling to remain higher-order in depth.

What would settle it

Numerical experiments that measure the distance between finite-depth ResNet training trajectories and the NFD solution for increasing L and check whether the error scales as O(1/L) rather than slower.

Figures

Figures reproduced from arXiv: 2512.21075 by Ruoyu Wu, Tianxiang Gao, Zihan Yao.

Figure 1
Figure 1. Figure 1: Pre- and post-act ResNets debate under depth-µP. In (a), the pre-act variant maintains stable feature across depth, whereas the post-act exhibits rapid growth. In (b)-(c), we train depth-64 width-128 ResNets with ReLU on CIFAR-10 under SGD (LR 0.01, batch size 128). The stability from pre-act design yields faster convergence and lower test loss with reduced variance across runs. This assumption is rigorous… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of the time horizon T on ResNet performance. We train width-128 ReLU ResNets on CIFAR-10 using SGD (LR 0.1, batch size 128) across 3 random seeds. (a–b) At fixed depth 5, moderate T (e.g., 2) improves capacity and performance, whereas very large T (e.g., 32) causes unstable training and degraded performance. (c–d) When varying both T and depth L, mod￾erate increases improve performance, while excess… view at source ↗
Figure 3
Figure 3. Figure 3: Convergence to NFD at initialization and after 30 epochs. We evaluate depth-µP ResNets on CIFAR-10 using SGD (LR 0.01, batch size 128). Widths range from 32 to 1024, and depths from 4 to 128. The approximation error decays as O(1/L + 1/n) when increasing depth and width. This uniform behavior confirms that the width and depth limits are empirically commutable both at initialization and during training. 4.1… view at source ↗
Figure 4
Figure 4. Figure 4: Empirical evaluation of GIA restoration. We train all models with width 128 on CIFAR￾10 using SGD (LR 0.1 and batch size 128), comparing standard training (shared forward/backward weights) with a decoupled setup where the backward pass uses an i.i.d. copy of the forward weights. As depth increases: (a) Vanilla DNNs suffer from vanishing gradients and make little progress, and the two trajectories remain mi… view at source ↗
Figure 5
Figure 5. Figure 5: Internal learning collapse and recovery in two-layer residual blocks. In Row 1, we train a width-128 ReLU ResNet on CIFAR-10 using SGD (batch size 128, ηc = 0.1) for 300 steps across depths L. Under standard depth-µP, the internal feature-update collapses at rate 1/ √ L, indi￾cating that the first layer stops learning as depth grows. Our depth-aware learning rate η1 = ηcn √ L restores active feature learni… view at source ↗
Figure 6
Figure 6. Figure 6: Minimum eigenvalues of the covariance matrices during training. We evaluate ResNets on CIFAR-10 using online SGD (learning rate 0.1, batch size 1) across 5 seeds, with 4 hidden layers and widths ranging from 512 to 4096. The minimum eigenvalues of Σ (k) t and Θ (k) t remain strictly positive across layers, validating Assumption 1. Although the eigenvalues decrease over training, they grow with network widt… view at source ↗
Figure 7
Figure 7. Figure 7: Empirical evaluation of GIA restoration at width 256. We repeat the experiment of [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗
read the original abstract

Deep neural networks have achieved remarkable success in practice, yet a mechanistic understanding of how features evolve during training remains incomplete, especially in the large-depth limit. For ResNets under depth-$\mu$P scaling, prior work treats the layer index $\ell$ as a continuous time $t_\ell = \ell/L$, yielding SDE descriptions of the training dynamics. A key unresolved issue is that backpropagation reuses each forward weight matrix $W_\ell$ through its transpose $W_\ell^\top$, creating correlations between forward features and backward gradients whose behavior and role in feature learning remain unclear. We study this reused-weight forward--backward coupling in one-layer ResNets under depth-$\mu$P. Using conditional Gaussian representations, we explicitly separate the coupling terms induced by weight reuse from decoupled Gaussian fluctuations before taking any network limit. At initialization, we prove that the coupling is a finite-width effect and vanishes at rate $O(n^{-1})$, uniformly over depth. During training, however, SGD induces a nontrivial forward--backward correlation term that survives the infinite-width limit. The key depth effect is that, under depth-$\mu$P scaling, this surviving term is higher order in depth and its accumulated contribution over layers becomes negligible as $L\to\infty$. This depth-induced suppression motivates Neural Feature Dynamics (NFD), a forward--backward SDE system with decoupled backward weights that retains the feature-gradient covariance structure generated during training. Under nondegeneracy assumptions, we prove that the finite-network training dynamics converge to its NFD limit with an $O(L^{-1})$ depth-discretization error, while the reused-weight coupling term has a faster $O(L^{-2})$ decay. These results provide a rigorous infinite-depth limit for the feature-learning dynamics of one-layer ResNets under depth-$\mu$P.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies the reused-weight forward-backward coupling in one-layer ResNets under depth-μP scaling. Using conditional Gaussian representations, it separates coupling terms from decoupled fluctuations before taking limits. At initialization the coupling vanishes as O(n^{-1}) uniformly in depth; during SGD training a nontrivial correlation survives infinite width but is shown to be higher-order in depth under depth-μP, becoming negligible as L→∞. This motivates the Neural Feature Dynamics (NFD) forward-backward SDE with decoupled backward weights. Under nondegeneracy assumptions on the feature-gradient covariance, the finite-network dynamics converge to the NFD limit with O(L^{-1}) depth-discretization error while the coupling decays as O(L^{-2}).

Significance. If the central claims hold, the work supplies a rigorous infinite-depth limit for feature-learning dynamics in ResNets, explicitly quantifying the suppression of weight-reuse correlations and providing a decoupled SDE model that preserves training-induced covariances. The conditional-Gaussian separation before limits and the explicit convergence rates are technically valuable contributions to the theory of depth scaling.

major comments (2)
  1. [Abstract] Abstract: the O(L^{-2}) decay of the reused-weight coupling and the existence of the NFD SDE limit both rest on nondegeneracy of the feature-gradient covariance generated during training; the manuscript states these assumptions but supplies no verification, relaxation, or analysis of whether the minimal eigenvalue remains bounded away from zero as training proceeds.
  2. [Abstract] Abstract: the claimed convergence of finite-network SGD dynamics to the NFD limit with O(L^{-1}) discretization error requires that the conditional-Gaussian separation and subsequent limits commute with the SGD-induced correlations; without the full derivations it is unclear how the nondegeneracy conditions are used to control the error terms.
minor comments (2)
  1. The term 'one-layer ResNets' should be defined explicitly in the introduction, as it is nonstandard and may be confused with standard residual blocks.
  2. Notation for depth-μP scaling and the continuous-time variable t_ℓ = ℓ/L should be introduced with a short self-contained paragraph before the main results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of the significance of our results and for the detailed comments. We address each major comment below, indicating where revisions will be made to improve clarity while preserving the theoretical focus of the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the O(L^{-2}) decay of the reused-weight coupling and the existence of the NFD SDE limit both rest on nondegeneracy of the feature-gradient covariance generated during training; the manuscript states these assumptions but supplies no verification, relaxation, or analysis of whether the minimal eigenvalue remains bounded away from zero as training proceeds.

    Authors: We agree that the nondegeneracy assumption on the minimal eigenvalue of the feature-gradient covariance is essential for the stated rates and for invertibility in the error analysis. The manuscript states the assumption explicitly in the theorems but provides neither empirical verification nor relaxation. As the contribution is primarily theoretical, we will add a short discussion paragraph in the revised introduction and conclusion noting that the assumption is expected to hold generically for non-degenerate data distributions (consistent with prior feature-learning analyses) and flagging relaxation as an open direction. No change to the core proofs is planned. revision: partial

  2. Referee: [Abstract] Abstract: the claimed convergence of finite-network SGD dynamics to the NFD limit with O(L^{-1}) discretization error requires that the conditional-Gaussian separation and subsequent limits commute with the SGD-induced correlations; without the full derivations it is unclear how the nondegeneracy conditions are used to control the error terms.

    Authors: The conditional-Gaussian representation is applied at the finite-width, finite-depth level before any limits are taken, as set out in Section 3. Nondegeneracy is then used to bound the operator norm of the inverse covariance, which in turn controls the accumulation of discretization and correlation errors across the L layers, yielding the O(L^{-1}) rate. The complete argument, including the commutation of the separation with SGD-induced correlations, appears in Appendix B. We will insert a one-paragraph proof sketch immediately after the statement of the main convergence theorem to highlight the role of nondegeneracy without requiring the reader to turn to the appendix. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from explicit finite-width representations and stated assumptions

full rationale

The paper begins with conditional Gaussian representations of finite-width one-layer ResNets, explicitly separates reused-weight forward-backward coupling terms from decoupled fluctuations, proves O(n^{-1}) vanishing at initialization uniformly in depth, shows the coupling survives infinite width under SGD but is suppressed to O(L^{-2}) under depth-μP scaling, and defines NFD as the resulting decoupled SDE. Convergence of finite-network dynamics to this NFD limit is proved with O(L^{-1}) discretization error under explicitly stated nondegeneracy assumptions on the feature-gradient covariance. No equation or claim reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation; the nondegeneracy conditions are external to the derivation and required for SDE existence rather than presupposed by it.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on depth-μP scaling and conditional Gaussian representations from prior literature as background; the main addition is the analysis of the coupling term and construction of NFD.

axioms (1)
  • domain assumption Nondegeneracy assumptions on the feature-gradient covariance structure
    Invoked to ensure the forward-backward SDE system is well-defined and the infinite-depth limit exists.
invented entities (1)
  • Neural Feature Dynamics (NFD) no independent evidence
    purpose: Forward-backward SDE system with decoupled backward weights that retains training-induced feature-gradient covariance
    New limiting object introduced after proving the coupling is negligible at infinite depth.

pith-pipeline@v0.9.0 · 5639 in / 1358 out tokens · 39721 ms · 2026-05-16T20:13:31.612520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Effective Depth Paradox: Evaluating the Relationship between Architectural Topology and Trainability in Deep CNNs

    cs.CV 2026-02 unverdicted novelty 4.0

    Effective depth, an operational count of sequential transformations, predicts CNN trainability better than nominal layer count because shortcuts and branches decouple the two.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 1 Pith paper

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...