Recognition: 2 theorem links
· Lean TheoremFeature Learning Dynamics in Infinite-Depth Neural Networks
Pith reviewed 2026-05-16 20:13 UTC · model grok-4.3
The pith
Finite ResNet training dynamics converge to a decoupled Neural Feature Dynamics limit with O(L^{-1}) error under depth-μP scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under nondegeneracy assumptions on the feature-gradient covariance structure, the finite-network training dynamics converge to the Neural Feature Dynamics limit with an O(L^{-1}) depth-discretization error, while the reused-weight coupling term decays faster at O(L^{-2}). At initialization the coupling vanishes as O(n^{-1}) uniformly in depth, and during SGD training the surviving correlation is higher order in depth and accumulates negligibly over layers as L approaches infinity.
What carries the argument
Neural Feature Dynamics (NFD), a forward-backward SDE system with decoupled backward weights that retains the feature-gradient covariance structure generated during training.
If this is right
- The reused-weight forward-backward correlation becomes negligible in the infinite-depth limit.
- Feature learning dynamics can be modeled by forward and backward processes that are decoupled at each layer.
- The infinite-depth limit preserves the covariance structure produced by SGD while removing depth-induced correlations.
- Convergence holds uniformly under the stated nondegeneracy conditions on the covariance process.
Where Pith is reading between the lines
- Similar decoupling might appear in other residual architectures when depth-μP scaling is used.
- The O(L^{-2}) decay of coupling could guide the design of depth-aware training schedules that exploit the faster suppression.
- The NFD equations might serve as a reduced-order model for studying generalization bounds in deep residual networks.
Load-bearing premise
Nondegeneracy assumptions on the feature-gradient covariance structure generated during training are needed for the SDE limit to exist and for the coupling to remain higher-order in depth.
What would settle it
Numerical experiments that measure the distance between finite-depth ResNet training trajectories and the NFD solution for increasing L and check whether the error scales as O(1/L) rather than slower.
Figures
read the original abstract
Deep neural networks have achieved remarkable success in practice, yet a mechanistic understanding of how features evolve during training remains incomplete, especially in the large-depth limit. For ResNets under depth-$\mu$P scaling, prior work treats the layer index $\ell$ as a continuous time $t_\ell = \ell/L$, yielding SDE descriptions of the training dynamics. A key unresolved issue is that backpropagation reuses each forward weight matrix $W_\ell$ through its transpose $W_\ell^\top$, creating correlations between forward features and backward gradients whose behavior and role in feature learning remain unclear. We study this reused-weight forward--backward coupling in one-layer ResNets under depth-$\mu$P. Using conditional Gaussian representations, we explicitly separate the coupling terms induced by weight reuse from decoupled Gaussian fluctuations before taking any network limit. At initialization, we prove that the coupling is a finite-width effect and vanishes at rate $O(n^{-1})$, uniformly over depth. During training, however, SGD induces a nontrivial forward--backward correlation term that survives the infinite-width limit. The key depth effect is that, under depth-$\mu$P scaling, this surviving term is higher order in depth and its accumulated contribution over layers becomes negligible as $L\to\infty$. This depth-induced suppression motivates Neural Feature Dynamics (NFD), a forward--backward SDE system with decoupled backward weights that retains the feature-gradient covariance structure generated during training. Under nondegeneracy assumptions, we prove that the finite-network training dynamics converge to its NFD limit with an $O(L^{-1})$ depth-discretization error, while the reused-weight coupling term has a faster $O(L^{-2})$ decay. These results provide a rigorous infinite-depth limit for the feature-learning dynamics of one-layer ResNets under depth-$\mu$P.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies the reused-weight forward-backward coupling in one-layer ResNets under depth-μP scaling. Using conditional Gaussian representations, it separates coupling terms from decoupled fluctuations before taking limits. At initialization the coupling vanishes as O(n^{-1}) uniformly in depth; during SGD training a nontrivial correlation survives infinite width but is shown to be higher-order in depth under depth-μP, becoming negligible as L→∞. This motivates the Neural Feature Dynamics (NFD) forward-backward SDE with decoupled backward weights. Under nondegeneracy assumptions on the feature-gradient covariance, the finite-network dynamics converge to the NFD limit with O(L^{-1}) depth-discretization error while the coupling decays as O(L^{-2}).
Significance. If the central claims hold, the work supplies a rigorous infinite-depth limit for feature-learning dynamics in ResNets, explicitly quantifying the suppression of weight-reuse correlations and providing a decoupled SDE model that preserves training-induced covariances. The conditional-Gaussian separation before limits and the explicit convergence rates are technically valuable contributions to the theory of depth scaling.
major comments (2)
- [Abstract] Abstract: the O(L^{-2}) decay of the reused-weight coupling and the existence of the NFD SDE limit both rest on nondegeneracy of the feature-gradient covariance generated during training; the manuscript states these assumptions but supplies no verification, relaxation, or analysis of whether the minimal eigenvalue remains bounded away from zero as training proceeds.
- [Abstract] Abstract: the claimed convergence of finite-network SGD dynamics to the NFD limit with O(L^{-1}) discretization error requires that the conditional-Gaussian separation and subsequent limits commute with the SGD-induced correlations; without the full derivations it is unclear how the nondegeneracy conditions are used to control the error terms.
minor comments (2)
- The term 'one-layer ResNets' should be defined explicitly in the introduction, as it is nonstandard and may be confused with standard residual blocks.
- Notation for depth-μP scaling and the continuous-time variable t_ℓ = ℓ/L should be introduced with a short self-contained paragraph before the main results.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of the significance of our results and for the detailed comments. We address each major comment below, indicating where revisions will be made to improve clarity while preserving the theoretical focus of the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: the O(L^{-2}) decay of the reused-weight coupling and the existence of the NFD SDE limit both rest on nondegeneracy of the feature-gradient covariance generated during training; the manuscript states these assumptions but supplies no verification, relaxation, or analysis of whether the minimal eigenvalue remains bounded away from zero as training proceeds.
Authors: We agree that the nondegeneracy assumption on the minimal eigenvalue of the feature-gradient covariance is essential for the stated rates and for invertibility in the error analysis. The manuscript states the assumption explicitly in the theorems but provides neither empirical verification nor relaxation. As the contribution is primarily theoretical, we will add a short discussion paragraph in the revised introduction and conclusion noting that the assumption is expected to hold generically for non-degenerate data distributions (consistent with prior feature-learning analyses) and flagging relaxation as an open direction. No change to the core proofs is planned. revision: partial
-
Referee: [Abstract] Abstract: the claimed convergence of finite-network SGD dynamics to the NFD limit with O(L^{-1}) discretization error requires that the conditional-Gaussian separation and subsequent limits commute with the SGD-induced correlations; without the full derivations it is unclear how the nondegeneracy conditions are used to control the error terms.
Authors: The conditional-Gaussian representation is applied at the finite-width, finite-depth level before any limits are taken, as set out in Section 3. Nondegeneracy is then used to bound the operator norm of the inverse covariance, which in turn controls the accumulation of discretization and correlation errors across the L layers, yielding the O(L^{-1}) rate. The complete argument, including the commutation of the separation with SGD-induced correlations, appears in Appendix B. We will insert a one-paragraph proof sketch immediately after the statement of the main convergence theorem to highlight the role of nondegeneracy without requiring the reader to turn to the appendix. revision: partial
Circularity Check
No significant circularity; derivation proceeds from explicit finite-width representations and stated assumptions
full rationale
The paper begins with conditional Gaussian representations of finite-width one-layer ResNets, explicitly separates reused-weight forward-backward coupling terms from decoupled fluctuations, proves O(n^{-1}) vanishing at initialization uniformly in depth, shows the coupling survives infinite width under SGD but is suppressed to O(L^{-2}) under depth-μP scaling, and defines NFD as the resulting decoupled SDE. Convergence of finite-network dynamics to this NFD limit is proved with O(L^{-1}) discretization error under explicitly stated nondegeneracy assumptions on the feature-gradient covariance. No equation or claim reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation; the nondegeneracy conditions are external to the derivation and required for SDE existence rather than presupposed by it.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Nondegeneracy assumptions on the feature-gradient covariance structure
invented entities (1)
-
Neural Feature Dynamics (NFD)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under nondegeneracy assumptions, we prove that the finite-network training dynamics converge to its NFD limit with an O(L^{-1}) depth-discretization error, while the reused-weight coupling term has a faster O(L^{-2}) decay.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NFD: a coupled forward-backward SDE system driven by independent Brownian motions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
The Effective Depth Paradox: Evaluating the Relationship between Architectural Topology and Trainability in Deep CNNs
Effective depth, an operational count of sequential transformations, predicts CNN trainability better than nominal layer count because shortcuts and branches decouple the two.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.