pith. sign in

arxiv: 2605.22010 · v1 · pith:NS4FNXUZnew · submitted 2026-05-21 · 📊 stat.ML · cs.LG

Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks

Pith reviewed 2026-05-22 04:26 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords propagation of chaosmean-field limitshallow neural networksgradient descentuniform-in-time boundsWasserstein gradient flowweak convergence
0
0 comments X

The pith

Finite-width shallow networks stay close to their infinite-width mean-field limit for all training times under polynomial loss decay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that one-hidden-layer networks trained by gradient descent remain close to their mean-field infinite-width counterparts uniformly over infinite time. Standard short-time bounds come from Gronwall estimates, but long-time control usually needs strong convexity or noise; here the authors use the decay rate of the mean-field loss itself to bound accumulated fluctuations. A sympathetic reader cares because the result implies that poly(d/epsilon) neurons, samples, and steps suffice to reach epsilon loss whenever the mean-field dynamics converges faster than t to the minus 2, without landscape assumptions near the optimum and in a noiseless setting. The bound extends directly to finite-sample and discrete-time versions.

Core claim

We establish non-asymptotic weak propagation-of-chaos that holds uniformly in time, obtained by exploiting the convergence rate of the mean-field deterministic Wasserstein-gradient-flow dynamics. Denoting by L_t the mean-field excess MSE loss at time t and m the number of neurons, under standard regularity assumptions and the condition integral from 0 to infinity of L_t to the 1/2 dt equals O(log d), we obtain the uniform in time bound of the squared difference between f rho t MF and f rho hat t m less than or equal to poly(d) times m to the minus min(1, c/6) whenever L_t is less than or equal to t to the minus c.

What carries the argument

The mean-field excess MSE loss L_t together with the integral condition on its square root, which controls accumulated fluctuations via the mean-field convergence rate to yield the uniform propagation-of-chaos bound.

If this is right

  • Whenever the mean-field population loss converges faster than t to the minus 2, loss epsilon is attainable with only poly(d/epsilon) neurons, samples, and gradient steps.
  • The uniform bound extends seamlessly to finite training samples and to time-discretized gradient descent.
  • The result requires no assumptions on landscape geometry near the optimum and holds in noiseless dynamics.
  • The same argument applies to other discretization schemes beyond finite width.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practical networks with moderate width may therefore inherit long-time mean-field behavior whenever population loss decays reasonably fast.
  • The same integral-control idea could be tested on deeper networks or on stochastic gradient variants to see how far the uniform bound travels.
  • Experiments could directly measure the observed gap versus predicted scaling as a function of measured loss decay exponent c.

Load-bearing premise

The mean-field loss decays at a polynomial rate such that the integral of its square root over infinite time stays only logarithmic in dimension.

What would settle it

Simulate or compute the squared difference between finite-width and mean-field network outputs over long times and check whether it remains bounded by poly(d) m to the minus min(1,c/6) when the observed loss decays as t to the minus c and the integral condition holds.

Figures

Figures reproduced from arXiv: 2605.22010 by Joan Bruna, Margalit Glasgow.

Figure 1
Figure 1. Figure 1: Approximate loss L(ρ MF t ) (left) and R t s=0 p L(ρMF s )ds (right) for d = 128. We train both layers. 5 Experiments In this section, we provide several examples of learning problems which empirically demonstrate fast enough convergence rates to satisfy the conditions of our main theorem. We study two settings, and defer further experimental details to Appendix B. Misspecified Sobolev single-index model W… view at source ↗
Figure 2
Figure 2. Figure 2: examples of target densities (top), alongside the behavior of [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Decomposing d dt∆t(i) = −ν(ξt(wi), ρMF t ) + νDˆ ( ˆξtη (wi), ρˆ m tη ). Upper bound on the approximate differ￾ences between the terms in the rectangles are given above the arrows. In the case that S = S d−1 , the above uniform convergence bounds follow from standard empirical process theory arguments: from Lemma 7 all the random variables are bounded since all neurons are on S d−1 , and we can take an ϵ-n… view at source ↗
Figure 4
Figure 4. Figure 4: Plot of ϕ(x1) = F(arccos(x1)) for various values of γ. The function becomes smoother as γ increases. B Supplemental Experimental Details for Misspecified Sobolev single-index model For γ ∈ {1, 2, 4, 8}, train a wide neural network with gradient descent on n = 1024 data points (xi , f γ (xi)). For the best approximation of the population loss, when d = 2, we use xi evenly spaced around S 1 ; otherwise we ch… view at source ↗
Figure 5
Figure 5. Figure 5: Approximate loss L(ρ MF t ) (left) and R t s=0 p L(ρMF s )ds (right). Training both layers. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_5.png] view at source ↗
read the original abstract

We consider one-hidden layer neural networks trained in the feature-learning regime using gradient descent, and relate the output of the finite-width network $f_{\hat{\rho}_t^m}$ to its infinite-width counterpart $f_{\rho_t^{MF}}$, which evolves in the mean-field dynamics. While constant-time horizon bounds for $\|f_{\rho_t^{MF}} - f_{\hat{\rho}_t^m}\|$ may be obtained via standard Gr\"onwall estimates, the long-time behavior of the fluctuation is a more delicate matter. Uniform-in-time bounds often rely on (local) strong convexity in the landscape or Logarithmic Sobolev inequalities present in noisy gradient dynamics. In this work, we establish non-asymptotic weak propagation-of-chaos that holds uniformly in time, obtained by exploiting instead the convergence rate of the mean-field deterministic Wasserstein-gradient-flow dynamics. Specifically, denoting by $L_t$ the mean-field excess MSE loss at time $t$ and $m$ the number of neurons, under standard regularity assumptions and the condition $\int_0^\infty L_t^{1/2} dt =O(\log d)$, we obtain the uniform in time bound $\|f_{\rho_t^{MF}}- f_{\hat{\rho}_t^m}\|^2 \lesssim \text{poly}(d) m^{-\min(1,c/6)}$ whenever $L_t \lesssim t^{-c}$. Our result holds in a noiseless setting and does not make any assumptions on the geometry of the landscape near the optimum, and extends seamlessly to other forms of discretization, including finite number of samples and time discretization. A key takeaway of our result is that whenever the convergence rate of the mean-field, population-loss dynamics is faster than $t^{-2}$, we can attain a loss of $\epsilon$ with only $\text{poly}(d/\epsilon)$ neurons, training samples, and GD steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript establishes non-asymptotic uniform-in-time weak propagation-of-chaos for one-hidden-layer networks in the feature-learning regime. It bounds the squared difference between the finite-width network output f_ρ̂_t^m and its mean-field counterpart f_ρ_t^MF by poly(d) m^{-min(1,c/6)} whenever the mean-field excess loss satisfies L_t ≲ t^{-c} and the integral condition ∫_0^∞ L_t^{1/2} dt = O(log d), under standard regularity assumptions. The argument replaces standard Gronwall estimates with control derived from the deterministic Wasserstein gradient-flow convergence rate of the mean-field dynamics, and extends the bound to finite-sample and time-discretized settings without landscape assumptions near the optimum or added noise.

Significance. If the derivation holds, the result is significant for providing uniform-in-time fluctuation control in the noiseless case without local strong convexity or logarithmic Sobolev inequalities. The explicit dependence on the mean-field decay rate L_t and the implication that poly(d/ε) neurons, samples, and steps suffice for ε-loss when the mean-field dynamics converge faster than t^{-2} offer a concrete scaling guideline. Credit is due for the clean conditional derivation that avoids self-referential constants and for the seamless extension to discretizations.

major comments (2)
  1. [§3.2, Theorem 3.1] §3.2, Theorem 3.1: the derivation of the exponent min(1,c/6) in the m^{-min(1,c/6)} rate relies on a specific splitting of the fluctuation integral; the manuscript should explicitly verify that the c/6 term arises from the Hölder conjugate applied to the ∫ L_t^{1/2} dt term rather than from an auxiliary constant.
  2. [§4.1, Eq. (4.3)] §4.1, Eq. (4.3): the uniform bound is stated to hold for the population loss; the extension to the finite-sample empirical loss in §4.2 requires an additional concentration term whose dependence on the number of samples n is only sketched. The manuscript should state the precise n scaling that preserves the poly(d) prefactor.
minor comments (2)
  1. [Notation paragraph] Notation: the symbol ρ̂_t^m is used both for the empirical measure and for the network output; a brief clarification in the notation paragraph would avoid confusion.
  2. [Abstract and §3.1] The integral condition ∫ L_t^{1/2} dt = O(log d) is introduced in the abstract and Theorem 3.1 but its necessity is not contrasted with the weaker ∫ L_t dt < ∞ that would suffice for pointwise convergence; a short remark would help readers gauge sharpness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and for identifying points that will improve the clarity of the presentation. We address each major comment below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [§3.2, Theorem 3.1] the derivation of the exponent min(1,c/6) in the m^{-min(1,c/6)} rate relies on a specific splitting of the fluctuation integral; the manuscript should explicitly verify that the c/6 term arises from the Hölder conjugate applied to the ∫ L_t^{1/2} dt term rather than from an auxiliary constant.

    Authors: We agree that an explicit verification of the exponent would enhance readability. In the revision we will insert a short remark immediately after the proof of Theorem 3.1 that isolates the application of Hölder's inequality to the integral term ∫_0^∞ L_t^{1/2} dt. The calculation shows that the conjugate pair (p,q) with 1/p + 1/q = 1 is chosen so that the resulting power on m is exactly -c/6 when the integral is bounded by O(log d); no auxiliary constants enter the exponent beyond those already stated in the theorem hypotheses. revision: yes

  2. Referee: [§4.1, Eq. (4.3)] the uniform bound is stated to hold for the population loss; the extension to the finite-sample empirical loss in §4.2 requires an additional concentration term whose dependence on the number of samples n is only sketched. The manuscript should state the precise n scaling that preserves the poly(d) prefactor.

    Authors: The referee is correct that the dependence on n was only indicated qualitatively. Under the same regularity assumptions used for the population case, standard empirical-process concentration (e.g., via bounded differences or sub-Gaussian tails) yields an additive error of order sqrt((d log n)/n) in the loss. In the revised §4.2 we will state explicitly that choosing n ≳ poly(d) m^{min(1,c/6)} absorbs this term into the existing poly(d) m^{-min(1,c/6)} bound, thereby preserving the overall rate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The central result conditions the uniform-in-time bound explicitly on the externally supplied mean-field excess loss decay L_t ≲ t^{-c} together with the integral condition ∫ L_t^{1/2} dt = O(log d). These quantities are defined from the infinite-width Wasserstein gradient flow and enter the fluctuation control as given inputs; the finite-width deviation is then bounded in terms of them via standard estimates that replace Gronwall with the supplied convergence speed. No step redefines L_t in terms of the finite-network output, fits a parameter to the target quantity, or relies on a load-bearing self-citation whose content reduces to the present claim. The derivation therefore remains independent of its conclusion.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard regularity assumptions for the loss and activation (treated as domain assumptions) plus the explicit integral condition on L_t; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Standard regularity assumptions on the loss and network (abstract).
    Invoked to justify the mean-field Wasserstein gradient flow and the fluctuation estimates.
  • ad hoc to paper ∫_0^∞ L_t^{1/2} dt = O(log d) and L_t ≲ t^{-c}.
    This decay-plus-integral condition is required to close the uniform-in-time argument.

pith-pipeline@v0.9.0 · 5882 in / 1487 out tokens · 51956 ms · 2026-05-22T04:26:37.959895+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793,

    [BBPV23] Alberto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. On learning gaussian multi-index models with gradient flow.arXiv preprint arXiv:2310.19793,

  2. [2]

    Quantita- tive convergence of wasserstein gradient flows of kernel mean discrepancies.arXiv preprint arXiv:2603.01977,

    [CCCFR26] Lénaïc Chizat, Maria Colombo, Roberto Colombo, and Xavier Fernández-Real. Quantita- tive convergence of wasserstein gradient flows of kernel mean discrepancies.arXiv preprint arXiv:2603.01977,

  3. [3]

    The hidden width of deep resnets: Tight error bounds and phase diagrams

    [Chi25] Lénaïc Chizat. The hidden width of deep resnets: Tight error bounds and phase diagrams. arXiv preprint arXiv:2509.10167,

  4. [4]

    Propagation of Chaos in Contextual Flow Maps

    arXiv:2605.16747v1. [CLRW24] Fan Chen, Yiqing Lin, Zhenjie Ren, and Songbo Wang. Uniform-in-time propagation of chaos for kinetic mean field langevin dynamics.Electronic Journal of Probability, 29:1–43,

  5. [5]

    Mean-field langevin dynamics and energy landscape of neural networks.arXiv preprint arXiv:1905.07769,

    [HRSS19] Kaitong Hu, Zhenjie Ren, David Siska, and Lukasz Szpruch. Mean-field langevin dynamics and energy landscape of neural networks.arXiv preprint arXiv:1905.07769,

  6. [6]

    Scaling Laws for Neural Language Models

    [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  7. [7]

    Sampling from the mean-field stationary distribution.arXiv preprint arXiv:2402.07355,

    [KZC+24] Yunbum Kook, Matthew S Zhang, Sinho Chewi, Murat A Erdogdu, and Mufan (Bill) Li. Sampling from the mean-field stationary distribution.arXiv preprint arXiv:2402.07355,

  8. [8]

    Learning multi-index models with neural networks via mean-field langevin dynamics.arXiv preprint arXiv:2408.07254,

    [MHWE24] Alireza Mousavi-Hosseini, Denny Wu, and Murat A Erdogdu. Learning multi-index models with neural networks via mean-field langevin dynamics.arXiv preprint arXiv:2408.07254,

  9. [9]

    Free energy Wasserstein gradient flow and their particle counter- parts: toy model, (degenerate) PL inequalities and exit times.arXiv e-prints, page arXiv:2510.16506, October 2025

    [Mon25] Pierre Monmarché. Free energy wasserstein gradient flow and their particle counterparts: toy model,(degenerate) pl inequalities and exit times.arXiv preprint arXiv:2510.16506,

  10. [10]

    Empirical bernstein in smooth banach spaces

    [MTR24] Diego Martinez-Taboada and Aaditya Ramdas. Empirical bernstein in smooth banach spaces. arXiv preprint arXiv:2409.06060,

  11. [11]

    On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

    [PPP26] Romain Petit, Clarice Poon, and Gabriel Peyré. On the global convergence of gradient de- scent for wide shallow models with bounded nonlinearities.arXiv preprint arXiv:2605.10775,

  12. [12]

    Rotskoff and E

    [RVE18] Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as Interacting Particle Systems: Asymptotic convexity of the Loss Landscape and Universal Scaling of the Approximation Error.arXiv preprint arXiv:1805.00915,

  13. [13]

    Mean-field analysis on two-layer neural networks from a kernel perspective.arXiv preprint arXiv:2403.14917,

    [TS24] Shokichi Takakura and Taiji Suzuki. Mean-field analysis on two-layer neural networks from a kernel perspective.arXiv preprint arXiv:2403.14917,