pith. sign in

arxiv: 2605.21648 · v1 · pith:MEM2WHTHnew · submitted 2026-05-20 · 💻 cs.LG · cond-mat.dis-nn· cs.NE· stat.ML

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

Pith reviewed 2026-05-22 09:02 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncs.NEstat.ML
keywords dropoutedge of chaosscaling lawsuniversality classessignal propagationcorrelation decayoptimal schedulingneural network depth
0
0 comments X

The pith

Dropout perturbs the edge-of-chaos fixed point in signal propagation, producing distinct scaling laws and universality classes for smooth versus kinked activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats dropout as a controlled perturbation around the critical initialization where signals would propagate without decay through infinite depth. This shift creates a finite depth scale for correlation loss even at criticality, with the decay governed by scaling laws in depth, detuning from criticality, and dropout strength. Smooth activations allow a Taylor expansion of the correlation map near perfect alignment, while kinked activations introduce a branch-point singularity, placing the two families in separate universality classes with different exponents and a shared two-parameter scaling collapse. As a direct consequence the theory supplies saturated dropout profiles under a fixed budget and shows that a rank-flow rule selects front-loaded schedules that lower held-out loss in MLPs and Vision Transformers.

Core claim

At the edge of chaos the correlation map possesses a perfect-alignment fixed point. Dropout displaces this fixed point, rendering the propagation depth finite. The resulting correlation decay obeys critical and crossover scaling laws whose form is fixed by the analytic structure of the map: a regular Taylor series for smooth activations versus a non-analytic branch point for ReLU-like activations. These structures generate distinct critical exponents together with a universal collapse of correlation data onto a single curve when plotted against the two scaling variables of detuning and dropout rate.

What carries the argument

The correlation map near perfect alignment, whose Taylor expansion or branch-point non-analyticity sets the universality class and the exponents of the scaling laws.

If this is right

  • Critical initialization alone no longer supports infinite-depth propagation once dropout is present.
  • Smooth and ReLU-like activations belong to separate universality classes distinguished by their correlation-map singularities.
  • Correlation decay obeys a universal two-parameter scaling collapse controlled by detuning and dropout strength.
  • Fixed-budget dropout is optimally realized by saturated, front-loaded schedules selected by a rank-flow tie-breaker.
  • The same scaling framework accounts for the observed reduction in held-out loss for MLPs and Vision Transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The distinction between analytic and branched correlation maps may reappear in other stochastic regularizers that act as perturbations to signal propagation.
  • The derived front-loaded schedules could be tested directly on larger transformer variants or on convolutional architectures without changing the total compute budget.
  • If the mean-field scaling holds, similar universality classes should emerge when dropout is replaced by other depth-dependent noise sources.

Load-bearing premise

The mean-field description of dropout as a perturbation of critical propagation remains valid and the local analytic structure of the correlation map alone determines the scaling exponents and collapse.

What would settle it

A measurement showing that correlation decay versus depth in networks with varying dropout rates fails to collapse onto the predicted two-parameter surface when activations are switched from smooth to kinked.

Figures

Figures reproduced from arXiv: 2605.21648 by Lucas Fernandez Sarmiento.

Figure 1
Figure 1. Figure 1: Critical scaling for smooth (tanh) and kinked (ReLU) activation functions, comparing tuning at zero dropout to tuning at the edge-of-chaos using a dropout field. The top row compares the different critical exponents at zero dropout and probes critical detuning decay, while the bottom row explores on critical networks with non-zero dropout. As the variables grow, higher-order effects become comparable and t… view at source ↗
Figure 2
Figure 2. Figure 2: Two-parameter crossover and scaling collapse of the dropout-deformed equation of state for the smooth universality class (tanh). Plots obtained using MFT recursion relations. The curves collapse onto a universal function after rescaling by t˜and m˜ . The kinked counterpart is shown in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Magnitude of Hermite coefficients |an| for ReLU versus tanh. ReLU exhibits a slow (power-law) decay, reflecting multi-scale support across Hermite degrees, while tanh decays rapidly and concentrates most spectral mass in the lowest modes. C.5. Hermite decompositions for ReLU and tanh Throughout, Z ∼ N (0, 1) and Dz ≡ dz √ 2π e −z 2/2 . (155) We expand the fixed point rescaled activation f(z) in an orthonor… view at source ↗
Figure 4
Figure 4. Figure 4: Kinked counterpart to [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Finite-width MLP training and test curves at fixed mean dropout budget h¯ (with 0 ≤ hℓ ≤ hmax). We compare uniform dropout, linear ramps (increasing/decreasing with depth), and step schedules that concentrate dropout in either the first or last half of the network, together with the no-dropout baseline. The budget-control experiment checks a simpler explanation: early-concentrated dropout may only be winni… view at source ↗
Figure 6
Figure 6. Figure 6: Matched-budget controls comparing front-loaded step schedules against constant dropout fields at h¯, 2h¯, and 3h¯. If the step schedules succeed merely because they apply locally higher dropout rates, then the uniform schedules with matching dropout should perform at least as well. In contrast, if spatial allocation genuinely matters, the step schedules should outperform their uniform counterparts despite … view at source ↗
Figure 7
Figure 7. Figure 7: Robustness sweeps for depth-6 near-critical ReLU MLPs on CIFAR-10. Early schedules improve over constant dropout throughout the large-width regime, while N = 64 illustrates the expected finite-width boundary. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Mean dropout field h̄ −3 −2 −1 0 1 Δ b est test a c c ura c y (%) Scheduling advantage relative to constant dropout Step (early) - Constant Big step (1/3) - Con… view at source ↗
Figure 8
Figure 8. Figure 8: Smooth-activation h-sweep for depth-6 near-critical GELU MLPs on CIFAR-10 at width N = 256. Step-like schedules improve over constant dropout around h¯ = 0.1, while the largest field leaves the small-dropout regime where the mean-field perturbation is expected to be predictive. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Finite-width Vision Transformer training and test curves at fixed mean dropout budget h¯. We compare the no-dropout baseline, uniform dropout, decreasing linear ramps, and early step schedules. The cropped view shows epochs 20–75 for readability; the full curve is also in App. D. Accuracy curves are shown in [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Extended training curves for [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Component ablations applying dropout schedules to the attention block, MLP block, or both. The figure compares no dropout, constant dropout, and step-like dropout [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗
read the original abstract

We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked, ReLU-like activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields saturated dropout profiles under fixed budget; a rank-flow tie-breaker then selects front-loaded schedules, substantially reducing held-out test loss at no extra computational cost, with accuracy gains as a consistent secondary effect. We test the predictions in MLPs and Vision Transformers and discuss CNN/ResNet extensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a mean-field theory treating dropout as a perturbation around critical signal propagation at the edge of chaos. It derives critical and crossover scaling laws for correlation decay, identifies distinct universality classes for smooth versus kinked (ReLU-like) activations arising from the analytic structure of the correlation map (Taylor expansion versus branch-point non-analyticity), and obtains a universal two-parameter scaling collapse in detuning and dropout strength. As a corollary the framework produces saturated dropout profiles and a rank-flow tie-breaker that selects front-loaded schedules, which are shown to reduce held-out test loss in MLPs and Vision Transformers at fixed computational budget.

Significance. If the mean-field correlation map and its perturbation analysis hold, the work supplies a principled explanation for dropout’s effect on information propagation and yields falsifiable scaling predictions together with a practical scheduling rule that improves performance without extra cost. The explicit separation into universality classes and the two-parameter collapse constitute a clear theoretical advance over existing edge-of-chaos analyses that treat dropout only phenomenologically.

major comments (2)
  1. [§4.1–4.3] §4.1–4.3 (correlation map derivation): the claim that dropout remains a local perturbation around the shifted fixed point for finite dropout rates lacks explicit error bounds or radius-of-convergence estimates; higher-order stochastic corrections from mask averaging could modify the leading singularity and thereby invalidate the extracted critical exponents and the asserted universality classes.
  2. [Eq. (12)] Eq. (12) and surrounding text (branch-point analysis for kinked activations): the preservation of the universal non-analyticity under the stochastic average over dropout masks is asserted but not demonstrated with a controlled expansion; an explicit calculation showing that the branch-point singularity survives to leading order is required to support the distinction between the two universality classes.
minor comments (2)
  1. [Figure 3] Figure 3 caption: the scaling-collapse axes are labeled only by symbols; add explicit definitions of the rescaled variables to allow readers to reproduce the collapse without consulting the main text.
  2. [§6.2] §6.2 (empirical validation): report the number of independent runs and the standard error on the reported test-loss reductions so that the statistical significance of the front-loaded schedule advantage can be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. The points raised highlight areas where additional rigor can strengthen the presentation of the mean-field perturbation analysis. We address each major comment below.

read point-by-point responses
  1. Referee: [§4.1–4.3] §4.1–4.3 (correlation map derivation): the claim that dropout remains a local perturbation around the shifted fixed point for finite dropout rates lacks explicit error bounds or radius-of-convergence estimates; higher-order stochastic corrections from mask averaging could modify the leading singularity and thereby invalidate the extracted critical exponents and the asserted universality classes.

    Authors: We agree that explicit error bounds would improve the manuscript. The derivation treats dropout as a controlled shift of the fixed point in the mean-field limit, with higher-order mask corrections suppressed by factors of p(1-p). In the revised version we will add a controlled expansion to second order in the perturbation parameter together with a radius-of-convergence estimate based on the Lipschitz constant of the correlation map, confirming that the leading singularity and extracted exponents remain valid throughout the scaling regimes considered. revision: partial

  2. Referee: [Eq. (12)] Eq. (12) and surrounding text (branch-point analysis for kinked activations): the preservation of the universal non-analyticity under the stochastic average over dropout masks is asserted but not demonstrated with a controlled expansion; an explicit calculation showing that the branch-point singularity survives to leading order is required to support the distinction between the two universality classes.

    Authors: We will supply the requested explicit calculation. Because the stochastic average is a linear operation, it acts term-by-term on the Taylor or Puiseux expansion of the correlation map. For kinked activations the leading non-analytic contribution is a branch-point term whose coefficient is independent of the mask realization; averaging therefore leaves the |Δ|^{3/2} (or equivalent) singularity intact to leading order in dropout strength. The revised manuscript will include this controlled expansion, thereby rigorously separating the two universality classes. revision: yes

Circularity Check

0 steps flagged

Mean-field derivation of scaling laws is self-contained with no reduction to inputs

full rationale

The paper constructs a mean-field theory starting from critical signal propagation at the edge of chaos, then perturbs it with dropout to shift the fixed point and extract scaling laws from the resulting correlation map. The universality classes are distinguished by the intrinsic analytic properties of that map (Taylor expansion for smooth activations versus branch-point non-analyticity for kinked ones), which are structural features of the activation functions rather than quantities fitted or defined from the target scaling predictions. No equations or steps in the provided derivation chain reduce by construction to fitted parameters renamed as predictions, self-citations that bear the central load, or ansatzes smuggled in without independent justification. External tests on MLPs and Vision Transformers supply falsifiable checks outside the fitted values, keeping the derivation independent.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Abstract-only review yields limited visibility into specific parameters or axioms; the framework rests on mean-field approximations and analytic properties of an unspecified correlation map.

free parameters (2)
  • detuning
    One of the two parameters in the universal scaling collapse alongside dropout strength.
  • dropout strength
    Controls the perturbation strength and enters the scaling collapse and fixed-point shift.
axioms (2)
  • domain assumption Mean-field theory applies to dropout as a perturbation of critical signal propagation
    Invoked to shift the perfect-alignment fixed point and derive finite depth scale and scaling laws.
  • domain assumption Analytic structure of the correlation map determines universality class
    Used to separate smooth (Taylor-expandable) from kinked (branch-point) activations.

pith-pipeline@v0.9.0 · 5700 in / 1524 out tokens · 38823 ms · 2026-05-22T09:02:25.369422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Modeling Brain Function: The World of Attractor Neural Networks , author =

  2. [2]

    Artificial Neural Networks and Machine Learning --

    Deep and Wide Neural Networks Covariance Estimation , author =. Artificial Neural Networks and Machine Learning --. 2020 , doi =

  3. [3]

    , journal =

    Bahri, Yasaman and Hanin, Boris and Brossollet, Antonin and Erba, Vittorio and Keup, Christian and Pacelli, Rosalba and Simon, James B. , journal =. 2024 , doi =

  4. [4]

    Physical Review Letters , volume =

    Storing Infinite Numbers of Patterns in a Spin-Glass Model of Neural Networks , author =. Physical Review Letters , volume =. 1985 , doi =

  5. [5]

    , journal =

    Dyson, Freeman J. , journal =. A meeting with. 2004 , doi =

  6. [6]

    Advances in Neural Information Processing Systems , volume =

    Kernel Methods for Deep Learning , author =. Advances in Neural Information Processing Systems , volume =

  7. [7]

    Journal of Physics F: Metal Physics , volume =

    Theory of spin glasses , author =. Journal of Physics F: Metal Physics , volume =

  8. [8]

    Proceedings of the National Academy of Sciences of the United States of America , volume =

    Neural networks and physical systems with emergent collective computational abilities , author =. Proceedings of the National Academy of Sciences of the United States of America , volume =. 1982 , doi =

  9. [9]

    Matrix Analysis , author =

  10. [10]

    Zeitschrift f

    Beitrag zur Theorie des Ferromagnetismus , author =. Zeitschrift f

  11. [11]

    1980 , volume =

    Statistical Physics , author =. 1980 , volume =

  12. [12]

    Neural Computation , volume =

    Real-time computation at the edge of chaos in recurrent neural networks , author =. Neural Computation , volume =. 2004 , doi =

  13. [13]

    , journal =

    Bishop, Christopher M. , journal =. Training with noise is equivalent to. 1995 , doi =

  14. [14]

    Gradient-based learning applied to document recognition , journal =

    LeCun, Yann and Bottou, L. Gradient-based learning applied to document recognition , journal =. 1998 , volume =

  15. [15]

    Advances in Neural Information Processing Systems , volume =

    Batch Normalization Provably Avoids Ranks Collapse for Randomly Initialised Deep Networks , author =. Advances in Neural Information Processing Systems , volume =

  16. [16]

    Proceedings of the 38th International Conference on Machine Learning , series =

    Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author =. Proceedings of the 38th International Conference on Machine Learning , series =. 2021 , publisher =

  17. [17]

    Advances in Neural Information Processing Systems , volume =

    Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse , author =. Advances in Neural Information Processing Systems , volume =

  18. [18]

    International Conference on Learning Representations , year =

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations , year =

  19. [19]

    International Conference on Learning Representations , year =

    Reducing Transformer Depth on Demand with Structured Dropout , author =. International Conference on Learning Representations , year =

  20. [20]

    Proceedings of the 36th International Conference on Machine Learning , series =

    On the impact of the activation function on deep neural networks training , author =. Proceedings of the 36th International Conference on Machine Learning , series =. 2019 , publisher =

  21. [21]

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , booktitle =. Delving Deep into Rectifiers: Surpassing Human-Level Performance on

  22. [22]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =

    Deep Residual Learning for Image Recognition , author =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =

  23. [23]

    European Conference on Computer Vision , series =

    Deep networks with stochastic depth , author =. European Conference on Computer Vision , series =. 2016 , publisher =

  24. [24]

    Learning Multiple Layers of Features from Tiny Images , author =

  25. [25]

    ECAI 2020 - 24th European Conference on Artificial Intelligence , series =

    Mean Field Theory for Deep Dropout Networks: Digging up Gradient Backpropagation Deeply , author =. ECAI 2020 - 24th European Conference on Artificial Intelligence , series =. 2020 , doi =

  26. [26]

    Spin Glass Theory and Beyond , author =

  27. [27]

    Proceedings of the IEEE International Conference on Computer Vision , pages =

    Curriculum dropout , author =. Proceedings of the IEEE International Conference on Computer Vision , pages =

  28. [28]

    Bayesian Learning for Neural Networks , author =

  29. [29]

    Dynamic Patterns in Complex Systems , editor =

    Adaptation Toward the Edge of Chaos , author =. Dynamic Patterns in Complex Systems , editor =

  30. [30]

    A useful theorem for nonlinear devices having

    Price, Robert , journal =. A useful theorem for nonlinear devices having. 1958 , doi =

  31. [31]

    Physical Review Letters , volume =

    Infinite number of order parameters for spin glasses , author =. Physical Review Letters , volume =

  32. [32]

    A sequence of approximated solutions to the

    Parisi, Giorgio , journal =. A sequence of approximated solutions to the

  33. [33]

    Physical Review Letters , volume =

    Order parameter for spin glasses , author =. Physical Review Letters , volume =

  34. [34]

    Reviews of Modern Physics , volume =

    Mean-field theory of spin glasses , author =. Reviews of Modern Physics , volume =

  35. [35]

    Advances in Neural Information Processing Systems , volume =

    Exponential expressivity in deep neural networks through transient chaos , author =. Advances in Neural Information Processing Systems , volume =

  36. [36]

    2022 , doi =

    The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks , author =. 2022 , doi =

  37. [37]

    Quantum Field Theory and Critical Phenomena , author =

  38. [38]

    International Conference on Learning Representations , year =

    Deep Information Propagation , author =. International Conference on Learning Representations , year =

  39. [39]

    Physical Review Letters , volume =

    Solvable Model of a Spin-Glass , author =. Physical Review Letters , volume =

  40. [40]

    Physical Review Letters , volume =

    Chaos in Random Neural Networks , author =. Physical Review Letters , volume =

  41. [41]

    Journal of Machine Learning Research , volume =

    Dropout: A simple way to prevent neural networks from overfitting , author =. Journal of Machine Learning Research , volume =

  42. [42]

    Advances in Neural Information Processing Systems , volume =

    Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =

  43. [43]

    Advances in Neural Information Processing Systems , volume =

    Dropout Training as Adaptive Regularization , author =. Advances in Neural Information Processing Systems , volume =

  44. [44]

    and Pennington, Jeffrey , booktitle =

    Xiao, Lechao and Bahri, Yasaman and Sohl-Dickstein, Jascha and Schoenholz, Samuel S. and Pennington, Jeffrey , booktitle =. Dynamical Isometry and a Mean Field Theory of. 2018 , publisher =

  45. [45]

    Advances in Neural Information Processing Systems , volume =

    Mean Field Residual Networks: On the Edge of Chaos , author =. Advances in Neural Information Processing Systems , volume =

  46. [46]

    Infinite attention:

    Hron, Jiri and Bahri, Yasaman and Sohl-Dickstein, Jascha and Novak, Roman , booktitle =. Infinite attention:. 2020 , publisher =

  47. [47]

    and Pennington, Jeffrey and Sohl-Dickstein, Jascha , booktitle =

    Lee, Jaehoon and Bahri, Yasaman and Novak, Roman and Schoenholz, Samuel S. and Pennington, Jeffrey and Sohl-Dickstein, Jascha , booktitle =. Deep neural networks as. 2018 , url =

  48. [48]

    Annual Review of Condensed Matter Physics , volume =

    Statistical Mechanics of Deep Learning , author =. Annual Review of Condensed Matter Physics , volume =. 2020 , doi =

  49. [49]

    International Conference on Learning Representations , year =

    Quadratic Models for Understanding Catapult Dynamics of Neural Networks , author =. International Conference on Learning Representations , year =

  50. [50]

    Proceedings of the National Academy of Sciences , volume =

    Explaining neural scaling laws , author =. Proceedings of the National Academy of Sciences , volume =. 2024 , doi =

  51. [51]

    2009 , doi =

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle =. 2009 , doi =