Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

Lucas Fernandez Sarmiento

arxiv: 2605.21648 · v2 · pith:MEM2WHTHnew · submitted 2026-05-20 · 💻 cs.LG · cond-mat.dis-nn· cs.NE· stat.ML

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

Lucas Fernandez Sarmiento This is my paper

Pith reviewed 2026-05-22 09:02 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncs.NEstat.ML

keywords dropoutedge of chaosscaling lawsuniversality classessignal propagationcorrelation decayoptimal schedulingneural network depth

0 comments

The pith

Dropout perturbs the edge-of-chaos fixed point in signal propagation, producing distinct scaling laws and universality classes for smooth versus kinked activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats dropout as a controlled perturbation around the critical initialization where signals would propagate without decay through infinite depth. This shift creates a finite depth scale for correlation loss even at criticality, with the decay governed by scaling laws in depth, detuning from criticality, and dropout strength. Smooth activations allow a Taylor expansion of the correlation map near perfect alignment, while kinked activations introduce a branch-point singularity, placing the two families in separate universality classes with different exponents and a shared two-parameter scaling collapse. As a direct consequence the theory supplies saturated dropout profiles under a fixed budget and shows that a rank-flow rule selects front-loaded schedules that lower held-out loss in MLPs and Vision Transformers.

Core claim

At the edge of chaos the correlation map possesses a perfect-alignment fixed point. Dropout displaces this fixed point, rendering the propagation depth finite. The resulting correlation decay obeys critical and crossover scaling laws whose form is fixed by the analytic structure of the map: a regular Taylor series for smooth activations versus a non-analytic branch point for ReLU-like activations. These structures generate distinct critical exponents together with a universal collapse of correlation data onto a single curve when plotted against the two scaling variables of detuning and dropout rate.

What carries the argument

The correlation map near perfect alignment, whose Taylor expansion or branch-point non-analyticity sets the universality class and the exponents of the scaling laws.

If this is right

Critical initialization alone no longer supports infinite-depth propagation once dropout is present.
Smooth and ReLU-like activations belong to separate universality classes distinguished by their correlation-map singularities.
Correlation decay obeys a universal two-parameter scaling collapse controlled by detuning and dropout strength.
Fixed-budget dropout is optimally realized by saturated, front-loaded schedules selected by a rank-flow tie-breaker.
The same scaling framework accounts for the observed reduction in held-out loss for MLPs and Vision Transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distinction between analytic and branched correlation maps may reappear in other stochastic regularizers that act as perturbations to signal propagation.
The derived front-loaded schedules could be tested directly on larger transformer variants or on convolutional architectures without changing the total compute budget.
If the mean-field scaling holds, similar universality classes should emerge when dropout is replaced by other depth-dependent noise sources.

Load-bearing premise

The mean-field description of dropout as a perturbation of critical propagation remains valid and the local analytic structure of the correlation map alone determines the scaling exponents and collapse.

What would settle it

A measurement showing that correlation decay versus depth in networks with varying dropout rates fails to collapse onto the predicted two-parameter surface when activations are switched from smooth to kinked.

Figures

Figures reproduced from arXiv: 2605.21648 by Lucas Fernandez Sarmiento.

**Figure 1.** Figure 1: Critical scaling for smooth (tanh) and kinked (ReLU) activation functions, comparing tuning at zero dropout to tuning at the edge-of-chaos using a dropout field. The top row compares the different critical exponents at zero dropout and probes critical detuning decay, while the bottom row explores on critical networks with non-zero dropout. As the variables grow, higher-order effects become comparable and t… view at source ↗

**Figure 2.** Figure 2: Two-parameter crossover and scaling collapse of the dropout-deformed equation of state for the smooth universality class (tanh). Plots obtained using MFT recursion relations. The curves collapse onto a universal function after rescaling by t˜and m˜ . The kinked counterpart is shown in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Magnitude of Hermite coefficients |an| for ReLU versus tanh. ReLU exhibits a slow (power-law) decay, reflecting multi-scale support across Hermite degrees, while tanh decays rapidly and concentrates most spectral mass in the lowest modes. C.5. Hermite decompositions for ReLU and tanh Throughout, Z ∼ N (0, 1) and Dz ≡ dz √ 2π e −z 2/2 . (155) We expand the fixed point rescaled activation f(z) in an orthonor… view at source ↗

**Figure 4.** Figure 4: Kinked counterpart to [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

**Figure 5.** Figure 5: Finite-width MLP training and test curves at fixed mean dropout budget h¯ (with 0 ≤ hℓ ≤ hmax). We compare uniform dropout, linear ramps (increasing/decreasing with depth), and step schedules that concentrate dropout in either the first or last half of the network, together with the no-dropout baseline. The budget-control experiment checks a simpler explanation: early-concentrated dropout may only be winni… view at source ↗

**Figure 6.** Figure 6: Matched-budget controls comparing front-loaded step schedules against constant dropout fields at h¯, 2h¯, and 3h¯. If the step schedules succeed merely because they apply locally higher dropout rates, then the uniform schedules with matching dropout should perform at least as well. In contrast, if spatial allocation genuinely matters, the step schedules should outperform their uniform counterparts despite … view at source ↗

**Figure 7.** Figure 7: Robustness sweeps for depth-6 near-critical ReLU MLPs on CIFAR-10. Early schedules improve over constant dropout throughout the large-width regime, while N = 64 illustrates the expected finite-width boundary. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Mean dropout field h̄ −3 −2 −1 0 1 Δ b est test a c c ura c y (%) Scheduling advantage relative to constant dropout Step (early) - Constant Big step (1/3) - Con… view at source ↗

**Figure 8.** Figure 8: Smooth-activation h-sweep for depth-6 near-critical GELU MLPs on CIFAR-10 at width N = 256. Step-like schedules improve over constant dropout around h¯ = 0.1, while the largest field leaves the small-dropout regime where the mean-field perturbation is expected to be predictive. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: Finite-width Vision Transformer training and test curves at fixed mean dropout budget h¯. We compare the no-dropout baseline, uniform dropout, decreasing linear ramps, and early step schedules. The cropped view shows epochs 20–75 for readability; the full curve is also in App. D. Accuracy curves are shown in [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗

**Figure 10.** Figure 10: Extended training curves for [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: Component ablations applying dropout schedules to the attention block, MLP block, or both. The figure compares no dropout, constant dropout, and step-like dropout [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗

read the original abstract

We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos, and show that it predicts a simple, no-cost change to standard practice: \emph{front-loaded} dropout schedules cut test loss by \(18\)--\(35\%\) over constant dropout in MLPs and Vision Transformers at fixed budget. The theoretical mechanism is that dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked, \relu{}-like activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields saturated dropout profiles under fixed budget; a regularization-reach argument then selects front-loaded schedules, with accuracy gains as a consistent secondary effect. We also discuss how the same Gaussian-kernel structure extends the theory beyond MLPs toward CNNs and residual architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives dropout scaling laws from a mean-field correlation map at the edge of chaos and shows front-loaded schedules cut test loss at fixed budget.

read the letter

The core contribution is a perturbation treatment of dropout around the perfect-alignment fixed point. This produces critical and crossover scaling laws for correlation decay, plus a clean split into universality classes: smooth activations allow a Taylor expansion near alignment, while kinked ones like ReLU introduce a branch-point non-analyticity. The result is different exponents and a two-parameter collapse in detuning and dropout strength. As a direct corollary they derive saturated dropout profiles and use a rank-flow rule to select front-loaded schedules, which they test on MLPs and Vision Transformers with measurable held-out loss reductions and no added compute.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a mean-field theory treating dropout as a perturbation around critical signal propagation at the edge of chaos. It derives critical and crossover scaling laws for correlation decay, identifies distinct universality classes for smooth versus kinked (ReLU-like) activations arising from the analytic structure of the correlation map (Taylor expansion versus branch-point non-analyticity), and obtains a universal two-parameter scaling collapse in detuning and dropout strength. As a corollary the framework produces saturated dropout profiles and a rank-flow tie-breaker that selects front-loaded schedules, which are shown to reduce held-out test loss in MLPs and Vision Transformers at fixed computational budget.

Significance. If the mean-field correlation map and its perturbation analysis hold, the work supplies a principled explanation for dropout’s effect on information propagation and yields falsifiable scaling predictions together with a practical scheduling rule that improves performance without extra cost. The explicit separation into universality classes and the two-parameter collapse constitute a clear theoretical advance over existing edge-of-chaos analyses that treat dropout only phenomenologically.

major comments (2)

[§4.1–4.3] §4.1–4.3 (correlation map derivation): the claim that dropout remains a local perturbation around the shifted fixed point for finite dropout rates lacks explicit error bounds or radius-of-convergence estimates; higher-order stochastic corrections from mask averaging could modify the leading singularity and thereby invalidate the extracted critical exponents and the asserted universality classes.
[Eq. (12)] Eq. (12) and surrounding text (branch-point analysis for kinked activations): the preservation of the universal non-analyticity under the stochastic average over dropout masks is asserted but not demonstrated with a controlled expansion; an explicit calculation showing that the branch-point singularity survives to leading order is required to support the distinction between the two universality classes.

minor comments (2)

[Figure 3] Figure 3 caption: the scaling-collapse axes are labeled only by symbols; add explicit definitions of the rescaled variables to allow readers to reproduce the collapse without consulting the main text.
[§6.2] §6.2 (empirical validation): report the number of independent runs and the standard error on the reported test-loss reductions so that the statistical significance of the front-loaded schedule advantage can be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. The points raised highlight areas where additional rigor can strengthen the presentation of the mean-field perturbation analysis. We address each major comment below.

read point-by-point responses

Referee: [§4.1–4.3] §4.1–4.3 (correlation map derivation): the claim that dropout remains a local perturbation around the shifted fixed point for finite dropout rates lacks explicit error bounds or radius-of-convergence estimates; higher-order stochastic corrections from mask averaging could modify the leading singularity and thereby invalidate the extracted critical exponents and the asserted universality classes.

Authors: We agree that explicit error bounds would improve the manuscript. The derivation treats dropout as a controlled shift of the fixed point in the mean-field limit, with higher-order mask corrections suppressed by factors of p(1-p). In the revised version we will add a controlled expansion to second order in the perturbation parameter together with a radius-of-convergence estimate based on the Lipschitz constant of the correlation map, confirming that the leading singularity and extracted exponents remain valid throughout the scaling regimes considered. revision: partial
Referee: [Eq. (12)] Eq. (12) and surrounding text (branch-point analysis for kinked activations): the preservation of the universal non-analyticity under the stochastic average over dropout masks is asserted but not demonstrated with a controlled expansion; an explicit calculation showing that the branch-point singularity survives to leading order is required to support the distinction between the two universality classes.

Authors: We will supply the requested explicit calculation. Because the stochastic average is a linear operation, it acts term-by-term on the Taylor or Puiseux expansion of the correlation map. For kinked activations the leading non-analytic contribution is a branch-point term whose coefficient is independent of the mask realization; averaging therefore leaves the |Δ|^{3/2} (or equivalent) singularity intact to leading order in dropout strength. The revised manuscript will include this controlled expansion, thereby rigorously separating the two universality classes. revision: yes

Circularity Check

0 steps flagged

Mean-field derivation of scaling laws is self-contained with no reduction to inputs

full rationale

The paper constructs a mean-field theory starting from critical signal propagation at the edge of chaos, then perturbs it with dropout to shift the fixed point and extract scaling laws from the resulting correlation map. The universality classes are distinguished by the intrinsic analytic properties of that map (Taylor expansion for smooth activations versus branch-point non-analyticity for kinked ones), which are structural features of the activation functions rather than quantities fitted or defined from the target scaling predictions. No equations or steps in the provided derivation chain reduce by construction to fitted parameters renamed as predictions, self-citations that bear the central load, or ansatzes smuggled in without independent justification. External tests on MLPs and Vision Transformers supply falsifiable checks outside the fitted values, keeping the derivation independent.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Abstract-only review yields limited visibility into specific parameters or axioms; the framework rests on mean-field approximations and analytic properties of an unspecified correlation map.

free parameters (2)

detuning
One of the two parameters in the universal scaling collapse alongside dropout strength.
dropout strength
Controls the perturbation strength and enters the scaling collapse and fixed-point shift.

axioms (2)

domain assumption Mean-field theory applies to dropout as a perturbation of critical signal propagation
Invoked to shift the perfect-alignment fixed point and derive finite depth scale and scaling laws.
domain assumption Analytic structure of the correlation map determines universality class
Used to separate smooth (Taylor-expandable) from kinked (branch-point) activations.

pith-pipeline@v0.9.0 · 5700 in / 1524 out tokens · 38823 ms · 2026-05-22T09:02:25.369422+00:00 · methodology

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)