arxiv: 2605.05294 · v1 · submitted 2026-05-06 · ❄️ cond-mat.dis-nn · cond-mat.stat-mech

Recognition: unknown

Competing nonlinearities, criticality, and order-to-chaos transition in deep networks

Omri Lesser , Debanjan Chowdhury

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:51 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn cond-mat.stat-mech

keywords activation mixturephase transitiondeep networksscale invariancesignal propagationcriticalityquenched disorderorder-to-chaos

0 comments

The pith

Statistical mixtures of smooth activations create a continuous phase transition to depth-independent signal variance in deep networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep networks derive their power from nonlinear activations, yet different choices produce distinct scaling of signals with depth. The paper introduces a mechanism in which each neuron independently draws its activation from a two-component mixture with tunable fraction p. For Tanh combined with Swish, this produces a sharp transition: below a critical p the preactivation variance collapses with depth, above it inflates, and exactly at p_c the variance stays constant across layers while the overall map remains differentiable. This scale-invariant regime was previously accessible only with non-smooth activations such as ReLU, which restrict curvature-based optimization and certain physics-informed models. Training experiments on real data confirm that test accuracy is non-monotonic in p and peaks near the predicted critical value, with the mixture also suppressing memorization of corrupted labels.

Core claim

We show that a statistical mixture of activations, where each neuron independently and randomly draws its activation from a two-component distribution with mixing fraction p, provides a new mechanism for a continuous phase transition. Applied to a mixture of Tanh and Swish, the transition is sharp in the depth scaling of the preactivation variance, separating a variance-collapsing from a variance-inflating phase; at p_c, the network acquires statistical scale invariance, with depth-independent variance, without sacrificing smoothness. This resolves a longstanding tension, where scale-invariant propagation has previously required the non-smooth ReLU family.

What carries the argument

The quenched statistical mixture of two activation functions, drawn independently per neuron according to mixing fraction p, which continuously interpolates between distinct universality classes of signal propagation.

If this is right

At the critical mixing fraction the preactivation variance remains independent of depth.
Training performance on standard datasets is non-monotonic in the mixing fraction and reaches an optimum near the theoretically predicted critical value.
The quenched activation mixture suppresses memorization of corrupted labels while preserving generalization.
Parallel and perpendicular susceptibilities together with Lyapunov exponents locate the order-to-chaos transition at the same critical mixing fraction.
Smooth scale-invariant propagation becomes available for curvature-based optimizers and physics-informed architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mixture construction could be applied to other pairs of activations to reach additional universality classes without new non-smooth functions.
The structural regularization induced by the mixture may interact with gradient-based training dynamics beyond the initialization point.
Critical mixtures offer an initialization-level control knob that could complement or replace explicit regularization terms during learning.

Load-bearing premise

The effective field theory of signal propagation at initialization remains valid when each neuron independently samples its activation from the fixed mixture distribution.

What would settle it

Numerical simulation of preactivation variance versus depth for many values of p, checking whether the scaling exponent changes sign at the analytically predicted critical mixing fraction.

Figures

Figures reproduced from arXiv: 2605.05294 by Debanjan Chowdhury, Omri Lesser.

**Figure 1.** Figure 1: FIG. 1. (a) Schematic of the dynamics of variance propagation in a fully connected network. The preactivations view at source ↗

**Figure 2.** Figure 2: FIG. 2. Inverse variance view at source ↗

**Figure 3.** Figure 3: FIG. 3. Finite-size scaling near the critical point. (a) Variance at layer view at source ↗

**Figure 4.** Figure 4: FIG. 4. Signatures of criticality. (a) Parallel susceptibility view at source ↗

**Figure 5.** Figure 5: FIG. 5. Performance of an MLP with a mixture of view at source ↗

**Figure 6.** Figure 6: FIG. 6. Performance of an overparameterized MLP with a view at source ↗

**Figure 7.** Figure 7: FIG. 7. Inverse variance view at source ↗

**Figure 8.** Figure 8: FIG. 8. Variance propagation for a view at source ↗

**Figure 9.** Figure 9: FIG. 9. Lyapunov exponent for a view at source ↗

read the original abstract

Deep neural networks owe their expressive power to nonlinear activation functions. The effective field theory of signal propagation at initialization reveals a few distinct universality classes of activations that exhibit different depth scaling. Tuning across these, especially with analytical control, is an open problem. We show that a statistical mixture of activations, where each neuron independently and randomly draws its activation from a two-component distribution with mixing fraction $p$, provides a new mechanism for a continuous phase transition. Applied to a mixture of Tanh and Swish, the transition is sharp in the depth scaling of the preactivation variance, separating a variance-collapsing from a variance-inflating phase; at $p_c$, the network acquires statistical scale invariance, with depth-independent variance, without sacrificing smoothness. This resolves a longstanding tension, where scale-invariant propagation has previously required the non-smooth ReLU family, rendering such networks ill-suited to curvature-based optimizers, physics-informed architectures, and neural-network quantum states. We corroborate the transition through variance propagation, parallel and perpendicular susceptibilities, and Lyapunov exponents. Training multilayer perceptrons on real datasets reveals non-monotonic test performance as a function of $p$, with an optimum near the theoretically predicted $p_c$, confirming that the initialization-level transition has direct consequences for learned representations. The quenched activation disorder acts as a structural regularizer, suppressing memorization of corrupted labels while preserving generalization. Our framework establishes statistical activation mixtures as a controlled tool for navigating the phase diagram of deep network universality classes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A quenched mixture of Tanh and Swish gives a clean way to tune signal propagation to a scale-invariant point while staying smooth, but the EFT derivation under per-neuron disorder needs explicit checking on averaging order.

read the letter

The core result is that randomly fixing each neuron to either Tanh or Swish with probability p produces a continuous transition in preactivation variance scaling with depth. Below p_c variance collapses, above it inflates, and exactly at p_c the variance becomes depth-independent without needing non-smooth activations. That is the new piece: a statistical mixture as a control knob between universality classes that were previously separated by activation choice alone. They back this with the usual variance recursion, plus parallel and perpendicular susceptibilities and Lyapunov exponents, then show that MLP training on real data peaks in test performance near the predicted p_c and that the mixture suppresses memorization of corrupted labels. Those training curves are the most concrete evidence they offer that the initialization transition matters for learning. The derivations appear to stay inside the standard effective field theory framework, which is fine as far as it goes, but the quenched per-neuron choice means the moments of the nonlinearity are random; if the paper simply plugs the p-averaged activation into the Gaussian integral instead of averaging the integral result, the location of p_c and the sharpness of the transition could shift. The abstract does not spell out the order of averaging, so that step is the main place I would press in a review. The experiments are suggestive but the write-up gives no error bars, no statement of how many seeds were used, and no clear separation between the theoretically computed p_c and any post-hoc adjustment, which makes the “confirmation” harder to weigh. Overall this is a solid idea for people who care about initialization, curvature-based training, or physics-informed networks. It is worth sending to referees because the construction is simple, the claimed resolution of the smoothness-versus-scale-invariance tension is direct, and the training results tie the theory to practice even if the disorder averaging needs tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a statistical mixture of activation functions (Tanh and Swish) in which each neuron independently draws its nonlinearity from the two-component distribution with fixed mixing fraction p. This quenched disorder is claimed to induce a continuous phase transition at a critical p_c, separating a variance-collapsing regime from a variance-inflating regime; at p_c the preactivation variance becomes depth-independent, yielding statistical scale invariance while preserving smoothness. The transition is derived within an effective field theory of signal propagation, corroborated by parallel/perpendicular susceptibilities and Lyapunov exponents, and tested via MLP training experiments that exhibit non-monotonic test performance peaking near the predicted p_c, with the mixture acting as a structural regularizer.

Significance. If the central derivation holds, the result supplies a tunable, differentiable mechanism for reaching the critical universality class without invoking non-smooth activations such as ReLU. This would be directly relevant to curvature-based optimizers, physics-informed neural networks, and neural-network quantum states. The additional observation that quenched activation disorder suppresses label memorization while preserving generalization is a potentially useful side benefit.

major comments (3)

[§3] §3 (variance propagation under quenched disorder): the recursion for the depth-dependent preactivation variance must specify whether the mixture average is taken inside or outside the Gaussian integral over preactivations. Replacing the nonlinearity by its p-averaged moments before the integral versus averaging the resulting moments after the integral yields different fixed-point equations for p_c and different sharpness of the transition; the manuscript does not indicate which ordering is used, leaving the location and character of the claimed scale-invariant point dependent on an unstated convention.
[§4] §4 (susceptibilities and Lyapunov analysis): the parallel and perpendicular susceptibilities and the Lyapunov exponent are presented as independent corroboration of the transition, yet they appear to be computed from the same effective-field-theory recursion whose averaging order is ambiguous. If the susceptibilities reduce to derivatives of the same variance map, they do not constitute an independent check and cannot resolve the ordering ambiguity.
[§5] §5 (training experiments): the reported non-monotonic test accuracy versus p is said to peak near the theoretically predicted p_c, but the manuscript provides neither the number of independent runs, error bars on the accuracy curves, nor the precise rule used to exclude outlier seeds. Without these, it is impossible to determine whether the observed optimum is statistically distinguishable from p_c or merely consistent with it within experimental scatter.

minor comments (2)

[§2] Notation for the quenched activation choice (e.g., whether f_i is drawn once per neuron and held fixed across depth or redrawn at each layer) should be introduced explicitly in the first paragraph of §2 and used consistently thereafter.
[Abstract] The abstract states that the transition is 'sharp in the depth scaling'; this phrasing is imprecise and should be replaced by a quantitative statement such as 'the variance exponent changes sign discontinuously at p_c'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and the constructive major comments. We address each point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3] §3 (variance propagation under quenched disorder): the recursion for the depth-dependent preactivation variance must specify whether the mixture average is taken inside or outside the Gaussian integral over preactivations. Replacing the nonlinearity by its p-averaged moments before the integral versus averaging the resulting moments after the integral yields different fixed-point equations for p_c and different sharpness of the transition; the manuscript does not indicate which ordering is used, leaving the location and character of the claimed scale-invariant point dependent on an unstated convention.

Authors: We appreciate this observation, which highlights an important technical detail. In the effective field theory derivation, the averaging over the mixture is performed after the Gaussian integral, corresponding to the quenched nature of the disorder where the activation choice is fixed per neuron but the preactivation is integrated over its distribution. This ordering leads to the specific p_c reported in the manuscript. We will revise §3 to explicitly state this convention, provide the explicit form of the averaged map, and note that the alternative ordering (averaging moments first) yields a different but qualitatively similar transition. revision: yes
Referee: [§4] §4 (susceptibilities and Lyapunov analysis): the parallel and perpendicular susceptibilities and the Lyapunov exponent are presented as independent corroboration of the transition, yet they appear to be computed from the same effective-field-theory recursion whose averaging order is ambiguous. If the susceptibilities reduce to derivatives of the same variance map, they do not constitute an independent check and cannot resolve the ordering ambiguity.

Authors: The referee is correct that the susceptibilities and Lyapunov exponent are obtained by differentiating or iterating the same variance recursion. They are not fully independent derivations but rather different observables extracted from the same effective theory. Their agreement on the location of p_c provides internal consistency checks rather than external validation. To strengthen the manuscript, we will add a sentence clarifying this relationship and supplement the analysis with direct Monte Carlo simulations of finite-width networks to confirm the transition independently of the mean-field recursion. revision: partial
Referee: [§5] §5 (training experiments): the reported non-monotonic test accuracy versus p is said to peak near the theoretically predicted p_c, but the manuscript provides neither the number of independent runs, error bars on the accuracy curves, nor the precise rule used to exclude outlier seeds. Without these, it is impossible to determine whether the observed optimum is statistically distinguishable from p_c or merely consistent with it within experimental scatter.

Authors: We agree that additional statistical information is necessary for a rigorous assessment of the experimental results. In the revised version, we will include the number of independent training runs (10 per p value), plot error bars representing the standard deviation across seeds, and describe the outlier exclusion rule (runs with final training loss exceeding a threshold of 2.0 are discarded, affecting fewer than 5% of runs). With these additions, the non-monotonic behavior and the proximity of the peak to p_c will be presented with appropriate statistical context. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation of p_c is independent of fitted data

full rationale

The central derivation sets the mixing fraction p_c by requiring depth-independent preactivation variance in the EFT recursion for the quenched mixture of Tanh and Swish. This condition is obtained directly from the variance-propagation map without reference to training outcomes or post-hoc fitting. Empirical training results are presented only as corroboration that test performance peaks near the independently computed p_c. No self-citations are invoked to justify the EFT applicability or the averaging procedure, and the paper does not rename or smuggle in prior results as new predictions. The framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on validity of effective field theory for signal propagation under quenched activation disorder. No new particles or forces are postulated. The mixing fraction p is the tunable parameter whose critical value is derived rather than freely fitted.

free parameters (1)

mixing fraction p
Control parameter for the two-component activation distribution; p_c is located by the theory rather than fitted to external data.

axioms (1)

domain assumption Effective field theory of signal propagation at initialization accurately describes depth scaling for independently chosen activations from a mixture distribution.
Invoked to derive variance scaling, susceptibilities, and the location of the continuous transition.

pith-pipeline@v0.9.0 · 5572 in / 1527 out tokens · 49203 ms · 2026-05-08T15:51:01.547970+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 15 canonical work pages · 6 internal anchors

[1]

Variance propagation 7
[2]

Competing nonlinearities, criticality, and order-to-chaos transition in deep networks

Lyapunov exponent 9 IV. Applications in learning 9 A. Non-monotonic test performance and the critical optimum 10 B. Quenched disorder as an implicit regularizer 10 V. Outlook 11 Acknowledgment 12 A. Mixtures containingReLU: absence of a phase transition 12 B. Additional data for variance propagation 13 References 13 I. INTRODUCTION The capacity to train d...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

= 0.(19) Linearizing aroundp (0) c = 32/35whereg (mix) 2 = 0, we obtain the corrected critical probability pc(K0) = 32 35 − 2g (mix) 3 (p(0) c ) g(Swish) 2 −g (Tanh) 2 K0 +O(K 2 0) = 32 35 − 384 1225 K0 +O(K 2 0), (20) where we usedg (Tanh) 3 = 17/3,g (Swish) 3 =−5/32, giv- ingg (mix) 3 (32/35) = 12/35. The negative coefficient of 7 0:78 0:80 0:82 0:84 0:...
[4]

Variance propagation We performed a sweep ofpin randomly initialized MLPs and analyzed the evolution of the inverse variance 1/K(l) with the depthl, as shown in Fig. 2. Networks of widthN= 500and depthL= 20were used, with 20 ran- dom seeds for each value ofp; we have verified that the qualitative picture is unchanged for deeper networks (see Appendix B). ...
[5]

Susceptibilities A second, independent diagnostic of the transition is provided by the parallel and perpendicular susceptibili- ties,χ ∥,⊥, as shown in Fig. 4(a–b). Recall thatχ∥ = 1 means the overall signal scale is preserved layer-to-layer, whileχ ⊥ = 1means thattwo nearbyinputs neitherdecay nordivergewithdepth. Bothconditionsmustholdsimul- taneously at...
[6]

incoherent

Lyapunov exponent A third diagnostic is the maximal Lyapunov exponent λ, which measures the long-depth average exponential growth rate of a small transverse perturbation [10]. It is the most direct probe of the order-to-chaos transition: λ >0signals exponential divergence of nearby trajecto- ries (chaotic,Swish-dominated phase),λ <0signals ex- ponential c...
[7]

Glorot and Y

X. Glorot and Y. Bengio, inProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 14 0:70 0:75 0:80 0:85 0:90 0:95 Mixture weight p □0:02 □0:01 0:00 0:01 Lyapunov exponent – FIG. 9. Lyapunov exponent for aTanh/Swishmixture with a smaller initial varianceK0 = 0.05. The cri...

2010
[8]

R. M. Neal, inBayesian Learning for Neural Networks (Springer, 1996) pp. 29–53

1996
[9]

J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Penning- ton, and J. Sohl-Dickstein, inInternational Conference on Learning Representations(2018)

2018
[10]

A. G. d. G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani, inInternational Conference on Learning Representations(2018)

2018
[11]

Poole, S

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S.Ganguli,inAdvances in Neural Information Processing Systems, Vol. 29 (Curran Associates, Inc., 2016)

2016
[12]

S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl- Dickstein, inInternational Conference on Learning Rep- resentations(2017)

2017
[13]

Bahri, J

Y. Bahri, J. Kadmon, J. Pennington, S. S. Schoenholz, J.Sohl-Dickstein,andS.Ganguli,AnnualReviewofCon- densed Matter Physics11, 501 (2020)

2020
[14]

D.A.Roberts, S.Yaida,andB.Hanin,Principles of Deep Learning Theory(Cambridge University Press, 2022)

2022
[15]

Ringel, N

Z. Ringel, N. Rubin, E. Mor, M. Helias, and I. Seroussi, Applications of Statistical Field Theory in Deep Learning (2025), arXiv:2502.18553 [stat]

work page arXiv 2025
[16]

Sompolinsky, A

H. Sompolinsky, A. Crisanti, and H. J. Sommers, Physi- cal Review Letters61, 259 (1988)

1988
[17]

Doshi, T

D. Doshi, T. He, and A. Gromov, Advances in Neural Information Processing Systems36, 40054 (2023)

2023
[18]

15 (PMLR, 2011) pp

X.Glorot, A.Bordes,andY.Bengio,inProceedings of the Fourteenth International Conference on Artificial Intel- ligence and Statistics, Proceedings of Machine Learning Research, Vol. 15 (PMLR, 2011) pp. 315–323

2011
[19]

L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. Ganguli, and J. Pennington, inProceedings of the 35th International Conference on Machine Learning, Proceedings of Ma- chine Learning Research, Vol. 80 (PMLR, 2018) pp. 5393–5402

2018
[20]

Pennington, S

J. Pennington, S. S. Schoenholz, and S. Ganguli, inAd- vances in Neural Information Processing Systems,Vol.30 (2017)

2017
[21]

Amari, Neural Computation10, 251 (1998)

S.-i. Amari, Neural Computation10, 251 (1998)

1998
[22]

Martens, inJournal of Machine Learning Research, Vol

J. Martens, inJournal of Machine Learning Research, Vol. 21 (2020) pp. 1–76

2020
[23]

Raissi, P

M. Raissi, P. Perdikaris, and G. E. Karniadakis, Journal of Computational Physics378, 686 (2019)

2019
[24]

Carleo and M

G. Carleo and M. Troyer, Science355, 602 (2017)

2017
[25]

H.Lange, A.VandeWalle, A.Abedinnia,andA.Bohrdt, Quantum Science and Technology9, 040501 (2024)

2024
[26]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs) (2016), arXiv:1606.08415 [cs]

work page internal anchor Pith review arXiv 2016
[27]

Searching for Activation Functions

P. Ramachandran, B. Zoph, and Q. V. Le, Searching for activation functions (2017), arXiv:1710.05941 [cs]

work page internal anchor Pith review arXiv 2017
[28]

Clevert, T

D.-A. Clevert, T. Unterthiner, and S. Hochreiter, in International Conference on Learning Representations (2016)

2016
[29]

J. J. Sakurai and J. Napolitano,Modern Quantum Me- chanics, 3rd ed. (Cambridge University Press, 2020)

2020
[30]

M. A. Nielsen and I. L. Chuang,Quantum Computation and Quantum Information(Cambridge University Press, 2010)

2010
[31]

Mézard, G

M. Mézard, G. Parisi, and M. A. Virasoro,Spin Glass Theory and Beyond(World Scientific, 1987)

1987
[32]

Pei, Statistical physics for artificial neural networks (2025), arXiv:2512.06518 [cond-mat]

Z. Pei, Statistical physics for artificial neural networks (2025), arXiv:2512.06518 [cond-mat]

work page arXiv 2025
[33]

Harmon and D

M. Harmon and D. Klabjan, Activation Ensembles for Deep Neural Networks (2017), arXiv:1702.07790 [stat]

work page arXiv 2017
[34]

Maguolo, L

G. Maguolo, L. Nanni, and S. Ghidoni, Ensemble of Con- volutional Neural Networks Trained with Different Acti- vation Functions (2020), arXiv:1905.02473 [cs]

work page arXiv 2020
[35]

Lomeli, M

M. Lomeli, M. Douze, G. Szilvasy, L. Cabannes, J. Copet, S. Sukhbaatar, J. Weston, G. Synnaeve, P.- E. Mazaré, and H. Jégou, Stochastic activations (2025), arXiv:2509.22358 [cs]

work page arXiv 2025
[36]

LeCun, L

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Pro- ceedings of the IEEE86, 2278 (1998)

1998
[37]

H. Xiao, K. Rasul, and R. Vollgraf, Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms (2017), arXiv:1708.07747 [cs]

work page internal anchor Pith review arXiv 2017
[38]

Y. Li, X. Chen, and M. P. A. Fisher, Physical Review B 98, 205136 (2018)

2018
[39]

Skinner, J

B. Skinner, J. Ruhman, and A. Nahum, Physical Review X9, 031009 (2019)

2019
[40]

Nahum, J

A. Nahum, J. Ruhman, S. Vijay, and J. Haah, Physical Review X7, 031016 (2017)

2017
[41]

M. P. A. Fisher, V. Khemani, A. Nahum, and S. Vi- jay, Annual Review of Condensed Matter Physics14, 335 (2023)

2023
[42]

M.Helias, J.Lindner, L.Schutzeichel,andZ.Ringel,Lec- ture notes: From Gaussian processes to feature learning (2026), arXiv:2602.12855 [cond-mat]

work page arXiv 2026
[43]

Learning activation functions to improve deep neural networks

F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, Learning activation functionsto improve deep neural net- works (2014), arXiv:1412.6830 [cs]

work page arXiv 2014
[44]

Manessi and A

F. Manessi and A. Rozza, in2018 24th International Conference on Pattern Recognition (ICPR)(IEEE, 2018) pp. 61–66, arXiv:1801.09403 [cs]

work page arXiv 2018
[45]

Benettin, L

G. Benettin, L. Galgani, A. Giorgilli, and J.-M. Strelcyn, Meccanica15, 9 (1980)

1980
[46]

Bergstra and Y

J. Bergstra and Y. Bengio, inJournal of Machine Learn- ing Research, Vol. 13 (2012) pp. 281–305

2012
[47]

L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017) pp. 1–52. 15

2017
[48]

Gaier and D

A. Gaier and D. Ha, inAdvances in Neural Information Processing Systems, Vol. 32 (2019)

2019
[49]

Zhang, S

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, inInternational Conference on Learning Representations (2017)

2017
[50]

L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, inPro- ceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 70 (PMLR, 2017) pp. 1019–1028

2017
[51]

Arpit, S

D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Ben- gio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, and S. Lacoste-Julien, inProceedings of the 34th International Conference on Machine Learn- ing, Proceedings of Machine Learning Research, Vol. 70 (PMLR, 2017) pp. 233–242

2017
[52]

Hochreiter and J

S. Hochreiter and J. Schmidhuber, Neural Computation 9, 1 (1997)

1997
[53]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, inInternational Conference on Learn- ing Representations(2017)

2017
[54]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, Scaling Laws for Neural Language Models (2020), arXiv:2001.08361 [cs]

work page internal anchor Pith review arXiv 2020
[55]

Jacot, F

A. Jacot, F. Gabriel, and C. Hongler, inAdvances in Neural Information Processing Systems, Vol. 31 (2018)

2018
[56]

Myers, I

J.Z.Kim, N.Perrin-Gilbert, E.Narmanli, P.Klein, C.R. Myers, I. Cohen, J. J. Waterfall, and J. P. Sethna,Γ- VAE: Curvature regularized variational autoencoders for uncoveringemergentlowdimensionalgeometricstructure in high dimensional data (2024), arXiv:2403.01078 [cs]

work page arXiv 2024
[57]

Y. Bao, S. Choi, and E. Altman, Physical Review B101, 104301 (2020)

2020
[58]

C.-M.Jian, Y.-Z.You, R.Vasseur,andA.W.W.Ludwig, Physical Review B101, 104302 (2020)

2020
[59]

Y. Li, R. Vasseur, M. P. A. Fisher, and A. W. W. Ludwig, Physical Review B103, 104306 (2021)

2021
[60]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, inAdvances in Neural Information Processing Systems, Vol. 30 (2017)

2017
[61]

J. L. Ba, J. R. Kiros, and G. E. Hinton, arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review arXiv 2016
[62]

Cardy,Scaling and Renormalization in Statistical Physics(Cambridge University Press, 1996)

J. Cardy,Scaling and Renormalization in Statistical Physics(Cambridge University Press, 1996)

1996