A Random-Matrix Criterion for Initializing Gated Recurrent Neural Networks

Francesco Casola; Riccardo Marcaccioli; Tommaso Fioratti

arxiv: 2605.10650 · v1 · submitted 2026-05-11 · 💻 cs.LG · cond-mat.dis-nn

A Random-Matrix Criterion for Initializing Gated Recurrent Neural Networks

Tommaso Fioratti , Riccardo Marcaccioli , Francesco Casola This is my paper

Pith reviewed 2026-05-12 04:31 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nn

keywords random matrix theorygated recurrent networksreservoir computingweight initializationedge of chaoschaotic forecastingcritical gain

0 comments

The pith

A random-matrix criterion estimates the critical initialization gain where gated RNN reservoirs reach peak performance on chaotic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a simple criterion, based on random-matrix analysis in the infinite-width limit, to locate the critical weight variance g_c that marks the boundary between ordered and chaotic dynamics in gated recurrent networks. This matters because reservoir computing relies on fixed recurrent weights to generate rich yet stable dynamics, and prior work has shown that operating near this phase transition maximizes the reservoir's expressive power. The authors demonstrate that their estimated g_c closely matches the gain at which a gated-RNN reservoir achieves its highest accuracy on a standard chaotic forecasting benchmark. They conclude that the same criterion can guide the design of initialization schemes for recurrent architectures more generally.

Core claim

In the infinite-width limit, meaningful random initializations for a broad class of gated recurrent networks sit at an effective critical point controlled by the weight variance g squared; the transition separates an ordered phase from a chaotic phase in which information degrades. The authors supply an explicit random-matrix criterion that estimates the critical gain g_c for this transition and verify that, on a chaotic time-series forecasting task, reservoir performance peaks near the predicted g_c.

What carries the argument

The random-matrix criterion for the critical gain g_c, obtained by locating the point at which the largest eigenvalue of the effective random matrix product equals unity and thereby marks the ordered-to-chaotic transition.

If this is right

The same criterion applies without modification to a wide family of recurrent architectures beyond the specific gated RNN tested.
Reservoir performance on chaotic forecasting tasks reaches its maximum when the initialization variance is set to the estimated critical value.
The criterion supplies an explicit, parameter-free design rule that future initialization schemes for recurrent networks can adopt directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The criterion could be used to initialize recurrent layers inside larger hybrid models that combine reservoirs with trained weights.
Because the derivation relies only on the spectral properties of the random matrix product, it may extend to recurrent architectures with different gating mechanisms or activation functions.
Testing the same initialization on tasks with long memory requirements but non-chaotic statistics would reveal whether the edge-of-chaos optimum is task-dependent.

Load-bearing premise

That the infinite-width edge-of-chaos transition identified by the random-matrix analysis remains the optimal operating point for finite-width gated RNNs on concrete prediction tasks.

What would settle it

Measure the forecasting error of a gated-RNN reservoir while sweeping the initialization gain g around the predicted g_c; if the error minimum occurs at a gain differing by more than a few percent from the criterion's prediction, the claimed correspondence is falsified.

Figures

Figures reproduced from arXiv: 2605.10650 by Francesco Casola, Riccardo Marcaccioli, Tommaso Fioratti.

**Figure 2.** Figure 2: FIG. 2: Maximum Lyapunov exponent [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3: Phase diagram under Gaussian bias [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4: Training (dashed) and test (solid) mean squared [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Proper weight initialization prior to training has historically been one of the key factors that helped kick off the deep learning revolution. Initialization is even more crucial in "reservoir computing", where the weights of a readout layer are learned linearly while the reservoir weights are fixed and largely determine the richness, stability and memory of the resulting dynamics. In the infinite-width limit it has been shown that meaningful initializations are those sitting at an effective critical point of the randomly initialized model. The phase transition is controlled by the weight variance $g^2$ and separates an ordered phase from a chaotic one where information progressively degrades. Here we derive a simple criterion to estimate the critical $g_c$ for a broad class of recurrent architectures and we show that it closely tracks the gain at which a gated-RNN reservoir achieves peak performance on a chaotic forecasting task. Finally, we argue that our criterion can serve as a design principle for future initialization schemes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a random-matrix criterion for the critical gain in gated RNNs that lines up with peak performance on one chaotic forecasting task.

read the letter

The core contribution is a direct extension of the usual edge-of-chaos analysis to gated recurrent architectures. They linearize the Jacobian in the infinite-width limit, account for the gate functions, and obtain a simple expression for the critical g_c that separates ordered from chaotic regimes. That derivation is the new piece; prior work covered vanilla RNNs, and this version handles the extra nonlinearities without obvious extra fitting parameters. The empirical check is also clean: they vary g on a reservoir setup and show the forecasting error bottoms out near the predicted value. This gives reservoir-computing people a calculable starting point instead of pure grid search, which is practical when the readout is trained linearly and the reservoir stays fixed. The math follows the standard mean-field route, so it is reproducible in principle if the steps are written out clearly. The main limitation is scope. The match is shown on a single task and a narrow set of gated variants; it is not yet clear how sensitive the alignment is to sequence length, gate type, or finite width. A short finite-size correction or a second benchmark would make the claim more convincing, but the current evidence does not contradict the derivation. The work is aimed at researchers who already use random-matrix tools for initialization or who build fixed-reservoir models for time series. It is narrow enough that not every RNN paper needs to cite it, but the criterion itself is the sort of thing that could be checked or used in follow-up experiments. The paper is coherent on its own terms and deserves a serious referee who can verify the Jacobian calculation and ask for broader tests.

Referee Report

0 major / 4 minor

Summary. The manuscript derives a random-matrix criterion for the critical gain g_c marking the edge-of-chaos transition in a broad class of gated recurrent architectures in the infinite-width limit. It then shows empirically that initializing a gated-RNN reservoir at this g_c produces peak performance on a chaotic time-series forecasting task and proposes the criterion as a general initialization design principle.

Significance. If the derivation and alignment hold, the work supplies a theoretically motivated initialization rule for gated reservoirs that could reduce hyperparameter search costs while improving dynamical stability and forecasting accuracy. Extending mean-field edge-of-chaos analysis to gated units is a useful incremental contribution to reservoir computing literature.

minor comments (4)

[Abstract] Abstract: the phrase 'closely tracks' should be replaced by a quantitative statement (e.g., relative error or correlation coefficient) that is backed by the results section.
[§3] §3 (or wherever the derivation appears): explicitly list the steps that map the linearized Jacobian of the gated recurrence to the final g_c expression; the extension from vanilla RNNs is not obvious and needs to be shown.
[Figure 2] Figure 2 (performance vs. gain curves): add error bars or report the number of independent runs; without them the visual alignment with the derived g_c is hard to assess.
[Throughout] Notation: consistently distinguish the variance parameter g² from the critical value g_c; the current usage occasionally blurs the two.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript and for recommending minor revision. The summary accurately captures the core contribution: a random-matrix derivation of the critical gain g_c for gated RNN reservoirs in the infinite-width limit, together with empirical evidence that this initialization yields peak performance on chaotic forecasting tasks. As the report contains no specific major comments, we see no need for revisions and believe the current version is suitable for publication.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives a random-matrix criterion for the critical gain g_c by extending standard mean-field edge-of-chaos analysis to the linearized Jacobian of gated RNN architectures in the infinite-width limit. This derivation is presented as a direct theoretical extension using established variance-controlled phase transitions and does not reduce to any fitted quantity or self-referential definition. The subsequent empirical check that the derived g_c aligns with peak performance on a chaotic forecasting task serves only as validation; the task data is not used to construct or tune the criterion itself. No load-bearing self-citations, ansatz smuggling, or renaming of known results appear in the derivation chain. The central claim therefore remains independent of its empirical test.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim rests on the infinite-width random-matrix description of RNN phase transitions and the assumption that peak task performance occurs at the critical point.

axioms (1)

domain assumption Infinite-width limit governs the phase transition controlled by weight variance g^2 in the recurrent architectures considered.
Explicitly invoked in the abstract as the regime where meaningful initializations sit at the critical point.

pith-pipeline@v0.9.0 · 5460 in / 1238 out tokens · 48728 ms · 2026-05-12T04:31:52.644180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Glorot and Y

X. Glorot and Y. Bengio, in Proceedings of the 13th Inter- national Conference on Artificial Intelligence and Statis- tics (2010)

work page 2010
[2]

K. He, X. Zhang, S. Ren, and J. Sun, in Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015) pp. 1026–1034

work page 2015
[3]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Ad- vances in Neural Information Processing Systems , Vol. 25 (2012)

work page 2012
[4]

A. M. Saxe, J. L. McClelland, and S. Ganguli, in Inter- national Conference on Learning Representations (2014)

work page 2014
[5]

Pennington, S

J. Pennington, S. S. Schoenholz, and S. Ganguli, in Ad- vances in Neural Information Processing Systems , Vol. 30 (2017) pp. 4788–4798

work page 2017
[6]

Bertschinger and T

N. Bertschinger and T. Natschläger, Neural Computation 16, 1413 (2004)

work page 2004
[7]

Boedecker, O

J. Boedecker, O. Obst, J. T. Lizier, N. M. Mayer, and M. Asada, Theory in Biosciences 131, 205 (2012)

work page 2012
[8]

Sompolinsky, A

H. Sompolinsky, A. Crisanti, and H.-J. Sommers, Physi- cal Review Letters 61, 259 (1988)

work page 1988
[9]

S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl- Dickstein, in International Conference on Learning Rep- resentations (2017)

work page 2017
[10]

Cardy, Scaling and Renormalization in Statistical Physics (Cambridge University Press, 1996)

J. Cardy, Scaling and Renormalization in Statistical Physics (Cambridge University Press, 1996)

work page 1996
[11]

Molgedey, J

L. Molgedey, J. Schuchhardt, and H. G. Schuster, Physi- cal Review Letters 69, 3717 (1992)

work page 1992
[12]

Ahmadian, F

Y. Ahmadian, F. Fumarola, and K. D. Miller, Physical Review E 91, 012820 (2015)

work page 2015
[13]

T. Can, K. Krishnamurthy, and D. J. Schwab, in Proceed- ings of the First Mathematical and Scientific Machine Learning Conference, Proceedings of Machine Learning Research, Vol. 107 (2020) pp. 476–511, arXiv:2002.00025 [cs.LG]

work page arXiv 2020
[14]

Krishnamurthy, T

K. Krishnamurthy, T. Can, and D. J. Schwab, Physical Review X 12, 011011 (2022)

work page 2022
[15]

R. G. Brown, Exponential Smoothing for Predicting De- mand (Arthur D. Little Inc., 1956). 9

work page 1956
[16]

P. J. Kaufman, Smarter Trading: Improving Perfor- mance in Changing Markets (McGraw-Hill, New York, 1995)

work page 1995
[17]

S. F. Edwards and P. W. Anderson, Journal of Physics F: Metal Physics 5, 965 (1975)

work page 1975
[18]

Poole, S

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, in Advances in Neural Information Processing Systems, Vol. 29 (2016)

work page 2016
[19]

Benettin, L

G. Benettin, L. Galgani, A. Giorgilli, and J.-M. Strelcyn, Meccanica 15, 21 (1980)

work page 1980
[20]

Tao and V

T. Tao and V. Vu, Annals of Probability 38, 2023 (2010) , with an appendix by M. Krishnapur

work page 2023
[21]

Tallec and Y

C. Tallec and Y. Ollivier, in International Conference on Learning Representations (2018)

work page 2018
[22]

M. C. Mackey and L. Glass, Science 197, 287 (1977)

work page 1977
[23]

echo state

H. Jaeger, The “echo state” approach to analysing and training recurrent neural networks , Tech. Rep. GMD Re- port 148 (German National Research Center for Informa- tion Technology, 2001)

work page 2001
[24]

Jaeger and H

H. Jaeger and H. Haas, Science 304, 78 (2004)

work page 2004
[25]

Cowsik, T

A. Cowsik, T. Nebabu, X.-L. Qi, and S. Ganguli, Physical Review E 112, 055301 (2025) . Appendix A: Removing the regularizer The boundary condition ( 4) is taken from [ 12] in its full form, with the limits in the order limr→0+ limN →∞. This order matters in general: for arbitrary deterministic sequences of M, L, R, the unregularized empirical sum may dive...

work page 2025

[1] [1]

Glorot and Y

X. Glorot and Y. Bengio, in Proceedings of the 13th Inter- national Conference on Artificial Intelligence and Statis- tics (2010)

work page 2010

[2] [2]

K. He, X. Zhang, S. Ren, and J. Sun, in Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015) pp. 1026–1034

work page 2015

[3] [3]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Ad- vances in Neural Information Processing Systems , Vol. 25 (2012)

work page 2012

[4] [4]

A. M. Saxe, J. L. McClelland, and S. Ganguli, in Inter- national Conference on Learning Representations (2014)

work page 2014

[5] [5]

Pennington, S

J. Pennington, S. S. Schoenholz, and S. Ganguli, in Ad- vances in Neural Information Processing Systems , Vol. 30 (2017) pp. 4788–4798

work page 2017

[6] [6]

Bertschinger and T

N. Bertschinger and T. Natschläger, Neural Computation 16, 1413 (2004)

work page 2004

[7] [7]

Boedecker, O

J. Boedecker, O. Obst, J. T. Lizier, N. M. Mayer, and M. Asada, Theory in Biosciences 131, 205 (2012)

work page 2012

[8] [8]

Sompolinsky, A

H. Sompolinsky, A. Crisanti, and H.-J. Sommers, Physi- cal Review Letters 61, 259 (1988)

work page 1988

[9] [9]

S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl- Dickstein, in International Conference on Learning Rep- resentations (2017)

work page 2017

[10] [10]

Cardy, Scaling and Renormalization in Statistical Physics (Cambridge University Press, 1996)

J. Cardy, Scaling and Renormalization in Statistical Physics (Cambridge University Press, 1996)

work page 1996

[11] [11]

Molgedey, J

L. Molgedey, J. Schuchhardt, and H. G. Schuster, Physi- cal Review Letters 69, 3717 (1992)

work page 1992

[12] [12]

Ahmadian, F

Y. Ahmadian, F. Fumarola, and K. D. Miller, Physical Review E 91, 012820 (2015)

work page 2015

[13] [13]

T. Can, K. Krishnamurthy, and D. J. Schwab, in Proceed- ings of the First Mathematical and Scientific Machine Learning Conference, Proceedings of Machine Learning Research, Vol. 107 (2020) pp. 476–511, arXiv:2002.00025 [cs.LG]

work page arXiv 2020

[14] [14]

Krishnamurthy, T

K. Krishnamurthy, T. Can, and D. J. Schwab, Physical Review X 12, 011011 (2022)

work page 2022

[15] [15]

R. G. Brown, Exponential Smoothing for Predicting De- mand (Arthur D. Little Inc., 1956). 9

work page 1956

[16] [16]

P. J. Kaufman, Smarter Trading: Improving Perfor- mance in Changing Markets (McGraw-Hill, New York, 1995)

work page 1995

[17] [17]

S. F. Edwards and P. W. Anderson, Journal of Physics F: Metal Physics 5, 965 (1975)

work page 1975

[18] [18]

Poole, S

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, in Advances in Neural Information Processing Systems, Vol. 29 (2016)

work page 2016

[19] [19]

Benettin, L

G. Benettin, L. Galgani, A. Giorgilli, and J.-M. Strelcyn, Meccanica 15, 21 (1980)

work page 1980

[20] [20]

Tao and V

T. Tao and V. Vu, Annals of Probability 38, 2023 (2010) , with an appendix by M. Krishnapur

work page 2023

[21] [21]

Tallec and Y

C. Tallec and Y. Ollivier, in International Conference on Learning Representations (2018)

work page 2018

[22] [22]

M. C. Mackey and L. Glass, Science 197, 287 (1977)

work page 1977

[23] [23]

echo state

H. Jaeger, The “echo state” approach to analysing and training recurrent neural networks , Tech. Rep. GMD Re- port 148 (German National Research Center for Informa- tion Technology, 2001)

work page 2001

[24] [24]

Jaeger and H

H. Jaeger and H. Haas, Science 304, 78 (2004)

work page 2004

[25] [25]

Cowsik, T

A. Cowsik, T. Nebabu, X.-L. Qi, and S. Ganguli, Physical Review E 112, 055301 (2025) . Appendix A: Removing the regularizer The boundary condition ( 4) is taken from [ 12] in its full form, with the limits in the order limr→0+ limN →∞. This order matters in general: for arbitrary deterministic sequences of M, L, R, the unregularized empirical sum may dive...

work page 2025