Plateaus, Optima, and Overfitting in Multi-Layer Perceptrons: A Saddle-Saddle-Attractor Scenario

Alex Al\`i Maleknia; Yuzuru Sato

arxiv: 2604.02393 · v2 · submitted 2026-04-02 · 💻 cs.LG · nlin.AO

Plateaus, Optima, and Overfitting in Multi-Layer Perceptrons: A Saddle-Saddle-Attractor Scenario

Alex Al\`i Maleknia , Yuzuru Sato This is my paper

Pith reviewed 2026-05-13 22:02 UTC · model grok-4.3

classification 💻 cs.LG nlin.AO

keywords multi-layer perceptronsoverfittingsaddle pointstraining dynamicsplateausdynamical systemsloss landscapemachine learning

0 comments

The pith

For finite noisy datasets, multi-layer perceptron training necessarily converges to an overfitting solution rather than the theoretical optimum.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a dynamical model of learning in multi-layer perceptrons based on a minimal version of earlier work on gradient flows. This model shows that training trajectories move through regions of slow progress called plateaus and areas close to optimal performance, both shaped by saddle points in the loss landscape. The flow eventually enters an overfitting regime. For finite datasets that contain noise, the theoretical global minimum becomes unreachable, so the dynamics settle into a stable overfitting state that can reduce to a single attractor up to symmetry.

Core claim

In the minimal dynamical model of MLP training, the parameter space features saddle structures that organize both plateau regions of slow learning and near-optimal regions. The flow of training dynamics passes through these before entering an overfitting attractor, which under suitable data conditions collapses to a single attractor up to symmetry. For finite noisy datasets, the theoretical optimum is unreachable, and the system necessarily converges to the overfitting solution.

What carries the argument

Saddle structures in the loss landscape of the minimal model that organize plateau and near-optimal regions before directing flow to the overfitting attractor.

If this is right

Training paths move through slow-learning plateaus shaped by saddles.
Near-optimal performance regions are also structured by saddles.
The system converges to an overfitting attractor.
On finite noisy data the theoretical optimum remains unreachable.
The overfitting regime collapses to a single attractor modulo symmetry under suitable data conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the minimal model applies, then regularization or early stopping works by interrupting the inevitable flow toward the overfitting attractor.
The saddle-organized dynamics may explain persistent overfitting even in overparameterized networks trained on noisy data.
Similar plateau-to-overfitting transitions could appear in gradient flows for other architectures once noise is present in finite samples.

Load-bearing premise

The minimal model inspired by Fukumizu and Amari faithfully captures the essential dynamical features of real multi-layer perceptron training.

What would settle it

A simulation or analysis of the minimal model on a finite noisy dataset that reaches the theoretical global optimum without entering the overfitting regime would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.02393 by Alex Al\`i Maleknia, Yuzuru Sato.

**Figure 2.** Figure 2: A schematic representation of the saddle-saddle-attractor scenario in MLP gradient [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Graphs obtained after training the minimal model for 2 million iterations. In the [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Vanishing gradients and overfitting are central problems in machine learning, yet are typically analyzed in asymptotic regimes that obscure their dynamical origins. Here we provide a dynamical description of learning in multi-layer perceptrons (MLPs) via a minimal model inspired by Fukumizu and Amari. We show that training dynamics traverse plateau and near-optimal regions, both organized by saddle structures, before converging to an overfitting regime. Under suitable conditions on the data, this regime collapses to a single attractor modulo symmetry. Furthermore, for finite noisy datasets, convergence to the theoretical optimum is impossible, and the dynamics necessarily settle into an overfitting solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean saddle-saddle-attractor story for why MLPs hit plateaus then overfit on finite noisy data, but the minimal model still lacks any direct check against real network trajectories.

read the letter

The new piece is the explicit sequence: training moves through plateau saddles, then near-optimal saddles, and finally settles into a single overfitting attractor modulo symmetry. This extends the Fukumizu-Amari minimal model by organizing those regions into one dynamical path and by stating that finite noisy data makes the theoretical optimum unreachable inside the reduced equations. That framing is straightforward and pulls together phenomena that usually sit in separate papers. The derivations inside the model appear consistent on their own terms, and the claim that the dynamics must collapse to the overfitting regime follows directly once the reductions are accepted. What is missing is evidence that the reductions preserve the key behavior when the omitted terms are restored. No trajectory comparisons with a concrete MLP on the same finite noisy data are shown, so it remains open whether stochastic gradients, higher-order weight interactions, or symmetry-breaking perturbations change the attractor basin. The abstract also does not display the explicit reduced equations, which makes it hard to judge how many choices were needed to reach the saddle sequence. For someone already working on dynamical systems views of optimization this is worth reading; it supplies a mechanistic picture that could suggest new regularization ideas if the transfer holds. A reader focused on practical training would still need to run their own checks. The work is coherent enough on its own terms to deserve referee time rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a minimal dynamical model for MLP training inspired by Fukumizu and Amari. It shows that the loss landscape dynamics traverse plateau regions and near-optimal regions, both organized by saddle structures, before settling into an overfitting attractor. The central claim is that, for finite noisy datasets, convergence to the theoretical optimum is impossible and the dynamics necessarily reach an overfitting solution that collapses to a single attractor modulo symmetry.

Significance. If the reduced model faithfully captures the essential features of MLP dynamics, the saddle-saddle-attractor scenario supplies a concrete dynamical mechanism for the coexistence of plateaus, near-optima, and overfitting. This could inform analyses of why stochastic gradient methods on finite data avoid global optima even when they exist in the population loss. The work is strongest where it derives the sequence of saddle structures inside the reduced equations; its impact hinges on whether those structures survive in the full network.

major comments (3)

[§3] §3 (minimal-model construction): the reduction from the full MLP loss to the Fukumizu-Amari-inspired system omits higher-order weight interactions and stochastic-gradient noise; the manuscript does not show that these terms leave the saddle-saddle-attractor sequence qualitatively intact.
[§5] §5 (attractor analysis): the claim that finite noisy data forces the dynamics into the overfitting attractor is derived inside the reduced equations, yet no side-by-side trajectory comparison is supplied between the minimal model and a concrete MLP trained on identical finite noisy data.
[§4.2] §4.2 (symmetry-breaking): the statement that the overfitting regime collapses to a single attractor modulo symmetry assumes that symmetry-breaking perturbations do not open new basins; this assumption is load-bearing for the “necessarily settle” conclusion but is not tested numerically.

minor comments (2)

Notation for the reduced coordinates (e.g., the variables tracking the plateau and near-optimum saddles) is introduced without an explicit table of symbols; a short nomenclature table would improve readability.
Figure 2 (phase portrait) uses line styles that are difficult to distinguish in grayscale; adding distinct markers or a clearer legend would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive report. We address each of the major comments below, proposing revisions to the manuscript where they strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: §3 (minimal-model construction): the reduction from the full MLP loss to the Fukumizu-Amari-inspired system omits higher-order weight interactions and stochastic-gradient noise; the manuscript does not show that these terms leave the saddle-saddle-attractor sequence qualitatively intact.

Authors: We agree that the minimal model is an approximation that neglects higher-order terms and gradient noise. The construction follows the Fukumizu-Amari approach to isolate the essential saddle dynamics. In the revised manuscript, we will expand §3 to include a perturbative argument showing that small higher-order perturbations do not alter the sequence of saddle crossings, thereby preserving the qualitative plateau-near-optima-overfitting structure. revision: partial
Referee: §5 (attractor analysis): the claim that finite noisy data forces the dynamics into the overfitting attractor is derived inside the reduced equations, yet no side-by-side trajectory comparison is supplied between the minimal model and a concrete MLP trained on identical finite noisy data.

Authors: The attractor analysis is performed analytically within the reduced dynamical system. While a direct numerical comparison with full MLP training would provide additional support, such experiments are computationally demanding and lie outside the theoretical scope of the present work. We will revise §5 to explicitly state that the necessity of the overfitting attractor holds in the minimal model, and we will add a remark on the expected robustness to full networks based on the reduction assumptions. revision: partial
Referee: §4.2 (symmetry-breaking): the statement that the overfitting regime collapses to a single attractor modulo symmetry assumes that symmetry-breaking perturbations do not open new basins; this assumption is load-bearing for the “necessarily settle” conclusion but is not tested numerically.

Authors: We acknowledge that the uniqueness modulo symmetry relies on the absence of new basins induced by perturbations. In the revised version, we will include numerical simulations of the reduced equations with added small symmetry-breaking terms to verify that the dynamics still converge to the same attractor class. This will be presented in an expanded §4.2. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained within minimal model; no reduction to inputs by construction

full rationale

The paper defines a minimal dynamical model inspired by Fukumizu-Amari, then analyzes its equations to obtain the plateau-near-optimum-overfitting sequence and the attractor for finite noisy data. This is a direct mathematical consequence of the reduced ODEs rather than a fit or self-referential definition. No load-bearing self-citation, no parameter fitted to data then relabeled prediction, and no ansatz smuggled from the authors' own prior work. The transfer to real MLPs rests on an unvalidated modeling assumption, but that is a correctness issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the full ledger cannot be extracted. The central claim rests on the assumption that the minimal model is representative.

axioms (1)

domain assumption The minimal model inspired by Fukumizu and Amari captures the essential dynamics of MLP training
Invoked to derive the plateau, saddle, and attractor behavior described in the abstract.

pith-pipeline@v0.9.0 · 5410 in / 1145 out tokens · 39171 ms · 2026-05-13T22:02:29.801585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

D. P. Kingma, J. Ba, Adam: A method for stochastic optimization , CoRR abs/1412.6980 (2014). URL https://api.semanticscholar.org/CorpusID:6628106

work page internal anchor Pith review Pith/arXiv arXiv 2014
[2]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, J. Bernstein, Muon: An optimizer for hidden layers in neural networks (2024). URL https://kellerjordan.github.io/posts/muon/

work page 2024
[3]

Z. Hu, J. ZHANG, Y. Ge, Handling vanishing gradient problem using arti- ﬁcial derivative, IEEE Access PP (2021) 1–1. doi:10.1109/ACCESS.2021. 3054915

work page doi:10.1109/access.2021 2021
[4]

Ainsworth, Y

M. Ainsworth, Y. Shin, Plateau phenomenon in gradient descent training of relu networks: Explanation, quantiﬁcation, and avoidance , SIAM Jour- nal on Scientiﬁc Computing 43 (5) (2021) A3438–A3468. doi:10.1137/ 20M1353010. URL https://doi.org/10.1137/20M1353010

work page doi:10.1137/20m1353010 2021
[5]

Fukumizu, S

K. Fukumizu, S. Amari, Local minima and plateaus in hierarchical struc- tures of multilayer perceptrons , Neural Networks 13 (3) (2000) 317–327. doi:https://doi.org/10.1016/S0893-6080(00)00009-5 . URL https://www.sciencedirect.com/science/article/pii/ S0893608000000095

work page doi:10.1016/s0893-6080(00)00009-5 2000
[6]

Simsek, F

B. Simsek, F. Ged, A. Jacot, F. Spadaro, C. Hongler, W. Gerstner, J. Brea, Geometry of the loss landscape in overparameterized neural networks: Sym- metries and invariances, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, Vol. 139 of Proceed- ings of Machine Learning Research, PMLR, 2021, pp. 9722–9732. 11

work page 2021
[7]

Y. Sato, D. Tsutsui, A. Fujiwara, Noise-induced degeneration in on- line learning , Physica D: Nonlinear Phenomena 430 (2022) 133095. doi:https://doi.org/10.1016/j.physd.2021.133095. URL https://www.sciencedirect.com/science/article/pii/ S0167278921002505

work page doi:10.1016/j.physd.2021.133095 2022
[8]

Zhang, A

Y. Zhang, A. M. Saxe, P. E. Latham, Saddle-to-saddle dynamics explains a simplicity bias across neural network architectures , in: The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=Vit5M0G5Gb

work page 2026
[9]

M. A. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015

work page 2015
[10]

C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Berlin, Heidelberg, 2006

work page 2006
[11]

Pascanu, T

R. Pascanu, T. Mikolov, Y. Bengio, On the diﬃculty of training recurrent neural networks, in: S. Dasgupta, D. McAllester (Eds.), Proceedings of the 30th International Conference on Machine Learning, Vol. 28 of Proceedings of Machine Learning Research, PMLR, Atlanta, Georgia, USA, 2013, pp. 1310–1318. URL https://proceedings.mlr.press/v28/pascanu13.html

work page 2013
[12]

doi: 10.1007/s10208-024-09664-9

R. Berthier, A. Montanari, K. Zhou, Learning time-scales in two-layers neural networks, Foundations of Computational Mathematics (Aug 2024). doi:https://doi.org/10.1007/s10208-024-09664-9

work page doi:10.1007/s10208-024-09664-9 2024
[13]

Teramae, D

J.-n. Teramae, D. Tanaka, Robustness of the noise-induced phase synchro- nization in a general class of limit cycle oscillators , Phys. Rev. Lett. 93 (2004) 204103. doi:10.1103/PhysRevLett.93.204103. URL https://link.aps.org/doi/10.1103/PhysRevLett.93.204103

work page doi:10.1103/physrevlett.93.204103 2004
[14]

Absil, R

P. Absil, R. Mahony, B. Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM Journal on Optimization 16 (2) (2005) 531–547. doi:10.1137/040605266

work page doi:10.1137/040605266 2005
[15]

Leobacher, A

G. Leobacher, A. Steinicke, Existence, uniqueness and regularity of the projection onto diﬀerentiable manifolds , Annals of Global Analysis and Geometry 60 (3) (2021) 559–587. doi:10.1007/s10455-021-09788-z . URL https://doi.org/10.1007/s10455-021-09788-z

work page doi:10.1007/s10455-021-09788-z 2021
[16]

A. M. Chen, H.-m. Lu, R. Hecht-Nielsen, On the geometry of feedforward neural network error surfaces, Neural Computation 5 (6) (1993) 910–927. doi:10.1162/neco.1993.5.6.910

work page doi:10.1162/neco.1993.5.6.910 1993
[17]

Aamari, J

E. Aamari, J. Kim, F. Chazal, B. Michel, A. Rinaldo, L. Wasserman, Es- timating the reach of a manifold , Electronic Journal of Statistics 13 (1) (2019) 1359 – 1399. doi:10.1214/19-EJS1551. URL https://doi.org/10.1214/19-EJS1551

work page doi:10.1214/19-ejs1551 2019
[18]

Fukumizu, S

K. Fukumizu, S. Yamaguchi, Y.-i. Mototake, M. Tanaka, Semi-ﬂat minima and saddle points by embedding neural networks to overparameterization , in: Advances in Neural Information Processing Systems, Vol. 32, Curran 12 Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ a4ee59dd868ba016ed2de90d330acb6a-Abstract.html

work page 2019
[19]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, A. Courville, On the spectral bias of neural networks , in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th Interna- tional Conference on Machine Learning, Vol. 97 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 5301–5310. URL https://proceedings.mlr...

work page 2019

[1] [1]

D. P. Kingma, J. Ba, Adam: A method for stochastic optimization , CoRR abs/1412.6980 (2014). URL https://api.semanticscholar.org/CorpusID:6628106

work page internal anchor Pith review Pith/arXiv arXiv 2014

[2] [2]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, J. Bernstein, Muon: An optimizer for hidden layers in neural networks (2024). URL https://kellerjordan.github.io/posts/muon/

work page 2024

[3] [3]

Z. Hu, J. ZHANG, Y. Ge, Handling vanishing gradient problem using arti- ﬁcial derivative, IEEE Access PP (2021) 1–1. doi:10.1109/ACCESS.2021. 3054915

work page doi:10.1109/access.2021 2021

[4] [4]

Ainsworth, Y

M. Ainsworth, Y. Shin, Plateau phenomenon in gradient descent training of relu networks: Explanation, quantiﬁcation, and avoidance , SIAM Jour- nal on Scientiﬁc Computing 43 (5) (2021) A3438–A3468. doi:10.1137/ 20M1353010. URL https://doi.org/10.1137/20M1353010

work page doi:10.1137/20m1353010 2021

[5] [5]

Fukumizu, S

K. Fukumizu, S. Amari, Local minima and plateaus in hierarchical struc- tures of multilayer perceptrons , Neural Networks 13 (3) (2000) 317–327. doi:https://doi.org/10.1016/S0893-6080(00)00009-5 . URL https://www.sciencedirect.com/science/article/pii/ S0893608000000095

work page doi:10.1016/s0893-6080(00)00009-5 2000

[6] [6]

Simsek, F

B. Simsek, F. Ged, A. Jacot, F. Spadaro, C. Hongler, W. Gerstner, J. Brea, Geometry of the loss landscape in overparameterized neural networks: Sym- metries and invariances, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, Vol. 139 of Proceed- ings of Machine Learning Research, PMLR, 2021, pp. 9722–9732. 11

work page 2021

[7] [7]

Y. Sato, D. Tsutsui, A. Fujiwara, Noise-induced degeneration in on- line learning , Physica D: Nonlinear Phenomena 430 (2022) 133095. doi:https://doi.org/10.1016/j.physd.2021.133095. URL https://www.sciencedirect.com/science/article/pii/ S0167278921002505

work page doi:10.1016/j.physd.2021.133095 2022

[8] [8]

Zhang, A

Y. Zhang, A. M. Saxe, P. E. Latham, Saddle-to-saddle dynamics explains a simplicity bias across neural network architectures , in: The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=Vit5M0G5Gb

work page 2026

[9] [9]

M. A. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015

work page 2015

[10] [10]

C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Berlin, Heidelberg, 2006

work page 2006

[11] [11]

Pascanu, T

R. Pascanu, T. Mikolov, Y. Bengio, On the diﬃculty of training recurrent neural networks, in: S. Dasgupta, D. McAllester (Eds.), Proceedings of the 30th International Conference on Machine Learning, Vol. 28 of Proceedings of Machine Learning Research, PMLR, Atlanta, Georgia, USA, 2013, pp. 1310–1318. URL https://proceedings.mlr.press/v28/pascanu13.html

work page 2013

[12] [12]

doi: 10.1007/s10208-024-09664-9

R. Berthier, A. Montanari, K. Zhou, Learning time-scales in two-layers neural networks, Foundations of Computational Mathematics (Aug 2024). doi:https://doi.org/10.1007/s10208-024-09664-9

work page doi:10.1007/s10208-024-09664-9 2024

[13] [13]

Teramae, D

J.-n. Teramae, D. Tanaka, Robustness of the noise-induced phase synchro- nization in a general class of limit cycle oscillators , Phys. Rev. Lett. 93 (2004) 204103. doi:10.1103/PhysRevLett.93.204103. URL https://link.aps.org/doi/10.1103/PhysRevLett.93.204103

work page doi:10.1103/physrevlett.93.204103 2004

[14] [14]

Absil, R

P. Absil, R. Mahony, B. Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM Journal on Optimization 16 (2) (2005) 531–547. doi:10.1137/040605266

work page doi:10.1137/040605266 2005

[15] [15]

Leobacher, A

G. Leobacher, A. Steinicke, Existence, uniqueness and regularity of the projection onto diﬀerentiable manifolds , Annals of Global Analysis and Geometry 60 (3) (2021) 559–587. doi:10.1007/s10455-021-09788-z . URL https://doi.org/10.1007/s10455-021-09788-z

work page doi:10.1007/s10455-021-09788-z 2021

[16] [16]

A. M. Chen, H.-m. Lu, R. Hecht-Nielsen, On the geometry of feedforward neural network error surfaces, Neural Computation 5 (6) (1993) 910–927. doi:10.1162/neco.1993.5.6.910

work page doi:10.1162/neco.1993.5.6.910 1993

[17] [17]

Aamari, J

E. Aamari, J. Kim, F. Chazal, B. Michel, A. Rinaldo, L. Wasserman, Es- timating the reach of a manifold , Electronic Journal of Statistics 13 (1) (2019) 1359 – 1399. doi:10.1214/19-EJS1551. URL https://doi.org/10.1214/19-EJS1551

work page doi:10.1214/19-ejs1551 2019

[18] [18]

Fukumizu, S

K. Fukumizu, S. Yamaguchi, Y.-i. Mototake, M. Tanaka, Semi-ﬂat minima and saddle points by embedding neural networks to overparameterization , in: Advances in Neural Information Processing Systems, Vol. 32, Curran 12 Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ a4ee59dd868ba016ed2de90d330acb6a-Abstract.html

work page 2019

[19] [19]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, A. Courville, On the spectral bias of neural networks , in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th Interna- tional Conference on Machine Learning, Vol. 97 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 5301–5310. URL https://proceedings.mlr...

work page 2019