pith. sign in

arxiv: 2506.14951 · v4 · submitted 2025-06-17 · 💻 cs.LG · cs.AI· cs.NE

Flat Channels to Infinity in Neural Loss Landscapes

Pith reviewed 2026-05-19 08:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE
keywords loss landscapeflat minimaneural networksgradient flowgated linear unitsregressionsymmetry
0
0 comments X

The pith

Neural network loss landscapes contain flat channels leading to infinity where neuron pairs form gated linear units.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies special structures in neural network loss landscapes: channels along which the loss decreases extremely slowly. Along these channels the output weights of at least two neurons diverge to positive and negative infinity while their input weight vectors become identical. Gradient flow and related optimizers reach the channels with high probability in regression tasks, yet the channels look like ordinary flat minima with finite weights unless examined closely. The channels run asymptotically parallel to lines of critical points created by network symmetries. At the far end the two neurons together realize a gated linear unit of the form sigma(w · x) plus (v · x) times sigma prime of (w · x).

Core claim

We identify and characterize channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, a_i and a_j, diverge to ±infinity, and their input weight vectors, w_i and w_j, become equal to each other. At convergence, the two neurons implement a gated linear unit: a_i sigma(w_i · x) + a_j sigma(w_j · x) approaches sigma(w · x) + (v · x) sigma prime(w · x). Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers reach the channels with high probability in diverse regression settings.

What carries the argument

Flat channels to infinity in parameter space, asymptotically parallel to symmetry-induced lines of critical points, that end in a pair of neurons implementing a gated linear unit.

If this is right

  • Gradient flow solvers and SGD or ADAM reach the channels with high probability in diverse regression settings.
  • Without careful inspection the channels appear as flat local minima with finite parameter values.
  • The channels supply a comprehensive picture of quasi-flat regions in terms of gradient dynamics, geometry, and functional form.
  • The emergence of gated linear units at the end of the channels points to a computational capability of fully connected layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training trajectories that appear to have converged may actually continue slow movement along these channels for many more steps.
  • Detecting such channels could guide regularization methods that penalize large weight divergence.
  • The same geometric mechanism may operate in deeper networks or other activation functions.

Load-bearing premise

The analysis assumes gradient flow and SGD or ADAM reach these channels with high probability in diverse regression settings and that the resulting configuration is asymptotically parallel to symmetry-induced critical lines.

What would settle it

Train a two-layer network on a regression task with gradient descent and check whether pairs of neurons show output weights diverging to plus and minus infinity, input weights aligning, and loss continuing to decrease slowly over long times.

Figures

Figures reproduced from arXiv: 2506.14951 by Alexander Van Meegen, Berfin \c{S}im\c{s}ek, Flavio Martinelli, Johanni Brea, Wulfram Gerstner.

Figure 1
Figure 1. Figure 1: Saddle lines à la Fukumizu & Amari [1] and channels to infinity. Left: Duplicating a neuron in a network trained to convergence generates lines of saddle points in the loss landscape [1]. Duplicated neurons share the input weights of the original neuron while their output weights γa,(1 − γ)a sum to the original neuron’s output weight a. Middle: Loss landscape of duplicated network projected along the saddl… view at source ↗
Figure 2
Figure 2. Figure 2: Stable plateau-saddles can be found in MLPs with scalar output and no bias: (a) Networks of 1 to 5 hidden neurons are trained on the shown 2D regression target (logarithm of the rosenbrock function, see Appendix A). Training follows full-batch gradient flow dynamics until convergence to a critical point. A quantification of unique solutions in weight-space (up to permutation symmetries) is shown at the bot… view at source ↗
Figure 3
Figure 3. Figure 3: Loss landscape of plateau-saddles: (a) Schematic of trajectories around saddle line. Repulsive directions strict saddles (red segments) become attractive for the plateau-saddle (orange segment). (b) Loss landscape along the duplication parameter γ and the direction of smallest eigenvalue of the Hessian αemin(γ). (c) Example of neuron duplication for loss function shown in b: small perturbations are stable … view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Frequency and properties of channels to infinity: (a) As a heuristic to identify channels to infinity, we look at the cosine distance of the pair of closest input weight vectors within a network and the sum of absolute output weights corresponding to that pair. Channel solutions are identified by having a pair of neurons with large output weight norm and a small distance in input weights (top left section … view at source ↗
Figure 6
Figure 6. Figure 6: Convergence in ϵ to gated linear units. (a) Moving along a channel to infinity with the jump procedure described in Appendix C shows that the loss and the approximation error decreases with ϵ 2 , as predicted by the theory, and that c, a, and the cosine similarity cos(∆, w) converge to constant values. A network with 8 input dimensions and 8 hidden softplus neurons (81 parameters) trained on the rosenbrock… view at source ↗
Figure 2
Figure 2. Figure 2: the L∞-norm maxi |∇θiL(θ)| is used to quantify the gradient norm. A.3 Simulation details [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: the L∞-norm maxi |∇θiL(θ)| is used to quantify the gradient norm. Figure B7: Examples of 2D GP datasets: 2D GP datasets used in [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
read the original abstract

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies and characterizes 'flat channels to infinity' in neural network loss landscapes. Along these channels the loss decreases extremely slowly while output weights a_i and a_j diverge to ±∞ and input weights w_i, w_j align; the pair asymptotically implements a gated linear unit of the form σ(w·x) + (v·x)σ'(w·x). The channels are asymptotically parallel to symmetry-induced lines of critical points. The authors assert that gradient flow, SGD and ADAM reach these channels with high probability in diverse regression settings, making them appear as flat local minima with finite parameters. The work supplies a unified picture in terms of gradient dynamics, geometry and functional interpretation.

Significance. If the reachability claim is substantiated, the result is significant: it supplies a concrete dynamical and geometric mechanism for the flat regions routinely observed in neural loss landscapes and directly links them to the emergence of gated-linear-unit-like computations inside fully connected layers. The explicit functional limit and the asymptotic parallelism to symmetry lines are strengths that go beyond standard critical-point catalogs and could inform implicit-regularization analyses. The paper earns credit for attempting a comprehensive characterization that integrates dynamics, geometry and expressivity.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings' is load-bearing for the claim that these structures explain observed training behavior. No basin-volume estimate, Lyapunov analysis near the channel, or controls ruling out other attractors are supplied; without such support the channels may exist mathematically yet remain irrelevant to typical trajectories.
  2. [Geometry / symmetry analysis] The geometric claim that the channels are 'asymptotically parallel to symmetry-induced lines of critical points' is central to the characterization. The manuscript should supply the explicit symmetry group, the associated conserved quantities or Hessian null directions, and the precise sense in which the flow approaches these lines (e.g., a differential equation for the transverse coordinates).
minor comments (1)
  1. [Abstract / notation] The limiting expression for the gated linear unit in the abstract would benefit from an explicit parametrization (e.g., a scaling parameter t → ∞ along the channel) that makes the divergence rates of a_i, a_j and the alignment of w_i, w_j mathematically precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential significance of our findings. We provide point-by-point responses to the major comments below and describe the revisions we intend to implement.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings' is load-bearing for the claim that these structures explain observed training behavior. No basin-volume estimate, Lyapunov analysis near the channel, or controls ruling out other attractors are supplied; without such support the channels may exist mathematically yet remain irrelevant to typical trajectories.

    Authors: We agree that the reachability of the channels is a key aspect of our claims and that our current support is primarily empirical. The manuscript presents numerical evidence from multiple regression tasks showing that gradient-based optimizers consistently converge to these channels. To address the referee's concern, we will revise the abstract and the relevant sections to temper the language, emphasizing that the claim is based on observed behavior in the studied settings rather than a proven high-probability result for all cases. We will also add further experimental controls and discussion of potential other attractors. A full basin-volume analysis or Lyapunov study is not included and would constitute a substantial extension of the work. revision: partial

  2. Referee: [Geometry / symmetry analysis] The geometric claim that the channels are 'asymptotically parallel to symmetry-induced lines of critical points' is central to the characterization. The manuscript should supply the explicit symmetry group, the associated conserved quantities or Hessian null directions, and the precise sense in which the flow approaches these lines (e.g., a differential equation for the transverse coordinates).

    Authors: We appreciate this suggestion for greater precision in the geometric analysis. The symmetry in question is the permutation symmetry among identical neurons in the hidden layer. We will explicitly identify the symmetry group as the symmetric group S_n acting by permuting the neuron indices. The associated conserved quantities include the loss invariance under such permutations, and the Hessian has null directions corresponding to these infinitesimal symmetries. In the revised version, we will add a subsection detailing these elements and describe the approach to the symmetry lines via the transverse dynamics, including a reduced differential equation for the deviation from the line. This will clarify the asymptotic parallelism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external loss-landscape geometry

full rationale

The paper derives the channel geometry and the asymptotic gated-linear-unit form directly from the loss Hessian and symmetry-induced critical lines (visible in the abstract's functional limit and geometric parallelism statement). No equation reduces a claimed prediction to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem, and the reachability statement is framed as an empirical observation under gradient flow rather than a self-referential fit. The central characterization therefore rests on independent geometric analysis rather than circular re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of gradient-flow dynamics in overparameterized networks and on the existence of symmetry-induced critical lines; no free parameters or new postulated entities are introduced in the abstract.

axioms (1)
  • domain assumption Gradient flow and related first-order methods govern the trajectories that reach the described channels in regression settings.
    Stated directly in the abstract as the condition under which the channels are reached with high probability.

pith-pipeline@v0.9.0 · 5803 in / 1291 out tokens · 38981 ms · 2026-05-19T08:51:07.156620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 3 internal anchors

  1. [1]

    Local minima and plateaus in hierarchical structures of multilayer perceptrons

    Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural networks, 13(3):317–327, 2000

  2. [2]

    Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

    Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 27. Curran Associates,...

  3. [3]

    The Loss Surfaces of Multilayer Networks

    Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The Loss Surfaces of Multilayer Networks. In Guy Lebanon and S. V . N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 192–204. PMLR, 2015

  4. [4]

    Visualizing the loss landscape of neural nets

    Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

  5. [5]

    Semi-flat minima and saddle points by embedding neural networks to overparameterization

    Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, and Mirai Tanaka. Semi-flat minima and saddle points by embedding neural networks to overparameterization. Advances in Neural Information Processing Systems, 32:13868–13876, 2019

  6. [6]

    Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances

    Berfin ¸ Sim¸ sek, François Ged, Arthur Jacot, Francesco Spadaro, Clément Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning , pages 9722–9732. PMLR, 2021

  7. [7]

    Non-attracting regions of local minima in deep and wide neural networks

    Henning Petzka and Cristian Sminchisescu. Non-attracting regions of local minima in deep and wide neural networks. Journal of Machine Learning Research, 22(143):1–34, 2021

  8. [8]

    How to escape saddle points efficiently

    Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732. PMLR, 2017

  9. [9]

    The Implicit Bias of Gradient Descent on Separable Data

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data. Journal of Machine Learning Research, 19(70):1–57, 2018

  10. [10]

    Embedding principle of loss landscape of deep neural networks

    Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhiqin J Xu. Embedding principle of loss landscape of deep neural networks. Advances in Neural Information Processing Systems , 34:14848–14859, 2021

  11. [11]

    Embedding principle: a hierarchical structure of loss landscape of deep neural networks

    Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle: a hierarchical structure of loss landscape of deep neural networks. arXiv preprint arXiv:2111.15527, 2021

  12. [12]

    Splitting steepest descent for growing neural architectures

    Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for growing neural architectures. Advances in neural information processing systems, 32, 2019. 10

  13. [13]

    Steepest descent neural archi- tecture optimization: Escaping local optimum with signed neural splitting

    Lemeng Wu, Mao Ye, Qi Lei, Jason D Lee, and Qiang Liu. Steepest descent neural archi- tecture optimization: Escaping local optimum with signed neural splitting. arXiv preprint arXiv:2003.10392, 2020

  14. [14]

    An analysis on negative curvature induced by singularity in multi-layer neural-network learning

    Eiji Mizutani and Stuart Dreyfus. An analysis on negative curvature induced by singularity in multi-layer neural-network learning. Advances in Neural Information Processing Systems, 23, 2010

  15. [15]

    Local minima and back propagation

    Timothy Poston, C-N Lee, Y Choie, and Yonghoon Kwon. Local minima and back propagation. In IJCNN-91-Seattle International Joint Conference on Neural Networks , volume 2, pages 173–176. IEEE, 1991

  16. [16]

    No bad local minima: Data independent training error guarantees for multilayer neural networks

    Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016

  17. [17]

    The loss surface of deep and wide neural networks

    Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In International conference on machine learning, pages 2603–2612. PMLR, 2017

  18. [18]

    The global landscape of neural networks: An overview

    Ruoyu Sun, Dawei Li, Shiyu Liang, Tian Ding, and Rayadurgam Srikant. The global landscape of neural networks: An overview. IEEE Signal Processing Magazine, 37(5):95–108, 2020

  19. [19]

    Spurious local minima are common in two-layer relu neural networks

    Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pages 4433–4441. PMLR, 2018

  20. [20]

    Analytic study of families of spurious minima in two-layer relu neural networks: a tale of symmetry ii

    Yossi Arjevani and Michael Field. Analytic study of families of spurious minima in two-layer relu neural networks: a tale of symmetry ii. Advances in Neural Information Processing Systems, 34:15162–15174, 2021

  21. [21]

    Expand-and-cluster: Parameter recovery of neural networks

    Flavio Martinelli, Berfin Simsek, Wulfram Gerstner, and Johanni Brea. Expand-and-cluster: Parameter recovery of neural networks. In Forty-first International Conference on Machine Learning, 2024

  22. [22]

    Learning gaussian multi-index models with gradient flow: Time complexity and directional convergence

    Berfin ¸ Sim¸ sek, Amire Bendjeddou, and Daniel Hsu. Learning gaussian multi-index models with gradient flow: Time complexity and directional convergence. arXiv preprint arXiv:2411.08798, 2024

  23. [23]

    The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks

    Itay M Safran, Gilad Yehudai, and Ohad Shamir. The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks. In Conference on Learning Theory, pages 3889–3934. PMLR, 2021

  24. [24]

    Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models

    Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, and Lenka Zde- borová. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Asso...

  25. [25]

    Radford M. Neal. Bayesian Learning for Neural Networks. Springer New York, 1996

  26. [26]

    Computing with infinite networks

    Christopher Williams. Computing with infinite networks. In M.C. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996

  27. [27]

    Deep neural networks as gaussian processes

    Jaehoon Lee, Jascha Sohl-Dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018

  28. [28]

    Gaussian process behaviour in wide deep neural networks

    Alexander G d G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahra- mani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018

  29. [29]

    The loss landscape of overparameterized neural networks

    Yaim Cooper. The loss landscape of overparameterized neural networks. arXiv preprint arXiv:1804.10200, 2018. 11

  30. [30]

    Loss landscapes and optimization in over- parameterized non-linear systems and neural networks

    Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over- parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022

  31. [31]

    Loss surfaces, mode connectivity, and fast ensembling of dnns

    Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018

  32. [32]

    Linear mode connectivity and the lottery ticket hypothesis

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational Conference on Machine Learning, pages 3259–3269. PMLR, 2020

  33. [33]

    Model fusion via optimal transport

    Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020

  34. [34]

    K., Hayase, J., and Srinivasa, S

    Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022

  35. [35]

    Exploring neural network landscapes: Star-shaped and geodesic connectivity

    Zhanran Lin, Puheng Li, and Lei Wu. Exploring neural network landscapes: Star-shaped and geodesic connectivity. arXiv preprint arXiv:2404.06391, 2024

  36. [36]

    Do deep neural network solutions form a star domain? arXiv preprint arXiv:2403.07968, 2024

    Ankit Sonthalia, Alexander Rubinstein, Ehsan Abbasnejad, and Seong Joon Oh. Do deep neural network solutions form a star domain? arXiv preprint arXiv:2403.07968, 2024

  37. [37]

    Large scale structure of neural network loss landscapes

    Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019

  38. [38]

    Certifying the absence of spurious local minima at infinity

    Cédric Josz and Xiaopeng Li. Certifying the absence of spurious local minima at infinity. SIAM Journal on Optimization, 33(3):1416–1439, 2023

  39. [39]

    Adding one neuron can eliminate all bad local minima

    Shiyu Liang, Ruoyu Sun, Jason D Lee, and Rayadurgam Srikant. Adding one neuron can eliminate all bad local minima. Advances in Neural Information Processing Systems, 31, 2018

  40. [40]

    Elimination of all bad local minima in deep learning

    Kenji Kawaguchi and Leslie Kaelbling. Elimination of all bad local minima in deep learning. In International Conference on Artificial Intelligence and Statistics, pages 853–863. PMLR, 2020

  41. [41]

    Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity

    Shiyu Liang, Ruoyu Sun, and R Srikant. Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity. SIAM Journal on Optimization , 32(4):2797–2827, 2022

  42. [42]

    von Neuman and E

    J. von Neuman and E. Wigner. Uber merkwürdige diskrete Eigenwerte. Uber das Verhalten von Eigenwerten bei adiabatischen Prozessen. Physikalische Zeitschrift, 30:467–470, January 1929

  43. [43]

    MLPGradientFlow: Going with the flow of multilayer perceptrons (and finding minima fast and accurately), January 2023

    Johanni Brea, Flavio Martinelli, Berfin ¸ Sim¸ sek, and Wulfram Gerstner. MLPGradientFlow: Going with the flow of multilayer perceptrons (and finding minima fast and accurately), January 2023

  44. [44]

    Dauphin, Angela Fan, Michael Auli, and David Grangier

    Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 933–941. PMLR, 06–11 Aug 2017

  45. [45]

    D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, December 2014

  46. [46]

    Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process

    Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 483–513. PMLR, 09–12 Jul 2020

  47. [47]

    What happens after SGD reaches zero loss? –a mathematical framework

    Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations, 2022

  48. [48]

    Representational drift as a result of implicit regularization

    Aviv Ratzon, Dori Derdikman, and Omri Barak. Representational drift as a result of implicit regularization. April 2024. 12

  49. [49]

    Zico Kolter, and Ameet Talwalkar

    Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021

  50. [50]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020

  51. [51]

    Bridg- ing mode connectivity in loss landscapes and adversarial robustness

    Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, and Xue Lin. Bridg- ing mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060, 2020

  52. [52]

    Exploring diversified adversarial robustness in neural networks via robust mode connectivity

    Ren Wang, Yuxuan Li, and Sijia Liu. Exploring diversified adversarial robustness in neural networks via robust mode connectivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2346–2352, 2023

  53. [53]

    Linear mode connectivity in multitask and continual learning

    Seyed Iman Mirzadeh, Mehrdad Farajtabar, Dilan Gorur, Razvan Pascanu, and Hassan Ghasemzadeh. Linear mode connectivity in multitask and continual learning. arXiv preprint arXiv:2010.04495, 2020

  54. [54]

    Optimiz- ing mode connectivity for class incremental learning

    Haitao Wen, Haoyang Cheng, Heqian Qiu, Lanxiao Wang, Lili Pan, and Hongliang Li. Optimiz- ing mode connectivity for class incremental learning. In International Conference on Machine Learning, pages 36940–36957. PMLR, 2023

  55. [55]

    Federated learning with matched averaging

    Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. arXiv preprint arXiv:2002.06440, 2020. 13 A Neuron duplication introduces lines of critical points A.1 Adding bias to neurons drastically reduces the number of stable plateau saddles In the main text we highlighted specific...

  56. [56]

    20 GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ( s = 10) rosenbrock Figure B9: Fraction of updates parallel to saddle line across all datasets

    Use the reparametrization below Equation 3 to obtainc(t), a(t), w(t), ∆(t) and ϵ(t) for given values of a(t) i , a(t) j , w(t) i , w(t) j in step t of the ODE solver. 20 GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ( s = 10) rosenbrock Figure B9: Fraction of updates parallel to saddle line across all datasets. GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ...

  57. [57]

    bottom of the channel

    Move approximately in the direction of the channel by lowering ϵ(t+1) = ϵ(t)/2, while keeping c(t+1) = c(t), a(t+1) = a(t), wt+1 = w(t), ∆(t+1) = ∆(t). This point may not be at the “bottom of the channel”, because the other parameters also move slightly when lowering ϵ

  58. [58]

    bottom of the channel

    Compute the corresponding parameters a(t+1) i , a(t+1) j , w(t+1) i , w(t+1) j and contine the ODE solver from this point to move again closer to the “bottom of the channel”. C.2 Expansion of the loss in ϵ We start with the reparameterization in main text Equation 3, aiσ(wi · x) + ajσ(wj · x) = c 2 σ (w + ϵ∆) · x + σ (w − ϵ∆) · x + a 2ϵ σ (w + ϵ∆) · x − σ...

  59. [59]

    the function g can be computed analytically (see C.3.1). Inserting Equation 15 into Equation 14 leads to ℓ(θ) =1 2 ⟨f(x; θ∗)2⟩ − r∗ X j=1 rX k=1 a∗ j akg(b∗ j , bk, w∗ j · w∗ j , w∗ j · wk, wk · wk) + 1 2 rX j=1 rX k=1 ajakg(bj, bk, wj · wj, wj · wk, wk · wk) (16) We investigate the properties of the landscape using gradient flow, ˙θ = −∇θℓ(θ), where ℓ(θ)...

  60. [60]

    we use 2G(z) − 1 = erf(z/ √ 2), leading to g(µ1, µ2, σ2 1, σ1σ2ρ, σ2

  61. [61]

    C.3.2 Minimum at infinity Here, we derive the stability condition for the minimum at the end of a channel

    =4 BvN µ1p 1 + σ2 1 , µ2p 1 + σ2 2 ; ρ σ1σ2p 1 + σ2 1 p 1 + σ2 2 − 2G µ1p 1 + σ2 1 − 2G µ2p 1 + σ2 2 + 1 (21) where we used 10,010.8 from [58], R ∞ −∞ dx G′(x)G(a + bx) = G( a√ 1+b2 ). C.3.2 Minimum at infinity Here, we derive the stability condition for the minimum at the end of a channel. To this end, we consider the simplified setting where the input i...

  62. [62]

    + α1(ω1 · x)σ′(ω0 · x/ √

  63. [63]

    (33) Note that the error is indeed O(ϵ2) becauseP1 i=0 u3 i1 = 0

    + O(ϵ2). (33) Note that the error is indeed O(ϵ2) becauseP1 i=0 u3 i1 = 0. Connecting this result with the notation in the main text, we see c = √ 2α0, a = α1, w = ω0/ √ 2, and ∆ = ω1. C.5.2 Second Derivative with Three Neurons For the second derivative with a three neuron network we change basis to u0 = 1√ 3 1 1 1 ! , u1 = 1√ 2 1 0 −1 ! , u2 = 1√ 6 1 −2 ...

  64. [64]

    (35) Note that it is necessary to have the ω2 · x contribution to be O(ϵ2), otherwise the α2(ω2 · x) term would diverge with 1/ϵ

    + [α1(ω1 · x) + α2(ω2 · x)]σ′(ω0 · x/ √ 3) + 1 12 α2(ω1 · x)2σ′′(ω0 · x/ √ 3). (35) Note that it is necessary to have the ω2 · x contribution to be O(ϵ2), otherwise the α2(ω2 · x) term would diverge with 1/ϵ. Appendix References

  65. [65]

    Higher-order additive runge–kutta schemes for ordinary differential equations

    Christopher A Kennedy and Mark H Carpenter. Higher-order additive runge–kutta schemes for ordinary differential equations. Applied numerical mathematics, 136:183–205, 2019

  66. [66]

    An introduction to numerical analysis

    Endre Süli and David F Mayers. An introduction to numerical analysis. Cambridge university press, 2003

  67. [67]

    Donald B. Owen. A table of normal integrals. Commun. Stat. Simul. Comput., 9(4):389–419, 1980. 26