Flat Channels to Infinity in Neural Loss Landscapes

Alexander Van Meegen; Berfin \c{S}im\c{s}ek; Flavio Martinelli; Johanni Brea; Wulfram Gerstner

arxiv: 2506.14951 · v4 · submitted 2025-06-17 · 💻 cs.LG · cs.AI· cs.NE

Flat Channels to Infinity in Neural Loss Landscapes

Flavio Martinelli , Alexander Van Meegen , Berfin \c{S}im\c{s}ek , Wulfram Gerstner , Johanni Brea This is my paper

Pith reviewed 2026-05-19 08:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords loss landscapeflat minimaneural networksgradient flowgated linear unitsregressionsymmetry

0 comments

The pith

Neural network loss landscapes contain flat channels leading to infinity where neuron pairs form gated linear units.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies special structures in neural network loss landscapes: channels along which the loss decreases extremely slowly. Along these channels the output weights of at least two neurons diverge to positive and negative infinity while their input weight vectors become identical. Gradient flow and related optimizers reach the channels with high probability in regression tasks, yet the channels look like ordinary flat minima with finite weights unless examined closely. The channels run asymptotically parallel to lines of critical points created by network symmetries. At the far end the two neurons together realize a gated linear unit of the form sigma(w · x) plus (v · x) times sigma prime of (w · x).

Core claim

We identify and characterize channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, a_i and a_j, diverge to ±infinity, and their input weight vectors, w_i and w_j, become equal to each other. At convergence, the two neurons implement a gated linear unit: a_i sigma(w_i · x) + a_j sigma(w_j · x) approaches sigma(w · x) + (v · x) sigma prime(w · x). Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers reach the channels with high probability in diverse regression settings.

What carries the argument

Flat channels to infinity in parameter space, asymptotically parallel to symmetry-induced lines of critical points, that end in a pair of neurons implementing a gated linear unit.

If this is right

Gradient flow solvers and SGD or ADAM reach the channels with high probability in diverse regression settings.
Without careful inspection the channels appear as flat local minima with finite parameter values.
The channels supply a comprehensive picture of quasi-flat regions in terms of gradient dynamics, geometry, and functional form.
The emergence of gated linear units at the end of the channels points to a computational capability of fully connected layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training trajectories that appear to have converged may actually continue slow movement along these channels for many more steps.
Detecting such channels could guide regularization methods that penalize large weight divergence.
The same geometric mechanism may operate in deeper networks or other activation functions.

Load-bearing premise

The analysis assumes gradient flow and SGD or ADAM reach these channels with high probability in diverse regression settings and that the resulting configuration is asymptotically parallel to symmetry-induced critical lines.

What would settle it

Train a two-layer network on a regression task with gradient descent and check whether pairs of neurons show output weights diverging to plus and minus infinity, input weights aligning, and loss continuing to decrease slowly over long times.

Figures

Figures reproduced from arXiv: 2506.14951 by Alexander Van Meegen, Berfin \c{S}im\c{s}ek, Flavio Martinelli, Johanni Brea, Wulfram Gerstner.

**Figure 1.** Figure 1: Saddle lines à la Fukumizu & Amari [1] and channels to infinity. Left: Duplicating a neuron in a network trained to convergence generates lines of saddle points in the loss landscape [1]. Duplicated neurons share the input weights of the original neuron while their output weights γa,(1 − γ)a sum to the original neuron’s output weight a. Middle: Loss landscape of duplicated network projected along the saddl… view at source ↗

**Figure 2.** Figure 2: Stable plateau-saddles can be found in MLPs with scalar output and no bias: (a) Networks of 1 to 5 hidden neurons are trained on the shown 2D regression target (logarithm of the rosenbrock function, see Appendix A). Training follows full-batch gradient flow dynamics until convergence to a critical point. A quantification of unique solutions in weight-space (up to permutation symmetries) is shown at the bot… view at source ↗

**Figure 3.** Figure 3: Loss landscape of plateau-saddles: (a) Schematic of trajectories around saddle line. Repulsive directions strict saddles (red segments) become attractive for the plateau-saddle (orange segment). (b) Loss landscape along the duplication parameter γ and the direction of smallest eigenvalue of the Hessian αemin(γ). (c) Example of neuron duplication for loss function shown in b: small perturbations are stable … view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Frequency and properties of channels to infinity: (a) As a heuristic to identify channels to infinity, we look at the cosine distance of the pair of closest input weight vectors within a network and the sum of absolute output weights corresponding to that pair. Channel solutions are identified by having a pair of neurons with large output weight norm and a small distance in input weights (top left section … view at source ↗

**Figure 6.** Figure 6: Convergence in ϵ to gated linear units. (a) Moving along a channel to infinity with the jump procedure described in Appendix C shows that the loss and the approximation error decreases with ϵ 2 , as predicted by the theory, and that c, a, and the cosine similarity cos(∆, w) converge to constant values. A network with 8 input dimensions and 8 hidden softplus neurons (81 parameters) trained on the rosenbrock… view at source ↗

**Figure 2.** Figure 2: the L∞-norm maxi |∇θiL(θ)| is used to quantify the gradient norm. A.3 Simulation details [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 5.** Figure 5: the L∞-norm maxi |∇θiL(θ)| is used to quantify the gradient norm. Figure B7: Examples of 2D GP datasets: 2D GP datasets used in [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps channels to infinity in neural loss landscapes where neuron pairs diverge but equalize to form gated linear units, with the reachability claim under gradient flow as the main open question.

read the letter

Hi colleague, the main thing to know is that this paper identifies and characterizes flat channels in neural loss landscapes along which loss decreases very slowly while two neurons' output weights go to plus and minus infinity and their input weights become identical, so the pair asymptotically implements a gated linear unit of the form σ(w·x) + (v·x)σ'(w·x). They tie the geometry to symmetry-induced lines of critical points and describe the gradient dynamics along the channel. That functional interpretation and the concrete link to emergent gated units in ordinary fully connected layers is the clearest new piece. It builds on existing work on flat minima and symmetries but adds a specific mechanism and limiting computation that was not spelled out before. The geometric and dynamical description looks careful and gives a coherent picture of why these regions look like flat minima at finite parameters but are actually channels to infinity. The soft spot is the reachability claim. The abstract and stress-test note both state that gradient flow, SGD, and ADAM reach these channels with high probability in diverse regression settings, yet there is no basin-volume estimate, Lyapunov analysis near the channel, or controls showing that typical trajectories avoid other critical sets. Without that quantitative support the channels exist mathematically but their relevance to observed training remains unclear. This is for readers working on loss-landscape geometry, optimization dynamics, and implicit regularization in deep networks. A serious referee should see it because the characterization is precise enough to be checked and the idea could matter for understanding emergent computational structure, even if the probability part needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper identifies and characterizes 'flat channels to infinity' in neural network loss landscapes. Along these channels the loss decreases extremely slowly while output weights a_i and a_j diverge to ±∞ and input weights w_i, w_j align; the pair asymptotically implements a gated linear unit of the form σ(w·x) + (v·x)σ'(w·x). The channels are asymptotically parallel to symmetry-induced lines of critical points. The authors assert that gradient flow, SGD and ADAM reach these channels with high probability in diverse regression settings, making them appear as flat local minima with finite parameters. The work supplies a unified picture in terms of gradient dynamics, geometry and functional interpretation.

Significance. If the reachability claim is substantiated, the result is significant: it supplies a concrete dynamical and geometric mechanism for the flat regions routinely observed in neural loss landscapes and directly links them to the emergence of gated-linear-unit-like computations inside fully connected layers. The explicit functional limit and the asymptotic parallelism to symmetry lines are strengths that go beyond standard critical-point catalogs and could inform implicit-regularization analyses. The paper earns credit for attempting a comprehensive characterization that integrates dynamics, geometry and expressivity.

major comments (2)

[Abstract] Abstract: the assertion that 'Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings' is load-bearing for the claim that these structures explain observed training behavior. No basin-volume estimate, Lyapunov analysis near the channel, or controls ruling out other attractors are supplied; without such support the channels may exist mathematically yet remain irrelevant to typical trajectories.
[Geometry / symmetry analysis] The geometric claim that the channels are 'asymptotically parallel to symmetry-induced lines of critical points' is central to the characterization. The manuscript should supply the explicit symmetry group, the associated conserved quantities or Hessian null directions, and the precise sense in which the flow approaches these lines (e.g., a differential equation for the transverse coordinates).

minor comments (1)

[Abstract / notation] The limiting expression for the gated linear unit in the abstract would benefit from an explicit parametrization (e.g., a scaling parameter t → ∞ along the channel) that makes the divergence rates of a_i, a_j and the alignment of w_i, w_j mathematically precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential significance of our findings. We provide point-by-point responses to the major comments below and describe the revisions we intend to implement.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings' is load-bearing for the claim that these structures explain observed training behavior. No basin-volume estimate, Lyapunov analysis near the channel, or controls ruling out other attractors are supplied; without such support the channels may exist mathematically yet remain irrelevant to typical trajectories.

Authors: We agree that the reachability of the channels is a key aspect of our claims and that our current support is primarily empirical. The manuscript presents numerical evidence from multiple regression tasks showing that gradient-based optimizers consistently converge to these channels. To address the referee's concern, we will revise the abstract and the relevant sections to temper the language, emphasizing that the claim is based on observed behavior in the studied settings rather than a proven high-probability result for all cases. We will also add further experimental controls and discussion of potential other attractors. A full basin-volume analysis or Lyapunov study is not included and would constitute a substantial extension of the work. revision: partial
Referee: [Geometry / symmetry analysis] The geometric claim that the channels are 'asymptotically parallel to symmetry-induced lines of critical points' is central to the characterization. The manuscript should supply the explicit symmetry group, the associated conserved quantities or Hessian null directions, and the precise sense in which the flow approaches these lines (e.g., a differential equation for the transverse coordinates).

Authors: We appreciate this suggestion for greater precision in the geometric analysis. The symmetry in question is the permutation symmetry among identical neurons in the hidden layer. We will explicitly identify the symmetry group as the symmetric group S_n acting by permuting the neuron indices. The associated conserved quantities include the loss invariance under such permutations, and the Hessian has null directions corresponding to these infinitesimal symmetries. In the revised version, we will add a subsection detailing these elements and describe the approach to the symmetry lines via the transverse dynamics, including a reduced differential equation for the deviation from the line. This will clarify the asymptotic parallelism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external loss-landscape geometry

full rationale

The paper derives the channel geometry and the asymptotic gated-linear-unit form directly from the loss Hessian and symmetry-induced critical lines (visible in the abstract's functional limit and geometric parallelism statement). No equation reduces a claimed prediction to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem, and the reachability statement is framed as an empirical observation under gradient flow rather than a self-referential fit. The central characterization therefore rests on independent geometric analysis rather than circular re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of gradient-flow dynamics in overparameterized networks and on the existence of symmetry-induced critical lines; no free parameters or new postulated entities are introduced in the abstract.

axioms (1)

domain assumption Gradient flow and related first-order methods govern the trajectories that reach the described channels in regression settings.
Stated directly in the abstract as the condition under which the channels are reached with high probability.

pith-pipeline@v0.9.0 · 5803 in / 1291 out tokens · 38981 ms · 2026-05-19T08:51:07.156620+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 3 internal anchors

[1]

Local minima and plateaus in hierarchical structures of multilayer perceptrons

Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural networks, 13(3):317–327, 2000

work page 2000
[2]

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 27. Curran Associates,...

work page 2014
[3]

The Loss Surfaces of Multilayer Networks

Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The Loss Surfaces of Multilayer Networks. In Guy Lebanon and S. V . N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 192–204. PMLR, 2015

work page 2015
[4]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018
[5]

Semi-flat minima and saddle points by embedding neural networks to overparameterization

Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, and Mirai Tanaka. Semi-flat minima and saddle points by embedding neural networks to overparameterization. Advances in Neural Information Processing Systems, 32:13868–13876, 2019

work page 2019
[6]

Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances

Berfin ¸ Sim¸ sek, François Ged, Arthur Jacot, Francesco Spadaro, Clément Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning , pages 9722–9732. PMLR, 2021

work page 2021
[7]

Non-attracting regions of local minima in deep and wide neural networks

Henning Petzka and Cristian Sminchisescu. Non-attracting regions of local minima in deep and wide neural networks. Journal of Machine Learning Research, 22(143):1–34, 2021

work page 2021
[8]

How to escape saddle points efficiently

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732. PMLR, 2017

work page 2017
[9]

The Implicit Bias of Gradient Descent on Separable Data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data. Journal of Machine Learning Research, 19(70):1–57, 2018

work page 2018
[10]

Embedding principle of loss landscape of deep neural networks

Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhiqin J Xu. Embedding principle of loss landscape of deep neural networks. Advances in Neural Information Processing Systems , 34:14848–14859, 2021

work page 2021
[11]

Embedding principle: a hierarchical structure of loss landscape of deep neural networks

Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle: a hierarchical structure of loss landscape of deep neural networks. arXiv preprint arXiv:2111.15527, 2021

work page arXiv 2021
[12]

Splitting steepest descent for growing neural architectures

Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for growing neural architectures. Advances in neural information processing systems, 32, 2019. 10

work page 2019
[13]

Steepest descent neural archi- tecture optimization: Escaping local optimum with signed neural splitting

Lemeng Wu, Mao Ye, Qi Lei, Jason D Lee, and Qiang Liu. Steepest descent neural archi- tecture optimization: Escaping local optimum with signed neural splitting. arXiv preprint arXiv:2003.10392, 2020

work page arXiv 2003
[14]

An analysis on negative curvature induced by singularity in multi-layer neural-network learning

Eiji Mizutani and Stuart Dreyfus. An analysis on negative curvature induced by singularity in multi-layer neural-network learning. Advances in Neural Information Processing Systems, 23, 2010

work page 2010
[15]

Local minima and back propagation

Timothy Poston, C-N Lee, Y Choie, and Yonghoon Kwon. Local minima and back propagation. In IJCNN-91-Seattle International Joint Conference on Neural Networks , volume 2, pages 173–176. IEEE, 1991

work page 1991
[16]

No bad local minima: Data independent training error guarantees for multilayer neural networks

Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

The loss surface of deep and wide neural networks

Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In International conference on machine learning, pages 2603–2612. PMLR, 2017

work page 2017
[18]

The global landscape of neural networks: An overview

Ruoyu Sun, Dawei Li, Shiyu Liang, Tian Ding, and Rayadurgam Srikant. The global landscape of neural networks: An overview. IEEE Signal Processing Magazine, 37(5):95–108, 2020

work page 2020
[19]

Spurious local minima are common in two-layer relu neural networks

Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pages 4433–4441. PMLR, 2018

work page 2018
[20]

Analytic study of families of spurious minima in two-layer relu neural networks: a tale of symmetry ii

Yossi Arjevani and Michael Field. Analytic study of families of spurious minima in two-layer relu neural networks: a tale of symmetry ii. Advances in Neural Information Processing Systems, 34:15162–15174, 2021

work page 2021
[21]

Expand-and-cluster: Parameter recovery of neural networks

Flavio Martinelli, Berfin Simsek, Wulfram Gerstner, and Johanni Brea. Expand-and-cluster: Parameter recovery of neural networks. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[22]

Learning gaussian multi-index models with gradient flow: Time complexity and directional convergence

Berfin ¸ Sim¸ sek, Amire Bendjeddou, and Daniel Hsu. Learning gaussian multi-index models with gradient flow: Time complexity and directional convergence. arXiv preprint arXiv:2411.08798, 2024

work page arXiv 2024
[23]

The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks

Itay M Safran, Gilad Yehudai, and Ohad Shamir. The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks. In Conference on Learning Theory, pages 3889–3934. PMLR, 2021

work page 2021
[24]

Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models

Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, and Lenka Zde- borová. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Asso...

work page 2019
[25]

Radford M. Neal. Bayesian Learning for Neural Networks. Springer New York, 1996

work page 1996
[26]

Computing with infinite networks

Christopher Williams. Computing with infinite networks. In M.C. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996

work page 1996
[27]

Deep neural networks as gaussian processes

Jaehoon Lee, Jascha Sohl-Dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018

work page 2018
[28]

Gaussian process behaviour in wide deep neural networks

Alexander G d G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahra- mani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018

work page 2018
[29]

The loss landscape of overparameterized neural networks

Yaim Cooper. The loss landscape of overparameterized neural networks. arXiv preprint arXiv:1804.10200, 2018. 11

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Loss landscapes and optimization in over- parameterized non-linear systems and neural networks

Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over- parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022

work page 2022
[31]

Loss surfaces, mode connectivity, and fast ensembling of dnns

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018

work page 2018
[32]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational Conference on Machine Learning, pages 3259–3269. PMLR, 2020

work page 2020
[33]

Model fusion via optimal transport

Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020

work page 2020
[34]

K., Hayase, J., and Srinivasa, S

Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022

work page arXiv 2022
[35]

Exploring neural network landscapes: Star-shaped and geodesic connectivity

Zhanran Lin, Puheng Li, and Lei Wu. Exploring neural network landscapes: Star-shaped and geodesic connectivity. arXiv preprint arXiv:2404.06391, 2024

work page arXiv 2024
[36]

Do deep neural network solutions form a star domain? arXiv preprint arXiv:2403.07968, 2024

Ankit Sonthalia, Alexander Rubinstein, Ehsan Abbasnejad, and Seong Joon Oh. Do deep neural network solutions form a star domain? arXiv preprint arXiv:2403.07968, 2024

work page arXiv 2024
[37]

Large scale structure of neural network loss landscapes

Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[38]

Certifying the absence of spurious local minima at infinity

Cédric Josz and Xiaopeng Li. Certifying the absence of spurious local minima at infinity. SIAM Journal on Optimization, 33(3):1416–1439, 2023

work page 2023
[39]

Adding one neuron can eliminate all bad local minima

Shiyu Liang, Ruoyu Sun, Jason D Lee, and Rayadurgam Srikant. Adding one neuron can eliminate all bad local minima. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[40]

Elimination of all bad local minima in deep learning

Kenji Kawaguchi and Leslie Kaelbling. Elimination of all bad local minima in deep learning. In International Conference on Artificial Intelligence and Statistics, pages 853–863. PMLR, 2020

work page 2020
[41]

Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity

Shiyu Liang, Ruoyu Sun, and R Srikant. Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity. SIAM Journal on Optimization , 32(4):2797–2827, 2022

work page 2022
[42]

von Neuman and E

J. von Neuman and E. Wigner. Uber merkwürdige diskrete Eigenwerte. Uber das Verhalten von Eigenwerten bei adiabatischen Prozessen. Physikalische Zeitschrift, 30:467–470, January 1929

work page 1929
[43]

MLPGradientFlow: Going with the flow of multilayer perceptrons (and finding minima fast and accurately), January 2023

Johanni Brea, Flavio Martinelli, Berfin ¸ Sim¸ sek, and Wulfram Gerstner. MLPGradientFlow: Going with the flow of multilayer perceptrons (and finding minima fast and accurately), January 2023

work page 2023
[44]

Dauphin, Angela Fan, Michael Auli, and David Grangier

Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 933–941. PMLR, 06–11 Aug 2017

work page 2017
[45]

D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, December 2014

work page 2014
[46]

Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process

Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 483–513. PMLR, 09–12 Jul 2020

work page 2020
[47]

What happens after SGD reaches zero loss? –a mathematical framework

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations, 2022

work page 2022
[48]

Representational drift as a result of implicit regularization

Aviv Ratzon, Dori Derdikman, and Omri Barak. Representational drift as a result of implicit regularization. April 2024. 12

work page 2024
[49]

Zico Kolter, and Ameet Talwalkar

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021

work page arXiv 2021
[50]

Sharpness-Aware Minimization for Efficiently Improving Generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[51]

Bridg- ing mode connectivity in loss landscapes and adversarial robustness

Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, and Xue Lin. Bridg- ing mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060, 2020

work page arXiv 2005
[52]

Exploring diversified adversarial robustness in neural networks via robust mode connectivity

Ren Wang, Yuxuan Li, and Sijia Liu. Exploring diversified adversarial robustness in neural networks via robust mode connectivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2346–2352, 2023

work page 2023
[53]

Linear mode connectivity in multitask and continual learning

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Dilan Gorur, Razvan Pascanu, and Hassan Ghasemzadeh. Linear mode connectivity in multitask and continual learning. arXiv preprint arXiv:2010.04495, 2020

work page arXiv 2010
[54]

Optimiz- ing mode connectivity for class incremental learning

Haitao Wen, Haoyang Cheng, Heqian Qiu, Lanxiao Wang, Lili Pan, and Hongliang Li. Optimiz- ing mode connectivity for class incremental learning. In International Conference on Machine Learning, pages 36940–36957. PMLR, 2023

work page 2023
[55]

Federated learning with matched averaging

Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. arXiv preprint arXiv:2002.06440, 2020. 13 A Neuron duplication introduces lines of critical points A.1 Adding bias to neurons drastically reduces the number of stable plateau saddles In the main text we highlighted specific...

work page arXiv 2002
[56]

20 GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ( s = 10) rosenbrock Figure B9: Fraction of updates parallel to saddle line across all datasets

Use the reparametrization below Equation 3 to obtainc(t), a(t), w(t), ∆(t) and ϵ(t) for given values of a(t) i , a(t) j , w(t) i , w(t) j in step t of the ODE solver. 20 GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ( s = 10) rosenbrock Figure B9: Fraction of updates parallel to saddle line across all datasets. GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ...

work page
[57]

bottom of the channel

Move approximately in the direction of the channel by lowering ϵ(t+1) = ϵ(t)/2, while keeping c(t+1) = c(t), a(t+1) = a(t), wt+1 = w(t), ∆(t+1) = ∆(t). This point may not be at the “bottom of the channel”, because the other parameters also move slightly when lowering ϵ

work page
[58]

bottom of the channel

Compute the corresponding parameters a(t+1) i , a(t+1) j , w(t+1) i , w(t+1) j and contine the ODE solver from this point to move again closer to the “bottom of the channel”. C.2 Expansion of the loss in ϵ We start with the reparameterization in main text Equation 3, aiσ(wi · x) + ajσ(wj · x) = c 2 σ (w + ϵ∆) · x + σ (w − ϵ∆) · x + a 2ϵ σ (w + ϵ∆) · x − σ...

work page
[59]

the function g can be computed analytically (see C.3.1). Inserting Equation 15 into Equation 14 leads to ℓ(θ) =1 2 ⟨f(x; θ∗)2⟩ − r∗ X j=1 rX k=1 a∗ j akg(b∗ j , bk, w∗ j · w∗ j , w∗ j · wk, wk · wk) + 1 2 rX j=1 rX k=1 ajakg(bj, bk, wj · wj, wj · wk, wk · wk) (16) We investigate the properties of the landscape using gradient flow, ˙θ = −∇θℓ(θ), where ℓ(θ)...

work page
[60]

we use 2G(z) − 1 = erf(z/ √ 2), leading to g(µ1, µ2, σ2 1, σ1σ2ρ, σ2

work page
[61]

C.3.2 Minimum at infinity Here, we derive the stability condition for the minimum at the end of a channel

=4 BvN µ1p 1 + σ2 1 , µ2p 1 + σ2 2 ; ρ σ1σ2p 1 + σ2 1 p 1 + σ2 2 − 2G µ1p 1 + σ2 1 − 2G µ2p 1 + σ2 2 + 1 (21) where we used 10,010.8 from [58], R ∞ −∞ dx G′(x)G(a + bx) = G( a√ 1+b2 ). C.3.2 Minimum at infinity Here, we derive the stability condition for the minimum at the end of a channel. To this end, we consider the simplified setting where the input i...

work page
[62]

+ α1(ω1 · x)σ′(ω0 · x/ √

work page
[63]

(33) Note that the error is indeed O(ϵ2) becauseP1 i=0 u3 i1 = 0

+ O(ϵ2). (33) Note that the error is indeed O(ϵ2) becauseP1 i=0 u3 i1 = 0. Connecting this result with the notation in the main text, we see c = √ 2α0, a = α1, w = ω0/ √ 2, and ∆ = ω1. C.5.2 Second Derivative with Three Neurons For the second derivative with a three neuron network we change basis to u0 = 1√ 3 1 1 1 ! , u1 = 1√ 2 1 0 −1 ! , u2 = 1√ 6 1 −2 ...

work page
[64]

(35) Note that it is necessary to have the ω2 · x contribution to be O(ϵ2), otherwise the α2(ω2 · x) term would diverge with 1/ϵ

+ [α1(ω1 · x) + α2(ω2 · x)]σ′(ω0 · x/ √ 3) + 1 12 α2(ω1 · x)2σ′′(ω0 · x/ √ 3). (35) Note that it is necessary to have the ω2 · x contribution to be O(ϵ2), otherwise the α2(ω2 · x) term would diverge with 1/ϵ. Appendix References

work page
[65]

Higher-order additive runge–kutta schemes for ordinary differential equations

Christopher A Kennedy and Mark H Carpenter. Higher-order additive runge–kutta schemes for ordinary differential equations. Applied numerical mathematics, 136:183–205, 2019

work page 2019
[66]

An introduction to numerical analysis

Endre Süli and David F Mayers. An introduction to numerical analysis. Cambridge university press, 2003

work page 2003
[67]

Donald B. Owen. A table of normal integrals. Commun. Stat. Simul. Comput., 9(4):389–419, 1980. 26

work page 1980

[1] [1]

Local minima and plateaus in hierarchical structures of multilayer perceptrons

Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural networks, 13(3):317–327, 2000

work page 2000

[2] [2]

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 27. Curran Associates,...

work page 2014

[3] [3]

The Loss Surfaces of Multilayer Networks

Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The Loss Surfaces of Multilayer Networks. In Guy Lebanon and S. V . N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 192–204. PMLR, 2015

work page 2015

[4] [4]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018

[5] [5]

Semi-flat minima and saddle points by embedding neural networks to overparameterization

Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, and Mirai Tanaka. Semi-flat minima and saddle points by embedding neural networks to overparameterization. Advances in Neural Information Processing Systems, 32:13868–13876, 2019

work page 2019

[6] [6]

Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances

Berfin ¸ Sim¸ sek, François Ged, Arthur Jacot, Francesco Spadaro, Clément Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning , pages 9722–9732. PMLR, 2021

work page 2021

[7] [7]

Non-attracting regions of local minima in deep and wide neural networks

Henning Petzka and Cristian Sminchisescu. Non-attracting regions of local minima in deep and wide neural networks. Journal of Machine Learning Research, 22(143):1–34, 2021

work page 2021

[8] [8]

How to escape saddle points efficiently

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732. PMLR, 2017

work page 2017

[9] [9]

The Implicit Bias of Gradient Descent on Separable Data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data. Journal of Machine Learning Research, 19(70):1–57, 2018

work page 2018

[10] [10]

Embedding principle of loss landscape of deep neural networks

Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhiqin J Xu. Embedding principle of loss landscape of deep neural networks. Advances in Neural Information Processing Systems , 34:14848–14859, 2021

work page 2021

[11] [11]

Embedding principle: a hierarchical structure of loss landscape of deep neural networks

Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle: a hierarchical structure of loss landscape of deep neural networks. arXiv preprint arXiv:2111.15527, 2021

work page arXiv 2021

[12] [12]

Splitting steepest descent for growing neural architectures

Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for growing neural architectures. Advances in neural information processing systems, 32, 2019. 10

work page 2019

[13] [13]

Steepest descent neural archi- tecture optimization: Escaping local optimum with signed neural splitting

Lemeng Wu, Mao Ye, Qi Lei, Jason D Lee, and Qiang Liu. Steepest descent neural archi- tecture optimization: Escaping local optimum with signed neural splitting. arXiv preprint arXiv:2003.10392, 2020

work page arXiv 2003

[14] [14]

An analysis on negative curvature induced by singularity in multi-layer neural-network learning

Eiji Mizutani and Stuart Dreyfus. An analysis on negative curvature induced by singularity in multi-layer neural-network learning. Advances in Neural Information Processing Systems, 23, 2010

work page 2010

[15] [15]

Local minima and back propagation

Timothy Poston, C-N Lee, Y Choie, and Yonghoon Kwon. Local minima and back propagation. In IJCNN-91-Seattle International Joint Conference on Neural Networks , volume 2, pages 173–176. IEEE, 1991

work page 1991

[16] [16]

No bad local minima: Data independent training error guarantees for multilayer neural networks

Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

The loss surface of deep and wide neural networks

Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In International conference on machine learning, pages 2603–2612. PMLR, 2017

work page 2017

[18] [18]

The global landscape of neural networks: An overview

Ruoyu Sun, Dawei Li, Shiyu Liang, Tian Ding, and Rayadurgam Srikant. The global landscape of neural networks: An overview. IEEE Signal Processing Magazine, 37(5):95–108, 2020

work page 2020

[19] [19]

Spurious local minima are common in two-layer relu neural networks

Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pages 4433–4441. PMLR, 2018

work page 2018

[20] [20]

Analytic study of families of spurious minima in two-layer relu neural networks: a tale of symmetry ii

Yossi Arjevani and Michael Field. Analytic study of families of spurious minima in two-layer relu neural networks: a tale of symmetry ii. Advances in Neural Information Processing Systems, 34:15162–15174, 2021

work page 2021

[21] [21]

Expand-and-cluster: Parameter recovery of neural networks

Flavio Martinelli, Berfin Simsek, Wulfram Gerstner, and Johanni Brea. Expand-and-cluster: Parameter recovery of neural networks. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[22] [22]

Learning gaussian multi-index models with gradient flow: Time complexity and directional convergence

Berfin ¸ Sim¸ sek, Amire Bendjeddou, and Daniel Hsu. Learning gaussian multi-index models with gradient flow: Time complexity and directional convergence. arXiv preprint arXiv:2411.08798, 2024

work page arXiv 2024

[23] [23]

The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks

Itay M Safran, Gilad Yehudai, and Ohad Shamir. The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks. In Conference on Learning Theory, pages 3889–3934. PMLR, 2021

work page 2021

[24] [24]

Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models

Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, and Lenka Zde- borová. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Asso...

work page 2019

[25] [25]

Radford M. Neal. Bayesian Learning for Neural Networks. Springer New York, 1996

work page 1996

[26] [26]

Computing with infinite networks

Christopher Williams. Computing with infinite networks. In M.C. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996

work page 1996

[27] [27]

Deep neural networks as gaussian processes

Jaehoon Lee, Jascha Sohl-Dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018

work page 2018

[28] [28]

Gaussian process behaviour in wide deep neural networks

Alexander G d G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahra- mani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018

work page 2018

[29] [29]

The loss landscape of overparameterized neural networks

Yaim Cooper. The loss landscape of overparameterized neural networks. arXiv preprint arXiv:1804.10200, 2018. 11

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Loss landscapes and optimization in over- parameterized non-linear systems and neural networks

Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over- parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022

work page 2022

[31] [31]

Loss surfaces, mode connectivity, and fast ensembling of dnns

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018

work page 2018

[32] [32]

Linear mode connectivity and the lottery ticket hypothesis

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational Conference on Machine Learning, pages 3259–3269. PMLR, 2020

work page 2020

[33] [33]

Model fusion via optimal transport

Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020

work page 2020

[34] [34]

K., Hayase, J., and Srinivasa, S

Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022

work page arXiv 2022

[35] [35]

Exploring neural network landscapes: Star-shaped and geodesic connectivity

Zhanran Lin, Puheng Li, and Lei Wu. Exploring neural network landscapes: Star-shaped and geodesic connectivity. arXiv preprint arXiv:2404.06391, 2024

work page arXiv 2024

[36] [36]

Do deep neural network solutions form a star domain? arXiv preprint arXiv:2403.07968, 2024

Ankit Sonthalia, Alexander Rubinstein, Ehsan Abbasnejad, and Seong Joon Oh. Do deep neural network solutions form a star domain? arXiv preprint arXiv:2403.07968, 2024

work page arXiv 2024

[37] [37]

Large scale structure of neural network loss landscapes

Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[38] [38]

Certifying the absence of spurious local minima at infinity

Cédric Josz and Xiaopeng Li. Certifying the absence of spurious local minima at infinity. SIAM Journal on Optimization, 33(3):1416–1439, 2023

work page 2023

[39] [39]

Adding one neuron can eliminate all bad local minima

Shiyu Liang, Ruoyu Sun, Jason D Lee, and Rayadurgam Srikant. Adding one neuron can eliminate all bad local minima. Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[40] [40]

Elimination of all bad local minima in deep learning

Kenji Kawaguchi and Leslie Kaelbling. Elimination of all bad local minima in deep learning. In International Conference on Artificial Intelligence and Statistics, pages 853–863. PMLR, 2020

work page 2020

[41] [41]

Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity

Shiyu Liang, Ruoyu Sun, and R Srikant. Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity. SIAM Journal on Optimization , 32(4):2797–2827, 2022

work page 2022

[42] [42]

von Neuman and E

J. von Neuman and E. Wigner. Uber merkwürdige diskrete Eigenwerte. Uber das Verhalten von Eigenwerten bei adiabatischen Prozessen. Physikalische Zeitschrift, 30:467–470, January 1929

work page 1929

[43] [43]

MLPGradientFlow: Going with the flow of multilayer perceptrons (and finding minima fast and accurately), January 2023

Johanni Brea, Flavio Martinelli, Berfin ¸ Sim¸ sek, and Wulfram Gerstner. MLPGradientFlow: Going with the flow of multilayer perceptrons (and finding minima fast and accurately), January 2023

work page 2023

[44] [44]

Dauphin, Angela Fan, Michael Auli, and David Grangier

Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 933–941. PMLR, 06–11 Aug 2017

work page 2017

[45] [45]

D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, December 2014

work page 2014

[46] [46]

Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process

Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 483–513. PMLR, 09–12 Jul 2020

work page 2020

[47] [47]

What happens after SGD reaches zero loss? –a mathematical framework

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations, 2022

work page 2022

[48] [48]

Representational drift as a result of implicit regularization

Aviv Ratzon, Dori Derdikman, and Omri Barak. Representational drift as a result of implicit regularization. April 2024. 12

work page 2024

[49] [49]

Zico Kolter, and Ameet Talwalkar

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021

work page arXiv 2021

[50] [50]

Sharpness-Aware Minimization for Efficiently Improving Generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[51] [51]

Bridg- ing mode connectivity in loss landscapes and adversarial robustness

Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, and Xue Lin. Bridg- ing mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060, 2020

work page arXiv 2005

[52] [52]

Exploring diversified adversarial robustness in neural networks via robust mode connectivity

Ren Wang, Yuxuan Li, and Sijia Liu. Exploring diversified adversarial robustness in neural networks via robust mode connectivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2346–2352, 2023

work page 2023

[53] [53]

Linear mode connectivity in multitask and continual learning

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Dilan Gorur, Razvan Pascanu, and Hassan Ghasemzadeh. Linear mode connectivity in multitask and continual learning. arXiv preprint arXiv:2010.04495, 2020

work page arXiv 2010

[54] [54]

Optimiz- ing mode connectivity for class incremental learning

Haitao Wen, Haoyang Cheng, Heqian Qiu, Lanxiao Wang, Lili Pan, and Hongliang Li. Optimiz- ing mode connectivity for class incremental learning. In International Conference on Machine Learning, pages 36940–36957. PMLR, 2023

work page 2023

[55] [55]

Federated learning with matched averaging

Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. arXiv preprint arXiv:2002.06440, 2020. 13 A Neuron duplication introduces lines of critical points A.1 Adding bias to neurons drastically reduces the number of stable plateau saddles In the main text we highlighted specific...

work page arXiv 2002

[56] [56]

20 GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ( s = 10) rosenbrock Figure B9: Fraction of updates parallel to saddle line across all datasets

Use the reparametrization below Equation 3 to obtainc(t), a(t), w(t), ∆(t) and ϵ(t) for given values of a(t) i , a(t) j , w(t) i , w(t) j in step t of the ODE solver. 20 GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ( s = 10) rosenbrock Figure B9: Fraction of updates parallel to saddle line across all datasets. GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ...

work page

[57] [57]

bottom of the channel

Move approximately in the direction of the channel by lowering ϵ(t+1) = ϵ(t)/2, while keeping c(t+1) = c(t), a(t+1) = a(t), wt+1 = w(t), ∆(t+1) = ∆(t). This point may not be at the “bottom of the channel”, because the other parameters also move slightly when lowering ϵ

work page

[58] [58]

bottom of the channel

Compute the corresponding parameters a(t+1) i , a(t+1) j , w(t+1) i , w(t+1) j and contine the ODE solver from this point to move again closer to the “bottom of the channel”. C.2 Expansion of the loss in ϵ We start with the reparameterization in main text Equation 3, aiσ(wi · x) + ajσ(wj · x) = c 2 σ (w + ϵ∆) · x + σ (w − ϵ∆) · x + a 2ϵ σ (w + ϵ∆) · x − σ...

work page

[59] [59]

the function g can be computed analytically (see C.3.1). Inserting Equation 15 into Equation 14 leads to ℓ(θ) =1 2 ⟨f(x; θ∗)2⟩ − r∗ X j=1 rX k=1 a∗ j akg(b∗ j , bk, w∗ j · w∗ j , w∗ j · wk, wk · wk) + 1 2 rX j=1 rX k=1 ajakg(bj, bk, wj · wj, wj · wk, wk · wk) (16) We investigate the properties of the landscape using gradient flow, ˙θ = −∇θℓ(θ), where ℓ(θ)...

work page

[60] [60]

we use 2G(z) − 1 = erf(z/ √ 2), leading to g(µ1, µ2, σ2 1, σ1σ2ρ, σ2

work page

[61] [61]

C.3.2 Minimum at infinity Here, we derive the stability condition for the minimum at the end of a channel

=4 BvN µ1p 1 + σ2 1 , µ2p 1 + σ2 2 ; ρ σ1σ2p 1 + σ2 1 p 1 + σ2 2 − 2G µ1p 1 + σ2 1 − 2G µ2p 1 + σ2 2 + 1 (21) where we used 10,010.8 from [58], R ∞ −∞ dx G′(x)G(a + bx) = G( a√ 1+b2 ). C.3.2 Minimum at infinity Here, we derive the stability condition for the minimum at the end of a channel. To this end, we consider the simplified setting where the input i...

work page

[62] [62]

+ α1(ω1 · x)σ′(ω0 · x/ √

work page

[63] [63]

(33) Note that the error is indeed O(ϵ2) becauseP1 i=0 u3 i1 = 0

+ O(ϵ2). (33) Note that the error is indeed O(ϵ2) becauseP1 i=0 u3 i1 = 0. Connecting this result with the notation in the main text, we see c = √ 2α0, a = α1, w = ω0/ √ 2, and ∆ = ω1. C.5.2 Second Derivative with Three Neurons For the second derivative with a three neuron network we change basis to u0 = 1√ 3 1 1 1 ! , u1 = 1√ 2 1 0 −1 ! , u2 = 1√ 6 1 −2 ...

work page

[64] [64]

(35) Note that it is necessary to have the ω2 · x contribution to be O(ϵ2), otherwise the α2(ω2 · x) term would diverge with 1/ϵ

+ [α1(ω1 · x) + α2(ω2 · x)]σ′(ω0 · x/ √ 3) + 1 12 α2(ω1 · x)2σ′′(ω0 · x/ √ 3). (35) Note that it is necessary to have the ω2 · x contribution to be O(ϵ2), otherwise the α2(ω2 · x) term would diverge with 1/ϵ. Appendix References

work page

[65] [65]

Higher-order additive runge–kutta schemes for ordinary differential equations

Christopher A Kennedy and Mark H Carpenter. Higher-order additive runge–kutta schemes for ordinary differential equations. Applied numerical mathematics, 136:183–205, 2019

work page 2019

[66] [66]

An introduction to numerical analysis

Endre Süli and David F Mayers. An introduction to numerical analysis. Cambridge university press, 2003

work page 2003

[67] [67]

Donald B. Owen. A table of normal integrals. Commun. Stat. Simul. Comput., 9(4):389–419, 1980. 26

work page 1980