Flat Channels to Infinity in Neural Loss Landscapes
Pith reviewed 2026-05-19 08:51 UTC · model grok-4.3
The pith
Neural network loss landscapes contain flat channels leading to infinity where neuron pairs form gated linear units.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We identify and characterize channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, a_i and a_j, diverge to ±infinity, and their input weight vectors, w_i and w_j, become equal to each other. At convergence, the two neurons implement a gated linear unit: a_i sigma(w_i · x) + a_j sigma(w_j · x) approaches sigma(w · x) + (v · x) sigma prime(w · x). Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers reach the channels with high probability in diverse regression settings.
What carries the argument
Flat channels to infinity in parameter space, asymptotically parallel to symmetry-induced lines of critical points, that end in a pair of neurons implementing a gated linear unit.
If this is right
- Gradient flow solvers and SGD or ADAM reach the channels with high probability in diverse regression settings.
- Without careful inspection the channels appear as flat local minima with finite parameter values.
- The channels supply a comprehensive picture of quasi-flat regions in terms of gradient dynamics, geometry, and functional form.
- The emergence of gated linear units at the end of the channels points to a computational capability of fully connected layers.
Where Pith is reading between the lines
- Training trajectories that appear to have converged may actually continue slow movement along these channels for many more steps.
- Detecting such channels could guide regularization methods that penalize large weight divergence.
- The same geometric mechanism may operate in deeper networks or other activation functions.
Load-bearing premise
The analysis assumes gradient flow and SGD or ADAM reach these channels with high probability in diverse regression settings and that the resulting configuration is asymptotically parallel to symmetry-induced critical lines.
What would settle it
Train a two-layer network on a regression task with gradient descent and check whether pairs of neurons show output weights diverging to plus and minus infinity, input weights aligning, and loss continuing to decrease slowly over long times.
Figures
read the original abstract
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_i\sigma(\mathbf{w_i} \cdot \mathbf{x}) + a_j\sigma(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow \sigma(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) \sigma'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies and characterizes 'flat channels to infinity' in neural network loss landscapes. Along these channels the loss decreases extremely slowly while output weights a_i and a_j diverge to ±∞ and input weights w_i, w_j align; the pair asymptotically implements a gated linear unit of the form σ(w·x) + (v·x)σ'(w·x). The channels are asymptotically parallel to symmetry-induced lines of critical points. The authors assert that gradient flow, SGD and ADAM reach these channels with high probability in diverse regression settings, making them appear as flat local minima with finite parameters. The work supplies a unified picture in terms of gradient dynamics, geometry and functional interpretation.
Significance. If the reachability claim is substantiated, the result is significant: it supplies a concrete dynamical and geometric mechanism for the flat regions routinely observed in neural loss landscapes and directly links them to the emergence of gated-linear-unit-like computations inside fully connected layers. The explicit functional limit and the asymptotic parallelism to symmetry lines are strengths that go beyond standard critical-point catalogs and could inform implicit-regularization analyses. The paper earns credit for attempting a comprehensive characterization that integrates dynamics, geometry and expressivity.
major comments (2)
- [Abstract] Abstract: the assertion that 'Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings' is load-bearing for the claim that these structures explain observed training behavior. No basin-volume estimate, Lyapunov analysis near the channel, or controls ruling out other attractors are supplied; without such support the channels may exist mathematically yet remain irrelevant to typical trajectories.
- [Geometry / symmetry analysis] The geometric claim that the channels are 'asymptotically parallel to symmetry-induced lines of critical points' is central to the characterization. The manuscript should supply the explicit symmetry group, the associated conserved quantities or Hessian null directions, and the precise sense in which the flow approaches these lines (e.g., a differential equation for the transverse coordinates).
minor comments (1)
- [Abstract / notation] The limiting expression for the gated linear unit in the abstract would benefit from an explicit parametrization (e.g., a scaling parameter t → ∞ along the channel) that makes the divergence rates of a_i, a_j and the alignment of w_i, w_j mathematically precise.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the potential significance of our findings. We provide point-by-point responses to the major comments below and describe the revisions we intend to implement.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings' is load-bearing for the claim that these structures explain observed training behavior. No basin-volume estimate, Lyapunov analysis near the channel, or controls ruling out other attractors are supplied; without such support the channels may exist mathematically yet remain irrelevant to typical trajectories.
Authors: We agree that the reachability of the channels is a key aspect of our claims and that our current support is primarily empirical. The manuscript presents numerical evidence from multiple regression tasks showing that gradient-based optimizers consistently converge to these channels. To address the referee's concern, we will revise the abstract and the relevant sections to temper the language, emphasizing that the claim is based on observed behavior in the studied settings rather than a proven high-probability result for all cases. We will also add further experimental controls and discussion of potential other attractors. A full basin-volume analysis or Lyapunov study is not included and would constitute a substantial extension of the work. revision: partial
-
Referee: [Geometry / symmetry analysis] The geometric claim that the channels are 'asymptotically parallel to symmetry-induced lines of critical points' is central to the characterization. The manuscript should supply the explicit symmetry group, the associated conserved quantities or Hessian null directions, and the precise sense in which the flow approaches these lines (e.g., a differential equation for the transverse coordinates).
Authors: We appreciate this suggestion for greater precision in the geometric analysis. The symmetry in question is the permutation symmetry among identical neurons in the hidden layer. We will explicitly identify the symmetry group as the symmetric group S_n acting by permuting the neuron indices. The associated conserved quantities include the loss invariance under such permutations, and the Hessian has null directions corresponding to these infinitesimal symmetries. In the revised version, we will add a subsection detailing these elements and describe the approach to the symmetry lines via the transverse dynamics, including a reduced differential equation for the deviation from the line. This will clarify the asymptotic parallelism. revision: yes
Circularity Check
No significant circularity; derivation self-contained against external loss-landscape geometry
full rationale
The paper derives the channel geometry and the asymptotic gated-linear-unit form directly from the loss Hessian and symmetry-induced critical lines (visible in the abstract's functional limit and geometric parallelism statement). No equation reduces a claimed prediction to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem, and the reachability statement is framed as an empirical observation under gradient flow rather than a self-referential fit. The central characterization therefore rests on independent geometric analysis rather than circular re-labeling of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient flow and related first-order methods govern the trajectories that reach the described channels in regression settings.
Reference graph
Works this paper leans on
-
[1]
Local minima and plateaus in hierarchical structures of multilayer perceptrons
Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural networks, 13(3):317–327, 2000
work page 2000
-
[2]
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 27. Curran Associates,...
work page 2014
-
[3]
The Loss Surfaces of Multilayer Networks
Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The Loss Surfaces of Multilayer Networks. In Guy Lebanon and S. V . N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 192–204. PMLR, 2015
work page 2015
-
[4]
Visualizing the loss landscape of neural nets
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018
work page 2018
-
[5]
Semi-flat minima and saddle points by embedding neural networks to overparameterization
Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, and Mirai Tanaka. Semi-flat minima and saddle points by embedding neural networks to overparameterization. Advances in Neural Information Processing Systems, 32:13868–13876, 2019
work page 2019
-
[6]
Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances
Berfin ¸ Sim¸ sek, François Ged, Arthur Jacot, Francesco Spadaro, Clément Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning , pages 9722–9732. PMLR, 2021
work page 2021
-
[7]
Non-attracting regions of local minima in deep and wide neural networks
Henning Petzka and Cristian Sminchisescu. Non-attracting regions of local minima in deep and wide neural networks. Journal of Machine Learning Research, 22(143):1–34, 2021
work page 2021
-
[8]
How to escape saddle points efficiently
Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732. PMLR, 2017
work page 2017
-
[9]
The Implicit Bias of Gradient Descent on Separable Data
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data. Journal of Machine Learning Research, 19(70):1–57, 2018
work page 2018
-
[10]
Embedding principle of loss landscape of deep neural networks
Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhiqin J Xu. Embedding principle of loss landscape of deep neural networks. Advances in Neural Information Processing Systems , 34:14848–14859, 2021
work page 2021
-
[11]
Embedding principle: a hierarchical structure of loss landscape of deep neural networks
Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle: a hierarchical structure of loss landscape of deep neural networks. arXiv preprint arXiv:2111.15527, 2021
-
[12]
Splitting steepest descent for growing neural architectures
Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for growing neural architectures. Advances in neural information processing systems, 32, 2019. 10
work page 2019
-
[13]
Lemeng Wu, Mao Ye, Qi Lei, Jason D Lee, and Qiang Liu. Steepest descent neural archi- tecture optimization: Escaping local optimum with signed neural splitting. arXiv preprint arXiv:2003.10392, 2020
-
[14]
An analysis on negative curvature induced by singularity in multi-layer neural-network learning
Eiji Mizutani and Stuart Dreyfus. An analysis on negative curvature induced by singularity in multi-layer neural-network learning. Advances in Neural Information Processing Systems, 23, 2010
work page 2010
-
[15]
Local minima and back propagation
Timothy Poston, C-N Lee, Y Choie, and Yonghoon Kwon. Local minima and back propagation. In IJCNN-91-Seattle International Joint Conference on Neural Networks , volume 2, pages 173–176. IEEE, 1991
work page 1991
-
[16]
No bad local minima: Data independent training error guarantees for multilayer neural networks
Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[17]
The loss surface of deep and wide neural networks
Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In International conference on machine learning, pages 2603–2612. PMLR, 2017
work page 2017
-
[18]
The global landscape of neural networks: An overview
Ruoyu Sun, Dawei Li, Shiyu Liang, Tian Ding, and Rayadurgam Srikant. The global landscape of neural networks: An overview. IEEE Signal Processing Magazine, 37(5):95–108, 2020
work page 2020
-
[19]
Spurious local minima are common in two-layer relu neural networks
Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pages 4433–4441. PMLR, 2018
work page 2018
-
[20]
Yossi Arjevani and Michael Field. Analytic study of families of spurious minima in two-layer relu neural networks: a tale of symmetry ii. Advances in Neural Information Processing Systems, 34:15162–15174, 2021
work page 2021
-
[21]
Expand-and-cluster: Parameter recovery of neural networks
Flavio Martinelli, Berfin Simsek, Wulfram Gerstner, and Johanni Brea. Expand-and-cluster: Parameter recovery of neural networks. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[22]
Learning gaussian multi-index models with gradient flow: Time complexity and directional convergence
Berfin ¸ Sim¸ sek, Amire Bendjeddou, and Daniel Hsu. Learning gaussian multi-index models with gradient flow: Time complexity and directional convergence. arXiv preprint arXiv:2411.08798, 2024
-
[23]
Itay M Safran, Gilad Yehudai, and Ohad Shamir. The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks. In Conference on Learning Theory, pages 3889–3934. PMLR, 2021
work page 2021
-
[24]
Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models
Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, and Lenka Zde- borová. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Asso...
work page 2019
-
[25]
Radford M. Neal. Bayesian Learning for Neural Networks. Springer New York, 1996
work page 1996
-
[26]
Computing with infinite networks
Christopher Williams. Computing with infinite networks. In M.C. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996
work page 1996
-
[27]
Deep neural networks as gaussian processes
Jaehoon Lee, Jascha Sohl-Dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018
work page 2018
-
[28]
Gaussian process behaviour in wide deep neural networks
Alexander G d G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahra- mani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018
work page 2018
-
[29]
The loss landscape of overparameterized neural networks
Yaim Cooper. The loss landscape of overparameterized neural networks. arXiv preprint arXiv:1804.10200, 2018. 11
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Loss landscapes and optimization in over- parameterized non-linear systems and neural networks
Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over- parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022
work page 2022
-
[31]
Loss surfaces, mode connectivity, and fast ensembling of dnns
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018
work page 2018
-
[32]
Linear mode connectivity and the lottery ticket hypothesis
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational Conference on Machine Learning, pages 3259–3269. PMLR, 2020
work page 2020
-
[33]
Model fusion via optimal transport
Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020
work page 2020
-
[34]
K., Hayase, J., and Srinivasa, S
Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022
-
[35]
Exploring neural network landscapes: Star-shaped and geodesic connectivity
Zhanran Lin, Puheng Li, and Lei Wu. Exploring neural network landscapes: Star-shaped and geodesic connectivity. arXiv preprint arXiv:2404.06391, 2024
-
[36]
Do deep neural network solutions form a star domain? arXiv preprint arXiv:2403.07968, 2024
Ankit Sonthalia, Alexander Rubinstein, Ehsan Abbasnejad, and Seong Joon Oh. Do deep neural network solutions form a star domain? arXiv preprint arXiv:2403.07968, 2024
-
[37]
Large scale structure of neural network loss landscapes
Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[38]
Certifying the absence of spurious local minima at infinity
Cédric Josz and Xiaopeng Li. Certifying the absence of spurious local minima at infinity. SIAM Journal on Optimization, 33(3):1416–1439, 2023
work page 2023
-
[39]
Adding one neuron can eliminate all bad local minima
Shiyu Liang, Ruoyu Sun, Jason D Lee, and Rayadurgam Srikant. Adding one neuron can eliminate all bad local minima. Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[40]
Elimination of all bad local minima in deep learning
Kenji Kawaguchi and Leslie Kaelbling. Elimination of all bad local minima in deep learning. In International Conference on Artificial Intelligence and Statistics, pages 853–863. PMLR, 2020
work page 2020
-
[41]
Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity
Shiyu Liang, Ruoyu Sun, and R Srikant. Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to infinity. SIAM Journal on Optimization , 32(4):2797–2827, 2022
work page 2022
-
[42]
J. von Neuman and E. Wigner. Uber merkwürdige diskrete Eigenwerte. Uber das Verhalten von Eigenwerten bei adiabatischen Prozessen. Physikalische Zeitschrift, 30:467–470, January 1929
work page 1929
-
[43]
Johanni Brea, Flavio Martinelli, Berfin ¸ Sim¸ sek, and Wulfram Gerstner. MLPGradientFlow: Going with the flow of multilayer perceptrons (and finding minima fast and accurately), January 2023
work page 2023
-
[44]
Dauphin, Angela Fan, Michael Auli, and David Grangier
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 933–941. PMLR, 06–11 Aug 2017
work page 2017
-
[45]
D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, December 2014
work page 2014
-
[46]
Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process
Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 483–513. PMLR, 09–12 Jul 2020
work page 2020
-
[47]
What happens after SGD reaches zero loss? –a mathematical framework
Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations, 2022
work page 2022
-
[48]
Representational drift as a result of implicit regularization
Aviv Ratzon, Dori Derdikman, and Omri Barak. Representational drift as a result of implicit regularization. April 2024. 12
work page 2024
-
[49]
Zico Kolter, and Ameet Talwalkar
Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021
-
[50]
Sharpness-Aware Minimization for Efficiently Improving Generalization
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[51]
Bridg- ing mode connectivity in loss landscapes and adversarial robustness
Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, and Xue Lin. Bridg- ing mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060, 2020
-
[52]
Exploring diversified adversarial robustness in neural networks via robust mode connectivity
Ren Wang, Yuxuan Li, and Sijia Liu. Exploring diversified adversarial robustness in neural networks via robust mode connectivity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2346–2352, 2023
work page 2023
-
[53]
Linear mode connectivity in multitask and continual learning
Seyed Iman Mirzadeh, Mehrdad Farajtabar, Dilan Gorur, Razvan Pascanu, and Hassan Ghasemzadeh. Linear mode connectivity in multitask and continual learning. arXiv preprint arXiv:2010.04495, 2020
-
[54]
Optimiz- ing mode connectivity for class incremental learning
Haitao Wen, Haoyang Cheng, Heqian Qiu, Lanxiao Wang, Lili Pan, and Hongliang Li. Optimiz- ing mode connectivity for class incremental learning. In International Conference on Machine Learning, pages 36940–36957. PMLR, 2023
work page 2023
-
[55]
Federated learning with matched averaging
Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. arXiv preprint arXiv:2002.06440, 2020. 13 A Neuron duplication introduces lines of critical points A.1 Adding bias to neurons drastically reduces the number of stable plateau saddles In the main text we highlighted specific...
-
[56]
Use the reparametrization below Equation 3 to obtainc(t), a(t), w(t), ∆(t) and ϵ(t) for given values of a(t) i , a(t) j , w(t) i , w(t) j in step t of the ODE solver. 20 GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ( s = 10) rosenbrock Figure B9: Fraction of updates parallel to saddle line across all datasets. GP ( s = 0.1) GP ( s = 0.5) GP ( s = 2.0) GP ...
-
[57]
Move approximately in the direction of the channel by lowering ϵ(t+1) = ϵ(t)/2, while keeping c(t+1) = c(t), a(t+1) = a(t), wt+1 = w(t), ∆(t+1) = ∆(t). This point may not be at the “bottom of the channel”, because the other parameters also move slightly when lowering ϵ
-
[58]
Compute the corresponding parameters a(t+1) i , a(t+1) j , w(t+1) i , w(t+1) j and contine the ODE solver from this point to move again closer to the “bottom of the channel”. C.2 Expansion of the loss in ϵ We start with the reparameterization in main text Equation 3, aiσ(wi · x) + ajσ(wj · x) = c 2 σ (w + ϵ∆) · x + σ (w − ϵ∆) · x + a 2ϵ σ (w + ϵ∆) · x − σ...
-
[59]
the function g can be computed analytically (see C.3.1). Inserting Equation 15 into Equation 14 leads to ℓ(θ) =1 2 ⟨f(x; θ∗)2⟩ − r∗ X j=1 rX k=1 a∗ j akg(b∗ j , bk, w∗ j · w∗ j , w∗ j · wk, wk · wk) + 1 2 rX j=1 rX k=1 ajakg(bj, bk, wj · wj, wj · wk, wk · wk) (16) We investigate the properties of the landscape using gradient flow, ˙θ = −∇θℓ(θ), where ℓ(θ)...
-
[60]
we use 2G(z) − 1 = erf(z/ √ 2), leading to g(µ1, µ2, σ2 1, σ1σ2ρ, σ2
-
[61]
=4 BvN µ1p 1 + σ2 1 , µ2p 1 + σ2 2 ; ρ σ1σ2p 1 + σ2 1 p 1 + σ2 2 − 2G µ1p 1 + σ2 1 − 2G µ2p 1 + σ2 2 + 1 (21) where we used 10,010.8 from [58], R ∞ −∞ dx G′(x)G(a + bx) = G( a√ 1+b2 ). C.3.2 Minimum at infinity Here, we derive the stability condition for the minimum at the end of a channel. To this end, we consider the simplified setting where the input i...
-
[62]
+ α1(ω1 · x)σ′(ω0 · x/ √
-
[63]
(33) Note that the error is indeed O(ϵ2) becauseP1 i=0 u3 i1 = 0
+ O(ϵ2). (33) Note that the error is indeed O(ϵ2) becauseP1 i=0 u3 i1 = 0. Connecting this result with the notation in the main text, we see c = √ 2α0, a = α1, w = ω0/ √ 2, and ∆ = ω1. C.5.2 Second Derivative with Three Neurons For the second derivative with a three neuron network we change basis to u0 = 1√ 3 1 1 1 ! , u1 = 1√ 2 1 0 −1 ! , u2 = 1√ 6 1 −2 ...
-
[64]
+ [α1(ω1 · x) + α2(ω2 · x)]σ′(ω0 · x/ √ 3) + 1 12 α2(ω1 · x)2σ′′(ω0 · x/ √ 3). (35) Note that it is necessary to have the ω2 · x contribution to be O(ϵ2), otherwise the α2(ω2 · x) term would diverge with 1/ϵ. Appendix References
-
[65]
Higher-order additive runge–kutta schemes for ordinary differential equations
Christopher A Kennedy and Mark H Carpenter. Higher-order additive runge–kutta schemes for ordinary differential equations. Applied numerical mathematics, 136:183–205, 2019
work page 2019
-
[66]
An introduction to numerical analysis
Endre Süli and David F Mayers. An introduction to numerical analysis. Cambridge university press, 2003
work page 2003
-
[67]
Donald B. Owen. A table of normal integrals. Commun. Stat. Simul. Comput., 9(4):389–419, 1980. 26
work page 1980
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.