pith. sign in

arxiv: 1907.02911 · v1 · pith:3FUUPMVPnew · submitted 2019-07-05 · 💻 cs.LG · stat.ML

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

Pith reviewed 2026-05-25 02:09 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords permutation symmetryloss landscapesaddle pointscritical pointsneural networkflat valleysweight space symmetry
0
0 comments X

The pith

Neuron permutation symmetry produces saddle points linked by constant-loss valleys in neural network loss landscapes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors construct smooth paths in weight space connecting equivalent global minima that pass through permutation points, where the weights of two neurons in one layer coincide and can be interchanged. At these points the loss function has a plateau because at least as many Hessian eigenvalues as the number of neurons in the next layer vanish. Each such point opens into a flat valley that lets the network reach any of the n_k factorial permutations of the layer's neurons while keeping the loss fixed. Higher-order versions of these points, built recursively, outnumber the equivalent minima. Numerical examples on a toy function fit and on MNIST confirm that some of these points behave as first-order saddles.

Core claim

In a network with d-1 hidden layers, paths between equivalent minima transit through permutation points at which two neurons in layer k exchange their input and output weight vectors. These points are critical points of the loss with at least n_{k+1} zero Hessian eigenvalues. The point for neurons i and j continues into an extended plateau of n_{k+1} flat dimensions that accommodates every possible ordering of the n_k neurons at identical loss. The recursive structure further produces Kth-order permutation points whose number is at least larger than the number of minima by the factor sum_k 1/(2!^K) * binom(n_k - K, K).

What carries the argument

The permutation point, the configuration where two neurons in the same hidden layer have identical incoming and outgoing weights so that they can be continuously swapped while the network function stays unchanged.

Load-bearing premise

A differentiable path through weight space can be chosen that keeps the network output exactly constant while two neurons swap their weight vectors continuously.

What would settle it

In a small network construct an explicit permutation point and verify that the Hessian has at least n_{k+1} eigenvalues equal to zero.

Figures

Figures reproduced from arXiv: 1907.02911 by Berfin Simsek, Bernd Illing, Johanni Brea, Wulfram Gerstner.

Figure 1
Figure 1. Figure 1: Paths to a permutation point in networks with two-dimensional input space, five hidden neurons and one linear output neuron. Merging hidden neurons 1&3 (A) leads to the same configuration and the same loss L as merging neurons 3&4 (B) whereas merging neurons 2&5 leads to a different configuration and higher loss (C). Top row. Configuration of the 5 weight vectors W (1) i,: /b(1) i of the hidden layer in a … view at source ↗
Figure 2
Figure 2. Figure 2: A. Configuration of two parameter vectors ϑ (k) l and ϑ (k) m in the teacher network (black) and a potential path (green) towards a permutation point θ (k) l⇔m (?) in the student network (black filled circles: positions of the student parameter vectors along the path; sample vector shown in red). The path is parametrized by the distance d. Along the path the distance d (blue dashed lines) decreases continu… view at source ↗
Figure 3
Figure 3. Figure 3: A low-loss permutation path in the loss landscape of a network trained on MNIST. A We merged two parameter vectors with high cosine-similarity in the second hidden layer of a three-layer network with H = 10, 15, 20 or 25 trained on MNIST. For each hidden layer size we train 6 teacher networks with different random seeds and display one curve per hidden layer size and seed. The distance d was decreased in 1… view at source ↗
Figure 5
Figure 5. Figure 5: A. The loss landscape (schematic) as a function of two parameters: the difference W (k) m,i − W (k) l,i between the weights from neuron i to neurons m and l in layer k and the weight W (k+1) n,m from neuron m to neuron n in the next layer (see B for network graph). The red curve indicates the path from one of the global minima (red triangle) to a saddle point where the difference between the input weight v… view at source ↗
Figure 6
Figure 6. Figure 6: Visualizing how permutation points arise in between global minima for a (hypothetical) [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A. Zooming in permutations points of one of the permutation sets (red ? in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

The permutation symmetry of neurons in each layer of a deep neural network gives rise not only to multiple equivalent global minima of the loss function, but also to first-order saddle points located on the path between the global minima. In a network of $d-1$ hidden layers with $n_k$ neurons in layers $k = 1, \ldots, d$, we construct smooth paths between equivalent global minima that lead through a `permutation point' where the input and output weight vectors of two neurons in the same hidden layer $k$ collide and interchange. We show that such permutation points are critical points with at least $n_{k+1}$ vanishing eigenvalues of the Hessian matrix of second derivatives indicating a local plateau of the loss function. We find that a permutation point for the exchange of neurons $i$ and $j$ transits into a flat valley (or generally, an extended plateau of $n_{k+1}$ flat dimensions) that enables all $n_k!$ permutations of neurons in a given layer $k$ at the same loss value. Moreover, we introduce high-order permutation points by exploiting the recursive structure in neural network functions, and find that the number of $K^{\text{th}}$-order permutation points is at least by a factor $\sum_{k=1}^{d-1}\frac{1}{2!^K}{n_k-K \choose K}$ larger than the (already huge) number of equivalent global minima. In two tasks, we illustrate numerically that some of the permutation points correspond to first-order saddles (`permutation saddles'): first, in a toy network with a single hidden layer on a function approximation task and, second, in a multilayer network on the MNIST task. Our geometric approach yields a lower bound on the number of critical points generated by weight-space symmetries and provides a simple intuitive link between previous mathematical results and numerical observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that neuron permutation symmetries in deep networks generate not only equivalent global minima but also 'permutation points' (where input/output weight vectors of two neurons in layer k collide) that are critical points with at least n_{k+1} vanishing Hessian eigenvalues, connected by flat valleys enabling all n_k! permutations at constant loss. It constructs smooth paths through these points between equivalent minima, introduces higher-order permutation points whose number exceeds the number of minima by a combinatorial factor, and numerically illustrates permutation saddles on a toy single-hidden-layer network and on MNIST.

Significance. If the identification of permutation points as critical points holds, the geometric construction supplies an explicit lower bound on symmetry-induced critical points, a concrete link between permutation invariance and the observed abundance of saddles/plateaus, and a mechanism for connected components of equal-loss minima; the numerical examples on toy and MNIST tasks provide direct evidence of the predicted flat directions.

major comments (2)
  1. [construction of permutation points and Hessian analysis (abstract and §3–4)] The central derivation that permutation points are critical points (i.e., have identically zero gradient) is not established by the symmetry argument alone. Symmetry under neuron-label swap forces the per-neuron gradient components to be identical, but places no constraint forcing the common gradient with respect to the shared input weights or the summed output direction to vanish; those components vanish only if the reduced (merged-neuron) network is itself at a critical point of the loss, which the path-construction between global minima does not enforce. This directly affects the claim that the points are saddles rather than merely points with a flat valley in certain directions.
  2. [Hessian eigenvalue claim (abstract and §3)] The lower bound of at least n_{k+1} vanishing Hessian eigenvalues is tied to the output-weight valley, but the full statement that these are zero eigenvalues of the Hessian at a stationary point requires the gradient to be zero first; without that, the eigenvalue count applies only to the restricted Hessian on the valley subspace.
minor comments (2)
  1. [higher-order points] Notation for the recursive definition of higher-order permutation points could be clarified with an explicit small example (e.g., K=2 on a two-layer network) to make the combinatorial factor ∑_{k=1}^{d-1} 1/(2!^K) binom(n_k - K, K) easier to verify.
  2. [numerical experiments on toy network and MNIST] The numerical sections would benefit from reporting the measured gradient norm at the identified permutation points (in addition to loss and Hessian eigenvalues) to allow direct checking of the stationarity claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which identify a substantive gap in the theoretical argument. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: The central derivation that permutation points are critical points (i.e., have identically zero gradient) is not established by the symmetry argument alone. Symmetry under neuron-label swap forces the per-neuron gradient components to be identical, but places no constraint forcing the common gradient with respect to the shared input weights or the summed output direction to vanish; those components vanish only if the reduced (merged-neuron) network is itself at a critical point of the loss, which the path-construction between global minima does not enforce. This directly affects the claim that the points are saddles rather than merely points with a flat valley in certain directions.

    Authors: We agree with the referee's analysis. The symmetry argument establishes that the per-neuron gradient components are identical but does not force the common value to vanish; this requires the reduced (merged-neuron) network to itself be at a critical point. Our path construction between global minima does not automatically enforce the latter condition at the collision point. We will revise §3–4 and the abstract to state the additional condition explicitly, qualify the claim that permutation points are critical points/saddles, and note that the flat-valley property holds independently of criticality. revision: yes

  2. Referee: The lower bound of at least n_{k+1} vanishing Hessian eigenvalues is tied to the output-weight valley, but the full statement that these are zero eigenvalues of the Hessian at a stationary point requires the gradient to be zero first; without that, the eigenvalue count applies only to the restricted Hessian on the valley subspace.

    Authors: We concur. The stated lower bound on vanishing eigenvalues is rigorously a property of the restricted Hessian along the valley directions unless the point is first shown to be stationary. We will revise the abstract and §3 to make this distinction clear and to tie the full-Hessian claim to the corrected criticality condition introduced in response to the first comment. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is an explicit geometric construction from standard permutation invariance

full rationale

The paper constructs explicit smooth paths in weight space that interchange neurons while holding the network output (hence loss) constant, then computes the gradient and Hessian along those paths to identify critical points and flat directions. These steps rely only on the algebraic fact that the network function is invariant under neuron relabeling (a property true for any feed-forward architecture with the usual sum-and-activation structure) and on direct differentiation; no parameters are fitted to data, no result is renamed as a prediction, and no load-bearing premise is justified solely by self-citation. The counting arguments for higher-order permutation points follow recursively from the same invariance. The derivation is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on the standard domain assumption of exact permutation invariance of the network function and on ordinary calculus (differentiability of the loss). No free parameters are introduced and no new entities are postulated.

axioms (2)
  • domain assumption The loss function is exactly invariant under arbitrary permutations of neurons within each hidden layer.
    This is the foundational symmetry used to construct the constant-loss paths and is stated at the opening of the abstract.
  • standard math The loss function is twice differentiable so that the Hessian exists at the permutation points.
    Required for the eigenvalue analysis of critical points.

pith-pipeline@v0.9.0 · 5903 in / 1415 out tokens · 21480 ms · 2026-05-25T02:09:51.377959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Theory of Saddle Escape in Deep Nonlinear Networks

    cs.LG 2026-05 unverdicted novelty 7.0

    Derives exact norm-imbalance identity for deep nonlinear nets, classifying activations into four classes and yielding escape time law τ★ = Θ(ε^{-(r-2)}) governed by bottleneck depth r.

  2. A Theory of Saddle Escape in Deep Nonlinear Networks

    cs.LG 2026-05 conditional novelty 7.0

    An exact norm-imbalance identity classifies activations into four classes and reduces deep nonlinear training flow to a scalar ODE that predicts saddle escape time scaling as ε to the power of minus (r-2) for r bottle...

  3. The Platonic Representation Hypothesis

    cs.LG 2024-05 unverdicted novelty 5.0

    Representations learned by large AI models are converging toward a shared statistical model of reality.

  4. Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

    cs.LG 2026-05 unverdicted novelty 4.0

    Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(m...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 3 Pith papers · 11 internal anchors

  1. [1]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, and A. Courville. Deep learning. 2016

  2. [2]

    C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 1995

  3. [3]

    Y . N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y . Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941, 2014

  4. [4]

    I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014

  5. [5]

    H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 6389–6399. Curran Associates, Inc., 2018

  6. [6]

    Explorations on high dimensional landscapes

    L. Sagun, V . U. Guney, G. B. Arous, and Y . LeCun. Explorations on high dimensional landscapes. arXiv preprint arXiv:1412.6615, 2014

  7. [7]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    L. Sagun, L. Bottou, and Y . LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016

  8. [8]

    Choromanska, M

    A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y . LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015

  9. [9]

    C. E. Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003

  10. [10]

    C. D. Freeman and J. Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016

  11. [11]

    Fukumizu and S

    K. Fukumizu and S. Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13(3):317 – 327, 2000

  12. [12]

    Saad and S

    D. Saad and S. A. Solla. On-line learning in soft committee machines. Physical Review E, 52 (4):4225, 1995

  13. [13]

    A. J. Ballard, R. Das, S. Martiniani, D. Mehta, L. Sagun, J. D. Stevenson, and D. J. Wales. Energy landscapes for machine learning. Physical Chemistry Chemical Physics, 19(20):12585– 12603, 2017

  14. [14]

    Understanding deep learning requires rethinking generalization

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016

  15. [15]

    Baldi and K

    P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53–58, 1989. 9

  16. [16]

    Kawaguchi

    K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016

  17. [17]

    Depth Creates No Bad Local Minima

    H. Lu and K. Kawaguchi. Depth creates no bad local minima. arXiv preprint arXiv:1702.08580, 2017

  18. [18]

    A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013

  19. [19]

    No bad local minima: Data independent training error guarantees for multilayer neural networks

    D. Soudry and Y . Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016

  20. [20]

    Nguyen and M

    Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2603–2612. JMLR. org, 2017

  21. [21]

    Garipov, P

    T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pages 8789–8798, 2018

  22. [22]

    Essentially No Barriers in Neural Network Energy Landscape

    F. Draxler, K. Veschgini, M. Salmhofer, and F. A. Hamprecht. Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885, 2018

  23. [23]

    J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016

  24. [24]

    Jacot, F

    A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, pages 8571–8580, 2018

  25. [25]

    Spigler, M

    S. Spigler, M. Geiger, S. d’Ascoli, L. Sagun, G. Biroli, and M. Wyart. A jamming transition from under-to over-parametrization affects loss landscape and generalization. arXiv preprint arXiv:1810.09665, 2018

  26. [26]

    Engel and C

    A. Engel and C. Van den Broeck. Statistical mechanics of learning. Cambridge University Press, 2001

  27. [27]

    Inoue, H

    M. Inoue, H. Park, and M. Okada. On-line learning theory of soft committee machines with correlated hidden units - a steepest gradient descent and natural gradient descent. Journal of the Physical Society of Japan, 72(4):805–810, 2003

  28. [28]

    Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

    L. Sagun, U. Evci, V . U. Guney, Y . Dauphin, and L. Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017

  29. [29]

    ,$ (&'() 𝑊

    M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. B. Arous, C. Cammarota, Y . LeCun, M. Wyart, and G. Biroli. Comparing dynamics: Deep neural networks versus glassy systems.arXiv preprint arXiv:1803.06969, 2018. 10 A Supplementary Figures 𝑑𝑑0 * 𝑑𝑑0 * 𝑑𝑑0 * 𝑑𝑑0 * d(λ)0 A B DC S 𝑑𝑑0 * 𝑑𝑑0 * 𝑑𝑑0 * 𝑑𝑑0 * d(λ)0 A B DC S Figure 4: LossL (vertical axis) on the ...