Derives exact Frobenius norm imbalance identity for deep nonlinear networks, classifies activations into four classes, and obtains critical-depth escape time law τ★ = Θ(ε^{-(r-2)}) from reduction to scalar ODE on permutation-symmetric submanifold.
Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape
5 Pith papers cite this work. Polarity classification is still indexing.
abstract
The permutation symmetry of neurons in each layer of a deep neural network gives rise not only to multiple equivalent global minima of the loss function, but also to first-order saddle points located on the path between the global minima. In a network of $d-1$ hidden layers with $n_k$ neurons in layers $k = 1, \ldots, d$, we construct smooth paths between equivalent global minima that lead through a `permutation point' where the input and output weight vectors of two neurons in the same hidden layer $k$ collide and interchange. We show that such permutation points are critical points with at least $n_{k+1}$ vanishing eigenvalues of the Hessian matrix of second derivatives indicating a local plateau of the loss function. We find that a permutation point for the exchange of neurons $i$ and $j$ transits into a flat valley (or generally, an extended plateau of $n_{k+1}$ flat dimensions) that enables all $n_k!$ permutations of neurons in a given layer $k$ at the same loss value. Moreover, we introduce high-order permutation points by exploiting the recursive structure in neural network functions, and find that the number of $K^{\text{th}}$-order permutation points is at least by a factor $\sum_{k=1}^{d-1}\frac{1}{2!^K}{n_k-K \choose K}$ larger than the (already huge) number of equivalent global minima. In two tasks, we illustrate numerically that some of the permutation points correspond to first-order saddles (`permutation saddles'): first, in a toy network with a single hidden layer on a function approximation task and, second, in a multilayer network on the MNIST task. Our geometric approach yields a lower bound on the number of critical points generated by weight-space symmetries and provides a simple intuitive link between previous mathematical results and numerical observations.
citation-role summary
citation-polarity summary
fields
cs.LG 5verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
Neural networks admit large families of approximately equivalent solutions via neuron identifiability even without structural symmetry, enabling linear low-loss merging paths without prior alignment.
A geometric classification of stationary points on neuron-splitting plateaus in two-layer NN loss landscapes using the inner Hessian.
Representations learned by large AI models are converging toward a shared statistical model of reality.
Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(mn) complexity and proven scalability.
citing papers explorer
-
A Theory of Saddle Escape in Deep Nonlinear Networks
Derives exact Frobenius norm imbalance identity for deep nonlinear networks, classifies activations into four classes, and obtains critical-depth escape time law τ★ = Θ(ε^{-(r-2)}) from reduction to scalar ODE on permutation-symmetric submanifold.
-
Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability
Neural networks admit large families of approximately equivalent solutions via neuron identifiability even without structural symmetry, enabling linear low-loss merging paths without prior alignment.
-
A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks
A geometric classification of stationary points on neuron-splitting plateaus in two-layer NN loss landscapes using the inner Hessian.
-
The Platonic Representation Hypothesis
Representations learned by large AI models are converging toward a shared statistical model of reality.
-
Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(mn) complexity and proven scalability.