A shared mixed-activation network of width 2dN+d+2 yields layer-wise L^p approximation rates bounded by the modulus of continuity at geometric scale N^{-ℓ}, reducing to (2d+1)N^{-ℓ} for 1-Lipschitz targets.
Understanding gradient descent on the edge of stability in deep learning
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.
citing papers explorer
-
Geometric Layer-wise Approximation Rates for Deep Networks
A shared mixed-activation network of width 2dN+d+2 yields layer-wise L^p approximation rates bounded by the modulus of continuity at geometric scale N^{-ℓ}, reducing to (2d+1)N^{-ℓ} for 1-Lipschitz targets.
-
Large Spikes in Stochastic Gradient Descent: A Large-Deviations View
Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.