The Power of Normalization: Faster Evasion of Saddle Points
read the original abstract
A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
Function-free Optimization via Comparison Oracles
Introduces a comparison-oracle framework using preference level-set geometry to achieve near-optimal query complexity for normal direction estimation and descent-based optimization under regularity and convexity.
-
Function-free Optimization via Comparison Oracles
Introduces a geometry-based framework for comparison-oracle optimization, with O(d log(d/ε)) comparisons for normal direction estimation and Õ(d D²/ε²) comparisons to reach ε level-set optimality gap under regularity,...
-
Memory-Efficient LLM Pretraining via Minimalist Optimizer Design
SCALE matches Adam performance in LLM pretraining from 60M to 7B parameters by combining column-wise gradient normalization with last-layer-only momentum, using 35-45% of Adam's memory.
-
Adaptive Federated Optimization
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
-
Accelerated Gradient Descent for Faster Convergence with Minimal Overhead
CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead co...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.