The Power of Normalization: Faster Evasion of Saddle Points

Kfir Y. Levy

arxiv: 1611.04831 · v1 · pith:IUMKF47Mnew · submitted 2016-11-15 · 💻 cs.LG · math.OC· stat.ML

The Power of Normalization: Faster Evasion of Saddle Points

Kfir Y. Levy This is my paper

classification 💻 cs.LG math.OCstat.ML

keywords gradientsaddleconvergencedescentevasionheuristicknownmethod

0 comments

read the original abstract

A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Function-free Optimization via Comparison Oracles
math.OC 2026-04 unverdicted novelty 7.0

Introduces a comparison-oracle framework using preference level-set geometry to achieve near-optimal query complexity for normal direction estimation and descent-based optimization under regularity and convexity.
Function-free Optimization via Comparison Oracles
math.OC 2026-04 unverdicted novelty 7.0

Introduces a geometry-based framework for comparison-oracle optimization, with O(d log(d/ε)) comparisons for normal direction estimation and Õ(d D²/ε²) comparisons to reach ε level-set optimality gap under regularity,...
Memory-Efficient LLM Pretraining via Minimalist Optimizer Design
cs.LG 2025-06 conditional novelty 6.0

SCALE matches Adam performance in LLM pretraining from 60M to 7B parameters by combining column-wise gradient normalization with last-layer-only momentum, using 35-45% of Adam's memory.
Adaptive Federated Optimization
cs.LG 2020-02 unverdicted novelty 6.0

Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
Accelerated Gradient Descent for Faster Convergence with Minimal Overhead
cs.LG 2026-05 unverdicted novelty 4.0

CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead co...