pith. sign in

The Power of Normalization: Faster Evasion of Saddle Points

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it
abstract

A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.

representative citing papers

Function-free Optimization via Comparison Oracles

math.OC · 2026-04-29 · unverdicted · novelty 7.0

Introduces a geometry-based framework for comparison-oracle optimization, with O(d log(d/ε)) comparisons for normal direction estimation and Õ(d D²/ε²) comparisons to reach ε level-set optimality gap under regularity, convexity, and growth conditions.

Adaptive Federated Optimization

cs.LG · 2020-02-29 · unverdicted · novelty 6.0

Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.

citing papers explorer

Showing 4 of 4 citing papers.

  • Function-free Optimization via Comparison Oracles math.OC · 2026-04-29 · unverdicted · none · ref 20 · internal anchor

    Introduces a geometry-based framework for comparison-oracle optimization, with O(d log(d/ε)) comparisons for normal direction estimation and Õ(d D²/ε²) comparisons to reach ε level-set optimality gap under regularity, convexity, and growth conditions.

  • Memory-Efficient LLM Pretraining via Minimalist Optimizer Design cs.LG · 2025-06-20 · conditional · none · ref 6 · internal anchor

    SCALE matches Adam performance in LLM pretraining from 60M to 7B parameters by combining column-wise gradient normalization with last-layer-only momentum, using 35-45% of Adam's memory.

  • Adaptive Federated Optimization cs.LG · 2020-02-29 · unverdicted · none · ref 174 · internal anchor

    Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.

  • Accelerated Gradient Descent for Faster Convergence with Minimal Overhead cs.LG · 2026-05-15 · unverdicted · none · ref 69 · internal anchor

    CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead comparable to Adam.