Introduces a geometry-based framework for comparison-oracle optimization, with O(d log(d/ε)) comparisons for normal direction estimation and Õ(d D²/ε²) comparisons to reach ε level-set optimality gap under regularity, convexity, and growth conditions.
The Power of Normalization: Faster Evasion of Saddle Points
4 Pith papers cite this work. Polarity classification is still indexing.
abstract
A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.
representative citing papers
SCALE matches Adam performance in LLM pretraining from 60M to 7B parameters by combining column-wise gradient normalization with last-layer-only momentum, using 35-45% of Adam's memory.
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead comparable to Adam.
citing papers explorer
-
Function-free Optimization via Comparison Oracles
Introduces a geometry-based framework for comparison-oracle optimization, with O(d log(d/ε)) comparisons for normal direction estimation and Õ(d D²/ε²) comparisons to reach ε level-set optimality gap under regularity, convexity, and growth conditions.
-
Memory-Efficient LLM Pretraining via Minimalist Optimizer Design
SCALE matches Adam performance in LLM pretraining from 60M to 7B parameters by combining column-wise gradient normalization with last-layer-only momentum, using 35-45% of Adam's memory.
-
Adaptive Federated Optimization
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
-
Accelerated Gradient Descent for Faster Convergence with Minimal Overhead
CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead comparable to Adam.