Steepest descent under divergence-induced quadratic models equals an LQR problem, enabling learning of diagonal or Kronecker-factored inverse preconditioners via a global layerwise objective for scalable geometry-aware training.
Minghao Xu, Lichuan Xiang, Xu Cai, and Hongkai Wen
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4representative citing papers
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
SCALE matches Adam performance in LLM pretraining from 60M to 7B parameters by combining column-wise gradient normalization with last-layer-only momentum, using 35-45% of Adam's memory.
A retrospective survey and empirical evaluation of deep learning optimization algorithms that identifies trends, design trade-offs, and future directions.
citing papers explorer
-
Layerwise LQR for Geometry-Aware Optimization of Deep Networks
Steepest descent under divergence-induced quadratic models equals an LQR problem, enabling learning of diagonal or Kronecker-factored inverse preconditioners via a global layerwise objective for scalable geometry-aware training.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
Memory-Efficient LLM Pretraining via Minimalist Optimizer Design
SCALE matches Adam performance in LLM pretraining from 60M to 7B parameters by combining column-wise gradient normalization with last-layer-only momentum, using 35-45% of Adam's memory.
-
Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations
A retrospective survey and empirical evaluation of deep learning optimization algorithms that identifies trends, design trade-offs, and future directions.