Convergence Rate Analysis of LION

Huan Li; Yiming Dong; Zhouchen Lin

arxiv: 2411.07724 · v2 · pith:2ISEK7MSnew · submitted 2024-11-12 · 💻 cs.LG · math.OC

Convergence Rate Analysis of LION

Yiming Dong , Huan Li , Zhouchen Lin This is my paper

classification 💻 cs.LG math.OC

keywords rateconvergencelionproblemgradientlowernormanalysis

0 comments

read the original abstract

The LION (evoLved sIgn mOmeNtum) optimizer for deep neural network training was found by Google via program search, with the simple sign update yet showing impressive performance in training large scale networks. Although previous studies have investigated its convergence properties, a comprehensive analysis, especially the convergence rate, is still desirable. Recognizing that LION can be regarded as solving a specific constrained problem, this paper focuses on demonstrating its convergence to the Karush-Kuhn-Tucker (KKT) point at the rate of $\cal O(\sqrt{d}K^{-1/4})$ measured by gradient $\ell_1$ norm, where $d$ is the problem dimension and $K$ is the number of iteration steps. Step further, we remove the constraint and establish that LION converges to the critical point of the general unconstrained problem at the same rate. This rate not only delivers the currently optimal dependence on the problem dimension $d$ but also tightly matches the theoretical lower bound for nonconvex stochastic optimization algorithms, which is typically measured using the gradient $\ell_2$ norm, with respect to the number of iterations $K$. Through extensive experiments, we not only demonstrate that LION achieves lower loss and higher performance compared to standard SGD, but also empirically confirm that the gradient $\ell_1/\ell_2$ norm ratio aligns with $\Theta(\sqrt{d})$, thus proving that our convergence rate matches the theoretical lower bound with respect to $d$ in the empirical sense.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
cs.LG 2026-05 unverdicted novelty 7.0

ALU uses public data to suppress unlearning cost quadratically while characterizing distribution mismatch effects, enabling mass unlearning with maintained utility.
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
cs.LG 2026-05 unverdicted novelty 7.0

Asymmetric Langevin Unlearning uses public data to suppress unlearning noise costs by O(1/n_pub²), enabling practical mass unlearning with preserved utility under distribution mismatch.
Accelerating LMO-Based Optimization via Implicit Gradient Transport
cs.LG 2026-05 unverdicted novelty 7.0

LMO-IGT achieves O(ε^{-3.5}) iteration complexity for stochastic LMO optimization via implicit gradient transport with a single gradient per step and introduces the regularized support function as a unified stationari...
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization
cs.LG 2026-04 unverdicted novelty 6.0

CLion achieves O(1/N) generalization error and O(√d / T^{1/4}) convergence for nonconvex stochastic optimization, improving on Lion's O(1/(N τ^T)) bound.
MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation
cs.LG 2025-06 conditional novelty 6.0

MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.