Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms.arXiv preprint arXiv:2502.17410

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, Tuo Zhao · 2025 · arXiv 2502.17410

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

representative citing papers

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.

On the Convergence of Muon and Beyond

cs.LG · 2025-09-19 · unverdicted · novelty 7.0

Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.

Budget-aware Auto Optimizer Configurator

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

BAOC samples gradient streams to compute per-block risk metrics for cheap optimizer configs then solves a constrained optimization to minimize total risk under memory and time budgets while preserving training quality.

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

cs.LG · 2026-03-20 · conditional · novelty 5.0

RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

cs.LG · 2025-09-15 · unverdicted · novelty 5.0

Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.

citing papers explorer

Showing 6 of 6 citing papers.

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo cs.LG · 2026-04-19 · unverdicted · none · ref 35
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
On the Convergence of Muon and Beyond cs.LG · 2025-09-19 · unverdicted · none · ref 32
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization cs.LG · 2026-05-07 · unverdicted · none · ref 11
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.
Budget-aware Auto Optimizer Configurator cs.AI · 2026-05-06 · unverdicted · none · ref 17
BAOC samples gradient streams to compute per-block risk metrics for cheap optimizer configs then solves a constrained optimization to minimize total risk under memory and time budgets while preserving training quality.
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization cs.LG · 2026-03-20 · conditional · none · ref 15
RMNP preconditions matrix updates via row-wise L2 normalization instead of Newton-Schulz iteration, reducing complexity to O(mn) while matching Muon's non-convex convergence rate and empirical performance.
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training cs.LG · 2025-09-15 · unverdicted · none · ref 36
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms.arXiv preprint arXiv:2502.17410

fields

years

verdicts

representative citing papers

citing papers explorer