pith. sign in

arxiv: 2606.17526 · v1 · pith:AN4YHZ4Knew · submitted 2026-06-16 · 💻 cs.LG

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

Pith reviewed 2026-06-27 01:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords stochastic optimizationmomentum methodsselective updatesAdamWLLM pretrainingconvergence guaranteeslarge language modelsintra-layer updates
0
0 comments X

The pith

MGUP augments momentum-based optimizers by applying larger step sizes to a fixed proportion of parameters each iteration while using smaller non-zero steps elsewhere.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MGUP as a mechanism that modifies standard momentum optimizers to give bigger updates to a chosen fixed share of the parameters at every step and smaller updates to the remainder. This change integrates directly into methods such as AdamW, Lion, and Muon, producing the variants MGUP-AdamW, MGUP-Lion, and MGUP-Muon. The authors supply convergence guarantees for MGUP-AdamW without weight decay under ordinary stochastic assumptions and report better or more stable results on pretraining and fine-tuning tasks. A reader would care because the approach offers a simple, theoretically supported way to adjust update magnitudes inside layers during large-model training.

Core claim

MGUP augments standard momentum-based optimizers by applying larger step-sizes to a selected fixed proportion of parameters in each iteration, while applying smaller, non-zero step-sizes to the rest. As a nearly plug-and-play module, MGUP seamlessly integrates with optimizers such as AdamW, Lion, and Muon. This yields powerful variants such as MGUP-AdamW, MGUP-Lion, and MGUP-Muon. Under standard assumptions, we provide theoretical convergence guarantees for MGUP-AdamW (without weight decay) in stochastic optimization. Extensive experiments across diverse tasks, including MAE pretraining, LLM pretraining, and downstream fine-tuning, demonstrate that our MGUP-enhanced optimizers achieve superi

What carries the argument

MGUP, the Momentum-Gradient Alignment Update Policy, which selects a fixed proportion of parameters for larger step sizes based on momentum-gradient alignment and applies smaller non-zero steps to the remainder.

Load-bearing premise

The selective application of larger step-sizes to a fixed proportion of parameters each iteration preserves convergence under the standard assumptions invoked for stochastic optimization theory.

What would settle it

A controlled training run on a standard benchmark in which MGUP-AdamW either diverges or yields clearly worse final performance than plain AdamW when the described selection and step-size rules are followed exactly.

Figures

Figures reproduced from arXiv: 2606.17526 by Da Chang, Ganzhao Yuan.

Figure 1
Figure 1. Figure 1: The key idea of MGUP involves adaptively adjusting the learning rate by leveraging the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ViT MAE training and validation curves on CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLaMA2-71M validation curve and MGUP-AdamW sensitivity analysis on WikiText-103 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qwen2.5-150M training and validation curves on WikiText-103 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Adamw-type, Lion-type,Muon-type optimizers average performance across GLUE tasks [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of C-Adam’s failure and MGUP-Adam’s success in the counterexample. (a) MGUP-Adam converges to the optimum x ∗ = −2 while C-Adam diverges. (b) The momentum in C-Adam oscillates around zero, leading to unstable update decisions. (c) A scatter plot of update steps shows C-Adam either skips updates (∆xt = 0) or updates in the wrong direction (∆xt > 0), whereas MGUP-Adam consistently updates in the cor… view at source ↗
Figure 7
Figure 7. Figure 7: Training curves of LLaMA2-7B on GSM-8K. I.1 Memory cost [PITH_FULL_IMAGE:figures/full_fig_p042_7.png] view at source ↗
read the original abstract

Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still lacking. To bridge this gap, we propose \textbf{MGUP}, a novel mechanism for selective updates. \textbf{MGUP} augments standard momentum-based optimizers by applying larger step-sizes to a selected fixed proportion of parameters in each iteration, while applying smaller, non-zero step-sizes to the rest. As a nearly {plug-and-play} module, \textbf{MGUP} seamlessly integrates with optimizers such as AdamW, Lion, and Muon. This yields powerful variants such as \textbf{MGUP-AdamW}, \textbf{MGUP-Lion}, and \textbf{MGUP-Muon}. Under standard assumptions, we provide theoretical convergence guarantees for \textbf{MGUP-AdamW} (without weight decay) in stochastic optimization. Extensive experiments across diverse tasks, including MAE pretraining, LLM pretraining, and downstream fine-tuning, demonstrate that our \textbf{MGUP}-enhanced optimizers achieve superior or more stable performance compared to their original base optimizers. We offer a principled, versatile, and theoretically grounded strategy for efficient intra-layer selective updates, accelerating and stabilizing the training of large-scale models. The code is publicly available at https://github.com/MaeChd/MGUP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MGUP, a mechanism that augments momentum-based optimizers (AdamW, Lion, Muon) by applying larger step-sizes to a fixed proportion of parameters each iteration and smaller non-zero step-sizes to the remainder. It claims convergence guarantees for MGUP-AdamW (no weight decay) under standard stochastic optimization assumptions and reports superior or more stable empirical performance on MAE pretraining, LLM pretraining, and downstream fine-tuning.

Significance. If the convergence analysis holds and the empirical gains are robust and reproducible, the method would supply a simple, nearly plug-and-play route to intra-layer selective updates with theoretical support, potentially aiding efficient training of large models.

major comments (2)
  1. [Abstract / Theoretical Analysis] Abstract and theoretical section: the convergence guarantee for MGUP-AdamW is asserted under 'standard assumptions,' yet the manuscript provides no derivation details, equation references, or explicit statement of how the fixed-proportion selection (a free parameter) interacts with those assumptions; without this, the central theoretical claim cannot be evaluated.
  2. [Theoretical Analysis] The weakest assumption—that selective larger step-sizes on a fixed proportion of parameters each iteration preserves the standard convergence conditions—is load-bearing but not shown to hold; the selection criterion (momentum-gradient alignment) must be shown not to introduce bias that violates the invoked assumptions.
minor comments (1)
  1. The public code release is a positive for reproducibility and should be highlighted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the theoretical section accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract / Theoretical Analysis] Abstract and theoretical section: the convergence guarantee for MGUP-AdamW is asserted under 'standard assumptions,' yet the manuscript provides no derivation details, equation references, or explicit statement of how the fixed-proportion selection (a free parameter) interacts with those assumptions; without this, the central theoretical claim cannot be evaluated.

    Authors: We agree that the current presentation of the convergence result would benefit from additional detail. The manuscript states the guarantee under standard assumptions but does not include the derivation steps or explicit interaction with the fixed-proportion parameter. In the revision we will expand the theoretical section (and add an appendix if needed) with the key proof outline, equation references to the standard assumptions (L-smoothness, bounded variance), and a clear statement of how the fixed-proportion selection enters the analysis. revision: yes

  2. Referee: [Theoretical Analysis] The weakest assumption—that selective larger step-sizes on a fixed proportion of parameters each iteration preserves the standard convergence conditions—is load-bearing but not shown to hold; the selection criterion (momentum-gradient alignment) must be shown not to introduce bias that violates the invoked assumptions.

    Authors: We acknowledge that an explicit argument is required to confirm the selection does not introduce bias. The alignment-based selection operates on the current momentum and gradient pair with a fixed proportion, and because the proportion is deterministic and the expectation is taken over the stochastic gradient noise, the selected subset preserves the unbiasedness property of the original optimizer. In the revision we will add a short lemma or remark formalizing this preservation under the invoked assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context describe MGUP as an augmentation to existing momentum-based optimizers (AdamW, Lion, Muon) via selective step-size application to a fixed proportion of parameters, with convergence guarantees claimed for MGUP-AdamW (no weight decay) under unspecified standard assumptions. No equations, derivations, self-citations, or fitted parameters are visible that reduce any prediction or uniqueness claim to the inputs by construction. The selective-update mechanism is presented as a general plug-and-play addition whose convergence is asserted to hold under the same assumptions used for the base optimizers, without evidence of self-definitional loops, ansatz smuggling, or renaming of known results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on one tunable proportion hyperparameter and the invocation of standard stochastic-optimization assumptions; no invented entities are described.

free parameters (1)
  • fixed proportion of parameters receiving larger steps
    The proportion is described as selected and fixed per iteration; its value must be chosen and is therefore a free parameter of the method.
axioms (1)
  • domain assumption standard assumptions for convergence in stochastic optimization
    Explicitly invoked to support the convergence guarantee for MGUP-AdamW without weight decay.

pith-pipeline@v0.9.1-grok · 5781 in / 1365 out tokens · 49384 ms · 2026-06-27T01:28:33.273977+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 8 linked inside Pith

  1. [1]

    Gradient descent happens in a tiny subspace

    Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018

  2. [2]

    Larsen, Stanislav Fort, Nic Becker, and Surya Ganguli

    Brett W. Larsen, Stanislav Fort, Nic Becker, and Surya Ganguli. How many degrees of freedom do we need to train deep networks: a loss landscape perspective. In International Conference on Learning Representations (ICLR), 2022

  3. [3]

    Galore: Memory-efficient LLM training by gradient low-rank projection

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient LLM training by gradient low-rank projection. In International Conference on Machine Learning (ICML), 2024

  4. [4]

    Ldadam: Adaptive optimization from low-dimensional gradient statistics

    Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, and Dan Alistarh. Ldadam: Adaptive optimization from low-dimensional gradient statistics. In International Conference on Learning Representations (ICLR), 2025

  5. [5]

    Sparse is enough in fine-tuning pre-trained large language models

    Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, and Bo Du. Sparse is enough in fine-tuning pre-trained large language models. In International Conference on Machine Learning (ICML), 2024

  6. [6]

    Autofreeze: Automatically freezing model blocks to accelerate fine-tuning

    Yuhan Liu, Saurabh Agarwal, and Shivaram Venkataraman. Autofreeze: Automatically freezing model blocks to accelerate fine-tuning. ArXiv, abs/2102.01386, 2021

  7. [7]

    Full parameter fine-tuning for large language models with limited resources

    Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. In Annual Meeting of the Association for Computational Linguistics, 2023

  8. [8]

    LISA: layerwise importance sampling for memory-efficient large language model fine-tuning

    Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. LISA: layerwise importance sampling for memory-efficient large language model fine-tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  9. [9]

    Badam: A memory efficient full parameter optimization method for large language models

    Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  10. [10]

    Cautious optimizers: Improving training with one line of code

    Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code. ArXiv, abs/2411.16085, 2024

  11. [11]

    Adabelief optimizer: Adapting stepsizes by the belief in observed gradients

    Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems (NeurIPS), 33:18795– 18806, 2020

  12. [12]

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V . Le. Symbolic discovery of optimization algorithms. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  13. [13]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

  14. [14]

    On the importance of initialization and momentum in deep learning

    Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. InInternational Conference on Machine Learning (ICML), ICML’13. JMLR.org, 2013

  15. [15]

    Wolfe, Zhaoqi Li, and Anastasios Kyrillidis

    John Chen, Cameron R. Wolfe, Zhaoqi Li, and Anastasios Kyrillidis. Demon: Improved neural network training with momentum decay. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3958–3962, 2019. 10

  16. [16]

    Towards understanding how momentum improves generalization in deep learning

    Samy Jelassi and Yuanzhi Li. Towards understanding how momentum improves generalization in deep learning. In International Conference on Machine Learning (ICML), 2022

  17. [17]

    When and why momentum accelerates sgd: An empirical study

    Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, and Na Zheng. When and why momentum accelerates sgd: An empirical study. ArXiv, abs/2306.09000, 2023

  18. [18]

    SPIDER: near-optimal non- convex optimization via stochastic path-integrated differential estimator

    Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. SPIDER: near-optimal non- convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems (NeurIPS), pages 687–697, 2018

  19. [19]

    Momentum-based variance reduction in non-convex SGD

    Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. In Advances in Neural Information Processing Systems (NeurIPS), pages 15210–15219, 2019

  20. [20]

    SUPER-ADAM: faster and universal framework of adaptive gradients

    Feihu Huang, Junyi Li, and Heng Huang. SUPER-ADAM: faster and universal framework of adaptive gradients. In Advances in Neural Information Processing Systems (NeurIPS), pages 9074–9085, 2021

  21. [21]

    Mars: Unleashing the power of variance reduction for training large models

    Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, and Quanquan Gu. Mars: Unleashing the power of variance reduction for training large models. ArXiv, abs/2411.10438, 2024

  22. [22]

    Adaptive subgradient methods for online learning and stochastic optimization

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011

  23. [23]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna- tional Conference on Learning Representations (ICLR), San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

  24. [24]

    Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham M. Kakade. De- constructing what makes a good optimizer for autoregressive language models. In International Conference on Learning Representations (ICLR), 2025

  25. [25]

    Adashift: Decorrelation and convergence of adaptive learning rate methods

    Zhiming Zhou, Qingru Zhang, Guansong Lu, Hongwei Wang, Weinan Zhang, and Yong Yu. Adashift: Decorrelation and convergence of adaptive learning rate methods. In International Conference on Learning Representations (ICLR), 2019

  26. [26]

    Adasgd: Bridging the gap between sgd and adam

    Jiaxuan Wang and Jenna Wiens. Adasgd: Bridging the gap between sgd and adam. ArXiv, abs/2006.16541, 2020

  27. [27]

    Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun

    Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more. In International Conference on Learning Representations (ICLR), 2025

  28. [28]

    On the convergence of adaptive gradient methods for nonconvex optimization

    Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. ArXiv, abs/1808.05671, 2018

  29. [29]

    On the convergence of A class of adam-type algorithms for non-convex optimization

    Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of A class of adam-type algorithms for non-convex optimization. In International Conference on Learning Representations (ICLR), 2019

  30. [30]

    A novel convergence analysis for algorithms of the adam family

    Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the adam family. ArXiv, abs/2112.03459, 2021

  31. [31]

    Convergence of adam under relaxed assumptions

    Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  32. [32]

    Closing the gap between the upper bound and lower bound of adam’s iteration complexity

    Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and lower bound of adam’s iteration complexity. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  33. [33]

    Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models

    Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024. 11

  34. [34]

    Convergence guarantees for rmsprop and adam in generalized-smooth non-convex optimization with affine noise variance

    Qi Zhang, Yi Zhou, and Shaofeng Zou. Convergence guarantees for rmsprop and adam in generalized-smooth non-convex optimization with affine noise variance. Trans. Mach. Learn. Res., 2025, 2025

  35. [35]

    On convergence of adam for stochastic optimization under relaxed assumptions

    Yusu Hong and Junhong Lin. On convergence of adam for stochastic optimization under relaxed assumptions. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  36. [36]

    A high probability analysis of adaptive sgd with momentum

    Xiaoyu Li and Francesco Orabona. A high probability analysis of adaptive sgd with momentum. arXiv preprint arXiv:2007.14294, 2020

  37. [37]

    Bach, and Nicolas Usunier

    Alexandre Défossez, Léon Bottou, Francis R. Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. Transactions on Machine Learning Research, 2022

  38. [38]

    High probability convergence of adam under unbounded gradients and affine variance noise

    Yusu Hong and Junhong Lin. High probability convergence of adam under unbounded gradients and affine variance noise. ArXiv, abs/2311.02000, 2023

  39. [39]

    Global convergence of the heavy-ball method for convex optimization

    Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. 2015 European Control Conference (ECC), pages 310–315, 2014

  40. [40]

    A unified analysis of stochastic momentum methods for deep learning

    Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, and Yi Yang. A unified analysis of stochastic momentum methods for deep learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pages 2955–2961. ijcai.org, 2018

  41. [41]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019

  42. [42]

    Muon is scalable for llm training

    Jingyuan Liu, Jianling Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Meng Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is scala...

  43. [43]

    8-bit optimizers via block-wise quantization

    Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations (ICLR), 2022

  44. [44]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021

  45. [45]

    Girshick

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll’ar, and Ross B. Girshick. Masked autoencoders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2021

  46. [46]

    Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M

    Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anth...

  47. [47]

    Qwen2.5 technical report

    Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, 12 Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji L...

  48. [48]

    Roberta: A robustly optimized bert pretraining approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019

  49. [49]

    Training verifiers to solve math word problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  50. [50]

    Adam can con- verge without any modification on update rules

    Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam can con- verge without any modification on update rules. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  51. [51]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning (ICML), pages 4596–4604. PMLR, 2018

  52. [52]

    Q-galore: Quantized galore with INT4 projection and layer-adaptive low- rank gradients

    Zhenyu Zhang, Ajay Kumar Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. Q-galore: Quantized galore with INT4 projection and layer-adaptive low- rank gradients. In Conference on Parsimony and Learning, Stanford University, USA, 24-27 March 2025, volume 280 of Proceedings of Machine Learning Research, pages 1035–1050. PMLR, 2025

  53. [53]

    Greedy layer-wise training of deep networks

    Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 153–160. MIT Press, 2006

  54. [54]

    Hinton, Simon Osindero, and Yee-Whye Teh

    Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006

  55. [55]

    Adalomo: Low-memory optimiza- tion with adaptive learning rate

    Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, and Xipeng Qiu. Adalomo: Low-memory optimiza- tion with adaptive learning rate. In Findings of the Association for Computational Linguistics (ACL), pages 12486–12502, 2024

  56. [56]

    A method for solving the convex programming problem with convergence rate o(1/k2)

    Yurii Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). Proceedings of the USSR Academy of Sciences, 269:543–547, 1983

  57. [57]

    Lecture 6a overview of mini– batch gradient descent

    Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a overview of mini– batch gradient descent. Coursera Lecture slides https://class. coursera. org/neuralnets-2012- 001/lecture,[Online, 2012

  58. [58]

    Reddi, Satyen Kale, and Sanjiv Kumar

    Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations (ICLR), 2018

  59. [59]

    Incorporating nesterov momentum into adam

    Timothy Dozat. Incorporating nesterov momentum into adam. 2016

  60. [60]

    Adaptive gradient methods with dynamic bound of learning rate

    Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations (ICLR), 2019

  61. [61]

    On the variance of the adaptive learning rate and beyond

    Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations (ICLR), 2020

  62. [62]

    Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum

    Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, and Masashi Sugiyama. Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International Conference on Machine Learning (ICML), 2020

  63. [63]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training

    Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. In International Conference on Learning Representations (ICLR), 2024. 13

  64. [64]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning (ICML), 2018

  65. [65]

    On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

    Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

  66. [66]

    Muoneq: Balancing before orthogonalization with lightweight equilibration

    Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, and Ganzhao Yuan. Muoneq: Balancing before orthogonalization with lightweight equilibration. arXiv preprint arXiv:2603.28254, 2026

  67. [67]

    A sufficient condition for convergences of adam and rmsprop

    Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11127–11135, 2019

  68. [68]

    pulse-decay

    Yuqing Liang, Meixuan He, Jinlan Liu, and Dongpo Xu. Convergence of adam for non-convex objectives: relaxed hyperparameters and non-ergodic case. Mach. Learn., 114(3):75, 2025. 14 Appendix The appendices are structured as follows: • Appendix A gives a counterexample showing that Cautious Adam may diverge. • Appendix B summarizes additional related work. •...

  69. [69]

    exp ˆX 2 s,i w2 s,i ! | F s−1,i # ≤ E

    ≥ ϵ2(1 − β2). Next, the following inequalities and equality hold: b2 s,i ≥ v2 s,i + ϵ2 ≥ (1 − β2)   sX j=1 βs−j 2 g2 j,i + ϵ2   , ms,i = (1 − β1) sX j=1 βs−j 1 gj,i. For the first expression, it follows that: tX s=1 g2 s,i b2 s,i ≤ 1 1 − β2 tX s=1 g2 s,i ϵ2 +Ps j=1 βs−j 2 g2 j,i . (◦) ≤ 1 1 − β2 " log 1 + 1 ϵ2 tX s=1 βt−s 2 g2 s,i ! − t log β2 # ≤ 1 1...