arxiv: 2604.12968 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.CV

Recognition: unknown

Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

Tong Zhang , Jiangning Zhang , Zhucun Xue , Juntao Jiang , Yicheng Xu , Chengming Xu , Teng Hu , Xingyu Xie

show 4 more authors

Xiaobin Hu Yabiao Wang Yong Liu Shuicheng Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:12 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords deep learning optimizationfirst-order methodssecond-order optimizationzeroth-order methodsempirical evaluationdesign trade-offsSGDAdam

0 comments

The pith

Review traces optimization methods from SGD and Adam to second- and zeroth-order alternatives while mapping their performance trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how first-order gradient methods became standard in deep learning but face limits in large models, privacy constraints, and distributed training. It traces the shift toward second-order methods for better performance ceilings and zeroth-order methods for lower memory use. A broad set of experiments across model types and scenarios identifies recurring design patterns and trade-offs among speed, generalization, and efficiency. The resulting synthesis supplies concrete directions for building optimizers that better meet modern requirements.

Core claim

First-order methods such as SGD and Adam anchor current pipelines yet expose shortfalls in privacy protection and memory when models scale or training becomes distributed. Researchers therefore pursue second-order techniques to break first-order ceilings and zeroth-order techniques to cut memory costs. A retrospective mapping of algorithm development combined with systematic benchmarks on varied architectures and scenarios reveals consistent trends and fundamental trade-offs, which together supply practical guidance for next-generation methods that are simultaneously efficient, robust, and trustworthy.

What carries the argument

A retrospective evolutionary analysis paired with empirical benchmarks of mainstream optimizers run on multiple model architectures and training scenarios.

If this is right

Second-order methods can exceed the convergence and generalization limits of first-order baselines in targeted settings.
Zeroth-order methods reduce memory footprint for training very large models.
Optimizer design must explicitly trade off convergence speed against privacy guarantees and distributed-system constraints.
The identified patterns supply starting points for constructing optimizers that simultaneously meet efficiency, robustness, and trustworthiness goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trade-off lens could be applied to optimization outside supervised deep learning, for example in reinforcement learning or federated settings.
Hardware-aware extensions of the benchmarks might reveal additional memory or communication bottlenecks not visible in the current evaluation.
Repeating the study after a few years with newly dominant architectures would test whether the distilled trends remain stable.

Load-bearing premise

The chosen collection of optimizers, model sizes, and training conditions is broad enough to reveal trends that hold outside the tested cases.

What would settle it

If the same set of optimizers is evaluated on an architecture or scenario outside the paper's test suite and the reported trade-offs between convergence, memory, and privacy no longer appear in the same form.

read the original abstract

Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second-order optimization techniques to surpass first-order performance ceilings, while zeroth-order methods reemerge to alleviate memory constraints inherent to large-scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade-offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next-generation highly efficient, robust, and trustworthy optimization methods. The code is available at https://github.com/APRIL-AIGC/Awesome-Optimizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that organizes optimizer families and runs some benchmarks across scenarios, but the empirical base looks too narrow to back broad claims about design trade-offs.

read the letter

The core of the paper is a retrospective on how first-order methods like SGD and Adam gave way to second-order and zeroth-order alternatives, followed by empirical comparisons in large-scale, privacy, and distributed settings. It does not claim a new algorithm or derivation, just synthesis plus testing of what already exists. The GitHub link for the code is a concrete plus that lets others check the runs directly. That kind of organization can save time for people who need to pick an optimizer without reading every original paper. The authors also flag real pain points around memory and privacy that matter in current practice. Those parts are straightforward and useful as far as they go. The soft spot is the representativeness of the tested set. The abstract talks about distilling general trends and fundamental trade-offs, yet the stress-test note is right that there is no clear argument for why the chosen models, optimizers, and scenarios cover the space well enough. If the experiments stay inside standard vision benchmarks and skip very large language models or strict edge constraints, the guidance on what works beyond the tested cases stays provisional. A reader who wants a quick map of the optimizer landscape will get value here. Someone looking for new theory or a definitive ranking will not. The work engages the literature honestly and ships reproducible code, so it meets the bar for serious refereeing even if the experiments need tightening. I would send it to peer review with a request that the authors justify the selection of methods and scenarios more explicitly and add any missing baselines that could change the trends.

Referee Report

2 major / 2 minor

Summary. The paper retrospectively analyzes the evolutionary trajectory of deep learning optimization algorithms, covering first-order methods (SGD, Adam), second-order techniques, and zeroth-order approaches. It presents a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios (large-scale, differential privacy, distributed). From this synthesis of theory and experiments, the authors distill key trends and fundamental design trade-offs, offering actionable guidance for designing next-generation efficient, robust, and trustworthy optimizers. Open code is provided at https://github.com/APRIL-AIGC/Awesome-Optimizer.

Significance. If the empirical evaluation is representative and the distilled trends hold under scrutiny, the work could offer practical value by unifying principles across optimizer families and highlighting trade-offs in challenging regimes like privacy or memory constraints. The availability of code is a clear strength for reproducibility in this survey-plus-experiments format.

major comments (2)

[Abstract and §1] Abstract and §1 (Introduction): The central claim of distilling general design trade-offs and providing actionable guidance rests on the representativeness of the selected optimizers, architectures, and scenarios; the manuscript provides no explicit justification or coverage analysis for why omitted regimes (e.g., models >10B parameters, non-vision modalities, or strict edge-device constraints) would not alter the trends, undermining generalizability.
[§4] §4 (Empirical Evaluation): The abstract asserts a 'comprehensive empirical evaluation' with 'extensive empirical evidence,' yet the text lacks details on data splits, hyperparameter search protocols, number of runs, or statistical controls (e.g., error bars or significance tests); without these, it is impossible to assess whether post-hoc choices or missing baselines affect the reported trends.

minor comments (2)

[Abstract] The GitHub repository link is provided but should include a README with exact reproduction commands and environment specifications to maximize utility.
[§3] Notation for optimizer variants (e.g., distinctions between Adam and AdamW) should be standardized in tables comparing performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity, rigor, and transparency.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): The central claim of distilling general design trade-offs and providing actionable guidance rests on the representativeness of the selected optimizers, architectures, and scenarios; the manuscript provides no explicit justification or coverage analysis for why omitted regimes (e.g., models >10B parameters, non-vision modalities, or strict edge-device constraints) would not alter the trends, undermining generalizability.

Authors: We agree that an explicit justification for the selected optimizers, architectures, and scenarios would strengthen the claims. In the revision, we will add a new subsection (likely in §1) that outlines the selection criteria: optimizers were chosen based on their prevalence in recent literature and practical adoption; architectures cover representative vision and language models (e.g., ResNet, ViT, BERT, GPT-style); scenarios focus on large-scale training, differential privacy, and distributed settings because these expose the most salient limitations of first-order methods. We will also add an explicit limitations paragraph acknowledging that trends may not directly extend to models >10B parameters, non-vision modalities, or strict edge constraints, and will reference related works on those regimes to contextualize potential differences. revision: yes
Referee: [§4] §4 (Empirical Evaluation): The abstract asserts a 'comprehensive empirical evaluation' with 'extensive empirical evidence,' yet the text lacks details on data splits, hyperparameter search protocols, number of runs, or statistical controls (e.g., error bars or significance tests); without these, it is impossible to assess whether post-hoc choices or missing baselines affect the reported trends.

Authors: We fully agree that these experimental details are necessary for reproducibility and to allow readers to evaluate the robustness of the reported trends. In the revised §4, we will add: (i) explicit descriptions of data splits for each dataset and task, (ii) the hyperparameter search protocol including ranges, selection method (e.g., validation performance), and any grid or random search details, (iii) the number of independent runs per experiment, and (iv) statistical reporting including mean ± standard deviation across runs, error bars on all figures, and significance tests where comparisons are drawn. We will also clarify baseline selection criteria to address potential post-hoc concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; descriptive synthesis and empirical evaluation

full rationale

The paper performs a retrospective analysis of optimization method evolution and conducts empirical evaluations across architectures and scenarios to distill trends and guidance. No mathematical derivations, fitted parameters, or closed-form predictions are presented that could reduce to their own inputs by construction. The central claims rely on synthesis of prior literature and independent experimental results rather than self-definitional loops, fitted-input predictions, or load-bearing self-citations. This is self-contained against external benchmarks and literature, consistent with the default non-circular outcome for survey-style empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey and benchmarking paper the central claims rest on the completeness of the reviewed literature and the representativeness of the new experiments; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5551 in / 1064 out tokens · 25480 ms · 2026-05-10T15:12:38.958928+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Sgd with large step sizes learns sparse features

10 Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Sgd with large step sizes learns sparse features. InICML, 2023. 17 Imen Ayadi and Gabriel Turinici. Stochastic runge-kutta methods and adaptive sgd-g2 stochastic gradient de- scent. InICPR, 2021. 17 Jiyang Bai, Yuxiang Ren, and Jiawei Zhang. Deam: adaptive momen...

2023
[2]

Zeta: A riemann zeta-scaled exten- sion of adam for deep learning.arXiv preprint arXiv:2508.02719, 2025

40, 56 Samiksha BC. Zeta: A riemann zeta-scaled exten- sion of adam for deep learning.arXiv preprint arXiv:2508.02719, 2025. 21 Marlon Becker, Frederick Altrock, and Benjamin Risse. Momentum-sam: Sharpness aware minimiza- tion without computational overhead.arXiv preprint arXiv:2401.12033, 2024. 19 Stefania Bellavia, Serge Gratton, Benedetta Morini, and P...

work page arXiv 2025
[3]

Large-scalemachinelearningwithstochastic gradient descent

21 LéonBottou. Large-scalemachinelearningwithstochastic gradient descent. InCOMPSTAT, 2010. 3 Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, and Uthaipon Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. InNeurIPS, 2021. 41 Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited ...

work page arXiv 2010
[4]

Méthode générale pour la réso- lution des systemes d’équations simultanées.Comp

17 Augustin Cauchy et al. Méthode générale pour la réso- lution des systemes d’équations simultanées.Comp. Rend. Sci. Paris, 1847. 3, 4, 5, 6 Evan Chen, Shiqiang Wang, Jianing Zhang, Dong-Jun Han, Chaoyue Liu, and Christopher Brinton. Gradient correction in federated learning with adaptive optimiza- tion.arXiv preprint arXiv:2502.02727, 2025a. 36, 38, 47 ...

work page arXiv 2025
[5]

FZOO: Fast zeroth-order optimizer for fine-tuning large language models towards adam-scale speed,

48 Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InICML, 2020. 16 Tehila Dahan and Kfir Yehuda Levy. Do stochastic, feel noiseless: Stable stochastic optimization via a double momentum mechanism. InICLR, 2025. 8 Sizhe Dang, Yangyang Guo, Yanjun Zhao, Haishan Ye, Xiaodong Zheng, Guang Dai, and Ivor Tsang. Fzoo: Fast zeroth-order optimiz...

work page arXiv 2020
[6]

FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimization

13 Nalin Dhiman. Fanos: Friction-adaptive nos\’e–hoover symplectic momentum for stiff objectives.arXiv preprint arXiv:2601.00889, 2025. 10 Yucheng Ding, Chaoyue Niu, Yikai Yan, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shaojie Tang, and Rongfei Jia. Distributed optimization over block-cyclic data. In MMAsia, 2024. 38 Alexey Dosovitskiy. An image is worth 16x16 ...

work page internal anchor Pith review arXiv 2025
[7]

A sta- ble whitening optimizer for efficient neural network training.arXiv preprint arXiv:2506.07254, 2025

24 Kevin Frans, Sergey Levine, and Pieter Abbeel. A sta- ble whitening optimizer for efficient neural network training.arXiv preprint arXiv:2506.07254, 2025. 14 Deng Fucheng, Wang Wanjie, Gong Ao, Wang Xiaoqi, and Wang Fan. Gradient descent algorithm survey. arXiv preprint arXiv:2511.20725, 2025. 1 Takumi Fujimoto and Hiroaki Nishi. eagle: early approx- i...

work page arXiv 2025
[8]

Topology-aware differential privacy for decentralized image classification.IEEE Trans, 2021

4 Shangwei Guo, Tianwei Zhang, Guowen Xu, Han Yu, Tao Xiang, and Yang Liu. Topology-aware differential privacy for decentralized image classification.IEEE Trans, 2021. 33, 40, 41 Sunny Gupta, Mohit Jindal, Pankhi Kashyap, Pranav Jeevan, and Amit Sethi. Flens: Federated learning with enhanced nesterov-newton sketch. InBigData,

2021
[9]

Sham- poo: Preconditioned stochastic tensor optimization

38 Vineet Gupta, Tomer Koren, and Yoram Singer. Sham- poo: Preconditioned stochastic tensor optimization. In ICML, 2018. 7, 9, 11, 14 Mohamed Hassan, Aleksandar Vakanski, Boyu Zhang, and Min Xian. Gcsam: Gradient centralized sharpness aware minimization.arXiv preprint arXiv:2501.11584,

work page arXiv 2018
[10]

TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

16 Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kun Yuan, and Binhang Yuan. Tah-quant: Effective acti- vation quantization in pipeline parallelism over slow network.arXiv preprint arXiv:2506.01352, 2025a. 34, 43 60 Huan He, Shifan Zhao, Yuanzhe Xi, Joyce Ho, and Yousef Saad. GDA-AM: ON THE EFFECTIVENESS OF SOLVING MIN-IMAX OPTIMIZATION VIA AN- DERSON MIX...

work page internal anchor Pith review arXiv 2022
[11]

Learning across data owners with joint differential privacy.arXiv preprint arXiv:2305.15723, 2023

17 Yangsibo Huang, Haotian Jiang, Daogao Liu, Mohammad Mahdian, Jieming Mao, and Vahab Mirrokni. Learning across data owners with joint differential privacy.arXiv preprint arXiv:2305.15723, 2023. 41 Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Fi- rat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient...

work page arXiv 2023
[12]

Subspace-based approximate hessian method for zeroth-order optimization.arXiv preprint arXiv:2507.06125, 2025

20 Dongyoon Kim, Sungjae Lee, Wonjin Lee, and Kwang In Kim. Subspace-based approximate hessian method for zeroth-order optimization.arXiv preprint arXiv:2507.06125, 2025. 28 Diederik P Kingma. Adam: A method for stochastic optimization. InICLR, 2015. 1, 4, 7, 9, 10, 11, 12, 13, 15, 17, 20, 21, 22, 28, 42, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55 Weiwei Kong...

work page arXiv 2025
[13]

Q-newton: Hybrid quantum-classical scheduling for accelerating neural network training with newton’s gradient descent.arXiv preprint arXiv:2405.00252, 2024a

46 Pingzhi Li, Junyu Liu, Hanrui Wang, and Tianlong Chen. Q-newton: Hybrid quantum-classical scheduling for accelerating neural network training with newton’s gradient descent.arXiv preprint arXiv:2405.00252, 2024a. 23 Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. Friendly sharpness-aware minimization. InCVPR, 2024b. 19 Xi-Lin Li. Precon...

work page arXiv 2017
[14]

Sophia: A scalable stochastic second-order optimizer for language model pre-training

13 Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InICLR, 2024a. 11, 24, 27, 43, 46, 56 Jie Liu and Yongqiang Wang. Communication efficient federated learning with linear convergence on hetero- geneous data.arXiv preprint arXiv:2503.15804, 2025....

work page arXiv 2025
[15]

Decoupled weight decay regularization

44 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 9, 13, 16, 19, 48, 49, 50, 51, 52, 53, 54, 55 Rongwei Lu, Jingyan Jiang, Chunyang Li, Haotian Dong, Xingguang Wei, Delin Cai, and Zhi Wang. Deco- sgd: Joint optimization of delay staleness and gradient compression ratio for distributed sgd.arXiv preprint arXiv:2507.1...

work page arXiv 2019
[16]

Adaptive gradient methods with dynamic bound of learning rate

34 Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learning rate. InICLR, 2019. 11, 17 Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models. InNeurIPS, 2024a. 21 Yihong Luo, Yuhan Chen, Siya Qiu, Yiwei Wang, Chen Zhang, Yan Zhou, Xiaochun Cao, a...

work page arXiv 2019
[17]

Full parameter fine-tuning for large lan- guage models with limited resources

12 Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large lan- guage models with limited resources. InACL, 2024. 11, 22, 46 Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. SWAN: SGD with normalization and whitening enables stateless LLM training. InICML, 2025. 21, 44, 45 Jerry Ma and Denis Yarats. Quasi-h...

work page arXiv 2024
[18]

Lyam: Robust non-convex optimization for stable learning in noisy environments.arXiv preprint arXiv:2507.11262, 2025

4 Elmira Mirzabeigi, Sepehr Rezaee, and Kourosh Parand. Lyam: Robust non-convex optimization for stable learning in noisy environments.arXiv preprint arXiv:2507.11262, 2025. 17 Ionut-Vlad Modoranu, Mher Safaryan, Grigory Mali- novsky, Eldar Kurtić, Thomas Robert, Peter Richtárik, and Dan Alistarh. Microadam: Accurate adaptive optimization with low space o...

work page arXiv 2025
[19]

A method of solving a convex programming problem with convergence rate O( 1 k2 ).Sov

38 Y Nesterov. A method of solving a convex programming problem with convergence rate O( 1 k2 ).Sov. Math. Dokl., 1983. 5, 6, 8, 12, 13, 21, 36, 38, 49 Yurii Nesterov et al.Lectures on convex optimization. Springer, 2018. 3 Son Nguyen, Bo Liu, Lizhang Chen, and Qiang Liu. Improving adaptive moment optimization via preconditioner diagonalization.arXiv prep...

work page arXiv 1983
[20]

ml-bfgs: A momentum-based l-bfgs for distributed large-scale neural network optimization.TMLR, 2023

34 Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, and Salman Avestimehr. ml-bfgs: A momentum-based l-bfgs for distributed large-scale neural network optimization.TMLR, 2023. 11, 23, 26 Jorge Nocedal. Updating quasi-newton matrices with limited storage.Mathematics of computation, 1980. 26, 46 Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard,...

work page arXiv 2023
[21]

MADA: Meta-adaptive optimizers through hyper-gradient descent

44 Kaan Ozkara, Can Karakus, Parameswaran Raman, Mingyi Hong, Shoham Sabach, Branislav Kveton, and Volkan Cevher. MADA: Meta-adaptive optimizers through hyper-gradient descent. InICML, 2024. 47 Matteo Pagliardini, Pierre Ablin, and David Grangier. The adEMAMix optimizer: Better, faster, older. In ICLR, 2025. 8 Shivam Pal, Aishwarya Gupta, Saqib Sarwar, an...

work page arXiv 2024
[22]

Distributed opti- mization and learning for automated stepsize selec- tion with finite time coordination.arXiv preprint arXiv:2508.05887, 2025

37 Apostolos I Rikos, Nicola Bastianello, Themistoklis Char- alambous, and Karl H Johansson. Distributed opti- mization and learning for automated stepsize selec- tion with finite time coordination.arXiv preprint arXiv:2508.05887, 2025. 37 Herbert Robbins and Sutton Monro. A stochastic approx- imation method.Ann. Math. Stat., 1951. 1, 3, 4, 5, 6, 13, 15, ...

work page arXiv 2025
[23]

Angu- largrad: A new optimization technique for angular convergence of convolutional neural networks.arXiv preprint arXiv:2105.10190, 2021

20 Swalpa Kumar Roy, Mercedes Eugenia Paoletti, Juan Mario Haut, Shiv Ram Dubey, Purbayan Kar, Antonio Plaza, and Bidyut B Chaudhuri. Angu- largrad: A new optimization technique for angular convergence of convolutional neural networks.arXiv preprint arXiv:2105.10190, 2021. 9 David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representat...

work page arXiv 2021
[24]

Alphagrad: Non-linear gradient normaliza- tion optimizer.arXiv preprint arXiv:2504.16020, 2025

16 Soham Sane. Alphagrad: Non-linear gradient normaliza- tion optimizer.arXiv preprint arXiv:2504.16020, 2025. 13, 43, 44, 46 Krisanu Sarkar. Hindsight-guided momentum (hgm) op- timizer: An approach to adaptive learning rate.arXiv preprint arXiv:2506.22479, 2025. 17 Pedro Savarese, David McAllester, Sudarshan Babu, and Michael Maire. Domain-independent do...

work page arXiv 2025
[25]

Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025

12 Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025. 48 Mrinmay Sen and Chalavadi Krishna Mohan. pfedsop: Accelerating training of personalized federated learn- ing using second-order optimization.arXiv preprint arXiv:2506.07159, 2025. 38 Mrinmay Sen, A...

work page arXiv 2025
[26]

Stochastic smoothed gradient descent ascent for feder- ated minimax optimization

43, 44, 45, 48, 49, 51, 54, 55 Wei Shen, Minhui Huang, Jiawei Zhang, and Cong Shen. Stochastic smoothed gradient descent ascent for feder- ated minimax optimization. InAISTATS, 2024. 20 Shaohuai Shi, Zhenheng Tang, Qiang Wang, Kaiyong Zhao, and Xiaowen Chu. Layer-wise adaptive gradi- ent sparsification for distributed deep learning with convergence guaran...

2024
[27]

Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approxi- mation

24 Dahun Shin, Dongyeop Lee, Jinseok Chung, and Namhoon Lee. Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approxi- mation. InICML, 2025a. 24 Dahun Shin, Dongyeop Lee, Jinseok Chung, and Namhoon Lee. Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approxi- mation.arXiv preprint arXiv:2502.181...

work page arXiv 2019
[28]

Sparq-sgd: Event-triggered and compressed communication in decentralized optimization.TAC,

8, 34, 35, 36, 40 Navjot Singh, Deepesh Data, Jemin George, and Suhas Diggavi. Sparq-sgd: Event-triggered and compressed communication in decentralized optimization.TAC,
[29]

Fedstas: Clientstratification and client level sampling for efficient federated learning

34, 39 Jordan Slessor, Dezheng Kong, Xiaofen Tang, Zheng En Than, andLinglongKong. Fedstas: Clientstratification and client level sampling for efficient federated learning. arXiv preprint arXiv:2412.14226, 2024. 38 Oscar Smee, Fred Roosta, and Stephen J Wright. First- ish order methods: Hessian-aware scalings of gradient descent.arXiv preprint arXiv:2502....

work page arXiv 2024
[30]

arXiv preprint arXiv:2501.19057 , year=

21 Lillian Sun, Kevin Cong, Je Qin Chooi, and Russell Li. Dp-adamw: Investigating decoupled weight decay and bias correction in private deep learning. InICML, 2025a. 41 Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, and Dacheng Tao. Tezo: Empowering the low- rankness on the temporal dimension in the zeroth- order optimization for fine-tuning llms.arXiv pr...

work page arXiv 2025
[31]

Amos: An adam-style optimizer with adaptive weight decay towards model- oriented scale.arXiv preprint arXiv:2210.11693, 2022

19 Ran Tian and Ankur P Parikh. Amos: An adam-style optimizer with adaptive weight decay towards model- oriented scale.arXiv preprint arXiv:2210.11693, 2022. 17 Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5- rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural net- works for machine learning, 2012. 5, 6, 12, 49, 5...

work page arXiv 2022
[32]

Minghao Xu, Lichuan Xiang, Xu Cai, and Hongkai Wen

21 Wanyun Xie, Thomas Pethick, and Volkan Cevher. Sampa: Sharpness-aware minimization parallelized. In NeurIPS, 2024a. 18 Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.TPAMI, 2024b. 9, 12, 49, 51, 54, 55 Jie Xu, Wei Zhang, and Fei Wang. A (DP)2SGD: Asyn- chronou...

work page arXiv 2021
[33]

Eadam optimizer: Howϵ impact adam.arXiv preprint arXiv:2011.02150, 2020

1, 9, 12, 49, 51, 52, 54, 55 Wei Yuan and Kai-Xin Gao. Eadam optimizer: Howϵ impact adam.arXiv preprint arXiv:2011.02150, 2020. 12 Yun Yue, Zhiling Ye, Jiadi Jiang, Yongchao Liu, and Ke Zhang. Agd: An auto-switchable optimizer using stepwise gradient difference for preconditioning matrix. InNeurIPS, 2023. 20 Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sangh...

work page arXiv 2011