pith. machine review for the scientific record. sign in

arxiv: 2604.12968 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.CV

Recognition: unknown

Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:12 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords deep learning optimizationfirst-order methodssecond-order optimizationzeroth-order methodsempirical evaluationdesign trade-offsSGDAdam
0
0 comments X

The pith

Review traces optimization methods from SGD and Adam to second- and zeroth-order alternatives while mapping their performance trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how first-order gradient methods became standard in deep learning but face limits in large models, privacy constraints, and distributed training. It traces the shift toward second-order methods for better performance ceilings and zeroth-order methods for lower memory use. A broad set of experiments across model types and scenarios identifies recurring design patterns and trade-offs among speed, generalization, and efficiency. The resulting synthesis supplies concrete directions for building optimizers that better meet modern requirements.

Core claim

First-order methods such as SGD and Adam anchor current pipelines yet expose shortfalls in privacy protection and memory when models scale or training becomes distributed. Researchers therefore pursue second-order techniques to break first-order ceilings and zeroth-order techniques to cut memory costs. A retrospective mapping of algorithm development combined with systematic benchmarks on varied architectures and scenarios reveals consistent trends and fundamental trade-offs, which together supply practical guidance for next-generation methods that are simultaneously efficient, robust, and trustworthy.

What carries the argument

A retrospective evolutionary analysis paired with empirical benchmarks of mainstream optimizers run on multiple model architectures and training scenarios.

If this is right

  • Second-order methods can exceed the convergence and generalization limits of first-order baselines in targeted settings.
  • Zeroth-order methods reduce memory footprint for training very large models.
  • Optimizer design must explicitly trade off convergence speed against privacy guarantees and distributed-system constraints.
  • The identified patterns supply starting points for constructing optimizers that simultaneously meet efficiency, robustness, and trustworthiness goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trade-off lens could be applied to optimization outside supervised deep learning, for example in reinforcement learning or federated settings.
  • Hardware-aware extensions of the benchmarks might reveal additional memory or communication bottlenecks not visible in the current evaluation.
  • Repeating the study after a few years with newly dominant architectures would test whether the distilled trends remain stable.

Load-bearing premise

The chosen collection of optimizers, model sizes, and training conditions is broad enough to reveal trends that hold outside the tested cases.

What would settle it

If the same set of optimizers is evaluated on an architecture or scenario outside the paper's test suite and the reported trade-offs between convergence, memory, and privacy no longer appear in the same form.

read the original abstract

Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second-order optimization techniques to surpass first-order performance ceilings, while zeroth-order methods reemerge to alleviate memory constraints inherent to large-scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade-offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next-generation highly efficient, robust, and trustworthy optimization methods. The code is available at https://github.com/APRIL-AIGC/Awesome-Optimizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper retrospectively analyzes the evolutionary trajectory of deep learning optimization algorithms, covering first-order methods (SGD, Adam), second-order techniques, and zeroth-order approaches. It presents a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios (large-scale, differential privacy, distributed). From this synthesis of theory and experiments, the authors distill key trends and fundamental design trade-offs, offering actionable guidance for designing next-generation efficient, robust, and trustworthy optimizers. Open code is provided at https://github.com/APRIL-AIGC/Awesome-Optimizer.

Significance. If the empirical evaluation is representative and the distilled trends hold under scrutiny, the work could offer practical value by unifying principles across optimizer families and highlighting trade-offs in challenging regimes like privacy or memory constraints. The availability of code is a clear strength for reproducibility in this survey-plus-experiments format.

major comments (2)
  1. [Abstract and §1] Abstract and §1 (Introduction): The central claim of distilling general design trade-offs and providing actionable guidance rests on the representativeness of the selected optimizers, architectures, and scenarios; the manuscript provides no explicit justification or coverage analysis for why omitted regimes (e.g., models >10B parameters, non-vision modalities, or strict edge-device constraints) would not alter the trends, undermining generalizability.
  2. [§4] §4 (Empirical Evaluation): The abstract asserts a 'comprehensive empirical evaluation' with 'extensive empirical evidence,' yet the text lacks details on data splits, hyperparameter search protocols, number of runs, or statistical controls (e.g., error bars or significance tests); without these, it is impossible to assess whether post-hoc choices or missing baselines affect the reported trends.
minor comments (2)
  1. [Abstract] The GitHub repository link is provided but should include a README with exact reproduction commands and environment specifications to maximize utility.
  2. [§3] Notation for optimizer variants (e.g., distinctions between Adam and AdamW) should be standardized in tables comparing performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity, rigor, and transparency.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): The central claim of distilling general design trade-offs and providing actionable guidance rests on the representativeness of the selected optimizers, architectures, and scenarios; the manuscript provides no explicit justification or coverage analysis for why omitted regimes (e.g., models >10B parameters, non-vision modalities, or strict edge-device constraints) would not alter the trends, undermining generalizability.

    Authors: We agree that an explicit justification for the selected optimizers, architectures, and scenarios would strengthen the claims. In the revision, we will add a new subsection (likely in §1) that outlines the selection criteria: optimizers were chosen based on their prevalence in recent literature and practical adoption; architectures cover representative vision and language models (e.g., ResNet, ViT, BERT, GPT-style); scenarios focus on large-scale training, differential privacy, and distributed settings because these expose the most salient limitations of first-order methods. We will also add an explicit limitations paragraph acknowledging that trends may not directly extend to models >10B parameters, non-vision modalities, or strict edge constraints, and will reference related works on those regimes to contextualize potential differences. revision: yes

  2. Referee: [§4] §4 (Empirical Evaluation): The abstract asserts a 'comprehensive empirical evaluation' with 'extensive empirical evidence,' yet the text lacks details on data splits, hyperparameter search protocols, number of runs, or statistical controls (e.g., error bars or significance tests); without these, it is impossible to assess whether post-hoc choices or missing baselines affect the reported trends.

    Authors: We fully agree that these experimental details are necessary for reproducibility and to allow readers to evaluate the robustness of the reported trends. In the revised §4, we will add: (i) explicit descriptions of data splits for each dataset and task, (ii) the hyperparameter search protocol including ranges, selection method (e.g., validation performance), and any grid or random search details, (iii) the number of independent runs per experiment, and (iv) statistical reporting including mean ± standard deviation across runs, error bars on all figures, and significance tests where comparisons are drawn. We will also clarify baseline selection criteria to address potential post-hoc concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; descriptive synthesis and empirical evaluation

full rationale

The paper performs a retrospective analysis of optimization method evolution and conducts empirical evaluations across architectures and scenarios to distill trends and guidance. No mathematical derivations, fitted parameters, or closed-form predictions are presented that could reduce to their own inputs by construction. The central claims rely on synthesis of prior literature and independent experimental results rather than self-definitional loops, fitted-input predictions, or load-bearing self-citations. This is self-contained against external benchmarks and literature, consistent with the default non-circular outcome for survey-style empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey and benchmarking paper the central claims rest on the completeness of the reviewed literature and the representativeness of the new experiments; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5551 in / 1064 out tokens · 25480 ms · 2026-05-10T15:12:38.958928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    Sgd with large step sizes learns sparse features

    10 Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Sgd with large step sizes learns sparse features. InICML, 2023. 17 Imen Ayadi and Gabriel Turinici. Stochastic runge-kutta methods and adaptive sgd-g2 stochastic gradient de- scent. InICPR, 2021. 17 Jiyang Bai, Yuxiang Ren, and Jiawei Zhang. Deam: adaptive momen...

  2. [2]

    Zeta: A riemann zeta-scaled exten- sion of adam for deep learning.arXiv preprint arXiv:2508.02719, 2025

    40, 56 Samiksha BC. Zeta: A riemann zeta-scaled exten- sion of adam for deep learning.arXiv preprint arXiv:2508.02719, 2025. 21 Marlon Becker, Frederick Altrock, and Benjamin Risse. Momentum-sam: Sharpness aware minimiza- tion without computational overhead.arXiv preprint arXiv:2401.12033, 2024. 19 Stefania Bellavia, Serge Gratton, Benedetta Morini, and P...

  3. [3]

    Large-scalemachinelearningwithstochastic gradient descent

    21 LéonBottou. Large-scalemachinelearningwithstochastic gradient descent. InCOMPSTAT, 2010. 3 Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, and Uthaipon Tantipongpipat. Fast and memory efficient differentially private-sgd via jl projections. InNeurIPS, 2021. 41 Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited ...

  4. [4]

    Méthode générale pour la réso- lution des systemes d’équations simultanées.Comp

    17 Augustin Cauchy et al. Méthode générale pour la réso- lution des systemes d’équations simultanées.Comp. Rend. Sci. Paris, 1847. 3, 4, 5, 6 Evan Chen, Shiqiang Wang, Jianing Zhang, Dong-Jun Han, Chaoyue Liu, and Christopher Brinton. Gradient correction in federated learning with adaptive optimiza- tion.arXiv preprint arXiv:2502.02727, 2025a. 36, 38, 47 ...

  5. [5]

    FZOO: Fast zeroth-order optimizer for fine-tuning large language models towards adam-scale speed,

    48 Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InICML, 2020. 16 Tehila Dahan and Kfir Yehuda Levy. Do stochastic, feel noiseless: Stable stochastic optimization via a double momentum mechanism. InICLR, 2025. 8 Sizhe Dang, Yangyang Guo, Yanjun Zhao, Haishan Ye, Xiaodong Zheng, Guang Dai, and Ivor Tsang. Fzoo: Fast zeroth-order optimiz...

  6. [6]

    FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimization

    13 Nalin Dhiman. Fanos: Friction-adaptive nos\’e–hoover symplectic momentum for stiff objectives.arXiv preprint arXiv:2601.00889, 2025. 10 Yucheng Ding, Chaoyue Niu, Yikai Yan, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shaojie Tang, and Rongfei Jia. Distributed optimization over block-cyclic data. In MMAsia, 2024. 38 Alexey Dosovitskiy. An image is worth 16x16 ...

  7. [7]

    A sta- ble whitening optimizer for efficient neural network training.arXiv preprint arXiv:2506.07254, 2025

    24 Kevin Frans, Sergey Levine, and Pieter Abbeel. A sta- ble whitening optimizer for efficient neural network training.arXiv preprint arXiv:2506.07254, 2025. 14 Deng Fucheng, Wang Wanjie, Gong Ao, Wang Xiaoqi, and Wang Fan. Gradient descent algorithm survey. arXiv preprint arXiv:2511.20725, 2025. 1 Takumi Fujimoto and Hiroaki Nishi. eagle: early approx- i...

  8. [8]

    Topology-aware differential privacy for decentralized image classification.IEEE Trans, 2021

    4 Shangwei Guo, Tianwei Zhang, Guowen Xu, Han Yu, Tao Xiang, and Yang Liu. Topology-aware differential privacy for decentralized image classification.IEEE Trans, 2021. 33, 40, 41 Sunny Gupta, Mohit Jindal, Pankhi Kashyap, Pranav Jeevan, and Amit Sethi. Flens: Federated learning with enhanced nesterov-newton sketch. InBigData,

  9. [9]

    Sham- poo: Preconditioned stochastic tensor optimization

    38 Vineet Gupta, Tomer Koren, and Yoram Singer. Sham- poo: Preconditioned stochastic tensor optimization. In ICML, 2018. 7, 9, 11, 14 Mohamed Hassan, Aleksandar Vakanski, Boyu Zhang, and Min Xian. Gcsam: Gradient centralized sharpness aware minimization.arXiv preprint arXiv:2501.11584,

  10. [10]

    TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

    16 Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kun Yuan, and Binhang Yuan. Tah-quant: Effective acti- vation quantization in pipeline parallelism over slow network.arXiv preprint arXiv:2506.01352, 2025a. 34, 43 60 Huan He, Shifan Zhao, Yuanzhe Xi, Joyce Ho, and Yousef Saad. GDA-AM: ON THE EFFECTIVENESS OF SOLVING MIN-IMAX OPTIMIZATION VIA AN- DERSON MIX...

  11. [11]

    Learning across data owners with joint differential privacy.arXiv preprint arXiv:2305.15723, 2023

    17 Yangsibo Huang, Haotian Jiang, Daogao Liu, Mohammad Mahdian, Jieming Mao, and Vahab Mirrokni. Learning across data owners with joint differential privacy.arXiv preprint arXiv:2305.15723, 2023. 41 Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Fi- rat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient...

  12. [12]

    Subspace-based approximate hessian method for zeroth-order optimization.arXiv preprint arXiv:2507.06125, 2025

    20 Dongyoon Kim, Sungjae Lee, Wonjin Lee, and Kwang In Kim. Subspace-based approximate hessian method for zeroth-order optimization.arXiv preprint arXiv:2507.06125, 2025. 28 Diederik P Kingma. Adam: A method for stochastic optimization. InICLR, 2015. 1, 4, 7, 9, 10, 11, 12, 13, 15, 17, 20, 21, 22, 28, 42, 45, 46, 47, 49, 50, 51, 52, 53, 54, 55 Weiwei Kong...

  13. [13]

    Q-newton: Hybrid quantum-classical scheduling for accelerating neural network training with newton’s gradient descent.arXiv preprint arXiv:2405.00252, 2024a

    46 Pingzhi Li, Junyu Liu, Hanrui Wang, and Tianlong Chen. Q-newton: Hybrid quantum-classical scheduling for accelerating neural network training with newton’s gradient descent.arXiv preprint arXiv:2405.00252, 2024a. 23 Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. Friendly sharpness-aware minimization. InCVPR, 2024b. 19 Xi-Lin Li. Precon...

  14. [14]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training

    13 Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InICLR, 2024a. 11, 24, 27, 43, 46, 56 Jie Liu and Yongqiang Wang. Communication efficient federated learning with linear convergence on hetero- geneous data.arXiv preprint arXiv:2503.15804, 2025....

  15. [15]

    Decoupled weight decay regularization

    44 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 9, 13, 16, 19, 48, 49, 50, 51, 52, 53, 54, 55 Rongwei Lu, Jingyan Jiang, Chunyang Li, Haotian Dong, Xingguang Wei, Delin Cai, and Zhi Wang. Deco- sgd: Joint optimization of delay staleness and gradient compression ratio for distributed sgd.arXiv preprint arXiv:2507.1...

  16. [16]

    Adaptive gradient methods with dynamic bound of learning rate

    34 Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learning rate. InICLR, 2019. 11, 17 Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models. InNeurIPS, 2024a. 21 Yihong Luo, Yuhan Chen, Siya Qiu, Yiwei Wang, Chen Zhang, Yan Zhou, Xiaochun Cao, a...

  17. [17]

    Full parameter fine-tuning for large lan- guage models with limited resources

    12 Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large lan- guage models with limited resources. InACL, 2024. 11, 22, 46 Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. SWAN: SGD with normalization and whitening enables stateless LLM training. InICML, 2025. 21, 44, 45 Jerry Ma and Denis Yarats. Quasi-h...

  18. [18]

    Lyam: Robust non-convex optimization for stable learning in noisy environments.arXiv preprint arXiv:2507.11262, 2025

    4 Elmira Mirzabeigi, Sepehr Rezaee, and Kourosh Parand. Lyam: Robust non-convex optimization for stable learning in noisy environments.arXiv preprint arXiv:2507.11262, 2025. 17 Ionut-Vlad Modoranu, Mher Safaryan, Grigory Mali- novsky, Eldar Kurtić, Thomas Robert, Peter Richtárik, and Dan Alistarh. Microadam: Accurate adaptive optimization with low space o...

  19. [19]

    A method of solving a convex programming problem with convergence rate O( 1 k2 ).Sov

    38 Y Nesterov. A method of solving a convex programming problem with convergence rate O( 1 k2 ).Sov. Math. Dokl., 1983. 5, 6, 8, 12, 13, 21, 36, 38, 49 Yurii Nesterov et al.Lectures on convex optimization. Springer, 2018. 3 Son Nguyen, Bo Liu, Lizhang Chen, and Qiang Liu. Improving adaptive moment optimization via preconditioner diagonalization.arXiv prep...

  20. [20]

    ml-bfgs: A momentum-based l-bfgs for distributed large-scale neural network optimization.TMLR, 2023

    34 Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, and Salman Avestimehr. ml-bfgs: A momentum-based l-bfgs for distributed large-scale neural network optimization.TMLR, 2023. 11, 23, 26 Jorge Nocedal. Updating quasi-newton matrices with limited storage.Mathematics of computation, 1980. 26, 46 Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard,...

  21. [21]

    MADA: Meta-adaptive optimizers through hyper-gradient descent

    44 Kaan Ozkara, Can Karakus, Parameswaran Raman, Mingyi Hong, Shoham Sabach, Branislav Kveton, and Volkan Cevher. MADA: Meta-adaptive optimizers through hyper-gradient descent. InICML, 2024. 47 Matteo Pagliardini, Pierre Ablin, and David Grangier. The adEMAMix optimizer: Better, faster, older. In ICLR, 2025. 8 Shivam Pal, Aishwarya Gupta, Saqib Sarwar, an...

  22. [22]

    Distributed opti- mization and learning for automated stepsize selec- tion with finite time coordination.arXiv preprint arXiv:2508.05887, 2025

    37 Apostolos I Rikos, Nicola Bastianello, Themistoklis Char- alambous, and Karl H Johansson. Distributed opti- mization and learning for automated stepsize selec- tion with finite time coordination.arXiv preprint arXiv:2508.05887, 2025. 37 Herbert Robbins and Sutton Monro. A stochastic approx- imation method.Ann. Math. Stat., 1951. 1, 3, 4, 5, 6, 13, 15, ...

  23. [23]

    Angu- largrad: A new optimization technique for angular convergence of convolutional neural networks.arXiv preprint arXiv:2105.10190, 2021

    20 Swalpa Kumar Roy, Mercedes Eugenia Paoletti, Juan Mario Haut, Shiv Ram Dubey, Purbayan Kar, Antonio Plaza, and Bidyut B Chaudhuri. Angu- largrad: A new optimization technique for angular convergence of convolutional neural networks.arXiv preprint arXiv:2105.10190, 2021. 9 David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representat...

  24. [24]

    Alphagrad: Non-linear gradient normaliza- tion optimizer.arXiv preprint arXiv:2504.16020, 2025

    16 Soham Sane. Alphagrad: Non-linear gradient normaliza- tion optimizer.arXiv preprint arXiv:2504.16020, 2025. 13, 43, 44, 46 Krisanu Sarkar. Hindsight-guided momentum (hgm) op- timizer: An approach to adaptive learning rate.arXiv preprint arXiv:2506.22479, 2025. 17 Pedro Savarese, David McAllester, Sudarshan Babu, and Michael Maire. Domain-independent do...

  25. [25]

    Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025

    12 Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025. 48 Mrinmay Sen and Chalavadi Krishna Mohan. pfedsop: Accelerating training of personalized federated learn- ing using second-order optimization.arXiv preprint arXiv:2506.07159, 2025. 38 Mrinmay Sen, A...

  26. [26]

    Stochastic smoothed gradient descent ascent for feder- ated minimax optimization

    43, 44, 45, 48, 49, 51, 54, 55 Wei Shen, Minhui Huang, Jiawei Zhang, and Cong Shen. Stochastic smoothed gradient descent ascent for feder- ated minimax optimization. InAISTATS, 2024. 20 Shaohuai Shi, Zhenheng Tang, Qiang Wang, Kaiyong Zhao, and Xiaowen Chu. Layer-wise adaptive gradi- ent sparsification for distributed deep learning with convergence guaran...

  27. [27]

    Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approxi- mation

    24 Dahun Shin, Dongyeop Lee, Jinseok Chung, and Namhoon Lee. Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approxi- mation. InICML, 2025a. 24 Dahun Shin, Dongyeop Lee, Jinseok Chung, and Namhoon Lee. Sassha: Sharpness-aware adaptive second-order optimization with stable hessian approxi- mation.arXiv preprint arXiv:2502.181...

  28. [28]

    Sparq-sgd: Event-triggered and compressed communication in decentralized optimization.TAC,

    8, 34, 35, 36, 40 Navjot Singh, Deepesh Data, Jemin George, and Suhas Diggavi. Sparq-sgd: Event-triggered and compressed communication in decentralized optimization.TAC,

  29. [29]

    Fedstas: Clientstratification and client level sampling for efficient federated learning

    34, 39 Jordan Slessor, Dezheng Kong, Xiaofen Tang, Zheng En Than, andLinglongKong. Fedstas: Clientstratification and client level sampling for efficient federated learning. arXiv preprint arXiv:2412.14226, 2024. 38 Oscar Smee, Fred Roosta, and Stephen J Wright. First- ish order methods: Hessian-aware scalings of gradient descent.arXiv preprint arXiv:2502....

  30. [30]

    arXiv preprint arXiv:2501.19057 , year=

    21 Lillian Sun, Kevin Cong, Je Qin Chooi, and Russell Li. Dp-adamw: Investigating decoupled weight decay and bias correction in private deep learning. InICML, 2025a. 41 Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, and Dacheng Tao. Tezo: Empowering the low- rankness on the temporal dimension in the zeroth- order optimization for fine-tuning llms.arXiv pr...

  31. [31]

    Amos: An adam-style optimizer with adaptive weight decay towards model- oriented scale.arXiv preprint arXiv:2210.11693, 2022

    19 Ran Tian and Ankur P Parikh. Amos: An adam-style optimizer with adaptive weight decay towards model- oriented scale.arXiv preprint arXiv:2210.11693, 2022. 17 Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5- rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural net- works for machine learning, 2012. 5, 6, 12, 49, 5...

  32. [32]

    Minghao Xu, Lichuan Xiang, Xu Cai, and Hongkai Wen

    21 Wanyun Xie, Thomas Pethick, and Volkan Cevher. Sampa: Sharpness-aware minimization parallelized. In NeurIPS, 2024a. 18 Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.TPAMI, 2024b. 9, 12, 49, 51, 54, 55 Jie Xu, Wei Zhang, and Fei Wang. A (DP)2SGD: Asyn- chronou...

  33. [33]

    Eadam optimizer: Howϵ impact adam.arXiv preprint arXiv:2011.02150, 2020

    1, 9, 12, 49, 51, 52, 54, 55 Wei Yuan and Kai-Xin Gao. Eadam optimizer: Howϵ impact adam.arXiv preprint arXiv:2011.02150, 2020. 12 Yun Yue, Zhiling Ye, Jiadi Jiang, Yongchao Liu, and Ke Zhang. Agd: An auto-switchable optimizer using stepwise gradient difference for preconditioning matrix. InNeurIPS, 2023. 20 Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sangh...