Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

Aleksandr Beznosikov; Daniil Medyakov; Dmitriy Bystrov; Dmitry Bylinkin

arxiv: 2606.14970 · v2 · pith:PJ64HOC2new · submitted 2026-06-12 · 💻 cs.LG

Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

Dmitriy Bystrov , Daniil Medyakov , Dmitry Bylinkin , Aleksandr Beznosikov This is my paper

Pith reviewed 2026-06-27 04:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords zeroth-order optimizationparameter-free optimizationLLM fine-tuninglinear minimization oracleadaptive optimizationmemory-efficient trainingnon-Euclidean geometry

0 comments

The pith

AdaNAGED provides parameter-free zeroth-order optimization for LMO-based fine-tuning of large language models with convergence guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AdaNAGED as a method for zeroth-order optimization that automatically adapts its parameters without prior knowledge of problem-dependent constants. It combines gradient-free training with adaptive tuning and geometry-aware updates that use linear minimization oracles to handle heterogeneous parameter blocks. This addresses the high sensitivity of standard zeroth-order methods to stepsize and smoothing choices during large-scale LLM fine-tuning. Convergence guarantees are established for the approach, and it is validated through experiments on the OPT-1.3B model.

Core claim

AdaNAGED unifies gradient-free training, adaptive tuning, and non-Euclidean update geometry for LMO-based ZO optimization, establishing convergence guarantees and demonstrating effectiveness on the OPT-1.3B fine-tuning task.

What carries the argument

AdaNAGED, which performs parameter-free adaptation within LMO-based zeroth-order optimization to enable non-Euclidean updates across heterogeneous parameter blocks.

If this is right

Convergence guarantees hold for the unified LMO-based ZO setting.
Memory overhead is reduced by avoiding storage of activations, gradients, and optimizer states.
Automatic adaptation removes the need for task-specific tuning of algorithmic parameters.
Non-Euclidean geometry better respects the structure of different parameter blocks in large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other memory-limited training scenarios such as on-device adaptation.
Hybrid use with selective first-order steps might further improve efficiency on very large models.
Validation beyond 1.3B parameters would test whether the geometry-aware adaptation scales to frontier models.

Load-bearing premise

Large-scale fine-tuning can benefit from geometry-aware updates that account for the heterogeneous structure of parameter blocks, which can be modeled through methods that exploit linear minimization oracle (LMO).

What would settle it

If experiments on OPT-1.3B fine-tuning show that AdaNAGED requires manual stepsize or smoothing tuning to match standard ZO performance, or fails to converge under its stated guarantees, the central claim would be falsified.

read the original abstract

Fine-tuning large language models (LLMs) has become a central application of modern optimization, enabling pretrained models to adapt to diverse downstream tasks and domain-specific data. A major obstacle in large-scale fine-tuning is the memory overhead of backpropagation, which requires storing activations, gradients, and optimizer states. Zeroth-order (ZO) optimization offers a memory-efficient alternative, but its performance is highly sensitive to the stepsize and smoothing parameter, often requiring costly task-specific tuning. Parameter-free (PF) optimization addresses this issue by adapting algorithmic parameters without prior knowledge of problem-dependent constants. Moreover, large-scale fine-tuning can benefit from geometry-aware updates that account for the heterogeneous structure of parameter blocks, which can be modeled through methods that exploit linear minimization oracle (LMO). In this work, we study PF adaptation for LMO-based ZO optimization and introduce $\texttt{AdaNAGED}$, a method that unifies gradient-free training, adaptive tuning, and non-Euclidean update geometry. We establish convergence guarantees and validate the method on large-scale LLM fine-tuning task with $\texttt{OPT}-1.3\mathrm{B}$ model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaNAGED claims to unify ZO optimization, parameter-free tuning, and LMO geometry for memory-efficient LLM fine-tuning with convergence guarantees, but the abstract supplies no proof details or experimental evidence to check those claims.

read the letter

The paper's main move is to introduce AdaNAGED, which tries to combine zeroth-order gradient-free updates, automatic adaptation of stepsizes without manual tuning, and non-Euclidean geometry via the linear minimization oracle. The goal is to cut memory use during LLM fine-tuning while handling the fact that different parameter blocks have different structures. They state convergence results and report validation on OPT-1.3B.

The practical framing is reasonable. Memory overhead from backprop and optimizer states is a real constraint, and ZO methods avoid storing gradients. Making the method parameter-free removes one source of task-specific fiddling, and LMO geometry could in principle respect block-wise differences better than plain Euclidean steps. The abstract lays this motivation out plainly.

The soft spots are straightforward. No derivation steps, no list of assumptions for the convergence claim, and no experimental protocol, baselines, or numbers appear. Without those, it is impossible to judge whether the guarantees are meaningful or whether the OPT-1.3B runs actually show an advantage. Novelty is also difficult to assess because the abstract does not spell out how AdaNAGED differs from prior ZO or parameter-free work in a substantive way.

This is aimed at researchers working on efficient adaptation of large models. A reader looking for concrete memory-saving tricks would need the full proofs and tables to extract value. The high-level idea is coherent enough that a serious referee should see the complete manuscript rather than a desk rejection.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces AdaNAGED, a parameter-free zeroth-order optimization method based on linear minimization oracles (LMO) for memory-efficient fine-tuning of large language models. It claims to unify gradient-free training, adaptive parameter tuning, and non-Euclidean geometry, while establishing convergence guarantees and providing empirical validation on the OPT-1.3B model.

Significance. If the claimed convergence guarantees hold under standard ZO assumptions and the method demonstrates practical gains in memory usage and performance without task-specific tuning on large models, the work could meaningfully advance efficient fine-tuning techniques that avoid backpropagation storage costs.

major comments (2)

[Abstract] Abstract: the central claim that convergence guarantees are established is unsupported by any statement of assumptions, proof outline, theorem reference, or derivation steps, rendering the theoretical contribution unverifiable from the provided text.
[Abstract] Abstract: the empirical validation on OPT-1.3B is asserted without reference to experimental protocol, baselines, metrics, number of runs, or error bars, which is load-bearing for the practical contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify that the abstract lacks supporting details for its claims on theory and experiments. We will revise the abstract to address both points while preserving conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that convergence guarantees are established is unsupported by any statement of assumptions, proof outline, theorem reference, or derivation steps, rendering the theoretical contribution unverifiable from the provided text.

Authors: We agree the abstract is too terse to verify the claim. The full manuscript states the assumptions in Section 3, presents the main result as Theorem 4.1, and provides the proof in Appendix A under standard ZO assumptions (Lipschitz smoothness and bounded gradient variance). We will revise the abstract to reference the theorem and note the assumptions. revision: yes
Referee: [Abstract] Abstract: the empirical validation on OPT-1.3B is asserted without reference to experimental protocol, baselines, metrics, number of runs, or error bars, which is load-bearing for the practical contribution.

Authors: We agree the abstract omits these details. Section 5 and the associated tables/figures specify the protocol (OPT-1.3B fine-tuning on GLUE and language modeling), baselines (MeZO, ZO-Adam), metrics (accuracy, perplexity), 5 runs with standard error bars. We will revise the abstract to briefly indicate the evaluation setup and comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and high-level description introduce AdaNAGED as a unification of ZO optimization, PF adaptation, and LMO-based geometry, with stated convergence guarantees and empirical validation on OPT-1.3B. No equations, fitting procedures, self-citations, or derivation steps are visible that reduce any claimed prediction or result to its inputs by construction. The paper's claims remain at a level where the central contribution is presented as independent of any internal self-referential loop, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, proofs, or experimental sections are provided from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5743 in / 1145 out tokens · 42562 ms · 2026-06-27T04:30:20.425496+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

94 extracted references · 10 linked inside Pith

[1]

Universal language model fine-tuning for text classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. InProceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, 2018

2018
[2]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021

2021
[3]

Don’t stop pretraining: Adapt language models to domains and tasks

Suchin Gururangan, Ana Marasovi´ c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8342–8360, 2020

2020
[4]

Dialogpt: Large-scale generative pre-training for conversational response generation

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and William B Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations, pages 270–278, 2020

2020
[5]

Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025

Xiao-Kun Wu, Min Chen, Wanyi Li, Rui Wang, Limeng Lu, Jia Liu, Kai Hwang, Yixue Hao, Yanru Pan, Qingguo Meng, et al. Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025

2025
[6]

A stochastic approximation method.The annals of mathematical statis- tics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statis- tics, pages 400–407, 1951

1951
[7]

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Pith/arXiv arXiv 2014
[8]

The backpropagation algorithm

Raul Rojas. The backpropagation algorithm. InNeural networks: a systematic introduction, pages 149–182. Springer, 1996

1996
[9]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, net- working, storage and analysis, pages 1–16. IEEE, 2020

2020
[10]

Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023

2023
[11]

Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function.Mathematical Programming, 144(1):1–38, 2014

Peter Richt´ arik and Martin Tak´ aˇ c. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function.Mathematical Programming, 144(1):1–38, 2014

2014
[12]

Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

2024
[13]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861, 2021

arXiv 2021
[14]

Few-bit backward: Quantized gradients of activation functions for memory footprint reduction

Georgii Sergeevich Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Valerievich Dimitrov, and Ivan Oseledets. Few-bit backward: Quantized gradients of activation functions for memory footprint reduction. InInternational Conference on Machine Learning, pages 26363–26381. PMLR, 2023

2023
[15]

Recent advances in lora: A comprehensive survey.ACM Transactions on Sensor Networks, 18(4):1–44, 2022

Zehua Sun, Huanqi Yang, Kai Liu, Zhimeng Yin, Zhenjiang Li, and Weitao Xu. Recent advances in lora: A comprehensive survey.ACM Transactions on Sensor Networks, 18(4):1–44, 2022

2022
[16]

Lora-pro: Are low-rank adapters properly optimized?arXiv preprint arXiv:2407.18242, 2024

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. Lora-pro: Are low-rank adapters properly optimized?arXiv preprint arXiv:2407.18242, 2024

arXiv 2024
[17]

Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming.SIAM journal on optimization, 23(4):2341–2368, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming.SIAM journal on optimization, 23(4):2341–2368, 2013

2013
[18]

Zeroth-order policy gradient for reinforcement learning from human feedback without reward inference.arXiv preprint arXiv:2409.17401, 2024

Qining Zhang and Lei Ying. Zeroth-order policy gradient for reinforcement learning from human feedback without reward inference.arXiv preprint arXiv:2409.17401, 2024

arXiv 2024
[19]

A survey on zeroth-order optimization for machine learning

Liting Lin, Hansong Ma, Junxiao Wang, and Shiyu Yang. A survey on zeroth-order optimization for machine learning. InInternational Conference on Web Information Systems and Applications, pages 481–497. Springer, 2025

2025
[20]

Zoqo: Zero-order quantized optimization

Noga Bar and Raja Giryes. Zoqo: Zero-order quantized optimization. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

2025
[21]

Unified optimal analysis of the (stochastic) gradient method.arXiv preprint arXiv:1907.04232, 2019

Sebastian U Stich. Unified optimal analysis of the (stochastic) gradient method.arXiv preprint arXiv:1907.04232, 2019

arXiv 1907
[22]

Statistically precon- ditioned accelerated gradient method for distributed optimization

Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie. Statistically precon- ditioned accelerated gradient method for distributed optimization. InInternational conference on machine learning, pages 4203–4227. PMLR, 2020

2020
[23]

Making sgd parameter-free

Yair Carmon and Oliver Hinder. Making sgd parameter-free. InConference on learning theory, pages 2360–
[24]

Uniformly optimal and parameter-free first-order methods for convex and function-constrained optimization.arXiv preprint arXiv:2412.06319, 2024

Qi Deng, Guanghui Lan, and Zhenwei Lin. Uniformly optimal and parameter-free first-order methods for convex and function-constrained optimization.arXiv preprint arXiv:2412.06319, 2024

Pith/arXiv arXiv 2024
[25]

Accelerated parameter-free stochastic optimization

Itai Kreisler, Maor Ivgi, Oliver Hinder, and Yair Carmon. Accelerated parameter-free stochastic optimization. InThe Thirty Seventh Annual Conference on Learning Theory, pages 3257–3324. PMLR, 2024

2024
[26]

Revisiting zeroth-order optimization: Minimum-variance two-point estimators and directionally aligned perturbations.arXiv preprint arXiv:2510.19975, 2025

Shaocong Ma and Heng Huang. Revisiting zeroth-order optimization: Minimum-variance two-point estimators and directionally aligned perturbations.arXiv preprint arXiv:2510.19975, 2025

arXiv 2025
[27]

Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bern- stein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

2024
[28]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Pith/arXiv arXiv 2024
[29]

Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Pith/arXiv arXiv 2025
[30]

Momentum improves normalized sgd

Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InInternational conference on machine learning, pages 2260–2268. PMLR, 2020

2020
[31]

Momentum ensures convergence of signsgd under weaker assumptions

Tao Sun, Qingsong Wang, Dongsheng Li, and Bao Wang. Momentum ensures convergence of signsgd under weaker assumptions. InInternational Conference on Machine Learning, pages 33077–33099. PMLR, 2023

2023
[32]

Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt´ arik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025

arXiv 2025
[33]

An overview of the simultaneous perturbation method for efficient optimization.Johns Hopkins apl technical digest, 19(4):482–492, 1998

James C Spall. An overview of the simultaneous perturbation method for efficient optimization.Johns Hopkins apl technical digest, 19(4):482–492, 1998

1998
[34]

Online convex optimization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

Pith/arXiv arXiv 2004
[35]

Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Transactions on Information Theory, 61 (5):2788–2806, 2015

John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Transactions on Information Theory, 61 (5):2788–2806, 2015

2015
[36]

Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173, 2024

Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, and Ivor W Tsang. Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173, 2024

arXiv 2024
[37]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992

1992
[38]

Zeroth-order optimization is secretly single-step policy optimization.arXiv preprint arXiv:2506.14460, 2025

Junbin Qiu, Zhengpeng Xie, Xiangda Yan, Yongjie Yang, and Yao Shu. Zeroth-order optimization is secretly single-step policy optimization.arXiv preprint arXiv:2506.14460, 2025

arXiv 2025
[39]

Low-rank curvature for zeroth-order optimization in llm fine-tuning

Hyunseok Seung, Jaewoo Lee, and Hyunsuk Ko. Low-rank curvature for zeroth-order optimization in llm fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 25235–25242, 2026

2026
[40]

Zo-adamm: Zeroth- order adaptive momentum method for black-box optimization.Advances in neural information processing systems, 32, 2019

Xiangyi Chen, Sijia Liu, Kaidi Xu, Xingguo Li, Xue Lin, Mingyi Hong, and David Cox. Zo-adamm: Zeroth- order adaptive momentum method for black-box optimization.Advances in neural information processing systems, 32, 2019

2019
[41]

Zeroth-order stochas- tic variance reduction for nonconvex optimization.Advances in neural information processing systems, 31, 2018

Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochas- tic variance reduction for nonconvex optimization.Advances in neural information processing systems, 31, 2018

2018
[42]

Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization

Kaiyi Ji, Zhe Wang, Yi Zhou, and Yingbin Liang. Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. InInternational conference on machine learning, pages 3100–3109. PMLR, 2019

2019
[43]

Variance-reduced zeroth-order methods for fine-tuning language models.arXiv preprint arXiv:2404.08080, 2024

Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, and Wooseok Ha. Variance-reduced zeroth-order methods for fine-tuning language models.arXiv preprint arXiv:2404.08080, 2024

arXiv 2024
[44]

Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. InACM AISec, 2017

2017
[45]

Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark.arXiv preprint arXiv:2402.11592, 2024

Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D Lee, Wotao Yin, Mingyi Hong, et al. Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark.arXiv preprint arXiv:2402.11592, 2024

arXiv 2024
[46]

Problem complexity and method efficiency in optimization

Arkadij Semenoviˇ c Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983

1983
[47]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

2011
[48]

Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012

Matthew D Zeiler. Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012

Pith/arXiv arXiv 2012
[49]

A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019

Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019

Pith/arXiv arXiv 1912
[50]

Dimension-free exponentiated gradient.Advances in Neural Information Processing Systems, 26, 2013

Francesco Orabona. Dimension-free exponentiated gradient.Advances in Neural Information Processing Systems, 26, 2013

2013
[51]

Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations

H Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. InConference on Learning Theory, pages 1020–1039. PMLR, 2014

2014
[52]

Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

Francesco Orabona and D´ avid P´ al. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

2016
[53]

Black-box reductions for parameter-free online learning in banach spaces

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018

2018
[54]

Dog is sgd’s best friend: A parameter-free dynamic step size schedule

Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. InInternational conference on machine learning, pages 14465–14499. PMLR, 2023

2023
[55]

Learning-rate-free learning by d-adaptation

Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by d-adaptation. InInternational conference on machine learning, pages 7449–7479. PMLR, 2023

2023
[56]

Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023

arXiv 2023
[57]

Momo: Momentum models for adaptive learning rates.arXiv preprint arXiv:2305.07583, 2023

Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, and Robert M Gower. Momo: Momentum models for adaptive learning rates.arXiv preprint arXiv:2305.07583, 2023

arXiv 2023
[58]

Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance

Amit Attia and Tomer Koren. Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. InInternational Conference on Machine Learning, pages 1147–1171. PMLR, 2023

2023
[59]

How free is parameter-free stochastic optimization?arXiv preprint arXiv:2402.03126, 2024

Amit Attia and Tomer Koren. How free is parameter-free stochastic optimization?arXiv preprint arXiv:2402.03126, 2024

arXiv 2024
[60]

Dowg unleashed: An efficient universal parameter-free gradient descent method.Advances in Neural Information Processing Systems, 36:6748–6769, 2023

Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. Dowg unleashed: An efficient universal parameter-free gradient descent method.Advances in Neural Information Processing Systems, 36:6748–6769, 2023

2023
[61]

Adaptive gradient descent without descent.arXiv preprint arXiv:1910.09529, 2019

Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent.arXiv preprint arXiv:1910.09529, 2019

arXiv 1910
[62]

Directional smoothness and gradient methods: Convergence and adaptivity.Advances in Neural Information Processing Systems, 37: 14810–14848, 2024

Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, and Robert M Gower. Directional smoothness and gradient methods: Convergence and adaptivity.Advances in Neural Information Processing Systems, 37: 14810–14848, 2024

2024
[63]

Sign-sgd via parameter-free optimization

Daniil Medyakov, Stanko Sergey, Gleb Molodtsov, Philip Zmushko, Evseev Grigoriy, Egor Petrov, and Alek- sandr Beznosikov. Sign-sgd via parameter-free optimization. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[64]

Frank–wolfe and friends: a journey into projection-free first-order optimization methods.4OR, 19(3):313–345, 2021

Immanuel M Bomze, Francesco Rinaldi, and Damiano Zeffiro. Frank–wolfe and friends: a journey into projection-free first-order optimization methods.4OR, 19(3):313–345, 2021

2021
[65]

Conditional gradient methods.arXiv preprint arXiv:2211.14103, 2022

G´ abor Braun, Alejandro Carderera, Cyrille W Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta. Conditional gradient methods.arXiv preprint arXiv:2211.14103, 2022

arXiv 2022
[66]

signsgd: Com- pressedoptimisationfornon-convexproblems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Com- pressedoptimisationfornon-convexproblems. InInternational conference on machine learning, pages560–569. PMLR, 2018

2018
[67]

Error feedback fixes signsgd and other gradient compression schemes

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. InInternational conference on machine learning, pages 3252–3261. PMLR, 2019

2019
[68]

Dissecting adam: The sign, magnitude and variance of stochastic gradients

Lukas Balles and Philipp Hennig. Dissecting adam: The sign, magnitude and variance of stochastic gradients. InInternational Conference on Machine Learning, pages 404–413. PMLR, 2018

2018
[69]

The geometry of sign gradient descent.arXiv preprint arXiv:2002.08056, 2020

Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux. The geometry of sign gradient descent.arXiv preprint arXiv:2002.08056, 2020

arXiv 2002
[70]

Effective quantization of muon optimizer states

Aman Gupta, Rafael Celente, Abhishek Shivanna, DT Braithwaite, Gregory Dexter, Shao Tang, Hiroto Uda- gawa, Daniel Silva, Rohan Ramanath, and S Sathiya Keerthi. Effective quantization of muon optimizer states. arXiv preprint arXiv:2509.23106, 2025

arXiv 2025
[71]

Limuon: Light and fast muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

Feihu Huang, Yuning Luo, and Songcan Chen. Limuon: Light and fast muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

Pith/arXiv arXiv 2025
[72]

A parameter-free and near-optimal zeroth-order algorithm for stochastic convex optimization.arXiv preprint arXiv:2502.05600, 2025

Kunjie Ren and Luo Luo. A parameter-free and near-optimal zeroth-order algorithm for stochastic convex optimization.arXiv preprint arXiv:2502.05600, 2025

arXiv 2025
[73]

A parameter-free zeroth-order algorithm for decentralized stochastic convex optimization.arXiv preprint arXiv:2603.15219, 2026

Jiawei Chen and Alexander Rogozin. A parameter-free zeroth-order algorithm for decentralized stochastic convex optimization.arXiv preprint arXiv:2603.15219, 2026

arXiv 2026
[74]

Parameter-free variance reduced zeroth-order optimization for non-convex problems

Yuxing Peng, Yuanyuan Liu, Fanhua Shang, and Hongying Liu. Parameter-free variance reduced zeroth-order optimization for non-convex problems
[75]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

arXiv 2025
[76]

Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

arXiv 2025
[77]

Towards gradient free and projection free stochastic optimization

Anit Kumar Sahu, Manzil Zaheer, and Soummya Kar. Towards gradient free and projection free stochastic optimization. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 3468–3477. PMLR, 2019

2019
[78]

Parameter-free locally accelerated conditional gradients

Alejandro Carderera, Jelena Diakonikolas, Cheuk Yin Lin, and Sebastian Pokutta. Parameter-free locally accelerated conditional gradients. InInternational Conference on Machine Learning, pages 1283–1293. PMLR, 2021

2021
[79]

A parameter-free conditional gradient method for composite minimization under h¨ older condition.Journal of Machine Learning Research, 24(166):1–34, 2023

Masaru Ito, Zhaosong Lu, and Chuan He. A parameter-free conditional gradient method for composite minimization under h¨ older condition.Journal of Machine Learning Research, 24(166):1–34, 2023

2023
[80]

New aspects of black box conditional gradient: Variance reduction and one point feedback.Chaos, Solitons & Fractals, 189:115654, 2024

Andrey Veprikov, Alexander Bogdanov, Vladislav Minashkin, and Aleksandr Beznosikov. New aspects of black box conditional gradient: Variance reduction and one point feedback.Chaos, Solitons & Fractals, 189:115654, 2024

2024

Showing first 80 references.

[1] [1]

Universal language model fine-tuning for text classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. InProceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, 2018

2018

[2] [2]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021

2021

[3] [3]

Don’t stop pretraining: Adapt language models to domains and tasks

Suchin Gururangan, Ana Marasovi´ c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8342–8360, 2020

2020

[4] [4]

Dialogpt: Large-scale generative pre-training for conversational response generation

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and William B Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations, pages 270–278, 2020

2020

[5] [5]

Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025

Xiao-Kun Wu, Min Chen, Wanyi Li, Rui Wang, Limeng Lu, Jia Liu, Kai Hwang, Yixue Hao, Yanru Pan, Qingguo Meng, et al. Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025

2025

[6] [6]

A stochastic approximation method.The annals of mathematical statis- tics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statis- tics, pages 400–407, 1951

1951

[7] [7]

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Pith/arXiv arXiv 2014

[8] [8]

The backpropagation algorithm

Raul Rojas. The backpropagation algorithm. InNeural networks: a systematic introduction, pages 149–182. Springer, 1996

1996

[9] [9]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, net- working, storage and analysis, pages 1–16. IEEE, 2020

2020

[10] [10]

Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023

2023

[11] [11]

Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function.Mathematical Programming, 144(1):1–38, 2014

Peter Richt´ arik and Martin Tak´ aˇ c. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function.Mathematical Programming, 144(1):1–38, 2014

2014

[12] [12]

Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

2024

[13] [13]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861, 2021

arXiv 2021

[14] [14]

Few-bit backward: Quantized gradients of activation functions for memory footprint reduction

Georgii Sergeevich Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Valerievich Dimitrov, and Ivan Oseledets. Few-bit backward: Quantized gradients of activation functions for memory footprint reduction. InInternational Conference on Machine Learning, pages 26363–26381. PMLR, 2023

2023

[15] [15]

Recent advances in lora: A comprehensive survey.ACM Transactions on Sensor Networks, 18(4):1–44, 2022

Zehua Sun, Huanqi Yang, Kai Liu, Zhimeng Yin, Zhenjiang Li, and Weitao Xu. Recent advances in lora: A comprehensive survey.ACM Transactions on Sensor Networks, 18(4):1–44, 2022

2022

[16] [16]

Lora-pro: Are low-rank adapters properly optimized?arXiv preprint arXiv:2407.18242, 2024

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. Lora-pro: Are low-rank adapters properly optimized?arXiv preprint arXiv:2407.18242, 2024

arXiv 2024

[17] [17]

Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming.SIAM journal on optimization, 23(4):2341–2368, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming.SIAM journal on optimization, 23(4):2341–2368, 2013

2013

[18] [18]

Zeroth-order policy gradient for reinforcement learning from human feedback without reward inference.arXiv preprint arXiv:2409.17401, 2024

Qining Zhang and Lei Ying. Zeroth-order policy gradient for reinforcement learning from human feedback without reward inference.arXiv preprint arXiv:2409.17401, 2024

arXiv 2024

[19] [19]

A survey on zeroth-order optimization for machine learning

Liting Lin, Hansong Ma, Junxiao Wang, and Shiyu Yang. A survey on zeroth-order optimization for machine learning. InInternational Conference on Web Information Systems and Applications, pages 481–497. Springer, 2025

2025

[20] [20]

Zoqo: Zero-order quantized optimization

Noga Bar and Raja Giryes. Zoqo: Zero-order quantized optimization. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

2025

[21] [21]

Unified optimal analysis of the (stochastic) gradient method.arXiv preprint arXiv:1907.04232, 2019

Sebastian U Stich. Unified optimal analysis of the (stochastic) gradient method.arXiv preprint arXiv:1907.04232, 2019

arXiv 1907

[22] [22]

Statistically precon- ditioned accelerated gradient method for distributed optimization

Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie. Statistically precon- ditioned accelerated gradient method for distributed optimization. InInternational conference on machine learning, pages 4203–4227. PMLR, 2020

2020

[23] [23]

Making sgd parameter-free

Yair Carmon and Oliver Hinder. Making sgd parameter-free. InConference on learning theory, pages 2360–

[24] [24]

Uniformly optimal and parameter-free first-order methods for convex and function-constrained optimization.arXiv preprint arXiv:2412.06319, 2024

Qi Deng, Guanghui Lan, and Zhenwei Lin. Uniformly optimal and parameter-free first-order methods for convex and function-constrained optimization.arXiv preprint arXiv:2412.06319, 2024

Pith/arXiv arXiv 2024

[25] [25]

Accelerated parameter-free stochastic optimization

Itai Kreisler, Maor Ivgi, Oliver Hinder, and Yair Carmon. Accelerated parameter-free stochastic optimization. InThe Thirty Seventh Annual Conference on Learning Theory, pages 3257–3324. PMLR, 2024

2024

[26] [26]

Revisiting zeroth-order optimization: Minimum-variance two-point estimators and directionally aligned perturbations.arXiv preprint arXiv:2510.19975, 2025

Shaocong Ma and Heng Huang. Revisiting zeroth-order optimization: Minimum-variance two-point estimators and directionally aligned perturbations.arXiv preprint arXiv:2510.19975, 2025

arXiv 2025

[27] [27]

Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bern- stein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

2024

[28] [28]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Pith/arXiv arXiv 2024

[29] [29]

Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Pith/arXiv arXiv 2025

[30] [30]

Momentum improves normalized sgd

Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InInternational conference on machine learning, pages 2260–2268. PMLR, 2020

2020

[31] [31]

Momentum ensures convergence of signsgd under weaker assumptions

Tao Sun, Qingsong Wang, Dongsheng Li, and Bao Wang. Momentum ensures convergence of signsgd under weaker assumptions. InInternational Conference on Machine Learning, pages 33077–33099. PMLR, 2023

2023

[32] [32]

Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt´ arik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025

arXiv 2025

[33] [33]

An overview of the simultaneous perturbation method for efficient optimization.Johns Hopkins apl technical digest, 19(4):482–492, 1998

James C Spall. An overview of the simultaneous perturbation method for efficient optimization.Johns Hopkins apl technical digest, 19(4):482–492, 1998

1998

[34] [34]

Online convex optimization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004

Pith/arXiv arXiv 2004

[35] [35]

Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Transactions on Information Theory, 61 (5):2788–2806, 2015

John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Transactions on Information Theory, 61 (5):2788–2806, 2015

2015

[36] [36]

Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173, 2024

Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, and Ivor W Tsang. Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173, 2024

arXiv 2024

[37] [37]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992

1992

[38] [38]

Zeroth-order optimization is secretly single-step policy optimization.arXiv preprint arXiv:2506.14460, 2025

Junbin Qiu, Zhengpeng Xie, Xiangda Yan, Yongjie Yang, and Yao Shu. Zeroth-order optimization is secretly single-step policy optimization.arXiv preprint arXiv:2506.14460, 2025

arXiv 2025

[39] [39]

Low-rank curvature for zeroth-order optimization in llm fine-tuning

Hyunseok Seung, Jaewoo Lee, and Hyunsuk Ko. Low-rank curvature for zeroth-order optimization in llm fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 25235–25242, 2026

2026

[40] [40]

Zo-adamm: Zeroth- order adaptive momentum method for black-box optimization.Advances in neural information processing systems, 32, 2019

Xiangyi Chen, Sijia Liu, Kaidi Xu, Xingguo Li, Xue Lin, Mingyi Hong, and David Cox. Zo-adamm: Zeroth- order adaptive momentum method for black-box optimization.Advances in neural information processing systems, 32, 2019

2019

[41] [41]

Zeroth-order stochas- tic variance reduction for nonconvex optimization.Advances in neural information processing systems, 31, 2018

Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochas- tic variance reduction for nonconvex optimization.Advances in neural information processing systems, 31, 2018

2018

[42] [42]

Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization

Kaiyi Ji, Zhe Wang, Yi Zhou, and Yingbin Liang. Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. InInternational conference on machine learning, pages 3100–3109. PMLR, 2019

2019

[43] [43]

Variance-reduced zeroth-order methods for fine-tuning language models.arXiv preprint arXiv:2404.08080, 2024

Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, and Wooseok Ha. Variance-reduced zeroth-order methods for fine-tuning language models.arXiv preprint arXiv:2404.08080, 2024

arXiv 2024

[44] [44]

Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. InACM AISec, 2017

2017

[45] [45]

Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark.arXiv preprint arXiv:2402.11592, 2024

Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D Lee, Wotao Yin, Mingyi Hong, et al. Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark.arXiv preprint arXiv:2402.11592, 2024

arXiv 2024

[46] [46]

Problem complexity and method efficiency in optimization

Arkadij Semenoviˇ c Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983

1983

[47] [47]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

2011

[48] [48]

Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012

Matthew D Zeiler. Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012

Pith/arXiv arXiv 2012

[49] [49]

A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019

Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019

Pith/arXiv arXiv 1912

[50] [50]

Dimension-free exponentiated gradient.Advances in Neural Information Processing Systems, 26, 2013

Francesco Orabona. Dimension-free exponentiated gradient.Advances in Neural Information Processing Systems, 26, 2013

2013

[51] [51]

Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations

H Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. InConference on Learning Theory, pages 1020–1039. PMLR, 2014

2014

[52] [52]

Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

Francesco Orabona and D´ avid P´ al. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016

2016

[53] [53]

Black-box reductions for parameter-free online learning in banach spaces

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018

2018

[54] [54]

Dog is sgd’s best friend: A parameter-free dynamic step size schedule

Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. InInternational conference on machine learning, pages 14465–14499. PMLR, 2023

2023

[55] [55]

Learning-rate-free learning by d-adaptation

Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by d-adaptation. InInternational conference on machine learning, pages 7449–7479. PMLR, 2023

2023

[56] [56]

Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023

arXiv 2023

[57] [57]

Momo: Momentum models for adaptive learning rates.arXiv preprint arXiv:2305.07583, 2023

Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, and Robert M Gower. Momo: Momentum models for adaptive learning rates.arXiv preprint arXiv:2305.07583, 2023

arXiv 2023

[58] [58]

Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance

Amit Attia and Tomer Koren. Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. InInternational Conference on Machine Learning, pages 1147–1171. PMLR, 2023

2023

[59] [59]

How free is parameter-free stochastic optimization?arXiv preprint arXiv:2402.03126, 2024

Amit Attia and Tomer Koren. How free is parameter-free stochastic optimization?arXiv preprint arXiv:2402.03126, 2024

arXiv 2024

[60] [60]

Dowg unleashed: An efficient universal parameter-free gradient descent method.Advances in Neural Information Processing Systems, 36:6748–6769, 2023

Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. Dowg unleashed: An efficient universal parameter-free gradient descent method.Advances in Neural Information Processing Systems, 36:6748–6769, 2023

2023

[61] [61]

Adaptive gradient descent without descent.arXiv preprint arXiv:1910.09529, 2019

Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent.arXiv preprint arXiv:1910.09529, 2019

arXiv 1910

[62] [62]

Directional smoothness and gradient methods: Convergence and adaptivity.Advances in Neural Information Processing Systems, 37: 14810–14848, 2024

Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, and Robert M Gower. Directional smoothness and gradient methods: Convergence and adaptivity.Advances in Neural Information Processing Systems, 37: 14810–14848, 2024

2024

[63] [63]

Sign-sgd via parameter-free optimization

Daniil Medyakov, Stanko Sergey, Gleb Molodtsov, Philip Zmushko, Evseev Grigoriy, Egor Petrov, and Alek- sandr Beznosikov. Sign-sgd via parameter-free optimization. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[64] [64]

Frank–wolfe and friends: a journey into projection-free first-order optimization methods.4OR, 19(3):313–345, 2021

Immanuel M Bomze, Francesco Rinaldi, and Damiano Zeffiro. Frank–wolfe and friends: a journey into projection-free first-order optimization methods.4OR, 19(3):313–345, 2021

2021

[65] [65]

Conditional gradient methods.arXiv preprint arXiv:2211.14103, 2022

G´ abor Braun, Alejandro Carderera, Cyrille W Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta. Conditional gradient methods.arXiv preprint arXiv:2211.14103, 2022

arXiv 2022

[66] [66]

signsgd: Com- pressedoptimisationfornon-convexproblems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Com- pressedoptimisationfornon-convexproblems. InInternational conference on machine learning, pages560–569. PMLR, 2018

2018

[67] [67]

Error feedback fixes signsgd and other gradient compression schemes

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. InInternational conference on machine learning, pages 3252–3261. PMLR, 2019

2019

[68] [68]

Dissecting adam: The sign, magnitude and variance of stochastic gradients

Lukas Balles and Philipp Hennig. Dissecting adam: The sign, magnitude and variance of stochastic gradients. InInternational Conference on Machine Learning, pages 404–413. PMLR, 2018

2018

[69] [69]

The geometry of sign gradient descent.arXiv preprint arXiv:2002.08056, 2020

Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux. The geometry of sign gradient descent.arXiv preprint arXiv:2002.08056, 2020

arXiv 2002

[70] [70]

Effective quantization of muon optimizer states

Aman Gupta, Rafael Celente, Abhishek Shivanna, DT Braithwaite, Gregory Dexter, Shao Tang, Hiroto Uda- gawa, Daniel Silva, Rohan Ramanath, and S Sathiya Keerthi. Effective quantization of muon optimizer states. arXiv preprint arXiv:2509.23106, 2025

arXiv 2025

[71] [71]

Limuon: Light and fast muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

Feihu Huang, Yuning Luo, and Songcan Chen. Limuon: Light and fast muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

Pith/arXiv arXiv 2025

[72] [72]

A parameter-free and near-optimal zeroth-order algorithm for stochastic convex optimization.arXiv preprint arXiv:2502.05600, 2025

Kunjie Ren and Luo Luo. A parameter-free and near-optimal zeroth-order algorithm for stochastic convex optimization.arXiv preprint arXiv:2502.05600, 2025

arXiv 2025

[73] [73]

A parameter-free zeroth-order algorithm for decentralized stochastic convex optimization.arXiv preprint arXiv:2603.15219, 2026

Jiawei Chen and Alexander Rogozin. A parameter-free zeroth-order algorithm for decentralized stochastic convex optimization.arXiv preprint arXiv:2603.15219, 2026

arXiv 2026

[74] [74]

Parameter-free variance reduced zeroth-order optimization for non-convex problems

Yuxing Peng, Yuanyuan Liu, Fanhua Shang, and Hongying Liu. Parameter-free variance reduced zeroth-order optimization for non-convex problems

[75] [75]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

arXiv 2025

[76] [76]

Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

arXiv 2025

[77] [77]

Towards gradient free and projection free stochastic optimization

Anit Kumar Sahu, Manzil Zaheer, and Soummya Kar. Towards gradient free and projection free stochastic optimization. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 3468–3477. PMLR, 2019

2019

[78] [78]

Parameter-free locally accelerated conditional gradients

Alejandro Carderera, Jelena Diakonikolas, Cheuk Yin Lin, and Sebastian Pokutta. Parameter-free locally accelerated conditional gradients. InInternational Conference on Machine Learning, pages 1283–1293. PMLR, 2021

2021

[79] [79]

A parameter-free conditional gradient method for composite minimization under h¨ older condition.Journal of Machine Learning Research, 24(166):1–34, 2023

Masaru Ito, Zhaosong Lu, and Chuan He. A parameter-free conditional gradient method for composite minimization under h¨ older condition.Journal of Machine Learning Research, 24(166):1–34, 2023

2023

[80] [80]

New aspects of black box conditional gradient: Variance reduction and one point feedback.Chaos, Solitons & Fractals, 189:115654, 2024

Andrey Veprikov, Alexander Bogdanov, Vladislav Minashkin, and Aleksandr Beznosikov. New aspects of black box conditional gradient: Variance reduction and one point feedback.Chaos, Solitons & Fractals, 189:115654, 2024

2024