Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning
Pith reviewed 2026-06-27 04:30 UTC · model grok-4.3
The pith
AdaNAGED provides parameter-free zeroth-order optimization for LMO-based fine-tuning of large language models with convergence guarantees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaNAGED unifies gradient-free training, adaptive tuning, and non-Euclidean update geometry for LMO-based ZO optimization, establishing convergence guarantees and demonstrating effectiveness on the OPT-1.3B fine-tuning task.
What carries the argument
AdaNAGED, which performs parameter-free adaptation within LMO-based zeroth-order optimization to enable non-Euclidean updates across heterogeneous parameter blocks.
If this is right
- Convergence guarantees hold for the unified LMO-based ZO setting.
- Memory overhead is reduced by avoiding storage of activations, gradients, and optimizer states.
- Automatic adaptation removes the need for task-specific tuning of algorithmic parameters.
- Non-Euclidean geometry better respects the structure of different parameter blocks in large models.
Where Pith is reading between the lines
- The approach could extend to other memory-limited training scenarios such as on-device adaptation.
- Hybrid use with selective first-order steps might further improve efficiency on very large models.
- Validation beyond 1.3B parameters would test whether the geometry-aware adaptation scales to frontier models.
Load-bearing premise
Large-scale fine-tuning can benefit from geometry-aware updates that account for the heterogeneous structure of parameter blocks, which can be modeled through methods that exploit linear minimization oracle (LMO).
What would settle it
If experiments on OPT-1.3B fine-tuning show that AdaNAGED requires manual stepsize or smoothing tuning to match standard ZO performance, or fails to converge under its stated guarantees, the central claim would be falsified.
read the original abstract
Fine-tuning large language models (LLMs) has become a central application of modern optimization, enabling pretrained models to adapt to diverse downstream tasks and domain-specific data. A major obstacle in large-scale fine-tuning is the memory overhead of backpropagation, which requires storing activations, gradients, and optimizer states. Zeroth-order (ZO) optimization offers a memory-efficient alternative, but its performance is highly sensitive to the stepsize and smoothing parameter, often requiring costly task-specific tuning. Parameter-free (PF) optimization addresses this issue by adapting algorithmic parameters without prior knowledge of problem-dependent constants. Moreover, large-scale fine-tuning can benefit from geometry-aware updates that account for the heterogeneous structure of parameter blocks, which can be modeled through methods that exploit linear minimization oracle (LMO). In this work, we study PF adaptation for LMO-based ZO optimization and introduce $\texttt{AdaNAGED}$, a method that unifies gradient-free training, adaptive tuning, and non-Euclidean update geometry. We establish convergence guarantees and validate the method on large-scale LLM fine-tuning task with $\texttt{OPT}-1.3\mathrm{B}$ model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AdaNAGED, a parameter-free zeroth-order optimization method based on linear minimization oracles (LMO) for memory-efficient fine-tuning of large language models. It claims to unify gradient-free training, adaptive parameter tuning, and non-Euclidean geometry, while establishing convergence guarantees and providing empirical validation on the OPT-1.3B model.
Significance. If the claimed convergence guarantees hold under standard ZO assumptions and the method demonstrates practical gains in memory usage and performance without task-specific tuning on large models, the work could meaningfully advance efficient fine-tuning techniques that avoid backpropagation storage costs.
major comments (2)
- [Abstract] Abstract: the central claim that convergence guarantees are established is unsupported by any statement of assumptions, proof outline, theorem reference, or derivation steps, rendering the theoretical contribution unverifiable from the provided text.
- [Abstract] Abstract: the empirical validation on OPT-1.3B is asserted without reference to experimental protocol, baselines, metrics, number of runs, or error bars, which is load-bearing for the practical contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify that the abstract lacks supporting details for its claims on theory and experiments. We will revise the abstract to address both points while preserving conciseness.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that convergence guarantees are established is unsupported by any statement of assumptions, proof outline, theorem reference, or derivation steps, rendering the theoretical contribution unverifiable from the provided text.
Authors: We agree the abstract is too terse to verify the claim. The full manuscript states the assumptions in Section 3, presents the main result as Theorem 4.1, and provides the proof in Appendix A under standard ZO assumptions (Lipschitz smoothness and bounded gradient variance). We will revise the abstract to reference the theorem and note the assumptions. revision: yes
-
Referee: [Abstract] Abstract: the empirical validation on OPT-1.3B is asserted without reference to experimental protocol, baselines, metrics, number of runs, or error bars, which is load-bearing for the practical contribution.
Authors: We agree the abstract omits these details. Section 5 and the associated tables/figures specify the protocol (OPT-1.3B fine-tuning on GLUE and language modeling), baselines (MeZO, ZO-Adam), metrics (accuracy, perplexity), 5 runs with standard error bars. We will revise the abstract to briefly indicate the evaluation setup and comparisons. revision: yes
Circularity Check
No significant circularity identified
full rationale
The abstract and high-level description introduce AdaNAGED as a unification of ZO optimization, PF adaptation, and LMO-based geometry, with stated convergence guarantees and empirical validation on OPT-1.3B. No equations, fitting procedures, self-citations, or derivation steps are visible that reduce any claimed prediction or result to its inputs by construction. The paper's claims remain at a level where the central contribution is presented as independent of any internal self-referential loop, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Universal language model fine-tuning for text classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. InProceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, 2018
2018
-
[2]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021
2021
-
[3]
Don’t stop pretraining: Adapt language models to domains and tasks
Suchin Gururangan, Ana Marasovi´ c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8342–8360, 2020
2020
-
[4]
Dialogpt: Large-scale generative pre-training for conversational response generation
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and William B Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations, pages 270–278, 2020
2020
-
[5]
Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025
Xiao-Kun Wu, Min Chen, Wanyi Li, Rui Wang, Limeng Lu, Jia Liu, Kai Hwang, Yixue Hao, Yanru Pan, Qingguo Meng, et al. Llm fine-tuning: Concepts, opportunities, and challenges.Big Data and Cognitive Computing, 9(4):87, 2025
2025
-
[6]
A stochastic approximation method.The annals of mathematical statis- tics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statis- tics, pages 400–407, 1951
1951
-
[7]
Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
Pith/arXiv arXiv 2014
-
[8]
The backpropagation algorithm
Raul Rojas. The backpropagation algorithm. InNeural networks: a systematic introduction, pages 149–182. Springer, 1996
1996
-
[9]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, net- working, storage and analysis, pages 1–16. IEEE, 2020
2020
-
[10]
Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023
Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023
2023
-
[11]
Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function.Mathematical Programming, 144(1):1–38, 2014
Peter Richt´ arik and Martin Tak´ aˇ c. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function.Mathematical Programming, 144(1):1–38, 2014
2014
-
[12]
Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024
Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024
2024
-
[13]
8-bit optimizers via block-wise quantization
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861, 2021
arXiv 2021
-
[14]
Few-bit backward: Quantized gradients of activation functions for memory footprint reduction
Georgii Sergeevich Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Valerievich Dimitrov, and Ivan Oseledets. Few-bit backward: Quantized gradients of activation functions for memory footprint reduction. InInternational Conference on Machine Learning, pages 26363–26381. PMLR, 2023
2023
-
[15]
Recent advances in lora: A comprehensive survey.ACM Transactions on Sensor Networks, 18(4):1–44, 2022
Zehua Sun, Huanqi Yang, Kai Liu, Zhimeng Yin, Zhenjiang Li, and Weitao Xu. Recent advances in lora: A comprehensive survey.ACM Transactions on Sensor Networks, 18(4):1–44, 2022
2022
-
[16]
Lora-pro: Are low-rank adapters properly optimized?arXiv preprint arXiv:2407.18242, 2024
Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. Lora-pro: Are low-rank adapters properly optimized?arXiv preprint arXiv:2407.18242, 2024
arXiv 2024
-
[17]
Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming.SIAM journal on optimization, 23(4):2341–2368, 2013
Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming.SIAM journal on optimization, 23(4):2341–2368, 2013
2013
-
[18]
Qining Zhang and Lei Ying. Zeroth-order policy gradient for reinforcement learning from human feedback without reward inference.arXiv preprint arXiv:2409.17401, 2024
arXiv 2024
-
[19]
A survey on zeroth-order optimization for machine learning
Liting Lin, Hansong Ma, Junxiao Wang, and Shiyu Yang. A survey on zeroth-order optimization for machine learning. InInternational Conference on Web Information Systems and Applications, pages 481–497. Springer, 2025
2025
-
[20]
Zoqo: Zero-order quantized optimization
Noga Bar and Raja Giryes. Zoqo: Zero-order quantized optimization. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025
2025
-
[21]
Unified optimal analysis of the (stochastic) gradient method.arXiv preprint arXiv:1907.04232, 2019
Sebastian U Stich. Unified optimal analysis of the (stochastic) gradient method.arXiv preprint arXiv:1907.04232, 2019
arXiv 1907
-
[22]
Statistically precon- ditioned accelerated gradient method for distributed optimization
Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie. Statistically precon- ditioned accelerated gradient method for distributed optimization. InInternational conference on machine learning, pages 4203–4227. PMLR, 2020
2020
-
[23]
Making sgd parameter-free
Yair Carmon and Oliver Hinder. Making sgd parameter-free. InConference on learning theory, pages 2360–
-
[24]
Qi Deng, Guanghui Lan, and Zhenwei Lin. Uniformly optimal and parameter-free first-order methods for convex and function-constrained optimization.arXiv preprint arXiv:2412.06319, 2024
Pith/arXiv arXiv 2024
-
[25]
Accelerated parameter-free stochastic optimization
Itai Kreisler, Maor Ivgi, Oliver Hinder, and Yair Carmon. Accelerated parameter-free stochastic optimization. InThe Thirty Seventh Annual Conference on Learning Theory, pages 3257–3324. PMLR, 2024
2024
-
[26]
Shaocong Ma and Heng Huang. Revisiting zeroth-order optimization: Minimum-variance two-point estimators and directionally aligned perturbations.arXiv preprint arXiv:2510.19975, 2025
arXiv 2025
-
[27]
Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bern- stein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024
2024
-
[28]
Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024
Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024
Pith/arXiv arXiv 2024
-
[29]
Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025
Pith/arXiv arXiv 2025
-
[30]
Momentum improves normalized sgd
Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. InInternational conference on machine learning, pages 2260–2268. PMLR, 2020
2020
-
[31]
Momentum ensures convergence of signsgd under weaker assumptions
Tao Sun, Qingsong Wang, Dongsheng Li, and Bao Wang. Momentum ensures convergence of signsgd under weaker assumptions. InInternational Conference on Machine Learning, pages 33077–33099. PMLR, 2023
2023
-
[32]
Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt´ arik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416, 2025
arXiv 2025
-
[33]
An overview of the simultaneous perturbation method for efficient optimization.Johns Hopkins apl technical digest, 19(4):482–492, 1998
James C Spall. An overview of the simultaneous perturbation method for efficient optimization.Johns Hopkins apl technical digest, 19(4):482–492, 1998
1998
-
[34]
Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient.arXiv preprint cs/0408007, 2004
Pith/arXiv arXiv 2004
-
[35]
Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Transactions on Information Theory, 61 (5):2788–2806, 2015
John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations.IEEE Transactions on Information Theory, 61 (5):2788–2806, 2015
2015
-
[36]
Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, and Ivor W Tsang. Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173, 2024
arXiv 2024
-
[37]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992
1992
-
[38]
Junbin Qiu, Zhengpeng Xie, Xiangda Yan, Yongjie Yang, and Yao Shu. Zeroth-order optimization is secretly single-step policy optimization.arXiv preprint arXiv:2506.14460, 2025
arXiv 2025
-
[39]
Low-rank curvature for zeroth-order optimization in llm fine-tuning
Hyunseok Seung, Jaewoo Lee, and Hyunsuk Ko. Low-rank curvature for zeroth-order optimization in llm fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 25235–25242, 2026
2026
-
[40]
Zo-adamm: Zeroth- order adaptive momentum method for black-box optimization.Advances in neural information processing systems, 32, 2019
Xiangyi Chen, Sijia Liu, Kaidi Xu, Xingguo Li, Xue Lin, Mingyi Hong, and David Cox. Zo-adamm: Zeroth- order adaptive momentum method for black-box optimization.Advances in neural information processing systems, 32, 2019
2019
-
[41]
Zeroth-order stochas- tic variance reduction for nonconvex optimization.Advances in neural information processing systems, 31, 2018
Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochas- tic variance reduction for nonconvex optimization.Advances in neural information processing systems, 31, 2018
2018
-
[42]
Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization
Kaiyi Ji, Zhe Wang, Yi Zhou, and Yingbin Liang. Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. InInternational conference on machine learning, pages 3100–3109. PMLR, 2019
2019
-
[43]
Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, and Wooseok Ha. Variance-reduced zeroth-order methods for fine-tuning language models.arXiv preprint arXiv:2404.08080, 2024
arXiv 2024
-
[44]
Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models
Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. InACM AISec, 2017
2017
-
[45]
Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D Lee, Wotao Yin, Mingyi Hong, et al. Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark.arXiv preprint arXiv:2402.11592, 2024
arXiv 2024
-
[46]
Problem complexity and method efficiency in optimization
Arkadij Semenoviˇ c Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983
1983
-
[47]
Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011
2011
-
[48]
Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012
Matthew D Zeiler. Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012
Pith/arXiv arXiv 2012
-
[49]
A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019
Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019
Pith/arXiv arXiv 1912
-
[50]
Dimension-free exponentiated gradient.Advances in Neural Information Processing Systems, 26, 2013
Francesco Orabona. Dimension-free exponentiated gradient.Advances in Neural Information Processing Systems, 26, 2013
2013
-
[51]
Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations
H Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. InConference on Learning Theory, pages 1020–1039. PMLR, 2014
2014
-
[52]
Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016
Francesco Orabona and D´ avid P´ al. Coin betting and parameter-free online learning.Advances in Neural Information Processing Systems, 29, 2016
2016
-
[53]
Black-box reductions for parameter-free online learning in banach spaces
Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018
2018
-
[54]
Dog is sgd’s best friend: A parameter-free dynamic step size schedule
Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. InInternational conference on machine learning, pages 14465–14499. PMLR, 2023
2023
-
[55]
Learning-rate-free learning by d-adaptation
Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by d-adaptation. InInternational conference on machine learning, pages 7449–7479. PMLR, 2023
2023
-
[56]
Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023
Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner.arXiv preprint arXiv:2306.06101, 2023
arXiv 2023
-
[57]
Momo: Momentum models for adaptive learning rates.arXiv preprint arXiv:2305.07583, 2023
Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, and Robert M Gower. Momo: Momentum models for adaptive learning rates.arXiv preprint arXiv:2305.07583, 2023
arXiv 2023
-
[58]
Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance
Amit Attia and Tomer Koren. Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. InInternational Conference on Machine Learning, pages 1147–1171. PMLR, 2023
2023
-
[59]
How free is parameter-free stochastic optimization?arXiv preprint arXiv:2402.03126, 2024
Amit Attia and Tomer Koren. How free is parameter-free stochastic optimization?arXiv preprint arXiv:2402.03126, 2024
arXiv 2024
-
[60]
Dowg unleashed: An efficient universal parameter-free gradient descent method.Advances in Neural Information Processing Systems, 36:6748–6769, 2023
Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. Dowg unleashed: An efficient universal parameter-free gradient descent method.Advances in Neural Information Processing Systems, 36:6748–6769, 2023
2023
-
[61]
Adaptive gradient descent without descent.arXiv preprint arXiv:1910.09529, 2019
Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent.arXiv preprint arXiv:1910.09529, 2019
arXiv 1910
-
[62]
Directional smoothness and gradient methods: Convergence and adaptivity.Advances in Neural Information Processing Systems, 37: 14810–14848, 2024
Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, and Robert M Gower. Directional smoothness and gradient methods: Convergence and adaptivity.Advances in Neural Information Processing Systems, 37: 14810–14848, 2024
2024
-
[63]
Sign-sgd via parameter-free optimization
Daniil Medyakov, Stanko Sergey, Gleb Molodtsov, Philip Zmushko, Evseev Grigoriy, Egor Petrov, and Alek- sandr Beznosikov. Sign-sgd via parameter-free optimization. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[64]
Frank–wolfe and friends: a journey into projection-free first-order optimization methods.4OR, 19(3):313–345, 2021
Immanuel M Bomze, Francesco Rinaldi, and Damiano Zeffiro. Frank–wolfe and friends: a journey into projection-free first-order optimization methods.4OR, 19(3):313–345, 2021
2021
-
[65]
Conditional gradient methods.arXiv preprint arXiv:2211.14103, 2022
G´ abor Braun, Alejandro Carderera, Cyrille W Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta. Conditional gradient methods.arXiv preprint arXiv:2211.14103, 2022
arXiv 2022
-
[66]
signsgd: Com- pressedoptimisationfornon-convexproblems
Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Com- pressedoptimisationfornon-convexproblems. InInternational conference on machine learning, pages560–569. PMLR, 2018
2018
-
[67]
Error feedback fixes signsgd and other gradient compression schemes
Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. InInternational conference on machine learning, pages 3252–3261. PMLR, 2019
2019
-
[68]
Dissecting adam: The sign, magnitude and variance of stochastic gradients
Lukas Balles and Philipp Hennig. Dissecting adam: The sign, magnitude and variance of stochastic gradients. InInternational Conference on Machine Learning, pages 404–413. PMLR, 2018
2018
-
[69]
The geometry of sign gradient descent.arXiv preprint arXiv:2002.08056, 2020
Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux. The geometry of sign gradient descent.arXiv preprint arXiv:2002.08056, 2020
arXiv 2002
-
[70]
Effective quantization of muon optimizer states
Aman Gupta, Rafael Celente, Abhishek Shivanna, DT Braithwaite, Gregory Dexter, Shao Tang, Hiroto Uda- gawa, Daniel Silva, Rohan Ramanath, and S Sathiya Keerthi. Effective quantization of muon optimizer states. arXiv preprint arXiv:2509.23106, 2025
arXiv 2025
-
[71]
Limuon: Light and fast muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025
Feihu Huang, Yuning Luo, and Songcan Chen. Limuon: Light and fast muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025
Pith/arXiv arXiv 2025
-
[72]
Kunjie Ren and Luo Luo. A parameter-free and near-optimal zeroth-order algorithm for stochastic convex optimization.arXiv preprint arXiv:2502.05600, 2025
arXiv 2025
-
[73]
Jiawei Chen and Alexander Rogozin. A parameter-free zeroth-order algorithm for decentralized stochastic convex optimization.arXiv preprint arXiv:2603.15219, 2026
arXiv 2026
-
[74]
Parameter-free variance reduced zeroth-order optimization for non-convex problems
Yuxing Peng, Yuanyuan Liu, Fanhua Shang, and Hongying Liu. Parameter-free variance reduced zeroth-order optimization for non-convex problems
-
[75]
Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025
Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025
arXiv 2025
-
[76]
Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025
Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025
arXiv 2025
-
[77]
Towards gradient free and projection free stochastic optimization
Anit Kumar Sahu, Manzil Zaheer, and Soummya Kar. Towards gradient free and projection free stochastic optimization. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 3468–3477. PMLR, 2019
2019
-
[78]
Parameter-free locally accelerated conditional gradients
Alejandro Carderera, Jelena Diakonikolas, Cheuk Yin Lin, and Sebastian Pokutta. Parameter-free locally accelerated conditional gradients. InInternational Conference on Machine Learning, pages 1283–1293. PMLR, 2021
2021
-
[79]
A parameter-free conditional gradient method for composite minimization under h¨ older condition.Journal of Machine Learning Research, 24(166):1–34, 2023
Masaru Ito, Zhaosong Lu, and Chuan He. A parameter-free conditional gradient method for composite minimization under h¨ older condition.Journal of Machine Learning Research, 24(166):1–34, 2023
2023
-
[80]
New aspects of black box conditional gradient: Variance reduction and one point feedback.Chaos, Solitons & Fractals, 189:115654, 2024
Andrey Veprikov, Alexander Bogdanov, Vladislav Minashkin, and Aleksandr Beznosikov. New aspects of black box conditional gradient: Variance reduction and one point feedback.Chaos, Solitons & Fractals, 189:115654, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.