Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

Bhavya Kailkhura; Changsheng Wang; Chongyu Fan; James Diffenderfer; Sijia Liu; Soumyadeep Pal; Yancheng Huang; Yicheng Lang; Yihua Zhang

arxiv: 2605.15622 · v2 · pith:FNKHUXSWnew · submitted 2026-05-15 · 💻 cs.LG

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

Sijia Liu , Yicheng Lang , Soumyadeep Pal , Changsheng Wang , Yancheng Huang , Chongyu Fan , James Diffenderfer , Bhavya Kailkhura

show 1 more author

Yihua Zhang

This is my paper

Pith reviewed 2026-05-20 21:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords zeroth-order optimizationdeep learningvariance reductionsubspace methodsmemory efficiencyforward-only trainingblack-box optimizationquery complexity

0 comments

The pith

Zeroth-order optimization can handle large deep learning models once development moves past full-space element-wise estimators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that zeroth-order optimization, which updates models using only function evaluations instead of backpropagated gradients, is dismissed too quickly as unscalable. The authors trace its reported problems with variance and query cost to narrow design choices such as updating every parameter independently across the full space. They lay out six positions that reframe ZO around subspace and spectral structure for variance control, its forward-only execution for systems gains, and cleaner ways to measure performance separate from task difficulty. If these positions hold, ZO becomes a practical route to memory-light training in gray-box or hardware-limited settings. The argument rests on showing that current limitations are engineering artifacts rather than fundamental barriers.

Core claim

Zeroth-order optimization is underexplored rather than underpowered for deep learning; many of its perceived limits arise from myopic full-space, element-wise, estimator-centric designs, and shifting to subspace or spectral views, forward-only systems advantages, and de-obfuscated evaluations can open a viable path to large-scale use.

What carries the argument

Subspace and spectral views of zeroth-order estimators that deliver interpretable variance reduction together with more graceful scaling in the number of queries.

If this is right

Memory usage drops because no gradient storage or backpropagation is required, suiting very deep or distributed models.
Training pipelines become communication-efficient and pipeline-friendly since only forward passes are exchanged.
Resource-constrained or black-box settings gain a usable optimization route without needing internal gradients.
Evaluations of ZO methods can be separated from overall task hardness, revealing true algorithmic progress.
Variance-query tradeoffs become tunable through directional derivatives and spectral structure rather than brute-force sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subspace framing could extend naturally to non-differentiable objectives in reinforcement learning or combinatorial search.
Hardware with limited memory bandwidth might adopt ZO as a default for on-device adaptation once the query overhead is controlled.
Future work could test whether spectral decompositions allow ZO to match first-order performance on specific layer types rather than whole models.

Load-bearing premise

The proposed changes to subspace and spectral designs plus forward-only execution will cut variance and query demands enough to support practical training of large deep networks.

What would settle it

A direct head-to-head run on a standard large model such as a transformer or ResNet-50 where even subspace ZO variants still require orders of magnitude more queries than backpropagation or fail to reach comparable accuracy.

Figures

Figures reproduced from arXiv: 2605.15622 by Bhavya Kailkhura, Changsheng Wang, Chongyu Fan, James Diffenderfer, Sijia Liu, Soumyadeep Pal, Yancheng Huang, Yicheng Lang, Yihua Zhang.

**Figure 1.** Figure 1: Schematic overview of this position paper. (Left) Publication trends (papers per year) for works with “ZO optimization” in the title in arXiv cs.AI and cs.LG (machine learning). (Right) Conceptual organization of our positioning points (P1–P6). BP with forward-only passes, ZO optimization can finetune large pretrained models, while significantly reducing memory overhead (Zhang et al., 2024c). Thus, a gro… view at source ↗

**Figure 2.** Figure 2: Fine-tuning accuracy of ZO optimization methods, including MeZO (Malladi et al., 2023), Sparse-MeZO (Liu et al., 2025b), HiZOO (Zhao et al., 2025b), and LOZO (Chen et al., 2025), on the SST-2, RTE, and WiC downstream tasks under taskaligned (w/ align) and non-aligned (w/o align) settings [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Zeroth-order (ZO) optimization, learning from finite differences of function evaluations without backpropagation, has recently regained attention in deep learning due to its memory efficiency and applicability to gray- or black-box pipelines. Yet, ZO methods are often dismissed as fundamentally unscalable because of estimator variance and unfavorable query complexity. We argue that this conclusion might be misguided: ZO optimization is underexplored, not underpowered. We show that many perceived limitations stem from myopic development practices, most notably full-space, element-wise, estimator-centric designs. We articulate six positions spanning the algorithmic, systems, and evaluation stack. First, we revisit the feasibility boundaries of estimator-centric ZO methods through variance control, variance-query tradeoffs, and directional-derivative lenses. Then, we identify three underexplored opportunities: (i) subspace and spectral views of ZO that enable interpretable variance reduction with graceful query scaling, (ii) the forward-only nature of ZO as a systems advantage for communication-efficient, pipeline-friendly, and resource-constrained training, and (iii) the need to de-obfuscate ZO evaluations from task complexity. We strongly advocate rethinking ZO optimization around its unique strengths and acting accordingly, opening a viable path toward large-scale, system-aware, and resource-efficient learning with ZO optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a position paper arguing that zeroth-order (ZO) optimization in deep learning is underexplored rather than underpowered. It attributes perceived limitations such as high estimator variance and unfavorable query complexity to myopic design practices, notably full-space element-wise estimator-centric approaches. The authors articulate six positions across the algorithmic, systems, and evaluation stack, advocating subspace/spectral views for variance reduction, forward-only systems advantages for communication efficiency, and de-obfuscated evaluations to support large-scale resource-efficient learning.

Significance. If the positions are pursued, the paper could usefully redirect research toward ZO methods' memory and black-box strengths for constrained or pipeline settings. The conceptual reframing of limitations as artifacts of design choices rather than fundamentals is a constructive contribution, and the emphasis on forward-only computation as a systems asset is a clear strength that could inform future work on communication-efficient training.

major comments (2)

[Abstract and the six positions section] The claim that subspace and spectral views enable interpretable variance reduction with graceful query scaling (abstract and positions on algorithmic opportunities) is central to the argument that ZO can scale to large deep learning. The manuscript offers only high-level conceptual analysis without a concrete bound, example derivation, or cited prior result showing how query complexity improves relative to full-space estimators in high-dimensional models.
[Positions on underexplored opportunities] The assertion that the proposed shifts (subspace views, forward-only advantages, de-obfuscated evaluations) will sufficiently mitigate variance and query complexity for large-scale use (weakest assumption noted in reader's take) lacks supporting analysis or falsifiable prediction within the manuscript, leaving the viability claim for deep learning dependent on unshown future developments.

minor comments (2)

[Abstract] The abstract refers to 'six positions' without enumerating them; a short numbered list in the abstract or introduction would improve readability.
[Introduction] Terminology such as 'myopic development practices' and 'estimator-centric designs' would benefit from a brief operational definition or example in the introduction to make the critique more precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The feedback helps clarify how to better support the central claims of this position paper. Below we respond point-by-point to the major comments, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract and the six positions section] The claim that subspace and spectral views enable interpretable variance reduction with graceful query scaling (abstract and positions on algorithmic opportunities) is central to the argument that ZO can scale to large deep learning. The manuscript offers only high-level conceptual analysis without a concrete bound, example derivation, or cited prior result showing how query complexity improves relative to full-space estimators in high-dimensional models.

Authors: We appreciate this observation. As a position paper our primary aim is conceptual reframing rather than exhaustive derivation; however, we agree that the scaling claim benefits from additional grounding. In the revised manuscript we will add citations to existing subspace and spectral ZO results that report concrete variance reductions and improved query scaling relative to full-space estimators. We will also include a short illustrative derivation sketch (in an appendix) showing how restricting perturbations to a k-dimensional subspace yields query complexity that scales gracefully with k rather than ambient dimension d. revision: yes
Referee: [Positions on underexplored opportunities] The assertion that the proposed shifts (subspace views, forward-only advantages, de-obfuscated evaluations) will sufficiently mitigate variance and query complexity for large-scale use (weakest assumption noted in reader's take) lacks supporting analysis or falsifiable prediction within the manuscript, leaving the viability claim for deep learning dependent on unshown future developments.

Authors: We acknowledge that the manuscript, being positional, does not contain exhaustive supporting analysis or empirical validation for every proposed shift; its purpose is to identify directions rather than to close them. To address the concern we will expand the relevant section with a short discussion of falsifiable predictions, including expected variance-reduction factors under subspace sampling and communication savings from forward-only execution. These additions will make the viability argument more concrete while preserving the paper's role in guiding future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a position paper that articulates six interpretive positions advocating shifts in zeroth-order optimization design. It contains no equations, derivations, fitted parameters, predictions, or theorems. The central claim rests on conceptual analysis of existing practices rather than any self-referential reduction, self-citation chain, or ansatz that could be equivalent to its inputs by construction. All load-bearing steps are external to the paper's own content and do not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on reinterpretation of known ZO challenges without new parameters or entities; it assumes alternative design paradigms can address variance issues.

axioms (1)

domain assumption Perceived limitations of ZO methods arise primarily from full-space, element-wise, estimator-centric designs
Directly stated in the abstract as the source of misguided conclusions about scalability.

pith-pipeline@v0.9.0 · 5807 in / 1055 out tokens · 74092 ms · 2026-05-20T21:15:32.947085+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A vanilla ZO gradient estimator ... ˆ∇xf(x) = [f(x+µu)−f(x)]/µ u (RGE)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

subspace RGE (S-RGE) ... ˆ∇xf(x) = [f(x+µPu)−f(x)]/µ Pu

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · 12 internal anchors

[1]

Mathematical Programming , volume=

Zeroth-order optimization with orthogonal random directions , author=. Mathematical Programming , volume=. 2023 , publisher=

work page 2023
[2]

arXiv preprint arXiv:2602.17155 , year=

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization , author=. arXiv preprint arXiv:2602.17155 , year=

work page arXiv
[3]

The Fourteenth International Conference on Learning Representations , year=

Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[4]

The Fourteenth International Conference on Learning Representations , year=

Online Pseudo-Zeroth-Order Training of Neuromorphic Spiking Neural Networks , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[5]

Forty-second International Conference on Machine Learning , year=

Natural perturbations for black-box training of neural networks by zeroth-order optimization , author=. Forty-second International Conference on Machine Learning , year=

work page
[6]

Refining Adaptive Zeroth-Order Optimization at Ease , author=

work page
[7]

Journal of Optimization Theory and Applications , volume=

Zeroth-order random subspace algorithm for non-smooth convex optimization , author=. Journal of Optimization Theory and Applications , volume=. 2025 , publisher=

work page 2025
[8]

IEEE transactions on pattern analysis and machine intelligence , year=

Hessian-aware zeroth-order optimization , author=. IEEE transactions on pattern analysis and machine intelligence , year=

work page
[9]

arXiv preprint arXiv:2509.15552 , year=

The Multi-Query Paradox in Zeroth-Order Optimization , author=. arXiv preprint arXiv:2509.15552 , year=

work page arXiv
[10]

arXiv preprint arXiv:2502.03304 , year=

Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning , author=. arXiv preprint arXiv:2502.03304 , year=

work page arXiv
[11]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

work page
[13]

arXiv preprint arXiv:2506.14460 , year=

Zeroth-Order Optimization is Secretly Single-Step Policy Optimization , author=. arXiv preprint arXiv:2506.14460 , year=

work page arXiv
[14]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards Efficient Low-Order Hybrid Optimizer for Language Model Fine-Tuning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[15]

Advances in Neural Information Processing Systems , volume=

Relizo: Sample reusable linear interpolation-based zeroth-order optimization , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Zeroth-order fine-tuning of llms in random subspaces , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[17]

arXiv preprint arXiv:2506.04430 , year=

Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order , author=. arXiv preprint arXiv:2506.04430 , year=

work page arXiv
[18]

arXiv preprint arXiv:2501.19099 , year=

Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale , author=. arXiv preprint arXiv:2501.19099 , year=

work page arXiv
[19]

Proceedings of the 2016 ACM SIGSAC conference on computer and communications security , pages=

Deep learning with differential privacy , author=. Proceedings of the 2016 ACM SIGSAC conference on computer and communications security , pages=

work page 2016
[20]

IEEE Transactions on Signal Processing , volume=

Communication-efficient stochastic zeroth-order optimization for federated learning , author=. IEEE Transactions on Signal Processing , volume=. 2022 , publisher=

work page 2022
[21]

International Conference on Learning Representations , volume=

Achieving dimension-free communication in federated learning via zeroth-order optimization , author=. International Conference on Learning Representations , volume=

work page
[22]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Low-rank curvature for zeroth-order optimization in llm fine-tuning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[23]

Advances in Neural Information Processing Systems , volume=

Enabling fast differentially private sgd via just-in-time compilation and vectorization , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

Second-Order Fine-Tuning without Pain for

Yanjun Zhao and Sizhe Dang and Haishan Ye and Guang Dai and Yi Qian and Ivor Tsang , booktitle=. Second-Order Fine-Tuning without Pain for. 2025 , url=

work page 2025
[25]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

MUZO: Leveraging Multiple Queries and Momentum for Zeroth-Order Fine-Tuning of Large Language Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[26]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024
[27]

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training , author=. arXiv preprint arXiv:2509.11983 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2505.02222 , year=

Practical efficiency of muon for pretraining , author=. arXiv preprint arXiv:2505.02222 , year=

work page arXiv
[29]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Dion: Distributed Orthonormalized Updates

Dion: Distributed orthonormalized updates , author=. arXiv preprint arXiv:2504.05295 , year=

work page arXiv
[31]

On the Convergence Analysis of Muon

On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Muon is Scalable for LLM Training

Muon is scalable for LLM training , author=. arXiv preprint arXiv:2502.16982 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

SIAM Journal on Optimization , volume=

Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017
[35]

GaLore: Memory-Efficient

Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian , booktitle=. GaLore: Memory-Efficient. 2024 , url=

work page 2024
[36]

arXiv preprint arXiv:2502.04959 , year=

No task left behind: Isotropic model merging with common and task-specific subspaces , author=. arXiv preprint arXiv:2502.04959 , year=

work page arXiv
[37]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

Sophia: A scalable stochastic second-order optimizer for language model pre-training , author=. arXiv preprint arXiv:2305.14342 , year=

work page arXiv
[38]

Advances in neural information processing systems , volume=

Symbolic discovery of optimization algorithms , author=. Advances in neural information processing systems , volume=

work page
[39]

Nature Machine Intelligence , pages=

Rethinking machine unlearning for large language models , author=. Nature Machine Intelligence , pages=. 2025 , publisher=

work page 2025
[40]

Hanzhen Zhao and Shihong Ding and Cong Fang and Zhouchen Lin , booktitle=. Pa. 2025 , url=

work page 2025
[41]

Journal of Machine Learning Research , volume=

Learning to optimize: A primer and a benchmark , author=. Journal of Machine Learning Research , volume=

work page
[42]

Chongyu Fan and Jinghan Jia and Yihua Zhang and Anil Ramakrishna and Mingyi Hong and Sijia Liu , booktitle=. Towards. 2025 , url=

work page 2025
[43]

International Conference on Learning Representations , year=

Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page
[44]

Sparse Me

Yong Liu and Zirui Zhu and Chaoyu Gong and Minhao Cheng and Cho-Jui Hsieh and Yang You , booktitle=. Sparse Me. 2025 , url=

work page 2025
[45]

Pengyun Yue and Xuanlin Yang and Mingqing Xiao and Zhouchen Lin , booktitle=. Pseu. 2025 , url=

work page 2025
[46]

Yifan Yang and Zhen Zhang and Rupak Vignesh Swaminathan and Jing Liu and Nathan Susanj and Zheng Zhang , booktitle=. Sharp. 2025 , url=

work page 2025
[47]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Private Zeroth-Order Optimization with Public Data , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[48]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Fast Zeroth-Order Convex Optimization with Quantum Gradient Methods , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[49]

The Twelfth International Conference on Learning Representations , year=

DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training , author=. The Twelfth International Conference on Learning Representations , year=

work page
[50]

2026 , url=

Sizhe Dang and yangyangGuo and Yanjun Zhao and Xiaodong Zheng and Guang Dai and Ivor Tsang and Haishan Ye , booktitle=. 2026 , url=

work page 2026
[51]

arXiv preprint arXiv:2404.08080 , year=

Variance-reduced zeroth-order methods for fine-tuning language models , author=. arXiv preprint arXiv:2404.08080 , year=

work page arXiv
[52]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[53]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Pytorch fsdp: experiences on scaling fully sharded data parallel , author=. arXiv preprint arXiv:2304.11277 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

and Yin, Wotao and Hong, Mingyi and Wang, Zhangyang and Liu, Sijia and Chen, Tianlong , title =

Zhang, Yihua and Li, Pingzhi and Hong, Junyuan and Li, Jiaxiang and Zhang, Yimeng and Zheng, Wenqing and Chen, Pin-Yu and Lee, Jason D. and Yin, Wotao and Hong, Mingyi and Wang, Zhangyang and Liu, Sijia and Chen, Tianlong , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

work page
[55]

Foundations and Trends in Optimization , volume=

Gradient-based algorithms for zeroth-order optimization , author=. Foundations and Trends in Optimization , volume=. 2025 , publisher=

work page 2025
[56]

IEEE Transactions on Information Theory , volume=

Optimal rates for zero-order convex optimization: The power of two function evaluations , author=. IEEE Transactions on Information Theory , volume=. 2015 , publisher=

work page 2015
[57]

arXiv preprint arXiv:2202.08587 , year=

Gradients without backpropagation , author=. arXiv preprint arXiv:2202.08587 , year=

work page arXiv
[58]

The Eleventh International Conference on Learning Representations , year=

Scaling Forward Gradient With Local Losses , author=. The Eleventh International Conference on Learning Representations , year=

work page
[59]

International Conference on Learning Representations , year=

Learning by directional gradient descent , author=. International Conference on Learning Representations , year=

work page
[60]

arXiv preprint arXiv:2209.06302 , year=

Optimization without backpropagation , author=. arXiv preprint arXiv:2209.06302 , year=

work page arXiv
[61]

Foundations of Computational Mathematics , volume=

Random gradient-free minimization of convex functions , author=. Foundations of Computational Mathematics , volume=. 2017 , publisher=

work page 2017
[62]

IEEE transactions on automatic control , volume=

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , author=. IEEE transactions on automatic control , volume=. 2002 , publisher=

work page 2002
[63]

IEEE Signal Processing Magazine , volume=

A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=

work page 2020
[64]

The Thirteenth International Conference on Learning Representations , year=

Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[65]

Advances in neural information processing systems , volume=

Direct feedback alignment provides learning in deep neural networks , author=. Advances in neural information processing systems , volume=

work page
[66]

Advances in neural information processing systems , volume=

Direct feedback alignment scales to modern deep learning tasks and architectures , author=. Advances in neural information processing systems , volume=

work page
[67]

arXiv preprint arXiv:2011.12428 , year=

Align, then memorise: the dynamics of learning with feedback alignment , author=. arXiv preprint arXiv:2011.12428 , year=

work page arXiv 2011
[68]

1952 , institution=

Numerical solution of a minimum problem , author=. 1952 , institution=

work page 1952
[69]

The Eleventh International Conference on Learning Representations , year=

Loss landscapes are all you need: Neural network generalization can be explained without the implicit bias of gradient descent , author=. The Eleventh International Conference on Learning Representations , year=

work page
[70]

2000 , publisher=

Trust region methods , author=. 2000 , publisher=

work page 2000
[71]

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in

Yicheng Lang and Yihua Zhang and Chongyu Fan and Changsheng Wang and Jinghan Jia and Sijia Liu , booktitle=. Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in. 2026 , url=

work page 2026
[72]

Proceedings of the IEEE , volume=

Taking the human out of the loop: A review of Bayesian optimization , author=. Proceedings of the IEEE , volume=. 2015 , publisher=

work page 2015
[73]

IEEE transactions on evolutionary computation , volume=

A fast and elitist multiobjective genetic algorithm: NSGA-II , author=. IEEE transactions on evolutionary computation , volume=. 2002 , publisher=

work page 2002
[74]

Proceedings of the sixth annual conference on Computational learning theory , pages=

Genetic algorithms and machine learning , author=. Proceedings of the sixth annual conference on Computational learning theory , pages=

work page
[75]

Proceedings of ICNN'95-international conference on neural networks , volume=

Particle swarm optimization , author=. Proceedings of ICNN'95-international conference on neural networks , volume=. 1995 , organization=

work page 1995
[76]

Optimization Methods & Software , volume=

PSwarm: a hybrid solver for linearly constrained global derivative-free optimization , author=. Optimization Methods & Software , volume=. 2009 , publisher=

work page 2009
[77]

Computational Methods for Optimal Design and Control: Proceedings of the AFOSR Workshop on Optimal Design and Control Arlington, Virginia 30 September--3 October, 1997 , pages=

The simplex gradient and noisy optimization problems , author=. Computational Methods for Optimal Design and Control: Proceedings of the AFOSR Workshop on Optimal Design and Control Arlington, Virginia 30 September--3 October, 1997 , pages=. 1998 , organization=

work page 1997
[78]

SIAM journal on Optimization , volume=

On the convergence of the multidirectional search algorithm , author=. SIAM journal on Optimization , volume=. 1991 , publisher=

work page 1991
[79]

Biologically-plausible learning algorithms can scale to large datasets

Biologically-plausible learning algorithms can scale to large datasets , author=. arXiv preprint arXiv:1811.03567 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

International Conference on Artificial Intelligence and Statistics , pages=

Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

work page 2018

Showing first 80 references.

[1] [1]

Mathematical Programming , volume=

Zeroth-order optimization with orthogonal random directions , author=. Mathematical Programming , volume=. 2023 , publisher=

work page 2023

[2] [2]

arXiv preprint arXiv:2602.17155 , year=

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization , author=. arXiv preprint arXiv:2602.17155 , year=

work page arXiv

[3] [3]

The Fourteenth International Conference on Learning Representations , year=

Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization , author=. The Fourteenth International Conference on Learning Representations , year=

work page

[4] [4]

The Fourteenth International Conference on Learning Representations , year=

Online Pseudo-Zeroth-Order Training of Neuromorphic Spiking Neural Networks , author=. The Fourteenth International Conference on Learning Representations , year=

work page

[5] [5]

Forty-second International Conference on Machine Learning , year=

Natural perturbations for black-box training of neural networks by zeroth-order optimization , author=. Forty-second International Conference on Machine Learning , year=

work page

[6] [6]

Refining Adaptive Zeroth-Order Optimization at Ease , author=

work page

[7] [7]

Journal of Optimization Theory and Applications , volume=

Zeroth-order random subspace algorithm for non-smooth convex optimization , author=. Journal of Optimization Theory and Applications , volume=. 2025 , publisher=

work page 2025

[8] [8]

IEEE transactions on pattern analysis and machine intelligence , year=

Hessian-aware zeroth-order optimization , author=. IEEE transactions on pattern analysis and machine intelligence , year=

work page

[9] [9]

arXiv preprint arXiv:2509.15552 , year=

The Multi-Query Paradox in Zeroth-Order Optimization , author=. arXiv preprint arXiv:2509.15552 , year=

work page arXiv

[10] [10]

arXiv preprint arXiv:2502.03304 , year=

Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning , author=. arXiv preprint arXiv:2502.03304 , year=

work page arXiv

[11] [11]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

work page

[13] [13]

arXiv preprint arXiv:2506.14460 , year=

Zeroth-Order Optimization is Secretly Single-Step Policy Optimization , author=. arXiv preprint arXiv:2506.14460 , year=

work page arXiv

[14] [14]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards Efficient Low-Order Hybrid Optimizer for Language Model Fine-Tuning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[15] [15]

Advances in Neural Information Processing Systems , volume=

Relizo: Sample reusable linear interpolation-based zeroth-order optimization , author=. Advances in Neural Information Processing Systems , volume=

work page

[16] [16]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Zeroth-order fine-tuning of llms in random subspaces , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[17] [17]

arXiv preprint arXiv:2506.04430 , year=

Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order , author=. arXiv preprint arXiv:2506.04430 , year=

work page arXiv

[18] [18]

arXiv preprint arXiv:2501.19099 , year=

Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale , author=. arXiv preprint arXiv:2501.19099 , year=

work page arXiv

[19] [19]

Proceedings of the 2016 ACM SIGSAC conference on computer and communications security , pages=

Deep learning with differential privacy , author=. Proceedings of the 2016 ACM SIGSAC conference on computer and communications security , pages=

work page 2016

[20] [20]

IEEE Transactions on Signal Processing , volume=

Communication-efficient stochastic zeroth-order optimization for federated learning , author=. IEEE Transactions on Signal Processing , volume=. 2022 , publisher=

work page 2022

[21] [21]

International Conference on Learning Representations , volume=

Achieving dimension-free communication in federated learning via zeroth-order optimization , author=. International Conference on Learning Representations , volume=

work page

[22] [22]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Low-rank curvature for zeroth-order optimization in llm fine-tuning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[23] [23]

Advances in Neural Information Processing Systems , volume=

Enabling fast differentially private sgd via just-in-time compilation and vectorization , author=. Advances in Neural Information Processing Systems , volume=

work page

[24] [24]

Second-Order Fine-Tuning without Pain for

Yanjun Zhao and Sizhe Dang and Haishan Ye and Guang Dai and Yi Qian and Ivor Tsang , booktitle=. Second-Order Fine-Tuning without Pain for. 2025 , url=

work page 2025

[25] [25]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

MUZO: Leveraging Multiple Queries and Momentum for Zeroth-Order Fine-Tuning of Large Language Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[26] [26]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

work page 2024

[27] [27]

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training , author=. arXiv preprint arXiv:2509.11983 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

arXiv preprint arXiv:2505.02222 , year=

Practical efficiency of muon for pretraining , author=. arXiv preprint arXiv:2505.02222 , year=

work page arXiv

[29] [29]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Dion: Distributed Orthonormalized Updates

Dion: Distributed orthonormalized updates , author=. arXiv preprint arXiv:2504.05295 , year=

work page arXiv

[31] [31]

On the Convergence Analysis of Muon

On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Muon is Scalable for LLM Training

Muon is scalable for LLM training , author=. arXiv preprint arXiv:2502.16982 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

SIAM Journal on Optimization , volume=

Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017

[35] [35]

GaLore: Memory-Efficient

Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian , booktitle=. GaLore: Memory-Efficient. 2024 , url=

work page 2024

[36] [36]

arXiv preprint arXiv:2502.04959 , year=

No task left behind: Isotropic model merging with common and task-specific subspaces , author=. arXiv preprint arXiv:2502.04959 , year=

work page arXiv

[37] [37]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

Sophia: A scalable stochastic second-order optimizer for language model pre-training , author=. arXiv preprint arXiv:2305.14342 , year=

work page arXiv

[38] [38]

Advances in neural information processing systems , volume=

Symbolic discovery of optimization algorithms , author=. Advances in neural information processing systems , volume=

work page

[39] [39]

Nature Machine Intelligence , pages=

Rethinking machine unlearning for large language models , author=. Nature Machine Intelligence , pages=. 2025 , publisher=

work page 2025

[40] [40]

Hanzhen Zhao and Shihong Ding and Cong Fang and Zhouchen Lin , booktitle=. Pa. 2025 , url=

work page 2025

[41] [41]

Journal of Machine Learning Research , volume=

Learning to optimize: A primer and a benchmark , author=. Journal of Machine Learning Research , volume=

work page

[42] [42]

Chongyu Fan and Jinghan Jia and Yihua Zhang and Anil Ramakrishna and Mingyi Hong and Sijia Liu , booktitle=. Towards. 2025 , url=

work page 2025

[43] [43]

International Conference on Learning Representations , year=

Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page

[44] [44]

Sparse Me

Yong Liu and Zirui Zhu and Chaoyu Gong and Minhao Cheng and Cho-Jui Hsieh and Yang You , booktitle=. Sparse Me. 2025 , url=

work page 2025

[45] [45]

Pengyun Yue and Xuanlin Yang and Mingqing Xiao and Zhouchen Lin , booktitle=. Pseu. 2025 , url=

work page 2025

[46] [46]

Yifan Yang and Zhen Zhang and Rupak Vignesh Swaminathan and Jing Liu and Nathan Susanj and Zheng Zhang , booktitle=. Sharp. 2025 , url=

work page 2025

[47] [47]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Private Zeroth-Order Optimization with Public Data , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[48] [48]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Fast Zeroth-Order Convex Optimization with Quantum Gradient Methods , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[49] [49]

The Twelfth International Conference on Learning Representations , year=

DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training , author=. The Twelfth International Conference on Learning Representations , year=

work page

[50] [50]

2026 , url=

Sizhe Dang and yangyangGuo and Yanjun Zhao and Xiaodong Zheng and Guang Dai and Ivor Tsang and Haishan Ye , booktitle=. 2026 , url=

work page 2026

[51] [51]

arXiv preprint arXiv:2404.08080 , year=

Variance-reduced zeroth-order methods for fine-tuning language models , author=. arXiv preprint arXiv:2404.08080 , year=

work page arXiv

[52] [52]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[53] [53]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Pytorch fsdp: experiences on scaling fully sharded data parallel , author=. arXiv preprint arXiv:2304.11277 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

and Yin, Wotao and Hong, Mingyi and Wang, Zhangyang and Liu, Sijia and Chen, Tianlong , title =

Zhang, Yihua and Li, Pingzhi and Hong, Junyuan and Li, Jiaxiang and Zhang, Yimeng and Zheng, Wenqing and Chen, Pin-Yu and Lee, Jason D. and Yin, Wotao and Hong, Mingyi and Wang, Zhangyang and Liu, Sijia and Chen, Tianlong , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

work page

[55] [55]

Foundations and Trends in Optimization , volume=

Gradient-based algorithms for zeroth-order optimization , author=. Foundations and Trends in Optimization , volume=. 2025 , publisher=

work page 2025

[56] [56]

IEEE Transactions on Information Theory , volume=

Optimal rates for zero-order convex optimization: The power of two function evaluations , author=. IEEE Transactions on Information Theory , volume=. 2015 , publisher=

work page 2015

[57] [57]

arXiv preprint arXiv:2202.08587 , year=

Gradients without backpropagation , author=. arXiv preprint arXiv:2202.08587 , year=

work page arXiv

[58] [58]

The Eleventh International Conference on Learning Representations , year=

Scaling Forward Gradient With Local Losses , author=. The Eleventh International Conference on Learning Representations , year=

work page

[59] [59]

International Conference on Learning Representations , year=

Learning by directional gradient descent , author=. International Conference on Learning Representations , year=

work page

[60] [60]

arXiv preprint arXiv:2209.06302 , year=

Optimization without backpropagation , author=. arXiv preprint arXiv:2209.06302 , year=

work page arXiv

[61] [61]

Foundations of Computational Mathematics , volume=

Random gradient-free minimization of convex functions , author=. Foundations of Computational Mathematics , volume=. 2017 , publisher=

work page 2017

[62] [62]

IEEE transactions on automatic control , volume=

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , author=. IEEE transactions on automatic control , volume=. 2002 , publisher=

work page 2002

[63] [63]

IEEE Signal Processing Magazine , volume=

A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=

work page 2020

[64] [64]

The Thirteenth International Conference on Learning Representations , year=

Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[65] [65]

Advances in neural information processing systems , volume=

Direct feedback alignment provides learning in deep neural networks , author=. Advances in neural information processing systems , volume=

work page

[66] [66]

Advances in neural information processing systems , volume=

Direct feedback alignment scales to modern deep learning tasks and architectures , author=. Advances in neural information processing systems , volume=

work page

[67] [67]

arXiv preprint arXiv:2011.12428 , year=

Align, then memorise: the dynamics of learning with feedback alignment , author=. arXiv preprint arXiv:2011.12428 , year=

work page arXiv 2011

[68] [68]

1952 , institution=

Numerical solution of a minimum problem , author=. 1952 , institution=

work page 1952

[69] [69]

The Eleventh International Conference on Learning Representations , year=

Loss landscapes are all you need: Neural network generalization can be explained without the implicit bias of gradient descent , author=. The Eleventh International Conference on Learning Representations , year=

work page

[70] [70]

2000 , publisher=

Trust region methods , author=. 2000 , publisher=

work page 2000

[71] [71]

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in

Yicheng Lang and Yihua Zhang and Chongyu Fan and Changsheng Wang and Jinghan Jia and Sijia Liu , booktitle=. Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in. 2026 , url=

work page 2026

[72] [72]

Proceedings of the IEEE , volume=

Taking the human out of the loop: A review of Bayesian optimization , author=. Proceedings of the IEEE , volume=. 2015 , publisher=

work page 2015

[73] [73]

IEEE transactions on evolutionary computation , volume=

A fast and elitist multiobjective genetic algorithm: NSGA-II , author=. IEEE transactions on evolutionary computation , volume=. 2002 , publisher=

work page 2002

[74] [74]

Proceedings of the sixth annual conference on Computational learning theory , pages=

Genetic algorithms and machine learning , author=. Proceedings of the sixth annual conference on Computational learning theory , pages=

work page

[75] [75]

Proceedings of ICNN'95-international conference on neural networks , volume=

Particle swarm optimization , author=. Proceedings of ICNN'95-international conference on neural networks , volume=. 1995 , organization=

work page 1995

[76] [76]

Optimization Methods & Software , volume=

PSwarm: a hybrid solver for linearly constrained global derivative-free optimization , author=. Optimization Methods & Software , volume=. 2009 , publisher=

work page 2009

[77] [77]

Computational Methods for Optimal Design and Control: Proceedings of the AFOSR Workshop on Optimal Design and Control Arlington, Virginia 30 September--3 October, 1997 , pages=

The simplex gradient and noisy optimization problems , author=. Computational Methods for Optimal Design and Control: Proceedings of the AFOSR Workshop on Optimal Design and Control Arlington, Virginia 30 September--3 October, 1997 , pages=. 1998 , organization=

work page 1997

[78] [78]

SIAM journal on Optimization , volume=

On the convergence of the multidirectional search algorithm , author=. SIAM journal on Optimization , volume=. 1991 , publisher=

work page 1991

[79] [79]

Biologically-plausible learning algorithms can scale to large datasets

Biologically-plausible learning algorithms can scale to large datasets , author=. arXiv preprint arXiv:1811.03567 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [80]

International Conference on Artificial Intelligence and Statistics , pages=

Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

work page 2018