pith. sign in

arxiv: 2605.15622 · v2 · pith:FNKHUXSWnew · submitted 2026-05-15 · 💻 cs.LG

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

Pith reviewed 2026-05-20 21:15 UTC · model grok-4.3

classification 💻 cs.LG
keywords zeroth-order optimizationdeep learningvariance reductionsubspace methodsmemory efficiencyforward-only trainingblack-box optimizationquery complexity
0
0 comments X

The pith

Zeroth-order optimization can handle large deep learning models once development moves past full-space element-wise estimators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that zeroth-order optimization, which updates models using only function evaluations instead of backpropagated gradients, is dismissed too quickly as unscalable. The authors trace its reported problems with variance and query cost to narrow design choices such as updating every parameter independently across the full space. They lay out six positions that reframe ZO around subspace and spectral structure for variance control, its forward-only execution for systems gains, and cleaner ways to measure performance separate from task difficulty. If these positions hold, ZO becomes a practical route to memory-light training in gray-box or hardware-limited settings. The argument rests on showing that current limitations are engineering artifacts rather than fundamental barriers.

Core claim

Zeroth-order optimization is underexplored rather than underpowered for deep learning; many of its perceived limits arise from myopic full-space, element-wise, estimator-centric designs, and shifting to subspace or spectral views, forward-only systems advantages, and de-obfuscated evaluations can open a viable path to large-scale use.

What carries the argument

Subspace and spectral views of zeroth-order estimators that deliver interpretable variance reduction together with more graceful scaling in the number of queries.

If this is right

  • Memory usage drops because no gradient storage or backpropagation is required, suiting very deep or distributed models.
  • Training pipelines become communication-efficient and pipeline-friendly since only forward passes are exchanged.
  • Resource-constrained or black-box settings gain a usable optimization route without needing internal gradients.
  • Evaluations of ZO methods can be separated from overall task hardness, revealing true algorithmic progress.
  • Variance-query tradeoffs become tunable through directional derivatives and spectral structure rather than brute-force sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same subspace framing could extend naturally to non-differentiable objectives in reinforcement learning or combinatorial search.
  • Hardware with limited memory bandwidth might adopt ZO as a default for on-device adaptation once the query overhead is controlled.
  • Future work could test whether spectral decompositions allow ZO to match first-order performance on specific layer types rather than whole models.

Load-bearing premise

The proposed changes to subspace and spectral designs plus forward-only execution will cut variance and query demands enough to support practical training of large deep networks.

What would settle it

A direct head-to-head run on a standard large model such as a transformer or ResNet-50 where even subspace ZO variants still require orders of magnitude more queries than backpropagation or fail to reach comparable accuracy.

Figures

Figures reproduced from arXiv: 2605.15622 by Bhavya Kailkhura, Changsheng Wang, Chongyu Fan, James Diffenderfer, Sijia Liu, Soumyadeep Pal, Yancheng Huang, Yicheng Lang, Yihua Zhang.

Figure 1
Figure 1. Figure 1: Schematic overview of this position paper. (Left) Publi￾cation trends (papers per year) for works with “ZO optimization” in the title in arXiv cs.AI and cs.LG (machine learning). (Right) Conceptual organization of our positioning points (P1–P6). BP with forward-only passes, ZO optimization can fine￾tune large pretrained models, while significantly reducing memory overhead (Zhang et al., 2024c). Thus, a gro… view at source ↗
Figure 2
Figure 2. Figure 2: Fine-tuning accuracy of ZO optimization methods, in￾cluding MeZO (Malladi et al., 2023), Sparse-MeZO (Liu et al., 2025b), HiZOO (Zhao et al., 2025b), and LOZO (Chen et al., 2025), on the SST-2, RTE, and WiC downstream tasks under task￾aligned (w/ align) and non-aligned (w/o align) settings [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Zeroth-order (ZO) optimization, learning from finite differences of function evaluations without backpropagation, has recently regained attention in deep learning due to its memory efficiency and applicability to gray- or black-box pipelines. Yet, ZO methods are often dismissed as fundamentally unscalable because of estimator variance and unfavorable query complexity. We argue that this conclusion might be misguided: ZO optimization is underexplored, not underpowered. We show that many perceived limitations stem from myopic development practices, most notably full-space, element-wise, estimator-centric designs. We articulate six positions spanning the algorithmic, systems, and evaluation stack. First, we revisit the feasibility boundaries of estimator-centric ZO methods through variance control, variance-query tradeoffs, and directional-derivative lenses. Then, we identify three underexplored opportunities: (i) subspace and spectral views of ZO that enable interpretable variance reduction with graceful query scaling, (ii) the forward-only nature of ZO as a systems advantage for communication-efficient, pipeline-friendly, and resource-constrained training, and (iii) the need to de-obfuscate ZO evaluations from task complexity. We strongly advocate rethinking ZO optimization around its unique strengths and acting accordingly, opening a viable path toward large-scale, system-aware, and resource-efficient learning with ZO optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a position paper arguing that zeroth-order (ZO) optimization in deep learning is underexplored rather than underpowered. It attributes perceived limitations such as high estimator variance and unfavorable query complexity to myopic design practices, notably full-space element-wise estimator-centric approaches. The authors articulate six positions across the algorithmic, systems, and evaluation stack, advocating subspace/spectral views for variance reduction, forward-only systems advantages for communication efficiency, and de-obfuscated evaluations to support large-scale resource-efficient learning.

Significance. If the positions are pursued, the paper could usefully redirect research toward ZO methods' memory and black-box strengths for constrained or pipeline settings. The conceptual reframing of limitations as artifacts of design choices rather than fundamentals is a constructive contribution, and the emphasis on forward-only computation as a systems asset is a clear strength that could inform future work on communication-efficient training.

major comments (2)
  1. [Abstract and the six positions section] The claim that subspace and spectral views enable interpretable variance reduction with graceful query scaling (abstract and positions on algorithmic opportunities) is central to the argument that ZO can scale to large deep learning. The manuscript offers only high-level conceptual analysis without a concrete bound, example derivation, or cited prior result showing how query complexity improves relative to full-space estimators in high-dimensional models.
  2. [Positions on underexplored opportunities] The assertion that the proposed shifts (subspace views, forward-only advantages, de-obfuscated evaluations) will sufficiently mitigate variance and query complexity for large-scale use (weakest assumption noted in reader's take) lacks supporting analysis or falsifiable prediction within the manuscript, leaving the viability claim for deep learning dependent on unshown future developments.
minor comments (2)
  1. [Abstract] The abstract refers to 'six positions' without enumerating them; a short numbered list in the abstract or introduction would improve readability.
  2. [Introduction] Terminology such as 'myopic development practices' and 'estimator-centric designs' would benefit from a brief operational definition or example in the introduction to make the critique more precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The feedback helps clarify how to better support the central claims of this position paper. Below we respond point-by-point to the major comments, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract and the six positions section] The claim that subspace and spectral views enable interpretable variance reduction with graceful query scaling (abstract and positions on algorithmic opportunities) is central to the argument that ZO can scale to large deep learning. The manuscript offers only high-level conceptual analysis without a concrete bound, example derivation, or cited prior result showing how query complexity improves relative to full-space estimators in high-dimensional models.

    Authors: We appreciate this observation. As a position paper our primary aim is conceptual reframing rather than exhaustive derivation; however, we agree that the scaling claim benefits from additional grounding. In the revised manuscript we will add citations to existing subspace and spectral ZO results that report concrete variance reductions and improved query scaling relative to full-space estimators. We will also include a short illustrative derivation sketch (in an appendix) showing how restricting perturbations to a k-dimensional subspace yields query complexity that scales gracefully with k rather than ambient dimension d. revision: yes

  2. Referee: [Positions on underexplored opportunities] The assertion that the proposed shifts (subspace views, forward-only advantages, de-obfuscated evaluations) will sufficiently mitigate variance and query complexity for large-scale use (weakest assumption noted in reader's take) lacks supporting analysis or falsifiable prediction within the manuscript, leaving the viability claim for deep learning dependent on unshown future developments.

    Authors: We acknowledge that the manuscript, being positional, does not contain exhaustive supporting analysis or empirical validation for every proposed shift; its purpose is to identify directions rather than to close them. To address the concern we will expand the relevant section with a short discussion of falsifiable predictions, including expected variance-reduction factors under subspace sampling and communication savings from forward-only execution. These additions will make the viability argument more concrete while preserving the paper's role in guiding future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a position paper that articulates six interpretive positions advocating shifts in zeroth-order optimization design. It contains no equations, derivations, fitted parameters, predictions, or theorems. The central claim rests on conceptual analysis of existing practices rather than any self-referential reduction, self-citation chain, or ansatz that could be equivalent to its inputs by construction. All load-bearing steps are external to the paper's own content and do not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on reinterpretation of known ZO challenges without new parameters or entities; it assumes alternative design paradigms can address variance issues.

axioms (1)
  • domain assumption Perceived limitations of ZO methods arise primarily from full-space, element-wise, estimator-centric designs
    Directly stated in the abstract as the source of misguided conclusions about scalability.

pith-pipeline@v0.9.0 · 5807 in / 1055 out tokens · 74092 ms · 2026-05-20T21:15:32.947085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · 12 internal anchors

  1. [1]

    Mathematical Programming , volume=

    Zeroth-order optimization with orthogonal random directions , author=. Mathematical Programming , volume=. 2023 , publisher=

  2. [2]

    arXiv preprint arXiv:2602.17155 , year=

    Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization , author=. arXiv preprint arXiv:2602.17155 , year=

  3. [3]

    The Fourteenth International Conference on Learning Representations , year=

    Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization , author=. The Fourteenth International Conference on Learning Representations , year=

  4. [4]

    The Fourteenth International Conference on Learning Representations , year=

    Online Pseudo-Zeroth-Order Training of Neuromorphic Spiking Neural Networks , author=. The Fourteenth International Conference on Learning Representations , year=

  5. [5]

    Forty-second International Conference on Machine Learning , year=

    Natural perturbations for black-box training of neural networks by zeroth-order optimization , author=. Forty-second International Conference on Machine Learning , year=

  6. [6]

    Refining Adaptive Zeroth-Order Optimization at Ease , author=

  7. [7]

    Journal of Optimization Theory and Applications , volume=

    Zeroth-order random subspace algorithm for non-smooth convex optimization , author=. Journal of Optimization Theory and Applications , volume=. 2025 , publisher=

  8. [8]

    IEEE transactions on pattern analysis and machine intelligence , year=

    Hessian-aware zeroth-order optimization , author=. IEEE transactions on pattern analysis and machine intelligence , year=

  9. [9]

    arXiv preprint arXiv:2509.15552 , year=

    The Multi-Query Paradox in Zeroth-Order Optimization , author=. arXiv preprint arXiv:2509.15552 , year=

  10. [10]

    arXiv preprint arXiv:2502.03304 , year=

    Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning , author=. arXiv preprint arXiv:2502.03304 , year=

  11. [11]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

  12. [12]

    Advances in neural information processing systems , volume=

    Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

  13. [13]

    arXiv preprint arXiv:2506.14460 , year=

    Zeroth-Order Optimization is Secretly Single-Step Policy Optimization , author=. arXiv preprint arXiv:2506.14460 , year=

  14. [14]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Towards Efficient Low-Order Hybrid Optimizer for Language Model Fine-Tuning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Relizo: Sample reusable linear interpolation-based zeroth-order optimization , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Zeroth-order fine-tuning of llms in random subspaces , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  17. [17]

    arXiv preprint arXiv:2506.04430 , year=

    Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order , author=. arXiv preprint arXiv:2506.04430 , year=

  18. [18]

    arXiv preprint arXiv:2501.19099 , year=

    Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale , author=. arXiv preprint arXiv:2501.19099 , year=

  19. [19]

    Proceedings of the 2016 ACM SIGSAC conference on computer and communications security , pages=

    Deep learning with differential privacy , author=. Proceedings of the 2016 ACM SIGSAC conference on computer and communications security , pages=

  20. [20]

    IEEE Transactions on Signal Processing , volume=

    Communication-efficient stochastic zeroth-order optimization for federated learning , author=. IEEE Transactions on Signal Processing , volume=. 2022 , publisher=

  21. [21]

    International Conference on Learning Representations , volume=

    Achieving dimension-free communication in federated learning via zeroth-order optimization , author=. International Conference on Learning Representations , volume=

  22. [22]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Low-rank curvature for zeroth-order optimization in llm fine-tuning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Enabling fast differentially private sgd via just-in-time compilation and vectorization , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    Second-Order Fine-Tuning without Pain for

    Yanjun Zhao and Sizhe Dang and Haishan Ye and Guang Dai and Yi Qian and Ivor Tsang , booktitle=. Second-Order Fine-Tuning without Pain for. 2025 , url=

  25. [25]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    MUZO: Leveraging Multiple Queries and Momentum for Zeroth-Order Fine-Tuning of Large Language Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  26. [26]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  27. [27]

    Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

    Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training , author=. arXiv preprint arXiv:2509.11983 , year=

  28. [28]

    arXiv preprint arXiv:2505.02222 , year=

    Practical efficiency of muon for pretraining , author=. arXiv preprint arXiv:2505.02222 , year=

  29. [29]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  30. [30]

    Dion: Distributed Orthonormalized Updates

    Dion: Distributed orthonormalized updates , author=. arXiv preprint arXiv:2504.05295 , year=

  31. [31]

    On the Convergence Analysis of Muon

    On the convergence analysis of muon , author=. arXiv preprint arXiv:2505.23737 , year=

  32. [32]

    Muon is Scalable for LLM Training

    Muon is scalable for LLM training , author=. arXiv preprint arXiv:2502.16982 , year=

  33. [33]

    Kimi K2: Open Agentic Intelligence

    Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

  34. [34]

    SIAM Journal on Optimization , volume=

    Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

  35. [35]

    GaLore: Memory-Efficient

    Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian , booktitle=. GaLore: Memory-Efficient. 2024 , url=

  36. [36]

    arXiv preprint arXiv:2502.04959 , year=

    No task left behind: Isotropic model merging with common and task-specific subspaces , author=. arXiv preprint arXiv:2502.04959 , year=

  37. [37]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

    Sophia: A scalable stochastic second-order optimizer for language model pre-training , author=. arXiv preprint arXiv:2305.14342 , year=

  38. [38]

    Advances in neural information processing systems , volume=

    Symbolic discovery of optimization algorithms , author=. Advances in neural information processing systems , volume=

  39. [39]

    Nature Machine Intelligence , pages=

    Rethinking machine unlearning for large language models , author=. Nature Machine Intelligence , pages=. 2025 , publisher=

  40. [40]

    Hanzhen Zhao and Shihong Ding and Cong Fang and Zhouchen Lin , booktitle=. Pa. 2025 , url=

  41. [41]

    Journal of Machine Learning Research , volume=

    Learning to optimize: A primer and a benchmark , author=. Journal of Machine Learning Research , volume=

  42. [42]

    Chongyu Fan and Jinghan Jia and Yihua Zhang and Anil Ramakrishna and Mingyi Hong and Sijia Liu , booktitle=. Towards. 2025 , url=

  43. [43]

    International Conference on Learning Representations , year=

    Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

  44. [44]

    Sparse Me

    Yong Liu and Zirui Zhu and Chaoyu Gong and Minhao Cheng and Cho-Jui Hsieh and Yang You , booktitle=. Sparse Me. 2025 , url=

  45. [45]

    Pengyun Yue and Xuanlin Yang and Mingqing Xiao and Zhouchen Lin , booktitle=. Pseu. 2025 , url=

  46. [46]

    Yifan Yang and Zhen Zhang and Rupak Vignesh Swaminathan and Jing Liu and Nathan Susanj and Zheng Zhang , booktitle=. Sharp. 2025 , url=

  47. [47]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Private Zeroth-Order Optimization with Public Data , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  48. [48]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Fast Zeroth-Order Convex Optimization with Quantum Gradient Methods , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  49. [49]

    The Twelfth International Conference on Learning Representations , year=

    DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training , author=. The Twelfth International Conference on Learning Representations , year=

  50. [50]

    2026 , url=

    Sizhe Dang and yangyangGuo and Yanjun Zhao and Xiaodong Zheng and Guang Dai and Ivor Tsang and Haishan Ye , booktitle=. 2026 , url=

  51. [51]

    arXiv preprint arXiv:2404.08080 , year=

    Variance-reduced zeroth-order methods for fine-tuning language models , author=. arXiv preprint arXiv:2404.08080 , year=

  52. [52]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

  53. [53]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Pytorch fsdp: experiences on scaling fully sharded data parallel , author=. arXiv preprint arXiv:2304.11277 , year=

  54. [54]

    and Yin, Wotao and Hong, Mingyi and Wang, Zhangyang and Liu, Sijia and Chen, Tianlong , title =

    Zhang, Yihua and Li, Pingzhi and Hong, Junyuan and Li, Jiaxiang and Zhang, Yimeng and Zheng, Wenqing and Chen, Pin-Yu and Lee, Jason D. and Yin, Wotao and Hong, Mingyi and Wang, Zhangyang and Liu, Sijia and Chen, Tianlong , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

  55. [55]

    Foundations and Trends in Optimization , volume=

    Gradient-based algorithms for zeroth-order optimization , author=. Foundations and Trends in Optimization , volume=. 2025 , publisher=

  56. [56]

    IEEE Transactions on Information Theory , volume=

    Optimal rates for zero-order convex optimization: The power of two function evaluations , author=. IEEE Transactions on Information Theory , volume=. 2015 , publisher=

  57. [57]

    arXiv preprint arXiv:2202.08587 , year=

    Gradients without backpropagation , author=. arXiv preprint arXiv:2202.08587 , year=

  58. [58]

    The Eleventh International Conference on Learning Representations , year=

    Scaling Forward Gradient With Local Losses , author=. The Eleventh International Conference on Learning Representations , year=

  59. [59]

    International Conference on Learning Representations , year=

    Learning by directional gradient descent , author=. International Conference on Learning Representations , year=

  60. [60]

    arXiv preprint arXiv:2209.06302 , year=

    Optimization without backpropagation , author=. arXiv preprint arXiv:2209.06302 , year=

  61. [61]

    Foundations of Computational Mathematics , volume=

    Random gradient-free minimization of convex functions , author=. Foundations of Computational Mathematics , volume=. 2017 , publisher=

  62. [62]

    IEEE transactions on automatic control , volume=

    Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , author=. IEEE transactions on automatic control , volume=. 2002 , publisher=

  63. [63]

    IEEE Signal Processing Magazine , volume=

    A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=

  64. [64]

    The Thirteenth International Conference on Learning Representations , year=

    Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures , author=. The Thirteenth International Conference on Learning Representations , year=

  65. [65]

    Advances in neural information processing systems , volume=

    Direct feedback alignment provides learning in deep neural networks , author=. Advances in neural information processing systems , volume=

  66. [66]

    Advances in neural information processing systems , volume=

    Direct feedback alignment scales to modern deep learning tasks and architectures , author=. Advances in neural information processing systems , volume=

  67. [67]

    arXiv preprint arXiv:2011.12428 , year=

    Align, then memorise: the dynamics of learning with feedback alignment , author=. arXiv preprint arXiv:2011.12428 , year=

  68. [68]

    1952 , institution=

    Numerical solution of a minimum problem , author=. 1952 , institution=

  69. [69]

    The Eleventh International Conference on Learning Representations , year=

    Loss landscapes are all you need: Neural network generalization can be explained without the implicit bias of gradient descent , author=. The Eleventh International Conference on Learning Representations , year=

  70. [70]

    2000 , publisher=

    Trust region methods , author=. 2000 , publisher=

  71. [71]

    Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in

    Yicheng Lang and Yihua Zhang and Chongyu Fan and Changsheng Wang and Jinghan Jia and Sijia Liu , booktitle=. Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in. 2026 , url=

  72. [72]

    Proceedings of the IEEE , volume=

    Taking the human out of the loop: A review of Bayesian optimization , author=. Proceedings of the IEEE , volume=. 2015 , publisher=

  73. [73]

    IEEE transactions on evolutionary computation , volume=

    A fast and elitist multiobjective genetic algorithm: NSGA-II , author=. IEEE transactions on evolutionary computation , volume=. 2002 , publisher=

  74. [74]

    Proceedings of the sixth annual conference on Computational learning theory , pages=

    Genetic algorithms and machine learning , author=. Proceedings of the sixth annual conference on Computational learning theory , pages=

  75. [75]

    Proceedings of ICNN'95-international conference on neural networks , volume=

    Particle swarm optimization , author=. Proceedings of ICNN'95-international conference on neural networks , volume=. 1995 , organization=

  76. [76]

    Optimization Methods & Software , volume=

    PSwarm: a hybrid solver for linearly constrained global derivative-free optimization , author=. Optimization Methods & Software , volume=. 2009 , publisher=

  77. [77]

    Computational Methods for Optimal Design and Control: Proceedings of the AFOSR Workshop on Optimal Design and Control Arlington, Virginia 30 September--3 October, 1997 , pages=

    The simplex gradient and noisy optimization problems , author=. Computational Methods for Optimal Design and Control: Proceedings of the AFOSR Workshop on Optimal Design and Control Arlington, Virginia 30 September--3 October, 1997 , pages=. 1998 , organization=

  78. [78]

    SIAM journal on Optimization , volume=

    On the convergence of the multidirectional search algorithm , author=. SIAM journal on Optimization , volume=. 1991 , publisher=

  79. [79]

    Biologically-plausible learning algorithms can scale to large datasets

    Biologically-plausible learning algorithms can scale to large datasets , author=. arXiv preprint arXiv:1811.03567 , year=

  80. [80]

    International Conference on Artificial Intelligence and Statistics , pages=

    Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

Showing first 80 references.