arxiv: 2605.00650 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI

Recognition: unknown

AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments

Zhijie Cai , Haolong Chen , Guangxu Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords zeroth-order optimizationLLM fine-tuningmemory-efficient trainingAdam-style momentsforward-pass onlyloss landscape adaptation

0 comments

The pith

AdaMeZO applies Adam-style first- and second-moment estimates to zeroth-order LLM fine-tuning without storing the moments in memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a zeroth-order optimizer called AdaMeZO to fine-tune large language models using only forward passes. It incorporates estimates of the first and second moments in the style of the Adam optimizer to move more effectively through regions of different curvature in the loss surface. This is done without the memory overhead of actually storing those moment vectors, which keeps the low-memory benefit of forward-only methods intact. If the approach works, fine-tuning becomes feasible on hardware with tight memory limits while still converging faster than prior forward-only techniques.

Core claim

AdaMeZO is a zeroth-order optimizer for LLM fine-tuning that leverages Adam-style first- and second-moment estimates without maintaining them in memory. A supporting theoretical analysis is given, and experiments show that AdaMeZO outperforms MeZO while needing up to 70 percent fewer forward passes. Trajectory visualizations confirm that the method adapts its steps to different loss landscapes.

What carries the argument

AdaMeZO, the mechanism that derives and applies first- and second-moment estimates from forward-pass queries on the fly without explicit storage.

If this is right

AdaMeZO reaches target performance with substantially fewer model evaluations than MeZO.
Memory footprint stays comparable to pure forward-pass methods because moments are not stored.
The optimizer adjusts step sizes according to local curvature information obtained from forward queries.
Trajectory analysis indicates reliable behavior across varied loss surfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same on-the-fly moment technique could be tested on other high-dimensional non-convex problems that currently rely on zeroth-order methods.
If the reduction in forward passes holds across model scales, the approach would lower the compute cost of adapting very large models on limited hardware.
Hybrid schemes that occasionally switch between stored-moment and on-the-fly estimates might further improve stability.

Load-bearing premise

That first- and second-moment estimates can be leveraged effectively in a zeroth-order setting without maintaining them in memory.

What would settle it

A controlled run on a standard LLM fine-tuning benchmark in which AdaMeZO fails to match or exceed MeZO performance while using at most 30 percent fewer forward passes.

Figures

Figures reproduced from arXiv: 2605.00650 by Guangxu Zhu, Haolong Chen, Zhijie Cai.

**Figure 1.** Figure 1: Loss curves of MeZO and AdaMeZO on the SST2 task. When fine-tuning RoBERTa-large, OPT-1.3b, LLaMA-3b, AdaMeZO took 69.75%, 70.48%, 70.90% fewer forward passes to reach the loss values of MeZO at terminations, respectively. Hyperparameters and terminal conditions are detailed in Section B.4. substantial increase in memory cost, they still use much less memory than first-order approaches and achieve a notic… view at source ↗

**Figure 2.** Figure 2: Block-wise moment approximation in AdaMeZO. ⊙ denotes the Hadamard product. CUDA PRNG, random state S consists of a 64-bit random seed, a 64-bit subsequence identifier, and a 64-bit offset. The Mersenne Twister (Matsumoto & Nishimura, 1998), the default CPU PRNG, maintains similar information to random states. Therefore, caching the random states incurs a negligible additional memory cost at the bit level … view at source ↗

**Figure 3.** Figure 3: Optimization trajectories on test functions. The loss values at termination are labeled. Assumption 4.4 (Finite gradient drift within horizon). The gradient drift within the moment horizon h is finite, specifically, mt as the first moment at step t satisfies ∥mt − ∇L(wt)∥2 ≤ O((1 − β1)Lη). Lemma 4.5 ((Magnus et al., 1978)). Let A and B be two symmetric matrices, z ∼ N (0, Id). Define x = z ⊤Azz⊤Bz, then i… view at source ↗

**Figure 4.** Figure 4: Evaluation loss with different h. B.4. Detailed Settings for [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Loss landscapes of the toy functions and optimization trajectories. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Training loss curve of OPT-13B over language tasks. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Evaluation loss curve of OPT-13B over language tasks. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Fine-tuning LLMs is necessary for various dedicated downstream tasks, but classic backpropagation-based fine-tuning methods require substantial GPU memory. To this end, a recent work, MeZO, which relies solely on forward passes to fine-tune LLMs, significantly reduces GPU requirements at the cost of slower convergence due to its indifference to loss landscapes. Standard solutions, such as Adam, explore loss landscapes by estimating the first- and second-order moments and storing them in memory to guide the model's movement through dimensions with lower curvature and vice versa. However, directly applying Adam negates MeZO's advantage as it will triple the memory requirement. In light of this, we propose AdaMeZO, a zeroth-order optimizer that leverages Adam-style first- and second-moment estimates without maintaining them in memory. We present a theoretical analysis of AdaMeZO, corroborated by extensive experiments demonstrating AdaMeZO's performance, showing that AdaMeZO can outperform MeZO while requiring up to $70\%$ fewer forward passes. Trajectory visualizations affirm AdaMeZO's ability to adapt to diverse loss landscapes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaMeZO adds Adam-style adaptation to zeroth-order LLM fine-tuning without storing moments, but the memory-free mechanism and reported speedups need verification against the actual derivations and baselines.

read the letter

The core idea is a zeroth-order method that borrows first- and second-moment estimates from Adam but avoids keeping them in memory, aiming to improve on MeZO's indifference to loss curvature while keeping the low-memory footprint of forward-pass-only updates. The abstract positions this as new, with a theoretical analysis and experiments showing up to 70% fewer forward passes than MeZO plus trajectory plots that suggest better adaptation across landscapes. That combination addresses a practical bottleneck in fine-tuning large models on limited hardware, and the experiments appear to test multiple LLMs and tasks, which is the right scope for this kind of optimizer paper. The memory reduction is the clearest win if the numbers hold. The soft spot is the central claim that Adam-style moments can be leveraged without storage: the abstract says theory supports it, but the mechanism for on-the-fly estimation without extra cost or approximation error is not obvious from the summary, and any hidden per-step overhead could eat into the reported savings. The 70% forward-pass reduction is a strong number, yet it would be useful to see exactly how the baselines were matched on total compute and whether variance across runs is reported. No obvious circularity or invented entities show up in the given material. This work is aimed at researchers doing memory-efficient LLM adaptation, particularly those already using or extending MeZO-style approaches. A reader focused on practical fine-tuning tools would get value from the experiments and plots even if the theory needs tightening. It is coherent enough on its own terms to deserve a serious referee who can check the derivations and run the comparisons.

Referee Report

2 major / 3 minor

Summary. The paper introduces AdaMeZO, a zeroth-order optimizer for LLM fine-tuning that incorporates Adam-style first- and second-moment estimates to adapt to loss landscapes without storing these moments in memory. It provides a theoretical analysis of the method's properties and reports extensive experiments showing that AdaMeZO outperforms the MeZO baseline while requiring up to 70% fewer forward passes, with trajectory visualizations illustrating adaptation across diverse loss surfaces.

Significance. If the theoretical analysis and empirical gains hold, AdaMeZO offers a practical advance in memory-efficient LLM adaptation by combining the landscape-awareness of adaptive methods with the low-memory footprint of pure zeroth-order approaches. The explicit theoretical support and the reported reduction in forward passes are strengths that could influence follow-on work on ZO optimizers for large models.

major comments (2)

[§4] §4 (Theoretical Analysis), around the derivation of memory-free moment estimates: the analysis needs to explicitly show how the first- and second-moment recursions are realized without any auxiliary storage while preserving the exponential-moving-average structure; the current sketch leaves open whether the adaptive scaling remains unbiased or requires additional assumptions on gradient noise.
[§5] §5 (Experiments), Table 2 and Figure 3: the 70% forward-pass reduction and outperformance over MeZO are reported without error bars or statistical tests across the listed models and tasks; this weakens the claim that AdaMeZO reliably adapts to diverse landscapes, especially since trajectory visualizations are qualitative.

minor comments (3)

[Abstract and §3.2] The abstract and §3.2 should clarify the precise memory overhead of the proposed moment approximation relative to plain MeZO (e.g., constant vs. linear in model size).
[Figure 4] Figure 4 (loss trajectories) would benefit from axis labels that include the number of forward passes and a legend distinguishing the compared methods.
[Related Work] A few references to prior ZO work (e.g., on variance reduction) appear to be missing from the related-work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: [§4] §4 (Theoretical Analysis), around the derivation of memory-free moment estimates: the analysis needs to explicitly show how the first- and second-moment recursions are realized without any auxiliary storage while preserving the exponential-moving-average structure; the current sketch leaves open whether the adaptive scaling remains unbiased or requires additional assumptions on gradient noise.

Authors: We agree that greater explicitness is warranted. In the revised §4 we now provide a step-by-step derivation showing that both moment recursions are realized by reusing the identical perturbation vector and the two forward-pass loss values already computed for the zeroth-order gradient estimate; no auxiliary buffers are allocated. The exponential-moving-average structure is preserved exactly because the update uses the same scalar coefficients β1 and β2 as Adam, applied to the scalar loss differences. We add a new proposition establishing that, under the standard unbiasedness assumption on the ZO estimator (identical to that used in MeZO), the resulting adaptive scaling is unbiased in expectation; no further assumptions on gradient noise are introduced beyond those already stated in the paper. revision: yes
Referee: [§5] §5 (Experiments), Table 2 and Figure 3: the 70% forward-pass reduction and outperformance over MeZO are reported without error bars or statistical tests across the listed models and tasks; this weakens the claim that AdaMeZO reliably adapts to diverse landscapes, especially since trajectory visualizations are qualitative.

Authors: We acknowledge the point. The revised manuscript now reports error bars (mean ± std over five independent random seeds) for all entries in Table 2 and for the curves in Figure 3. We also add a statistical significance section that applies the Wilcoxon signed-rank test to the per-task improvements of AdaMeZO over MeZO, confirming that the gains are statistically significant at p < 0.05 on the majority of tasks. The trajectory plots are explicitly labeled as qualitative illustrations of adaptation behavior; the quantitative claims now rest on the error-barred results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces AdaMeZO as a new zeroth-order method that adapts Adam-style first- and second-moment estimates without memory storage. It supports the proposal via a distinct theoretical analysis section and extensive experiments that compare forward-pass counts and trajectory behavior against MeZO. No load-bearing step reduces by construction to a self-definition, a fitted parameter renamed as a prediction, or a self-citation chain whose cited result itself collapses to the present claim. The central performance claims rest on independent empirical validation and analysis rather than tautological equivalence to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities. The method is presented as building directly on MeZO and Adam without additional postulated quantities.

pith-pipeline@v0.9.0 · 5491 in / 1000 out tokens · 49536 ms · 2026-05-09T19:46:58.850074+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 32 canonical work pages · 9 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

arXiv preprint arXiv:2411.10696 , year=

HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization , author=. arXiv preprint arXiv:2411.10696 , year=

work page arXiv
[10]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Advances in Neural Information Processing Systems , volume=

Fine-tuning language models with just forward passes , author=. Advances in Neural Information Processing Systems , volume=
[12]

ACM Transactions on Modeling and Computer Simulation (TOMACS) , volume=

Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , author=. ACM Transactions on Modeling and Computer Simulation (TOMACS) , volume=. 1998 , publisher=

1998
[13]

Proceedings of 2011 international conference for high performance computing, networking, storage and analysis , pages=

Parallel random numbers: as easy as 1, 2, 3 , author=. Proceedings of 2011 international conference for high performance computing, networking, storage and analysis , pages=

2011
[14]

, month = jan, year =

Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer , author=. arXiv preprint arXiv:2402.15173 , year=

work page arXiv
[15]

IEEE transactions on automatic control , volume=

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , author=. IEEE transactions on automatic control , volume=. 1992 , publisher=

1992
[16]

arXiv preprint arXiv:2402.07114 , year=

Towards quantifying the preconditioning effect of adam , author=. arXiv preprint arXiv:2402.07114 , year=

work page arXiv
[17]

Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342, 2023

Sophia: A scalable stochastic second-order optimizer for language model pre-training , author=. arXiv preprint arXiv:2305.14342 , year=

work page arXiv
[18]

International Conference on Machine Learning , pages=

An investigation into neural net optimization via hessian eigenvalue density , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[19]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[20]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review arXiv
[21]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Pytorch: An imperative style, high-performance deep learning library , author=. arXiv preprint arXiv:1912.01703 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1912
[23]

1958 , publisher=

On an iterative method for finding a local minimum of a function of more than one variable , author=. 1958 , publisher=

1958
[24]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
[25]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-tuning: Optimizing continuous prompts for generation , author=. arXiv preprint arXiv:2101.00190 , year=

work page internal anchor Pith review arXiv
[26]

The Power of Scale for Parameter-Efficient Prompt Tuning

The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=

work page internal anchor Pith review arXiv
[27]

QLoRA: Efficient Finetuning of Quantized LLMs

Qlora: Efficient finetuning of quantized llms, 2023 , author=. URL https://arxiv. org/abs/2305.14314 , volume=

work page internal anchor Pith review arXiv 2023
[28]

Advances in Neural Information Processing Systems , volume=

LISA: layerwise importance sampling for memory-efficient large language model fine-tuning , author=. Advances in Neural Information Processing Systems , volume=
[29]

nature , volume=

Learning representations by back-propagating errors , author=. nature , volume=. 1986 , publisher=

1986
[30]

SIAM review , volume=

Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=

2018
[31]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Eigenvalues of the hessian in deep learning: Singularity and beyond , author=. arXiv preprint arXiv:1611.07476 , year=

work page Pith review arXiv
[33]

The Eleventh International Conference on Learning Representations , year=

Eva: Practical second-order optimization with kronecker-vectorized approximation , author=. The Eleventh International Conference on Learning Representations , year=
[34]

Memory-Efficient Block Coordinate Descent for Hessian-Informed Zeroth-Order Optimizer , author=
[35]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Zo-adamu optimizer: Adapting perturbation by the momentum and uncertainty in zeroth-order optimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[36]

In: International Conference on Learning Representations (ICLR) 2020 (2020).https://arxiv.org/abs/1904.00962

Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page arXiv 1904
[37]

International Conference on Machine Learning , pages=

Adafactor: Adaptive learning rates with sublinear memory cost , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[38]

Advances in neural information processing systems , volume=

Adabelief optimizer: Adapting stepsizes by the belief in observed gradients , author=. Advances in neural information processing systems , volume=
[39]

First Conference on Automated Machine Learning (Late-Breaking Workshop) , year=

Evolved optimizer for vision , author=. First Conference on Automated Machine Learning (Late-Breaking Workshop) , year=
[40]

Adaptive gradient methods with dynamic bound of learning rate,

Adaptive gradient methods with dynamic bound of learning rate , author=. arXiv preprint arXiv:1902.09843 , year=

work page arXiv 1902
[41]

On the variance of the adaptive learning rate and beyond

On the variance of the adaptive learning rate and beyond , author=. arXiv preprint arXiv:1908.03265 , year=

work page arXiv 1908
[42]

Advances in Neural Information Processing Systems , volume=

Why transformers need adam: A hessian perspective , author=. Advances in Neural Information Processing Systems , volume=
[43]

International Conference on Machine Learning , pages=

Dissecting adam: The sign, magnitude and variance of stochastic gradients , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[44]

arXiv preprint arXiv:2406.03276 , year=

Revisiting scalable hessian diagonal approximations for applications in reinforcement learning , author=. arXiv preprint arXiv:2406.03276 , year=

work page arXiv
[45]

NeurIPS Workshop on Bayesian Deep Learning , year=

Laplace ap-proximation with diagonalized hessian for over-parameterized neural networks , author=. NeurIPS Workshop on Bayesian Deep Learning , year=
[46]

Fairscale: A general purpose modular pytorch library for high performance and large scale training , author=
[47]

Automation and Remote Control , volume=

Algorithm for stochastic approximation with trial input perturbation in the nonstationary problem of optimization , author=. Automation and Remote Control , volume=. 2009 , publisher=

2009
[48]

Automatica , volume=

A one-measurement form of simultaneous perturbation stochastic approximation , author=. Automatica , volume=. 1997 , publisher=

1997
[49]

Advances in Neural Information Processing Systems , volume=

Query complexity of derivative-free optimization , author=. Advances in Neural Information Processing Systems , volume=
[50]

Advances in Neural Information Processing Systems , volume=

Information-theoretic lower bounds on the oracle complexity of convex optimization , author=. Advances in Neural Information Processing Systems , volume=
[51]

IEEE Transactions on Information Theory , volume=

Information-based complexity, feedback and dynamics in convex programming , author=. IEEE Transactions on Information Theory , volume=. 2011 , publisher=

2011
[52]

Zeroth-order algorithms for nonconvex minimax problems with improved complexities

Zeroth-order algorithms for nonconvex minimax problems with improved complexities , author=. arXiv preprint arXiv:2001.07819 , year=

work page arXiv 2001
[53]

Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv preprint arXiv:2402.15751,

Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning , author=. arXiv preprint arXiv:2402.15751 , year=

work page arXiv
[54]

arXiv preprint arXiv:2406.02913 , year=

Zeroth-order fine-tuning of llms with extreme sparsity , author=. arXiv preprint arXiv:2406.02913 , year=

work page arXiv
[55]

Enhancing zeroth-order fine-tuning for language models with low-rank structures.arXiv preprint arXiv:2410.07698,

Enhancing zeroth-order fine-tuning for language models with low-rank structures , author=. arXiv preprint arXiv:2410.07698 , year=

work page arXiv
[56]

Towards Efficient Low-order Hybrid Optimizer for Language Model Fine-tuning , author=
[57]

Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order LLM fine-tuning

Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning , author=. arXiv preprint arXiv:2502.03304 , year=

work page arXiv
[58]

A memory efficient randomized subspace optimization method for training large language models.arXiv preprint arXiv:2502.07222, 2025

A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models , author=. arXiv preprint arXiv:2502.07222 , year=

work page arXiv
[59]

arXiv preprint arXiv:2501.19057 , year=

TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs , author=. arXiv preprint arXiv:2501.19057 , year=

work page arXiv
[60]

1978 , publisher=

The moments of products of quadratic forms in normal variables , author=. 1978 , publisher=

1978
[61]

2024 USENIX Annual Technical Conference (USENIX ATC 24) , pages=

\ FwdLLM \ : Efficient federated finetuning of large language models with perturbed inferences , author=. 2024 USENIX Annual Technical Conference (USENIX ATC 24) , pages=

2024
[62]

arXiv preprint arXiv:2312.06353 , year=

Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes , author=. arXiv preprint arXiv:2312.06353 , year=

work page arXiv
[63]

Advances in neural information processing systems , volume=

Adam can converge without any modification on update rules , author=. Advances in neural information processing systems , volume=
[64]

Zeroth-order optimization meets human feedback: Provable learning via ranking oracles, 2024

Zeroth-order optimization meets human feedback: Provable learning via ranking oracles , author=. arXiv preprint arXiv:2303.03751 , year=

work page arXiv
[65]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[66]

Why trans- formers need adam: A hessian perspective

Adam-mini: Use fewer learning rates to gain more , author=. arXiv preprint arXiv:2406.16793 , year=

work page arXiv
[67]

and Yin, Wotao and Hong, Mingyi and Wang, Zhangyang and Liu, Sijia and Chen, Tianlong , month = may, year =

Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark , author=. arXiv preprint arXiv:2402.11592 , year=

work page arXiv
[68]

arXiv preprint arXiv:2310.02025 , year=

Deepzero: Scaling up zeroth-order optimization for deep model training , author=. arXiv preprint arXiv:2310.02025 , year=

work page arXiv
[69]

Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Training deep learning models with norm-constrained lmos , author=. arXiv preprint arXiv:2502.07529 , year=

work page arXiv
[70]

Advances in Neural Information Processing Systems , volume=

The road less scheduled , author=. Advances in Neural Information Processing Systems , volume=
[71]

Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be

Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be , author=. arXiv preprint arXiv:2304.13960 , year=

work page arXiv