pith. sign in

arxiv: 2605.03869 · v1 · submitted 2026-05-05 · 💻 cs.LG · math.OC

On Adaptivity in Zeroth-Order Optimization

Pith reviewed 2026-05-07 00:40 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords adaptivemeazomemoryoptimizationconvergencedemonstrateexperimentsfine-tuning
0
0 comments X

The pith

Adaptive zeroth-order methods bring no convergence gain over tuned ZO-SGD in high-dimensional LLM fine-tuning; a new single-scalar adaptive method, MEAZO, recovers the performance at ZO-SGD memory cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Zeroth-order optimization estimates gradients from function values alone, which is useful when memory is tight because it avoids storing full gradients. Earlier work suggested that making the step size adaptive per coordinate, as Adam does, would speed up convergence. The authors find this does not happen for zeroth-order gradients in the high-dimensional regime of LLM fine-tuning: the estimated gradients show little coordinate-wise variation, so per-coordinate adaptation adds memory without improving the path taken. They therefore replace per-coordinate statistics with one running scalar that tracks a global step-size scale. Under ordinary smoothness and bounded-variance assumptions they prove that this scalar-adaptive method converges at the same rate as standard ZO-SGD. On several LLM families and downstream tasks the new method reaches the same accuracy as ZO-Adam while using only the memory of plain ZO-SGD. It is also less sensitive to the choice of base learning rate, especially when parameters are updated in blocks.

Core claim

Contrary to prior claims, adaptive ZO methods such as ZO-Adam offer no convergence advantage over well-tuned ZO-SGD, while incurring significant memory overhead.

Load-bearing premise

The claim that ZO gradients lack coordinate-wise heterogeneity in high dimensions, which is used to conclude that per-coordinate adaptation is memory-inefficient.

read the original abstract

We investigate the effectiveness of adaptive zeroth-order (ZO) optimization for memory-constrained fine-tuning of large language models (LLMs). Contrary to prior claims, we show that adaptive ZO methods such as ZO-Adam offer no convergence advantage over well-tuned ZO-SGD, while incurring significant memory overhead. Our analysis reveals that in high dimensions, ZO gradients lack coordinate-wise heterogeneity, rendering adaptive mechanisms memory inefficient. Leveraging this insight, we propose MEAZO, a memory-efficient adaptive ZO optimizer that tracks only a single scalar for global step size adaptation. We support our method with theoretical convergence guarantees under standard assumptions. Experiments across multiple LLM families and tasks demonstrate that MEAZO matches ZO-Adam's performance with the memory footprint of ZO-SGD. Additional experiments on synthetic quadratic problems and LLM fine-tuning further demonstrate MEAZO's enhanced robustness to step size choices, particularly in grouped or block-structured optimization settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on (1) the empirical observation that high-dimensional ZO gradients lack coordinate-wise heterogeneity and (2) standard smoothness and variance assumptions for the convergence proof. No free parameters or invented entities are introduced beyond the optimizer itself.

axioms (1)
  • domain assumption Standard smoothness and bounded-variance assumptions for ZO convergence analysis
    Invoked to support theoretical guarantees for MEAZO.
invented entities (1)
  • MEA single-scalar adaptive mechanism no independent evidence
    purpose: Global step-size adaptation without per-coordinate memory
    New optimizer component whose only independent evidence is the reported LLM experiments.

pith-pipeline@v0.9.0 · 5439 in / 1282 out tokens · 19470 ms · 2026-05-07T00:40:46.069724+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.