On Adaptivity in Zeroth-Order Optimization
Pith reviewed 2026-05-07 00:40 UTC · model grok-4.3
The pith
Adaptive zeroth-order methods bring no convergence gain over tuned ZO-SGD in high-dimensional LLM fine-tuning; a new single-scalar adaptive method, MEAZO, recovers the performance at ZO-SGD memory cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Contrary to prior claims, adaptive ZO methods such as ZO-Adam offer no convergence advantage over well-tuned ZO-SGD, while incurring significant memory overhead.
Load-bearing premise
The claim that ZO gradients lack coordinate-wise heterogeneity in high dimensions, which is used to conclude that per-coordinate adaptation is memory-inefficient.
read the original abstract
We investigate the effectiveness of adaptive zeroth-order (ZO) optimization for memory-constrained fine-tuning of large language models (LLMs). Contrary to prior claims, we show that adaptive ZO methods such as ZO-Adam offer no convergence advantage over well-tuned ZO-SGD, while incurring significant memory overhead. Our analysis reveals that in high dimensions, ZO gradients lack coordinate-wise heterogeneity, rendering adaptive mechanisms memory inefficient. Leveraging this insight, we propose MEAZO, a memory-efficient adaptive ZO optimizer that tracks only a single scalar for global step size adaptation. We support our method with theoretical convergence guarantees under standard assumptions. Experiments across multiple LLM families and tasks demonstrate that MEAZO matches ZO-Adam's performance with the memory footprint of ZO-SGD. Additional experiments on synthetic quadratic problems and LLM fine-tuning further demonstrate MEAZO's enhanced robustness to step size choices, particularly in grouped or block-structured optimization settings.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard smoothness and bounded-variance assumptions for ZO convergence analysis
invented entities (1)
-
MEA single-scalar adaptive mechanism
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.