pith. sign in

arxiv: 2511.02757 · v2 · submitted 2025-11-04 · 💻 cs.LG · math.OC· stat.ML

ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models

Pith reviewed 2026-05-18 01:02 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords zeroth-order optimizationlarge language modelsfine-tuninggradient-free optimizationadaptive samplingmomentumconvergence analysis
0
0 comments X

The pith

ConMeZO accelerates zeroth-order fine-tuning of large language models by sampling descent directions inside a momentum-guided cone instead of uniformly at random.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Zeroth-order optimization avoids backpropagation memory costs when fine-tuning billion-parameter language models but suffers slow progress because random search directions in high dimensions rarely align with the true gradient. ConMeZO changes the sampling step by drawing directions only inside a cone centered on a running momentum estimate. This restriction is intended to increase the chance that sampled directions point downhill. The paper proves that the modified sampling still delivers the same worst-case convergence guarantee as standard MeZO. On natural-language fine-tuning tasks the new method reaches target performance up to twice as fast while keeping the same low memory footprint.

Core claim

By restricting zeroth-order direction samples to a cone around a momentum vector, ConMeZO increases the probability that each perturbation lies in a direction closer to the true gradient, thereby lowering the effective dimension penalty that slows uniform random sampling, yet the worst-case convergence rate remains identical to that of MeZO.

What carries the argument

The momentum-centered cone sampler that replaces uniform random direction draws in the zeroth-order gradient estimate.

If this is right

  • ConMeZO matches the worst-case convergence rate of MeZO.
  • On natural-language fine-tuning benchmarks ConMeZO reaches the same accuracy up to twice as fast as MeZO.
  • The method preserves the memory advantage of zeroth-order optimizers that avoid storing activations or gradients.
  • The adaptive sampling reduces the practical effect of high dimensionality without changing the theoretical guarantee.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cone restriction might improve other zeroth-order methods that currently rely on uniform sampling in high-dimensional spaces.
  • Combining the cone sampler with variance-reduction techniques already used in first-order methods could yield further practical speed-ups.
  • The approach may transfer to non-language models where gradient storage is the dominant memory cost.

Load-bearing premise

The momentum estimate reliably points toward regions where the true gradient is more likely to lie than a uniform random direction would.

What would settle it

An experiment on a high-dimensional quadratic where the momentum vector consistently points opposite the true gradient and shows no speedup or worse final loss than plain MeZO.

Figures

Figures reproduced from arXiv: 2511.02757 by Bingcong Li, Kiran Koshy Thekumparampil, Lejs Deen Behric, Liang Zhang.

Figure 1
Figure 1. Figure 1: ConMeZO achieves 2× speedup over MeZO when finetuning OPT-1.3B on SQuAD dataset. Zeroth-order optimization (ZO) methods, such as those employed by MeZO (Malladi et al., 2023), offer a promising alternative. By relying only on forward passes to estimate gradients, ZO methods bypass the memory-intensive backward pass, facilitating finetuning in resource-constrained scenarios. Despite their advan￾tages, ZO me… view at source ↗
Figure 2
Figure 2. Figure 2: 2D- and 3D-representation of the cone￾sampling approach. (a) Sphere with radius √ d and (gray) search space cone of half-angle θ around promis￾ing search direction mˆ t. We can set random direction zt = z ∥ t +z ⊥ t with angle γ to mˆ t. (b) 3D representation of cone sampling in red area. thereby striking a more effective balance between ex￾ploration and exploitation. Next sections formalize the cone-sampl… view at source ↗
Figure 3
Figure 3. Figure 3: Synthetic Optimization Problem: Con￾MeZO achieves 2.45× speedup over MeZO on the syn￾thetic quadratic problem [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Peak GPU memory usage (MiB) increase of ConMeZO over MeZO is negligible when compared to the memory usage of first-order methods like AdamW: Top: RoBERTa-Large on SST2 (batch size 64). Bot￾tom: OPT-1.3B on BoolQ (batch size 16). task and achieves the best average across the tasks. The advantages of ConMeZO we observed in RoBERTa carry over seamlessly to the larger OPT-1.3B and OPT￾13B models. That is, ConM… view at source ↗
Figure 5
Figure 5. Figure 5: Heatmaps of Test Accuracy of ConMeZO on TREC dataset for different [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Squared cosine similarity between real gradient and momentum vector during training. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Test Accuracy of ConMeZO with settings mentioned in C.2 compared to MeZO over 10,000 iterations. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

Zeroth-order or derivative-free optimization (MeZO) is an attractive strategy for finetuning large language models (LLMs) because it eliminates the memory overhead of backpropagation. However, it converges slowly due to the inherent curse of dimensionality when searching for descent directions in the high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a novel zeroth-order optimizer that accelerates convergence by adaptive directional sampling. Instead of drawing the direction uniformly at random, ConMeZO restricts the sampling to a cone centered around a momentum estimate. This concentrates the search in directions where the true gradient is more likely to lie and thus reduces the effect of high dimensions. We prove that ConMeZO achieves the same worst-case convergence rate as MeZO. Empirically, when finetuning LLMs on natural language tasks, ConMeZO is up to 2X faster than MeZO while retaining the low-memory footprint of zeroth-order methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes ConMeZO, a zeroth-order optimizer for finetuning LLMs that samples descent directions from a cone centered on a momentum estimate rather than uniformly from the unit sphere. It claims to achieve the same worst-case convergence rate as MeZO while delivering up to 2X empirical speedup on natural language tasks, all while retaining the low-memory footprint of zeroth-order methods.

Significance. If the convergence-rate claim holds under the modified sampling distribution and the empirical speedups are reproducible with proper controls, this could provide a practical advance for memory-efficient LLM fine-tuning by improving the convergence behavior of zeroth-order methods without added memory cost. The combination of a stated proof and LLM-scale experiments is a positive feature.

major comments (2)
  1. [Abstract / Theoretical Analysis] Abstract and theoretical analysis section: The central claim that ConMeZO matches MeZO's worst-case convergence rate requires explicit treatment of the modified direction distribution. When u is sampled from the cone rather than the full sphere, E[u u^T] is no longer (1/d)I. The analysis must derive or bound the effect on expected descent and variance to confirm the rate is unchanged; otherwise the proof does not fully support the claim. Please supply the key derivation steps, including any adjustments to step-size or constants.
  2. [Experimental Results] Experimental results section: The reported up to 2X speedup is load-bearing for the practical contribution, yet the manuscript provides no details on experimental protocol, number of runs, error bars, cone aperture, momentum update rule, or exact models/tasks. This prevents assessment of robustness and reproducibility.
minor comments (3)
  1. Clarify the precise definition of the cone (aperture angle, centering, and adaptation schedule) and how the momentum estimate is initialized and updated.
  2. Add error bars or confidence intervals to all speedup plots and tables to support the quantitative claims.
  3. Include a brief comparison to other adaptive or momentum-based zeroth-order methods in the related-work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We appreciate the recognition of the potential practical contribution to memory-efficient LLM fine-tuning. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract / Theoretical Analysis] Abstract and theoretical analysis section: The central claim that ConMeZO matches MeZO's worst-case convergence rate requires explicit treatment of the modified direction distribution. When u is sampled from the cone rather than the full sphere, E[u u^T] is no longer (1/d)I. The analysis must derive or bound the effect on expected descent and variance to confirm the rate is unchanged; otherwise the proof does not fully support the claim. Please supply the key derivation steps, including any adjustments to step-size or constants.

    Authors: We agree that the theoretical analysis would benefit from more explicit treatment of the cone sampling distribution. In the revised manuscript, we will add a short derivation (or lemma) showing that the restricted sampling still yields a worst-case expected descent that is at least as strong as the uniform case up to a dimension-independent constant. Specifically, we bound the minimum alignment probability within the cone and show that the second-moment matrix E[u u^T] deviates from (1/d)I by a multiplicative factor that can be absorbed into the step-size choice without changing the O(1/sqrt(T)) rate. The key steps will be included in the main text or appendix. revision: yes

  2. Referee: [Experimental Results] Experimental results section: The reported up to 2X speedup is load-bearing for the practical contribution, yet the manuscript provides no details on experimental protocol, number of runs, error bars, cone aperture, momentum update rule, or exact models/tasks. This prevents assessment of robustness and reproducibility.

    Authors: We acknowledge that the current experimental section is insufficiently detailed for reproducibility. In the revised manuscript we will expand the experimental protocol to report the number of independent runs, error bars (standard deviation across seeds), the cone aperture angle employed, the precise momentum update rule, and the exact models and tasks used. These additions will allow readers to assess the robustness of the reported speedups. revision: yes

Circularity Check

0 steps flagged

No significant circularity in convergence claim or sampling modification

full rationale

The paper's central theoretical result is a proof that ConMeZO matches MeZO's worst-case convergence rate despite cone-restricted sampling. This is presented as a direct extension of standard zeroth-order analysis rather than a reduction to a fitted quantity, self-defined term, or prior self-citation chain. The adaptive cone is motivated by momentum but the rate claim is asserted to hold under the modified distribution without the result being tautological by construction. Empirical speedups are reported separately and do not feed back into the proof. No load-bearing self-citation, ansatz smuggling, or renaming of known results is required for the stated derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard zeroth-order convergence assumptions plus the modeling choice that momentum provides a useful centering direction for the sampling cone; no new entities or fitted parameters are introduced in the abstract.

axioms (1)
  • standard math Standard assumptions underlying worst-case convergence analysis for zeroth-order methods (smoothness, bounded variance, etc.)
    Invoked to prove that ConMeZO matches MeZO's worst-case rate.

pith-pipeline@v0.9.0 · 5717 in / 1216 out tokens · 26825 ms · 2026-05-18T01:02:30.616758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    and Honorio, J

    Alabdulkareem, A. and Honorio, J. (2021). Information- theoreticlowerboundsforzero-orderstochasticgradi- ent estimation. InIEEE International Symposium on Information Theory (ISIT), pages 2316–2321. IEEE. Balasubramanian, K. and Ghadimi, S. (2018). Zeroth- order (non)-convex stochastic optimization via con- ditional gradient and gradient updates.Advances ...

  2. [2]

    Bentivogli, L., Clark, P., Dagan, I., and Giampiccolo, D. (2009). The fifth PASCAL recognizing textual entailment challenge. Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). A large annotated corpus for learning natural language inference. InProceedings of the Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil Conferenc...

  3. [3]

    Chen, Y., Zhang, Y., Cao, L., Yuan, K., and Wen, Z. (2025). Enhancing zeroth-order fine-tuning for language models with low-rank structures. InInter- national Conference on Learning Representations. Choromanski, K., Rowland, M., Sindhwani, V., Turner, R., and Weller, A. (2018). Structured evolution with compact architectures for scalable policy opti- miza...

  4. [4]

    Dagan, I., Glickman, O., and Magnini, B. (2005). The PASCAL recognising textual entailment challenge. Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). DROP: A reading com- prehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the Conference of the North American Chapter of the Association for C...

  5. [5]

    N., and Zhou, Y

    Fang, W., Yu, Z., Jiang, Y., Shi, Y., Jones, C. N., and Zhou, Y. (2022). Communication-efficient stochas- tic zeroth-order optimization for federated learning. IEEE Transactions on Signal Processing, 70:5058–

  6. [6]

    The Llama 3 Herd of Models

    Flaxman, A. D., Kalai, A. T., and McMahan, H. B. (2005). Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the ACM-SIAM Symposium on Dis- crete Algorithms, pages 385–394. Gautam, T., Park, Y., Zhou, H., Raman, P., and Ha, W. (2024). Variance-reduced zeroth-order methods for fine-tuning language models...

  7. [7]

    B., Dagan, I., Dolan, B., Ferro, L., Giampic- colo, D., Magnini, B., and Szpektor, I

    Haim, R. B., Dagan, I., Dolan, B., Ferro, L., Giampic- colo, D., Magnini, B., and Szpektor, I. (2006). The second PASCAL recognising textual entailment chal- lenge. Jamieson, K. G., Nowak, R., and Recht, B. (2012). Query complexity of derivative-free optimization.Ad- vances in Neural Information Processing Systems,

  8. [8]

    Ji, K., Wang, Z., Zhou, Y., and Liang, Y. (2019). Im- proved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. InInternational Conference on Machine Learning, pages 3100–3109. PMLR. Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. (2018). Looking beyond the surface: A challenge set for reading comprehensio...

  9. [9]

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoy- anov, V. (2019b). RoBERTa: A robustly opti- mized BERT pretraining approach.arXiv preprint arXiv:1907.11692. Liu, Y., Zhu, Z., Gong, C., Cheng, M., Hsieh, C.-J., and You, Y. (2024). Sparse MeZO: Less parameters for better performance in zeroth-order L...

  10. [10]

    (2003).Introductory lectures on convex optimization: A basic course, volume

    Nesterov, Y. (2003).Introductory lectures on convex optimization: A basic course, volume

  11. [11]

    arXiv preprint arXiv:2501.19099 , year=

    Springer Science & Business Media. Nesterov, Y. and Spokoiny, V. (2017). Random gradient-free minimization of convex functions.Foun- dations of Computational Mathematics, 17:527–566. Park, S., Yun, J., Kim, S., Kundu, S., and Yang, E. (2025). Unraveling zeroth-order optimization through the lens of low-dimensional structured perturbations. arXiv preprint ...

  12. [12]

    Williams, A., Nangia, N., and Bowman, S. (2018). A broad-coverage challenge corpus for sentence un- derstanding through inference. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pages 1112–1122. Xu, M., Cai, D., Wu, Y., Li, X., and Wang, S. (2024). FwdLLM: Efficient federated finetuning of l...

  13. [13]

    This completes the proof of the theorem

    cos2(θ) cos2(ρ) + d+4 d sin2(θ) ∥at∥2 =− dcos 2(θ) cos2(ρ) + sin2(θ) 2ℓd(1 + 4/d) ∥at∥2 ≈ − dcos 2(θ) cos2(ρ) + sin2(θ) 2ℓd ∥at∥2. This completes the proof of the theorem. Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil B Implementation and Practical Speedups As an additional contribution, ConMeZO introduces an efficient implementat...

  14. [14]

    Section C.4 explores their roles in convergence acceleration and alignment with the true gradient, highlighting key patterns observed in the experiments

    Parameter Sensitivity & Ablation Study.Understanding the sensitivity of the optimizer to its hyperpa- rameters, particularly momentum (β) and cone angle (θ), provides critical insights into its performance across different phases of optimization. Section C.4 explores their roles in convergence acceleration and alignment with the true gradient, highlightin...

  15. [15]

    (reading comprehension with commonsense reasoning), RTE (Dagan et al., 2005; Bentivogli et al.,

  16. [16]

    In contrast, ConMeZO focuses on the prompt-conditioned finetuning scenario, where optimization is performed in the presence of task prompts

    proposes a variance-reduced zeroth-order optimizer that achieves strong results for finetuning LLMs without relying on task-specific prompts, i.e., in the non-prompted finetuning setting. In contrast, ConMeZO focuses on the prompt-conditioned finetuning scenario, where optimization is performed in the presence of task prompts. While MeZO-SVRG improves sta...

  17. [17]

    Conceptually, LOZO and ConMeZO target different aspects of ZO training and are largely orthogonal. LOZO changes the parameterization of the model updates by constraining them to a low-rank subspace (via adapter rank and step-interval choices), i.e., it modifies the space in which parameters are updated. ConMeZO, in contrast, changes the ZO estimator and d...