ConMeZO: Adaptive Descent-Direction Sampling for Gradient-Free Finetuning of Large Language Models
Pith reviewed 2026-05-18 01:02 UTC · model grok-4.3
The pith
ConMeZO accelerates zeroth-order fine-tuning of large language models by sampling descent directions inside a momentum-guided cone instead of uniformly at random.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By restricting zeroth-order direction samples to a cone around a momentum vector, ConMeZO increases the probability that each perturbation lies in a direction closer to the true gradient, thereby lowering the effective dimension penalty that slows uniform random sampling, yet the worst-case convergence rate remains identical to that of MeZO.
What carries the argument
The momentum-centered cone sampler that replaces uniform random direction draws in the zeroth-order gradient estimate.
If this is right
- ConMeZO matches the worst-case convergence rate of MeZO.
- On natural-language fine-tuning benchmarks ConMeZO reaches the same accuracy up to twice as fast as MeZO.
- The method preserves the memory advantage of zeroth-order optimizers that avoid storing activations or gradients.
- The adaptive sampling reduces the practical effect of high dimensionality without changing the theoretical guarantee.
Where Pith is reading between the lines
- The same cone restriction might improve other zeroth-order methods that currently rely on uniform sampling in high-dimensional spaces.
- Combining the cone sampler with variance-reduction techniques already used in first-order methods could yield further practical speed-ups.
- The approach may transfer to non-language models where gradient storage is the dominant memory cost.
Load-bearing premise
The momentum estimate reliably points toward regions where the true gradient is more likely to lie than a uniform random direction would.
What would settle it
An experiment on a high-dimensional quadratic where the momentum vector consistently points opposite the true gradient and shows no speedup or worse final loss than plain MeZO.
Figures
read the original abstract
Zeroth-order or derivative-free optimization (MeZO) is an attractive strategy for finetuning large language models (LLMs) because it eliminates the memory overhead of backpropagation. However, it converges slowly due to the inherent curse of dimensionality when searching for descent directions in the high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a novel zeroth-order optimizer that accelerates convergence by adaptive directional sampling. Instead of drawing the direction uniformly at random, ConMeZO restricts the sampling to a cone centered around a momentum estimate. This concentrates the search in directions where the true gradient is more likely to lie and thus reduces the effect of high dimensions. We prove that ConMeZO achieves the same worst-case convergence rate as MeZO. Empirically, when finetuning LLMs on natural language tasks, ConMeZO is up to 2X faster than MeZO while retaining the low-memory footprint of zeroth-order methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ConMeZO, a zeroth-order optimizer for finetuning LLMs that samples descent directions from a cone centered on a momentum estimate rather than uniformly from the unit sphere. It claims to achieve the same worst-case convergence rate as MeZO while delivering up to 2X empirical speedup on natural language tasks, all while retaining the low-memory footprint of zeroth-order methods.
Significance. If the convergence-rate claim holds under the modified sampling distribution and the empirical speedups are reproducible with proper controls, this could provide a practical advance for memory-efficient LLM fine-tuning by improving the convergence behavior of zeroth-order methods without added memory cost. The combination of a stated proof and LLM-scale experiments is a positive feature.
major comments (2)
- [Abstract / Theoretical Analysis] Abstract and theoretical analysis section: The central claim that ConMeZO matches MeZO's worst-case convergence rate requires explicit treatment of the modified direction distribution. When u is sampled from the cone rather than the full sphere, E[u u^T] is no longer (1/d)I. The analysis must derive or bound the effect on expected descent and variance to confirm the rate is unchanged; otherwise the proof does not fully support the claim. Please supply the key derivation steps, including any adjustments to step-size or constants.
- [Experimental Results] Experimental results section: The reported up to 2X speedup is load-bearing for the practical contribution, yet the manuscript provides no details on experimental protocol, number of runs, error bars, cone aperture, momentum update rule, or exact models/tasks. This prevents assessment of robustness and reproducibility.
minor comments (3)
- Clarify the precise definition of the cone (aperture angle, centering, and adaptation schedule) and how the momentum estimate is initialized and updated.
- Add error bars or confidence intervals to all speedup plots and tables to support the quantitative claims.
- Include a brief comparison to other adaptive or momentum-based zeroth-order methods in the related-work section.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. We appreciate the recognition of the potential practical contribution to memory-efficient LLM fine-tuning. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract / Theoretical Analysis] Abstract and theoretical analysis section: The central claim that ConMeZO matches MeZO's worst-case convergence rate requires explicit treatment of the modified direction distribution. When u is sampled from the cone rather than the full sphere, E[u u^T] is no longer (1/d)I. The analysis must derive or bound the effect on expected descent and variance to confirm the rate is unchanged; otherwise the proof does not fully support the claim. Please supply the key derivation steps, including any adjustments to step-size or constants.
Authors: We agree that the theoretical analysis would benefit from more explicit treatment of the cone sampling distribution. In the revised manuscript, we will add a short derivation (or lemma) showing that the restricted sampling still yields a worst-case expected descent that is at least as strong as the uniform case up to a dimension-independent constant. Specifically, we bound the minimum alignment probability within the cone and show that the second-moment matrix E[u u^T] deviates from (1/d)I by a multiplicative factor that can be absorbed into the step-size choice without changing the O(1/sqrt(T)) rate. The key steps will be included in the main text or appendix. revision: yes
-
Referee: [Experimental Results] Experimental results section: The reported up to 2X speedup is load-bearing for the practical contribution, yet the manuscript provides no details on experimental protocol, number of runs, error bars, cone aperture, momentum update rule, or exact models/tasks. This prevents assessment of robustness and reproducibility.
Authors: We acknowledge that the current experimental section is insufficiently detailed for reproducibility. In the revised manuscript we will expand the experimental protocol to report the number of independent runs, error bars (standard deviation across seeds), the cone aperture angle employed, the precise momentum update rule, and the exact models and tasks used. These additions will allow readers to assess the robustness of the reported speedups. revision: yes
Circularity Check
No significant circularity in convergence claim or sampling modification
full rationale
The paper's central theoretical result is a proof that ConMeZO matches MeZO's worst-case convergence rate despite cone-restricted sampling. This is presented as a direct extension of standard zeroth-order analysis rather than a reduction to a fitted quantity, self-defined term, or prior self-citation chain. The adaptive cone is motivated by momentum but the rate claim is asserted to hold under the modified distribution without the result being tautological by construction. Empirical speedups are reported separately and do not feed back into the proof. No load-bearing self-citation, ansatz smuggling, or renaming of known results is required for the stated derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions underlying worst-case convergence analysis for zeroth-order methods (smoothness, bounded variance, etc.)
Reference graph
Works this paper leans on
-
[1]
Alabdulkareem, A. and Honorio, J. (2021). Information- theoreticlowerboundsforzero-orderstochasticgradi- ent estimation. InIEEE International Symposium on Information Theory (ISIT), pages 2316–2321. IEEE. Balasubramanian, K. and Ghadimi, S. (2018). Zeroth- order (non)-convex stochastic optimization via con- ditional gradient and gradient updates.Advances ...
work page 2021
-
[2]
Bentivogli, L., Clark, P., Dagan, I., and Giampiccolo, D. (2009). The fifth PASCAL recognizing textual entailment challenge. Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). A large annotated corpus for learning natural language inference. InProceedings of the Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil Conferenc...
work page 2009
-
[3]
Chen, Y., Zhang, Y., Cao, L., Yuan, K., and Wen, Z. (2025). Enhancing zeroth-order fine-tuning for language models with low-rank structures. InInter- national Conference on Learning Representations. Choromanski, K., Rowland, M., Sindhwani, V., Turner, R., and Weller, A. (2018). Structured evolution with compact architectures for scalable policy opti- miza...
work page 2025
-
[4]
Dagan, I., Glickman, O., and Magnini, B. (2005). The PASCAL recognising textual entailment challenge. Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). DROP: A reading com- prehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the Conference of the North American Chapter of the Association for C...
work page 2005
-
[5]
Fang, W., Yu, Z., Jiang, Y., Shi, Y., Jones, C. N., and Zhou, Y. (2022). Communication-efficient stochas- tic zeroth-order optimization for federated learning. IEEE Transactions on Signal Processing, 70:5058–
work page 2022
-
[6]
Flaxman, A. D., Kalai, A. T., and McMahan, H. B. (2005). Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the ACM-SIAM Symposium on Dis- crete Algorithms, pages 385–394. Gautam, T., Park, Y., Zhou, H., Raman, P., and Ha, W. (2024). Variance-reduced zeroth-order methods for fine-tuning language models...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[7]
B., Dagan, I., Dolan, B., Ferro, L., Giampic- colo, D., Magnini, B., and Szpektor, I
Haim, R. B., Dagan, I., Dolan, B., Ferro, L., Giampic- colo, D., Magnini, B., and Szpektor, I. (2006). The second PASCAL recognising textual entailment chal- lenge. Jamieson, K. G., Nowak, R., and Recht, B. (2012). Query complexity of derivative-free optimization.Ad- vances in Neural Information Processing Systems,
work page 2006
-
[8]
Ji, K., Wang, Z., Zhou, Y., and Liang, Y. (2019). Im- proved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. InInternational Conference on Machine Learning, pages 3100–3109. PMLR. Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. (2018). Looking beyond the surface: A challenge set for reading comprehensio...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[9]
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoy- anov, V. (2019b). RoBERTa: A robustly opti- mized BERT pretraining approach.arXiv preprint arXiv:1907.11692. Liu, Y., Zhu, Z., Gong, C., Cheng, M., Hsieh, C.-J., and You, Y. (2024). Sparse MeZO: Less parameters for better performance in zeroth-order L...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[10]
(2003).Introductory lectures on convex optimization: A basic course, volume
Nesterov, Y. (2003).Introductory lectures on convex optimization: A basic course, volume
work page 2003
-
[11]
arXiv preprint arXiv:2501.19099 , year=
Springer Science & Business Media. Nesterov, Y. and Spokoiny, V. (2017). Random gradient-free minimization of convex functions.Foun- dations of Computational Mathematics, 17:527–566. Park, S., Yun, J., Kim, S., Kundu, S., and Yang, E. (2025). Unraveling zeroth-order optimization through the lens of low-dimensional structured perturbations. arXiv preprint ...
-
[12]
Williams, A., Nangia, N., and Bowman, S. (2018). A broad-coverage challenge corpus for sentence un- derstanding through inference. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pages 1112–1122. Xu, M., Cai, D., Wu, Y., Li, X., and Wang, S. (2024). FwdLLM: Efficient federated finetuning of l...
-
[13]
This completes the proof of the theorem
cos2(θ) cos2(ρ) + d+4 d sin2(θ) ∥at∥2 =− dcos 2(θ) cos2(ρ) + sin2(θ) 2ℓd(1 + 4/d) ∥at∥2 ≈ − dcos 2(θ) cos2(ρ) + sin2(θ) 2ℓd ∥at∥2. This completes the proof of the theorem. Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil B Implementation and Practical Speedups As an additional contribution, ConMeZO introduces an efficient implementat...
work page 2013
-
[14]
Parameter Sensitivity & Ablation Study.Understanding the sensitivity of the optimizer to its hyperpa- rameters, particularly momentum (β) and cone angle (θ), provides critical insights into its performance across different phases of optimization. Section C.4 explores their roles in convergence acceleration and alignment with the true gradient, highlightin...
work page 2022
-
[15]
(reading comprehension with commonsense reasoning), RTE (Dagan et al., 2005; Bentivogli et al.,
work page 2005
-
[16]
proposes a variance-reduced zeroth-order optimizer that achieves strong results for finetuning LLMs without relying on task-specific prompts, i.e., in the non-prompted finetuning setting. In contrast, ConMeZO focuses on the prompt-conditioned finetuning scenario, where optimization is performed in the presence of task prompts. While MeZO-SVRG improves sta...
work page 2025
-
[17]
Conceptually, LOZO and ConMeZO target different aspects of ZO training and are largely orthogonal. LOZO changes the parameterization of the model updates by constraining them to a low-rank subspace (via adapter rank and step-interval choices), i.e., it modifies the space in which parameters are updated. ConMeZO, in contrast, changes the ZO estimator and d...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.