pith. sign in

arxiv: 2605.09034 · v2 · pith:WYOTTIM7new · submitted 2026-05-09 · 💻 cs.LG

Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

Pith reviewed 2026-05-19 17:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords zeroth-order optimizationspectral optimizationpower iterationLLM fine-tuningpartial orthogonalizationvariance reductionedge devices
0
0 comments X

The pith

Partial orthogonalization via power iteration accelerates zeroth-order fine-tuning of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that full orthogonalization, which helps spectral optimizers like Muon in the first-order case, breaks down under the high noise of zeroth-order gradient estimates. It replaces the Newton-Schulz iteration with a faster power-iteration step that only strengthens the dominant directions, and it stabilizes this step by restricting updates to a low-variance subspace obtained by projecting onto momentum directions. This combination yields 1.5x to 4x faster convergence than the prior best zeroth-order spectral method on SuperGlue tasks with the OPT-13B model while remaining competitive in final accuracy against MeZO, LOZO, and ZO-Muon across several models.

Core claim

Partial spectral orthogonalization obtained by substituting power iteration for Newton-Schulz, performed inside a momentum-projected subspace, stabilizes the search and exploits weak spectral directions effectively even when gradients are estimated only through function queries.

What carries the argument

Streaming power-iteration procedure for partial orthogonalization inside a momentum-derived subspace, which concentrates amplification on dominant directions while lowering variance enough for reliable zeroth-order use.

If this is right

  • Convergence speed improves 1.5x to 4x over ZO-Muon on SuperGlue datasets with OPT-13B.
  • Final accuracies remain competitive with MeZO, LOZO, and ZO-Muon while using less wall-clock time across tested models.
  • The streaming variant reduces per-step cost once the momentum subspace is formed.
  • The method works for hidden-layer training where spectral methods normally outperform AdamW-style updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variance-reduction trick via momentum projection could be paired with other low-dimensional ZO estimators beyond spectral methods.
  • Testing the approach on models larger than 13B would show whether the relative speedup grows with parameter count.
  • The partial-orthogonalization idea may transfer to other noisy non-convex settings such as reinforcement learning with only reward queries.

Load-bearing premise

Projecting the search onto a momentum-derived subspace lowers gradient variance enough to stabilize the streaming power-iteration and permit useful partial orthogonalization despite noisy zeroth-order estimates.

What would settle it

Remove the momentum-subspace projection step and measure whether power-iteration diverges or convergence speed drops back to ZO-Muon levels on the same OPT-13B SuperGlue fine-tuning runs.

Figures

Figures reproduced from arXiv: 2605.09034 by Jiahe Chen, Ziye Ma.

Figure 1
Figure 1. Figure 1: Comparison of FO and ZO gradient spectra on the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Gemma2-2B fine-tuning comparison. Panel (a) reports test accuracy versus wall-clock time [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ZO-MOPI’s training loss with and without momentum. The coordinate transformation is given by: Mnew t−1 ← (Anew) ⊤AoldMold t−1 , (11) where Mold t−1 ,Mnew t−1 ∈ R r×n are the momentum represen￾tations in the old and new coordinate systems respectively. Momentum plays a very important role in our algorithm beyond reducing variance, since the effectiveness of SPI is contingent on it. This contrasts with the a… view at source ↗
Figure 4
Figure 4. Figure 4: Lazy sampling interval choices Lazy sampling strategy is equally crucial for SPI. In our implementation, we fix the subspace A and resample it every ν iteration. We need periodic updates in A to encour￾age exploration, but shifting it too frequently may hinder the efficacy of SPI since it requires strong continuity. If the subspace changes too frequently, the spectral directions can vary substantially acro… view at source ↗
Figure 5
Figure 5. Figure 5: OPT-13B fine-tuning efficiency (Accuracy vs. Wall-clock time) across four SuperGLUE [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: OPT-13B fine-tuning efficiency (Relative time-to-same-accuracy) across four SuperGLUE [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of the spectral rank k for fine-tuning OPT-1.3B on SST-2. The two panels show test accuracy and training loss under different choices of k. 0 100 200 300 400 500 Wall-Clock Time (s) 60 65 70 75 80 85 90 Accuracy (%) r=32 r=64 r=128 r=256 0 100 200 300 400 500 600 Wall-Clock Time (s) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Train Loss r=32 r=64 r=128 r=256 (a) Accuracy vs. wall-clock time (b) Training los… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of the subspace rank r for fine-tuning OPT-2.7B on SST-2. The two panels show test accuracy and training loss under different choices of r. we follow common settings in prior work and use rank r = 8 with scaling factor α = 16. Detailed settings for both ZO and FO methods are summarized in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: OPT-13B training loss curves across four SuperGLUE tasks. Each panel reports training [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: OPT-13B evaluation loss curves across four SuperGLUE tasks. Each panel reports [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: LLaMA3-8B training loss curves across four SuperGLUE tasks. Each panel reports [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: LLaMA3-8B accuracy curves across four SuperGLUE tasks. Each panel reports accuracy [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Gemma2-2B training loss curves across four SuperGLUE tasks. Each panel reports [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Gemma2-2B accuracy curves across four SuperGLUE tasks. Each panel reports training [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
read the original abstract

Zeroth-order (ZO) optimization has become increasingly popular and important in fine-tuning large language models (LLMs), especially on edge devices due to its ability to adjust the model to local data without the need for memory-intensive back-propagation. Recent works try to reduce ZO variance through low-dimensional subspace search, but subspace restriction alone leaves key optimization geometry under-exploited, motivating additional acceleration. In this work, we focus on the hidden layer training problem in which spectral optimizers like Muon outperform AdamW due to its ability to exploit weak spectral directions by orthogonalization. However, we have discovered that unlike in the first-order setting, full orthogonalization works poorly in the ZO setting since the gradient estimates are highly noisy and unreliable. To address this issue, we propose applying partial spectral orthogonalization to accelerate ZO optimization. To do so, we replace the iconic Newton-Schulz procedure in Muon with the faster, more concentrated power-iteration method so that it only amplifies dominant spectral directions. Furthermore, to improve the efficiency and generalization of the algorithm, we adopted a streaming variant of power-iteration that requires low variance in gradients, which was achieved through constraining our search inside a subspace obtained through the projection of momentum, echoing recent advances. Experiments on LLM fine-tuning show that our method can achieve from 1.5x to 4x the convergence speed of ZO-Muon, the current SOTA algorithm, across SuperGlue datasets in the OPT-13B model. Across different models, we also reach competitive final accuracies with less time in most cases compared with strong ZO baselines such as MeZO, LOZO and ZO-Muon. Code is available at https://github.com/MOFA-LAB/ZO-MOPI.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces an algorithm for zeroth-order optimization in LLM fine-tuning that replaces full spectral orthogonalization (Newton-Schulz in Muon) with partial orthogonalization via power iteration to better handle noisy ZO gradient estimates. It further adopts a streaming power-iteration variant whose stability is achieved by projecting onto a momentum-derived subspace. Experiments report 1.5x–4x faster convergence than ZO-Muon on SuperGlue tasks using OPT-13B, along with competitive final accuracies versus MeZO, LOZO, and ZO-Muon baselines. Code is released publicly.

Significance. If the reported speedups are shown to be robust, the work could meaningfully improve memory-efficient ZO fine-tuning by exploiting spectral geometry under noise. The public code release aids reproducibility.

major comments (2)
  1. Abstract and streaming-variant paragraph: the claim that momentum-subspace projection sufficiently lowers ZO gradient variance to stabilize streaming power iteration is load-bearing for attributing speedups to partial orthogonalization, yet no variance measurements, spectrum concentration analysis, or ablation isolating this projection effect are provided; because momentum is itself built from noisy ZO estimates, the reported 1.5–4× gains cannot yet be confidently linked to the proposed mechanism rather than other factors.
  2. Experimental results section: speedups are stated without error bars, run-to-run variance, or ablation on the partial-orthogonalization degree / power-iteration count, weakening the central empirical claim that the method reliably outperforms ZO-Muon across datasets and models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in empirical validation that we agree need to be addressed to strengthen the link between the proposed mechanism and the observed speedups. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: Abstract and streaming-variant paragraph: the claim that momentum-subspace projection sufficiently lowers ZO gradient variance to stabilize streaming power iteration is load-bearing for attributing speedups to partial orthogonalization, yet no variance measurements, spectrum concentration analysis, or ablation isolating this projection effect are provided; because momentum is itself built from noisy ZO estimates, the reported 1.5–4× gains cannot yet be confidently linked to the proposed mechanism rather than other factors.

    Authors: We agree that the manuscript would be strengthened by direct evidence linking the momentum-subspace projection to variance reduction and stabilization of the streaming power iteration. The original submission motivates the projection by noting that it constrains the search to a lower-variance subspace derived from momentum estimates, but does not quantify this effect or isolate its contribution via ablation. In the revised version we will add (i) measurements of ZO gradient variance before and after momentum projection, (ii) spectrum concentration analysis showing how the projection affects the eigenvalue distribution seen by power iteration, and (iii) an ablation that removes the projection while keeping all other components fixed. These additions will allow a clearer attribution of the reported gains to the proposed mechanism. revision: yes

  2. Referee: Experimental results section: speedups are stated without error bars, run-to-run variance, or ablation on the partial-orthogonalization degree / power-iteration count, weakening the central empirical claim that the method reliably outperforms ZO-Muon across datasets and models.

    Authors: We acknowledge that the absence of error bars and hyperparameter ablations limits the strength of the empirical claims. The reported 1.5×–4× speedups were obtained from single runs per configuration, which is insufficient to assess run-to-run variability. In the revision we will rerun the SuperGLUE experiments on OPT-13B (and the additional model scales) with at least five independent random seeds, reporting mean convergence curves with standard-error bars. We will also add a dedicated ablation section that varies the number of power iterations (e.g., 1, 2, 4) and the degree of partial orthogonalization (truncation threshold) while measuring both wall-clock time and final accuracy, thereby demonstrating robustness across these design choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is procedural and empirically validated

full rationale

The paper defines its core contribution procedurally: replacing Newton-Schulz with power iteration for partial orthogonalization in the zeroth-order regime, combined with a streaming variant that projects onto a momentum-derived subspace to reduce gradient variance. These steps are presented as algorithmic choices motivated by empirical observations about noise in ZO estimates, not as mathematical derivations that reduce the claimed 1.5x–4x speedup to a fitted constant, self-referential prediction, or self-citation chain. Performance results are reported from direct experiments on SuperGlue with OPT-13B and comparisons to MeZO, LOZO, and ZO-Muon; no equations or uniqueness theorems are invoked that loop back to the paper's own inputs or prior unverified work by the same authors. The method is self-contained against external benchmarks via code release and empirical testing.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The method relies on standard optimization assumptions that zeroth-order gradient estimates are unbiased but high-variance, and that power iteration can isolate dominant directions when variance is controlled. No new entities are postulated.

free parameters (1)
  • partial orthogonalization degree or power-iteration count
    The extent to which orthogonalization is applied partially rather than fully is a tunable choice that affects the tradeoff between noise amplification and direction exploitation.

pith-pipeline@v0.9.0 · 5856 in / 1196 out tokens · 46456 ms · 2026-05-19T17:29:05.372590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

  1. [1]

    Dion: Distributed Orthonormalized Updates

    Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295,

  2. [2]

    Enhancing zeroth-order fine-tuning for language models with low-rank structures.arXiv preprint arXiv:2410.07698,

    Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, and Zaiwen Wen. Enhancing zeroth-order fine-tuning for language models with low-rank structures.arXiv preprint arXiv:2410.07698,

  3. [3]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

  4. [4]

    arXiv preprint arXiv:2404.08080 , year=

    Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, and Wooseok Ha. Variance- reduced zeroth-order methods for fine-tuning language models.arXiv preprint arXiv:2404.08080,

  5. [5]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al

    URLhttps://arxiv.org/abs/2602.09006. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  6. [6]

    Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185,

    Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, et al. Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185,

  7. [7]

    Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

    10 Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4,

  8. [8]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  9. [9]

    arXiv preprint arXiv:2602.17155 , year=

    Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, and Sijia Liu. Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155,

  10. [10]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025a. Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficien...

  11. [11]

    Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv preprint arXiv:2402.15751,

    Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv preprint arXiv:2402.15751,

  12. [12]

    arXiv preprint arXiv:2506.04430 , year=

    URLhttps://api.semanticscholar.org/CorpusID:16577977. Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, and Aleksandr Beznosikov. Leveraging coordinate momentum in signsgd and muon: Memory-optimized zero-order.arXiv preprint arXiv:2506.04430,

  13. [13]

    Wic: the word-in-context dataset for evaluating context-sensitive meaning representations

    Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273,

  14. [14]

    Squad: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392,

  15. [15]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207,

  16. [16]

    Refining adaptive zeroth-order optimization at ease.arXiv preprint arXiv:2502.01014,

    Yao Shu, Qixin Zhang, Kun He, and Zhongxiang Dai. Refining adaptive zeroth-order optimization at ease.arXiv preprint arXiv:2502.01014,

  17. [17]

    Recursive deep models for semantic compositionality over a sentiment treebank

    11 Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,

  18. [18]

    Gemma 2: Improving Open Language Models at a Practical Size

    theory, Apr 2026b. URL https://kexue.fm/archives/11710. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size, 2024.URL https://arxiv. org/abs/2408.00118, 1(3),

  19. [19]

    SOAP: Improving and Stabilizing Shampoo using Adam

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

  20. [20]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

  21. [21]

    Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark.arXiv preprint arXiv:2402.11592,

    Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D Lee, Wotao Yin, Mingyi Hong, et al. Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark.arXiv preprint arXiv:2402.11592,

  22. [22]

    Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173,

    Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, and Ivor W Tsang. Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173,

  23. [23]

    In this section, we establish that this projection- based SPI is equivalent to its full-space counterpart

    12 A Analysis of Subspace Power Iteration In Section 4, for computational efficiency, we operate the Streaming Power Iteration (SPI) in the reduced subspace rather than the full parameter space. In this section, we establish that this projection- based SPI is equivalent to its full-space counterpart. Our analysis is inspired by the proof of Proposition 1 ...

  24. [24]

    Our method reaches the highest accuracy of 91.7%, with almost the same wall-clock time and slightly additional memory usage

    We compare our method with ZO baselines through fine-tuning OPT-13B on SST-2 (Wang et al., 2019; Zhang et al., 2022). Our method reaches the highest accuracy of 91.7%, with almost the same wall-clock time and slightly additional memory usage. These results show that our method achieves a better accuracy performance without the requirement of extra memory ...

  25. [25]

    to ensure the same total query budget with other baselines. For LoRA (Malladi et al., 2023), 17 0 100 200 300 400 500 Wall-Clock Time (s) 60 65 70 75 80 85 90Accuracy (%) k=8 k=16 k=32 k=64 0 100 200 300 400 500 Wall-Clock Time (s) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Train Loss k=8 k=16 k=32 k=64 (a) Accuracy vs. wall-clock time (b) Training loss vs. wall-...

  26. [26]

    Additional Experiment ResultsFigures 9 through 13 illustrate the training dynamics of OPT-13B, LLaMA3-8B, and Gemma2-2B on SuperGLUE (Wang et al., 2019; Zhang et al., 2022; Team et al., 2024; Grattafiori et al., 2024). Notably, the accuracy-vs-time curves demonstrate a much steeper con- vergence trend than existing ZO baselines across most tasks, which cl...