AR1-ZO: Topology-Aware Rank-1 Zeroth-Order Queries for High-Rank LoRA Fine-Tuning

Chenyu Zhang; Hongbin Lin; Xiangda Yan; Yao Shu; Yongjie Yang; Ziye Chen

arxiv: 2605.19767 · v1 · pith:GFPVKU6Rnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

AR1-ZO: Topology-Aware Rank-1 Zeroth-Order Queries for High-Rank LoRA Fine-Tuning

Ziye Chen , Hongbin Lin , Chenyu Zhang , Xiangda Yan , Yongjie Yang , Yao Shu This is my paper

Pith reviewed 2026-05-20 07:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords zeroth-order optimizationLoRAlarge language modelsrank-1 queriesfinite-differencefine-tuningtopology-aware scaling

0 comments

The pith

Rank-1 atom queries with adjusted scaling restore invariant finite-difference signals for high-rank LoRA in zeroth-order fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that combining zeroth-order optimization with high-rank LoRA adapters faces a signal shrinkage problem due to how queries are scaled and applied across ranks. By querying one rank-1 atom at a time and using a topology-aware scaling factor that grows with rank, the active signal strength stays constant regardless of the adapter rank. This allows high-rank LoRA to provide more capacity without increasing the query dimension or requiring extra forward passes or auxiliary structures. A sympathetic reader would care because it makes memory-efficient fine-tuning of large models more effective by increasing adapter expressivity while keeping the optimization budget fixed at two forward passes.

Core claim

LoRA adapters decompose into matched rank-1 atoms, each spanning a complete factor-coordinate block. Querying these atoms individually removes the rank from the perturbation dimension of each query while preserving the stored rank r. The standard scaling α/r causes the finite-difference signal to shrink proportionally to 1/r and the signal-to-noise ratio to 1/r squared under fixed noise. AR1-ZO counters this by alternating atom queries with scaling γ equal to α times r, which restores a rank-invariant active signal strength without needing extra mechanisms.

What carries the argument

Alternating rank-1 atom queries combined with topology-aware scaling γ = α r to maintain finite-difference signal strength independent of LoRA rank.

Load-bearing premise

That the performance bottleneck arises from a mismatch between query topology and scaling rather than from the inherent limitations of rank-1 updates or the need for external subspace information.

What would settle it

Measure the finite-difference gradient estimate variance or signal magnitude across increasing LoRA ranks r both with and without the γ=α r scaling; the claim holds if the signal remains constant with the adjusted scaling but drops without it.

Figures

Figures reproduced from arXiv: 2605.19767 by Chenyu Zhang, Hongbin Lin, Xiangda Yan, Yao Shu, Yongjie Yang, Ziye Chen.

**Figure 2.** Figure 2: Training-loss trajectories on CB, SQuAD, and DROP. Under matched two-forward budgets, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Mechanism validation over 2×105 steps. Left: Spectral concentration ρ remains robust. Middle: Alignment gain concentrates heavily in high-impact layers. Right: Cumulative alignment gain remains strictly positive, confirming a net structural pull. 1 2 4 8 16 32 64 LoRA rank r 10 0 10 1 10 2 E m pirical FD-S N R = [fd 2 ]/( 2 sys /(2 2 )) Empirical FD-SNR collapses toward the noise floor ZO-Alt-naive AR1-ZO … view at source ↗

**Figure 4.** Figure 4: Signal restoration validation. Active FD-SNR and directional fidelity collapse under naive [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Full eight-task training-loss comparison. The appendix reports the complete loss trajectories [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Side-specific cumulative alignment decomposition. The output-side increment [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Ablations and computational efficiency on SQuAD. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Zeroth-order (ZO) optimization enables large-language-model fine-tuning without storing backpropagation activations, while LoRA supplies compact trainable adapters. Combining them creates a rank paradox: increasing LoRA rank improves adapter capacity, but standard two-point ZO either perturbs a rank-dependent number of coordinates or, under atomwise updates, can make the finite-difference signal unobservable. This paper shows that the bottleneck is a measurement-topology problem rather than a need for an external subspace. LoRA already decomposes into matched rank-$1$ atoms, each a complete factor-coordinate block of dimension $d_\text{out}+d_\text{in}$. Querying one atom per step keeps the stored adapter rank $r$ while removing $r$ from the single-query perturbation dimension. The naive atomwise query is still miscalibrated: if it inherits canonical LoRA scaling $\alpha/r$, the active finite-difference signal shrinks as $1/r$ and the active finite-difference signal-to-noise ratio (FD-SNR) as $1/r^2$, producing directional collapse under a fixed residual evaluation-noise floor. AR1-ZO pairs alternating rank-$1$ atom queries with topology-aware scaling $\gamma=\alpha r$, restoring rank-invariant active signal without auxiliary bases, activation hooks, curvature estimates, or extra forward queries. Theory proves atom minimality, rank-independent active query dimension, directional collapse and restoration, and the remaining rank dependence as an amortized coverage cost. Experiments on OPT and Qwen3 models validate the signal mechanism and show that AR1-ZO makes high-rank LoRA effective among matched-budget ZO methods under the standard two-forward-pass query budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AR1-ZO gives a direct fix for signal loss in high-rank ZO-LoRA by querying one rank-1 atom at a time and scaling the perturbation up with rank, though the effect on estimator bias and variance still needs checking.

read the letter

The main point is that this paper shows you can keep high-rank LoRA adapters effective under a fixed two-forward-pass zeroth-order budget by querying rank-1 atoms alternately and using a scaling of γ = α r to stop the finite-difference signal from shrinking with rank. They treat the problem as one of measurement topology rather than missing subspace information, which leads to a clean argument that the naive atomwise approach with standard LoRA scaling collapses the active signal as 1/r and the SNR as 1/r² under fixed noise. Their fix restores rank-independent signal strength without extra bases, hooks, or queries. The theory section claims proofs for atom minimality, directional collapse and restoration, and the remaining rank cost as just amortized coverage steps. Experiments on OPT and Qwen3 models test the mechanism and show high-rank versions now perform better among matched-budget ZO methods. That combination of the scaling rule and the topology framing is the concrete new piece. The analysis of why the signal drops and how to restore it without changing the query count is the part that feels solid and worth following up. The soft spot is the stress-test concern on estimator quality. Increasing the perturbation size proportionally to r will affect the bias and variance terms in the standard two-point ZO estimator, and the abstract does not spell out whether the error analysis keeps the net gradient quality independent of rank after the scaling change. If the approximation degrades at the same rate the raw signal improves, the claimed rank-invariance may not hold as cleanly. The full derivations would need to show how they handle that trade-off. This work is aimed at researchers doing memory-efficient adaptation of large models where backpropagation storage is the main constraint. Anyone already running zeroth-order fine-tuning experiments on LLMs would get practical value from the atom query pattern and the explicit scaling adjustment. It has enough of a mechanism and some validation to deserve a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes AR1-ZO for zeroth-order fine-tuning of high-rank LoRA adapters in LLMs. It frames the rank paradox as a measurement-topology issue rather than requiring external subspaces, solves it via alternating rank-1 atom queries paired with topology-aware scaling γ=α r that restores rank-invariant active finite-difference signal and FD-SNR, proves atom minimality, rank-independent active query dimension, directional collapse/restoration, and amortized coverage cost, and validates on OPT/Qwen3 models under a fixed two-forward-pass budget without auxiliary bases, hooks, or extra queries.

Significance. If the central signal-restoration mechanism holds without offsetting bias growth, the work would enable higher-rank LoRA in memory-efficient ZO settings, improving capacity over low-rank ZO baselines while preserving the two-query budget and avoiding backprop storage. The explicit theory on directional collapse/restoration and the absence of extra overhead are strengths; reproducible experiments on standard models further support practical utility if the rank-invariance claim survives perturbation-size analysis.

major comments (2)

[Theory section] Theory section (directional collapse and restoration): the analysis derives FD-SNR shrinkage as 1/r² under canonical α/r scaling and restoration via γ=α r, but does not bound or analyze the bias and variance of the two-point ZO estimator as functions of the now rank-dependent perturbation magnitude γ; if these terms scale with γ, the net gradient quality may lose rank-invariance even if raw signal magnitude is preserved.
[Scaling derivation] § on scaling derivation: γ=α r is introduced specifically to cancel the 1/r shrinkage defined from the paper's own canonical LoRA scaling α/r; this coupling makes the restoration somewhat internal to the problem formulation rather than an independent test of the measurement-topology hypothesis, and a concrete comparison to an external-subspace baseline with matched scaling would strengthen the claim that no auxiliary bases are needed.

minor comments (2)

[Abstract] Abstract: the phrase 'amortized coverage cost' appears without a one-sentence gloss; adding a brief parenthetical would improve immediate readability for readers unfamiliar with the coverage interpretation.
[Experiments] Experiments: the signal-mechanism validation would benefit from an explicit table or plot of measured FD-SNR versus rank for both canonical and topology-aware scalings, with error bars over multiple seeds, to make the 1/r² vs. rank-invariant behavior directly inspectable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and insightful comments on our manuscript. We address the major comments point by point below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Theory section] Theory section (directional collapse and restoration): the analysis derives FD-SNR shrinkage as 1/r² under canonical α/r scaling and restoration via γ=α r, but does not bound or analyze the bias and variance of the two-point ZO estimator as functions of the now rank-dependent perturbation magnitude γ; if these terms scale with γ, the net gradient quality may lose rank-invariance even if raw signal magnitude is preserved.

Authors: We thank the referee for this important observation regarding the completeness of the theoretical analysis. Our derivation focuses on the active signal and FD-SNR to highlight the directional collapse under standard scaling and its restoration via topology-aware γ=α r. We agree that bounding the bias and variance of the two-point ZO estimator with respect to the rank-dependent γ is necessary for a full picture of gradient quality. In the revised manuscript, we will add a new paragraph or subsection in the theory section that analyzes these terms under standard Lipschitz and smoothness assumptions on the loss function. This will demonstrate that the bias scales as O(γ) or better and variance contributions are controlled such that rank-invariance is preserved. Thus, we will incorporate this analysis. revision: yes
Referee: [Scaling derivation] § on scaling derivation: γ=α r is introduced specifically to cancel the 1/r shrinkage defined from the paper's own canonical LoRA scaling α/r; this coupling makes the restoration somewhat internal to the problem formulation rather than an independent test of the measurement-topology hypothesis, and a concrete comparison to an external-subspace baseline with matched scaling would strengthen the claim that no auxiliary bases are needed.

Authors: The choice of γ=α r follows directly from the measurement topology: since each rank-1 atom query operates on a fixed active dimension independent of r, the canonical LoRA scaling α/r induces a 1/r shrinkage in the finite-difference step, which we compensate to restore invariance. This is not an internal artifact but a consequence of applying standard LoRA scaling to the atomwise queries. Regarding external-subspace baselines, our work deliberately avoids them to emphasize the topology-aware approach without additional storage or queries. However, to strengthen the claim, we will add a short discussion in the experiments or related work section noting that such baselines typically incur extra costs (e.g., subspace storage or more queries), which our method does not. We will not run new experiments but clarify the comparison theoretically. revision: partial

Circularity Check

1 steps flagged

Topology-aware scaling γ=α r restores rank-invariance by direct algebraic cancellation of the 1/r shrinkage defined from canonical α/r

specific steps

self definitional [Abstract]
"The naive atomwise query is still miscalibrated: if it inherits canonical LoRA scaling α/r, the active finite-difference signal shrinks as 1/r and the active finite-difference signal-to-noise ratio (FD-SNR) as 1/r², producing directional collapse under a fixed residual evaluation-noise floor. AR1-ZO pairs alternating rank-1 atom queries with topology-aware scaling γ=α r, restoring rank-invariant active signal without auxiliary bases, activation hooks, curvature estimates, or extra forward queries."

The 1/r shrinkage is defined as a direct consequence of using the canonical scaling α/r. The proposed γ=α r is then selected precisely to multiply the perturbation by r and thereby cancel the 1/r term, so that rank-invariance of the active signal follows immediately from the definition of γ rather than from any external principle or measurement.

full rationale

The paper's central mechanism identifies shrinkage of the finite-difference signal as 1/r (and FD-SNR as 1/r²) specifically under the canonical LoRA scaling α/r, then introduces γ=α r to cancel that exact factor. The resulting rank-invariant active signal is therefore an algebraic consequence of the chosen scaling rather than an independent derivation. This matches the self-definitional pattern: the problem is posed in terms of one scaling, and the solution scaling is constructed to neutralize it. The atom-query topology and other claims do not exhibit this reduction, so the circularity is partial and localized to the scaling step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard decomposition of LoRA into rank-1 factors and the finite-difference approximation properties of zeroth-order methods; no new particles or dimensions are postulated.

axioms (2)

domain assumption LoRA adapters decompose into matched rank-1 atoms each of dimension d_out + d_in
Invoked when the paper states that LoRA already decomposes into matched rank-1 atoms and that querying one atom per step keeps stored rank r while removing r from perturbation dimension.
domain assumption Finite-difference signal under canonical scaling α/r shrinks as 1/r and FD-SNR as 1/r² under fixed residual noise
This is the load-bearing premise used to motivate the topology-aware scaling γ=α r.

pith-pipeline@v0.9.0 · 5855 in / 1542 out tokens · 35755 ms · 2026-05-20T07:37:25.544045+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

S. Chen, Y . Guo, Y . Ju, H. Dalal, Z. Zhu, and A. J. Khisti. Robust federated finetuning of LLMs via alternating optimization of loRA. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[2]

Y . Chen, Y . Zhang, L. Cao, K. Yuan, and Z. Wen. Enhancing zeroth-order fine-tuning for language models with low-rank structures. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[3]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 10088–10115. Curran Associates, Inc., 2023

work page 2023
[4]

Hayou, N

S. Hayou, N. Ghosh, and B. Yu. LoRA+: Efficient low rank adaptation of large models. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[5]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022
[6]

Kalajdzievski

D. Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora, 2023

work page 2023
[7]

W. Lin, Y . Jiang, Q. Song, Q. Xiang, and H. Xu. Agzo: Activation-guided zeroth-order optimization for llm fine-tuning, 2026

work page 2026
[8]

J. Liu, Z. Kong, P. Dong, C. Yang, X. Shen, P. Zhao, H. Tang, G. Yuan, W. Niu, W. Zhang, X. Lin, D. Huang, and Y . Wang. RoRA: Efficient fine-tuning of LLM with reliability optimization for rank adaptation. In2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

work page 2025
[9]

S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Pro- cessing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018
[10]

Liu, C.-Y

S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.-T. Cheng, and M.-H. Chen. Dora: weight-decomposed low-rank adaptation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[11]

Malladi, T

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora. Fine-tuning language models with just forward passes. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 53038–53075. Curran Associates, Inc., 2023

work page 2023
[12]

Nesterov and V

Y . Nesterov and V . Spokoiny. Random gradient-free minimization of convex functions.Found. Comput. Math., 17(2):527–566, Apr. 2017

work page 2017
[13]

S. Park, J. Yun, S. Kim, S. Kundu, and E. Yang. Elucidating subspace perturbation in zeroth- order optimization: Theory and practice at scale, 2025

work page 2025
[14]

Seung, J

H. Seung, J. Lee, and H. Ko. Low-rank curvature for zeroth-order optimization in LLM fine-tuning, 2025. Accepted to AAAI 2026

work page 2025
[15]

Song and W

Z. Song and W. Li. RoZO: Geometry-aware zeroth-order fine-tuning on low-rank adapters for black-box large language models. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1773–1783, Rabat, Morocco, 2026. Association for Computational Linguistics

work page 2026
[16]

Y . Sun, T. Huang, L. Ding, L. Shen, and D. Tao. TeZO: Empowering the low-rankness on the temporal dimension in the zeroth-order optimization for fine-tuning LLMs, 2025

work page 2025
[17]

Z. Yu, P. Zhou, S. Wang, J. Li, M. Tian, and H. Huang. Zeroth-order fine-tuning of LLMs in random subspaces, 2025. ICCV 2025 camera-ready version. 10

work page 2025
[18]

Zhang, M

Q. Zhang, M. Chen, A. Bukharin, P. He, Y . Cheng, W. Chen, and T. Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[19]

Y . Zhao, S. Dang, H. Ye, G. Dai, Y . Qian, and I. Tsang. Second-order fine-tuning without pain for LLMs: A hessian informed zeroth-order optimizer. InThe Thirteenth International Conference on Learning Representations, 2025. A Related Work Optimizing LLMs under strict memory constraints has driven the convergence of LoRA-style adap- tation and zeroth-ord...

work page 2025
[20]

The bound depends on q=d out +d in butnot on the total rank r, confirming that topology- aware scaling removes rank-induced signal degradation

work page
[21]

Choosing µ=O(q −1/2)makes thisO(L 2/q0) =O(L 2)

The persistent bias O(µ4L2q2) is controlled by the smoothing parameter µ. Choosing µ=O(q −1/2)makes thisO(L 2/q0) =O(L 2)

work page
[22]

The residual noise termO(σ 2 ξ q/(T 1/2µ2))vanishes asT→ ∞

work page
[23]

C.7 Proof of Corollary 4.5: Coverage Cost for Full-Adapter Stationarity Proof

As T→ ∞ and µ→0 at an appropriate rate, the bound reduces to the standard non-convex ZO convergence rateO(1/ √ T). C.7 Proof of Corollary 4.5: Coverage Cost for Full-Adapter Stationarity Proof. The atom factor-coordinate blocks are disjoint in the coordinates of (B,A) , so full-adapter stationarity is measured by the aggregate quantityPr k=1 ∥∇kL(θ)∥2 alr...

work page

[1] [1]

S. Chen, Y . Guo, Y . Ju, H. Dalal, Z. Zhu, and A. J. Khisti. Robust federated finetuning of LLMs via alternating optimization of loRA. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[2] [2]

Y . Chen, Y . Zhang, L. Cao, K. Yuan, and Z. Wen. Enhancing zeroth-order fine-tuning for language models with low-rank structures. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[3] [3]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 10088–10115. Curran Associates, Inc., 2023

work page 2023

[4] [4]

Hayou, N

S. Hayou, N. Ghosh, and B. Yu. LoRA+: Efficient low rank adaptation of large models. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[5] [5]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022

[6] [6]

Kalajdzievski

D. Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora, 2023

work page 2023

[7] [7]

W. Lin, Y . Jiang, Q. Song, Q. Xiang, and H. Xu. Agzo: Activation-guided zeroth-order optimization for llm fine-tuning, 2026

work page 2026

[8] [8]

J. Liu, Z. Kong, P. Dong, C. Yang, X. Shen, P. Zhao, H. Tang, G. Yuan, W. Niu, W. Zhang, X. Lin, D. Huang, and Y . Wang. RoRA: Efficient fine-tuning of LLM with reliability optimization for rank adaptation. In2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

work page 2025

[9] [9]

S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Pro- cessing Systems, volume 31. Curran Associates, Inc., 2018

work page 2018

[10] [10]

Liu, C.-Y

S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.-T. Cheng, and M.-H. Chen. Dora: weight-decomposed low-rank adaptation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[11] [11]

Malladi, T

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora. Fine-tuning language models with just forward passes. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 53038–53075. Curran Associates, Inc., 2023

work page 2023

[12] [12]

Nesterov and V

Y . Nesterov and V . Spokoiny. Random gradient-free minimization of convex functions.Found. Comput. Math., 17(2):527–566, Apr. 2017

work page 2017

[13] [13]

S. Park, J. Yun, S. Kim, S. Kundu, and E. Yang. Elucidating subspace perturbation in zeroth- order optimization: Theory and practice at scale, 2025

work page 2025

[14] [14]

Seung, J

H. Seung, J. Lee, and H. Ko. Low-rank curvature for zeroth-order optimization in LLM fine-tuning, 2025. Accepted to AAAI 2026

work page 2025

[15] [15]

Song and W

Z. Song and W. Li. RoZO: Geometry-aware zeroth-order fine-tuning on low-rank adapters for black-box large language models. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1773–1783, Rabat, Morocco, 2026. Association for Computational Linguistics

work page 2026

[16] [16]

Y . Sun, T. Huang, L. Ding, L. Shen, and D. Tao. TeZO: Empowering the low-rankness on the temporal dimension in the zeroth-order optimization for fine-tuning LLMs, 2025

work page 2025

[17] [17]

Z. Yu, P. Zhou, S. Wang, J. Li, M. Tian, and H. Huang. Zeroth-order fine-tuning of LLMs in random subspaces, 2025. ICCV 2025 camera-ready version. 10

work page 2025

[18] [18]

Zhang, M

Q. Zhang, M. Chen, A. Bukharin, P. He, Y . Cheng, W. Chen, and T. Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[19] [19]

Y . Zhao, S. Dang, H. Ye, G. Dai, Y . Qian, and I. Tsang. Second-order fine-tuning without pain for LLMs: A hessian informed zeroth-order optimizer. InThe Thirteenth International Conference on Learning Representations, 2025. A Related Work Optimizing LLMs under strict memory constraints has driven the convergence of LoRA-style adap- tation and zeroth-ord...

work page 2025

[20] [20]

The bound depends on q=d out +d in butnot on the total rank r, confirming that topology- aware scaling removes rank-induced signal degradation

work page

[21] [21]

Choosing µ=O(q −1/2)makes thisO(L 2/q0) =O(L 2)

The persistent bias O(µ4L2q2) is controlled by the smoothing parameter µ. Choosing µ=O(q −1/2)makes thisO(L 2/q0) =O(L 2)

work page

[22] [22]

The residual noise termO(σ 2 ξ q/(T 1/2µ2))vanishes asT→ ∞

work page

[23] [23]

C.7 Proof of Corollary 4.5: Coverage Cost for Full-Adapter Stationarity Proof

As T→ ∞ and µ→0 at an appropriate rate, the bound reduces to the standard non-convex ZO convergence rateO(1/ √ T). C.7 Proof of Corollary 4.5: Coverage Cost for Full-Adapter Stationarity Proof. The atom factor-coordinate blocks are disjoint in the coordinates of (B,A) , so full-adapter stationarity is measured by the aggregate quantityPr k=1 ∥∇kL(θ)∥2 alr...

work page