pith. sign in

arxiv: 2605.19767 · v1 · pith:GFPVKU6Rnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

AR1-ZO: Topology-Aware Rank-1 Zeroth-Order Queries for High-Rank LoRA Fine-Tuning

Pith reviewed 2026-05-20 07:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords zeroth-order optimizationLoRAlarge language modelsrank-1 queriesfinite-differencefine-tuningtopology-aware scaling
0
0 comments X

The pith

Rank-1 atom queries with adjusted scaling restore invariant finite-difference signals for high-rank LoRA in zeroth-order fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that combining zeroth-order optimization with high-rank LoRA adapters faces a signal shrinkage problem due to how queries are scaled and applied across ranks. By querying one rank-1 atom at a time and using a topology-aware scaling factor that grows with rank, the active signal strength stays constant regardless of the adapter rank. This allows high-rank LoRA to provide more capacity without increasing the query dimension or requiring extra forward passes or auxiliary structures. A sympathetic reader would care because it makes memory-efficient fine-tuning of large models more effective by increasing adapter expressivity while keeping the optimization budget fixed at two forward passes.

Core claim

LoRA adapters decompose into matched rank-1 atoms, each spanning a complete factor-coordinate block. Querying these atoms individually removes the rank from the perturbation dimension of each query while preserving the stored rank r. The standard scaling α/r causes the finite-difference signal to shrink proportionally to 1/r and the signal-to-noise ratio to 1/r squared under fixed noise. AR1-ZO counters this by alternating atom queries with scaling γ equal to α times r, which restores a rank-invariant active signal strength without needing extra mechanisms.

What carries the argument

Alternating rank-1 atom queries combined with topology-aware scaling γ = α r to maintain finite-difference signal strength independent of LoRA rank.

Load-bearing premise

That the performance bottleneck arises from a mismatch between query topology and scaling rather than from the inherent limitations of rank-1 updates or the need for external subspace information.

What would settle it

Measure the finite-difference gradient estimate variance or signal magnitude across increasing LoRA ranks r both with and without the γ=α r scaling; the claim holds if the signal remains constant with the adjusted scaling but drops without it.

Figures

Figures reproduced from arXiv: 2605.19767 by Chenyu Zhang, Hongbin Lin, Xiangda Yan, Yao Shu, Yongjie Yang, Ziye Chen.

Figure 1
Figure 1. Figure 1: The Optimization Pathology of LoRA-ZO and the AR1-ZO Solution. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training-loss trajectories on CB, SQuAD, and DROP. Under matched two-forward budgets, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mechanism validation over 2×105 steps. Left: Spectral concentration ρ remains robust. Middle: Alignment gain concentrates heavily in high-impact layers. Right: Cumulative alignment gain remains strictly positive, confirming a net structural pull. 1 2 4 8 16 32 64 LoRA rank r 10 0 10 1 10 2 E m pirical FD-S N R = [fd 2 ]/( 2 sys /(2 2 )) Empirical FD-SNR collapses toward the noise floor ZO-Alt-naive AR1-ZO … view at source ↗
Figure 4
Figure 4. Figure 4: Signal restoration validation. Active FD-SNR and directional fidelity collapse under naive [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Full eight-task training-loss comparison. The appendix reports the complete loss trajectories [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Side-specific cumulative alignment decomposition. The output-side increment [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablations and computational efficiency on SQuAD. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Zeroth-order (ZO) optimization enables large-language-model fine-tuning without storing backpropagation activations, while LoRA supplies compact trainable adapters. Combining them creates a rank paradox: increasing LoRA rank improves adapter capacity, but standard two-point ZO either perturbs a rank-dependent number of coordinates or, under atomwise updates, can make the finite-difference signal unobservable. This paper shows that the bottleneck is a measurement-topology problem rather than a need for an external subspace. LoRA already decomposes into matched rank-$1$ atoms, each a complete factor-coordinate block of dimension $d_\text{out}+d_\text{in}$. Querying one atom per step keeps the stored adapter rank $r$ while removing $r$ from the single-query perturbation dimension. The naive atomwise query is still miscalibrated: if it inherits canonical LoRA scaling $\alpha/r$, the active finite-difference signal shrinks as $1/r$ and the active finite-difference signal-to-noise ratio (FD-SNR) as $1/r^2$, producing directional collapse under a fixed residual evaluation-noise floor. AR1-ZO pairs alternating rank-$1$ atom queries with topology-aware scaling $\gamma=\alpha r$, restoring rank-invariant active signal without auxiliary bases, activation hooks, curvature estimates, or extra forward queries. Theory proves atom minimality, rank-independent active query dimension, directional collapse and restoration, and the remaining rank dependence as an amortized coverage cost. Experiments on OPT and Qwen3 models validate the signal mechanism and show that AR1-ZO makes high-rank LoRA effective among matched-budget ZO methods under the standard two-forward-pass query budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AR1-ZO for zeroth-order fine-tuning of high-rank LoRA adapters in LLMs. It frames the rank paradox as a measurement-topology issue rather than requiring external subspaces, solves it via alternating rank-1 atom queries paired with topology-aware scaling γ=α r that restores rank-invariant active finite-difference signal and FD-SNR, proves atom minimality, rank-independent active query dimension, directional collapse/restoration, and amortized coverage cost, and validates on OPT/Qwen3 models under a fixed two-forward-pass budget without auxiliary bases, hooks, or extra queries.

Significance. If the central signal-restoration mechanism holds without offsetting bias growth, the work would enable higher-rank LoRA in memory-efficient ZO settings, improving capacity over low-rank ZO baselines while preserving the two-query budget and avoiding backprop storage. The explicit theory on directional collapse/restoration and the absence of extra overhead are strengths; reproducible experiments on standard models further support practical utility if the rank-invariance claim survives perturbation-size analysis.

major comments (2)
  1. [Theory section] Theory section (directional collapse and restoration): the analysis derives FD-SNR shrinkage as 1/r² under canonical α/r scaling and restoration via γ=α r, but does not bound or analyze the bias and variance of the two-point ZO estimator as functions of the now rank-dependent perturbation magnitude γ; if these terms scale with γ, the net gradient quality may lose rank-invariance even if raw signal magnitude is preserved.
  2. [Scaling derivation] § on scaling derivation: γ=α r is introduced specifically to cancel the 1/r shrinkage defined from the paper's own canonical LoRA scaling α/r; this coupling makes the restoration somewhat internal to the problem formulation rather than an independent test of the measurement-topology hypothesis, and a concrete comparison to an external-subspace baseline with matched scaling would strengthen the claim that no auxiliary bases are needed.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'amortized coverage cost' appears without a one-sentence gloss; adding a brief parenthetical would improve immediate readability for readers unfamiliar with the coverage interpretation.
  2. [Experiments] Experiments: the signal-mechanism validation would benefit from an explicit table or plot of measured FD-SNR versus rank for both canonical and topology-aware scalings, with error bars over multiple seeds, to make the 1/r² vs. rank-invariant behavior directly inspectable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and insightful comments on our manuscript. We address the major comments point by point below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Theory section] Theory section (directional collapse and restoration): the analysis derives FD-SNR shrinkage as 1/r² under canonical α/r scaling and restoration via γ=α r, but does not bound or analyze the bias and variance of the two-point ZO estimator as functions of the now rank-dependent perturbation magnitude γ; if these terms scale with γ, the net gradient quality may lose rank-invariance even if raw signal magnitude is preserved.

    Authors: We thank the referee for this important observation regarding the completeness of the theoretical analysis. Our derivation focuses on the active signal and FD-SNR to highlight the directional collapse under standard scaling and its restoration via topology-aware γ=α r. We agree that bounding the bias and variance of the two-point ZO estimator with respect to the rank-dependent γ is necessary for a full picture of gradient quality. In the revised manuscript, we will add a new paragraph or subsection in the theory section that analyzes these terms under standard Lipschitz and smoothness assumptions on the loss function. This will demonstrate that the bias scales as O(γ) or better and variance contributions are controlled such that rank-invariance is preserved. Thus, we will incorporate this analysis. revision: yes

  2. Referee: [Scaling derivation] § on scaling derivation: γ=α r is introduced specifically to cancel the 1/r shrinkage defined from the paper's own canonical LoRA scaling α/r; this coupling makes the restoration somewhat internal to the problem formulation rather than an independent test of the measurement-topology hypothesis, and a concrete comparison to an external-subspace baseline with matched scaling would strengthen the claim that no auxiliary bases are needed.

    Authors: The choice of γ=α r follows directly from the measurement topology: since each rank-1 atom query operates on a fixed active dimension independent of r, the canonical LoRA scaling α/r induces a 1/r shrinkage in the finite-difference step, which we compensate to restore invariance. This is not an internal artifact but a consequence of applying standard LoRA scaling to the atomwise queries. Regarding external-subspace baselines, our work deliberately avoids them to emphasize the topology-aware approach without additional storage or queries. However, to strengthen the claim, we will add a short discussion in the experiments or related work section noting that such baselines typically incur extra costs (e.g., subspace storage or more queries), which our method does not. We will not run new experiments but clarify the comparison theoretically. revision: partial

Circularity Check

1 steps flagged

Topology-aware scaling γ=α r restores rank-invariance by direct algebraic cancellation of the 1/r shrinkage defined from canonical α/r

specific steps
  1. self definitional [Abstract]
    "The naive atomwise query is still miscalibrated: if it inherits canonical LoRA scaling α/r, the active finite-difference signal shrinks as 1/r and the active finite-difference signal-to-noise ratio (FD-SNR) as 1/r², producing directional collapse under a fixed residual evaluation-noise floor. AR1-ZO pairs alternating rank-1 atom queries with topology-aware scaling γ=α r, restoring rank-invariant active signal without auxiliary bases, activation hooks, curvature estimates, or extra forward queries."

    The 1/r shrinkage is defined as a direct consequence of using the canonical scaling α/r. The proposed γ=α r is then selected precisely to multiply the perturbation by r and thereby cancel the 1/r term, so that rank-invariance of the active signal follows immediately from the definition of γ rather than from any external principle or measurement.

full rationale

The paper's central mechanism identifies shrinkage of the finite-difference signal as 1/r (and FD-SNR as 1/r²) specifically under the canonical LoRA scaling α/r, then introduces γ=α r to cancel that exact factor. The resulting rank-invariant active signal is therefore an algebraic consequence of the chosen scaling rather than an independent derivation. This matches the self-definitional pattern: the problem is posed in terms of one scaling, and the solution scaling is constructed to neutralize it. The atom-query topology and other claims do not exhibit this reduction, so the circularity is partial and localized to the scaling step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard decomposition of LoRA into rank-1 factors and the finite-difference approximation properties of zeroth-order methods; no new particles or dimensions are postulated.

axioms (2)
  • domain assumption LoRA adapters decompose into matched rank-1 atoms each of dimension d_out + d_in
    Invoked when the paper states that LoRA already decomposes into matched rank-1 atoms and that querying one atom per step keeps stored rank r while removing r from perturbation dimension.
  • domain assumption Finite-difference signal under canonical scaling α/r shrinks as 1/r and FD-SNR as 1/r² under fixed residual noise
    This is the load-bearing premise used to motivate the topology-aware scaling γ=α r.

pith-pipeline@v0.9.0 · 5855 in / 1542 out tokens · 35755 ms · 2026-05-20T07:37:25.544045+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    S. Chen, Y . Guo, Y . Ju, H. Dalal, Z. Zhu, and A. J. Khisti. Robust federated finetuning of LLMs via alternating optimization of loRA. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  2. [2]

    Y . Chen, Y . Zhang, L. Cao, K. Yuan, and Z. Wen. Enhancing zeroth-order fine-tuning for language models with low-rank structures. InThe Thirteenth International Conference on Learning Representations, 2025

  3. [3]

    Dettmers, A

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 10088–10115. Curran Associates, Inc., 2023

  4. [4]

    Hayou, N

    S. Hayou, N. Ghosh, and B. Yu. LoRA+: Efficient low rank adaptation of large models. In Forty-first International Conference on Machine Learning, 2024

  5. [5]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

  6. [6]

    Kalajdzievski

    D. Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora, 2023

  7. [7]

    W. Lin, Y . Jiang, Q. Song, Q. Xiang, and H. Xu. Agzo: Activation-guided zeroth-order optimization for llm fine-tuning, 2026

  8. [8]

    J. Liu, Z. Kong, P. Dong, C. Yang, X. Shen, P. Zhao, H. Tang, G. Yuan, W. Niu, W. Zhang, X. Lin, D. Huang, and Y . Wang. RoRA: Efficient fine-tuning of LLM with reliability optimization for rank adaptation. In2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

  9. [9]

    S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Pro- cessing Systems, volume 31. Curran Associates, Inc., 2018

  10. [10]

    Liu, C.-Y

    S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.-T. Cheng, and M.-H. Chen. Dora: weight-decomposed low-rank adaptation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  11. [11]

    Malladi, T

    S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora. Fine-tuning language models with just forward passes. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 53038–53075. Curran Associates, Inc., 2023

  12. [12]

    Nesterov and V

    Y . Nesterov and V . Spokoiny. Random gradient-free minimization of convex functions.Found. Comput. Math., 17(2):527–566, Apr. 2017

  13. [13]

    S. Park, J. Yun, S. Kim, S. Kundu, and E. Yang. Elucidating subspace perturbation in zeroth- order optimization: Theory and practice at scale, 2025

  14. [14]

    Seung, J

    H. Seung, J. Lee, and H. Ko. Low-rank curvature for zeroth-order optimization in LLM fine-tuning, 2025. Accepted to AAAI 2026

  15. [15]

    Song and W

    Z. Song and W. Li. RoZO: Geometry-aware zeroth-order fine-tuning on low-rank adapters for black-box large language models. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1773–1783, Rabat, Morocco, 2026. Association for Computational Linguistics

  16. [16]

    Y . Sun, T. Huang, L. Ding, L. Shen, and D. Tao. TeZO: Empowering the low-rankness on the temporal dimension in the zeroth-order optimization for fine-tuning LLMs, 2025

  17. [17]

    Z. Yu, P. Zhou, S. Wang, J. Li, M. Tian, and H. Huang. Zeroth-order fine-tuning of LLMs in random subspaces, 2025. ICCV 2025 camera-ready version. 10

  18. [18]

    Zhang, M

    Q. Zhang, M. Chen, A. Bukharin, P. He, Y . Cheng, W. Chen, and T. Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023

  19. [19]

    Y . Zhao, S. Dang, H. Ye, G. Dai, Y . Qian, and I. Tsang. Second-order fine-tuning without pain for LLMs: A hessian informed zeroth-order optimizer. InThe Thirteenth International Conference on Learning Representations, 2025. A Related Work Optimizing LLMs under strict memory constraints has driven the convergence of LoRA-style adap- tation and zeroth-ord...

  20. [20]

    The bound depends on q=d out +d in butnot on the total rank r, confirming that topology- aware scaling removes rank-induced signal degradation

  21. [21]

    Choosing µ=O(q −1/2)makes thisO(L 2/q0) =O(L 2)

    The persistent bias O(µ4L2q2) is controlled by the smoothing parameter µ. Choosing µ=O(q −1/2)makes thisO(L 2/q0) =O(L 2)

  22. [22]

    The residual noise termO(σ 2 ξ q/(T 1/2µ2))vanishes asT→ ∞

  23. [23]

    C.7 Proof of Corollary 4.5: Coverage Cost for Full-Adapter Stationarity Proof

    As T→ ∞ and µ→0 at an appropriate rate, the bound reduces to the standard non-convex ZO convergence rateO(1/ √ T). C.7 Proof of Corollary 4.5: Coverage Cost for Full-Adapter Stationarity Proof. The atom factor-coordinate blocks are disjoint in the coordinates of (B,A) , so full-adapter stationarity is measured by the aggregate quantityPr k=1 ∥∇kL(θ)∥2 alr...