pith. sign in

arxiv: 2601.17261 · v4 · pith:PMZBQ5PUnew · submitted 2026-01-24 · 💻 cs.LG

AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning

Pith reviewed 2026-05-25 07:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords zeroth-order optimizationLLM fine-tuningactivation-guided perturbationsmemory-efficient traininggradient estimationlow-rank subspace
0
0 comments X

The pith

AGZO restricts zeroth-order perturbations to activation subspaces to improve LLM fine-tuning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that zeroth-order optimization for large language model fine-tuning can be strengthened by replacing random isotropic perturbations with ones confined to a subspace derived from the model's input activations. This structural choice is presented as a way to produce update directions that align better with the true gradient while preserving the memory savings of zeroth-order methods. The authors demonstrate the approach on Qwen3 and Pangu models, where it narrows the accuracy gap to standard first-order fine-tuning without increasing peak memory use. A sympathetic reader would care because memory constraints currently limit how large models can be adapted on modest hardware.

Core claim

AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. The method optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines.

What carries the argument

The activation-informed low-rank subspace spanned by a layer's input activations, which is extracted during the forward pass and used to confine the direction of zeroth-order perturbations.

If this is right

  • AGZO consistently outperforms existing zeroth-order baselines on Qwen3 and Pangu models across benchmarks.
  • The performance gap to first-order fine-tuning is significantly reduced.
  • Peak memory usage stays nearly identical to other zeroth-order methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the activation-gradient subspace link holds for additional layer types, the same guidance principle could apply to training regimes beyond linear layers.
  • The subspace smoothing view might suggest new ways to combine partial gradient information with zeroth-order steps in hybrid optimizers.
  • The on-the-fly subspace extraction could be tested for robustness under different batch sizes or sequence lengths where activation statistics vary.

Load-bearing premise

The gradient of a linear layer is confined to the subspace spanned by its input activations.

What would settle it

A measurement showing that the true gradient of a linear layer has substantial components outside the span of its input activations would disprove the central structural premise.

Figures

Figures reproduced from arXiv: 2601.17261 by Hong Xu, Qiao Xiang, Qingyu Song, Wei Lin, Yining Jiang.

Figure 1
Figure 1. Figure 1: Structural analysis of gradients and activations. (a) Cosine similarity between the true gradient and its projection onto the activation subspace. (b) & (c) Singular value spectra of gradients and activations. Across layers, the cosine similarity is typically close to 1 when r ≥ 10, indicating that almost all gradient energy lies in the subspace spanned by the forward activations. 3.2. Low-Rank Structure o… view at source ↗
Figure 2
Figure 2. Figure 2: Gradient alignment during fine￾tuning. 128 256 384 512 768 1024 Sequence length (tokens) 0 4 8 12 16 20 24 Peak GPU memory (GB) 11.12 12.59 14.95 17.73 3.98 4.19 5.17 6.15 7.75 8.11 3.97 4.19 5.16 6.14 7.73 8.10 3.97 4.19 5.16 6.14 7.73 8.10 Fixed batch size = 4 SGD AGZO MEZO LOZO OOM OOM (a) Fixed batch size = 4, varying sequence length. 1 2 4 8 16 32 Batch size 0 4 8 12 16 20 24 Peak GPU memory (GB) 11.1… view at source ↗
Figure 4
Figure 4. Figure 4: Peak GPU memory usage when fine-tuning Pangu-1B on DROP. B.2. Implementation and Hyperparameters All zeroth-order optimization methods (AGZO, MeZO, and LOZO) are implemented using the same codebase to ensure a fair comparison. For all ZO experiments, we perform fine-tuning for a total of 20,000 steps. This fixed budget allows us to directly compare the convergence speed and final performance of different e… view at source ↗
read the original abstract

Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Activation-Guided Zeroth-Order optimization (AGZO) for memory-efficient LLM fine-tuning. It exploits the structural fact that, for a linear layer y = Wx, the parameter gradient lies exactly in the rank-1 subspace spanned by the input activation x. AGZO extracts this activation-informed low-rank subspace on the fly during the forward pass and restricts ZO perturbations to it. The paper supplies a theoretical framework asserting that the method optimizes a subspace-smoothed objective and yields update directions with strictly higher cosine similarity to the true gradient than isotropic ZO baselines. Experiments on Qwen3 and Pangu models report consistent gains over prior ZO methods while preserving essentially the same peak memory footprint.

Significance. If the central derivation and empirical claims hold, AGZO supplies a clean, low-overhead way to inject forward-pass structural information into ZO updates. The explicit grounding in the exact gradient form g x^T for linear layers is a strength; the on-the-fly subspace construction avoids extra memory and is directly falsifiable via the cosine-similarity metric. This could narrow the ZO–first-order performance gap in practical LLM fine-tuning without altering the memory profile of existing ZO pipelines.

major comments (2)
  1. [§3] §3 (theoretical framework): the proof that cosine similarity is strictly higher must be checked against the precise definition of the activation-derived subspace; if the subspace is formed only from the current mini-batch activations, the guarantee may degrade under distribution shift or when the layer is not strictly linear (e.g., inside attention blocks).
  2. [§4.2] §4.2 (empirical evaluation): the reported gains over isotropic ZO baselines are load-bearing for the central claim; the manuscript should include an ablation that isolates the subspace restriction from other implementation choices (e.g., perturbation scale, number of ZO queries) to confirm the improvement is attributable to the activation guidance.
minor comments (2)
  1. [§4] The abstract states evaluation “across various benchmarks” but §4 should list the exact tasks, dataset sizes, and number of runs with standard deviations.
  2. [§2] Notation for the low-rank subspace projection operator should be introduced once and used consistently; currently the forward-pass extraction step is described in prose without an equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive evaluation of the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical framework): the proof that cosine similarity is strictly higher must be checked against the precise definition of the activation-derived subspace; if the subspace is formed only from the current mini-batch activations, the guarantee may degrade under distribution shift or when the layer is not strictly linear (e.g., inside attention blocks).

    Authors: The analysis in §3 establishes the cosine-similarity guarantee exactly for linear layers y = Wx, where the gradient lies in the rank-1 subspace spanned by the current forward-pass activation x; the subspace is constructed on the fly from the current mini-batch. The strict improvement therefore holds under these conditions. Attention blocks contain linear projections (query, key, value, output) to which the same per-layer construction applies, yielding an effective approximation. We will add an explicit statement of the theorem's assumptions and a short discussion of its use inside attention in the revised manuscript. Because the subspace is recomputed each batch, the method is designed to track distribution shifts during fine-tuning rather than degrade under them. revision: partial

  2. Referee: [§4.2] §4.2 (empirical evaluation): the reported gains over isotropic ZO baselines are load-bearing for the central claim; the manuscript should include an ablation that isolates the subspace restriction from other implementation choices (e.g., perturbation scale, number of ZO queries) to confirm the improvement is attributable to the activation guidance.

    Authors: We agree that an ablation isolating the subspace restriction is necessary to confirm the source of the gains. In the revised manuscript we will add controlled experiments that match perturbation scale, number of ZO queries, and all other hyperparameters between AGZO and the isotropic baseline, thereby attributing any remaining improvement to the activation-derived subspace. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard linear algebra

full rationale

The paper's core structural claim—that the gradient of y = Wx lies exactly in the rank-1 subspace spanned by input activation x—is a direct, general consequence of the chain rule and outer-product form of the gradient, not derived from or fitted to AGZO itself. The subsequent claims (subspace-smoothed objective and provably higher cosine similarity) follow mathematically from restricting isotropic noise to this subspace without any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. The argument is self-contained against external benchmarks and does not reduce the claimed results to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger records the single explicit structural assumption stated in the text; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption The gradient of a linear layer is confined to the subspace spanned by its input activations.
    This premise is identified in the abstract as the crucial link that justifies restricting perturbations to an activation-derived subspace.

pith-pipeline@v0.9.0 · 5743 in / 1349 out tokens · 24335 ms · 2026-05-25T07:29:53.811047+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    Deepzero: Scaling up zeroth-order op- timization for deep model training.arXiv preprint arXiv:2310.02025,

    Chen, A., Zhang, Y ., Jia, J., Diffenderfer, J., Liu, J., Parasyris, K., Zhang, Y ., Zhang, Z., Kailkhura, B., and Liu, S. Deepzero: Scaling up zeroth-order op- timization for deep model training.arXiv preprint arXiv:2310.02025,

  2. [2]

    Pangu embedded: An efficient dual-system llm reasoner with metacognition

    Chen, H., Wang, Y ., Han, K., Li, D., Li, L., Bi, Z., Li, J., Wang, H., Mi, F., Zhu, M., et al. Pangu embedded: An efficient dual-system llm reasoner with metacognition. arXiv preprint arXiv:2505.22375, 2025a. Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174,

  3. [3]

    Enhancing zeroth-order fine-tuning for language mod- els with low-rank structures

    Chen, Y ., Zhang, Y ., Cao, L., Yuan, K., and Wen, Z. Enhancing zeroth-order fine-tuning for language mod- els with low-rank structures. InThe Thirteenth In- ternational Conference on Learning Representations, 2025b. URL https://openreview.net/forum? id=9BiVepgmWW. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Ex...

  4. [4]

    Gradient Descent Happens in a Tiny Subspace

    Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754,

  5. [5]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  6. [6]

    Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

    Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. InPro- ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262,

  7. [7]

    Pilehvar, M. T. and Camacho-Collados, J. Wic: the word-in- context dataset for evaluating context-sensitive meaning representations. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (Long and short papers), pp. 1267–1273,

  8. [8]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250,

  9. [9]

    Revealing the power of post-training for small language models via knowledge distillation.arXiv preprint arXiv:2509.26497,

    Rang, M., Bi, Z., Zhou, H., Chen, H., Xiao, A., Guo, T., Han, K., Chen, X., and Wang, Y . Revealing the power of post-training for small language models via knowledge distillation.arXiv preprint arXiv:2509.26497,

  10. [10]

    D., Ng, A., and Potts, C

    Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical 10 Activation Guided Zeroth Order Perturbation Methods in Natural Language Processing, pp. 1631– 1642, Seattle, Washington, USA, October

  11. [11]

    Glue: A multi-task benchmark and analysis platform for natural language understanding

    Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. Glue: A multi-task benchmark and analysis platform for natural language understanding. InPro- ceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pp. 353–355,

  12. [12]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  13. [13]

    Substitutingu= D 2 and multiplied by √π, we get: r π(D−1) 2 < 1 β D < r πD 2 By reversing we complete the proof

    Γ(u) < √u. Substitutingu= D 2 and multiplied by √π, we get: r π(D−1) 2 < 1 β D < r πD 2 By reversing we complete the proof. 18 Activation Guided Zeroth Order Perturbation A.3. AGZO defeat MEZO in cosine similarity We compare the noiseless expectations from theorem 5.4 and corollary 5.5: ER h cos bGAGZO 0 , G i =β doutr ∥GPA∥F ∥G∥F ,E R h cos bGMEZO 0 , G ...