AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning
Pith reviewed 2026-05-25 07:29 UTC · model grok-4.3
The pith
AGZO restricts zeroth-order perturbations to activation subspaces to improve LLM fine-tuning performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. The method optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines.
What carries the argument
The activation-informed low-rank subspace spanned by a layer's input activations, which is extracted during the forward pass and used to confine the direction of zeroth-order perturbations.
If this is right
- AGZO consistently outperforms existing zeroth-order baselines on Qwen3 and Pangu models across benchmarks.
- The performance gap to first-order fine-tuning is significantly reduced.
- Peak memory usage stays nearly identical to other zeroth-order methods.
Where Pith is reading between the lines
- If the activation-gradient subspace link holds for additional layer types, the same guidance principle could apply to training regimes beyond linear layers.
- The subspace smoothing view might suggest new ways to combine partial gradient information with zeroth-order steps in hybrid optimizers.
- The on-the-fly subspace extraction could be tested for robustness under different batch sizes or sequence lengths where activation statistics vary.
Load-bearing premise
The gradient of a linear layer is confined to the subspace spanned by its input activations.
What would settle it
A measurement showing that the true gradient of a linear layer has substantial components outside the span of its input activations would disprove the central structural premise.
Figures
read the original abstract
Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Activation-Guided Zeroth-Order optimization (AGZO) for memory-efficient LLM fine-tuning. It exploits the structural fact that, for a linear layer y = Wx, the parameter gradient lies exactly in the rank-1 subspace spanned by the input activation x. AGZO extracts this activation-informed low-rank subspace on the fly during the forward pass and restricts ZO perturbations to it. The paper supplies a theoretical framework asserting that the method optimizes a subspace-smoothed objective and yields update directions with strictly higher cosine similarity to the true gradient than isotropic ZO baselines. Experiments on Qwen3 and Pangu models report consistent gains over prior ZO methods while preserving essentially the same peak memory footprint.
Significance. If the central derivation and empirical claims hold, AGZO supplies a clean, low-overhead way to inject forward-pass structural information into ZO updates. The explicit grounding in the exact gradient form g x^T for linear layers is a strength; the on-the-fly subspace construction avoids extra memory and is directly falsifiable via the cosine-similarity metric. This could narrow the ZO–first-order performance gap in practical LLM fine-tuning without altering the memory profile of existing ZO pipelines.
major comments (2)
- [§3] §3 (theoretical framework): the proof that cosine similarity is strictly higher must be checked against the precise definition of the activation-derived subspace; if the subspace is formed only from the current mini-batch activations, the guarantee may degrade under distribution shift or when the layer is not strictly linear (e.g., inside attention blocks).
- [§4.2] §4.2 (empirical evaluation): the reported gains over isotropic ZO baselines are load-bearing for the central claim; the manuscript should include an ablation that isolates the subspace restriction from other implementation choices (e.g., perturbation scale, number of ZO queries) to confirm the improvement is attributable to the activation guidance.
minor comments (2)
- [§4] The abstract states evaluation “across various benchmarks” but §4 should list the exact tasks, dataset sizes, and number of runs with standard deviations.
- [§2] Notation for the low-rank subspace projection operator should be introduced once and used consistently; currently the forward-pass extraction step is described in prose without an equation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive evaluation of the manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [§3] §3 (theoretical framework): the proof that cosine similarity is strictly higher must be checked against the precise definition of the activation-derived subspace; if the subspace is formed only from the current mini-batch activations, the guarantee may degrade under distribution shift or when the layer is not strictly linear (e.g., inside attention blocks).
Authors: The analysis in §3 establishes the cosine-similarity guarantee exactly for linear layers y = Wx, where the gradient lies in the rank-1 subspace spanned by the current forward-pass activation x; the subspace is constructed on the fly from the current mini-batch. The strict improvement therefore holds under these conditions. Attention blocks contain linear projections (query, key, value, output) to which the same per-layer construction applies, yielding an effective approximation. We will add an explicit statement of the theorem's assumptions and a short discussion of its use inside attention in the revised manuscript. Because the subspace is recomputed each batch, the method is designed to track distribution shifts during fine-tuning rather than degrade under them. revision: partial
-
Referee: [§4.2] §4.2 (empirical evaluation): the reported gains over isotropic ZO baselines are load-bearing for the central claim; the manuscript should include an ablation that isolates the subspace restriction from other implementation choices (e.g., perturbation scale, number of ZO queries) to confirm the improvement is attributable to the activation guidance.
Authors: We agree that an ablation isolating the subspace restriction is necessary to confirm the source of the gains. In the revised manuscript we will add controlled experiments that match perturbation scale, number of ZO queries, and all other hyperparameters between AGZO and the isotropic baseline, thereby attributing any remaining improvement to the activation-derived subspace. revision: yes
Circularity Check
No significant circularity; derivation relies on standard linear algebra
full rationale
The paper's core structural claim—that the gradient of y = Wx lies exactly in the rank-1 subspace spanned by input activation x—is a direct, general consequence of the chain rule and outer-product form of the gradient, not derived from or fitted to AGZO itself. The subsequent claims (subspace-smoothed objective and provably higher cosine similarity) follow mathematically from restricting isotropic noise to this subspace without any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. The argument is self-contained against external benchmarks and does not reduce the claimed results to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The gradient of a linear layer is confined to the subspace spanned by its input activations.
Reference graph
Works this paper leans on
-
[1]
Chen, A., Zhang, Y ., Jia, J., Diffenderfer, J., Liu, J., Parasyris, K., Zhang, Y ., Zhang, Z., Kailkhura, B., and Liu, S. Deepzero: Scaling up zeroth-order op- timization for deep model training.arXiv preprint arXiv:2310.02025,
-
[2]
Pangu embedded: An efficient dual-system llm reasoner with metacognition
Chen, H., Wang, Y ., Han, K., Li, D., Li, L., Bi, Z., Li, J., Wang, H., Mi, F., Zhu, M., et al. Pangu embedded: An efficient dual-system llm reasoner with metacognition. arXiv preprint arXiv:2505.22375, 2025a. Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174,
-
[3]
Enhancing zeroth-order fine-tuning for language mod- els with low-rank structures
Chen, Y ., Zhang, Y ., Cao, L., Yuan, K., and Wen, Z. Enhancing zeroth-order fine-tuning for language mod- els with low-rank structures. InThe Thirteenth In- ternational Conference on Learning Representations, 2025b. URL https://openreview.net/forum? id=9BiVepgmWW. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Ex...
work page 2019
-
[4]
Gradient Descent Happens in a Tiny Subspace
Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[6]
Looking beyond the surface: A challenge set for reading comprehension over multiple sentences
Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. InPro- ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262,
work page 2018
-
[7]
Pilehvar, M. T. and Camacho-Collados, J. Wic: the word-in- context dataset for evaluating context-sensitive meaning representations. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (Long and short papers), pp. 1267–1273,
work page 2019
-
[8]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Rang, M., Bi, Z., Zhou, H., Chen, H., Xiao, A., Guo, T., Han, K., Chen, X., and Wang, Y . Revealing the power of post-training for small language models via knowledge distillation.arXiv preprint arXiv:2509.26497,
-
[10]
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical 10 Activation Guided Zeroth Order Perturbation Methods in Natural Language Processing, pp. 1631– 1642, Seattle, Washington, USA, October
work page 2013
-
[11]
Glue: A multi-task benchmark and analysis platform for natural language understanding
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. Glue: A multi-task benchmark and analysis platform for natural language understanding. InPro- ceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pp. 353–355,
work page 2018
-
[12]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Γ(u) < √u. Substitutingu= D 2 and multiplied by √π, we get: r π(D−1) 2 < 1 β D < r πD 2 By reversing we complete the proof. 18 Activation Guided Zeroth Order Perturbation A.3. AGZO defeat MEZO in cosine similarity We compare the noiseless expectations from theorem 5.4 and corollary 5.5: ER h cos bGAGZO 0 , G i =β doutr ∥GPA∥F ∥G∥F ,E R h cos bGMEZO 0 , G ...
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.