Recognition: 2 theorem links
· Lean TheoremWhy Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory
Pith reviewed 2026-05-12 04:06 UTC · model grok-4.3
The pith
Zeroth-order adaptation forgets less than first-order descent because its shaped update contracts only the anisotropic retention curvature while preserving the isotropic floor.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO-ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations, and the blockwise RISE transfer applies the calibrated shape to exact FO gradients inside parameter blocks.
What carries the argument
The randomized gradient-shaping analysis in which finite differences expose a raw shape mean-aligned with the first-order gradient while the norm-matched comparator fixes the expected squared adaptation norm, producing the curvature-contraction identity.
If this is right
- ZO steps reduce retention damage relative to FO steps exactly when the gradient direction exceeds average retention curvature.
- The size of the forgetting improvement scales directly with the curvature excess in the FO direction and with the number of queries used to form the ZO estimate.
- Finite-query effects from sampling and smoothing can be isolated from the mean shaping mechanism.
- Applying the calibrated shape to exact FO gradients inside parameter blocks yields a stability-plasticity tradeoff that removes smoothing bias while retaining local shaping directions.
Where Pith is reading between the lines
- The theory implies that measuring directional curvature excess before each adaptation step could decide whether to switch to ZO shaping for that update.
- Block-diagonal curvature structure and limited cross-block coupling would make the blockwise RISE transfer most effective in layered networks.
- The same curvature-contraction identity may extend to other controlled-randomness methods that preserve expected step norm while altering update shape.
Load-bearing premise
Finite differences produce a raw shape whose mean is aligned with the first-order gradient and the norm-matched comparator exactly fixes the expected squared adaptation norm.
What would settle it
Measure the retention curvature matrix along the first-order gradient direction in a trained network, compute the predicted quadratic forgetting gap from the curvature excess, and check whether the observed difference in mean forgetting between norm-matched ZO and FO steps matches that prediction under controlled query budgets.
Figures
read the original abstract
Continual learning requires new-task adaptation without damaging previously acquired capabilities. Recent forward-pass and zeroth-order (ZO) results show that low-query adaptation may retain better than first-order (FO) descent, but the usual view of ZO as noisy FO estimation does not explain why. We give a local randomized gradient-shaping analysis: finite differences expose a raw shape that is mean-aligned with FO, while the norm-matched comparator fixes the expected squared adaptation norm. Under this controlled comparison, forgetting depends on how the adaptation shape exposes retention curvature. For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO--ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations. As an algorithmic transfer, RISE applies the calibrated ZO shape to exact FO gradients inside parameter blocks. Its target is a stability--plasticity tradeoff: randomized shaping may reduce the retention exposure paid by FO, exact gradients remove finite-smoothing bias from finite-difference ZO, and blockwise sampling supplies many local shaping directions after one gradient computation. The blockwise analysis separates mean-step damage from centered random exposure, showing how block-diagonal curvature, cross-block coupling, and local shaping diagnostics specify where this exact-gradient transfer is most likely to be visible.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that zeroth-order (ZO) adaptation forgets less than first-order (FO) in continual learning because finite-difference shaping produces a raw shape whose expectation is aligned with the FO gradient; under a norm-matched comparator that fixes E[||adaptation||²], the expected shaped retention curvature obeys an exact identity preserving the isotropic floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields a quadratic FO-ZO forgetting gap that favors ZO precisely when the FO direction has above-average retention curvature. A finite-query accounting separates the mean mechanism from sampling/smoothing perturbations, and the RISE algorithm transfers the calibrated ZO shape onto exact FO gradients inside parameter blocks, with blockwise analysis separating mean-step damage from centered random exposure.
Significance. If the identity and its supporting assumptions can be made rigorous with explicit error bounds, the work supplies a mechanistic explanation for empirical retention advantages of low-query methods that goes beyond the 'noisy FO estimator' view and directly motivates a hybrid algorithm (RISE) targeting the stability-plasticity tradeoff. The separation of mean alignment from finite-query perturbations and the block-diagonal curvature diagnostics are potentially useful for guiding practical implementations.
major comments (3)
- [Abstract / local randomized gradient-shaping analysis] Abstract and the local randomized gradient-shaping analysis: the central claim of an 'exact identity' for expected shaped retention curvature is presented without derivation steps, explicit Hessian assumptions, or finite-query error bounds. The finite-query accounting section itself acknowledges smoothing bias and sampling variance; when the loss is not twice differentiable or cross-block coupling is strong, E[raw shape] deviates from the FO direction by an O(σ) term, rendering the identity approximate and the quadratic gap smaller or reversed. Please supply the full derivation together with the precise conditions under which the identity remains exact.
- [Projection onto incoming gradient / quadratic forgetting gap] Projection of the identity onto the incoming gradient (quadratic forgetting gap): the gap is stated to be quadratic and query-dependent, yet the derivation relies on (i) exact mean-alignment of the finite-difference shape and (ii) the norm-matched comparator fixing E[||adaptation||²] exactly. Both are described as holding 'locally,' but the paper's own finite-query section notes that these fail under smoothing or curvature variation. Quantify how the O(σ) misalignment term propagates into the gap size and whether the sign of the improvement can reverse.
- [RISE / blockwise analysis] RISE algorithmic transfer and blockwise analysis: the claim that blockwise sampling supplies many local shaping directions after one gradient computation and separates mean-step damage from centered random exposure depends on block-diagonal curvature and limited cross-block coupling. No explicit bounds or diagnostic conditions are given for when this separation is visible; without them the practical advantage over plain FO or ZO remains unverified.
minor comments (3)
- [Abstract] Define 'retention curvature,' 'isotropic floor,' and 'anisotropic component' with explicit notation at first use; the current abstract-only presentation leaves these terms ambiguous for readers outside the immediate sub-area.
- [Theoretical analysis] Add a short table or proposition summarizing the exact assumptions (twice differentiability, norm-matching, mean-alignment) required for the identity to hold exactly versus approximately.
- [Experiments / figures] Ensure all figures comparing FO-ZO forgetting include finite-query variance or smoothing-radius sweeps so that the predicted quadratic gap can be visually assessed against the acknowledged perturbations.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments, which highlight opportunities to strengthen the rigor of the local analysis and algorithmic transfer. We address each major comment below and will revise the manuscript accordingly to include the requested derivations, bounds, and conditions.
read point-by-point responses
-
Referee: Abstract / local randomized gradient-shaping analysis: the central claim of an 'exact identity' for expected shaped retention curvature is presented without derivation steps, explicit Hessian assumptions, or finite-query error bounds. ... Please supply the full derivation together with the precise conditions under which the identity remains exact.
Authors: The identity is derived in the local randomized gradient-shaping section from the expectation of the finite-difference raw shape under uniform random directions on the sphere. Starting from the definition of the shaped adaptation vector v = (raw shape) scaled to match E[||v||²] = ||FO gradient||², the expectation E[v^T H v] expands via linearity to the isotropic floor (trace(H)/d) plus a contraction (1 - 1/k) of the anisotropic excess (g^T (H - mean(H)I) g / ||g||²), where k is the number of queries. This holds exactly when the loss is twice continuously differentiable and the query radius σ is small enough that the Hessian is locally constant (i.e., third-order remainder o(σ)). The finite-query section treats smoothing bias as an additive O(σ) perturbation to this mean identity. We will insert the full step-by-step derivation, state the twice-differentiability and local-constancy assumptions explicitly, and clarify that the identity is exact in the σ → 0 limit and approximate otherwise. revision: yes
-
Referee: Projection onto incoming gradient / quadratic forgetting gap: the gap is stated to be quadratic and query-dependent, yet the derivation relies on (i) exact mean-alignment ... Quantify how the O(σ) misalignment term propagates into the gap size and whether the sign of the improvement can reverse.
Authors: The quadratic gap is obtained by projecting the identity onto the incoming gradient direction, yielding gap = (1/k) × excess curvature, where excess curvature = g^T (H - mean(H)I) g / ||g||². The O(σ) misalignment from finite differences adds a linear perturbation bounded by O(σ ||H||), which enters the gap as an additive O(σ) term. Consequently the net gap remains positive (ZO advantage) whenever excess curvature > C σ for a constant C depending on query geometry; the sign reverses only when misalignment dominates, i.e., when σ is large relative to excess curvature or when curvature varies sharply across the query ball. We will add this propagation analysis with the explicit bound and the reversal condition in the finite-query accounting section. revision: yes
-
Referee: RISE algorithmic transfer and blockwise analysis: the claim that blockwise sampling supplies many local shaping directions after one gradient computation and separates mean-step damage from centered random exposure depends on block-diagonal curvature and limited cross-block coupling. No explicit bounds or diagnostic conditions are given ...
Authors: The separation holds when the Hessian is approximately block-diagonal, quantified by the cross-block coupling ratio ρ = ||off-diagonal blocks|| / ||diagonal blocks||. Under ρ ≪ 1 the mean-step damage is confined to within-block curvature while centered random exposure averages to zero across blocks; the error from residual coupling is bounded by O(ρ). We will add this explicit bound together with the diagnostic condition ρ < 0.1 (as a practical threshold) and the local curvature-variance diagnostic in the RISE section, allowing readers to verify when the advantage over plain FO or ZO is expected. revision: yes
Circularity Check
No significant circularity detected in the derivation chain
full rationale
The paper's central analysis begins from two explicit modeling assumptions (mean-alignment of the raw finite-difference shape with the FO gradient, and exact fixing of expected squared adaptation norm by the norm-matched comparator). The claimed 'exact identity' for expected shaped retention curvature is obtained by algebraic decomposition of the quadratic form into an isotropic floor (fixed by the norm constraint) plus an anisotropic deviation (modulated by alignment). Projecting the identity onto the incoming gradient to produce the observable FO-ZO quadratic forgetting gap is a direct, assumption-driven algebraic consequence rather than a tautology or fitted renaming. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is invoked; the derivation remains self-contained under the stated local randomized shaping framework and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1 (Expected Curvature Identity for Norm-Matched ZO): ¯H_q = (1-τ)H + τ ¯λ I
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3.2 (Raw Alignment, Norm Inflation, and Norm Matching): E[Z]=I, κ=(q+d+1)/q
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Make continual learning stronger via C-Flat
Ang Bian, Wei Li, Hangjie Yuan, Chengrong Yu, Mang Wang, Zixiang Zhao, Aojun Lu, Pengliang Ji, and Tao Feng. Make continual learning stronger via C-Flat. InProc. NeurIPS, 2024
work page 2024
-
[2]
Efficient lifelong learning with A-GEM
Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with A-GEM. InProc. ICLR, 2019
work page 2019
-
[3]
Forward-only continual learning
Jiao Chen, Jiayi He, Fangfang Chen, Zuohong Lv, and Jianhua Tang. Forward-only continual learning. InProc. ACM MM, 2025
work page 2025
-
[4]
Enhancing zeroth-order fine-tuning for language models with low-rank structures
Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, and Zaiwen Wen. Enhancing zeroth-order fine-tuning for language models with low-rank structures. InProc. ICLR, 2025
work page 2025
-
[5]
Orthogonal gradient descent for continual learning
Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InProc. AISTATS, 2020
work page 2020
-
[6]
ZeroFlow: Overcoming catastrophic forgetting is easier than you think
Tao Feng, Wei Li, Didi Zhu, Hangjie Yuan, Wendi Zheng, Dan Zhang, and Jie Tang. ZeroFlow: Overcoming catastrophic forgetting is easier than you think. InProc. ICML, 2025
work page 2025
-
[7]
Flaxman, Adam Tauman Kalai, and H
Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. InProc. SODA, 2005
work page 2005
-
[8]
Sharpness-aware mini- mization for efficiently improving generalization
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization. InProc. ICLR, 2021
work page 2021
-
[9]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProc. ICCV, 2021
work page 2021
-
[10]
Learning a unified classifier incrementally via rebalancing
Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. InProc. CVPR, 2019
work page 2019
-
[11]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. ICLR, 2022
work page 2022
-
[12]
AlterSGD: Finding flat minima for continual learning by alternative training
Zhongzhan Huang, Mingfu Liang, Senwei Liang, and Wei He. AlterSGD: Finding flat minima for continual learning by alternative training. arXiv:2107.05804, arXiv, 2021
-
[13]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proc. Natl. Acad. Sci. U.S.A., 2017
work page 2017
-
[14]
Yajing Kong, Liu Liu, Huanhuan Chen, Janusz Kacprzyk, and Dacheng Tao. Overcoming catastrophic forgetting in continual learning by exploring eigenvalues of Hessian matrix.IEEE Trans. Neural Netw. Learn. Syst., 2024
work page 2024
-
[15]
Learning multiple layers of features from tiny images
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009
work page 2009
-
[16]
Zhizhong Li and Derek Hoiem. Learning without forgetting. InProc. ECCV, 2016
work page 2016
-
[17]
Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning
Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning. InProc. NeurIPS, 2025
work page 2025
-
[18]
Gradient episodic memory for continual learning
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InProc. NeurIPS, 2017
work page 2017
-
[19]
Lee, Danqi Chen, and Sanjeev Arora
Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. InProc. NeurIPS, 2023. 10
work page 2023
-
[20]
Random gradient-free minimization of convex functions
Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. F ound. Comput. Math., 2017
work page 2017
-
[21]
Moment matching for multi-source domain adaptation
Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProc. ICCV, 2019
work page 2019
-
[22]
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. iCaRL: Incremental classifier and representation learning. InProc. CVPR, 2017
work page 2017
-
[23]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Fei-Fei Li. ImageNet large scale visual recognition challenge.Int. J. Comput. Vis., 2015
work page 2015
-
[24]
Gradient projection memory for continual learning
Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. InProc. ICLR, 2021
work page 2021
-
[25]
ElasticZO: A memory-efficient on-device learning with combined zeroth- and first-order optimization
Keisuke Sugiura and Hiroki Matsutani. ElasticZO: A memory-efficient on-device learning with combined zeroth- and first-order optimization. arXiv:2501.04287, arXiv, 2025
-
[26]
Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order LLM fine-tuning
Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, and Geng Yuan. Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order LLM fine-tuning. arXiv:2502.03304, arXiv, 2025
-
[27]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. EMNLP Workshop BlackboxNLP, 2018
work page 2018
-
[28]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. InProc. NeurIPS, 2019
work page 2019
-
[29]
Orthogonal subspace learning for language model continual learning
Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. In Findings of EMNLP, 2023
work page 2023
-
[30]
More than memory savings: Zeroth-order optimization mitigates forgetting in continual learning
Wanhao Yu, Zheng Wang, Shuteng Niu, Sen Lin, and Li Yang. More than memory savings: Zeroth-order optimization mitigates forgetting in continual learning. arXiv:2510.21019, arXiv, 2025
-
[31]
Character-level convolutional networks for text classification
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InProc. NeurIPS, 2015
work page 2015
-
[32]
Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, and Tianlong Chen
Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, and Tianlong Chen. Revisiting zeroth-order optimization for memory-efficient LLM fine-tuning: A benchmark. In Proc. ICML, 2024
work page 2024
-
[33]
Expandable subspace ensemble for pre-trained model-based class-incremental learning
Da-Wei Zhou, Hai-Long Sun, Han-Jia Ye, and De-Chuan Zhan. Expandable subspace ensemble for pre-trained model-based class-incremental learning. InProc. CVPR, 2024
work page 2024
-
[34]
Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Revisiting class- incremental learning with pre-trained models: Generalizability and adaptivity are all you need. Int. J. Comput. Vis., 2025. A Related Work and Positioning Continual Learning and Retention Geometry.Continual-learning methods have long controlled forgetting by restricting h...
-
[35]
For every H∈ H ¯λ, the trace of the product of two positive semidefinite matrices is nonnegative, becausetr(H(λ max(M)I−M)) = tr(H 1/2(λmax(M)I−M)H 1/2)≥0. Hence tr(HM)≤λ max(M) tr(H) =d ¯λ λmax(M).(90) This upper bound is attained by choosing H=d ¯λuu ⊤, where u is a unit top eigenvector of M. Hence sup H∈H¯λ tr(HM) =d ¯λ λmax(M).(91) Multiplying by η2/2...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.