pith. machine review for the scientific record. sign in

arxiv: 2605.10658 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learningzeroth-order optimizationforgettinggradient shapingretention curvaturestability-plasticity tradeoff
0
0 comments X

The pith

Zeroth-order adaptation forgets less than first-order descent because its shaped update contracts only the anisotropic retention curvature while preserving the isotropic floor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a local randomized gradient-shaping analysis to explain why low-query zeroth-order adaptation can retain prior knowledge better than standard first-order descent during continual learning. Finite differences produce an adaptation shape whose mean aligns with the first-order gradient, but the norm-matched comparison fixes the expected squared step size. Under this control, the expected shaped retention curvature follows an exact identity that leaves the isotropic component unchanged and contracts only the direction-dependent part. Projecting the identity onto the incoming gradient produces a concrete quadratic gap: zeroth-order steps improve average forgetting precisely when the first-order direction has above-average retention curvature, with the size of the gain set by a query-dependent fraction of that excess.

Core claim

For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO-ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations, and the blockwise RISE transfer applies the calibrated shape to exact FO gradients inside parameter blocks.

What carries the argument

The randomized gradient-shaping analysis in which finite differences expose a raw shape mean-aligned with the first-order gradient while the norm-matched comparator fixes the expected squared adaptation norm, producing the curvature-contraction identity.

If this is right

  • ZO steps reduce retention damage relative to FO steps exactly when the gradient direction exceeds average retention curvature.
  • The size of the forgetting improvement scales directly with the curvature excess in the FO direction and with the number of queries used to form the ZO estimate.
  • Finite-query effects from sampling and smoothing can be isolated from the mean shaping mechanism.
  • Applying the calibrated shape to exact FO gradients inside parameter blocks yields a stability-plasticity tradeoff that removes smoothing bias while retaining local shaping directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The theory implies that measuring directional curvature excess before each adaptation step could decide whether to switch to ZO shaping for that update.
  • Block-diagonal curvature structure and limited cross-block coupling would make the blockwise RISE transfer most effective in layered networks.
  • The same curvature-contraction identity may extend to other controlled-randomness methods that preserve expected step norm while altering update shape.

Load-bearing premise

Finite differences produce a raw shape whose mean is aligned with the first-order gradient and the norm-matched comparator exactly fixes the expected squared adaptation norm.

What would settle it

Measure the retention curvature matrix along the first-order gradient direction in a trained network, compute the predicted quadratic forgetting gap from the curvature excess, and check whether the observed difference in mean forgetting between norm-matched ZO and FO steps matches that prediction under controlled query budgets.

Figures

Figures reproduced from arXiv: 2605.10658 by Jian Mu, Yao Shu, Zhongxiang Dai.

Figure 1
Figure 1. Figure 1: Effective-curvature view of curvature-agnostic ZO shaping. All panels use the same quadratic-damage levels; darker contours indicate larger damage, and the arrow marks the shared incoming direction g. The contours show the curvature read by the shaped adaptation: FO reads H = λ¯I + R, norm-matched ZO reads H¯ q = E[P⊤HP] = λ¯I + (1 − τ )R, and the blind endpoint reads λ¯I. ZO therefore keeps the mean direc… view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic validation of randomized shaping. In a controlled quadratic sandbox, norm￾matched ZO shaping contracts the retention-curvature spectrum toward λ¯ while preserving the mean curvature (A), and the Monte Carlo operator residual decreases with sample size (B). Rotating g across eigendirections makes the direct damage gap change sign exactly as predicted by (16) (C, with R2 = 0.9999 between empirical … view at source ↗
read the original abstract

Continual learning requires new-task adaptation without damaging previously acquired capabilities. Recent forward-pass and zeroth-order (ZO) results show that low-query adaptation may retain better than first-order (FO) descent, but the usual view of ZO as noisy FO estimation does not explain why. We give a local randomized gradient-shaping analysis: finite differences expose a raw shape that is mean-aligned with FO, while the norm-matched comparator fixes the expected squared adaptation norm. Under this controlled comparison, forgetting depends on how the adaptation shape exposes retention curvature. For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO--ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations. As an algorithmic transfer, RISE applies the calibrated ZO shape to exact FO gradients inside parameter blocks. Its target is a stability--plasticity tradeoff: randomized shaping may reduce the retention exposure paid by FO, exact gradients remove finite-smoothing bias from finite-difference ZO, and blockwise sampling supplies many local shaping directions after one gradient computation. The blockwise analysis separates mean-step damage from centered random exposure, showing how block-diagonal curvature, cross-block coupling, and local shaping diagnostics specify where this exact-gradient transfer is most likely to be visible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that zeroth-order (ZO) adaptation forgets less than first-order (FO) in continual learning because finite-difference shaping produces a raw shape whose expectation is aligned with the FO gradient; under a norm-matched comparator that fixes E[||adaptation||²], the expected shaped retention curvature obeys an exact identity preserving the isotropic floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields a quadratic FO-ZO forgetting gap that favors ZO precisely when the FO direction has above-average retention curvature. A finite-query accounting separates the mean mechanism from sampling/smoothing perturbations, and the RISE algorithm transfers the calibrated ZO shape onto exact FO gradients inside parameter blocks, with blockwise analysis separating mean-step damage from centered random exposure.

Significance. If the identity and its supporting assumptions can be made rigorous with explicit error bounds, the work supplies a mechanistic explanation for empirical retention advantages of low-query methods that goes beyond the 'noisy FO estimator' view and directly motivates a hybrid algorithm (RISE) targeting the stability-plasticity tradeoff. The separation of mean alignment from finite-query perturbations and the block-diagonal curvature diagnostics are potentially useful for guiding practical implementations.

major comments (3)
  1. [Abstract / local randomized gradient-shaping analysis] Abstract and the local randomized gradient-shaping analysis: the central claim of an 'exact identity' for expected shaped retention curvature is presented without derivation steps, explicit Hessian assumptions, or finite-query error bounds. The finite-query accounting section itself acknowledges smoothing bias and sampling variance; when the loss is not twice differentiable or cross-block coupling is strong, E[raw shape] deviates from the FO direction by an O(σ) term, rendering the identity approximate and the quadratic gap smaller or reversed. Please supply the full derivation together with the precise conditions under which the identity remains exact.
  2. [Projection onto incoming gradient / quadratic forgetting gap] Projection of the identity onto the incoming gradient (quadratic forgetting gap): the gap is stated to be quadratic and query-dependent, yet the derivation relies on (i) exact mean-alignment of the finite-difference shape and (ii) the norm-matched comparator fixing E[||adaptation||²] exactly. Both are described as holding 'locally,' but the paper's own finite-query section notes that these fail under smoothing or curvature variation. Quantify how the O(σ) misalignment term propagates into the gap size and whether the sign of the improvement can reverse.
  3. [RISE / blockwise analysis] RISE algorithmic transfer and blockwise analysis: the claim that blockwise sampling supplies many local shaping directions after one gradient computation and separates mean-step damage from centered random exposure depends on block-diagonal curvature and limited cross-block coupling. No explicit bounds or diagnostic conditions are given for when this separation is visible; without them the practical advantage over plain FO or ZO remains unverified.
minor comments (3)
  1. [Abstract] Define 'retention curvature,' 'isotropic floor,' and 'anisotropic component' with explicit notation at first use; the current abstract-only presentation leaves these terms ambiguous for readers outside the immediate sub-area.
  2. [Theoretical analysis] Add a short table or proposition summarizing the exact assumptions (twice differentiability, norm-matching, mean-alignment) required for the identity to hold exactly versus approximately.
  3. [Experiments / figures] Ensure all figures comparing FO-ZO forgetting include finite-query variance or smoothing-radius sweeps so that the predicted quadratic gap can be visually assessed against the acknowledged perturbations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments, which highlight opportunities to strengthen the rigor of the local analysis and algorithmic transfer. We address each major comment below and will revise the manuscript accordingly to include the requested derivations, bounds, and conditions.

read point-by-point responses
  1. Referee: Abstract / local randomized gradient-shaping analysis: the central claim of an 'exact identity' for expected shaped retention curvature is presented without derivation steps, explicit Hessian assumptions, or finite-query error bounds. ... Please supply the full derivation together with the precise conditions under which the identity remains exact.

    Authors: The identity is derived in the local randomized gradient-shaping section from the expectation of the finite-difference raw shape under uniform random directions on the sphere. Starting from the definition of the shaped adaptation vector v = (raw shape) scaled to match E[||v||²] = ||FO gradient||², the expectation E[v^T H v] expands via linearity to the isotropic floor (trace(H)/d) plus a contraction (1 - 1/k) of the anisotropic excess (g^T (H - mean(H)I) g / ||g||²), where k is the number of queries. This holds exactly when the loss is twice continuously differentiable and the query radius σ is small enough that the Hessian is locally constant (i.e., third-order remainder o(σ)). The finite-query section treats smoothing bias as an additive O(σ) perturbation to this mean identity. We will insert the full step-by-step derivation, state the twice-differentiability and local-constancy assumptions explicitly, and clarify that the identity is exact in the σ → 0 limit and approximate otherwise. revision: yes

  2. Referee: Projection onto incoming gradient / quadratic forgetting gap: the gap is stated to be quadratic and query-dependent, yet the derivation relies on (i) exact mean-alignment ... Quantify how the O(σ) misalignment term propagates into the gap size and whether the sign of the improvement can reverse.

    Authors: The quadratic gap is obtained by projecting the identity onto the incoming gradient direction, yielding gap = (1/k) × excess curvature, where excess curvature = g^T (H - mean(H)I) g / ||g||². The O(σ) misalignment from finite differences adds a linear perturbation bounded by O(σ ||H||), which enters the gap as an additive O(σ) term. Consequently the net gap remains positive (ZO advantage) whenever excess curvature > C σ for a constant C depending on query geometry; the sign reverses only when misalignment dominates, i.e., when σ is large relative to excess curvature or when curvature varies sharply across the query ball. We will add this propagation analysis with the explicit bound and the reversal condition in the finite-query accounting section. revision: yes

  3. Referee: RISE algorithmic transfer and blockwise analysis: the claim that blockwise sampling supplies many local shaping directions after one gradient computation and separates mean-step damage from centered random exposure depends on block-diagonal curvature and limited cross-block coupling. No explicit bounds or diagnostic conditions are given ...

    Authors: The separation holds when the Hessian is approximately block-diagonal, quantified by the cross-block coupling ratio ρ = ||off-diagonal blocks|| / ||diagonal blocks||. Under ρ ≪ 1 the mean-step damage is confined to within-block curvature while centered random exposure averages to zero across blocks; the error from residual coupling is bounded by O(ρ). We will add this explicit bound together with the diagnostic condition ρ < 0.1 (as a practical threshold) and the local curvature-variance diagnostic in the RISE section, allowing readers to verify when the advantage over plain FO or ZO is expected. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper's central analysis begins from two explicit modeling assumptions (mean-alignment of the raw finite-difference shape with the FO gradient, and exact fixing of expected squared adaptation norm by the norm-matched comparator). The claimed 'exact identity' for expected shaped retention curvature is obtained by algebraic decomposition of the quadratic form into an isotropic floor (fixed by the norm constraint) plus an anisotropic deviation (modulated by alignment). Projecting the identity onto the incoming gradient to produce the observable FO-ZO quadratic forgetting gap is a direct, assumption-driven algebraic consequence rather than a tautology or fitted renaming. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is invoked; the derivation remains self-contained under the stated local randomized shaping framework and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full text is required to audit the curvature assumptions and norm-matching rule.

pith-pipeline@v0.9.0 · 5575 in / 1227 out tokens · 25604 ms · 2026-05-12T04:06:52.114960+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    Make continual learning stronger via C-Flat

    Ang Bian, Wei Li, Hangjie Yuan, Chengrong Yu, Mang Wang, Zixiang Zhao, Aojun Lu, Pengliang Ji, and Tao Feng. Make continual learning stronger via C-Flat. InProc. NeurIPS, 2024

  2. [2]

    Efficient lifelong learning with A-GEM

    Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with A-GEM. InProc. ICLR, 2019

  3. [3]

    Forward-only continual learning

    Jiao Chen, Jiayi He, Fangfang Chen, Zuohong Lv, and Jianhua Tang. Forward-only continual learning. InProc. ACM MM, 2025

  4. [4]

    Enhancing zeroth-order fine-tuning for language models with low-rank structures

    Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, and Zaiwen Wen. Enhancing zeroth-order fine-tuning for language models with low-rank structures. InProc. ICLR, 2025

  5. [5]

    Orthogonal gradient descent for continual learning

    Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InProc. AISTATS, 2020

  6. [6]

    ZeroFlow: Overcoming catastrophic forgetting is easier than you think

    Tao Feng, Wei Li, Didi Zhu, Hangjie Yuan, Wendi Zheng, Dan Zhang, and Jie Tang. ZeroFlow: Overcoming catastrophic forgetting is easier than you think. InProc. ICML, 2025

  7. [7]

    Flaxman, Adam Tauman Kalai, and H

    Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. InProc. SODA, 2005

  8. [8]

    Sharpness-aware mini- mization for efficiently improving generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization. InProc. ICLR, 2021

  9. [9]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProc. ICCV, 2021

  10. [10]

    Learning a unified classifier incrementally via rebalancing

    Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. InProc. CVPR, 2019

  11. [11]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. ICLR, 2022

  12. [12]

    AlterSGD: Finding flat minima for continual learning by alternative training

    Zhongzhan Huang, Mingfu Liang, Senwei Liang, and Wei He. AlterSGD: Finding flat minima for continual learning by alternative training. arXiv:2107.05804, arXiv, 2021

  13. [13]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proc. Natl. Acad. Sci. U.S.A., 2017

  14. [14]

    Overcoming catastrophic forgetting in continual learning by exploring eigenvalues of Hessian matrix.IEEE Trans

    Yajing Kong, Liu Liu, Huanhuan Chen, Janusz Kacprzyk, and Dacheng Tao. Overcoming catastrophic forgetting in continual learning by exploring eigenvalues of Hessian matrix.IEEE Trans. Neural Netw. Learn. Syst., 2024

  15. [15]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  16. [16]

    Learning without forgetting

    Zhizhong Li and Derek Hoiem. Learning without forgetting. InProc. ECCV, 2016

  17. [17]

    Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning

    Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning. InProc. NeurIPS, 2025

  18. [18]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InProc. NeurIPS, 2017

  19. [19]

    Lee, Danqi Chen, and Sanjeev Arora

    Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. InProc. NeurIPS, 2023. 10

  20. [20]

    Random gradient-free minimization of convex functions

    Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. F ound. Comput. Math., 2017

  21. [21]

    Moment matching for multi-source domain adaptation

    Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProc. ICCV, 2019

  22. [22]

    Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. iCaRL: Incremental classifier and representation learning. InProc. CVPR, 2017

  23. [23]

    Berg, and Fei-Fei Li

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Fei-Fei Li. ImageNet large scale visual recognition challenge.Int. J. Comput. Vis., 2015

  24. [24]

    Gradient projection memory for continual learning

    Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. InProc. ICLR, 2021

  25. [25]

    ElasticZO: A memory-efficient on-device learning with combined zeroth- and first-order optimization

    Keisuke Sugiura and Hiroki Matsutani. ElasticZO: A memory-efficient on-device learning with combined zeroth- and first-order optimization. arXiv:2501.04287, arXiv, 2025

  26. [26]

    Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order LLM fine-tuning

    Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, and Geng Yuan. Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order LLM fine-tuning. arXiv:2502.03304, arXiv, 2025

  27. [27]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. EMNLP Workshop BlackboxNLP, 2018

  28. [28]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. InProc. NeurIPS, 2019

  29. [29]

    Orthogonal subspace learning for language model continual learning

    Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. In Findings of EMNLP, 2023

  30. [30]

    More than memory savings: Zeroth-order optimization mitigates forgetting in continual learning

    Wanhao Yu, Zheng Wang, Shuteng Niu, Sen Lin, and Li Yang. More than memory savings: Zeroth-order optimization mitigates forgetting in continual learning. arXiv:2510.21019, arXiv, 2025

  31. [31]

    Character-level convolutional networks for text classification

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InProc. NeurIPS, 2015

  32. [32]

    Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, and Tianlong Chen

    Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, and Tianlong Chen. Revisiting zeroth-order optimization for memory-efficient LLM fine-tuning: A benchmark. In Proc. ICML, 2024

  33. [33]

    Expandable subspace ensemble for pre-trained model-based class-incremental learning

    Da-Wei Zhou, Hai-Long Sun, Han-Jia Ye, and De-Chuan Zhan. Expandable subspace ensemble for pre-trained model-based class-incremental learning. InProc. CVPR, 2024

  34. [34]

    Revisiting class- incremental learning with pre-trained models: Generalizability and adaptivity are all you need

    Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Revisiting class- incremental learning with pre-trained models: Generalizability and adaptivity are all you need. Int. J. Comput. Vis., 2025. A Related Work and Positioning Continual Learning and Retention Geometry.Continual-learning methods have long controlled forgetting by restricting h...

  35. [35]

    Hence tr(HM)≤λ max(M) tr(H) =d ¯λ λmax(M).(90) This upper bound is attained by choosing H=d ¯λuu ⊤, where u is a unit top eigenvector of M

    For every H∈ H ¯λ, the trace of the product of two positive semidefinite matrices is nonnegative, becausetr(H(λ max(M)I−M)) = tr(H 1/2(λmax(M)I−M)H 1/2)≥0. Hence tr(HM)≤λ max(M) tr(H) =d ¯λ λmax(M).(90) This upper bound is attained by choosing H=d ¯λuu ⊤, where u is a unit top eigenvector of M. Hence sup H∈H¯λ tr(HM) =d ¯λ λmax(M).(91) Multiplying by η2/2...