arxiv: 2605.10658 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory

Yao Shu , Jian Mu , Zhongxiang Dai

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningzeroth-order optimizationforgettinggradient shapingretention curvaturestability-plasticity tradeoff

0 comments

The pith

Zeroth-order adaptation forgets less than first-order descent because its shaped update contracts only the anisotropic retention curvature while preserving the isotropic floor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a local randomized gradient-shaping analysis to explain why low-query zeroth-order adaptation can retain prior knowledge better than standard first-order descent during continual learning. Finite differences produce an adaptation shape whose mean aligns with the first-order gradient, but the norm-matched comparison fixes the expected squared step size. Under this control, the expected shaped retention curvature follows an exact identity that leaves the isotropic component unchanged and contracts only the direction-dependent part. Projecting the identity onto the incoming gradient produces a concrete quadratic gap: zeroth-order steps improve average forgetting precisely when the first-order direction has above-average retention curvature, with the size of the gain set by a query-dependent fraction of that excess.

Core claim

For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO-ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations, and the blockwise RISE transfer applies the calibrated shape to exact FO gradients inside parameter blocks.

What carries the argument

The randomized gradient-shaping analysis in which finite differences expose a raw shape mean-aligned with the first-order gradient while the norm-matched comparator fixes the expected squared adaptation norm, producing the curvature-contraction identity.

If this is right

ZO steps reduce retention damage relative to FO steps exactly when the gradient direction exceeds average retention curvature.
The size of the forgetting improvement scales directly with the curvature excess in the FO direction and with the number of queries used to form the ZO estimate.
Finite-query effects from sampling and smoothing can be isolated from the mean shaping mechanism.
Applying the calibrated shape to exact FO gradients inside parameter blocks yields a stability-plasticity tradeoff that removes smoothing bias while retaining local shaping directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The theory implies that measuring directional curvature excess before each adaptation step could decide whether to switch to ZO shaping for that update.
Block-diagonal curvature structure and limited cross-block coupling would make the blockwise RISE transfer most effective in layered networks.
The same curvature-contraction identity may extend to other controlled-randomness methods that preserve expected step norm while altering update shape.

Load-bearing premise

Finite differences produce a raw shape whose mean is aligned with the first-order gradient and the norm-matched comparator exactly fixes the expected squared adaptation norm.

What would settle it

Measure the retention curvature matrix along the first-order gradient direction in a trained network, compute the predicted quadratic forgetting gap from the curvature excess, and check whether the observed difference in mean forgetting between norm-matched ZO and FO steps matches that prediction under controlled query budgets.

Figures

Figures reproduced from arXiv: 2605.10658 by Jian Mu, Yao Shu, Zhongxiang Dai.

**Figure 1.** Figure 1: Effective-curvature view of curvature-agnostic ZO shaping. All panels use the same quadratic-damage levels; darker contours indicate larger damage, and the arrow marks the shared incoming direction g. The contours show the curvature read by the shaped adaptation: FO reads H = λ¯I + R, norm-matched ZO reads H¯ q = E[P⊤HP] = λ¯I + (1 − τ )R, and the blind endpoint reads λ¯I. ZO therefore keeps the mean direc… view at source ↗

**Figure 2.** Figure 2: Synthetic validation of randomized shaping. In a controlled quadratic sandbox, normmatched ZO shaping contracts the retention-curvature spectrum toward λ¯ while preserving the mean curvature (A), and the Monte Carlo operator residual decreases with sample size (B). Rotating g across eigendirections makes the direct damage gap change sign exactly as predicted by (16) (C, with R2 = 0.9999 between empirical … view at source ↗

read the original abstract

Continual learning requires new-task adaptation without damaging previously acquired capabilities. Recent forward-pass and zeroth-order (ZO) results show that low-query adaptation may retain better than first-order (FO) descent, but the usual view of ZO as noisy FO estimation does not explain why. We give a local randomized gradient-shaping analysis: finite differences expose a raw shape that is mean-aligned with FO, while the norm-matched comparator fixes the expected squared adaptation norm. Under this controlled comparison, forgetting depends on how the adaptation shape exposes retention curvature. For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO--ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations. As an algorithmic transfer, RISE applies the calibrated ZO shape to exact FO gradients inside parameter blocks. Its target is a stability--plasticity tradeoff: randomized shaping may reduce the retention exposure paid by FO, exact gradients remove finite-smoothing bias from finite-difference ZO, and blockwise sampling supplies many local shaping directions after one gradient computation. The blockwise analysis separates mean-step damage from centered random exposure, showing how block-diagonal curvature, cross-block coupling, and local shaping diagnostics specify where this exact-gradient transfer is most likely to be visible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a curvature-based explanation for why norm-matched ZO can forget less than FO in continual learning, but the central identity is only approximate once smoothing and finite-query effects are included.

read the letter

The main point is that this work tries to explain the observed stability of low-query zeroth-order adaptation in continual learning through a randomized gradient-shaping analysis. They argue that under a norm-matched comparison, the expected shaped retention curvature follows an identity that keeps the isotropic floor intact while shrinking only the anisotropic component, which then projects to a quadratic gap in mean forgetting when the first-order direction sits above average retention curvature. The RISE transfer—applying the calibrated shape to exact gradients inside blocks—is a reasonable way to combine the ideas without full finite-difference bias. The blockwise separation of mean-step damage from random exposure and the discussion of cross-block coupling show some care about where the effect should appear. That part is useful and concrete for people working on stability-plasticity tradeoffs. The finite-query accounting section also separates the mean mechanism from sampling and smoothing perturbations, which is a fair step. The soft spot is exactly the one the stress-test flags: the derivation assumes the raw finite-difference shape is exactly mean-aligned with the true gradient and that the norm comparator fixes the expected squared norm perfectly. Once smoothing radius and sampling variance enter, that alignment picks up an O(σ) deviation, so the identity becomes approximate and the predicted gap can shrink or reverse under strong curvature variation or coupling. The paper acknowledges these perturbations but does not supply explicit bounds showing how small they stay in practice. Without those or direct checks on the deviation, the claim that ZO improves forgetting precisely when the FO direction has excess curvature feels optimistic. This is for continual-learning researchers who want a mechanistic story beyond the usual noisy-estimator view of ZO. A reader who cares about local curvature arguments and practical transfers will get something out of it, even if the math needs tightening. It has enough structure and a clear algorithmic suggestion to deserve a serious referee, though the review will likely focus on verifying the identity under realistic conditions and adding targeted experiments on the gap size.

Referee Report

3 major / 3 minor

Summary. The paper claims that zeroth-order (ZO) adaptation forgets less than first-order (FO) in continual learning because finite-difference shaping produces a raw shape whose expectation is aligned with the FO gradient; under a norm-matched comparator that fixes E[||adaptation||²], the expected shaped retention curvature obeys an exact identity preserving the isotropic floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields a quadratic FO-ZO forgetting gap that favors ZO precisely when the FO direction has above-average retention curvature. A finite-query accounting separates the mean mechanism from sampling/smoothing perturbations, and the RISE algorithm transfers the calibrated ZO shape onto exact FO gradients inside parameter blocks, with blockwise analysis separating mean-step damage from centered random exposure.

Significance. If the identity and its supporting assumptions can be made rigorous with explicit error bounds, the work supplies a mechanistic explanation for empirical retention advantages of low-query methods that goes beyond the 'noisy FO estimator' view and directly motivates a hybrid algorithm (RISE) targeting the stability-plasticity tradeoff. The separation of mean alignment from finite-query perturbations and the block-diagonal curvature diagnostics are potentially useful for guiding practical implementations.

major comments (3)

[Abstract / local randomized gradient-shaping analysis] Abstract and the local randomized gradient-shaping analysis: the central claim of an 'exact identity' for expected shaped retention curvature is presented without derivation steps, explicit Hessian assumptions, or finite-query error bounds. The finite-query accounting section itself acknowledges smoothing bias and sampling variance; when the loss is not twice differentiable or cross-block coupling is strong, E[raw shape] deviates from the FO direction by an O(σ) term, rendering the identity approximate and the quadratic gap smaller or reversed. Please supply the full derivation together with the precise conditions under which the identity remains exact.
[Projection onto incoming gradient / quadratic forgetting gap] Projection of the identity onto the incoming gradient (quadratic forgetting gap): the gap is stated to be quadratic and query-dependent, yet the derivation relies on (i) exact mean-alignment of the finite-difference shape and (ii) the norm-matched comparator fixing E[||adaptation||²] exactly. Both are described as holding 'locally,' but the paper's own finite-query section notes that these fail under smoothing or curvature variation. Quantify how the O(σ) misalignment term propagates into the gap size and whether the sign of the improvement can reverse.
[RISE / blockwise analysis] RISE algorithmic transfer and blockwise analysis: the claim that blockwise sampling supplies many local shaping directions after one gradient computation and separates mean-step damage from centered random exposure depends on block-diagonal curvature and limited cross-block coupling. No explicit bounds or diagnostic conditions are given for when this separation is visible; without them the practical advantage over plain FO or ZO remains unverified.

minor comments (3)

[Abstract] Define 'retention curvature,' 'isotropic floor,' and 'anisotropic component' with explicit notation at first use; the current abstract-only presentation leaves these terms ambiguous for readers outside the immediate sub-area.
[Theoretical analysis] Add a short table or proposition summarizing the exact assumptions (twice differentiability, norm-matching, mean-alignment) required for the identity to hold exactly versus approximately.
[Experiments / figures] Ensure all figures comparing FO-ZO forgetting include finite-query variance or smoothing-radius sweeps so that the predicted quadratic gap can be visually assessed against the acknowledged perturbations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments, which highlight opportunities to strengthen the rigor of the local analysis and algorithmic transfer. We address each major comment below and will revise the manuscript accordingly to include the requested derivations, bounds, and conditions.

read point-by-point responses

Referee: Abstract / local randomized gradient-shaping analysis: the central claim of an 'exact identity' for expected shaped retention curvature is presented without derivation steps, explicit Hessian assumptions, or finite-query error bounds. ... Please supply the full derivation together with the precise conditions under which the identity remains exact.

Authors: The identity is derived in the local randomized gradient-shaping section from the expectation of the finite-difference raw shape under uniform random directions on the sphere. Starting from the definition of the shaped adaptation vector v = (raw shape) scaled to match E[||v||²] = ||FO gradient||², the expectation E[v^T H v] expands via linearity to the isotropic floor (trace(H)/d) plus a contraction (1 - 1/k) of the anisotropic excess (g^T (H - mean(H)I) g / ||g||²), where k is the number of queries. This holds exactly when the loss is twice continuously differentiable and the query radius σ is small enough that the Hessian is locally constant (i.e., third-order remainder o(σ)). The finite-query section treats smoothing bias as an additive O(σ) perturbation to this mean identity. We will insert the full step-by-step derivation, state the twice-differentiability and local-constancy assumptions explicitly, and clarify that the identity is exact in the σ → 0 limit and approximate otherwise. revision: yes
Referee: Projection onto incoming gradient / quadratic forgetting gap: the gap is stated to be quadratic and query-dependent, yet the derivation relies on (i) exact mean-alignment ... Quantify how the O(σ) misalignment term propagates into the gap size and whether the sign of the improvement can reverse.

Authors: The quadratic gap is obtained by projecting the identity onto the incoming gradient direction, yielding gap = (1/k) × excess curvature, where excess curvature = g^T (H - mean(H)I) g / ||g||². The O(σ) misalignment from finite differences adds a linear perturbation bounded by O(σ ||H||), which enters the gap as an additive O(σ) term. Consequently the net gap remains positive (ZO advantage) whenever excess curvature > C σ for a constant C depending on query geometry; the sign reverses only when misalignment dominates, i.e., when σ is large relative to excess curvature or when curvature varies sharply across the query ball. We will add this propagation analysis with the explicit bound and the reversal condition in the finite-query accounting section. revision: yes
Referee: RISE algorithmic transfer and blockwise analysis: the claim that blockwise sampling supplies many local shaping directions after one gradient computation and separates mean-step damage from centered random exposure depends on block-diagonal curvature and limited cross-block coupling. No explicit bounds or diagnostic conditions are given ...

Authors: The separation holds when the Hessian is approximately block-diagonal, quantified by the cross-block coupling ratio ρ = ||off-diagonal blocks|| / ||diagonal blocks||. Under ρ ≪ 1 the mean-step damage is confined to within-block curvature while centered random exposure averages to zero across blocks; the error from residual coupling is bounded by O(ρ). We will add this explicit bound together with the diagnostic condition ρ < 0.1 (as a practical threshold) and the local curvature-variance diagnostic in the RISE section, allowing readers to verify when the advantage over plain FO or ZO is expected. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper's central analysis begins from two explicit modeling assumptions (mean-alignment of the raw finite-difference shape with the FO gradient, and exact fixing of expected squared adaptation norm by the norm-matched comparator). The claimed 'exact identity' for expected shaped retention curvature is obtained by algebraic decomposition of the quadratic form into an isotropic floor (fixed by the norm constraint) plus an anisotropic deviation (modulated by alignment). Projecting the identity onto the incoming gradient to produce the observable FO-ZO quadratic forgetting gap is a direct, assumption-driven algebraic consequence rather than a tautology or fitted renaming. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is invoked; the derivation remains self-contained under the stated local randomized shaping framework and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full text is required to audit the curvature assumptions and norm-matching rule.

pith-pipeline@v0.9.0 · 5575 in / 1227 out tokens · 25604 ms · 2026-05-12T04:06:52.114960+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 (Expected Curvature Identity for Norm-Matched ZO): ¯H_q = (1-τ)H + τ ¯λ I
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3.2 (Raw Alignment, Norm Inflation, and Norm Matching): E[Z]=I, κ=(q+d+1)/q

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

Make continual learning stronger via C-Flat

Ang Bian, Wei Li, Hangjie Yuan, Chengrong Yu, Mang Wang, Zixiang Zhao, Aojun Lu, Pengliang Ji, and Tao Feng. Make continual learning stronger via C-Flat. InProc. NeurIPS, 2024

work page 2024
[2]

Efficient lifelong learning with A-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with A-GEM. InProc. ICLR, 2019

work page 2019
[3]

Forward-only continual learning

Jiao Chen, Jiayi He, Fangfang Chen, Zuohong Lv, and Jianhua Tang. Forward-only continual learning. InProc. ACM MM, 2025

work page 2025
[4]

Enhancing zeroth-order fine-tuning for language models with low-rank structures

Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, and Zaiwen Wen. Enhancing zeroth-order fine-tuning for language models with low-rank structures. InProc. ICLR, 2025

work page 2025
[5]

Orthogonal gradient descent for continual learning

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InProc. AISTATS, 2020

work page 2020
[6]

ZeroFlow: Overcoming catastrophic forgetting is easier than you think

Tao Feng, Wei Li, Didi Zhu, Hangjie Yuan, Wendi Zheng, Dan Zhang, and Jie Tang. ZeroFlow: Overcoming catastrophic forgetting is easier than you think. InProc. ICML, 2025

work page 2025
[7]

Flaxman, Adam Tauman Kalai, and H

Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. InProc. SODA, 2005

work page 2005
[8]

Sharpness-aware mini- mization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware mini- mization for efficiently improving generalization. InProc. ICLR, 2021

work page 2021
[9]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProc. ICCV, 2021

work page 2021
[10]

Learning a unified classifier incrementally via rebalancing

Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. InProc. CVPR, 2019

work page 2019
[11]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. ICLR, 2022

work page 2022
[12]

AlterSGD: Finding flat minima for continual learning by alternative training

Zhongzhan Huang, Mingfu Liang, Senwei Liang, and Wei He. AlterSGD: Finding flat minima for continual learning by alternative training. arXiv:2107.05804, arXiv, 2021

work page arXiv 2021
[13]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proc. Natl. Acad. Sci. U.S.A., 2017

work page 2017
[14]

Overcoming catastrophic forgetting in continual learning by exploring eigenvalues of Hessian matrix.IEEE Trans

Yajing Kong, Liu Liu, Huanhuan Chen, Janusz Kacprzyk, and Dacheng Tao. Overcoming catastrophic forgetting in continual learning by exploring eigenvalues of Hessian matrix.IEEE Trans. Neural Netw. Learn. Syst., 2024

work page 2024
[15]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[16]

Learning without forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting. InProc. ECCV, 2016

work page 2016
[17]

Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning

Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse MeZO: Less parameters for better performance in zeroth-order LLM fine-tuning. InProc. NeurIPS, 2025

work page 2025
[18]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InProc. NeurIPS, 2017

work page 2017
[19]

Lee, Danqi Chen, and Sanjeev Arora

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. InProc. NeurIPS, 2023. 10

work page 2023
[20]

Random gradient-free minimization of convex functions

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. F ound. Comput. Math., 2017

work page 2017
[21]

Moment matching for multi-source domain adaptation

Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProc. ICCV, 2019

work page 2019
[22]

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. iCaRL: Incremental classifier and representation learning. InProc. CVPR, 2017

work page 2017
[23]

Berg, and Fei-Fei Li

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Fei-Fei Li. ImageNet large scale visual recognition challenge.Int. J. Comput. Vis., 2015

work page 2015
[24]

Gradient projection memory for continual learning

Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. InProc. ICLR, 2021

work page 2021
[25]

ElasticZO: A memory-efficient on-device learning with combined zeroth- and first-order optimization

Keisuke Sugiura and Hiroki Matsutani. ElasticZO: A memory-efficient on-device learning with combined zeroth- and first-order optimization. arXiv:2501.04287, arXiv, 2025

work page arXiv 2025
[26]

Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order LLM fine-tuning

Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, and Geng Yuan. Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order LLM fine-tuning. arXiv:2502.03304, arXiv, 2025

work page arXiv 2025
[27]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. EMNLP Workshop BlackboxNLP, 2018

work page 2018
[28]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. InProc. NeurIPS, 2019

work page 2019
[29]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. In Findings of EMNLP, 2023

work page 2023
[30]

More than memory savings: Zeroth-order optimization mitigates forgetting in continual learning

Wanhao Yu, Zheng Wang, Shuteng Niu, Sen Lin, and Li Yang. More than memory savings: Zeroth-order optimization mitigates forgetting in continual learning. arXiv:2510.21019, arXiv, 2025

work page arXiv 2025
[31]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InProc. NeurIPS, 2015

work page 2015
[32]

Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, and Tianlong Chen

Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, and Tianlong Chen. Revisiting zeroth-order optimization for memory-efficient LLM fine-tuning: A benchmark. In Proc. ICML, 2024

work page 2024
[33]

Expandable subspace ensemble for pre-trained model-based class-incremental learning

Da-Wei Zhou, Hai-Long Sun, Han-Jia Ye, and De-Chuan Zhan. Expandable subspace ensemble for pre-trained model-based class-incremental learning. InProc. CVPR, 2024

work page 2024
[34]

Revisiting class- incremental learning with pre-trained models: Generalizability and adaptivity are all you need

Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Revisiting class- incremental learning with pre-trained models: Generalizability and adaptivity are all you need. Int. J. Comput. Vis., 2025. A Related Work and Positioning Continual Learning and Retention Geometry.Continual-learning methods have long controlled forgetting by restricting h...

work page arXiv 2025
[35]

Hence tr(HM)≤λ max(M) tr(H) =d ¯λ λmax(M).(90) This upper bound is attained by choosing H=d ¯λuu ⊤, where u is a unit top eigenvector of M

For every H∈ H ¯λ, the trace of the product of two positive semidefinite matrices is nonnegative, becausetr(H(λ max(M)I−M)) = tr(H 1/2(λmax(M)I−M)H 1/2)≥0. Hence tr(HM)≤λ max(M) tr(H) =d ¯λ λmax(M).(90) This upper bound is attained by choosing H=d ¯λuu ⊤, where u is a unit top eigenvector of M. Hence sup H∈H¯λ tr(HM) =d ¯λ λmax(M).(91) Multiplying by η2/2...

work page