pith. machine review for the scientific record. sign in

arxiv: 2604.21927 · v2 · submitted 2026-04-23 · 💻 cs.LG

Recognition: unknown

Fine-Tuning Regimes Define Distinct Continual Learning Problems

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:59 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learningfine-tuning regimestrainable depthmethod rankingtask incremental learningcatastrophic forgettingadaptation subspaces
0
0 comments X

The pith

The relative ranking of continual learning methods is not preserved when the trainable depth of the fine-tuning regime changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-tuning regimes, defined by the subspace of trainable parameters, are themselves a variable that shapes continual learning problems. It formalizes these regimes as projected optimization and shows through experiments that changing trainable depth alters the update signals for both fitting new tasks and retaining old knowledge. Testing four methods across five regimes and five datasets reveals inconsistent method rankings, with deeper regimes tied to larger updates and more forgetting. A sympathetic reader cares because this means standard fixed-regime benchmarks may produce conclusions that fail to generalize when adaptation depth varies.

Core claim

Adaptation regimes formalized as projected optimization over fixed trainable subspaces cause the relative ranking of methods to shift, as deeper regimes produce larger update magnitudes, higher forgetting rates, and a tighter link between the two.

What carries the argument

Projected optimization over fixed trainable subspaces, which changes the effective update signal for current-task fitting and knowledge preservation.

If this is right

  • Relative rankings of methods such as online EWC, LwF, SI, and GEM are not consistent across the five trainable depth regimes.
  • Deeper regimes produce larger update magnitudes and higher forgetting.
  • The correlation between update magnitude and forgetting strengthens in deeper regimes.
  • Comparative conclusions about continual learning methods depend on the chosen fine-tuning regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standard CL benchmarks may need to treat trainable depth as an explicit experimental factor rather than a fixed default.
  • Method design could benefit from explicit robustness testing across multiple adaptation depths.
  • The observed link between update size and forgetting in deeper regimes suggests potential for new regularization strategies tuned to regime depth.

Load-bearing premise

That differences in method rankings across regimes are driven by the choice of trainable depth rather than other unstated implementation details.

What would settle it

An experiment repeating the exact setup across the same five regimes where method rankings remain identical would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.21927 by Elena Burceanu, Paul-Tiberiu Iordache.

Figure 1
Figure 1. Figure 1: fine-tuning regime changes comparative conclusions (Fashion MNIST). (a) Standard continual learning [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ranking sensitivity persists across other benchmarks. Each panel shows the mean Kendall’s [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Gradient magnitude and forgetting both increase with trainable depth, but the main effect goes beyond these [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Continual learning (CL) studies how models acquire tasks sequentially while retaining previously learned knowledge. Despite substantial progress in benchmarking CL methods, comparative evaluations typically keep the fine-tuning regime fixed. In this paper, we argue that the fine-tuning regime, defined by the trainable parameter subspace, is itself a key evaluation variable. We formalize adaptation regimes as projected optimization over fixed trainable subspaces, showing that changing the trainable depth alters the effective update signal through which both current task fitting and knowledge preservation operate. This analysis motivates the hypothesis that method comparisons need not be invariant across regimes. We test this hypothesis in task incremental CL, five trainable depth regimes, and four standard methods: online EWC, LwF, SI, and GEM. Across five benchmark datasets, namely MNIST, Fashion MNIST, KMNIST, QMNIST, and CIFAR-100, and across 11 task orders per dataset, we find that the relative ranking of methods is not consistently preserved across regimes. We further show that deeper adaptation regimes are associated with larger update magnitudes, higher forgetting, and a stronger relationship between the two. These results show that comparative conclusions in CL can depend strongly on the chosen fine-tuning regime, motivating regime-aware evaluation protocols that treat trainable depth as an explicit experimental factor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that fine-tuning regimes in continual learning—defined by the trainable parameter subspace (specifically trainable depth)—constitute distinct problems. It formalizes regimes as projected optimization over fixed subspaces, which alters effective update signals for task fitting and knowledge retention. Empirically, across five datasets (MNIST, Fashion MNIST, KMNIST, QMNIST, CIFAR-100), 11 task orders per dataset, and five depth regimes, the relative rankings of online EWC, LwF, SI, and GEM are not preserved; deeper regimes also show larger update magnitudes, higher forgetting, and stronger correlation between them. This motivates regime-aware evaluation protocols.

Significance. If the central empirical pattern holds after controlling for confounds, the result is significant for the CL field: it challenges the implicit assumption that method comparisons are regime-invariant and shows that benchmark conclusions can depend on the chosen trainable subspace. The broad evaluation (five datasets, 11 orders) provides a reasonably strong empirical basis and gives credit to the authors for testing the hypothesis at scale rather than on a single benchmark. This could shift evaluation practices toward treating trainable depth as an explicit factor, affecting both method development and reproducibility.

major comments (2)
  1. Methods section: the central claim that ranking changes are driven by trainable depth regimes (rather than implementation artifacts) requires that hyperparameters (learning rates, regularization coefficients for the EWC penalty, LwF distillation loss, SI, and GEM) were either re-tuned per regime or explicitly held fixed with justification. The manuscript notes larger update magnitudes in deeper regimes; if hyperparameters were not adjusted, the observed ranking flips could arise from mismatched optimization strength rather than the projected subspace itself. An explicit statement or ablation on hyperparameter protocol is load-bearing for the hypothesis.
  2. §3 (formalization): while the projected-optimization view is a useful lens, the paper does not derive a quantitative prediction for how the projection operator changes the relative weighting of current-task loss versus preservation terms across methods. Without this or a controlled simulation isolating the projection effect, the link between the formalization and the empirical ranking changes remains interpretive rather than predictive.
minor comments (2)
  1. The five specific trainable depth regimes (e.g., which layers or parameter subsets are frozen) should be defined with a table or explicit list in the main text rather than deferred to the appendix, as this is central to reproducibility.
  2. Figure captions reporting method rankings should include the exact number of runs, error bars, and any statistical test used to support claims of 'not consistently preserved'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our experimental design and theoretical framing. We address each major comment below and have made revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Methods section: the central claim that ranking changes are driven by trainable depth regimes (rather than implementation artifacts) requires that hyperparameters (learning rates, regularization coefficients for the EWC penalty, LwF distillation loss, SI, and GEM) were either re-tuned per regime or explicitly held fixed with justification. The manuscript notes larger update magnitudes in deeper regimes; if hyperparameters were not adjusted, the observed ranking flips could arise from mismatched optimization strength rather than the projected subspace itself. An explicit statement or ablation on hyperparameter protocol is load-bearing for the hypothesis.

    Authors: We held all hyperparameters fixed across regimes precisely to isolate the effect of the trainable subspace projection. Re-tuning per regime would have introduced an additional confounding variable, making it impossible to attribute performance differences to the regime itself rather than to optimization strength. This protocol is now stated explicitly in the Methods section with justification, including a note that the larger update magnitudes observed in deeper regimes are a direct consequence of the projection (more parameters receive the full gradient signal) rather than a hyperparameter mismatch. We also added a brief discussion acknowledging that regime-specific tuning could be studied separately but would address a different question. revision: yes

  2. Referee: §3 (formalization): while the projected-optimization view is a useful lens, the paper does not derive a quantitative prediction for how the projection operator changes the relative weighting of current-task loss versus preservation terms across methods. Without this or a controlled simulation isolating the projection effect, the link between the formalization and the empirical ranking changes remains interpretive rather than predictive.

    Authors: The formalization in §3 is intended as a mechanistic lens rather than a closed-form predictor; deriving a general quantitative mapping from projection to loss weighting would require strong assumptions on the Hessian and loss geometry that do not hold uniformly across methods and datasets. We have revised §3 to include an expanded discussion of the expected directional effects on current-task versus preservation gradients under projection, and we added a controlled toy simulation in the appendix that isolates the projection operator on a quadratic objective to illustrate the altered effective updates. The primary support for the hypothesis remains the large-scale empirical evaluation across five datasets and 11 orders, which directly tests whether rankings are preserved. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical hypothesis test on standard methods

full rationale

The paper defines adaptation regimes via projected optimization over trainable subspaces and analytically notes that depth changes alter update magnitudes and forgetting. It then states a hypothesis that method rankings need not be invariant and tests this directly via controlled experiments on five datasets, eleven task orders, and four standard CL methods (online EWC, LwF, SI, GEM). No step reduces a claimed prediction or uniqueness result to a fitted quantity defined inside the paper; no self-citation chain is invoked to justify the central claim; and the reported outcome (inconsistent rankings) is an observed empirical pattern rather than a quantity forced by the formalization itself. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard continual learning optimization assumptions and existing benchmark datasets without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)
  • domain assumption Projected optimization over fixed trainable subspaces accurately models the effect of changing trainable depth
    Invoked when formalizing adaptation regimes as the central modeling step.

pith-pipeline@v0.9.0 · 5522 in / 1118 out tokens · 39319 ms · 2026-05-09T22:59:51.107866+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 19 canonical work pages · 1 internal anchor

  1. [2]

    Critical learning periods in deep neural networks,

    URLhttps://arxiv.org/abs/1711.08856. Magdalena Biesialska, Katarzyna Biesialska, and Marta R. Costa-jussà. Continual lifelong learning in natural language processing: A survey

  2. [3]

    Continual lifelong learning in natural language processing: A survey

    URLhttps://arxiv.org/abs/2012.09823. Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence,

  3. [4]

    De Lange, R

    URLhttps://doi.org/10.1109/TPAMI.2021.3057446. Itay Evron, E. Moroshko, G. Buzaglo, Maroun Khriesh, B. Marjieh, N. Srebro, and Daniel Soudry. Continual learning in linear classification on separable data. InInternational Conference on Machine Learning,

  4. [5]

    Sebastian Farquhar and Yarin Gal

    URL https://arxiv.org/pdf/2306.03534.pdf. Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning

  5. [6]

    org/abs/1805.09733

    URL https://arxiv. org/abs/1805.09733. Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks

  6. [7]

    Deep Residual Learning for Image Recognition

    URL https://arxiv.org/abs/1512.03385. Yen-Chang Hsu, Yen-Yu Lin, Shih-Chieh Xu, Shao-Hua Sun, and Chu-Song Chen. Re-evaluating continual learning scenarios: A focus on class-incremental learning.arXiv preprint arXiv:1810.12488,

  7. [8]

    Re-evaluating con- tinual learning scenarios: A categorization and case for strong baselines.arXiv preprint arXiv:1810.12488, 2018

    URL https://arxiv. org/abs/1810.12488. Maurice G. Kendall. A new measure of rank correlation.Biometrika, 30(1–2):81–93,

  8. [9]

    doi: 10.1093/biomet/30. 1-2.81. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526,

  9. [10]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 3521–3526, 2017

    doi: 10.1073/pnas.1611835114. URL https://doi.org/10.1073/pnas.1611835114. Tatsuya Konishi, M. Kurokawa, C. Ono, Zixuan Ke, Gyuhak Kim, and Bin Liu. Parameter-level soft-masking for continual learning.ArXiv, abs/2306.14775,

  10. [11]

    Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,

    URLhttps://arxiv.org/abs/2202.10054. Zhizhong Li and Derek Hoiem. Learning without forgetting,

  11. [12]

    URL https://arxiv.org/abs/1606. 09282. Yan-Shuo Liang and Wu-Jun Li. Inflora: Interference-free low-rank adaptation for continual learning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23638–23647,

  12. [13]

    Michael McCloskey and Neal J

    URL https: //arxiv.org/abs/1706.08840. Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pp. 109–165. Academic Press,

  13. [14]

    URLhttps: //www.sciencedirect.com/science/article/pii/S0079742108605368

    doi: 10.1016/S0079-7421(08)60536-8. URLhttps://doi.org/10.1016/S0079-7421(08)60536-8. German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks,

  14. [15]

    Taherkhani, A

    URL https://doi.org/10.1016/j.neunet. 2019.01.012. 10 Preprint. Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning for medical imaging

  15. [16]

    arXiv preprint arXiv:1902.07208 , year=

    URLhttps://arxiv.org/abs/1902.07208. Jonathan Schwarz, Wojciech M. Czarnecki, Jelena Luketina, A. Grabska-Barwinska, Y . Teh, Razvan Pascanu, and R. Hadsell. Progress & compress: A scalable framework for continual learning.ArXiv, abs/1805.06370,

  16. [17]

    Tejas Srinivasan, Ting-Yun Chang, Leticia Pinto-Alva, Georgios Chochlakis, Mohammad Rostami, and Jesse Thomason

    URL https://api.semanticscholar.org/CorpusId:21718339. Tejas Srinivasan, Ting-Yun Chang, Leticia Pinto-Alva, Georgios Chochlakis, Mohammad Rostami, and Jesse Thomason. Climb: A continual learning benchmark for vision-and-language tasks.ArXiv, abs/2206.09059,

  17. [18]

    Three scenarios for continual learning

    URL https://arxiv. org/abs/1904.07734. Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi. Supermasks in superposition. InAdvances in Neural Information Process- ing Systems, volume 33,

  18. [19]

    Maciej Wołczyk, Michał Zaj ˛ ac, Razvan Pascanu, Łukasz Kuci´nski, and Piotr Miło´s

    URL https://proceedings.neurips.cc/paper/2020/hash/ ad1f8bb9b51f023cdc80cf94bb615aa9-Abstract.html. Maciej Wołczyk, Michał Zaj ˛ ac, Razvan Pascanu, Łukasz Kuci´nski, and Piotr Miło´s. Continual world: A robotic benchmark for continual reinforcement learning,

  19. [20]

    Wołczyk, M

    URLhttps://arxiv.org/abs/2105.10919. Jaehong Yoon, Saehoon Kim, Eunho Yang, and Sung Ju Hwang. Scalable and order-robust continual learning with additive parameter decomposition.arXiv: Learning,

  20. [21]

    URL https://arxiv.org/pdf/1902.09432. pdf. Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? InAdvances in Neural Information Processing Systems, volume 27,

  21. [22]

    neurips.cc/paper/2014/hash/375c71349b295fbe2dcdca9206851369-Abstract.html

    URL https://proceedings. neurips.cc/paper/2014/hash/375c71349b295fbe2dcdca9206851369-Abstract.html. Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence,

  22. [23]

    11 Preprint

    URL https://arxiv.org/abs/1703.04200. 11 Preprint. APPENDIX A ADDITIONALTHEORETICALDETAILS For completeness, we restate the progress bound used in the main text. Progress bound under projected updates.Let θ+ =θ−ηP S∇Jt(θ),(8) where PS is the orthogonal projector associated with the fixed trainable subset S. Assume that Jt is L-smooth on a neighborhood con...