pith. machine review for the scientific record. sign in

arxiv: 2604.18857 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.CV

Recognition: unknown

Task Switching Without Forgetting via Proximal Decoupling

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords continual learningcatastrophic forgettingproximal operatorsoperator splittingsparse regularizationstability-plasticity
0
0 comments X

The pith

Operator splitting decouples task learning from stability enforcement to prevent forgetting in continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard regularization in continual learning over-constrains models by blending learning and retention signals into one gradient update. It proposes operator splitting instead: a learning step that minimizes only the current task loss, followed by a proximal stability step that applies a sparse regularizer to identify, preserve task-relevant parameters, and prune redundant ones. This separation turns retention and plasticity into complementary negotiated updates rather than conflicting forces. As a result, the approach improves both stability on prior tasks and adaptability to new ones while using no replay buffers, Bayesian sampling, or meta-learning.

Core claim

The paper claims that applying operator splitting to the continual learning objective separates the process into a learning operator focused on minimizing the current task loss and a proximal stability operator equipped with a sparse regularizer. The stability operator prunes unnecessary parameters and preserves those critical for previous tasks. This avoids the over-constraining that arises when regularization is added directly to the loss and produces a negotiated update between the two operators.

What carries the argument

Operator splitting between a task-learning operator minimizing current loss and a proximal stability operator with sparse regularization that prunes redundant parameters while retaining task-critical ones.

If this is right

  • Improves both stability on old tasks and adaptability to new tasks on standard benchmarks.
  • Removes the need for replay buffers, Bayesian sampling, or meta-learning components.
  • Supports better forward transfer and more efficient capacity use as the number of tasks increases.
  • Supplies theoretical justification for the splitting method applied to the continual-learning objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The operator separation could extend to other settings where optimization involves conflicting objectives, such as multi-task or federated learning.
  • The built-in sparse regularization may produce models that are naturally more compact, aiding inference on limited hardware.
  • Applying the same split to reinforcement learning continual settings could reduce interference between policy updates for successive environments.

Load-bearing premise

The proximal stability step with a sparse regularizer can reliably identify and preserve task-relevant parameters while pruning others across growing task sequences without introducing new instabilities or requiring task-specific tuning.

What would settle it

A long sequence of dissimilar tasks where the method exhibits rising forgetting rates or requires per-task hyperparameter adjustments comparable to standard regularization would refute the advantage of the decoupling.

Figures

Figures reproduced from arXiv: 2604.18857 by Eric Granger, Masoumeh Zareapoor, Pourya Shamsolmoali, William A. P. Smith, Yue Lu.

Figure 1
Figure 1. Figure 1: Stability-plasticity trade-off in continual learning. Left: Average accuracy vs. average forgetting on the label [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Catastrophic forgetting in CL can be understood as [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the continual learner based on DRS in parameter update space (where [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Processing steps of the DRS continual learner. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Alternating proximal operators between the task [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average accuracy shows the plasticity-stability [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Parameter update magnitudes (∆x) across 20 tasks. DRS produces sparse, selective updates per task, preserving most parameters (i.e., keeps many updates close to zero, which shows as dark blue regions). By contrast, EWC and SGD update a broader set of parameters per task, indicating weaker or no retention [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Granular forgetting over 100 sequential tasks on CASIA-HWDB1.0. The plots illustrate the accumulated forgetting [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Evolution of parameter updates across tasks. The top row shows dense learning proposals from minimizing the [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of parameter magnitude between DRS [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Stability-Plasticity in class-incremental learning. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Long-horizon plasticity analysis. Our model main [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Selective DRS vs. Dense DRS on the fine-grained CUB-200 dataset. (Left) We report after-one-epoch accuracy across [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Effect of sparsity constraint λ on average accuracy and forgetting on CIFAR100 data. Forgetting decreases with larger, but with diminishing adaptation after λ > 10. that learning continues with a consistent update magnitude but is restricted to a smaller subset of parameters. Convergence analysis of the DRS solver: The DRS solver seeks a fixed point where the auxiliary sequence yk sta￾bilizes, meaning yk+… view at source ↗
read the original abstract

In continual learning, the primary challenge is to learn new information without forgetting old knowledge. A common solution addresses this trade-off through regularization, penalizing changes to parameters critical for previous tasks. In most cases, this regularization term is directly added to the training loss and optimized with standard gradient descent, which blends learning and retention signals into a single update and does not explicitly separate essential parameters from redundant ones. As task sequences grow, this coupling can over-constrain the model, limiting forward transfer and leading to inefficient use of capacity. We propose a different approach that separates task learning from stability enforcement via operator splitting. The learning step focuses on minimizing the current task loss, while a proximal stability step applies a sparse regularizer to prune unnecessary parameters and preserve task-relevant ones. This turns the stability-plasticity into a negotiated update between two complementary operators, rather than a conflicting gradient. We provide theoretical justification for the splitting method on the continual-learning objective, and demonstrate that our proposed solver achieves state-of-the-art results on standard benchmarks, improving both stability and adaptability without the need for replay buffers, Bayesian sampling, or meta-learning components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Proximal Decoupling for continual learning, using operator splitting to separate a learning step (minimizing current-task loss) from a proximal stability step that applies a sparse regularizer to prune redundant parameters while preserving those relevant to prior tasks. It claims this decouples the stability-plasticity trade-off, supplies theoretical justification for the splitting method on the continual-learning objective, and achieves SOTA results on standard benchmarks without replay buffers, Bayesian sampling, or meta-learning.

Significance. If the proximal operator reliably identifies task-relevant parameters in non-convex DNN losses, the method would provide a simple, buffer-free alternative to regularization-based continual learning that improves both stability and adaptability. The explicit separation of operators and the claimed theoretical grounding are strengths that could influence future work on scalable task sequences.

major comments (3)
  1. [Theoretical justification] Theoretical justification section: the derivation of the splitting method on the continual-learning objective must explicitly address non-convexity of the DNN loss and show why the proximal map (typically soft-thresholding for the sparse regularizer) aligns with functional task relevance rather than weight magnitude; otherwise the separation claim does not hold for growing task sequences.
  2. [Method and Experiments] Method and experimental sections: the central claim that the proximal stability step works without per-task tuning or new instabilities requires ablations demonstrating that the sparse regularizer strength can be fixed across tasks; misalignment between magnitude-based pruning and true importance would violate the no-replay, no-meta-learning SOTA result.
  3. [Experiments] Experimental evaluation: the SOTA claims on standard benchmarks need error bars, statistical significance tests, and analysis of capacity usage over long sequences to confirm that pruning does not cause unintended forgetting or wasted capacity.
minor comments (1)
  1. [Abstract and Method] Abstract and method: the description of the proximal operator and sparse regularizer would benefit from an explicit equation or pseudocode in the main text rather than relying solely on prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will make targeted revisions to strengthen the theoretical and experimental sections.

read point-by-point responses
  1. Referee: [Theoretical justification] Theoretical justification section: the derivation of the splitting method on the continual-learning objective must explicitly address non-convexity of the DNN loss and show why the proximal map (typically soft-thresholding for the sparse regularizer) aligns with functional task relevance rather than weight magnitude; otherwise the separation claim does not hold for growing task sequences.

    Authors: We appreciate the referee's emphasis on rigor here. The derivation in the manuscript applies operator splitting directly to the continual-learning objective, separating the task-loss minimization from the proximal stability operator. Proximal methods are known to apply under non-convexity when the non-smooth term is handled separately, and we will revise the section to explicitly note this and reference supporting results from non-convex proximal optimization literature. On alignment with functional relevance, we will add clarification that the proximal map is magnitude-based but operates iteratively within the stability objective; this process retains parameters that support prior-task performance, as confirmed by our retention metrics. We view this as sufficient to uphold the separation claim for the evaluated task sequences. revision: partial

  2. Referee: [Method and Experiments] Method and experimental sections: the central claim that the proximal stability step works without per-task tuning or new instabilities requires ablations demonstrating that the sparse regularizer strength can be fixed across tasks; misalignment between magnitude-based pruning and true importance would violate the no-replay, no-meta-learning SOTA result.

    Authors: We agree that fixed hyperparameters are central to the method's appeal. The sparse regularizer strength was held constant across tasks in all reported experiments, with no per-task retuning or observed instabilities. We will add an explicit ablation subsection showing results under a single fixed strength value, confirming stable performance. Regarding potential misalignment, the proximal step enforces sparsity based on the stability objective rather than isolated magnitude; the achieved SOTA results without replay, Bayesian methods, or meta-learning demonstrate that this does not produce the violation described. revision: yes

  3. Referee: [Experiments] Experimental evaluation: the SOTA claims on standard benchmarks need error bars, statistical significance tests, and analysis of capacity usage over long sequences to confirm that pruning does not cause unintended forgetting or wasted capacity.

    Authors: We will update the experimental section to report error bars from multiple random seeds and include statistical significance tests (e.g., Wilcoxon signed-rank tests) against baselines to support the SOTA claims. We will also add a capacity-usage analysis tracking the fraction of active parameters after each proximal step and verifying no unintended forgetting on the standard benchmarks. While our evaluation follows the sequence lengths conventional in the literature, we will discuss scalability implications for longer sequences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent solver

full rationale

The paper introduces operator splitting to decouple task learning (current loss minimization) from proximal stability enforcement (sparse regularizer on parameters) for continual learning. This is framed as a new solver with claimed theoretical justification on the CL objective and empirical SOTA results on benchmarks, without replay or meta-learning. No equations or steps in the abstract or description reduce a prediction or first-principles result to its own inputs by construction, nor rely on self-citation chains, ansatz smuggling, or renaming of known results as the core derivation. The approach is self-contained against external benchmarks and does not exhibit self-definitional or fitted-input patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on standard proximal operator properties and sparse regularization assumptions common in optimization literature.

axioms (1)
  • domain assumption Proximal operators for sparse regularization can separate stability enforcement from task loss minimization without introducing instability
    Invoked by the claim that the stability step prunes unnecessary parameters while preserving relevant ones.

pith-pipeline@v0.9.0 · 5510 in / 1129 out tokens · 24051 ms · 2026-05-10T04:47:59.069103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Catastrophic interference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology. Learn. Motiv., volume 24, pages 109–165. 1989

  2. [2]

    Memory aware synapses: Learn- ing what (not) to forget

    Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learn- ing what (not) to forget. InEur. Conf. Comput. Vis., pages 139–154, 2018

  3. [3]

    No one left behind: Real-world federated class- incremental learning.Trans

    Jiahua Dong, Hongliu Li, Yang Cong, Gan Sun, Yulun Zhang, and Luc Van Gool. No one left behind: Real-world federated class- incremental learning.Trans. Pattern Anal. Mach. Intell., 46(4):2054– 2070, 2023. 12 IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, DECEMBER 2025

  4. [4]

    A theoret- ical perspective on streaming noisy data with distribution shift

    Wenshui Luo, Shuo Chen, Tao Zhou, and Chen Gong. A theoret- ical perspective on streaming noisy data with distribution shift. Trans. Pattern Anal. Mach. Intell., 2025

  5. [5]

    Catastrophic forgetting in connectionist net- works.Trends Cogn

    Robert M French. Catastrophic forgetting in connectionist net- works.Trends Cogn. Sci., 3(4):128–135, 1999

  6. [6]

    Mitigating catastrophic forgetting in online continual learning by modeling previous task interrelations via pareto optimization

    Yichen Wu, Hong Wang, Peilin Zhao, Yefeng Zheng, Ying Wei, and Long-Kai Huang. Mitigating catastrophic forgetting in online continual learning by modeling previous task interrelations via pareto optimization. InInt. Conf. Mach. Learn., 2024

  7. [7]

    Layerwise proximal replay: A proximal point method for online continual learning.Int

    Jason Yoo, Yunpeng Liu, Frank Wood, and Geoff Pleiss. Layerwise proximal replay: A proximal point method for online continual learning.Int. Conf. Mach. Learn., 2024

  8. [8]

    Addressing loss of plasticity and catastrophic forgetting in continual learning.Int

    Mohamed Elsayed and A Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning.Int. Conf. Learn. Represent., 2024

  9. [9]

    Bayesian adaptation of network depth and width for continual learning

    Jeevan Thapa and Rui Li. Bayesian adaptation of network depth and width for continual learning. InInt. Conf. Learn. Represent., 2024

  10. [10]

    Star: Stability-inducing weight perturbation for continual learning.Int

    Masih Eskandar, Tooba Imtiaz, Davin Hill, Zifeng Wang, and Jen- nifer Dy. Star: Stability-inducing weight perturbation for continual learning.Int. Conf. Mach. Learn., 2025

  11. [11]

    Efficient lifelong learning with a-gem.Int

    Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem.Int. Conf. Learn. Represent., 2019

  12. [12]

    Progressive Neural Networks

    Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016

  13. [13]

    Continual learning with scaled gradient projection

    Gobinda Saha and Kaushik Roy. Continual learning with scaled gradient projection. InProc. AAAI Conf. Artif. Intell., volume 37, pages 9677–9685, 2023

  14. [14]

    Disentangling the causes of plasticity loss in neural networks,

    Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the causes of plasticity loss in neural networks.arXiv preprint arXiv:2402.18762, 2024

  15. [15]

    Meta-learning representations for continual learning.Adv

    Khurram Javed and Martha White. Meta-learning representations for continual learning.Adv. Neural Inform. Process. Syst., 32, 2019

  16. [16]

    Measuring catastrophic forgetting in neural networks

    Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. Measuring catastrophic forgetting in neural networks. InProc. AAAI Conf. Artif. Intell., volume 32, 2018

  17. [17]

    Recasting continual learning as sequence modeling.Adv

    Soochan Lee, Jaehyeon Son, and Gunhee Kim. Recasting continual learning as sequence modeling.Adv. Neural Inform. Process. Syst., 36, 2023

  18. [18]

    Learning to continually learn with the bayesian principle.Int

    Soochan Lee, Hyeonseong Jeon, Jaehyeon Son, and Gunhee Kim. Learning to continually learn with the bayesian principle.Int. Conf. Mach. Learn., 2024

  19. [19]

    Became: Bayesian continual learning with adaptive model merging.Int

    Mei Li, Yuxiang Lu, Qinyan Dai, Suizhi Huang, Yue Ding, and Hongtao Lu. Became: Bayesian continual learning with adaptive model merging.Int. Conf. Mach. Learn., 2025

  20. [20]

    Bayesian structural adaptation for continual learning

    Abhishek Kumar, Sunabha Chatterjee, and Piyush Rai. Bayesian structural adaptation for continual learning. InInt. Conf. Mach. Learn., pages 5850–5860, 2021

  21. [21]

    Bayesian continual learning and forgetting in neural networks.arXiv preprint arXiv:2504.13569, 2025

    Djohan Bonnet, Kellian Cottart, Tifenn Hirtzlin, Tarcisius Januel, Thomas Dalgaty, Elisa Vianello, and Damien Querlioz. Bayesian continual learning and forgetting in neural networks.arXiv preprint arXiv:2504.13569, 2025

  22. [22]

    Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

    Gido M Van de Ven, Nicholas Soures, and Dhireesha Kudithipudi. Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

  23. [23]

    Overcoming catastrophic forgetting in neural networks.Proc

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proc. Natl. Acad. Sci. U.S.A., 114(13):3521–3526, 2017

  24. [24]

    Continual domain adversarial adaptation via double-head dis- criminators

    Yan Shen, Zhanghexuan Ji, Chunwei Ma, and Mingchen Gao. Continual domain adversarial adaptation via double-head dis- criminators. InInt. Conf. Artif. Intell. Stat., pages 2584–2592, 2024

  25. [25]

    Progressive learning: A deep learning framework for continual learning.Neural Net., 128:345–357, 2020

    Haytham M Fayek, Lawrence Cavedon, and Hong Ren Wu. Progressive learning: A deep learning framework for continual learning.Neural Net., 128:345–357, 2020

  26. [26]

    Continual lifelong learning with neural networks: A review.Neural Net., 113:54–71, 2019

    German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Net., 113:54–71, 2019

  27. [27]

    spred: Solving l1 penalty with sgd

    Liu Ziyin and Zihao Wang. spred: Solving l1 penalty with sgd. In Int. Conf. Mach. Learn., pages 43407–43422, 2023

  28. [28]

    Springer Science & Business Media, 2004

    Larry Wasserman.All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2004

  29. [29]

    Feature selection for nonlinear regression and its application to cancer research

    Yijun Sun, Jin Yao, and Steve Goodison. Feature selection for nonlinear regression and its application to cancer research. InProc. SIAM Int. Conf. Data Min., pages 73–81. SIAM, 2015

  30. [30]

    On the numerical solution of heat conduction problems in two and three space variables.Trans

    Jim Douglas and Henry H Rachford. On the numerical solution of heat conduction problems in two and three space variables.Trans. Amer. Math. Soc., 82(2):421–439, 1956

  31. [31]

    On the dou- glas—rachford splitting method and the proximal point algorithm for maximal monotone operators.Math

    Jonathan Eckstein and Dimitri P Bertsekas. On the dou- glas—rachford splitting method and the proximal point algorithm for maximal monotone operators.Math. Program., 55:293–318, 1992

  32. [32]

    A statistical theory of deep learning via proximal splitting.arXiv preprint arXiv:1509.06061, 2015

    Nicholas G Polson, Brandon T Willard, and Massoud Heidari. A statistical theory of deep learning via proximal splitting.arXiv preprint arXiv:1509.06061, 2015

  33. [33]

    Loss of plasticity in deep continual learning.Nature, 632:768–774, 2024

    Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632:768–774, 2024

  34. [34]

    The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects.Front

    Martial Mermillod, Aur ´elia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects.Front. Psy- chol., 4:504, 2013

  35. [35]

    Progres- sive learning without forgetting

    Tao Feng, Hangjie Yuan, Mang Wang, Ziyuan Huang, Ang Bian, and Jianzhou Zhang. Progressive learning without forgetting. arXiv preprint arXiv:2211.15215, 2022

  36. [36]

    Lifelong learning with dynamically expandable networks.Int

    Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks.Int. Conf. Learn. Represent., 2018

  37. [37]

    Forget-free continual learning with winning subnetworks

    Haeyong Kang, Rusty John Lloyd Mina, Sultan Rizky Hikmawan Madjid, Jaehong Yoon, Mark Hasegawa-Johnson, Sung Ju Hwang, and Chang D Yoo. Forget-free continual learning with winning subnetworks. InInt. Conf. Mach. Learn., pages 10734–10750, 2022

  38. [38]

    Tag: Task-based accumulated gradients for lifelong learning

    Pranshu Malviya, Balaraman Ravindran, and Sarath Chandar. Tag: Task-based accumulated gradients for lifelong learning. InConf. Lifelong Learn. Agents, pages 366–389, 2022

  39. [39]

    Rehearsal-free and efficient continual learning for cross-domain face anti-spoofing.Trans

    Rizhao Cai, Yawen Cui, Zitong Yu, Xun Lin, Changsheng Chen, and Alex Kot. Rehearsal-free and efficient continual learning for cross-domain face anti-spoofing.Trans. Pattern Anal. Mach. Intell., 2025

  40. [40]

    Continual learning via sequential function- space variational inference

    Tim GJ Rudner, Freddie Bickford Smith, Qixuan Feng, Yee Whye Teh, and Yarin Gal. Continual learning via sequential function- space variational inference. InInt. Conf. Mach. Learn., pages 18871– 18887, 2022

  41. [41]

    Remind your neural network to prevent catastrophic forgetting

    Tyler L Hayes, Kushal Kafle, Robik Shrestha, Manoj Acharya, and Christopher Kanan. Remind your neural network to prevent catastrophic forgetting. InEur. Conf. Comput. vis., pages 466–483, 2020

  42. [42]

    Tight verifi- cation of probabilistic robustness in bayesian neural networks

    Ben Batten, Mehran Hosseini, and Alessio Lomuscio. Tight verifi- cation of probabilistic robustness in bayesian neural networks. In Int. Conf. Artif. Intell. Stat., pages 4906–4914, 2024

  43. [43]

    S., and Mahmood, A

    Shibhansh Dohare, Richard S Sutton, and A Rupam Mahmood. Continual backprop: Stochastic gradient descent with persistent randomness.arXiv preprint arXiv:2108.06325, 2021

  44. [44]

    Evcl: Elastic variational continual learning with weight consolidation.Int

    Hunar Batra and Ronald Clark. Evcl: Elastic variational continual learning with weight consolidation.Int. Conf. Mach. Learn., 2024

  45. [45]

    Deep reinforcement learning with plasticity injection.Adv

    Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Raz- van Pascanu, Will Dabney, and Andr´e Barreto. Deep reinforcement learning with plasticity injection.Adv. Neural Inform. Process. Syst., 36:37142–37159, 2023

  46. [46]

    Loss of plasticity in continual deep reinforce- ment learning

    Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforce- ment learning. InConf. Lifelong Learn. Agents, pages 620–636, 2023

  47. [47]

    Approximate bayesian class- conditional models under continuous representation shift

    Thomas L Lee and Amos Storkey. Approximate bayesian class- conditional models under continuous representation shift. InInt. Conf. Artif. Intell. Stat., pages 3628–3636, 2024

  48. [48]

    Flashbacks to harmonize stability and plasticity in continual learning.Neural Net., page 107616, 2025

    Leila Mahmoodi, Peyman Moghadam, Munawar Hayat, Christian Simon, and Mehrtash Harandi. Flashbacks to harmonize stability and plasticity in continual learning.Neural Net., page 107616, 2025

  49. [49]

    Achieving a better stability-plasticity trade-off via auxiliary networks in continual learning

    Sanghwan Kim, Lorenzo Noci, Antonio Orvieto, and Thomas Hofmann. Achieving a better stability-plasticity trade-off via auxiliary networks in continual learning. InIEEE Conf. Comput. Vis. Pattern Recog., pages 11930–11939, 2023

  50. [50]

    Switching between tasks can cause ai to lose the ability to learn.Nature, 2024

    Clare Lyle and Razvan Pascanu. Switching between tasks can cause ai to lose the ability to learn.Nature, 2024

  51. [51]

    Uncertainty-based continual learning with adaptive regulariza- tion.Adv

    Hongjoon Ahn, Sungmin Cha, Donggyu Lee, and Taesup Moon. Uncertainty-based continual learning with adaptive regulariza- tion.Adv. Neural Inform. Process. Syst., 32, 2019. SHAMSOLMOALIet al.TASK SWITCHING WITHOUT FORGETTING: THE ARCHITECTURE OF SEQUENTIAL LEARNING 13

  52. [52]

    Overcoming catastrophic forgetting by bayesian generative regularization

    Pei-Hung Chen, Wei Wei, Cho-Jui Hsieh, and Bo Dai. Overcoming catastrophic forgetting by bayesian generative regularization. In Int. Conf. Learn. Represent., pages 1760–1770, 2021

  53. [53]

    A dual algorithm for the solution of nonlinear variational problems via finite element ap- proximation.Comput

    Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element ap- proximation.Comput. Math. Appl., 2(1):17–40, 1976

  54. [54]

    Douglas- rachford networks: Learning both the image prior and data fidelity terms for blind image deconvolution

    Raied Aljadaany, Dipan K Pal, and Marios Savvides. Douglas- rachford networks: Learning both the image prior and data fidelity terms for blind image deconvolution. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10235–10244, 2019

  55. [55]

    Osqp: An operator splitting solver for quadratic programs.Math

    Bartolomeo Stellato, Goran Banjac, Paul Goulart, Alberto Bempo- rad, and Stephen Boyd. Osqp: An operator splitting solver for quadratic programs.Math. Program. Comput., 12(4):637–672, 2020

  56. [56]

    Cosmo: A conic operator splitting method for convex conic problems.J

    Michael Garstka, Mark Cannon, and Paul Goulart. Cosmo: A conic operator splitting method for convex conic problems.J. Optim. Theory Appl., 190(3):779–810, 2021

  57. [57]

    A fast and accurate splitting method for optimal transport: Analysis and implementation.Int

    Vien V Mai, Jacob Lindb ¨ack, and Mikael Johansson. A fast and accurate splitting method for optimal transport: Analysis and implementation.Int. Conf. Learn. Represent., 2022

  58. [58]

    Feddr–randomized douglas-rachford splitting algorithms for non- convex federated composite optimization.Adv

    Quoc Tran Dinh, Nhan H Pham, Dzung Phan, and Lam Nguyen. Feddr–randomized douglas-rachford splitting algorithms for non- convex federated composite optimization.Adv. Neural Inform. Process. Syst., 34:30326–30338, 2021

  59. [59]

    Douglas–rachford splitting for nonconvex optimization with application to nonconvex feasibility problems.Math

    Guoyin Li and Ting Kei Pong. Douglas–rachford splitting for nonconvex optimization with application to nonconvex feasibility problems.Math. Program., 159(1):371–401, 2016

  60. [60]

    Rethinking fast adversarial training: A splitting technique to overcome catas- trophic overfitting

    Masoumeh Zareapoor and Pourya Shamsolmoali. Rethinking fast adversarial training: A splitting technique to overcome catas- trophic overfitting. InEur. Conf. Comput. Vis., pages 34–51, 2024

  61. [61]

    A three-operator splitting scheme derived from three- block admm.Optimization and Engineering, 2025

    Anshika Anshika, Jiaxing Li, Debdas Ghosh, and Xiangxiong Zhang. A three-operator splitting scheme derived from three- block admm.Optimization and Engineering, 2025

  62. [62]

    Accelerated forward–backward and douglas–rachford splitting dynamics.Au- tomatica, 175:112210, 2025

    Ibrahim K Ozaslan and Mihailo R Jovanovi ´c. Accelerated forward–backward and douglas–rachford splitting dynamics.Au- tomatica, 175:112210, 2025

  63. [63]

    Rethinking gradient projection continual learning: Stability/plasticity feature space decoupling

    Zhen Zhao, Zhizhong Zhang, Xin Tan, Jun Liu, Yanyun Qu, Yuan Xie, and Lizhuang Ma. Rethinking gradient projection continual learning: Stability/plasticity feature space decoupling. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3718–3727, 2023

  64. [64]

    Proximal residual flows for bayesian inverse problems

    Johannes Hertrich. Proximal residual flows for bayesian inverse problems. InInt. Conf. Scale Space Var. Methods Comput. Vis., pages 210–222, 2023

  65. [65]

    The douglas–rachford algorithm for convex and nonconvex feasibility problems.Math

    Francisco J Arag ´on Artacho, Rub ´en Campoy, and Matthew K Tam. The douglas–rachford algorithm for convex and nonconvex feasibility problems.Math. Methods Oper. Res., 91(2):201–240, 2020

  66. [66]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016

  67. [67]

    Overcoming catastrophic forgetting with hard attention to the task

    Joan Serra, Didac Suris, Marius Miron, and Alexandros Karat- zoglou. Overcoming catastrophic forgetting with hard attention to the task. InInt. Conference Mach. Learn., pages 4548–4557, 2018

  68. [68]

    Learning multiple layers of features from tiny images.Ph.D

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Ph.D. dissertation, University of Toronto, 2009

  69. [69]

    Emnist: Extending mnist to handwritten letters

    Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In Int. Joint Conf. Neural Netw., pages 2921–2926, 2017

  70. [70]

    Ms-celeb-1m: A dataset and benchmark for large-scale face recognition

    Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. InEur. Conf. Comput. Vis., pages 87–102, 2016

  71. [71]

    Tiny imagenet challenge

    Jiayu Wu, Qixiang Zhang, and Guoxi Xu. Tiny imagenet challenge. Technical report, 2017

  72. [72]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conf. Comput. Vis. Pattern Recog., pages 248–255, 2009

  73. [73]

    Casia online and offline chinese handwriting databases

    Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. Casia online and offline chinese handwriting databases. InInt. Conf. Doc. Anal. Recognit., pages 37–41, 2011. Pourya Shamsolmoali(Senior Member, IEEE) received the PhD degree in computer science from Shanghai Jiao Tong University. He has been a visiting researcher with Link ¨oping Uni- versity, INRIA...