arxiv: 2604.18857 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.CV

Recognition: unknown

Task Switching Without Forgetting via Proximal Decoupling

Pourya Shamsolmoali , Masoumeh Zareapoor , Eric Granger , William A. P. Smith , Yue Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords continual learningcatastrophic forgettingproximal operatorsoperator splittingsparse regularizationstability-plasticity

0 comments

The pith

Operator splitting decouples task learning from stability enforcement to prevent forgetting in continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard regularization in continual learning over-constrains models by blending learning and retention signals into one gradient update. It proposes operator splitting instead: a learning step that minimizes only the current task loss, followed by a proximal stability step that applies a sparse regularizer to identify, preserve task-relevant parameters, and prune redundant ones. This separation turns retention and plasticity into complementary negotiated updates rather than conflicting forces. As a result, the approach improves both stability on prior tasks and adaptability to new ones while using no replay buffers, Bayesian sampling, or meta-learning.

Core claim

The paper claims that applying operator splitting to the continual learning objective separates the process into a learning operator focused on minimizing the current task loss and a proximal stability operator equipped with a sparse regularizer. The stability operator prunes unnecessary parameters and preserves those critical for previous tasks. This avoids the over-constraining that arises when regularization is added directly to the loss and produces a negotiated update between the two operators.

What carries the argument

Operator splitting between a task-learning operator minimizing current loss and a proximal stability operator with sparse regularization that prunes redundant parameters while retaining task-critical ones.

If this is right

Improves both stability on old tasks and adaptability to new tasks on standard benchmarks.
Removes the need for replay buffers, Bayesian sampling, or meta-learning components.
Supports better forward transfer and more efficient capacity use as the number of tasks increases.
Supplies theoretical justification for the splitting method applied to the continual-learning objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The operator separation could extend to other settings where optimization involves conflicting objectives, such as multi-task or federated learning.
The built-in sparse regularization may produce models that are naturally more compact, aiding inference on limited hardware.
Applying the same split to reinforcement learning continual settings could reduce interference between policy updates for successive environments.

Load-bearing premise

The proximal stability step with a sparse regularizer can reliably identify and preserve task-relevant parameters while pruning others across growing task sequences without introducing new instabilities or requiring task-specific tuning.

What would settle it

A long sequence of dissimilar tasks where the method exhibits rising forgetting rates or requires per-task hyperparameter adjustments comparable to standard regularization would refute the advantage of the decoupling.

Figures

Figures reproduced from arXiv: 2604.18857 by Eric Granger, Masoumeh Zareapoor, Pourya Shamsolmoali, William A. P. Smith, Yue Lu.

**Figure 1.** Figure 1: Stability-plasticity trade-off in continual learning. Left: Average accuracy vs. average forgetting on the label [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Catastrophic forgetting in CL can be understood as [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the continual learner based on DRS in parameter update space (where [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Processing steps of the DRS continual learner. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Alternating proximal operators between the task [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Average accuracy shows the plasticity-stability [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Parameter update magnitudes (∆x) across 20 tasks. DRS produces sparse, selective updates per task, preserving most parameters (i.e., keeps many updates close to zero, which shows as dark blue regions). By contrast, EWC and SGD update a broader set of parameters per task, indicating weaker or no retention [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Granular forgetting over 100 sequential tasks on CASIA-HWDB1.0. The plots illustrate the accumulated forgetting [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Evolution of parameter updates across tasks. The top row shows dense learning proposals from minimizing the [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of parameter magnitude between DRS [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Stability-Plasticity in class-incremental learning. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Long-horizon plasticity analysis. Our model main [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Selective DRS vs. Dense DRS on the fine-grained CUB-200 dataset. (Left) We report after-one-epoch accuracy across [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Effect of sparsity constraint λ on average accuracy and forgetting on CIFAR100 data. Forgetting decreases with larger, but with diminishing adaptation after λ > 10. that learning continues with a consistent update magnitude but is restricted to a smaller subset of parameters. Convergence analysis of the DRS solver: The DRS solver seeks a fixed point where the auxiliary sequence yk stabilizes, meaning yk+… view at source ↗

read the original abstract

In continual learning, the primary challenge is to learn new information without forgetting old knowledge. A common solution addresses this trade-off through regularization, penalizing changes to parameters critical for previous tasks. In most cases, this regularization term is directly added to the training loss and optimized with standard gradient descent, which blends learning and retention signals into a single update and does not explicitly separate essential parameters from redundant ones. As task sequences grow, this coupling can over-constrain the model, limiting forward transfer and leading to inefficient use of capacity. We propose a different approach that separates task learning from stability enforcement via operator splitting. The learning step focuses on minimizing the current task loss, while a proximal stability step applies a sparse regularizer to prune unnecessary parameters and preserve task-relevant ones. This turns the stability-plasticity into a negotiated update between two complementary operators, rather than a conflicting gradient. We provide theoretical justification for the splitting method on the continual-learning objective, and demonstrate that our proposed solver achieves state-of-the-art results on standard benchmarks, improving both stability and adaptability without the need for replay buffers, Bayesian sampling, or meta-learning components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The proximal decoupling split is a clean optimization framing for continual learning but the SOTA claim rests on unshown details about how the sparse proximal step actually identifies task-relevant parameters in non-convex nets.

read the letter

The main takeaway is that this paper reframes the stability-plasticity problem in continual learning as an operator splitting task rather than a single regularized loss. One step handles new task fitting with ordinary gradients while a separate proximal step applies a sparse regularizer to prune and preserve parameters. This avoids the usual coupling where retention and learning signals fight in the same update, which the authors argue becomes worse as task sequences grow longer. The approach claims to deliver better stability and forward transfer without replay buffers, Bayesian sampling, or meta-learning components, backed by some theoretical justification for the splitting method on the continual objective and SOTA numbers on standard benchmarks. That separation is the genuinely new piece here and it is worth looking at if you work on optimization angles for lifelong systems. The paper does a decent job spelling out why standard regularization blends the signals too tightly and how the proximal route turns it into two complementary operators instead. On the soft spots, the central mechanism relies on the proximal map with its sparse regularizer reliably separating task-critical parameters from redundant ones. In non-convex deep network losses this is not automatic because the operator typically acts on weight magnitudes rather than functional importance, and the stress-test note about potential misalignment or hidden per-task tuning needs direct evidence. The abstract asserts theory and results but the full paper must show the proof sketch, convergence behavior for these objectives, and ablations on the regularizer to make the no-tuning claim stick. Experiments are mentioned but without seeing the exact benchmarks, baselines, error analysis, or capacity usage across long sequences it is difficult to judge how much the gains come from the split versus implementation choices. This paper is for people already working on continual learning who are open to proximal and splitting techniques over memory-heavy or architectural fixes. A reader who follows operator splitting in optimization would find the framing useful even if the empirical side needs tightening. It deserves a serious referee because the core idea is grounded in established tools and targets a real bottleneck, though revisions will likely be needed on the theory and parameter identification analysis. I would send it to review.

Referee Report

3 major / 1 minor

Summary. The paper introduces Proximal Decoupling for continual learning, using operator splitting to separate a learning step (minimizing current-task loss) from a proximal stability step that applies a sparse regularizer to prune redundant parameters while preserving those relevant to prior tasks. It claims this decouples the stability-plasticity trade-off, supplies theoretical justification for the splitting method on the continual-learning objective, and achieves SOTA results on standard benchmarks without replay buffers, Bayesian sampling, or meta-learning.

Significance. If the proximal operator reliably identifies task-relevant parameters in non-convex DNN losses, the method would provide a simple, buffer-free alternative to regularization-based continual learning that improves both stability and adaptability. The explicit separation of operators and the claimed theoretical grounding are strengths that could influence future work on scalable task sequences.

major comments (3)

[Theoretical justification] Theoretical justification section: the derivation of the splitting method on the continual-learning objective must explicitly address non-convexity of the DNN loss and show why the proximal map (typically soft-thresholding for the sparse regularizer) aligns with functional task relevance rather than weight magnitude; otherwise the separation claim does not hold for growing task sequences.
[Method and Experiments] Method and experimental sections: the central claim that the proximal stability step works without per-task tuning or new instabilities requires ablations demonstrating that the sparse regularizer strength can be fixed across tasks; misalignment between magnitude-based pruning and true importance would violate the no-replay, no-meta-learning SOTA result.
[Experiments] Experimental evaluation: the SOTA claims on standard benchmarks need error bars, statistical significance tests, and analysis of capacity usage over long sequences to confirm that pruning does not cause unintended forgetting or wasted capacity.

minor comments (1)

[Abstract and Method] Abstract and method: the description of the proximal operator and sparse regularizer would benefit from an explicit equation or pseudocode in the main text rather than relying solely on prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will make targeted revisions to strengthen the theoretical and experimental sections.

read point-by-point responses

Referee: [Theoretical justification] Theoretical justification section: the derivation of the splitting method on the continual-learning objective must explicitly address non-convexity of the DNN loss and show why the proximal map (typically soft-thresholding for the sparse regularizer) aligns with functional task relevance rather than weight magnitude; otherwise the separation claim does not hold for growing task sequences.

Authors: We appreciate the referee's emphasis on rigor here. The derivation in the manuscript applies operator splitting directly to the continual-learning objective, separating the task-loss minimization from the proximal stability operator. Proximal methods are known to apply under non-convexity when the non-smooth term is handled separately, and we will revise the section to explicitly note this and reference supporting results from non-convex proximal optimization literature. On alignment with functional relevance, we will add clarification that the proximal map is magnitude-based but operates iteratively within the stability objective; this process retains parameters that support prior-task performance, as confirmed by our retention metrics. We view this as sufficient to uphold the separation claim for the evaluated task sequences. revision: partial
Referee: [Method and Experiments] Method and experimental sections: the central claim that the proximal stability step works without per-task tuning or new instabilities requires ablations demonstrating that the sparse regularizer strength can be fixed across tasks; misalignment between magnitude-based pruning and true importance would violate the no-replay, no-meta-learning SOTA result.

Authors: We agree that fixed hyperparameters are central to the method's appeal. The sparse regularizer strength was held constant across tasks in all reported experiments, with no per-task retuning or observed instabilities. We will add an explicit ablation subsection showing results under a single fixed strength value, confirming stable performance. Regarding potential misalignment, the proximal step enforces sparsity based on the stability objective rather than isolated magnitude; the achieved SOTA results without replay, Bayesian methods, or meta-learning demonstrate that this does not produce the violation described. revision: yes
Referee: [Experiments] Experimental evaluation: the SOTA claims on standard benchmarks need error bars, statistical significance tests, and analysis of capacity usage over long sequences to confirm that pruning does not cause unintended forgetting or wasted capacity.

Authors: We will update the experimental section to report error bars from multiple random seeds and include statistical significance tests (e.g., Wilcoxon signed-rank tests) against baselines to support the SOTA claims. We will also add a capacity-usage analysis tracking the fraction of active parameters after each proximal step and verifying no unintended forgetting on the standard benchmarks. While our evaluation follows the sequence lengths conventional in the literature, we will discuss scalability implications for longer sequences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent solver

full rationale

The paper introduces operator splitting to decouple task learning (current loss minimization) from proximal stability enforcement (sparse regularizer on parameters) for continual learning. This is framed as a new solver with claimed theoretical justification on the CL objective and empirical SOTA results on benchmarks, without replay or meta-learning. No equations or steps in the abstract or description reduce a prediction or first-principles result to its own inputs by construction, nor rely on self-citation chains, ansatz smuggling, or renaming of known results as the core derivation. The approach is self-contained against external benchmarks and does not exhibit self-definitional or fitted-input patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on standard proximal operator properties and sparse regularization assumptions common in optimization literature.

axioms (1)

domain assumption Proximal operators for sparse regularization can separate stability enforcement from task loss minimization without introducing instability
Invoked by the claim that the stability step prunes unnecessary parameters while preserving relevant ones.

pith-pipeline@v0.9.0 · 5510 in / 1129 out tokens · 24051 ms · 2026-05-10T04:47:59.069103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology. Learn. Motiv., volume 24, pages 109–165. 1989

1989
[2]

Memory aware synapses: Learn- ing what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learn- ing what (not) to forget. InEur. Conf. Comput. Vis., pages 139–154, 2018

2018
[3]

No one left behind: Real-world federated class- incremental learning.Trans

Jiahua Dong, Hongliu Li, Yang Cong, Gan Sun, Yulun Zhang, and Luc Van Gool. No one left behind: Real-world federated class- incremental learning.Trans. Pattern Anal. Mach. Intell., 46(4):2054– 2070, 2023. 12 IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, DECEMBER 2025

2054
[4]

A theoret- ical perspective on streaming noisy data with distribution shift

Wenshui Luo, Shuo Chen, Tao Zhou, and Chen Gong. A theoret- ical perspective on streaming noisy data with distribution shift. Trans. Pattern Anal. Mach. Intell., 2025

2025
[5]

Catastrophic forgetting in connectionist net- works.Trends Cogn

Robert M French. Catastrophic forgetting in connectionist net- works.Trends Cogn. Sci., 3(4):128–135, 1999

1999
[6]

Mitigating catastrophic forgetting in online continual learning by modeling previous task interrelations via pareto optimization

Yichen Wu, Hong Wang, Peilin Zhao, Yefeng Zheng, Ying Wei, and Long-Kai Huang. Mitigating catastrophic forgetting in online continual learning by modeling previous task interrelations via pareto optimization. InInt. Conf. Mach. Learn., 2024

2024
[7]

Layerwise proximal replay: A proximal point method for online continual learning.Int

Jason Yoo, Yunpeng Liu, Frank Wood, and Geoff Pleiss. Layerwise proximal replay: A proximal point method for online continual learning.Int. Conf. Mach. Learn., 2024

2024
[8]

Addressing loss of plasticity and catastrophic forgetting in continual learning.Int

Mohamed Elsayed and A Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning.Int. Conf. Learn. Represent., 2024

2024
[9]

Bayesian adaptation of network depth and width for continual learning

Jeevan Thapa and Rui Li. Bayesian adaptation of network depth and width for continual learning. InInt. Conf. Learn. Represent., 2024

2024
[10]

Star: Stability-inducing weight perturbation for continual learning.Int

Masih Eskandar, Tooba Imtiaz, Davin Hill, Zifeng Wang, and Jen- nifer Dy. Star: Stability-inducing weight perturbation for continual learning.Int. Conf. Mach. Learn., 2025

2025
[11]

Efficient lifelong learning with a-gem.Int

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem.Int. Conf. Learn. Represent., 2019

2019
[12]

Progressive Neural Networks

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review arXiv 2016
[13]

Continual learning with scaled gradient projection

Gobinda Saha and Kaushik Roy. Continual learning with scaled gradient projection. InProc. AAAI Conf. Artif. Intell., volume 37, pages 9677–9685, 2023

2023
[14]

Disentangling the causes of plasticity loss in neural networks,

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the causes of plasticity loss in neural networks.arXiv preprint arXiv:2402.18762, 2024

work page arXiv 2024
[15]

Meta-learning representations for continual learning.Adv

Khurram Javed and Martha White. Meta-learning representations for continual learning.Adv. Neural Inform. Process. Syst., 32, 2019

2019
[16]

Measuring catastrophic forgetting in neural networks

Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. Measuring catastrophic forgetting in neural networks. InProc. AAAI Conf. Artif. Intell., volume 32, 2018

2018
[17]

Recasting continual learning as sequence modeling.Adv

Soochan Lee, Jaehyeon Son, and Gunhee Kim. Recasting continual learning as sequence modeling.Adv. Neural Inform. Process. Syst., 36, 2023

2023
[18]

Learning to continually learn with the bayesian principle.Int

Soochan Lee, Hyeonseong Jeon, Jaehyeon Son, and Gunhee Kim. Learning to continually learn with the bayesian principle.Int. Conf. Mach. Learn., 2024

2024
[19]

Became: Bayesian continual learning with adaptive model merging.Int

Mei Li, Yuxiang Lu, Qinyan Dai, Suizhi Huang, Yue Ding, and Hongtao Lu. Became: Bayesian continual learning with adaptive model merging.Int. Conf. Mach. Learn., 2025

2025
[20]

Bayesian structural adaptation for continual learning

Abhishek Kumar, Sunabha Chatterjee, and Piyush Rai. Bayesian structural adaptation for continual learning. InInt. Conf. Mach. Learn., pages 5850–5860, 2021

2021
[21]

Bayesian continual learning and forgetting in neural networks.arXiv preprint arXiv:2504.13569, 2025

Djohan Bonnet, Kellian Cottart, Tifenn Hirtzlin, Tarcisius Januel, Thomas Dalgaty, Elisa Vianello, and Damien Querlioz. Bayesian continual learning and forgetting in neural networks.arXiv preprint arXiv:2504.13569, 2025

work page arXiv 2025
[22]

Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

Gido M Van de Ven, Nicholas Soures, and Dhireesha Kudithipudi. Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

work page arXiv 2024
[23]

Overcoming catastrophic forgetting in neural networks.Proc

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proc. Natl. Acad. Sci. U.S.A., 114(13):3521–3526, 2017

2017
[24]

Continual domain adversarial adaptation via double-head dis- criminators

Yan Shen, Zhanghexuan Ji, Chunwei Ma, and Mingchen Gao. Continual domain adversarial adaptation via double-head dis- criminators. InInt. Conf. Artif. Intell. Stat., pages 2584–2592, 2024

2024
[25]

Progressive learning: A deep learning framework for continual learning.Neural Net., 128:345–357, 2020

Haytham M Fayek, Lawrence Cavedon, and Hong Ren Wu. Progressive learning: A deep learning framework for continual learning.Neural Net., 128:345–357, 2020

2020
[26]

Continual lifelong learning with neural networks: A review.Neural Net., 113:54–71, 2019

German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Net., 113:54–71, 2019

2019
[27]

spred: Solving l1 penalty with sgd

Liu Ziyin and Zihao Wang. spred: Solving l1 penalty with sgd. In Int. Conf. Mach. Learn., pages 43407–43422, 2023

2023
[28]

Springer Science & Business Media, 2004

Larry Wasserman.All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2004

2004
[29]

Feature selection for nonlinear regression and its application to cancer research

Yijun Sun, Jin Yao, and Steve Goodison. Feature selection for nonlinear regression and its application to cancer research. InProc. SIAM Int. Conf. Data Min., pages 73–81. SIAM, 2015

2015
[30]

On the numerical solution of heat conduction problems in two and three space variables.Trans

Jim Douglas and Henry H Rachford. On the numerical solution of heat conduction problems in two and three space variables.Trans. Amer. Math. Soc., 82(2):421–439, 1956

1956
[31]

On the dou- glas—rachford splitting method and the proximal point algorithm for maximal monotone operators.Math

Jonathan Eckstein and Dimitri P Bertsekas. On the dou- glas—rachford splitting method and the proximal point algorithm for maximal monotone operators.Math. Program., 55:293–318, 1992

1992
[32]

A statistical theory of deep learning via proximal splitting.arXiv preprint arXiv:1509.06061, 2015

Nicholas G Polson, Brandon T Willard, and Massoud Heidari. A statistical theory of deep learning via proximal splitting.arXiv preprint arXiv:1509.06061, 2015

work page arXiv 2015
[33]

Loss of plasticity in deep continual learning.Nature, 632:768–774, 2024

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632:768–774, 2024

2024
[34]

The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects.Front

Martial Mermillod, Aur ´elia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects.Front. Psy- chol., 4:504, 2013

2013
[35]

Progres- sive learning without forgetting

Tao Feng, Hangjie Yuan, Mang Wang, Ziyuan Huang, Ang Bian, and Jianzhou Zhang. Progressive learning without forgetting. arXiv preprint arXiv:2211.15215, 2022

work page arXiv 2022
[36]

Lifelong learning with dynamically expandable networks.Int

Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks.Int. Conf. Learn. Represent., 2018

2018
[37]

Forget-free continual learning with winning subnetworks

Haeyong Kang, Rusty John Lloyd Mina, Sultan Rizky Hikmawan Madjid, Jaehong Yoon, Mark Hasegawa-Johnson, Sung Ju Hwang, and Chang D Yoo. Forget-free continual learning with winning subnetworks. InInt. Conf. Mach. Learn., pages 10734–10750, 2022

2022
[38]

Tag: Task-based accumulated gradients for lifelong learning

Pranshu Malviya, Balaraman Ravindran, and Sarath Chandar. Tag: Task-based accumulated gradients for lifelong learning. InConf. Lifelong Learn. Agents, pages 366–389, 2022

2022
[39]

Rehearsal-free and efficient continual learning for cross-domain face anti-spoofing.Trans

Rizhao Cai, Yawen Cui, Zitong Yu, Xun Lin, Changsheng Chen, and Alex Kot. Rehearsal-free and efficient continual learning for cross-domain face anti-spoofing.Trans. Pattern Anal. Mach. Intell., 2025

2025
[40]

Continual learning via sequential function- space variational inference

Tim GJ Rudner, Freddie Bickford Smith, Qixuan Feng, Yee Whye Teh, and Yarin Gal. Continual learning via sequential function- space variational inference. InInt. Conf. Mach. Learn., pages 18871– 18887, 2022

2022
[41]

Remind your neural network to prevent catastrophic forgetting

Tyler L Hayes, Kushal Kafle, Robik Shrestha, Manoj Acharya, and Christopher Kanan. Remind your neural network to prevent catastrophic forgetting. InEur. Conf. Comput. vis., pages 466–483, 2020

2020
[42]

Tight verifi- cation of probabilistic robustness in bayesian neural networks

Ben Batten, Mehran Hosseini, and Alessio Lomuscio. Tight verifi- cation of probabilistic robustness in bayesian neural networks. In Int. Conf. Artif. Intell. Stat., pages 4906–4914, 2024

2024
[43]

S., and Mahmood, A

Shibhansh Dohare, Richard S Sutton, and A Rupam Mahmood. Continual backprop: Stochastic gradient descent with persistent randomness.arXiv preprint arXiv:2108.06325, 2021

work page arXiv 2021
[44]

Evcl: Elastic variational continual learning with weight consolidation.Int

Hunar Batra and Ronald Clark. Evcl: Elastic variational continual learning with weight consolidation.Int. Conf. Mach. Learn., 2024

2024
[45]

Deep reinforcement learning with plasticity injection.Adv

Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Raz- van Pascanu, Will Dabney, and Andr´e Barreto. Deep reinforcement learning with plasticity injection.Adv. Neural Inform. Process. Syst., 36:37142–37159, 2023

2023
[46]

Loss of plasticity in continual deep reinforce- ment learning

Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C Machado. Loss of plasticity in continual deep reinforce- ment learning. InConf. Lifelong Learn. Agents, pages 620–636, 2023

2023
[47]

Approximate bayesian class- conditional models under continuous representation shift

Thomas L Lee and Amos Storkey. Approximate bayesian class- conditional models under continuous representation shift. InInt. Conf. Artif. Intell. Stat., pages 3628–3636, 2024

2024
[48]

Flashbacks to harmonize stability and plasticity in continual learning.Neural Net., page 107616, 2025

Leila Mahmoodi, Peyman Moghadam, Munawar Hayat, Christian Simon, and Mehrtash Harandi. Flashbacks to harmonize stability and plasticity in continual learning.Neural Net., page 107616, 2025

2025
[49]

Achieving a better stability-plasticity trade-off via auxiliary networks in continual learning

Sanghwan Kim, Lorenzo Noci, Antonio Orvieto, and Thomas Hofmann. Achieving a better stability-plasticity trade-off via auxiliary networks in continual learning. InIEEE Conf. Comput. Vis. Pattern Recog., pages 11930–11939, 2023

2023
[50]

Switching between tasks can cause ai to lose the ability to learn.Nature, 2024

Clare Lyle and Razvan Pascanu. Switching between tasks can cause ai to lose the ability to learn.Nature, 2024

2024
[51]

Uncertainty-based continual learning with adaptive regulariza- tion.Adv

Hongjoon Ahn, Sungmin Cha, Donggyu Lee, and Taesup Moon. Uncertainty-based continual learning with adaptive regulariza- tion.Adv. Neural Inform. Process. Syst., 32, 2019. SHAMSOLMOALIet al.TASK SWITCHING WITHOUT FORGETTING: THE ARCHITECTURE OF SEQUENTIAL LEARNING 13

2019
[52]

Overcoming catastrophic forgetting by bayesian generative regularization

Pei-Hung Chen, Wei Wei, Cho-Jui Hsieh, and Bo Dai. Overcoming catastrophic forgetting by bayesian generative regularization. In Int. Conf. Learn. Represent., pages 1760–1770, 2021

2021
[53]

A dual algorithm for the solution of nonlinear variational problems via finite element ap- proximation.Comput

Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element ap- proximation.Comput. Math. Appl., 2(1):17–40, 1976

1976
[54]

Douglas- rachford networks: Learning both the image prior and data fidelity terms for blind image deconvolution

Raied Aljadaany, Dipan K Pal, and Marios Savvides. Douglas- rachford networks: Learning both the image prior and data fidelity terms for blind image deconvolution. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10235–10244, 2019

2019
[55]

Osqp: An operator splitting solver for quadratic programs.Math

Bartolomeo Stellato, Goran Banjac, Paul Goulart, Alberto Bempo- rad, and Stephen Boyd. Osqp: An operator splitting solver for quadratic programs.Math. Program. Comput., 12(4):637–672, 2020

2020
[56]

Cosmo: A conic operator splitting method for convex conic problems.J

Michael Garstka, Mark Cannon, and Paul Goulart. Cosmo: A conic operator splitting method for convex conic problems.J. Optim. Theory Appl., 190(3):779–810, 2021

2021
[57]

A fast and accurate splitting method for optimal transport: Analysis and implementation.Int

Vien V Mai, Jacob Lindb ¨ack, and Mikael Johansson. A fast and accurate splitting method for optimal transport: Analysis and implementation.Int. Conf. Learn. Represent., 2022

2022
[58]

Feddr–randomized douglas-rachford splitting algorithms for non- convex federated composite optimization.Adv

Quoc Tran Dinh, Nhan H Pham, Dzung Phan, and Lam Nguyen. Feddr–randomized douglas-rachford splitting algorithms for non- convex federated composite optimization.Adv. Neural Inform. Process. Syst., 34:30326–30338, 2021

2021
[59]

Douglas–rachford splitting for nonconvex optimization with application to nonconvex feasibility problems.Math

Guoyin Li and Ting Kei Pong. Douglas–rachford splitting for nonconvex optimization with application to nonconvex feasibility problems.Math. Program., 159(1):371–401, 2016

2016
[60]

Rethinking fast adversarial training: A splitting technique to overcome catas- trophic overfitting

Masoumeh Zareapoor and Pourya Shamsolmoali. Rethinking fast adversarial training: A splitting technique to overcome catas- trophic overfitting. InEur. Conf. Comput. Vis., pages 34–51, 2024

2024
[61]

A three-operator splitting scheme derived from three- block admm.Optimization and Engineering, 2025

Anshika Anshika, Jiaxing Li, Debdas Ghosh, and Xiangxiong Zhang. A three-operator splitting scheme derived from three- block admm.Optimization and Engineering, 2025

2025
[62]

Accelerated forward–backward and douglas–rachford splitting dynamics.Au- tomatica, 175:112210, 2025

Ibrahim K Ozaslan and Mihailo R Jovanovi ´c. Accelerated forward–backward and douglas–rachford splitting dynamics.Au- tomatica, 175:112210, 2025

2025
[63]

Rethinking gradient projection continual learning: Stability/plasticity feature space decoupling

Zhen Zhao, Zhizhong Zhang, Xin Tan, Jun Liu, Yanyun Qu, Yuan Xie, and Lizhuang Ma. Rethinking gradient projection continual learning: Stability/plasticity feature space decoupling. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3718–3727, 2023

2023
[64]

Proximal residual flows for bayesian inverse problems

Johannes Hertrich. Proximal residual flows for bayesian inverse problems. InInt. Conf. Scale Space Var. Methods Comput. Vis., pages 210–222, 2023

2023
[65]

The douglas–rachford algorithm for convex and nonconvex feasibility problems.Math

Francisco J Arag ´on Artacho, Rub ´en Campoy, and Matthew K Tam. The douglas–rachford algorithm for convex and nonconvex feasibility problems.Math. Methods Oper. Res., 91(2):201–240, 2020

2020
[66]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016

2016
[67]

Overcoming catastrophic forgetting with hard attention to the task

Joan Serra, Didac Suris, Marius Miron, and Alexandros Karat- zoglou. Overcoming catastrophic forgetting with hard attention to the task. InInt. Conference Mach. Learn., pages 4548–4557, 2018

2018
[68]

Learning multiple layers of features from tiny images.Ph.D

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Ph.D. dissertation, University of Toronto, 2009

2009
[69]

Emnist: Extending mnist to handwritten letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In Int. Joint Conf. Neural Netw., pages 2921–2926, 2017

2017
[70]

Ms-celeb-1m: A dataset and benchmark for large-scale face recognition

Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. InEur. Conf. Comput. Vis., pages 87–102, 2016

2016
[71]

Tiny imagenet challenge

Jiayu Wu, Qixiang Zhang, and Guoxi Xu. Tiny imagenet challenge. Technical report, 2017

2017
[72]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conf. Comput. Vis. Pattern Recog., pages 248–255, 2009

2009
[73]

Casia online and offline chinese handwriting databases

Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. Casia online and offline chinese handwriting databases. InInt. Conf. Doc. Anal. Recognit., pages 37–41, 2011. Pourya Shamsolmoali(Senior Member, IEEE) received the PhD degree in computer science from Shanghai Jiao Tong University. He has been a visiting researcher with Link ¨oping Uni- versity, INRIA...

2011