Exemplar-Free Continual Learning for State Space Models

Isaac Ning Lee; Leila Mahmoodi; Mehrtash Harandi; Trung Le

arxiv: 2505.18604 · v3 · pith:Z32473H7new · submitted 2025-05-24 · 💻 cs.LG

Exemplar-Free Continual Learning for State Space Models

Isaac Ning Lee , Leila Mahmoodi , Trung Le , Mehrtash Harandi This is my paper

Pith reviewed 2026-05-19 13:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningstate space modelsexemplar-freeGrassmannianSylvester equationcatastrophic forgettingobservability subspace

0 comments

The pith

Inf-SSM curbs forgetting in state-space models by regularizing infinite-horizon state evolution on the Grassmannian without storing old data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Inf-SSM as a geometry-aware regularization for state-space models adapting to new tasks in sequence without access to prior examples. It works by constraining the infinite-horizon evolution of the models through their extended observability subspace on the infinite-dimensional Grassmannian rather than directly limiting parameter changes. This regularization is enforced by solving a Sylvester equation, and the authors derive a structured O(n^2) solver that exploits SSM properties to keep the overhead low. The approach integrates with existing continual learning techniques and is tested on image benchmarks presented as task sequences. Results indicate less forgetting and higher accuracy on later tasks compared to standard updates.

Core claim

Inf-SSM is a novel regularization method that utilizes the geometry of the infinite-dimensional Grassmannian to constrain the infinite-horizon evolution of SSMs encoded in their extended observability subspace. Unlike classical continual learning methods that constrain weight updates, Inf-SSM directly regularizes state evolution during adaptation. Enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which the authors solve in O(n^2) time by exploiting the structure and properties of SSMs. This produces an efficient mechanism that integrates into existing CL methods and yields reduced forgetting with improved accuracy across sequential tasks on the

What carries the argument

The geometry of the infinite-dimensional Grassmannian applied to the extended observability subspace of an SSM, enforced by an efficient structured solver for the Sylvester equation.

If this is right

The regularization can be added to existing continual learning pipelines for SSMs with minimal extra cost.
Forgetting decreases and accuracy rises on sequential image classification benchmarks such as ImageNet-R and Caltech-256.
The O(n^2) Sylvester solver makes the method scalable for larger state dimensions.
State evolution is controlled without replay buffers or stored exemplars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric idea on observability subspaces might apply to other linear recurrent architectures that admit similar infinite-horizon representations.
Controlling state geometry rather than weights could become a general principle for stable lifelong adaptation in sequence models.
The method invites direct tests on non-image sequences such as time-series forecasting or language streams to check domain breadth.

Load-bearing premise

That geometric regularization on the extended observability subspace via the infinite-dimensional Grassmannian sufficiently constrains SSM state evolution to prevent catastrophic forgetting without any exemplars.

What would settle it

Apply Inf-SSM to a sequence of tasks on a held-out benchmark and measure whether retention of earlier task performance shows no improvement over a baseline SSM updated without the Grassmannian term.

Figures

Figures reproduced from arXiv: 2505.18604 by Isaac Ning Lee, Leila Mahmoodi, Mehrtash Harandi, Trung Le.

**Figure 1.** Figure 1: Left: Average performance gain from integrating Inf-SSM into representative CIL baselines (ER [51], LUCIR [24], X-DER [5], L2P-R [59], CLFD [37]) on 5-task (5T) and 10-task (10T) benchmarks built from ImageNet-R (INR) [23], CIFAR-100 (CIF) [31], and Caltech-256 (CAL) [17]. Center: Comparison of Inf-SSM with exemplar-free CIL methods (EWC [29], SI [61], MAS [3], LwF [33]) when regularizing (A, C) on INR. Ri… view at source ↗

**Figure 2.** Figure 2: SSM at each sequence position τ is characterized by the infinite-horizon observability subspace O∞ defined by the tuple (A˜ , C˜ ), and visualized as a trajectory to an infinite horizon. The colored plane represents the complete set of O∞. Each trajectory is mapped to a point on the Grassmannian, and the pairwise distance dGr is illustrated as the geodesic on the sphere representing the Grassmann manifold … view at source ↗

**Figure 4.** Figure 4: CKD analysis on Vim-small with ImageNet-R and CUB-200-2011 over 10 tasks, EFCIL [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

read the original abstract

State-Space Models (SSMs) excel at capturing long-range dependencies with structured recurrence, making them well-suited for sequence modeling. However, their evolving internal states pose challenges in adapting them under Continual Learning (CL). This is particularly difficult in exemplar-free settings, where the absence of prior data leaves updates to the dynamic SSM states unconstrained, resulting in catastrophic forgetting. To address this, we propose Inf-SSM, a novel and simple geometry-aware regularization method that utilizes the geometry of the infinite-dimensional Grassmannian to constrain state evolution during CL. Unlike classical continual learning methods that constrain weight updates, Inf-SSM regularizes the infinite-horizon evolution of SSMs encoded in their extended observability subspace. We show that enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which typically incurs $\mathcal{O}(n^3)$ complexity. We develop a $\mathcal{O}(n^2)$ solution by exploiting the structure and properties of SSMs. This leads to an efficient regularization mechanism that can be seamlessly integrated into existing CL methods. Comprehensive experiments on challenging benchmarks, including ImageNet-R and Caltech-256, demonstrate a significant reduction in forgetting while improving accuracy across sequential tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Inf-SSM regularizes SSM observability subspaces on the infinite Grassmannian with an O(n^2) Sylvester solver for exemplar-free CL, but the link from subspace distance to actual forgetting reduction stays assumptive.

read the letter

The paper's main move is to regularize the extended observability subspace of an SSM using the geometry of the infinite-dimensional Grassmannian instead of replaying data or penalizing weights. They derive a Sylvester equation from that constraint and then exploit SSM structure to solve it in quadratic rather than cubic time. That combination is the actual novelty; it does not collapse to standard weight-regularization or replay tricks cited in the abstract. The efficient solver is a practical win and could be dropped into existing training loops without much overhead. The experiments on ImageNet-R and Caltech-256 are reported to show lower forgetting and higher accuracy, which is the intended payoff. The central assumption is that keeping the subspace close on the Grassmannian will keep the infinite-horizon state evolution from drifting enough to erase prior task performance. That step is the softest part. The abstract gives no explicit contraction or invariance argument showing that the geometric distance controls the input-driven recurrence of a finite-dimensional SSM, and exemplar-free settings leave the input sequence itself free. Without ablations that isolate the Grassmannian term or error bars that separate it from other factors, it is hard to judge how much of the reported gain comes from the geometry versus regularization strength in general. The work is aimed at people who already use SSMs for long sequences and need to adapt them without storing past examples. A reader who follows geometric methods for linear systems or who works on replay-free CL will get the most out of the construction and the solver. It is worth sending to peer review because the idea is distinct, the complexity improvement is verifiable, and the empirical claims can be checked with standard controls.

Referee Report

2 major / 3 minor

Summary. The paper proposes Inf-SSM, a geometry-aware regularization technique for exemplar-free continual learning with State Space Models. It constrains the infinite-horizon state evolution by regularizing the extended observability subspace on the infinite-dimensional Grassmannian; the resulting regularization is realized by solving a structured Sylvester equation, for which the authors derive an O(n²) algorithm that exploits SSM structure. Experiments on ImageNet-R and Caltech-256 are reported to show reduced forgetting together with higher accuracy across sequential tasks.

Significance. If the geometric constraint on the observability subspace indeed limits forgetting without exemplars, the work would supply a principled, parameter-light alternative to weight-regularization methods for recurrent sequence models. The O(n²) Sylvester solver that exploits SSM structure is a concrete technical contribution that could be reused beyond continual learning.

major comments (2)

[§3.2] §3.2 (Geometric Regularization): The central claim that Grassmannian distance on the extended observability subspace bounds the actual finite-dimensional state trajectory under new-task inputs is not supported by an explicit invariance or contraction argument. The Sylvester-equation solution aligns subspaces but leaves the input-dependent recurrence free; without a quantitative link showing that this alignment prevents erasure of prior dynamics, the exemplar-free guarantee remains unproven.
[§4] §4 (Experiments): The reported accuracy gains and forgetting reductions on ImageNet-R and Caltech-256 lack error bars, number of random seeds, or statistical significance tests. It is also unclear whether the baselines include SSM-adapted versions of standard exemplar-free methods (EWC, SI, etc.) or only generic CL baselines; without these controls the attribution of gains to the Grassmannian regularizer cannot be verified.

minor comments (3)

[§2] The precise definition of the 'extended observability subspace' for a finite-order SSM should be given explicitly (with an equation) before the infinite-dimensional Grassmannian construction is introduced.
[§4] Figure captions and axis labels in the experimental plots should state the number of tasks, the class-incremental protocol, and whether results are averaged over multiple runs.
[§3.3] A short discussion of the assumptions on the SSM matrices (diagonal, companion, etc.) that enable the O(n²) Sylvester reduction would help readers assess the generality of the complexity claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Geometric Regularization): The central claim that Grassmannian distance on the extended observability subspace bounds the actual finite-dimensional state trajectory under new-task inputs is not supported by an explicit invariance or contraction argument. The Sylvester-equation solution aligns subspaces but leaves the input-dependent recurrence free; without a quantitative link showing that this alignment prevents erasure of prior dynamics, the exemplar-free guarantee remains unproven.

Authors: We agree that an explicit quantitative link would strengthen the theoretical foundation. In the revised manuscript we will insert a new lemma establishing that the Grassmannian distance between extended observability subspaces induces a bound on the difference of finite-horizon state trajectories under bounded inputs. The proof exploits the stability of the SSM recurrence and the fact that the observability operator determines the reachable state evolution; the O(n²) Sylvester solver is used to enforce the subspace alignment that yields the contraction. This addition directly addresses the missing invariance argument. revision: yes
Referee: [§4] §4 (Experiments): The reported accuracy gains and forgetting reductions on ImageNet-R and Caltech-256 lack error bars, number of random seeds, or statistical significance tests. It is also unclear whether the baselines include SSM-adapted versions of standard exemplar-free methods (EWC, SI, etc.) or only generic CL baselines; without these controls the attribution of gains to the Grassmannian regularizer cannot be verified.

Authors: We concur that the experimental section requires additional rigor. We will augment the results with error bars computed over five independent random seeds, report standard deviations, and include paired t-tests for statistical significance. We will also add SSM-adapted baselines in which EWC and SI regularization terms are applied directly to the SSM parameters (A, B, C, D), thereby isolating the contribution of the Grassmannian regularizer from generic continual-learning techniques. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses standard linear algebra and geometry independent of fitted outputs

full rationale

The paper derives Inf-SSM regularization by mapping SSM state evolution to the extended observability subspace on the infinite-dimensional Grassmannian, then enforcing a distance constraint that reduces to solving the Sylvester equation. This equation and its O(n^2) solver are obtained by exploiting the structured matrices of SSMs (A, B, C) rather than by fitting to CL performance metrics or redefining the target forgetting measure. No load-bearing step equates the final regularization term to a quantity defined inside the paper's own training loop or self-cited uniqueness result. The central claim therefore remains a geometric construction applied to an existing model class, with empirical results on ImageNet-R and Caltech-256 serving as external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Grassmannian geometry on the observability subspace controls forgetting; the Sylvester equation is invoked as standard mathematics with no free parameters or new postulated entities introduced.

axioms (1)

domain assumption The geometry of the infinite-dimensional Grassmannian can be used to constrain the evolution of SSM states during continual learning.
This premise directly motivates the regularization and is invoked when the authors state that they regularize the infinite-horizon evolution encoded in the extended observability subspace.

pith-pipeline@v0.9.0 · 5745 in / 1285 out tokens · 54940 ms · 2026-05-19T13:22:51.013818+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Inf-SSM, a geometry-aware regularization framework … regularizes the infinite-horizon evolution of SSMs encoded in their extended observability subspace … solving a matrix equation known as the Sylvester equation … O(n²) solution by exploiting the structure … (A,C) … chordal distance d²_chord(S,S′) = ∥SS⊤ − S′S′⊤∥²_F
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3 (Invariance of the S∞ under P-equivalence) … S∞(A′,C′) = S∞(A,C)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 2 internal anchors

[1]

Afsari and R

B. Afsari and R. Vidal. The alignment distance on spaces of linear dynamical systems. In52nd IEEE Conference on Decision and Control, pages 1162–1167, 2013. doi: 10.1109/CDC.2013.6760039

work page doi:10.1109/cdc.2013.6760039 2013
[2]

H. Ahn, S. Cha, D. Lee, and T. Moon. Uncertainty-based continual learning with adaptive regularization. Advances in neural information processing systems, 32, 2019

work page 2019
[3]

Aljundi, F

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

work page 2018
[4]

R. H. Bartels and G. W. Stewart. Solution of the matrix equation AX + XB = C.Communications of the ACM, 15(9):820–826, 1972. doi: 10.1145/355607.362840

work page doi:10.1145/355607.362840 1972
[5]

Boschini, L

M. Boschini, L. Bonicelli, P. Buzzega, A. Porrello, and S. Calderara. Class-incremental continual learning into the extended der-verse.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

work page 2022
[6]

Chen and B

Z. Chen and B. Liu. Lifelong machine learning.Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–207, 2018

work page 2018
[7]

Cheng, Y

D. Cheng, Y. Lu, L. He, S. Zhang, X. Yang, N. Wang, and X. Gao. Mamba-CL: Optimizing selective state space model in null space for continual learning.arXiv preprint arXiv:2411.15469, 2024

work page arXiv 2024
[8]

Dao and A

T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InForty-first International Conference on Machine Learning, 2024

work page 2024
[9]

De Cock and B

K. De Cock and B. De Moor. Subspace angles between arma models.Systems & Control Letters, 46(4): 265–270, 2002

work page 2002
[10]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[11]

P. Dhar, R. V. Singh, K.-C. Peng, Z. Wu, and R. Chellappa. Learning without memorizing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5138–5146, 2019

work page 2019
[12]

I. S. Dhillon, J. R. Heath, T. Strohmer, and J. A. Tropp. Constructing packings in grassmannian manifolds via alternating projection.Experimental mathematics, 17(1):9–35, 2008

work page 2008
[13]

Doretto, A

G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures.International journal of computer vision, 51:91–109, 2003

work page 2003
[14]

Golub, S

G. Golub, S. Nash, and C. Van Loan. A hessenberg-schur method for the problem ax + xb= c.IEEE Transactions on Automatic Control, 24(6):909–913, 1979. doi: 10.1109/TAC.1979.1102170

work page doi:10.1109/tac.1979.1102170 1979
[15]

Gomez-Villa, D

A. Gomez-Villa, D. Goswami, K. Wang, A. D. Bagdanov, B. Twardowski, and J. van de Weijer. Exemplar- free continual representation learning via learnable drift compensation. InEuropean Conference on Computer Vision, pages 473–490. Springer, 2024

work page 2024
[16]

Gretton, O

A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with hilbert- schmidt norms. InInternational conference on algorithmic learning theory, pages 63–77. Springer, 2005

work page 2005
[17]

Griffin, A

G. Griffin, A. Holub, and P. Perona. Caltech 256, Apr 2022

work page 2022
[18]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

work page 2024
[19]

A. Gu, K. Goel, A. Gupta, and C. Ré. On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022. 13

work page 2022
[20]

A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022

work page 2022
[21]

Gupta, A

A. Gupta, A. Gu, and J. Berant. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022

work page 2022
[22]

Hatamizadeh and J

A. Hatamizadeh and J. Kautz. Mambavision: A hybrid mamba-transformer vision backbone.arXiv preprint arXiv:2407.08083, 2024

work page arXiv 2024
[23]

Hendrycks, S

D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

work page 2021
[24]

S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin. Learning a unified classifier incrementally via rebalancing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 831–839, 2019

work page 2019
[25]

Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

Y.-C. Hsu, Y.-C. Liu, A. Ramasamy, and Z. Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines.arXiv preprint arXiv:1810.12488, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Huang, F

W. Huang, F. Sun, L. Cao, D. Zhao, H. Liu, and M. Harandi. Sparse coding and dictionary learning with linear dynamical systems. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3938–3947, 2016

work page 2016
[27]

Huang, M

W. Huang, M. Harandi, T. Zhang, L. Fan, F. Sun, and J. Huang. Efficient optimization for linear dynamical systems with applications to clustering and sparse coding.Advances in neural information processing systems, 30, 2017

work page 2017
[28]

Iscen, J

A. Iscen, J. Zhang, S. Lazebnik, and C. Schmid. Memory-efficient incremental learning through feature adaptation. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 699–715. Springer, 2020

work page 2020
[29]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[30]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019

work page 2019
[31]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images.Master’s thesis, University of Toronto, 2009

work page 2009
[32]

X. Li, Y. Yang, J. Wu, B. Ghanem, L. Nie, and M. Zhang. Mamba-fscil: Dynamic adaptation with selective state space model for few-shot class-incremental learning.CoRR, abs/2407.06136, 2024

work page arXiv 2024
[33]

Li and D

Z. Li and D. Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017
[34]

Liang, B

J. Liang, B. Meyer, I. N. Lee, and T.-T. Do. Self-supervised learning for acoustic few-shot classification. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

work page 2025
[35]

Liang and W.-J

Y.-S. Liang and W.-J. Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23638– 23647, 2024

work page 2024
[36]

J. Liu, M. Liu, Z. Wang, P. An, X. Li, K. Zhou, S. Yang, R. Zhang, Y. Guo, and S. Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024. 14

work page 2024
[37]

R. Liu, B. Diao, L. Huang, Z. An, Z. An, and Y. Xu. Continual learning in the frequency domain. Advances in Neural Information Processing Systems, 37:85389–85411, 2024

work page 2024
[38]

Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, and Y. Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2025

work page 2025
[39]

E. S. Lubana, P. Trivedi, D. Koutra, and R. Dick. How do quadratic regularizers prevent catastrophic forgetting: The role of interpolation. InConference on Lifelong Learning Agents, pages 819–837. PMLR, 2022

work page 2022
[40]

Magistri, T

S. Magistri, T. Trinci, A. Soutif-Cormerais, J. van de Weijer, and A. D. Bagdanov. Elastic feature consolidation for cold start exemplar-free incremental learning. InICLR, 2024

work page 2024
[41]

Mahmoodi, M

L. Mahmoodi, M. Harandi, and P. Moghadam. Flashback for continual learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3434–3443, October 2023

work page 2023
[42]

Mahmoodi, P

L. Mahmoodi, P. Moghadam, M. Hayat, C. Simon, and M. Harandi. Flashbacks to harmonize stability and plasticity in continual learning.Neural Networks, 190:107616, 2025

work page 2025
[43]

R. J. Martin. A metric for arma processes.IEEE transactions on Signal Processing, 48(4):1164–1170, 2000

work page 2000
[44]

J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.Psychological review, 102(3):419, 1995

work page 1995
[45]

McCloskey and N

M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989
[46]

Z. Meng, J. Zhang, C. Yang, Z. Zhan, P. Zhao, and Y. Wang. Diffclass: Diffusion-based class incremental learning. InEuropean Conference on Computer Vision, pages 142–159. Springer, 2024

work page 2024
[47]

G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review.Neural networks, 113:54–71, 2019

work page 2019
[48]

Phung, Q

H. Phung, Q. Dao, T. Dao, V. H. Phan, D. Metaxas, and A. Tran. Dimsum: Diffusion mamba-a scalable and unified spatial-frequency method for image generation.Advances in Neural Information Processing Systems, 37:32947–32979, 2024

work page 2024
[49]

Ravichandran, R

A. Ravichandran, R. Chaudhry, and R. Vidal. Categorizing dynamic textures using a bag of dynamical systems.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2):342–353, 2012

work page 2012
[50]

Rebuffi, A

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001
[51]

Riemer, I

M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, , and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. InInternational Conference on Learning Representations, 2019

work page 2019
[52]

H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

work page 2017
[53]

Simon, P

C. Simon, P. Koniusz, and M. Harandi. On learning the geodesic path for incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1591–1600, 2021

work page 2021
[54]

Turaga, A

P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical computations on grassmann and stiefel manifolds for image and video-based recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):2273–2286, 2011. 15

work page 2011
[55]

G. M. Van de Ven and A. S. Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[56]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[57]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Jul 2011

work page 2011
[58]

L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383,

work page
[59]

doi: 10.1109/TPAMI.2024.3367329

work page doi:10.1109/tpami.2024.3367329 2024
[60]

Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 139–149, 2022

work page 2022
[61]

Ye and L.-H

K. Ye and L.-H. Lim. Schubert varieties and distances between subspaces of different dimensions.SIAM Journal on Matrix Analysis and Applications, 37(3):1176–1197, 2016

work page 2016
[62]

Zenke, B

F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. PMLR, 2017

work page 2017
[63]

Zhang, A

Z. Zhang, A. Liu, I. Reid, R. Hartley, B. Zhuang, and H. Tang. Motion mamba: Efficient and long sequence motion generation. InEuropean Conference on Computer Vision, pages 265–282. Springer, 2024

work page 2024
[64]

Zhao and D

C. Zhao and D. Gong. Learning mamba as a continual learner.CoRR, abs/2412.00776, 2024

work page arXiv 2024
[65]

K. Zhu, W. Zhai, Y. Cao, J. Luo, and Z.-J. Zha. Self-sustaining representation expansion for non- exemplar class-incremental learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9296–9305, 2022

work page 2022
[66]

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. InForty-first International Conference on Machine Learning, 2024. 16 Contents 1 Introduction 2 2 Preliminary 3 2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2024
[67]

Matrix multiplication ˜C⊤ ˜C

work page
[68]

Matrix multiplication ˜Adiag ˜A′⊤ diag

work page
[69]

Subtraction1 n − ˜Adiag ˜A′⊤ diag

work page
[70]

scan-breaking

Element-wise division of˜C⊤ ˜Cover1 n − ˜Adiag ˜A′⊤ diag Note that ⊙ and element-wise reciprocal could be combined as a single step by simply taking element-wise reciprocal of˜C⊤ ˜C′ by1n − ˜Adiag ˜A′⊤ diag. Thus, each outlined step have an FLOPS count ofn 2, and hence, the total FLOPS count is4n2 with computational complexity ofO(n2). Hence, thisreduces ...

work page 2011
[71]

Note for readability,Brefers to ˜Bderived in §D.3. 33 H.5 Vim-tiny To validate that Inf-SSM is adaptable to different model sizes, especially on small models, we validated Inf-SSM along with EFCIL baselines of EWC, SI, MAS, and LwF on Vim-tiny. Vim-tiny only has 7M parameters in comparison to 26M of Vim-small [65]. Table 10: AA(%↑), AIA(% ↑), and FM(%↓)of...

work page
[72]

Represent the main CL paradigms used in CIL or EFCIL

work page
[73]

We therefore adopt the following baselines, all implemented on Vim-Small under a shared training protocol (datasets, task splits, and metrics are described in §5, §G.2, and §J.2)

Can be adapted to the Vim backbone without ad hoc, architecture-specific redesign. We therefore adopt the following baselines, all implemented on Vim-Small under a shared training protocol (datasets, task splits, and metrics are described in §5, §G.2, and §J.2). For replay-based methods: • ER[ 51]is a canonical replay-based baseline. It measures how much ...

work page

[1] [1]

Afsari and R

B. Afsari and R. Vidal. The alignment distance on spaces of linear dynamical systems. In52nd IEEE Conference on Decision and Control, pages 1162–1167, 2013. doi: 10.1109/CDC.2013.6760039

work page doi:10.1109/cdc.2013.6760039 2013

[2] [2]

H. Ahn, S. Cha, D. Lee, and T. Moon. Uncertainty-based continual learning with adaptive regularization. Advances in neural information processing systems, 32, 2019

work page 2019

[3] [3]

Aljundi, F

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

work page 2018

[4] [4]

R. H. Bartels and G. W. Stewart. Solution of the matrix equation AX + XB = C.Communications of the ACM, 15(9):820–826, 1972. doi: 10.1145/355607.362840

work page doi:10.1145/355607.362840 1972

[5] [5]

Boschini, L

M. Boschini, L. Bonicelli, P. Buzzega, A. Porrello, and S. Calderara. Class-incremental continual learning into the extended der-verse.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

work page 2022

[6] [6]

Chen and B

Z. Chen and B. Liu. Lifelong machine learning.Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–207, 2018

work page 2018

[7] [7]

Cheng, Y

D. Cheng, Y. Lu, L. He, S. Zhang, X. Yang, N. Wang, and X. Gao. Mamba-CL: Optimizing selective state space model in null space for continual learning.arXiv preprint arXiv:2411.15469, 2024

work page arXiv 2024

[8] [8]

Dao and A

T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InForty-first International Conference on Machine Learning, 2024

work page 2024

[9] [9]

De Cock and B

K. De Cock and B. De Moor. Subspace angles between arma models.Systems & Control Letters, 46(4): 265–270, 2002

work page 2002

[10] [10]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009

[11] [11]

P. Dhar, R. V. Singh, K.-C. Peng, Z. Wu, and R. Chellappa. Learning without memorizing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5138–5146, 2019

work page 2019

[12] [12]

I. S. Dhillon, J. R. Heath, T. Strohmer, and J. A. Tropp. Constructing packings in grassmannian manifolds via alternating projection.Experimental mathematics, 17(1):9–35, 2008

work page 2008

[13] [13]

Doretto, A

G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures.International journal of computer vision, 51:91–109, 2003

work page 2003

[14] [14]

Golub, S

G. Golub, S. Nash, and C. Van Loan. A hessenberg-schur method for the problem ax + xb= c.IEEE Transactions on Automatic Control, 24(6):909–913, 1979. doi: 10.1109/TAC.1979.1102170

work page doi:10.1109/tac.1979.1102170 1979

[15] [15]

Gomez-Villa, D

A. Gomez-Villa, D. Goswami, K. Wang, A. D. Bagdanov, B. Twardowski, and J. van de Weijer. Exemplar- free continual representation learning via learnable drift compensation. InEuropean Conference on Computer Vision, pages 473–490. Springer, 2024

work page 2024

[16] [16]

Gretton, O

A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with hilbert- schmidt norms. InInternational conference on algorithmic learning theory, pages 63–77. Springer, 2005

work page 2005

[17] [17]

Griffin, A

G. Griffin, A. Holub, and P. Perona. Caltech 256, Apr 2022

work page 2022

[18] [18]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

work page 2024

[19] [19]

A. Gu, K. Goel, A. Gupta, and C. Ré. On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022. 13

work page 2022

[20] [20]

A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022

work page 2022

[21] [21]

Gupta, A

A. Gupta, A. Gu, and J. Berant. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022

work page 2022

[22] [22]

Hatamizadeh and J

A. Hatamizadeh and J. Kautz. Mambavision: A hybrid mamba-transformer vision backbone.arXiv preprint arXiv:2407.08083, 2024

work page arXiv 2024

[23] [23]

Hendrycks, S

D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

work page 2021

[24] [24]

S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin. Learning a unified classifier incrementally via rebalancing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 831–839, 2019

work page 2019

[25] [25]

Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

Y.-C. Hsu, Y.-C. Liu, A. Ramasamy, and Z. Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines.arXiv preprint arXiv:1810.12488, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Huang, F

W. Huang, F. Sun, L. Cao, D. Zhao, H. Liu, and M. Harandi. Sparse coding and dictionary learning with linear dynamical systems. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3938–3947, 2016

work page 2016

[27] [27]

Huang, M

W. Huang, M. Harandi, T. Zhang, L. Fan, F. Sun, and J. Huang. Efficient optimization for linear dynamical systems with applications to clustering and sparse coding.Advances in neural information processing systems, 30, 2017

work page 2017

[28] [28]

Iscen, J

A. Iscen, J. Zhang, S. Lazebnik, and C. Schmid. Memory-efficient incremental learning through feature adaptation. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 699–715. Springer, 2020

work page 2020

[29] [29]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017

[30] [30]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019

work page 2019

[31] [31]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images.Master’s thesis, University of Toronto, 2009

work page 2009

[32] [32]

X. Li, Y. Yang, J. Wu, B. Ghanem, L. Nie, and M. Zhang. Mamba-fscil: Dynamic adaptation with selective state space model for few-shot class-incremental learning.CoRR, abs/2407.06136, 2024

work page arXiv 2024

[33] [33]

Li and D

Z. Li and D. Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017

[34] [34]

Liang, B

J. Liang, B. Meyer, I. N. Lee, and T.-T. Do. Self-supervised learning for acoustic few-shot classification. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

work page 2025

[35] [35]

Liang and W.-J

Y.-S. Liang and W.-J. Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23638– 23647, 2024

work page 2024

[36] [36]

J. Liu, M. Liu, Z. Wang, P. An, X. Li, K. Zhou, S. Yang, R. Zhang, Y. Guo, and S. Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024. 14

work page 2024

[37] [37]

R. Liu, B. Diao, L. Huang, Z. An, Z. An, and Y. Xu. Continual learning in the frequency domain. Advances in Neural Information Processing Systems, 37:85389–85411, 2024

work page 2024

[38] [38]

Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, and Y. Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2025

work page 2025

[39] [39]

E. S. Lubana, P. Trivedi, D. Koutra, and R. Dick. How do quadratic regularizers prevent catastrophic forgetting: The role of interpolation. InConference on Lifelong Learning Agents, pages 819–837. PMLR, 2022

work page 2022

[40] [40]

Magistri, T

S. Magistri, T. Trinci, A. Soutif-Cormerais, J. van de Weijer, and A. D. Bagdanov. Elastic feature consolidation for cold start exemplar-free incremental learning. InICLR, 2024

work page 2024

[41] [41]

Mahmoodi, M

L. Mahmoodi, M. Harandi, and P. Moghadam. Flashback for continual learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3434–3443, October 2023

work page 2023

[42] [42]

Mahmoodi, P

L. Mahmoodi, P. Moghadam, M. Hayat, C. Simon, and M. Harandi. Flashbacks to harmonize stability and plasticity in continual learning.Neural Networks, 190:107616, 2025

work page 2025

[43] [43]

R. J. Martin. A metric for arma processes.IEEE transactions on Signal Processing, 48(4):1164–1170, 2000

work page 2000

[44] [44]

J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.Psychological review, 102(3):419, 1995

work page 1995

[45] [45]

McCloskey and N

M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989

[46] [46]

Z. Meng, J. Zhang, C. Yang, Z. Zhan, P. Zhao, and Y. Wang. Diffclass: Diffusion-based class incremental learning. InEuropean Conference on Computer Vision, pages 142–159. Springer, 2024

work page 2024

[47] [47]

G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review.Neural networks, 113:54–71, 2019

work page 2019

[48] [48]

Phung, Q

H. Phung, Q. Dao, T. Dao, V. H. Phan, D. Metaxas, and A. Tran. Dimsum: Diffusion mamba-a scalable and unified spatial-frequency method for image generation.Advances in Neural Information Processing Systems, 37:32947–32979, 2024

work page 2024

[49] [49]

Ravichandran, R

A. Ravichandran, R. Chaudhry, and R. Vidal. Categorizing dynamic textures using a bag of dynamical systems.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2):342–353, 2012

work page 2012

[50] [50]

Rebuffi, A

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001

[51] [51]

Riemer, I

M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, , and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. InInternational Conference on Learning Representations, 2019

work page 2019

[52] [52]

H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

work page 2017

[53] [53]

Simon, P

C. Simon, P. Koniusz, and M. Harandi. On learning the geodesic path for incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1591–1600, 2021

work page 2021

[54] [54]

Turaga, A

P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical computations on grassmann and stiefel manifolds for image and video-based recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):2273–2286, 2011. 15

work page 2011

[55] [55]

G. M. Van de Ven and A. S. Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[56] [56]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[57] [57]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Jul 2011

work page 2011

[58] [58]

L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383,

work page

[59] [59]

doi: 10.1109/TPAMI.2024.3367329

work page doi:10.1109/tpami.2024.3367329 2024

[60] [60]

Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 139–149, 2022

work page 2022

[61] [61]

Ye and L.-H

K. Ye and L.-H. Lim. Schubert varieties and distances between subspaces of different dimensions.SIAM Journal on Matrix Analysis and Applications, 37(3):1176–1197, 2016

work page 2016

[62] [62]

Zenke, B

F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. PMLR, 2017

work page 2017

[63] [63]

Zhang, A

Z. Zhang, A. Liu, I. Reid, R. Hartley, B. Zhuang, and H. Tang. Motion mamba: Efficient and long sequence motion generation. InEuropean Conference on Computer Vision, pages 265–282. Springer, 2024

work page 2024

[64] [64]

Zhao and D

C. Zhao and D. Gong. Learning mamba as a continual learner.CoRR, abs/2412.00776, 2024

work page arXiv 2024

[65] [65]

K. Zhu, W. Zhai, Y. Cao, J. Luo, and Z.-J. Zha. Self-sustaining representation expansion for non- exemplar class-incremental learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9296–9305, 2022

work page 2022

[66] [66]

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. InForty-first International Conference on Machine Learning, 2024. 16 Contents 1 Introduction 2 2 Preliminary 3 2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2024

[67] [67]

Matrix multiplication ˜C⊤ ˜C

work page

[68] [68]

Matrix multiplication ˜Adiag ˜A′⊤ diag

work page

[69] [69]

Subtraction1 n − ˜Adiag ˜A′⊤ diag

work page

[70] [70]

scan-breaking

Element-wise division of˜C⊤ ˜Cover1 n − ˜Adiag ˜A′⊤ diag Note that ⊙ and element-wise reciprocal could be combined as a single step by simply taking element-wise reciprocal of˜C⊤ ˜C′ by1n − ˜Adiag ˜A′⊤ diag. Thus, each outlined step have an FLOPS count ofn 2, and hence, the total FLOPS count is4n2 with computational complexity ofO(n2). Hence, thisreduces ...

work page 2011

[71] [71]

Note for readability,Brefers to ˜Bderived in §D.3. 33 H.5 Vim-tiny To validate that Inf-SSM is adaptable to different model sizes, especially on small models, we validated Inf-SSM along with EFCIL baselines of EWC, SI, MAS, and LwF on Vim-tiny. Vim-tiny only has 7M parameters in comparison to 26M of Vim-small [65]. Table 10: AA(%↑), AIA(% ↑), and FM(%↓)of...

work page

[72] [72]

Represent the main CL paradigms used in CIL or EFCIL

work page

[73] [73]

We therefore adopt the following baselines, all implemented on Vim-Small under a shared training protocol (datasets, task splits, and metrics are described in §5, §G.2, and §J.2)

Can be adapted to the Vim backbone without ad hoc, architecture-specific redesign. We therefore adopt the following baselines, all implemented on Vim-Small under a shared training protocol (datasets, task splits, and metrics are described in §5, §G.2, and §J.2). For replay-based methods: • ER[ 51]is a canonical replay-based baseline. It measures how much ...

work page