pith. sign in

arxiv: 2505.18604 · v3 · pith:Z32473H7new · submitted 2025-05-24 · 💻 cs.LG

Exemplar-Free Continual Learning for State Space Models

Pith reviewed 2026-05-19 13:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learningstate space modelsexemplar-freeGrassmannianSylvester equationcatastrophic forgettingobservability subspace
0
0 comments X

The pith

Inf-SSM curbs forgetting in state-space models by regularizing infinite-horizon state evolution on the Grassmannian without storing old data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Inf-SSM as a geometry-aware regularization for state-space models adapting to new tasks in sequence without access to prior examples. It works by constraining the infinite-horizon evolution of the models through their extended observability subspace on the infinite-dimensional Grassmannian rather than directly limiting parameter changes. This regularization is enforced by solving a Sylvester equation, and the authors derive a structured O(n^2) solver that exploits SSM properties to keep the overhead low. The approach integrates with existing continual learning techniques and is tested on image benchmarks presented as task sequences. Results indicate less forgetting and higher accuracy on later tasks compared to standard updates.

Core claim

Inf-SSM is a novel regularization method that utilizes the geometry of the infinite-dimensional Grassmannian to constrain the infinite-horizon evolution of SSMs encoded in their extended observability subspace. Unlike classical continual learning methods that constrain weight updates, Inf-SSM directly regularizes state evolution during adaptation. Enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which the authors solve in O(n^2) time by exploiting the structure and properties of SSMs. This produces an efficient mechanism that integrates into existing CL methods and yields reduced forgetting with improved accuracy across sequential tasks on the

What carries the argument

The geometry of the infinite-dimensional Grassmannian applied to the extended observability subspace of an SSM, enforced by an efficient structured solver for the Sylvester equation.

If this is right

  • The regularization can be added to existing continual learning pipelines for SSMs with minimal extra cost.
  • Forgetting decreases and accuracy rises on sequential image classification benchmarks such as ImageNet-R and Caltech-256.
  • The O(n^2) Sylvester solver makes the method scalable for larger state dimensions.
  • State evolution is controlled without replay buffers or stored exemplars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometric idea on observability subspaces might apply to other linear recurrent architectures that admit similar infinite-horizon representations.
  • Controlling state geometry rather than weights could become a general principle for stable lifelong adaptation in sequence models.
  • The method invites direct tests on non-image sequences such as time-series forecasting or language streams to check domain breadth.

Load-bearing premise

That geometric regularization on the extended observability subspace via the infinite-dimensional Grassmannian sufficiently constrains SSM state evolution to prevent catastrophic forgetting without any exemplars.

What would settle it

Apply Inf-SSM to a sequence of tasks on a held-out benchmark and measure whether retention of earlier task performance shows no improvement over a baseline SSM updated without the Grassmannian term.

Figures

Figures reproduced from arXiv: 2505.18604 by Isaac Ning Lee, Leila Mahmoodi, Mehrtash Harandi, Trung Le.

Figure 1
Figure 1. Figure 1: Left: Average performance gain from integrating Inf-SSM into representative CIL baselines (ER [51], LUCIR [24], X-DER [5], L2P-R [59], CLFD [37]) on 5-task (5T) and 10-task (10T) benchmarks built from ImageNet-R (INR) [23], CIFAR-100 (CIF) [31], and Caltech-256 (CAL) [17]. Center: Comparison of Inf-SSM with exemplar-free CIL methods (EWC [29], SI [61], MAS [3], LwF [33]) when regularizing (A, C) on INR. Ri… view at source ↗
Figure 2
Figure 2. Figure 2: SSM at each sequence position τ is characterized by the infinite-horizon observability subspace O∞ defined by the tuple (A˜ , C˜ ), and visualized as a trajectory to an infinite horizon. The colored plane represents the complete set of O∞. Each trajectory is mapped to a point on the Grassmannian, and the pairwise distance dGr is illustrated as the geodesic on the sphere representing the Grassmann manifold … view at source ↗
Figure 4
Figure 4. Figure 4: CKD analysis on Vim-small with ImageNet-R and CUB-200-2011 over 10 tasks, EFCIL [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗
read the original abstract

State-Space Models (SSMs) excel at capturing long-range dependencies with structured recurrence, making them well-suited for sequence modeling. However, their evolving internal states pose challenges in adapting them under Continual Learning (CL). This is particularly difficult in exemplar-free settings, where the absence of prior data leaves updates to the dynamic SSM states unconstrained, resulting in catastrophic forgetting. To address this, we propose Inf-SSM, a novel and simple geometry-aware regularization method that utilizes the geometry of the infinite-dimensional Grassmannian to constrain state evolution during CL. Unlike classical continual learning methods that constrain weight updates, Inf-SSM regularizes the infinite-horizon evolution of SSMs encoded in their extended observability subspace. We show that enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which typically incurs $\mathcal{O}(n^3)$ complexity. We develop a $\mathcal{O}(n^2)$ solution by exploiting the structure and properties of SSMs. This leads to an efficient regularization mechanism that can be seamlessly integrated into existing CL methods. Comprehensive experiments on challenging benchmarks, including ImageNet-R and Caltech-256, demonstrate a significant reduction in forgetting while improving accuracy across sequential tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes Inf-SSM, a geometry-aware regularization technique for exemplar-free continual learning with State Space Models. It constrains the infinite-horizon state evolution by regularizing the extended observability subspace on the infinite-dimensional Grassmannian; the resulting regularization is realized by solving a structured Sylvester equation, for which the authors derive an O(n²) algorithm that exploits SSM structure. Experiments on ImageNet-R and Caltech-256 are reported to show reduced forgetting together with higher accuracy across sequential tasks.

Significance. If the geometric constraint on the observability subspace indeed limits forgetting without exemplars, the work would supply a principled, parameter-light alternative to weight-regularization methods for recurrent sequence models. The O(n²) Sylvester solver that exploits SSM structure is a concrete technical contribution that could be reused beyond continual learning.

major comments (2)
  1. [§3.2] §3.2 (Geometric Regularization): The central claim that Grassmannian distance on the extended observability subspace bounds the actual finite-dimensional state trajectory under new-task inputs is not supported by an explicit invariance or contraction argument. The Sylvester-equation solution aligns subspaces but leaves the input-dependent recurrence free; without a quantitative link showing that this alignment prevents erasure of prior dynamics, the exemplar-free guarantee remains unproven.
  2. [§4] §4 (Experiments): The reported accuracy gains and forgetting reductions on ImageNet-R and Caltech-256 lack error bars, number of random seeds, or statistical significance tests. It is also unclear whether the baselines include SSM-adapted versions of standard exemplar-free methods (EWC, SI, etc.) or only generic CL baselines; without these controls the attribution of gains to the Grassmannian regularizer cannot be verified.
minor comments (3)
  1. [§2] The precise definition of the 'extended observability subspace' for a finite-order SSM should be given explicitly (with an equation) before the infinite-dimensional Grassmannian construction is introduced.
  2. [§4] Figure captions and axis labels in the experimental plots should state the number of tasks, the class-incremental protocol, and whether results are averaged over multiple runs.
  3. [§3.3] A short discussion of the assumptions on the SSM matrices (diagonal, companion, etc.) that enable the O(n²) Sylvester reduction would help readers assess the generality of the complexity claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications and indicate the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Geometric Regularization): The central claim that Grassmannian distance on the extended observability subspace bounds the actual finite-dimensional state trajectory under new-task inputs is not supported by an explicit invariance or contraction argument. The Sylvester-equation solution aligns subspaces but leaves the input-dependent recurrence free; without a quantitative link showing that this alignment prevents erasure of prior dynamics, the exemplar-free guarantee remains unproven.

    Authors: We agree that an explicit quantitative link would strengthen the theoretical foundation. In the revised manuscript we will insert a new lemma establishing that the Grassmannian distance between extended observability subspaces induces a bound on the difference of finite-horizon state trajectories under bounded inputs. The proof exploits the stability of the SSM recurrence and the fact that the observability operator determines the reachable state evolution; the O(n²) Sylvester solver is used to enforce the subspace alignment that yields the contraction. This addition directly addresses the missing invariance argument. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported accuracy gains and forgetting reductions on ImageNet-R and Caltech-256 lack error bars, number of random seeds, or statistical significance tests. It is also unclear whether the baselines include SSM-adapted versions of standard exemplar-free methods (EWC, SI, etc.) or only generic CL baselines; without these controls the attribution of gains to the Grassmannian regularizer cannot be verified.

    Authors: We concur that the experimental section requires additional rigor. We will augment the results with error bars computed over five independent random seeds, report standard deviations, and include paired t-tests for statistical significance. We will also add SSM-adapted baselines in which EWC and SI regularization terms are applied directly to the SSM parameters (A, B, C, D), thereby isolating the contribution of the Grassmannian regularizer from generic continual-learning techniques. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses standard linear algebra and geometry independent of fitted outputs

full rationale

The paper derives Inf-SSM regularization by mapping SSM state evolution to the extended observability subspace on the infinite-dimensional Grassmannian, then enforcing a distance constraint that reduces to solving the Sylvester equation. This equation and its O(n^2) solver are obtained by exploiting the structured matrices of SSMs (A, B, C) rather than by fitting to CL performance metrics or redefining the target forgetting measure. No load-bearing step equates the final regularization term to a quantity defined inside the paper's own training loop or self-cited uniqueness result. The central claim therefore remains a geometric construction applied to an existing model class, with empirical results on ImageNet-R and Caltech-256 serving as external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Grassmannian geometry on the observability subspace controls forgetting; the Sylvester equation is invoked as standard mathematics with no free parameters or new postulated entities introduced.

axioms (1)
  • domain assumption The geometry of the infinite-dimensional Grassmannian can be used to constrain the evolution of SSM states during continual learning.
    This premise directly motivates the regularization and is invoked when the authors state that they regularize the infinite-horizon evolution encoded in the extended observability subspace.

pith-pipeline@v0.9.0 · 5745 in / 1285 out tokens · 54940 ms · 2026-05-19T13:22:51.013818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 2 internal anchors

  1. [1]

    Afsari and R

    B. Afsari and R. Vidal. The alignment distance on spaces of linear dynamical systems. In52nd IEEE Conference on Decision and Control, pages 1162–1167, 2013. doi: 10.1109/CDC.2013.6760039

  2. [2]

    H. Ahn, S. Cha, D. Lee, and T. Moon. Uncertainty-based continual learning with adaptive regularization. Advances in neural information processing systems, 32, 2019

  3. [3]

    Aljundi, F

    R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

  4. [4]

    R. H. Bartels and G. W. Stewart. Solution of the matrix equation AX + XB = C.Communications of the ACM, 15(9):820–826, 1972. doi: 10.1145/355607.362840

  5. [5]

    Boschini, L

    M. Boschini, L. Bonicelli, P. Buzzega, A. Porrello, and S. Calderara. Class-incremental continual learning into the extended der-verse.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

  6. [6]

    Chen and B

    Z. Chen and B. Liu. Lifelong machine learning.Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–207, 2018

  7. [7]

    Cheng, Y

    D. Cheng, Y. Lu, L. He, S. Zhang, X. Yang, N. Wang, and X. Gao. Mamba-CL: Optimizing selective state space model in null space for continual learning.arXiv preprint arXiv:2411.15469, 2024

  8. [8]

    Dao and A

    T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InForty-first International Conference on Machine Learning, 2024

  9. [9]

    De Cock and B

    K. De Cock and B. De Moor. Subspace angles between arma models.Systems & Control Letters, 46(4): 265–270, 2002

  10. [10]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

  11. [11]

    P. Dhar, R. V. Singh, K.-C. Peng, Z. Wu, and R. Chellappa. Learning without memorizing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5138–5146, 2019

  12. [12]

    I. S. Dhillon, J. R. Heath, T. Strohmer, and J. A. Tropp. Constructing packings in grassmannian manifolds via alternating projection.Experimental mathematics, 17(1):9–35, 2008

  13. [13]

    Doretto, A

    G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures.International journal of computer vision, 51:91–109, 2003

  14. [14]

    Golub, S

    G. Golub, S. Nash, and C. Van Loan. A hessenberg-schur method for the problem ax + xb= c.IEEE Transactions on Automatic Control, 24(6):909–913, 1979. doi: 10.1109/TAC.1979.1102170

  15. [15]

    Gomez-Villa, D

    A. Gomez-Villa, D. Goswami, K. Wang, A. D. Bagdanov, B. Twardowski, and J. van de Weijer. Exemplar- free continual representation learning via learnable drift compensation. InEuropean Conference on Computer Vision, pages 473–490. Springer, 2024

  16. [16]

    Gretton, O

    A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with hilbert- schmidt norms. InInternational conference on algorithmic learning theory, pages 63–77. Springer, 2005

  17. [17]

    Griffin, A

    G. Griffin, A. Holub, and P. Perona. Caltech 256, Apr 2022

  18. [18]

    Gu and T

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024

  19. [19]

    A. Gu, K. Goel, A. Gupta, and C. Ré. On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022. 13

  20. [20]

    A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022

  21. [21]

    Gupta, A

    A. Gupta, A. Gu, and J. Berant. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022

  22. [22]

    Hatamizadeh and J

    A. Hatamizadeh and J. Kautz. Mambavision: A hybrid mamba-transformer vision backbone.arXiv preprint arXiv:2407.08083, 2024

  23. [23]

    Hendrycks, S

    D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

  24. [24]

    S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin. Learning a unified classifier incrementally via rebalancing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 831–839, 2019

  25. [25]

    Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

    Y.-C. Hsu, Y.-C. Liu, A. Ramasamy, and Z. Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines.arXiv preprint arXiv:1810.12488, 2018

  26. [26]

    Huang, F

    W. Huang, F. Sun, L. Cao, D. Zhao, H. Liu, and M. Harandi. Sparse coding and dictionary learning with linear dynamical systems. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3938–3947, 2016

  27. [27]

    Huang, M

    W. Huang, M. Harandi, T. Zhang, L. Fan, F. Sun, and J. Huang. Efficient optimization for linear dynamical systems with applications to clustering and sparse coding.Advances in neural information processing systems, 30, 2017

  28. [28]

    Iscen, J

    A. Iscen, J. Zhang, S. Lazebnik, and C. Schmid. Memory-efficient incremental learning through feature adaptation. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 699–715. Springer, 2020

  29. [29]

    Kirkpatrick, R

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  30. [30]

    Kornblith, M

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019

  31. [31]

    Krizhevsky

    A. Krizhevsky. Learning multiple layers of features from tiny images.Master’s thesis, University of Toronto, 2009

  32. [32]

    X. Li, Y. Yang, J. Wu, B. Ghanem, L. Nie, and M. Zhang. Mamba-fscil: Dynamic adaptation with selective state space model for few-shot class-incremental learning.CoRR, abs/2407.06136, 2024

  33. [33]

    Li and D

    Z. Li and D. Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

  34. [34]

    Liang, B

    J. Liang, B. Meyer, I. N. Lee, and T.-T. Do. Self-supervised learning for acoustic few-shot classification. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

  35. [35]

    Liang and W.-J

    Y.-S. Liang and W.-J. Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23638– 23647, 2024

  36. [36]

    J. Liu, M. Liu, Z. Wang, P. An, X. Li, K. Zhou, S. Yang, R. Zhang, Y. Guo, and S. Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024. 14

  37. [37]

    R. Liu, B. Diao, L. Huang, Z. An, Z. An, and Y. Xu. Continual learning in the frequency domain. Advances in Neural Information Processing Systems, 37:85389–85411, 2024

  38. [38]

    Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, and Y. Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2025

  39. [39]

    E. S. Lubana, P. Trivedi, D. Koutra, and R. Dick. How do quadratic regularizers prevent catastrophic forgetting: The role of interpolation. InConference on Lifelong Learning Agents, pages 819–837. PMLR, 2022

  40. [40]

    Magistri, T

    S. Magistri, T. Trinci, A. Soutif-Cormerais, J. van de Weijer, and A. D. Bagdanov. Elastic feature consolidation for cold start exemplar-free incremental learning. InICLR, 2024

  41. [41]

    Mahmoodi, M

    L. Mahmoodi, M. Harandi, and P. Moghadam. Flashback for continual learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3434–3443, October 2023

  42. [42]

    Mahmoodi, P

    L. Mahmoodi, P. Moghadam, M. Hayat, C. Simon, and M. Harandi. Flashbacks to harmonize stability and plasticity in continual learning.Neural Networks, 190:107616, 2025

  43. [43]

    R. J. Martin. A metric for arma processes.IEEE transactions on Signal Processing, 48(4):1164–1170, 2000

  44. [44]

    J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.Psychological review, 102(3):419, 1995

  45. [45]

    McCloskey and N

    M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

  46. [46]

    Z. Meng, J. Zhang, C. Yang, Z. Zhan, P. Zhao, and Y. Wang. Diffclass: Diffusion-based class incremental learning. InEuropean Conference on Computer Vision, pages 142–159. Springer, 2024

  47. [47]

    G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review.Neural networks, 113:54–71, 2019

  48. [48]

    Phung, Q

    H. Phung, Q. Dao, T. Dao, V. H. Phan, D. Metaxas, and A. Tran. Dimsum: Diffusion mamba-a scalable and unified spatial-frequency method for image generation.Advances in Neural Information Processing Systems, 37:32947–32979, 2024

  49. [49]

    Ravichandran, R

    A. Ravichandran, R. Chaudhry, and R. Vidal. Categorizing dynamic textures using a bag of dynamical systems.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2):342–353, 2012

  50. [50]

    Rebuffi, A

    S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

  51. [51]

    Riemer, I

    M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, , and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. InInternational Conference on Learning Representations, 2019

  52. [52]

    H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017

  53. [53]

    Simon, P

    C. Simon, P. Koniusz, and M. Harandi. On learning the geodesic path for incremental learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1591–1600, 2021

  54. [54]

    Turaga, A

    P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical computations on grassmann and stiefel manifolds for image and video-based recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):2273–2286, 2011. 15

  55. [55]

    G. M. Van de Ven and A. S. Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734, 2019

  56. [56]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  57. [57]

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Jul 2011

  58. [58]

    L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383,

  59. [59]

    doi: 10.1109/TPAMI.2024.3367329

  60. [60]

    Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 139–149, 2022

  61. [61]

    Ye and L.-H

    K. Ye and L.-H. Lim. Schubert varieties and distances between subspaces of different dimensions.SIAM Journal on Matrix Analysis and Applications, 37(3):1176–1197, 2016

  62. [62]

    Zenke, B

    F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. PMLR, 2017

  63. [63]

    Zhang, A

    Z. Zhang, A. Liu, I. Reid, R. Hartley, B. Zhuang, and H. Tang. Motion mamba: Efficient and long sequence motion generation. InEuropean Conference on Computer Vision, pages 265–282. Springer, 2024

  64. [64]

    Zhao and D

    C. Zhao and D. Gong. Learning mamba as a continual learner.CoRR, abs/2412.00776, 2024

  65. [65]

    K. Zhu, W. Zhai, Y. Cao, J. Luo, and Z.-J. Zha. Self-sustaining representation expansion for non- exemplar class-incremental learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9296–9305, 2022

  66. [66]

    L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. InForty-first International Conference on Machine Learning, 2024. 16 Contents 1 Introduction 2 2 Preliminary 3 2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  67. [67]

    Matrix multiplication ˜C⊤ ˜C

  68. [68]

    Matrix multiplication ˜Adiag ˜A′⊤ diag

  69. [69]

    Subtraction1 n − ˜Adiag ˜A′⊤ diag

  70. [70]

    scan-breaking

    Element-wise division of˜C⊤ ˜Cover1 n − ˜Adiag ˜A′⊤ diag Note that ⊙ and element-wise reciprocal could be combined as a single step by simply taking element-wise reciprocal of˜C⊤ ˜C′ by1n − ˜Adiag ˜A′⊤ diag. Thus, each outlined step have an FLOPS count ofn 2, and hence, the total FLOPS count is4n2 with computational complexity ofO(n2). Hence, thisreduces ...

  71. [71]

    Note for readability,Brefers to ˜Bderived in §D.3. 33 H.5 Vim-tiny To validate that Inf-SSM is adaptable to different model sizes, especially on small models, we validated Inf-SSM along with EFCIL baselines of EWC, SI, MAS, and LwF on Vim-tiny. Vim-tiny only has 7M parameters in comparison to 26M of Vim-small [65]. Table 10: AA(%↑), AIA(% ↑), and FM(%↓)of...

  72. [72]

    Represent the main CL paradigms used in CIL or EFCIL

  73. [73]

    We therefore adopt the following baselines, all implemented on Vim-Small under a shared training protocol (datasets, task splits, and metrics are described in §5, §G.2, and §J.2)

    Can be adapted to the Vim backbone without ad hoc, architecture-specific redesign. We therefore adopt the following baselines, all implemented on Vim-Small under a shared training protocol (datasets, task splits, and metrics are described in §5, §G.2, and §J.2). For replay-based methods: • ER[ 51]is a canonical replay-based baseline. It measures how much ...