Exemplar-Free Continual Learning for State Space Models
Pith reviewed 2026-05-19 13:22 UTC · model grok-4.3
The pith
Inf-SSM curbs forgetting in state-space models by regularizing infinite-horizon state evolution on the Grassmannian without storing old data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Inf-SSM is a novel regularization method that utilizes the geometry of the infinite-dimensional Grassmannian to constrain the infinite-horizon evolution of SSMs encoded in their extended observability subspace. Unlike classical continual learning methods that constrain weight updates, Inf-SSM directly regularizes state evolution during adaptation. Enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which the authors solve in O(n^2) time by exploiting the structure and properties of SSMs. This produces an efficient mechanism that integrates into existing CL methods and yields reduced forgetting with improved accuracy across sequential tasks on the
What carries the argument
The geometry of the infinite-dimensional Grassmannian applied to the extended observability subspace of an SSM, enforced by an efficient structured solver for the Sylvester equation.
If this is right
- The regularization can be added to existing continual learning pipelines for SSMs with minimal extra cost.
- Forgetting decreases and accuracy rises on sequential image classification benchmarks such as ImageNet-R and Caltech-256.
- The O(n^2) Sylvester solver makes the method scalable for larger state dimensions.
- State evolution is controlled without replay buffers or stored exemplars.
Where Pith is reading between the lines
- The same geometric idea on observability subspaces might apply to other linear recurrent architectures that admit similar infinite-horizon representations.
- Controlling state geometry rather than weights could become a general principle for stable lifelong adaptation in sequence models.
- The method invites direct tests on non-image sequences such as time-series forecasting or language streams to check domain breadth.
Load-bearing premise
That geometric regularization on the extended observability subspace via the infinite-dimensional Grassmannian sufficiently constrains SSM state evolution to prevent catastrophic forgetting without any exemplars.
What would settle it
Apply Inf-SSM to a sequence of tasks on a held-out benchmark and measure whether retention of earlier task performance shows no improvement over a baseline SSM updated without the Grassmannian term.
Figures
read the original abstract
State-Space Models (SSMs) excel at capturing long-range dependencies with structured recurrence, making them well-suited for sequence modeling. However, their evolving internal states pose challenges in adapting them under Continual Learning (CL). This is particularly difficult in exemplar-free settings, where the absence of prior data leaves updates to the dynamic SSM states unconstrained, resulting in catastrophic forgetting. To address this, we propose Inf-SSM, a novel and simple geometry-aware regularization method that utilizes the geometry of the infinite-dimensional Grassmannian to constrain state evolution during CL. Unlike classical continual learning methods that constrain weight updates, Inf-SSM regularizes the infinite-horizon evolution of SSMs encoded in their extended observability subspace. We show that enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which typically incurs $\mathcal{O}(n^3)$ complexity. We develop a $\mathcal{O}(n^2)$ solution by exploiting the structure and properties of SSMs. This leads to an efficient regularization mechanism that can be seamlessly integrated into existing CL methods. Comprehensive experiments on challenging benchmarks, including ImageNet-R and Caltech-256, demonstrate a significant reduction in forgetting while improving accuracy across sequential tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Inf-SSM, a geometry-aware regularization technique for exemplar-free continual learning with State Space Models. It constrains the infinite-horizon state evolution by regularizing the extended observability subspace on the infinite-dimensional Grassmannian; the resulting regularization is realized by solving a structured Sylvester equation, for which the authors derive an O(n²) algorithm that exploits SSM structure. Experiments on ImageNet-R and Caltech-256 are reported to show reduced forgetting together with higher accuracy across sequential tasks.
Significance. If the geometric constraint on the observability subspace indeed limits forgetting without exemplars, the work would supply a principled, parameter-light alternative to weight-regularization methods for recurrent sequence models. The O(n²) Sylvester solver that exploits SSM structure is a concrete technical contribution that could be reused beyond continual learning.
major comments (2)
- [§3.2] §3.2 (Geometric Regularization): The central claim that Grassmannian distance on the extended observability subspace bounds the actual finite-dimensional state trajectory under new-task inputs is not supported by an explicit invariance or contraction argument. The Sylvester-equation solution aligns subspaces but leaves the input-dependent recurrence free; without a quantitative link showing that this alignment prevents erasure of prior dynamics, the exemplar-free guarantee remains unproven.
- [§4] §4 (Experiments): The reported accuracy gains and forgetting reductions on ImageNet-R and Caltech-256 lack error bars, number of random seeds, or statistical significance tests. It is also unclear whether the baselines include SSM-adapted versions of standard exemplar-free methods (EWC, SI, etc.) or only generic CL baselines; without these controls the attribution of gains to the Grassmannian regularizer cannot be verified.
minor comments (3)
- [§2] The precise definition of the 'extended observability subspace' for a finite-order SSM should be given explicitly (with an equation) before the infinite-dimensional Grassmannian construction is introduced.
- [§4] Figure captions and axis labels in the experimental plots should state the number of tasks, the class-incremental protocol, and whether results are averaged over multiple runs.
- [§3.3] A short discussion of the assumptions on the SSM matrices (diagonal, companion, etc.) that enable the O(n²) Sylvester reduction would help readers assess the generality of the complexity claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications and indicate the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Geometric Regularization): The central claim that Grassmannian distance on the extended observability subspace bounds the actual finite-dimensional state trajectory under new-task inputs is not supported by an explicit invariance or contraction argument. The Sylvester-equation solution aligns subspaces but leaves the input-dependent recurrence free; without a quantitative link showing that this alignment prevents erasure of prior dynamics, the exemplar-free guarantee remains unproven.
Authors: We agree that an explicit quantitative link would strengthen the theoretical foundation. In the revised manuscript we will insert a new lemma establishing that the Grassmannian distance between extended observability subspaces induces a bound on the difference of finite-horizon state trajectories under bounded inputs. The proof exploits the stability of the SSM recurrence and the fact that the observability operator determines the reachable state evolution; the O(n²) Sylvester solver is used to enforce the subspace alignment that yields the contraction. This addition directly addresses the missing invariance argument. revision: yes
-
Referee: [§4] §4 (Experiments): The reported accuracy gains and forgetting reductions on ImageNet-R and Caltech-256 lack error bars, number of random seeds, or statistical significance tests. It is also unclear whether the baselines include SSM-adapted versions of standard exemplar-free methods (EWC, SI, etc.) or only generic CL baselines; without these controls the attribution of gains to the Grassmannian regularizer cannot be verified.
Authors: We concur that the experimental section requires additional rigor. We will augment the results with error bars computed over five independent random seeds, report standard deviations, and include paired t-tests for statistical significance. We will also add SSM-adapted baselines in which EWC and SI regularization terms are applied directly to the SSM parameters (A, B, C, D), thereby isolating the contribution of the Grassmannian regularizer from generic continual-learning techniques. revision: yes
Circularity Check
No significant circularity; derivation uses standard linear algebra and geometry independent of fitted outputs
full rationale
The paper derives Inf-SSM regularization by mapping SSM state evolution to the extended observability subspace on the infinite-dimensional Grassmannian, then enforcing a distance constraint that reduces to solving the Sylvester equation. This equation and its O(n^2) solver are obtained by exploiting the structured matrices of SSMs (A, B, C) rather than by fitting to CL performance metrics or redefining the target forgetting measure. No load-bearing step equates the final regularization term to a quantity defined inside the paper's own training loop or self-cited uniqueness result. The central claim therefore remains a geometric construction applied to an existing model class, with empirical results on ImageNet-R and Caltech-256 serving as external validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The geometry of the infinite-dimensional Grassmannian can be used to constrain the evolution of SSM states during continual learning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Inf-SSM, a geometry-aware regularization framework … regularizes the infinite-horizon evolution of SSMs encoded in their extended observability subspace … solving a matrix equation known as the Sylvester equation … O(n²) solution by exploiting the structure … (A,C) … chordal distance d²_chord(S,S′) = ∥SS⊤ − S′S′⊤∥²_F
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3 (Invariance of the S∞ under P-equivalence) … S∞(A′,C′) = S∞(A,C)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
B. Afsari and R. Vidal. The alignment distance on spaces of linear dynamical systems. In52nd IEEE Conference on Decision and Control, pages 1162–1167, 2013. doi: 10.1109/CDC.2013.6760039
-
[2]
H. Ahn, S. Cha, D. Lee, and T. Moon. Uncertainty-based continual learning with adaptive regularization. Advances in neural information processing systems, 32, 2019
work page 2019
-
[3]
R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European conference on computer vision (ECCV), pages 139–154, 2018
work page 2018
-
[4]
R. H. Bartels and G. W. Stewart. Solution of the matrix equation AX + XB = C.Communications of the ACM, 15(9):820–826, 1972. doi: 10.1145/355607.362840
-
[5]
M. Boschini, L. Bonicelli, P. Buzzega, A. Porrello, and S. Calderara. Class-incremental continual learning into the extended der-verse.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
work page 2022
-
[6]
Z. Chen and B. Liu. Lifelong machine learning.Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–207, 2018
work page 2018
- [7]
- [8]
-
[9]
K. De Cock and B. De Moor. Subspace angles between arma models.Systems & Control Letters, 46(4): 265–270, 2002
work page 2002
-
[10]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848
-
[11]
P. Dhar, R. V. Singh, K.-C. Peng, Z. Wu, and R. Chellappa. Learning without memorizing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5138–5146, 2019
work page 2019
-
[12]
I. S. Dhillon, J. R. Heath, T. Strohmer, and J. A. Tropp. Constructing packings in grassmannian manifolds via alternating projection.Experimental mathematics, 17(1):9–35, 2008
work page 2008
-
[13]
G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures.International journal of computer vision, 51:91–109, 2003
work page 2003
-
[14]
G. Golub, S. Nash, and C. Van Loan. A hessenberg-schur method for the problem ax + xb= c.IEEE Transactions on Automatic Control, 24(6):909–913, 1979. doi: 10.1109/TAC.1979.1102170
-
[15]
A. Gomez-Villa, D. Goswami, K. Wang, A. D. Bagdanov, B. Twardowski, and J. van de Weijer. Exemplar- free continual representation learning via learnable drift compensation. InEuropean Conference on Computer Vision, pages 473–490. Springer, 2024
work page 2024
-
[16]
A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with hilbert- schmidt norms. InInternational conference on algorithmic learning theory, pages 63–77. Springer, 2005
work page 2005
- [17]
- [18]
-
[19]
A. Gu, K. Goel, A. Gupta, and C. Ré. On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022. 13
work page 2022
-
[20]
A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022
work page 2022
- [21]
-
[22]
A. Hatamizadeh and J. Kautz. Mambavision: A hybrid mamba-transformer vision backbone.arXiv preprint arXiv:2407.08083, 2024
-
[23]
D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021
work page 2021
-
[24]
S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin. Learning a unified classifier incrementally via rebalancing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 831–839, 2019
work page 2019
-
[25]
Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines
Y.-C. Hsu, Y.-C. Liu, A. Ramasamy, and Z. Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines.arXiv preprint arXiv:1810.12488, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [26]
- [27]
- [28]
-
[29]
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
work page 2017
-
[30]
S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019
work page 2019
-
[31]
A. Krizhevsky. Learning multiple layers of features from tiny images.Master’s thesis, University of Toronto, 2009
work page 2009
- [32]
- [33]
- [34]
-
[35]
Y.-S. Liang and W.-J. Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23638– 23647, 2024
work page 2024
-
[36]
J. Liu, M. Liu, Z. Wang, P. An, X. Li, K. Zhou, S. Yang, R. Zhang, Y. Guo, and S. Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems, 37:40085–40110, 2024. 14
work page 2024
-
[37]
R. Liu, B. Diao, L. Huang, Z. An, Z. An, and Y. Xu. Continual learning in the frequency domain. Advances in Neural Information Processing Systems, 37:85389–85411, 2024
work page 2024
-
[38]
Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, and Y. Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2025
work page 2025
-
[39]
E. S. Lubana, P. Trivedi, D. Koutra, and R. Dick. How do quadratic regularizers prevent catastrophic forgetting: The role of interpolation. InConference on Lifelong Learning Agents, pages 819–837. PMLR, 2022
work page 2022
-
[40]
S. Magistri, T. Trinci, A. Soutif-Cormerais, J. van de Weijer, and A. D. Bagdanov. Elastic feature consolidation for cold start exemplar-free incremental learning. InICLR, 2024
work page 2024
-
[41]
L. Mahmoodi, M. Harandi, and P. Moghadam. Flashback for continual learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3434–3443, October 2023
work page 2023
-
[42]
L. Mahmoodi, P. Moghadam, M. Hayat, C. Simon, and M. Harandi. Flashbacks to harmonize stability and plasticity in continual learning.Neural Networks, 190:107616, 2025
work page 2025
-
[43]
R. J. Martin. A metric for arma processes.IEEE transactions on Signal Processing, 48(4):1164–1170, 2000
work page 2000
-
[44]
J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.Psychological review, 102(3):419, 1995
work page 1995
-
[45]
M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989
work page 1989
-
[46]
Z. Meng, J. Zhang, C. Yang, Z. Zhan, P. Zhao, and Y. Wang. Diffclass: Diffusion-based class incremental learning. InEuropean Conference on Computer Vision, pages 142–159. Springer, 2024
work page 2024
-
[47]
G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review.Neural networks, 113:54–71, 2019
work page 2019
- [48]
-
[49]
A. Ravichandran, R. Chaudhry, and R. Vidal. Categorizing dynamic textures using a bag of dynamical systems.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2):342–353, 2012
work page 2012
-
[50]
S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017
work page 2001
- [51]
-
[52]
H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay.Advances in neural information processing systems, 30, 2017
work page 2017
- [53]
- [54]
-
[55]
G. M. Van de Ven and A. S. Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[56]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[57]
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Jul 2011
work page 2011
-
[58]
L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383,
-
[59]
doi: 10.1109/TPAMI.2024.3367329
-
[60]
Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 139–149, 2022
work page 2022
-
[61]
K. Ye and L.-H. Lim. Schubert varieties and distances between subspaces of different dimensions.SIAM Journal on Matrix Analysis and Applications, 37(3):1176–1197, 2016
work page 2016
- [62]
- [63]
-
[64]
C. Zhao and D. Gong. Learning mamba as a continual learner.CoRR, abs/2412.00776, 2024
-
[65]
K. Zhu, W. Zhai, Y. Cao, J. Luo, and Z.-J. Zha. Self-sustaining representation expansion for non- exemplar class-incremental learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9296–9305, 2022
work page 2022
-
[66]
L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. InForty-first International Conference on Machine Learning, 2024. 16 Contents 1 Introduction 2 2 Preliminary 3 2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
work page 2024
-
[67]
Matrix multiplication ˜C⊤ ˜C
-
[68]
Matrix multiplication ˜Adiag ˜A′⊤ diag
-
[69]
Subtraction1 n − ˜Adiag ˜A′⊤ diag
-
[70]
Element-wise division of˜C⊤ ˜Cover1 n − ˜Adiag ˜A′⊤ diag Note that ⊙ and element-wise reciprocal could be combined as a single step by simply taking element-wise reciprocal of˜C⊤ ˜C′ by1n − ˜Adiag ˜A′⊤ diag. Thus, each outlined step have an FLOPS count ofn 2, and hence, the total FLOPS count is4n2 with computational complexity ofO(n2). Hence, thisreduces ...
work page 2011
-
[71]
Note for readability,Brefers to ˜Bderived in §D.3. 33 H.5 Vim-tiny To validate that Inf-SSM is adaptable to different model sizes, especially on small models, we validated Inf-SSM along with EFCIL baselines of EWC, SI, MAS, and LwF on Vim-tiny. Vim-tiny only has 7M parameters in comparison to 26M of Vim-small [65]. Table 10: AA(%↑), AIA(% ↑), and FM(%↓)of...
-
[72]
Represent the main CL paradigms used in CIL or EFCIL
-
[73]
Can be adapted to the Vim backbone without ad hoc, architecture-specific redesign. We therefore adopt the following baselines, all implemented on Vim-Small under a shared training protocol (datasets, task splits, and metrics are described in §5, §G.2, and §J.2). For replay-based methods: • ER[ 51]is a canonical replay-based baseline. It measures how much ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.