Looped SSMs: Depth-Recurrence and Input Reshaping for Time Series Classification
Pith reviewed 2026-05-20 20:55 UTC · model grok-4.3
The pith
Looping the same SSM block across layers matches or beats expanded models with independent parameters per layer on time series tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A looped SSM with k parameters iterated L times consistently closely matches or outperforms a standard SSM with k · L independent parameters across four architectures and six benchmarks, despite a strictly smaller hypothesis space. The advantage arises because parameter sharing across depth functions as a beneficial inductive bias that simplifies optimization, independent of the models' inherent sequence recurrence.
What carries the argument
Depth-recurrence via looping, in which the identical SSM block is applied repeatedly across layers to enforce parameter sharing.
If this is right
- Looped models can reach comparable accuracy with far fewer total parameters.
- Depth-recurrence supplies benefits orthogonal to the sequence recurrence already present in SSMs.
- Input reshaping by timestep concatenation or feature-time flattening produces consistent accuracy lifts.
- The two techniques can be stacked for additive gains.
Where Pith is reading between the lines
- The same sharing pattern could be tested in other recurrent or attention-based sequence models to reduce parameter count.
- Optimal loop depth may vary with input dimensionality or task length.
- Hybrid designs that share parameters only in early or late layers might offer further efficiency.
Load-bearing premise
The performance advantage comes specifically from the inductive bias created by sharing parameters across depth rather than from training dynamics or the particular choice of benchmarks.
What would settle it
A controlled experiment in which looped and expanded models are trained to convergence on the same data with identical random seeds and the looped version still underperforms.
Figures
read the original abstract
State Space Models (SSMs) are inherently recurrent along the sequence dimension, yet depth-recurrence - reusing the same block repeatedly across layers, as recently applied in looped transformers - has not been explored in this model family. We show that a looped SSM with $k$ parameters iterated $L$ times consistently closely matches or outperforms a standard SSM with $k \cdot L$ independent parameters across four architectures (LRU, S5, LinOSS, LrcSSM) and six time series classification benchmarks, despite operating within a strictly smaller hypothesis space, as we formally establish. Since the larger model contains the looped model as a special case, this dominance cannot be explained by expressivity and instead points to parameter sharing across depth as a beneficial inductive bias that simplifies optimization. These results demonstrate that depth-recurrence is orthogonal to sequence-recurrence and independently beneficial. We further show that input reshaping is an equally neglected design axis: concatenating timesteps for low-dimensional inputs, or flattening and rechunking the joint feature-time dimension for high-dimensional ones, yields accuracy gains of 1-6% across all models, confirmed over 5 random seeds. Both techniques provide standalone improvements that compound when combined, suggesting that depth and input reshaping are two independent and underexplored design axes for SSMs on time series.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces depth-recurrence to State Space Models by reusing the same SSM block (k parameters) across L layers and demonstrates that this looped variant consistently matches or outperforms a standard SSM with k·L independent parameters on six time series classification benchmarks across four architectures (LRU, S5, LinOSS, LrcSSM). It formally notes that the larger model contains the looped case as a special case, attributing the result to parameter sharing as an inductive bias that aids optimization rather than increased expressivity. The work additionally proposes input reshaping (concatenation for low-dimensional inputs or flattening/rechunking for high-dimensional ones) that yields 1-6% accuracy gains, with both techniques compounding when combined.
Significance. If the central empirical claim holds after controlling for training dynamics, the result would establish depth-recurrence as an orthogonal and beneficial design axis for SSMs, separate from their inherent sequence-recurrence. This could inform more efficient scaling of deep SSM architectures for time series tasks by leveraging parameter sharing to simplify optimization landscapes. The multi-architecture, multi-benchmark scope and the explicit smaller-hypothesis-space argument provide a solid foundation for follow-up work on inductive biases in recurrent sequence models.
major comments (2)
- [§4] §4 (Experiments): The looped-vs-standard comparison does not report independent hyperparameter tuning (learning rate, optimizer settings, epochs, or initialization) for the k·L parameter models. If identical regimes were used for both, differences in gradient flow through L independent layers versus looped recurrence could explain the observed performance without isolating the claimed inductive bias of parameter sharing.
- [§3.2] §3.2 and Table 2: The formal argument that the standard model contains the looped model as a special case is used to rule out expressivity, yet the empirical results would be strengthened by an ablation that explicitly initializes the standard model to recover the looped weights and verifies that optimization still diverges under the same training protocol.
minor comments (2)
- [Abstract] The abstract states results for input reshaping are 'confirmed over 5 random seeds' but does not clarify whether the main looped-vs-standard tables also use multiple seeds or report variance; adding this detail would improve reproducibility.
- [§3.1] Notation for the looped iteration (e.g., how the hidden state is passed between iterations of the same block) could be clarified with a small diagram or explicit recurrence equation in §3.1.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our experimental protocol and theoretical claims, and we outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The looped-vs-standard comparison does not report independent hyperparameter tuning (learning rate, optimizer settings, epochs, or initialization) for the k·L parameter models. If identical regimes were used for both, differences in gradient flow through L independent layers versus looped recurrence could explain the observed performance without isolating the claimed inductive bias of parameter sharing.
Authors: We appreciate this point on experimental controls. Our comparisons deliberately employed identical hyperparameter regimes—including the same learning rate, optimizer, epoch count, and initialization—for both looped and standard models to ensure a fair, controlled evaluation under the same training dynamics. This design choice allows us to attribute performance differences to the inductive bias of parameter sharing rather than to separately optimized training for the larger model. Gradient flow distinctions are a direct consequence of the architectural choice and thus part of the optimization benefit we claim. In the revision we will explicitly document this identical-regime protocol in §4 and add a brief discussion of how it supports isolating the inductive bias effect. revision: yes
-
Referee: [§3.2] §3.2 and Table 2: The formal argument that the standard model contains the looped model as a special case is used to rule out expressivity, yet the empirical results would be strengthened by an ablation that explicitly initializes the standard model to recover the looped weights and verifies that optimization still diverges under the same training protocol.
Authors: The formal argument in §3.2 shows that any looped weight configuration is realizable inside the standard model by repeating parameters across layers, thereby ruling out greater expressivity as an explanation for the observed advantage. We agree that an initialization ablation—setting the standard model’s layers to identical looped weights and checking whether optimization diverges from that point—would provide further insight into the optimization landscape. Our current experiments use standard random initialization for both models, which is the conventional protocol and already demonstrates the practical benefit of depth-recurrence. We will add a paragraph in the revised §3.2 discussing this ablation as a valuable direction for future work while noting that the existing random-initialization results suffice to support our claims. revision: partial
Circularity Check
Empirical comparisons with set-inclusion argument show no circularity
full rationale
The paper's core claims rest on direct experimental comparisons of looped SSMs (k parameters, L iterations) versus standard SSMs (k·L independent parameters) across four architectures and six benchmarks. The statement that the larger model contains the looped model as a special case is a standard expressivity argument establishing that performance gains cannot be attributed to greater hypothesis-space size; this is a logical inclusion, not a self-referential definition or fitted parameter renamed as prediction. No equations, derivations, or predictions are shown to reduce to their own inputs by construction. No load-bearing self-citations or ansatzes imported from prior author work appear in the provided claims. The results are therefore self-contained against external benchmarks and receive a score of 0.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formally establish that the latter contains the former as a special case... this dominance cannot be explained by expressivity and instead points to parameter sharing across depth as a beneficial inductive bias
-
IndisputableMonolith/Foundation/ArrowOfTime.leanforward_accumulates / z_monotone_absolute unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
depth-recurrence is precisely orthogonal to sequence-recurrence
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The UEA multivariate time series classification archive, 2018
Anthony Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn Keogh. The uea multivariate time series classification archive, 2018. arXiv preprint arXiv:1811.00075, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Flexivit: One model for all patch sizes
Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14496--14506, 2023
work page 2023
-
[3]
A spelling device for the paralysed
Niels Birbaumer, Nimr Ghanayim, Thilo Hinterberger, Iver Iversen, Boris Kotchoubey, Andrea K \"u bler, Juri Perelmouter, Edward Taub, and Herta Flor. A spelling device for the paralysed. Nature, 398 0 (6725): 0 297--298, 1999
work page 1999
-
[4]
A Mechanistic Analysis of Looped Reasoning Language Models
Hugh Blayney, \'A lvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M Bronstein, and Xiaowen Dong. A mechanistic analysis of looped reasoning language models. arXiv preprint arXiv:2604.11791, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Learning to dissipate energy in oscillatory state-space models
Jared Boyer, T Konstantin Rusch, and Daniela Rus. Learning to dissipate energy in oscillatory state-space models. arXiv preprint arXiv:2505.12171, 2025
-
[6]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7
work page 2019
-
[7]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https...
work page 2021
-
[8]
M \'o nika Farsang and Radu Grosu. Parallelization of non-linear state-space models: Scaling up liquid-resistance liquid-capacitance networks for efficient sequence modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=ENYvdnyhLl
work page 2026
-
[9]
Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein
Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum...
work page 2026
-
[10]
Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101 0 (23): 0 e215--e220, 2000
work page 2000
-
[11]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=tEYskw1VY2
work page 2024
-
[12]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC
work page 2022
-
[13]
Lu Han, Han-Jia Ye, and De-Chuan Zhan. The capacity and robustness trade-off: Revisiting the channel independent strategy for multivariate time series forecasting. IEEE Transactions on Knowledge and Data Engineering, 36 0 (11): 0 7129--7142, 2024
work page 2024
-
[14]
Liquid structural state-space models
Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. In The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[15]
Anderson Keller, Carmen Amo Alonso, Terrence Sejnowski, and Hava T Siegelmann
Arjun Karuvally, Franz Nowak, T. Anderson Keller, Carmen Amo Alonso, Terrence Sejnowski, and Hava T Siegelmann. Bridging expressivity and scalability with adaptive unitary SSM s. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=s4zitEu2R8
work page 2026
-
[16]
o der, N Hill, Wolfgang Rosenstiel, Christian Elger, Niels Birbaumer, and Bernhard Sch \
Thomas Lal, Thilo Hinterberger, Guido Widman, Michael Schr \"o der, N Hill, Wolfgang Rosenstiel, Christian Elger, Niels Birbaumer, and Bernhard Sch \"o lkopf. Methods towards invasive human brain computer interfaces. Advances in neural information processing systems, 17, 2004
work page 2004
-
[17]
Detecting forged alcohol non-invasively through vibrational spectroscopy and machine learning
James Large, E Kate Kemsley, Nikolaus Wellner, Ian Goodall, and Anthony Bagnall. Detecting forged alcohol non-invasively through vibrational spectroscopy and machine learning. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 298--309. Springer, 2018
work page 2018
-
[18]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012--10022, 2021
work page 2021
-
[19]
Rough transformers: Lightweight and continuous time series modelling through signature patching
Fernando Moreno-Pino, \'A lvaro Arroyo, Harrison Waldon, Xiaowen Dong, and \'A lvaro Cartea. Rough transformers: Lightweight and continuous time series modelling through signature patching. Advances in Neural Information Processing Systems, 37: 0 106264--106294, 2024
work page 2024
-
[20]
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Weight-space linear recurrent neural networks
Roussel Desmond Nzoyem, Nawid Keshtmand, Enrique Crespo Fernandez, Idriss Tsayem, Raul Santos-Rodriguez, David AW Barton, and Tom Deakin. Weight-space linear recurrent neural networks. arXiv preprint arXiv:2506.01153, 2025
-
[22]
Resurrecting recurrent neural networks for long sequences
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670--26698. PMLR, 2023
work page 2023
-
[23]
Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025
Francesco Pappone, Donato Crisostomi, and Emanuele Rodol \`a . Two-scale latent dynamics for recurrent-depth transformers. arXiv preprint arXiv:2509.23314, 2025
-
[24]
Learning long range dependencies through time reversal symmetry breaking
Guillaume Pourcel and Maxence Ernoult. Learning long range dependencies through time reversal symmetry breaking. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=w1ihNiIBOc
work page 2026
-
[25]
Parcae: Scaling Laws For Stable Looped Language Models
Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y Fu. Parcae: Scaling laws for stable looped language models. arXiv preprint arXiv:2604.12946, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Oscillatory state-space models
T Konstantin Rusch and Daniela Rus. Oscillatory state-space models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[27]
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=din0lGfZFd
work page 2025
-
[28]
Simplified state space layers for sequence modeling
Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. In ICLR, 2023
work page 2023
-
[29]
Log neural controlled differential equations: The lie brackets make a difference
Benjamin Walker, Andrew Donald McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, and Terry Lyons. Log neural controlled differential equations: The lie brackets make a difference. In Forty-first International Conference on Machine Learning, 2024 a . URL https://openreview.net/forum?id=0tYrMtQyPT
work page 2024
-
[30]
Log neural controlled differential equations: The lie brackets make a difference
Benjamin Walker, Andrew Donald McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, and Terry Lyons. Log neural controlled differential equations: The lie brackets make a difference. In Forty-first International Conference on Machine Learning, 2024 b
work page 2024
-
[31]
A database of caenorhabditis elegans behavioral phenotypes
Eviatar Yemini, Tadas Jucikas, Laura J Grundy, Andr \'e EX Brown, and William R Schafer. A database of caenorhabditis elegans behavioral phenotypes. Nature methods, 10 0 (9): 0 877--879, 2013
work page 2013
-
[32]
Scaling Latent Reasoning via Looped Language Models
Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, et al. Scaling latent reasoning via looped language models. arXiv preprint arXiv:2510.25741, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.