pith. sign in

arxiv: 2605.16048 · v1 · pith:KB6AIX3Lnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

Looped SSMs: Depth-Recurrence and Input Reshaping for Time Series Classification

Pith reviewed 2026-05-20 20:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords state space modelstime series classificationdepth recurrenceparameter sharinginductive biasinput reshaping
0
0 comments X

The pith

Looping the same SSM block across layers matches or beats expanded models with independent parameters per layer on time series tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a state space model block reused L times with only k parameters performs as well as or better than a version using k times L separate parameters. This holds for four different SSM architectures on six classification benchmarks. Because the expanded model can always simulate the looped one by setting all copies equal, the result cannot come from greater expressive power. Instead the shared parameters appear to act as an inductive bias that makes training easier. Input reshaping adds separate gains of one to six percent that combine with the looping effect.

Core claim

A looped SSM with k parameters iterated L times consistently closely matches or outperforms a standard SSM with k · L independent parameters across four architectures and six benchmarks, despite a strictly smaller hypothesis space. The advantage arises because parameter sharing across depth functions as a beneficial inductive bias that simplifies optimization, independent of the models' inherent sequence recurrence.

What carries the argument

Depth-recurrence via looping, in which the identical SSM block is applied repeatedly across layers to enforce parameter sharing.

If this is right

  • Looped models can reach comparable accuracy with far fewer total parameters.
  • Depth-recurrence supplies benefits orthogonal to the sequence recurrence already present in SSMs.
  • Input reshaping by timestep concatenation or feature-time flattening produces consistent accuracy lifts.
  • The two techniques can be stacked for additive gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sharing pattern could be tested in other recurrent or attention-based sequence models to reduce parameter count.
  • Optimal loop depth may vary with input dimensionality or task length.
  • Hybrid designs that share parameters only in early or late layers might offer further efficiency.

Load-bearing premise

The performance advantage comes specifically from the inductive bias created by sharing parameters across depth rather than from training dynamics or the particular choice of benchmarks.

What would settle it

A controlled experiment in which looped and expanded models are trained to convergence on the same data with identical random seeds and the looped version still underperforms.

Figures

Figures reproduced from arXiv: 2605.16048 by Daniela Rus, M\'onika Farsang, Radu Grosu, Ramin Hasani.

Figure 1
Figure 1. Figure 1: Left: Architecture setup of six independent unique layers [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the two input reshaping strategies. A concentration factor [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

State Space Models (SSMs) are inherently recurrent along the sequence dimension, yet depth-recurrence - reusing the same block repeatedly across layers, as recently applied in looped transformers - has not been explored in this model family. We show that a looped SSM with $k$ parameters iterated $L$ times consistently closely matches or outperforms a standard SSM with $k \cdot L$ independent parameters across four architectures (LRU, S5, LinOSS, LrcSSM) and six time series classification benchmarks, despite operating within a strictly smaller hypothesis space, as we formally establish. Since the larger model contains the looped model as a special case, this dominance cannot be explained by expressivity and instead points to parameter sharing across depth as a beneficial inductive bias that simplifies optimization. These results demonstrate that depth-recurrence is orthogonal to sequence-recurrence and independently beneficial. We further show that input reshaping is an equally neglected design axis: concatenating timesteps for low-dimensional inputs, or flattening and rechunking the joint feature-time dimension for high-dimensional ones, yields accuracy gains of 1-6% across all models, confirmed over 5 random seeds. Both techniques provide standalone improvements that compound when combined, suggesting that depth and input reshaping are two independent and underexplored design axes for SSMs on time series.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces depth-recurrence to State Space Models by reusing the same SSM block (k parameters) across L layers and demonstrates that this looped variant consistently matches or outperforms a standard SSM with k·L independent parameters on six time series classification benchmarks across four architectures (LRU, S5, LinOSS, LrcSSM). It formally notes that the larger model contains the looped case as a special case, attributing the result to parameter sharing as an inductive bias that aids optimization rather than increased expressivity. The work additionally proposes input reshaping (concatenation for low-dimensional inputs or flattening/rechunking for high-dimensional ones) that yields 1-6% accuracy gains, with both techniques compounding when combined.

Significance. If the central empirical claim holds after controlling for training dynamics, the result would establish depth-recurrence as an orthogonal and beneficial design axis for SSMs, separate from their inherent sequence-recurrence. This could inform more efficient scaling of deep SSM architectures for time series tasks by leveraging parameter sharing to simplify optimization landscapes. The multi-architecture, multi-benchmark scope and the explicit smaller-hypothesis-space argument provide a solid foundation for follow-up work on inductive biases in recurrent sequence models.

major comments (2)
  1. [§4] §4 (Experiments): The looped-vs-standard comparison does not report independent hyperparameter tuning (learning rate, optimizer settings, epochs, or initialization) for the k·L parameter models. If identical regimes were used for both, differences in gradient flow through L independent layers versus looped recurrence could explain the observed performance without isolating the claimed inductive bias of parameter sharing.
  2. [§3.2] §3.2 and Table 2: The formal argument that the standard model contains the looped model as a special case is used to rule out expressivity, yet the empirical results would be strengthened by an ablation that explicitly initializes the standard model to recover the looped weights and verifies that optimization still diverges under the same training protocol.
minor comments (2)
  1. [Abstract] The abstract states results for input reshaping are 'confirmed over 5 random seeds' but does not clarify whether the main looped-vs-standard tables also use multiple seeds or report variance; adding this detail would improve reproducibility.
  2. [§3.1] Notation for the looped iteration (e.g., how the hidden state is passed between iterations of the same block) could be clarified with a small diagram or explicit recurrence equation in §3.1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our experimental protocol and theoretical claims, and we outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The looped-vs-standard comparison does not report independent hyperparameter tuning (learning rate, optimizer settings, epochs, or initialization) for the k·L parameter models. If identical regimes were used for both, differences in gradient flow through L independent layers versus looped recurrence could explain the observed performance without isolating the claimed inductive bias of parameter sharing.

    Authors: We appreciate this point on experimental controls. Our comparisons deliberately employed identical hyperparameter regimes—including the same learning rate, optimizer, epoch count, and initialization—for both looped and standard models to ensure a fair, controlled evaluation under the same training dynamics. This design choice allows us to attribute performance differences to the inductive bias of parameter sharing rather than to separately optimized training for the larger model. Gradient flow distinctions are a direct consequence of the architectural choice and thus part of the optimization benefit we claim. In the revision we will explicitly document this identical-regime protocol in §4 and add a brief discussion of how it supports isolating the inductive bias effect. revision: yes

  2. Referee: [§3.2] §3.2 and Table 2: The formal argument that the standard model contains the looped model as a special case is used to rule out expressivity, yet the empirical results would be strengthened by an ablation that explicitly initializes the standard model to recover the looped weights and verifies that optimization still diverges under the same training protocol.

    Authors: The formal argument in §3.2 shows that any looped weight configuration is realizable inside the standard model by repeating parameters across layers, thereby ruling out greater expressivity as an explanation for the observed advantage. We agree that an initialization ablation—setting the standard model’s layers to identical looped weights and checking whether optimization diverges from that point—would provide further insight into the optimization landscape. Our current experiments use standard random initialization for both models, which is the conventional protocol and already demonstrates the practical benefit of depth-recurrence. We will add a paragraph in the revised §3.2 discussing this ablation as a valuable direction for future work while noting that the existing random-initialization results suffice to support our claims. revision: partial

Circularity Check

0 steps flagged

Empirical comparisons with set-inclusion argument show no circularity

full rationale

The paper's core claims rest on direct experimental comparisons of looped SSMs (k parameters, L iterations) versus standard SSMs (k·L independent parameters) across four architectures and six benchmarks. The statement that the larger model contains the looped model as a special case is a standard expressivity argument establishing that performance gains cannot be attributed to greater hypothesis-space size; this is a logical inclusion, not a self-referential definition or fitted parameter renamed as prediction. No equations, derivations, or predictions are shown to reduce to their own inputs by construction. No load-bearing self-citations or ansatzes imported from prior author work appear in the provided claims. The results are therefore self-contained against external benchmarks and receive a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, the work relies on standard machine-learning assumptions such as random initialization and gradient-based optimization; no explicit free parameters, ad-hoc axioms, or new invented entities are introduced or detailed.

pith-pipeline@v0.9.0 · 5779 in / 1044 out tokens · 40739 ms · 2026-05-20T20:55:30.353721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 5 internal anchors

  1. [1]

    The UEA multivariate time series classification archive, 2018

    Anthony Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn Keogh. The uea multivariate time series classification archive, 2018. arXiv preprint arXiv:1811.00075, 2018

  2. [2]

    Flexivit: One model for all patch sizes

    Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14496--14506, 2023

  3. [3]

    A spelling device for the paralysed

    Niels Birbaumer, Nimr Ghanayim, Thilo Hinterberger, Iver Iversen, Boris Kotchoubey, Andrea K \"u bler, Juri Perelmouter, Edward Taub, and Herta Flor. A spelling device for the paralysed. Nature, 398 0 (6725): 0 297--298, 1999

  4. [4]

    A Mechanistic Analysis of Looped Reasoning Language Models

    Hugh Blayney, \'A lvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M Bronstein, and Xiaowen Dong. A mechanistic analysis of looped reasoning language models. arXiv preprint arXiv:2604.11791, 2026

  5. [5]

    Learning to dissipate energy in oscillatory state-space models

    Jared Boyer, T Konstantin Rusch, and Daniela Rus. Learning to dissipate energy in oscillatory state-space models. arXiv preprint arXiv:2505.12171, 2025

  6. [6]

    Universal transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7

  7. [7]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https...

  8. [8]

    Parallelization of non-linear state-space models: Scaling up liquid-resistance liquid-capacitance networks for efficient sequence modeling

    M \'o nika Farsang and Radu Grosu. Parallelization of non-linear state-space models: Scaling up liquid-resistance liquid-capacitance networks for efficient sequence modeling. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=ENYvdnyhLl

  9. [9]

    Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

    Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum...

  10. [10]

    Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals

    Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101 0 (23): 0 e215--e220, 2000

  11. [11]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=tEYskw1VY2

  12. [12]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC

  13. [13]

    The capacity and robustness trade-off: Revisiting the channel independent strategy for multivariate time series forecasting

    Lu Han, Han-Jia Ye, and De-Chuan Zhan. The capacity and robustness trade-off: Revisiting the channel independent strategy for multivariate time series forecasting. IEEE Transactions on Knowledge and Data Engineering, 36 0 (11): 0 7129--7142, 2024

  14. [14]

    Liquid structural state-space models

    Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models. In The Eleventh International Conference on Learning Representations, 2023

  15. [15]

    Anderson Keller, Carmen Amo Alonso, Terrence Sejnowski, and Hava T Siegelmann

    Arjun Karuvally, Franz Nowak, T. Anderson Keller, Carmen Amo Alonso, Terrence Sejnowski, and Hava T Siegelmann. Bridging expressivity and scalability with adaptive unitary SSM s. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=s4zitEu2R8

  16. [16]

    o der, N Hill, Wolfgang Rosenstiel, Christian Elger, Niels Birbaumer, and Bernhard Sch \

    Thomas Lal, Thilo Hinterberger, Guido Widman, Michael Schr \"o der, N Hill, Wolfgang Rosenstiel, Christian Elger, Niels Birbaumer, and Bernhard Sch \"o lkopf. Methods towards invasive human brain computer interfaces. Advances in neural information processing systems, 17, 2004

  17. [17]

    Detecting forged alcohol non-invasively through vibrational spectroscopy and machine learning

    James Large, E Kate Kemsley, Nikolaus Wellner, Ian Goodall, and Anthony Bagnall. Detecting forged alcohol non-invasively through vibrational spectroscopy and machine learning. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 298--309. Springer, 2018

  18. [18]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012--10022, 2021

  19. [19]

    Rough transformers: Lightweight and continuous time series modelling through signature patching

    Fernando Moreno-Pino, \'A lvaro Arroyo, Harrison Waldon, Xiaowen Dong, and \'A lvaro Cartea. Rough transformers: Lightweight and continuous time series modelling through signature patching. Advances in Neural Information Processing Systems, 37: 0 106264--106294, 2024

  20. [20]

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730, 2022

  21. [21]

    Weight-space linear recurrent neural networks

    Roussel Desmond Nzoyem, Nawid Keshtmand, Enrique Crespo Fernandez, Idriss Tsayem, Raul Santos-Rodriguez, David AW Barton, and Tom Deakin. Weight-space linear recurrent neural networks. arXiv preprint arXiv:2506.01153, 2025

  22. [22]

    Resurrecting recurrent neural networks for long sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670--26698. PMLR, 2023

  23. [23]

    Two-scale latent dynamics for recurrent-depth transformers.CoRR, abs/2509.23314, 2025

    Francesco Pappone, Donato Crisostomi, and Emanuele Rodol \`a . Two-scale latent dynamics for recurrent-depth transformers. arXiv preprint arXiv:2509.23314, 2025

  24. [24]

    Learning long range dependencies through time reversal symmetry breaking

    Guillaume Pourcel and Maxence Ernoult. Learning long range dependencies through time reversal symmetry breaking. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=w1ihNiIBOc

  25. [25]

    Parcae: Scaling Laws For Stable Looped Language Models

    Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y Fu. Parcae: Scaling laws for stable looped language models. arXiv preprint arXiv:2604.12946, 2026

  26. [26]

    Oscillatory state-space models

    T Konstantin Rusch and Daniela Rus. Oscillatory state-space models. In The Thirteenth International Conference on Learning Representations, 2025

  27. [27]

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=din0lGfZFd

  28. [28]

    Simplified state space layers for sequence modeling

    Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. In ICLR, 2023

  29. [29]

    Log neural controlled differential equations: The lie brackets make a difference

    Benjamin Walker, Andrew Donald McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, and Terry Lyons. Log neural controlled differential equations: The lie brackets make a difference. In Forty-first International Conference on Machine Learning, 2024 a . URL https://openreview.net/forum?id=0tYrMtQyPT

  30. [30]

    Log neural controlled differential equations: The lie brackets make a difference

    Benjamin Walker, Andrew Donald McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, and Terry Lyons. Log neural controlled differential equations: The lie brackets make a difference. In Forty-first International Conference on Machine Learning, 2024 b

  31. [31]

    A database of caenorhabditis elegans behavioral phenotypes

    Eviatar Yemini, Tadas Jucikas, Laura J Grundy, Andr \'e EX Brown, and William R Schafer. A database of caenorhabditis elegans behavioral phenotypes. Nature methods, 10 0 (9): 0 877--879, 2013

  32. [32]

    Scaling Latent Reasoning via Looped Language Models

    Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, et al. Scaling latent reasoning via looped language models. arXiv preprint arXiv:2510.25741, 2025