pith. sign in

arxiv: 2605.19150 · v1 · pith:XTCPMDZCnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

Pith reviewed 2026-05-20 11:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords state-space modelsstructured sparse matricesfinite-state automatasequence modelingefficient transformerslong-context modelingmultivariate time serieshybrid language models
0
0 comments X

The pith

Flash PD-SSM achieves unstructured-matrix expressivity in state-space models by selecting one structured sparse transition matrix at each time step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

State-space models trade off efficiency against the ability to model arbitrary state transitions. Most structured forms run fast but cannot represent the full range of finite-state automaton behavior that unstructured matrices can. Flash PD-SSM keeps a small trainable bank of structured sparse matrices and switches among them discretely per time step. This design preserves the memory and speed advantages of sparsity while recovering the theoretical expressivity of dense matrices. Experiments confirm the expressivity gain on synthetic tracking tasks, set new accuracy records on long multivariate sequences, and improve hybrid language-model performance with lower memory footprint.

Core claim

Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale.

What carries the argument

Discrete per-time-step selection from a trainable bank of structured sparse transition matrices that approximates dense-matrix expressivity without dense storage or compute.

If this is right

  • On synthetic mechanistic and state-tracking tasks the model realizes its claimed finite-state-automaton expressivity.
  • On multivariate time-series sequences longer than 17,000 steps it sets new state-of-the-art accuracy among competing structured SSMs.
  • As a drop-in replacement inside hybrid language models it improves both natural-language state tracking and standard language-modeling benchmarks.
  • It delivers higher throughput and lower memory consumption than the structured SSMs currently used in frontier language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bank-and-select pattern could be applied to other linear recurrent layers to improve their expressivity without quadratic cost.
  • Hardware kernels that fuse the selection step with the sparse matrix-vector product would further reduce the already small overhead.
  • Because selection is discrete, gradient flow through the choice may require straight-through estimators or reinforcement-learning-style updates that the current work leaves open.
  • The approach suggests a route to context lengths beyond current SSM limits if the number of matrices in the bank can be kept small while still covering required transition diversity.

Load-bearing premise

Selecting one sparse matrix from the bank at every time step adds negligible overhead and lets the theoretical finite-state-automaton expressivity appear in practice without hidden training or inference costs.

What would settle it

A controlled experiment in which Flash PD-SSM is run on a suite of finite-state-automaton transition tasks and fails to reach the accuracy of an unstructured baseline while using comparable or higher peak memory.

Figures

Figures reproduced from arXiv: 2605.19150 by Abbas Rahimi, Aleksandar Terzi\'c, Francesco Carzaniga, Michael Hersche, Nicolas Menet, Thomas Hofmann, Yannick Biehl.

Figure 1
Figure 1. Figure 1: FLASH PD-SSM is expressive, fast, and memory-efficient. Synthetic state-tracking accuracy on a collection of FSA emulation tasks [13, 57]. The runtime (measured relative to PD￾SSM [53]) and memory consumption of FLASH PD-SSM and other popular SSMs are also reported. The maximum sequence length is 2048 and all models have hidden dimension 1024. The circle’s size indicates peak memory consumption during trai… view at source ↗
Figure 2
Figure 2. Figure 2: Left: PD-SSM. PD-SSM [53] sparsifies a convex combination of dense dictionary matrices. The column one-hot matrix generation process incurs significant computational and memory over￾heads. Right: FLASH PD-SSM. We simplify the column one-hot generation process by directly selecting a single element from a dictionary of trainable structured sparse matrices. This preserves the theoretical guarantees and allow… view at source ↗
Figure 3
Figure 3. Figure 3: Left: One FLASH PD-SSM block inte￾grates a range of components in a design following the pattern from [11]. Right: The model is embed￾ded in a pre-norm architecture. We embed FLASH PD-SSM into an intercon￾nected block by following standard design pat￾terns outlined by Mamba [18], further embed￾ding the block into a standard pre-norm archi￾tecture [59]. The full architecture is shown in [PITH_FULL_IMAGE:fi… view at source ↗
Figure 5
Figure 5. Figure 5: SSM Memory Comparison. Peak allocated memory comparison of FLASH PD￾SSM in a parameter-matched setting. FLASH PD-SSM consumes notably less memory. has learned to correctly emulate the automaton. This experimental setup exactly conforms to those used for evaluating the baseline methods [57, 53], with each model having two layers3 . The average validation performance over five random runs is reported in [PI… view at source ↗
Figure 6
Figure 6. Figure 6: FLASH PD-SSM kernel efficiency. CUDA forward kernel performance and bandwidth efficiency. tensors are negligible in size compared to the input tensors for typical chunk sizes of τ = 128, which we found to perform best in our setting. CUDA kernel performance vs PyTorch associative scan . Figure 6a reports the relative speedup of the custom CUDA forward kernel for the FLASH PD-SSM recurrence compared to the … view at source ↗
read the original abstract

State-space models (SSMs) face a fundamental trade-off between efficiency and expressivity that is mainly dictated by the structure of the model's transition matrix. Unstructured transition matrices enable maximal expressivity, as measured by their ability to model finite-state automaton (FSA) transitions, but come at a prohibitively high compute and memory cost. In contrast, most structured transition matrix forms are highly efficient both in runtime and memory consumption, but suffer from limited expressivity. Building on recent work on structured sparse SSMs, we propose Flash PD-SSM, a novel SSM that achieves comparable throughput to widely-used structured SSMs with significantly better expressivity guarantees. Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale. First, we validate Flash PD-SSM against a suite of alternative models on synthetic mechanistic and state-tracking tasks, finding that its theoretical expressivity is achieved in practice. Second, on multivariate time-series tasks involving sequences of length over 17,000, we find that Flash PD-SSM defines a new state-of-the-art (SoTA) accuracy among competing SSM methods. Finally, we demonstrate that Flash PD-SSM is an effective drop-in replacement for hybrid LLMs, yielding improvements both in natural language state-tracking and in common language modeling scenarios. The model exhibits increased throughput and decreased memory consumption compared to SSMs widely used in frontier language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Flash PD-SSM, an SSM variant that maintains a trainable collection of structured sparse transition matrices and performs a discrete selection of one matrix per time step. This construction is claimed to recover FSA-level expressivity equivalent to unstructured matrices while preserving the computational efficiency of structured SSMs. The authors validate the approach on synthetic mechanistic and state-tracking tasks, report new state-of-the-art accuracy on multivariate time-series benchmarks with sequences longer than 17,000 steps, and demonstrate utility as a drop-in replacement in hybrid language models.

Significance. If the central architectural claim and the reported empirical gains are substantiated, the work would meaningfully advance the efficiency-expressivity trade-off in state-space models, with direct relevance to long-context modeling and scalable sequence architectures. The combination of theoretical expressivity arguments with large-scale time-series and LLM experiments would constitute a useful contribution to the SSM literature.

major comments (2)
  1. [Experiments (synthetic and time-series sections)] The abstract and experimental sections assert that synthetic tasks confirm theoretical FSA expressivity and that Flash PD-SSM sets new SoTA accuracy on long multivariate time series, yet no baselines, error bars, data splits, or statistical significance tests are described. This absence prevents assessment of whether the performance gains are robust or attributable to the proposed selection mechanism.
  2. [Method and Theoretical Analysis] The central claim that discrete selection from a finite set of structured sparse matrices achieves expressivity equivalent to an unstructured transition matrix requires an explicit bound on set cardinality and a demonstration that the selection policy can realize arbitrary state transitions. Without such analysis, it remains unclear whether the union of supports plus the selection mechanism spans the full transition table of an equivalent FSA, as raised by the concern that a small set restricts reachable transitions while a large set reintroduces memory costs.
minor comments (2)
  1. [Method] Notation for the discrete selection operation and the trainable matrix set should be introduced with a clear equation or diagram in the method section to avoid ambiguity when comparing to prior structured sparse SSMs.
  2. [Introduction] The manuscript should include a short related-work paragraph explicitly contrasting the proposed discrete selection with continuous or learned routing mechanisms in recent SSM variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and positive feedback on our work. We address each major comment point by point below, indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Experiments (synthetic and time-series sections)] The abstract and experimental sections assert that synthetic tasks confirm theoretical FSA expressivity and that Flash PD-SSM sets new SoTA accuracy on long multivariate time series, yet no baselines, error bars, data splits, or statistical significance tests are described. This absence prevents assessment of whether the performance gains are robust or attributable to the proposed selection mechanism.

    Authors: We agree that the experimental reporting can be strengthened for better assessment of robustness. The manuscript already includes comparisons against multiple baselines (S4, Mamba, DSS, and others) on both synthetic mechanistic/state-tracking tasks and the long multivariate time-series benchmarks. However, we acknowledge the referee's point regarding missing details. In the revised manuscript, we have added error bars (standard deviation over 5 independent random seeds), explicitly described the data splits (e.g., standard 70/15/15 splits for the time-series datasets with sequences >17k steps), and included statistical significance tests (paired t-tests with p-values reported against the strongest baseline). These updates appear in Sections 4.1, 4.2, and the associated tables. revision: yes

  2. Referee: [Method and Theoretical Analysis] The central claim that discrete selection from a finite set of structured sparse matrices achieves expressivity equivalent to an unstructured transition matrix requires an explicit bound on set cardinality and a demonstration that the selection policy can realize arbitrary state transitions. Without such analysis, it remains unclear whether the union of supports plus the selection mechanism spans the full transition table of an equivalent FSA, as raised by the concern that a small set restricts reachable transitions while a large set reintroduces memory costs.

    Authors: This is a fair critique of the original theoretical presentation. While the manuscript argues that the discrete selection from structured sparse matrices recovers FSA-level expressivity (via the union of supports enabling full state coverage and selection acting as a state-dependent transition), we did not provide an explicit cardinality bound or a formal demonstration of arbitrary transitions. In the revision, we have added a new subsection (3.3) with a theorem establishing that a small fixed set cardinality (specifically 8 matrices in our implementation, each with O(1) non-zeros per row due to the structured sparsity) suffices to realize any FSA transition table. The proof constructs the set such that the selection policy, conditioned on the current hidden state, can choose the matrix encoding the required next-state mapping, effectively simulating the full transition function without needing an unstructured matrix. We also clarify that the memory cost remains linear in the (small) set size but is offset by the sparsity and efficient kernel implementation, avoiding the quadratic costs of unstructured alternatives. This addition directly addresses the reachability and cost concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: Flash PD-SSM is a novel architectural construction validated externally

full rationale

The paper introduces Flash PD-SSM as a new SSM variant maintaining a trainable set of structured sparse matrices with per-timestep discrete selection. This design is explicitly positioned as building on prior structured sparse SSM work while adding the selection mechanism to reach FSA-level expressivity. No equations, parameter fits, or self-citations are shown to reduce the central expressivity claim back to the inputs by construction; the claim is instead supported by direct empirical validation on synthetic mechanistic tasks, long-sequence time-series, and LLM hybrid replacements. The derivation chain therefore remains self-contained against external benchmarks rather than internally tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The model introduces a selection mechanism over multiple sparse matrices as the key new component; it relies on the domain assumption that such selection can deliver unstructured-level expressivity at structured cost, with the size of the matrix set acting as an implicit design choice.

free parameters (1)
  • size of the trainable matrix set
    The number of sparse matrices maintained is a hyperparameter that trades off expressivity against memory; its value is not derived from first principles.
axioms (1)
  • domain assumption Discrete selection among structured sparse matrices can achieve the expressivity of unstructured transition matrices without prohibitive overhead.
    This premise is required for the central claim that FSA-level expressivity is obtained while preserving efficiency.
invented entities (1)
  • Flash PD-SSM discrete selection mechanism no independent evidence
    purpose: To enable dynamic per-step choice among sparse matrices for higher expressivity
    A new architectural component postulated to resolve the SSM trade-off; no independent falsifiable evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5833 in / 1352 out tokens · 62743 ms · 2026-05-20T11:44:42.129837+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 7 internal anchors

  1. [1]

    Undergraduate Texts in Mathematics

    Axler, S.Linear Algebra Done Right. Undergraduate Texts in Mathematics. Springer Interna- tional Publishing, 2024

  2. [2]

    The UEA multivariate time series classification archive, 2018

    Bagnall, A., Dau, H. A., Lines, J., Flynn, M., Large, J., Bostrom, A. G., Southam, P., and Keogh, E. J. The UEA multivariate time series classification archive.arXiv preprint arXiv:1811.00075, 2018

  3. [3]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Bengio, Y ., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  4. [4]

    Bischof, C. H. and Van Loan, C. The WY Representation for Products of Householder Matrices. SIAM Journal on Scientific and Statistical Computing, 8(1):2–13, 1987

  5. [5]

    Piqa: Reasoning about physical commonsense in natural language

    Bisk, Y ., Zellers, R., Gao, J., Choi, Y ., et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020

  6. [6]

    Nemotron-H: A family of accurate and efficient hybrid Mamba-Transformer models,

    Blakeman, A., Basant, A., Khattar, A., Renduchintala, A., Bercovich, A., Ficek, A., Bjorlin, A., Taghibakhshi, A., Deshmukh, A. S., Mahabaleshwarkar, A. S., et al. Nemotron-H: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025

  7. [7]

    Blelloch, G. E. Prefix Sums and Their Applications, 1990

  8. [8]

    C.Visual group theory

    Carter, N. C.Visual group theory. Classroom resource materials. Mathematical Association of America, Washington, D.C., 2009

  9. [9]

    M., Orvieto, A., Walker, B., Salvi, C., and Lyons, T

    Cirone, N. M., Orvieto, A., Walker, B., Salvi, C., and Lyons, T. Theoretical foundations of deep selective state-space models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  10. [10]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  11. [11]

    and Gu, A

    Dao, T. and Gu, A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML), 2024

  12. [12]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    De, S., Smith, S. L., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y ., Srinivasan, S., Desjardins, G., Doucet, A., Budden, D., Teh, Y . W., Pascanu, R., De Freitas, N., and Gulcehre, C. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

  13. [13]

    K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., and Ortega, P

    Del´etang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., and Ortega, P. A. Neural networks and the Chomsky hierarchy. InInternational Conference on Learning Representations (ICLR), 2023

  14. [14]

    Fan, T.-H., Chi, T.-C., and Rudnicky, A. I. Advancing regular language reasoning in linear recurrent neural networks. InConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024

  15. [15]

    Y ., Dao, T., Saab, K

    Fu, D. Y ., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and Re, C. Hungry hungry hippos: Towards language modeling with state space models. InInternational Conference on Learning Representations (ICLR), 2023

  16. [16]

    The language model evaluation harness, 2024

    Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. The language model evaluation harness, 2024. URL https://zenodo.org/r...

  17. [17]

    K., Zela, A., Hutter, F., and Pontil, M

    Grazzi, R., Siems, J., Franke, J. K., Zela, A., Hutter, F., and Pontil, M. Unlocking state-tracking in linear RNNs through negative eigenvalues. InNeurIPS Workshop on Mathematics of Modern Machine Learning, 2024

  18. [18]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752, 2023

  19. [19]

    HiPPO: Recurrent Memory with Optimal Polynomial Projections

    Gu, A., Dao, T., Ermon, S., Rudra, A., and R´e, C. HiPPO: Recurrent Memory with Optimal Polynomial Projections. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  20. [20]

    Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

    Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and R´e, C. Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  21. [21]

    On the parameterization and initialization of diagonal state space models

    Gu, A., Goel, K., Gupta, A., and R´e, C. On the parameterization and initialization of diagonal state space models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  22. [22]

    Efficiently modeling long sequences with structured state spaces

    Gu, A., Goel, K., and R´e, C. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations (ICLR), 2022

  23. [23]

    Diagonal state spaces are as effective as structured state spaces

    Gupta, A., Gu, A., and Berant, J. Diagonal state spaces are as effective as structured state spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  24. [24]

    Granite 4.0 language models

    IBM Research. Granite 4.0 language models. https://github.com/ibm-granite/ granite-4.0-language-models, 2025

  25. [25]

    M., and Malach, E

    Jelassi, S., Brandfonbrener, D., Kakade, S. M., and Malach, E. Repeat after me: Transformers are better than state space models at copying. InInternational Conference on Machine Learning (ICML), 2024

  26. [26]

    and Schuster, S

    Kim, N. and Schuster, S. Entity tracking in language models. InMeeting of the Association for Computational Linguistics (ACL), 2023

  27. [27]

    Z., Dao, T., and Gu, A

    Lahoti, A., Li, K., Chen, B., Wang, C., Bick, A., Kolter, J. Z., Dao, T., and Gu, A. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR), 2026

  28. [28]

    Lenz, B., Lieber, O., Arazi, A., Bergman, A., Manevich, A., Peleg, B., Aviram, B., Almagor, C., Fridman, C., Padnos, D., Gissin, D., Jannai, D., Muhlgay, D., Zimberg, D., Gerber, E. M., Dolev, E., Krakovsky, E., Safahi, E., Schwartz, E., Cohen, G., Shachaf, G., Rozenblum, H., Bata, H., Blass, I., Magar, I., Dalmedigos, I., Osin, J., Fadlon, J., Rozman, M....

  29. [29]

    T., Goel, S., Krishnamurthy, A., and Zhang, C

    Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. InInternational Conference on Learning Representations (ICLR), 2023

  30. [30]

    MEGA: moving average equipped gated attention

    Ma, X., Zhou, C., Kong, X., He, J., Gui, L., Neubig, G., May, J., and Zettlemoyer, L. MEGA: moving average equipped gated attention. InInternational Conference on Learning Representa- tions (ICLR), 2023

  31. [31]

    MAD: Mechanistic architecture design, 2024

    MAD Lab. MAD: Mechanistic architecture design, 2024. URL https://github.com/ athms/mad-lab

  32. [32]

    and Sabharwal, A

    Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transform- ers.Transactions of the Association for Computational Linguistics, 2023

  33. [33]

    The illusion of state in state-space models

    Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models. In International Conference on Machine Learning (ICML), 2024. 11

  34. [34]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018

  35. [35]

    Nvidia A100 tensor core GPU datasheet, 2020

    NVIDIA Corporation. Nvidia A100 tensor core GPU datasheet, 2020

  36. [36]

    L., Gu, A., Fernando, A., Gulcehre, C., Pascanu, R., and De, S

    Orvieto, A., Smith, S. L., Gu, A., Fernando, A., Gulcehre, C., Pascanu, R., and De, S. Resur- recting recurrent neural networks for long sequences. InInternational Conference on Machine Learning (ICML), 2023

  37. [37]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern´andez, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Erk, K. and Smith, N. A. (eds.),Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 152...

  38. [38]

    B., Maddison, C

    Paulus, M. B., Maddison, C. J., and Krause, A. Rao-Blackwellizing the straight-through gumbel-softmax gradient estimator. InInternational Conference on Learning Representations (ICLR), 2021

  39. [39]

    B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L

    Penedo, G., Kydl´ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V ., and Wolf, T. The FineWeb Datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  40. [40]

    S., Wu, T., Wuttke, D., and Zhou-Zheng, C

    Peng, B., Zhang, R., Goldstein, D., Alcaide, E., Du, X., Hou, H., Lin, J., Liu, J., Lu, J., Merrill, W., Song, G., Tan, K., Utpala, S., Wilce, N., Wind, J. S., Wu, T., Wuttke, D., and Zhou-Zheng, C. RWKV-7 ”Goose” with Expressive Dynamic State Evolution. InSecond Conference on Language Modeling, 2025

  41. [41]

    arXiv preprint arXiv:2403.17844 , year=

    Poli, M., Thomas, A. W., Nguyen, E., Ponnusamy, P., Deiseroth, B., Kersting, K., Suzuki, T., Hie, B., Ermon, S., R´e, C., et al. Mechanistic design and scaling of hybrid architectures.arXiv preprint arXiv:2403.17844, 2024

  42. [42]

    Deep PPG: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14), 2019

    Reiss, A., Indlekofer, I., Schmidt, P., and Van Laerhoven, K. Deep PPG: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14), 2019

  43. [43]

    Samba: Simple hybrid state space models for efficient unlimited context language modeling

    Ren, L., Liu, Y ., Lu, Y ., Liang, C., Chen, W., et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InInternational Conference on Learning Representations (ICLR), 2025

  44. [44]

    Rusch, T. K. and Rus, D. Oscillatory state-space models. InInternational Conference on Learning Representations (ICLR), 2025

  45. [45]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y . Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019

  46. [46]

    The expressive capacity of state space models: A formal language perspective

    Sarrof, Y ., Veitsman, Y ., and Hahn, M. The expressive capacity of state space models: A formal language perspective. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  47. [47]

    Linear transformers are secretly fast weight program- mers

    Schlag, I., Irie, K., and Schmidhuber, J. Linear transformers are secretly fast weight program- mers. InInternational Conference on Machine Learning (ICML), 2021

  48. [48]

    Learning associative inference using fast weight memory

    Schlag, I., Munkhdalai, T., and Schmidhuber, J. Learning associative inference using fast weight memory. InInternational Conference on Learning Representations (ICLR), 2021

  49. [49]

    DeltaProduct: Improving state-tracking in linear RNNs via Householder products

    Siems, J., Carstensen, T., Zela, A., Hutter, F., Pontil, M., and Grazzi, R. DeltaProduct: Improving state-tracking in linear RNNs via Householder products. InICLR Workshop on Foundation Models in the Wild, 2025

  50. [50]

    Smith, J. T. H., Warrington, A., and Linderman, S. W. Simplified state space layers for sequence modeling. InInternational Conference on Learning Representations (ICLR), 2023

  51. [51]

    Birkhauser Verlag, CHE, 1994

    Straubing, H.Finite automata, formal logic, and circuit complexity. Birkhauser Verlag, CHE, 1994. 12

  52. [52]

    On the expressiveness and length generalization of selective state-space models on regular languages

    Terzi´c, A., Hersche, M., Camposampiero, G., Hofmann, T., Sebastian, A., and Rahimi, A. On the expressiveness and length generalization of selective state-space models on regular languages. InConference on Artificial Intelligence (AAAI), 2025

  53. [53]

    Structured sparse transition matrices to enable state tracking in state-space models

    Terzi´c, A., Menet, N., Hersche, M., Hofmann, T., and Rahimi, A. Structured sparse transition matrices to enable state tracking in state-space models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  54. [54]

    N., Kaiser, L., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  55. [55]

    An Empirical Study of Mamba-based Language Models

    Waleffe, R., Byeon, W., Riach, D., Norick, B., Korthikanti, V ., Dao, T., Gu, A., Hatamizadeh, A., Singh, S., Narayanan, D., et al. An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887, 2024

  56. [56]

    D., Qin, T., Cheng, Y ., Li, H., and Lyons, T

    Walker, B., McLeod, A. D., Qin, T., Cheng, Y ., Li, H., and Lyons, T. Log neural controlled differential equations: The lie brackets make a difference.International Conference on Machine Learning (ICML), 2024

  57. [57]

    M., Salvi, C., and Lyons, T

    Walker, B., Yang, L., Cirone, N. M., Salvi, C., and Lyons, T. Structured linear CDEs: Maximally expressive and parallel-in-time sequence models.arXiv preprint arXiv:2505.17761, 2025

  58. [58]

    TransXSSM: A hybrid transformer state space model with unified rotary position embedding.arXiv preprint arXiv:2506.09507, 2025

    Wu, B., Shi, J., Wu, Y ., Tang, N., and Luo, Y . TransXSSM: A hybrid transformer state space model with unified rotary position embedding.arXiv preprint arXiv:2506.09507, 2025

  59. [59]

    On layer normalization in the transformer architecture

    Xiong, R., Yang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., Wang, L., and Liu, T.-Y . On layer normalization in the transformer architecture. InInternational Conference on Machine Learning (ICML). JMLR.org, 2020

  60. [60]

    Parallelizing linear transformers with the delta rule over sequence length

    Yang, S., Wang, B., Zhang, Y ., Shen, Y ., and Kim, Y . Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  61. [61]

    URL https:// doi.org/10.18653/v1/p19-1472

    Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . HellaSwag: Can a machine really finish your sentence? In Korhonen, A., Traum, D., and M`arquez, L. (eds.),Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.186...

  62. [62]

    Each thread computesv[i] :=D t[i]·b[i]and writes the result to shared memory. 20

  63. [63]

    All threads synchronize using syncthreads() (to ensure each thread has written to v[i])

  64. [64]

    Each thread gathersv[P t[i]]from shared memory and adds it tob[i]to getb new[i]

  65. [65]

    These synchronization barriers are required to guarantee correctness of the recurrence, as threads depend on values produced by other threads within the same time step

    A second syncthreads() ensures that the updated bnew[i] is visible to all threads before proceeding to the next time step. These synchronization barriers are required to guarantee correctness of the recurrence, as threads depend on values produced by other threads within the same time step. We empirically evaluated the performance impact of these synchron...