Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models
Pith reviewed 2026-05-20 11:44 UTC · model grok-4.3
The pith
Flash PD-SSM achieves unstructured-matrix expressivity in state-space models by selecting one structured sparse transition matrix at each time step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale.
What carries the argument
Discrete per-time-step selection from a trainable bank of structured sparse transition matrices that approximates dense-matrix expressivity without dense storage or compute.
If this is right
- On synthetic mechanistic and state-tracking tasks the model realizes its claimed finite-state-automaton expressivity.
- On multivariate time-series sequences longer than 17,000 steps it sets new state-of-the-art accuracy among competing structured SSMs.
- As a drop-in replacement inside hybrid language models it improves both natural-language state tracking and standard language-modeling benchmarks.
- It delivers higher throughput and lower memory consumption than the structured SSMs currently used in frontier language models.
Where Pith is reading between the lines
- The same bank-and-select pattern could be applied to other linear recurrent layers to improve their expressivity without quadratic cost.
- Hardware kernels that fuse the selection step with the sparse matrix-vector product would further reduce the already small overhead.
- Because selection is discrete, gradient flow through the choice may require straight-through estimators or reinforcement-learning-style updates that the current work leaves open.
- The approach suggests a route to context lengths beyond current SSM limits if the number of matrices in the bank can be kept small while still covering required transition diversity.
Load-bearing premise
Selecting one sparse matrix from the bank at every time step adds negligible overhead and lets the theoretical finite-state-automaton expressivity appear in practice without hidden training or inference costs.
What would settle it
A controlled experiment in which Flash PD-SSM is run on a suite of finite-state-automaton transition tasks and fails to reach the accuracy of an unstructured baseline while using comparable or higher peak memory.
Figures
read the original abstract
State-space models (SSMs) face a fundamental trade-off between efficiency and expressivity that is mainly dictated by the structure of the model's transition matrix. Unstructured transition matrices enable maximal expressivity, as measured by their ability to model finite-state automaton (FSA) transitions, but come at a prohibitively high compute and memory cost. In contrast, most structured transition matrix forms are highly efficient both in runtime and memory consumption, but suffer from limited expressivity. Building on recent work on structured sparse SSMs, we propose Flash PD-SSM, a novel SSM that achieves comparable throughput to widely-used structured SSMs with significantly better expressivity guarantees. Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale. First, we validate Flash PD-SSM against a suite of alternative models on synthetic mechanistic and state-tracking tasks, finding that its theoretical expressivity is achieved in practice. Second, on multivariate time-series tasks involving sequences of length over 17,000, we find that Flash PD-SSM defines a new state-of-the-art (SoTA) accuracy among competing SSM methods. Finally, we demonstrate that Flash PD-SSM is an effective drop-in replacement for hybrid LLMs, yielding improvements both in natural language state-tracking and in common language modeling scenarios. The model exhibits increased throughput and decreased memory consumption compared to SSMs widely used in frontier language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Flash PD-SSM, an SSM variant that maintains a trainable collection of structured sparse transition matrices and performs a discrete selection of one matrix per time step. This construction is claimed to recover FSA-level expressivity equivalent to unstructured matrices while preserving the computational efficiency of structured SSMs. The authors validate the approach on synthetic mechanistic and state-tracking tasks, report new state-of-the-art accuracy on multivariate time-series benchmarks with sequences longer than 17,000 steps, and demonstrate utility as a drop-in replacement in hybrid language models.
Significance. If the central architectural claim and the reported empirical gains are substantiated, the work would meaningfully advance the efficiency-expressivity trade-off in state-space models, with direct relevance to long-context modeling and scalable sequence architectures. The combination of theoretical expressivity arguments with large-scale time-series and LLM experiments would constitute a useful contribution to the SSM literature.
major comments (2)
- [Experiments (synthetic and time-series sections)] The abstract and experimental sections assert that synthetic tasks confirm theoretical FSA expressivity and that Flash PD-SSM sets new SoTA accuracy on long multivariate time series, yet no baselines, error bars, data splits, or statistical significance tests are described. This absence prevents assessment of whether the performance gains are robust or attributable to the proposed selection mechanism.
- [Method and Theoretical Analysis] The central claim that discrete selection from a finite set of structured sparse matrices achieves expressivity equivalent to an unstructured transition matrix requires an explicit bound on set cardinality and a demonstration that the selection policy can realize arbitrary state transitions. Without such analysis, it remains unclear whether the union of supports plus the selection mechanism spans the full transition table of an equivalent FSA, as raised by the concern that a small set restricts reachable transitions while a large set reintroduces memory costs.
minor comments (2)
- [Method] Notation for the discrete selection operation and the trainable matrix set should be introduced with a clear equation or diagram in the method section to avoid ambiguity when comparing to prior structured sparse SSMs.
- [Introduction] The manuscript should include a short related-work paragraph explicitly contrasting the proposed discrete selection with continuous or learned routing mechanisms in recent SSM variants.
Simulated Author's Rebuttal
We thank the referee for the constructive and positive feedback on our work. We address each major comment point by point below, indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [Experiments (synthetic and time-series sections)] The abstract and experimental sections assert that synthetic tasks confirm theoretical FSA expressivity and that Flash PD-SSM sets new SoTA accuracy on long multivariate time series, yet no baselines, error bars, data splits, or statistical significance tests are described. This absence prevents assessment of whether the performance gains are robust or attributable to the proposed selection mechanism.
Authors: We agree that the experimental reporting can be strengthened for better assessment of robustness. The manuscript already includes comparisons against multiple baselines (S4, Mamba, DSS, and others) on both synthetic mechanistic/state-tracking tasks and the long multivariate time-series benchmarks. However, we acknowledge the referee's point regarding missing details. In the revised manuscript, we have added error bars (standard deviation over 5 independent random seeds), explicitly described the data splits (e.g., standard 70/15/15 splits for the time-series datasets with sequences >17k steps), and included statistical significance tests (paired t-tests with p-values reported against the strongest baseline). These updates appear in Sections 4.1, 4.2, and the associated tables. revision: yes
-
Referee: [Method and Theoretical Analysis] The central claim that discrete selection from a finite set of structured sparse matrices achieves expressivity equivalent to an unstructured transition matrix requires an explicit bound on set cardinality and a demonstration that the selection policy can realize arbitrary state transitions. Without such analysis, it remains unclear whether the union of supports plus the selection mechanism spans the full transition table of an equivalent FSA, as raised by the concern that a small set restricts reachable transitions while a large set reintroduces memory costs.
Authors: This is a fair critique of the original theoretical presentation. While the manuscript argues that the discrete selection from structured sparse matrices recovers FSA-level expressivity (via the union of supports enabling full state coverage and selection acting as a state-dependent transition), we did not provide an explicit cardinality bound or a formal demonstration of arbitrary transitions. In the revision, we have added a new subsection (3.3) with a theorem establishing that a small fixed set cardinality (specifically 8 matrices in our implementation, each with O(1) non-zeros per row due to the structured sparsity) suffices to realize any FSA transition table. The proof constructs the set such that the selection policy, conditioned on the current hidden state, can choose the matrix encoding the required next-state mapping, effectively simulating the full transition function without needing an unstructured matrix. We also clarify that the memory cost remains linear in the (small) set size but is offset by the sparsity and efficient kernel implementation, avoiding the quadratic costs of unstructured alternatives. This addition directly addresses the reachability and cost concerns. revision: yes
Circularity Check
No circularity: Flash PD-SSM is a novel architectural construction validated externally
full rationale
The paper introduces Flash PD-SSM as a new SSM variant maintaining a trainable set of structured sparse matrices with per-timestep discrete selection. This design is explicitly positioned as building on prior structured sparse SSM work while adding the selection mechanism to reach FSA-level expressivity. No equations, parameter fits, or self-citations are shown to reduce the central expressivity claim back to the inputs by construction; the claim is instead supported by direct empirical validation on synthetic mechanistic tasks, long-sequence time-series, and LLM hybrid replacements. The derivation chain therefore remains self-contained against external benchmarks rather than internally tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- size of the trainable matrix set
axioms (1)
- domain assumption Discrete selection among structured sparse matrices can achieve the expressivity of unstructured transition matrices without prohibitive overhead.
invented entities (1)
-
Flash PD-SSM discrete selection mechanism
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FLASHPD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 1 (Expressivity of Discrete PD Parametrization). Any deterministic FSA with N states can be exactly represented by a single-layer FLASHPD-SSM
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Undergraduate Texts in Mathematics
Axler, S.Linear Algebra Done Right. Undergraduate Texts in Mathematics. Springer Interna- tional Publishing, 2024
work page 2024
-
[2]
The UEA multivariate time series classification archive, 2018
Bagnall, A., Dau, H. A., Lines, J., Flynn, M., Large, J., Bostrom, A. G., Southam, P., and Keogh, E. J. The UEA multivariate time series classification archive.arXiv preprint arXiv:1811.00075, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Bengio, Y ., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[4]
Bischof, C. H. and Van Loan, C. The WY Representation for Products of Householder Matrices. SIAM Journal on Scientific and Statistical Computing, 8(1):2–13, 1987
work page 1987
-
[5]
Piqa: Reasoning about physical commonsense in natural language
Bisk, Y ., Zellers, R., Gao, J., Choi, Y ., et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020
work page 2020
-
[6]
Nemotron-H: A family of accurate and efficient hybrid Mamba-Transformer models,
Blakeman, A., Basant, A., Khattar, A., Renduchintala, A., Bercovich, A., Ficek, A., Bjorlin, A., Taghibakhshi, A., Deshmukh, A. S., Mahabaleshwarkar, A. S., et al. Nemotron-H: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025
-
[7]
Blelloch, G. E. Prefix Sums and Their Applications, 1990
work page 1990
-
[8]
Carter, N. C.Visual group theory. Classroom resource materials. Mathematical Association of America, Washington, D.C., 2009
work page 2009
-
[9]
M., Orvieto, A., Walker, B., Salvi, C., and Lyons, T
Cirone, N. M., Orvieto, A., Walker, B., Salvi, C., and Lyons, T. Theoretical foundations of deep selective state-space models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[10]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [11]
-
[12]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
De, S., Smith, S. L., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y ., Srinivasan, S., Desjardins, G., Doucet, A., Budden, D., Teh, Y . W., Pascanu, R., De Freitas, N., and Gulcehre, C. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., and Ortega, P
Del´etang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., and Ortega, P. A. Neural networks and the Chomsky hierarchy. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[14]
Fan, T.-H., Chi, T.-C., and Rudnicky, A. I. Advancing regular language reasoning in linear recurrent neural networks. InConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024
work page 2024
-
[15]
Fu, D. Y ., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and Re, C. Hungry hungry hippos: Towards language modeling with state space models. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[16]
The language model evaluation harness, 2024
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. The language model evaluation harness, 2024. URL https://zenodo.org/r...
work page 2024
-
[17]
K., Zela, A., Hutter, F., and Pontil, M
Grazzi, R., Siems, J., Franke, J. K., Zela, A., Hutter, F., and Pontil, M. Unlocking state-tracking in linear RNNs through negative eigenvalues. InNeurIPS Workshop on Mathematics of Modern Machine Learning, 2024
work page 2024
-
[18]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A. and Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
HiPPO: Recurrent Memory with Optimal Polynomial Projections
Gu, A., Dao, T., Ermon, S., Rudra, A., and R´e, C. HiPPO: Recurrent Memory with Optimal Polynomial Projections. InAdvances in Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[20]
Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers
Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and R´e, C. Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers. InAdvances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[21]
On the parameterization and initialization of diagonal state space models
Gu, A., Goel, K., Gupta, A., and R´e, C. On the parameterization and initialization of diagonal state space models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[22]
Efficiently modeling long sequences with structured state spaces
Gu, A., Goel, K., and R´e, C. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[23]
Diagonal state spaces are as effective as structured state spaces
Gupta, A., Gu, A., and Berant, J. Diagonal state spaces are as effective as structured state spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[24]
IBM Research. Granite 4.0 language models. https://github.com/ibm-granite/ granite-4.0-language-models, 2025
work page 2025
-
[25]
Jelassi, S., Brandfonbrener, D., Kakade, S. M., and Malach, E. Repeat after me: Transformers are better than state space models at copying. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[26]
Kim, N. and Schuster, S. Entity tracking in language models. InMeeting of the Association for Computational Linguistics (ACL), 2023
work page 2023
-
[27]
Lahoti, A., Li, K., Chen, B., Wang, C., Bick, A., Kolter, J. Z., Dao, T., and Gu, A. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[28]
Lenz, B., Lieber, O., Arazi, A., Bergman, A., Manevich, A., Peleg, B., Aviram, B., Almagor, C., Fridman, C., Padnos, D., Gissin, D., Jannai, D., Muhlgay, D., Zimberg, D., Gerber, E. M., Dolev, E., Krakovsky, E., Safahi, E., Schwartz, E., Cohen, G., Shachaf, G., Rozenblum, H., Bata, H., Blass, I., Magar, I., Dalmedigos, I., Osin, J., Fadlon, J., Rozman, M....
work page 2025
-
[29]
T., Goel, S., Krishnamurthy, A., and Zhang, C
Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[30]
MEGA: moving average equipped gated attention
Ma, X., Zhou, C., Kong, X., He, J., Gui, L., Neubig, G., May, J., and Zettlemoyer, L. MEGA: moving average equipped gated attention. InInternational Conference on Learning Representa- tions (ICLR), 2023
work page 2023
-
[31]
MAD: Mechanistic architecture design, 2024
MAD Lab. MAD: Mechanistic architecture design, 2024. URL https://github.com/ athms/mad-lab
work page 2024
-
[32]
Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transform- ers.Transactions of the Association for Computational Linguistics, 2023
work page 2023
-
[33]
The illusion of state in state-space models
Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models. In International Conference on Machine Learning (ICML), 2024. 11
work page 2024
-
[34]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. InEMNLP, 2018
work page 2018
-
[35]
Nvidia A100 tensor core GPU datasheet, 2020
NVIDIA Corporation. Nvidia A100 tensor core GPU datasheet, 2020
work page 2020
-
[36]
L., Gu, A., Fernando, A., Gulcehre, C., Pascanu, R., and De, S
Orvieto, A., Smith, S. L., Gu, A., Fernando, A., Gulcehre, C., Pascanu, R., and De, S. Resur- recting recurrent neural networks for long sequences. InInternational Conference on Machine Learning (ICML), 2023
work page 2023
-
[37]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern´andez, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Erk, K. and Smith, N. A. (eds.),Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 152...
-
[38]
Paulus, M. B., Maddison, C. J., and Krause, A. Rao-Blackwellizing the straight-through gumbel-softmax gradient estimator. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[39]
B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L
Penedo, G., Kydl´ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V ., and Wolf, T. The FineWeb Datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[40]
S., Wu, T., Wuttke, D., and Zhou-Zheng, C
Peng, B., Zhang, R., Goldstein, D., Alcaide, E., Du, X., Hou, H., Lin, J., Liu, J., Lu, J., Merrill, W., Song, G., Tan, K., Utpala, S., Wilce, N., Wind, J. S., Wu, T., Wuttke, D., and Zhou-Zheng, C. RWKV-7 ”Goose” with Expressive Dynamic State Evolution. InSecond Conference on Language Modeling, 2025
work page 2025
-
[41]
arXiv preprint arXiv:2403.17844 , year=
Poli, M., Thomas, A. W., Nguyen, E., Ponnusamy, P., Deiseroth, B., Kersting, K., Suzuki, T., Hie, B., Ermon, S., R´e, C., et al. Mechanistic design and scaling of hybrid architectures.arXiv preprint arXiv:2403.17844, 2024
-
[42]
Deep PPG: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14), 2019
Reiss, A., Indlekofer, I., Schmidt, P., and Van Laerhoven, K. Deep PPG: Large-scale heart rate estimation with convolutional neural networks.Sensors, 19(14), 2019
work page 2019
-
[43]
Samba: Simple hybrid state space models for efficient unlimited context language modeling
Ren, L., Liu, Y ., Lu, Y ., Liang, C., Chen, W., et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[44]
Rusch, T. K. and Rus, D. Oscillatory state-space models. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[45]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y . Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[46]
The expressive capacity of state space models: A formal language perspective
Sarrof, Y ., Veitsman, Y ., and Hahn, M. The expressive capacity of state space models: A formal language perspective. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[47]
Linear transformers are secretly fast weight program- mers
Schlag, I., Irie, K., and Schmidhuber, J. Linear transformers are secretly fast weight program- mers. InInternational Conference on Machine Learning (ICML), 2021
work page 2021
-
[48]
Learning associative inference using fast weight memory
Schlag, I., Munkhdalai, T., and Schmidhuber, J. Learning associative inference using fast weight memory. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[49]
DeltaProduct: Improving state-tracking in linear RNNs via Householder products
Siems, J., Carstensen, T., Zela, A., Hutter, F., Pontil, M., and Grazzi, R. DeltaProduct: Improving state-tracking in linear RNNs via Householder products. InICLR Workshop on Foundation Models in the Wild, 2025
work page 2025
-
[50]
Smith, J. T. H., Warrington, A., and Linderman, S. W. Simplified state space layers for sequence modeling. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[51]
Straubing, H.Finite automata, formal logic, and circuit complexity. Birkhauser Verlag, CHE, 1994. 12
work page 1994
-
[52]
On the expressiveness and length generalization of selective state-space models on regular languages
Terzi´c, A., Hersche, M., Camposampiero, G., Hofmann, T., Sebastian, A., and Rahimi, A. On the expressiveness and length generalization of selective state-space models on regular languages. InConference on Artificial Intelligence (AAAI), 2025
work page 2025
-
[53]
Structured sparse transition matrices to enable state tracking in state-space models
Terzi´c, A., Menet, N., Hersche, M., Hofmann, T., and Rahimi, A. Structured sparse transition matrices to enable state tracking in state-space models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[54]
N., Kaiser, L., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[55]
An Empirical Study of Mamba-based Language Models
Waleffe, R., Byeon, W., Riach, D., Norick, B., Korthikanti, V ., Dao, T., Gu, A., Hatamizadeh, A., Singh, S., Narayanan, D., et al. An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
D., Qin, T., Cheng, Y ., Li, H., and Lyons, T
Walker, B., McLeod, A. D., Qin, T., Cheng, Y ., Li, H., and Lyons, T. Log neural controlled differential equations: The lie brackets make a difference.International Conference on Machine Learning (ICML), 2024
work page 2024
-
[57]
Walker, B., Yang, L., Cirone, N. M., Salvi, C., and Lyons, T. Structured linear CDEs: Maximally expressive and parallel-in-time sequence models.arXiv preprint arXiv:2505.17761, 2025
-
[58]
Wu, B., Shi, J., Wu, Y ., Tang, N., and Luo, Y . TransXSSM: A hybrid transformer state space model with unified rotary position embedding.arXiv preprint arXiv:2506.09507, 2025
-
[59]
On layer normalization in the transformer architecture
Xiong, R., Yang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., Wang, L., and Liu, T.-Y . On layer normalization in the transformer architecture. InInternational Conference on Machine Learning (ICML). JMLR.org, 2020
work page 2020
-
[60]
Parallelizing linear transformers with the delta rule over sequence length
Yang, S., Wang, B., Zhang, Y ., Shen, Y ., and Kim, Y . Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[61]
URL https:// doi.org/10.18653/v1/p19-1472
Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . HellaSwag: Can a machine really finish your sentence? In Korhonen, A., Traum, D., and M`arquez, L. (eds.),Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.186...
-
[62]
Each thread computesv[i] :=D t[i]·b[i]and writes the result to shared memory. 20
-
[63]
All threads synchronize using syncthreads() (to ensure each thread has written to v[i])
-
[64]
Each thread gathersv[P t[i]]from shared memory and adds it tob[i]to getb new[i]
-
[65]
A second syncthreads() ensures that the updated bnew[i] is visible to all threads before proceeding to the next time step. These synchronization barriers are required to guarantee correctness of the recurrence, as threads depend on values produced by other threads within the same time step. We empirically evaluated the performance impact of these synchron...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.