pith. machine review for the scientific record. sign in

arxiv: 2208.04933 · v3 · submitted 2022-08-09 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Simplified State Space Layers for Sequence Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords state space modelsS4S5sequence modelinglong range arenaparallel scansdeep learning
0
0 comments X

The pith

S5 uses one multi-input multi-output state space model to match S4 performance and efficiency on long sequences

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the S5 layer as a streamlined state space model for long-range sequences. It replaces the collection of many independent single-input single-output models used in S4 with a single multi-input multi-output state space model. The authors establish a connection to S4 that supplies the initialization and parameterization needed for stability. This design permits the use of standard parallel scan operations for fast computation. The result delivers 87.4 percent average accuracy on the long range arena benchmark and 98.5 percent on the Path-X task while preserving S4-level efficiency.

Core claim

The S5 layer consists of one multi-input multi-output state space model rather than many independent single-input single-output models as in S4. By connecting S5 to S4, the initialization and parameterization enable stable performance, allowing the use of efficient parallel scans and resulting in 87.4 percent average accuracy on the long range arena benchmark and 98.5 percent on the Path-X task.

What carries the argument

The S5 layer, a single multi-input multi-output state space model whose initialization and parameterization are derived from its connection to the S4 framework to support parallel scan computation.

Load-bearing premise

The initialization and parameterization obtained by connecting S5 to S4 will produce stable high-performing models across tasks without per-task tuning or adjustments.

What would settle it

Training an S5 model on Path-X using random initialization instead of the S4-derived connection and observing accuracy well below 98.5 percent or training instability.

read the original abstract

Models using structured state space sequence (S4) layers have achieved state-of-the-art performance on long-range sequence modeling tasks. An S4 layer combines linear state space models (SSMs), the HiPPO framework, and deep learning to achieve high performance. We build on the design of the S4 layer and introduce a new state space layer, the S5 layer. Whereas an S4 layer uses many independent single-input, single-output SSMs, the S5 layer uses one multi-input, multi-output SSM. We establish a connection between S5 and S4, and use this to develop the initialization and parameterization used by the S5 model. The result is a state space layer that can leverage efficient and widely implemented parallel scans, allowing S5 to match the computational efficiency of S4, while also achieving state-of-the-art performance on several long-range sequence modeling tasks. S5 averages 87.4% on the long range arena benchmark, and 98.5% on the most difficult Path-X task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the S5 layer as a simplification of S4: it replaces the bank of independent SISO SSMs with a single MIMO SSM, derives the initialization and parameterization from an explicit mathematical connection to the HiPPO-based S4 discretization, and shows that the resulting model matches S4's computational efficiency via parallel scans while reporting 87.4% average accuracy on the Long Range Arena benchmark and 98.5% on Path-X.

Significance. If the central performance claims hold, the work offers a meaningful architectural simplification that preserves long-range modeling capability while reducing implementation complexity. The explicit derivation from prior S4 results is a strength, as it grounds the new parameterization in existing theory rather than benchmark-specific fitting.

major comments (1)
  1. [§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: the reported LRA average of 87.4% and Path-X accuracy of 98.5% are presented without error bars, without ablation on the S4-derived initialization, and without details on the hyperparameter search budget or per-task stabilization steps. These omissions make it impossible to verify that the MIMO parameterization alone transfers stability without implicit per-task adjustments, which is load-bearing for the claim that S5 matches S4 performance through the architectural simplification.
minor comments (1)
  1. [§3.2 (Parameterization)] §3.2 (Parameterization): the mapping from the S4 state matrix to the single MIMO A matrix is described at a high level; an explicit equation showing how the block-diagonal HiPPO structure is collapsed would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and positive assessment of the S5 layer as a meaningful simplification of S4. We agree that clearer experimental reporting will strengthen the claims and are happy to incorporate the suggested improvements as a minor revision.

read point-by-point responses
  1. Referee: [§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: the reported LRA average of 87.4% and Path-X accuracy of 98.5% are presented without error bars, without ablation on the S4-derived initialization, and without details on the hyperparameter search budget or per-task stabilization steps. These omissions make it impossible to verify that the MIMO parameterization alone transfers stability without implicit per-task adjustments, which is load-bearing for the claim that S5 matches S4 performance through the architectural simplification.

    Authors: We agree that the current experimental presentation can be improved for verifiability. In the revised manuscript we will add error bars (standard deviation over 3–5 random seeds) to the LRA average and Path-X results in Table 1 and the corresponding text. We will also include a new ablation study in §4 that compares the S4-derived HiPPO initialization against random initialization, confirming that the structured initialization is responsible for stable training and performance. Finally, we will expand the experimental details to report the hyperparameter search budget (grid ranges and number of trials per task), the specific stabilization steps (e.g., layer-norm placement, learning-rate schedules, and gradient clipping thresholds), and confirm that these choices are applied uniformly rather than tuned per-task beyond standard practice. These additions will make explicit that the reported performance stems from the MIMO parameterization and its S4-derived initialization. revision: yes

Circularity Check

0 steps flagged

No circularity: S5 init derived from external S4 connection

full rationale

The paper establishes a mathematical connection to prior S4 work (distinct authors) to obtain initialization and parameterization for the MIMO SSM in S5. This is an external derivation, not a self-definition or fit inside the present manuscript. Performance numbers (87.4% LRA, 98.5% Path-X) are reported from empirical evaluation on standard benchmarks rather than being forced by construction. No load-bearing step reduces to a fitted quantity or self-citation chain defined within the paper itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on the HiPPO framework and the S4 design as background assumptions; the main addition is the MIMO SSM structure whose initialization is justified by the S4 connection. No new free parameters beyond standard SSM state size are introduced in the abstract, and no new entities are postulated.

free parameters (1)
  • SSM state dimension
    The internal state size of the MIMO SSM is a model hyperparameter whose value is chosen to balance capacity and compute; its specific value is not derived from first principles in the abstract.
axioms (1)
  • domain assumption The mathematical connection between the S5 MIMO SSM and the S4 SISO SSMs yields a valid initialization and parameterization that transfers performance.
    This link is invoked to justify the S5 design choices and is central to the method's claimed success.

pith-pipeline@v0.9.0 · 5476 in / 1370 out tokens · 33932 ms · 2026-05-16T08:13:20.624547+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing eight_tick_forces_D3 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We build on the design of the S4 layer and introduce a new state space layer, the S5 layer. Whereas an S4 layer uses many independent single-input, single-output SSMs, the S5 layer uses one multi-input, multi-output SSM.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We establish a connection between S5 and S4, and use this to develop the initialization and parameterization used by the S5 model.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rotation Equivariant Mamba for Vision Tasks

    cs.CV 2026-03 unverdicted novelty 8.0

    EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-e...

  2. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  3. A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures

    cs.LG 2026-05 unverdicted novelty 7.0

    A real Schur decomposition projection maps the state matrix of discrete-time state-space layers onto its nearest stable counterpart, delivering accuracy comparable to prior stable identification methods with fewer weights.

  4. QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

    cs.LG 2026-05 unverdicted novelty 7.0

    QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.

  5. TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

    cs.CV 2026-05 unverdicted novelty 7.0

    TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

  6. TIDES: Implicit Time-Awareness in Selective State Space Models

    cs.LG 2026-05 unverdicted novelty 7.0

    TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...

  7. Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    cs.LG 2024-02 unverdicted novelty 7.0

    Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.

  8. Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

    cs.LG 2026-05 unverdicted novelty 6.0

    PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.

  9. Continuity Laws for Sequential Models

    cs.LG 2026-05 unverdicted novelty 6.0

    S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.

  10. Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

    cs.LG 2026-05 unverdicted novelty 6.0

    Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.

  11. Cubit: Token Mixer with Kernel Ridge Regression

    cs.LG 2026-05 unverdicted novelty 6.0

    Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.

  12. The Impossibility Triangle of Long-Context Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

  13. State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...

  14. The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

    cs.LG 2026-04 unverdicted novelty 6.0

    Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...

  15. An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling

    cs.NE 2026-04 unverdicted novelty 6.0

    S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.

  16. Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking

    cs.CV 2026-04 unverdicted novelty 6.0

    MambaTrack improves RGB-Event object tracking via event-adaptive state transitions in a Dynamic State Space Model and a Gated Projection Fusion module, reporting state-of-the-art results on FE108 and FELT datasets.

  17. Latent-Space Causal Discovery from Indirect Neuroimaging Observations

    q-bio.NC 2026-01 unverdicted novelty 6.0

    INCAMA recovers directed causal graphs from indirect neuroimaging by physics-aware inversion plus delay-aware Mamba encoding, yielding 2-3x F1 gains in simulations and anatomically plausible sparse graphs on HCP motor fMRI.

  18. Physics-Guided Tiny-Mamba Transformer for Reliability-Aware Early Fault Warning

    cs.LG 2026-01 unverdicted novelty 6.0

    PG-TMT couples a physics-aligned tri-branch encoder with EVT-calibrated decision rules to achieve higher PR-AUC and shorter detection times at controlled false-alarm rates across multiple bearing datasets.

  19. Beyond ZOH: Advanced Discretization Strategies for Vision Mamba

    cs.CV 2026-04 unverdicted novelty 4.0

    Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.

  20. Deep Learning for Virtual Reality User Identification: A Benchmark

    cs.HC 2026-03 unverdicted novelty 4.0

    A benchmark study evaluates standard and emerging deep learning architectures on motion data from 71 VR users, establishing performance baselines for user identification.

  21. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

149 extracted references · 149 canonical work pages · cited by 21 Pith papers · 6 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , year=

    Diagonal State Spaces are as Effective as Structured State Spaces , author=. Advances in Neural Information Processing Systems , year=

  2. [2]

    Legendre

    Voelker, Aaron and Kaji. Legendre. Advances in Neural Information Processing Systems , volume=

  3. [3]

    It’s Raw!

    Goel, Karan and Gu, Albert and Donahue, Chris and Re, Christopher , booktitle =. It’s Raw!. 2022 , volume =

  4. [4]

    Gu, Albert and Dao, Tri and Ermon, Stefano and Rudra, Atri and R. Hi. Advances in Neural Information Processing Systems , volume=

  5. [5]

    2020 , copyright =

    GMAT: Global Memory Augmentation for Transformers , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2006.03274 , author =

  6. [6]

    International Conference on Learning Representations , year=

    Efficiently Modeling Long Sequences with Structured State Spaces , author=. International Conference on Learning Representations , year=

  7. [7]

    Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXV , pages=

    Long movie clip classification with state-space video models , author=. Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXV , pages=

  8. [8]

    ACM Computing Surveys (CSUR) , publisher=

    Efficient transformers: A survey , author=. ACM Computing Surveys (CSUR) , publisher=

  9. [9]

    Transactions of the Association for Computational Linguistics , volume=

    Efficient Content-Based Sparse Attention with Routing Transformers , author=. Transactions of the Association for Computational Linguistics , volume=

  10. [12]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Xception: Deep learning with depthwise separable convolutions , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  11. [13]

    International Conference on Learning Representations , year=

    Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=

  12. [14]

    Advances in Neural Information Processing Systems , volume=

    Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers , author=. Advances in Neural Information Processing Systems , volume=

  13. [15]

    Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

    Blockwise Self-Attention for Long Document Understanding , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

  14. [16]

    Advances in Neural Information Processing Systems , volume=

    Attention is all you need , author=. Advances in Neural Information Processing Systems , volume=

  15. [17]

    Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , booktitle=. Long

  16. [18]

    2001 , publisher=

    Structured matrices and polynomials: unified superfast algorithms , author=. 2001 , publisher=

  17. [19]

    Fast approximate computations with

    Pan, Victor , journal=. Fast approximate computations with

  18. [20]

    International Conference on Machine Learning , pages=

    Unitary evolution recurrent neural networks , author=. International Conference on Machine Learning , pages=. 2016 , organization=

  19. [21]

    International Conference on Learning Representations , year=

    Lipschitz Recurrent Neural Networks , author=. International Conference on Learning Representations , year=

  20. [24]

    International Conference on Learning Representations , year=

    Rethinking Attention with Performers , author=. International Conference on Learning Representations , year=

  21. [25]

    Transformers are

    Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , pages=. 2020 , organization=

  22. [26]

    An algorithm for the machine calculation of complex

    Cooley, James and Tukey, John , journal=. An algorithm for the machine calculation of complex. 1965 , publisher=

  23. [27]

    Journal of the ACM (JACM) , volume=

    Parallel prefix computation , author=. Journal of the ACM (JACM) , volume=. 1980 , publisher=

  24. [28]

    1994 , publisher=

    Parallel computing using the prefix problem , author=. 1994 , publisher=

  25. [29]

    1990 , institution=

    Prefix Sums and Their Applications , author=. 1990 , institution=

  26. [30]

    H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

  27. [31]

    Xiong, Yunyang and Zeng, Zhanpeng and Chakraborty, Rudrasis and Tan, Mingxing and Fung, Glenn and Li, Yin and Singh, Vikas , booktitle=. Nystr. 2021 , organization=

  28. [32]

    Lee-Thorp, James and Ainslie, Joshua and Eckstein, Ilya and Ontanon, Santiago , booktitle=

  29. [33]

    Nangia, Nikita and Bowman, Samuel , journal=

  30. [34]

    Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages=

    Learning word vectors for sentiment analysis , author=. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages=

  31. [35]

    Radev, Dragomir and Muthukrishnan, Pradeep and Qazvinian, Vahed , journal=. The

  32. [36]

    Master's thesis, University of Toronto , year=

    Learning Multiple Layers of Features from Tiny Images , author=. Master's thesis, University of Toronto , year=

  33. [37]

    Advances in Neural Information Processing Systems , volume=

    Learning long-range spatial dependencies with horizontal gated recurrent units , author=. Advances in Neural Information Processing Systems , volume=

  34. [38]

    Advances in Neural Information Processing Systems , volume=

    Luna: Linear unified nested attention , author=. Advances in Neural Information Processing Systems , volume=

  35. [39]

    Warden, Pete , journal=. Speech

  36. [40]

    Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

    Neural Latents Benchmark ‘21: Evaluating latent variable models of neural population activity , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

  37. [41]

    International Conference on Learning Representations , year=

    Parallelizing Linear Recurrent Neural Nets Over Sequence Length , author=. International Conference on Learning Representations , year=

  38. [42]

    Advances in Neural Information Processing Systems , volume=

    Linear Dynamical Systems as a Core Computational Primitive , author=. Advances in Neural Information Processing Systems , volume=

  39. [43]

    Advances in Neural Information Processing Systems , volume=

    Big bird: Transformers for longer sequences , author=. Advances in Neural Information Processing Systems , volume=

  40. [44]

    Temporal parallelization of

    S. Temporal parallelization of. IEEE Transactions on Automatic Control , volume=. 2020 , publisher=

  41. [45]

    CoRR , year=

    Adam: A Method for Stochastic Optimization , author=. CoRR , year=

  42. [46]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  43. [47]

    Gaussian Error Linear Units (

    Hendrycks, Dan and Gimpel, Kevin , journal=. Gaussian Error Linear Units (

  44. [48]

    Experiment Tracking with Weights and Biases , year =

  45. [49]

    International Conference on Learning Representations , year=

    Quasi-Recurrent Neural Networks , author=. International Conference on Learning Representations , year=

  46. [50]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    Simple Recurrent Units for Highly Parallelizable Recurrence , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  47. [51]

    Long short-term memory , author=. Neural. 1997 , publisher=

  48. [52]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. arXiv preprint arXiv:2205.14135 , year=

  49. [53]

    1985 , publisher=

    Multidimensional systems theory: Progress, directions and open problems in multidimensional systems , author=. 1985 , publisher=

  50. [54]

    Antisymmetric

    Bo Chang and Minmin Chen and Eldad Haber and Ed Chi , booktitle=. Antisymmetric

  51. [55]

    International Conference on Machine Learning , pages=

    On the difficulty of training recurrent neural networks , author=. International Conference on Machine Learning , pages=. 2013 , organization=

  52. [56]

    Advances in Neural Information Processing Systems , year=

    On the Parameterization and Initialization of Diagonal State Space Models , author=. Advances in Neural Information Processing Systems , year=

  53. [57]

    International Conference on Machine Learning , pages=

    Language modeling with gated convolutional networks , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  54. [58]

    How to Train your

    Albert Gu and Isys Johnson and Aman Timalsina and Atri Rudra and Christopher Re , booktitle=. How to Train your

  55. [59]

    Learning longer-term dependencies in

    Trinh, Trieu and Dai, Andrew and Luong, Thang and Le, Quoc , booktitle=. Learning longer-term dependencies in. 2018 , organization=

  56. [60]

    David Romero and Anna Kuzina and Erik Bekkers and Jakub Mikolaj Tomczak and Mark Hoogendoorn , booktitle=

  57. [61]

    Advances in Neural Information Processing Systems , volume=

    Latent ordinary differential equations for irregularly-sampled time series , author=. Advances in Neural Information Processing Systems , volume=

  58. [62]

    International Conference on Learning Representations , year=

    Trellis Networks for Sequence Modeling , author=. International Conference on Learning Representations , year=

  59. [63]

    International Conference on Machine Learning , pages=

    Neural rough differential equations for long time series , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  60. [64]

    International Conference on Machine Learning , pages=

    Improving the gating mechanism of recurrent neural networks , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  61. [65]

    Advances in Neural Information Processing Systems , volume=

    Dilated recurrent neural networks , author=. Advances in Neural Information Processing Systems , volume=

  62. [66]

    Independently recurrent neural network (

    Li, Shuai and Li, Wanqing and Cook, Chris and Zhu, Ce and Gao, Yanbo , booktitle=. Independently recurrent neural network (

  63. [67]

    International Conference on Learning Representations , year=

    Adversarial Audio Synthesis , author=. International Conference on Learning Representations , year=

  64. [68]

    International Conference on Machine Learning , pages=

    Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  65. [69]

    International Conference on Machine Learning , pages=

    UnICORNN: A recurrent model for learning very long time dependencies , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  66. [70]

    Parallelizing

    Chilkuri, Narsimha Reddy and Eliasmith, Chris , booktitle=. Parallelizing. 2021 , organization=

  67. [71]

    Rush, Sasha and Karamcheti, Sidd , booktitle=. The. 2022 , url=

  68. [72]

    Machine Learning for Healthcare Conference , pages=

    In-depth benchmarking of deep neural network architectures for ECG diagnosis , author=. Machine Learning for Healthcare Conference , pages=. 2021 , organization=

  69. [73]

    International Conference on Machine Learning , pages=

    Modeling irregular time series with continuous recurrent units , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  70. [74]

    International Conference on Machine Learning , pages=

    Recurrent kalman networks: Factorized inference in high-dimensional deep feature spaces , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  71. [75]

    2009 , publisher=

    A first course in the numerical analysis of differential equations , author=. 2009 , publisher=

  72. [76]

    International Conference on Learning Representations , year=

    FlexConv: Continuous Kernel Convolutions With Differentiable Kernel Sizes , author=. International Conference on Learning Representations , year=

  73. [77]

    Towards a General Purpose

    Romero, David and Knigge, David and Gu, Albert and Bekkers, Erik and Gavves, Efstratios and Tomczak, Jakub and Hoogendoorn, Mark , journal=. Towards a General Purpose

  74. [78]

    Advances in neural information processing systems , volume=

    Self-normalizing neural networks , author=. Advances in neural information processing systems , volume=

  75. [79]

    International Conference on Learning Representations , year=

    Mega: Moving Average Equipped Gated Attention , author=. International Conference on Learning Representations , year=

  76. [80]

    Pattern Recognition Letters , volume=

    Weighted sigmoid gate unit for an activation function of deep neural network , author=. Pattern Recognition Letters , volume=. 2020 , publisher=

  77. [81]

    International Conference on Learning Representations , year=

    Liquid Structural State-Space Models , author=. International Conference on Learning Representations , year=

  78. [82]

    International Conference on Learning Representations , year=

    Multi-Time Attention Networks for Irregularly Sampled Time Series , author=. International Conference on Learning Representations , year=

  79. [83]

    Syntax, Semantics and Structure in Statistical Translation , pages=

    On the Properties of Neural Machine Translation: Encoder--Decoder Approaches , author=. Syntax, Semantics and Structure in Statistical Translation , pages=. 2014 , publisher=

  80. [84]

    Advances in neural information processing systems , volume=

    Neural ordinary differential equations , author=. Advances in neural information processing systems , volume=

Showing first 80 references.