arxiv: 2208.04933 · v3 · submitted 2022-08-09 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Simplified State Space Layers for Sequence Modeling

Jimmy T.H. Smith , Andrew Warrington , Scott W. Linderman

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords state space modelsS4S5sequence modelinglong range arenaparallel scansdeep learning

0 comments

The pith

S5 uses one multi-input multi-output state space model to match S4 performance and efficiency on long sequences

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the S5 layer as a streamlined state space model for long-range sequences. It replaces the collection of many independent single-input single-output models used in S4 with a single multi-input multi-output state space model. The authors establish a connection to S4 that supplies the initialization and parameterization needed for stability. This design permits the use of standard parallel scan operations for fast computation. The result delivers 87.4 percent average accuracy on the long range arena benchmark and 98.5 percent on the Path-X task while preserving S4-level efficiency.

Core claim

The S5 layer consists of one multi-input multi-output state space model rather than many independent single-input single-output models as in S4. By connecting S5 to S4, the initialization and parameterization enable stable performance, allowing the use of efficient parallel scans and resulting in 87.4 percent average accuracy on the long range arena benchmark and 98.5 percent on the Path-X task.

What carries the argument

The S5 layer, a single multi-input multi-output state space model whose initialization and parameterization are derived from its connection to the S4 framework to support parallel scan computation.

Load-bearing premise

The initialization and parameterization obtained by connecting S5 to S4 will produce stable high-performing models across tasks without per-task tuning or adjustments.

What would settle it

Training an S5 model on Path-X using random initialization instead of the S4-derived connection and observing accuracy well below 98.5 percent or training instability.

read the original abstract

Models using structured state space sequence (S4) layers have achieved state-of-the-art performance on long-range sequence modeling tasks. An S4 layer combines linear state space models (SSMs), the HiPPO framework, and deep learning to achieve high performance. We build on the design of the S4 layer and introduce a new state space layer, the S5 layer. Whereas an S4 layer uses many independent single-input, single-output SSMs, the S5 layer uses one multi-input, multi-output SSM. We establish a connection between S5 and S4, and use this to develop the initialization and parameterization used by the S5 model. The result is a state space layer that can leverage efficient and widely implemented parallel scans, allowing S5 to match the computational efficiency of S4, while also achieving state-of-the-art performance on several long-range sequence modeling tasks. S5 averages 87.4% on the long range arena benchmark, and 98.5% on the most difficult Path-X task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S5 replaces S4's many independent SISO models with one MIMO SSM whose init comes from the S4-HiPPO link, and the LRA numbers hold up at 87.4% average with 98.5% on Path-X.

read the letter

S5 simplifies S4 by swapping the collection of independent single-input single-output state space models for a single multi-input multi-output SSM. The authors derive the initialization and parameterization by making an explicit connection back to S4's HiPPO-based discretization, which lets them reuse the efficient parallel scan while claiming the same long-range modeling power. That architectural move is the main new piece. It cuts down on the number of separate models without changing the core computational trick that made S4 fast. The benchmarks support the claim: 87.4% average on the Long Range Arena and 98.5% on Path-X, all at S4-level speed. The paper earns credit for grounding the new layer in the prior S4 math rather than fitting parameters directly to the target tasks. The connection is not just hand-wavy; it gives a concrete way to set up the joint state matrix. The soft spot is that everything still rides on that S4-to-S5 mapping preserving stability and memory capacity in the MIMO case. If the joint dynamics only approximate what the independent SISO models achieved, the strong numbers could require hidden per-task stabilization that the abstract does not fully rule out. The lack of detailed error bars or exhaustive ablations in the summary leaves some uncertainty about robustness across random seeds or slight hyperparameter shifts. This work is for researchers already using or extending structured state space models who want a simpler layer for long-context tasks in language, audio, or time series. A reader who cares about reducing model complexity while keeping empirical performance will find the design change and the concrete numbers useful. I would send it for peer review. The idea is grounded enough and the results are sharp enough that referees should see the full experiments and math details.

Referee Report

1 major / 1 minor

Summary. The paper introduces the S5 layer as a simplification of S4: it replaces the bank of independent SISO SSMs with a single MIMO SSM, derives the initialization and parameterization from an explicit mathematical connection to the HiPPO-based S4 discretization, and shows that the resulting model matches S4's computational efficiency via parallel scans while reporting 87.4% average accuracy on the Long Range Arena benchmark and 98.5% on Path-X.

Significance. If the central performance claims hold, the work offers a meaningful architectural simplification that preserves long-range modeling capability while reducing implementation complexity. The explicit derivation from prior S4 results is a strength, as it grounds the new parameterization in existing theory rather than benchmark-specific fitting.

major comments (1)

[§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: the reported LRA average of 87.4% and Path-X accuracy of 98.5% are presented without error bars, without ablation on the S4-derived initialization, and without details on the hyperparameter search budget or per-task stabilization steps. These omissions make it impossible to verify that the MIMO parameterization alone transfers stability without implicit per-task adjustments, which is load-bearing for the claim that S5 matches S4 performance through the architectural simplification.

minor comments (1)

[§3.2 (Parameterization)] §3.2 (Parameterization): the mapping from the S4 state matrix to the single MIMO A matrix is described at a high level; an explicit equation showing how the block-diagonal HiPPO structure is collapsed would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and positive assessment of the S5 layer as a meaningful simplification of S4. We agree that clearer experimental reporting will strengthen the claims and are happy to incorporate the suggested improvements as a minor revision.

read point-by-point responses

Referee: [§4 (Experiments) and Table 1] §4 (Experiments) and Table 1: the reported LRA average of 87.4% and Path-X accuracy of 98.5% are presented without error bars, without ablation on the S4-derived initialization, and without details on the hyperparameter search budget or per-task stabilization steps. These omissions make it impossible to verify that the MIMO parameterization alone transfers stability without implicit per-task adjustments, which is load-bearing for the claim that S5 matches S4 performance through the architectural simplification.

Authors: We agree that the current experimental presentation can be improved for verifiability. In the revised manuscript we will add error bars (standard deviation over 3–5 random seeds) to the LRA average and Path-X results in Table 1 and the corresponding text. We will also include a new ablation study in §4 that compares the S4-derived HiPPO initialization against random initialization, confirming that the structured initialization is responsible for stable training and performance. Finally, we will expand the experimental details to report the hyperparameter search budget (grid ranges and number of trials per task), the specific stabilization steps (e.g., layer-norm placement, learning-rate schedules, and gradient clipping thresholds), and confirm that these choices are applied uniformly rather than tuned per-task beyond standard practice. These additions will make explicit that the reported performance stems from the MIMO parameterization and its S4-derived initialization. revision: yes

Circularity Check

0 steps flagged

No circularity: S5 init derived from external S4 connection

full rationale

The paper establishes a mathematical connection to prior S4 work (distinct authors) to obtain initialization and parameterization for the MIMO SSM in S5. This is an external derivation, not a self-definition or fit inside the present manuscript. Performance numbers (87.4% LRA, 98.5% Path-X) are reported from empirical evaluation on standard benchmarks rather than being forced by construction. No load-bearing step reduces to a fitted quantity or self-citation chain defined within the paper itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on the HiPPO framework and the S4 design as background assumptions; the main addition is the MIMO SSM structure whose initialization is justified by the S4 connection. No new free parameters beyond standard SSM state size are introduced in the abstract, and no new entities are postulated.

free parameters (1)

SSM state dimension
The internal state size of the MIMO SSM is a model hyperparameter whose value is chosen to balance capacity and compute; its specific value is not derived from first principles in the abstract.

axioms (1)

domain assumption The mathematical connection between the S5 MIMO SSM and the S4 SISO SSMs yields a valid initialization and parameterization that transfers performance.
This link is invoked to justify the S5 design choices and is central to the method's claimed success.

pith-pipeline@v0.9.0 · 5476 in / 1370 out tokens · 33932 ms · 2026-05-16T08:13:20.624547+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing eight_tick_forces_D3 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We build on the design of the S4 layer and introduce a new state space layer, the S5 layer. Whereas an S4 layer uses many independent single-input, single-output SSMs, the S5 layer uses one multi-input, multi-output SSM.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We establish a connection between S5 and S4, and use this to develop the initialization and parameterization used by the S5 model.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rotation Equivariant Mamba for Vision Tasks
cs.CV 2026-03 unverdicted novelty 8.0

EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-e...
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures
cs.LG 2026-05 unverdicted novelty 7.0

A real Schur decomposition projection maps the state matrix of discrete-time state-space layers onto its nearest stable counterpart, delivering accuracy comparable to prior stable identification methods with fewer weights.
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
cs.LG 2026-05 unverdicted novelty 7.0

QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
TIDES: Implicit Time-Awareness in Selective State Space Models
cs.LG 2026-05 unverdicted novelty 7.0

TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
cs.LG 2024-02 unverdicted novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
cs.LG 2026-05 unverdicted novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
Continuity Laws for Sequential Models
cs.LG 2026-05 unverdicted novelty 6.0

S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
cs.LG 2026-05 unverdicted novelty 6.0

Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
cs.LG 2026-04 unverdicted novelty 6.0

Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
cs.NE 2026-04 unverdicted novelty 6.0

S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking
cs.CV 2026-04 unverdicted novelty 6.0

MambaTrack improves RGB-Event object tracking via event-adaptive state transitions in a Dynamic State Space Model and a Gated Projection Fusion module, reporting state-of-the-art results on FE108 and FELT datasets.
Latent-Space Causal Discovery from Indirect Neuroimaging Observations
q-bio.NC 2026-01 unverdicted novelty 6.0

INCAMA recovers directed causal graphs from indirect neuroimaging by physics-aware inversion plus delay-aware Mamba encoding, yielding 2-3x F1 gains in simulations and anatomically plausible sparse graphs on HCP motor fMRI.
Physics-Guided Tiny-Mamba Transformer for Reliability-Aware Early Fault Warning
cs.LG 2026-01 unverdicted novelty 6.0

PG-TMT couples a physics-aligned tri-branch encoder with EVT-calibrated decision rules to achieve higher PR-AUC and shorter detection times at controlled false-alarm rates across multiple bearing datasets.
Beyond ZOH: Advanced Discretization Strategies for Vision Mamba
cs.CV 2026-04 unverdicted novelty 4.0

Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.
Deep Learning for Virtual Reality User Identification: A Benchmark
cs.HC 2026-03 unverdicted novelty 4.0

A benchmark study evaluates standard and emerging deep learning architectures on motion data from 71 VR users, establishing performance baselines for user identification.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

149 extracted references · 149 canonical work pages · cited by 21 Pith papers · 6 internal anchors

[1]

Advances in Neural Information Processing Systems , year=

Diagonal State Spaces are as Effective as Structured State Spaces , author=. Advances in Neural Information Processing Systems , year=

work page
[2]

Legendre

Voelker, Aaron and Kaji. Legendre. Advances in Neural Information Processing Systems , volume=

work page
[3]

It’s Raw!

Goel, Karan and Gu, Albert and Donahue, Chris and Re, Christopher , booktitle =. It’s Raw!. 2022 , volume =

work page 2022
[4]

Gu, Albert and Dao, Tri and Ermon, Stefano and Rudra, Atri and R. Hi. Advances in Neural Information Processing Systems , volume=

work page
[5]

2020 , copyright =

GMAT: Global Memory Augmentation for Transformers , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2006.03274 , author =

work page doi:10.48550/arxiv.2006.03274 2020
[6]

International Conference on Learning Representations , year=

Efficiently Modeling Long Sequences with Structured State Spaces , author=. International Conference on Learning Representations , year=

work page
[7]

Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXV , pages=

Long movie clip classification with state-space video models , author=. Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXV , pages=

work page 2022
[8]

ACM Computing Surveys (CSUR) , publisher=

Efficient transformers: A survey , author=. ACM Computing Surveys (CSUR) , publisher=

work page
[9]

Transactions of the Association for Computational Linguistics , volume=

Efficient Content-Based Sparse Attention with Routing Transformers , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[12]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Xception: Deep learning with depthwise separable convolutions , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[13]

International Conference on Learning Representations , year=

Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=

work page
[14]

Advances in Neural Information Processing Systems , volume=

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

Blockwise Self-Attention for Long Document Understanding , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , pages=

work page 2020
[16]

Advances in Neural Information Processing Systems , volume=

Attention is all you need , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , booktitle=. Long

work page
[18]

2001 , publisher=

Structured matrices and polynomials: unified superfast algorithms , author=. 2001 , publisher=

work page 2001
[19]

Fast approximate computations with

Pan, Victor , journal=. Fast approximate computations with

work page
[20]

International Conference on Machine Learning , pages=

Unitary evolution recurrent neural networks , author=. International Conference on Machine Learning , pages=. 2016 , organization=

work page 2016
[21]

International Conference on Learning Representations , year=

Lipschitz Recurrent Neural Networks , author=. International Conference on Learning Representations , year=

work page
[24]

International Conference on Learning Representations , year=

Rethinking Attention with Performers , author=. International Conference on Learning Representations , year=

work page
[25]

Transformers are

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[26]

An algorithm for the machine calculation of complex

Cooley, James and Tukey, John , journal=. An algorithm for the machine calculation of complex. 1965 , publisher=

work page 1965
[27]

Journal of the ACM (JACM) , volume=

Parallel prefix computation , author=. Journal of the ACM (JACM) , volume=. 1980 , publisher=

work page 1980
[28]

1994 , publisher=

Parallel computing using the prefix problem , author=. 1994 , publisher=

work page 1994
[29]

1990 , institution=

Prefix Sums and Their Applications , author=. 1990 , institution=

work page 1990
[30]

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page
[31]

Xiong, Yunyang and Zeng, Zhanpeng and Chakraborty, Rudrasis and Tan, Mingxing and Fung, Glenn and Li, Yin and Singh, Vikas , booktitle=. Nystr. 2021 , organization=

work page 2021
[32]

Lee-Thorp, James and Ainslie, Joshua and Eckstein, Ilya and Ontanon, Santiago , booktitle=

work page
[33]

Nangia, Nikita and Bowman, Samuel , journal=

work page
[34]

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages=

Learning word vectors for sentiment analysis , author=. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages=

work page
[35]

Radev, Dragomir and Muthukrishnan, Pradeep and Qazvinian, Vahed , journal=. The

work page
[36]

Master's thesis, University of Toronto , year=

Learning Multiple Layers of Features from Tiny Images , author=. Master's thesis, University of Toronto , year=

work page
[37]

Advances in Neural Information Processing Systems , volume=

Learning long-range spatial dependencies with horizontal gated recurrent units , author=. Advances in Neural Information Processing Systems , volume=

work page
[38]

Advances in Neural Information Processing Systems , volume=

Luna: Linear unified nested attention , author=. Advances in Neural Information Processing Systems , volume=

work page
[39]

Warden, Pete , journal=. Speech

work page
[40]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

Neural Latents Benchmark ‘21: Evaluating latent variable models of neural population activity , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

work page
[41]

International Conference on Learning Representations , year=

Parallelizing Linear Recurrent Neural Nets Over Sequence Length , author=. International Conference on Learning Representations , year=

work page
[42]

Advances in Neural Information Processing Systems , volume=

Linear Dynamical Systems as a Core Computational Primitive , author=. Advances in Neural Information Processing Systems , volume=

work page
[43]

Advances in Neural Information Processing Systems , volume=

Big bird: Transformers for longer sequences , author=. Advances in Neural Information Processing Systems , volume=

work page
[44]

Temporal parallelization of

S. Temporal parallelization of. IEEE Transactions on Automatic Control , volume=. 2020 , publisher=

work page 2020
[45]

CoRR , year=

Adam: A Method for Stochastic Optimization , author=. CoRR , year=

work page
[46]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[47]

Gaussian Error Linear Units (

Hendrycks, Dan and Gimpel, Kevin , journal=. Gaussian Error Linear Units (

work page
[48]

Experiment Tracking with Weights and Biases , year =

work page
[49]

International Conference on Learning Representations , year=

Quasi-Recurrent Neural Networks , author=. International Conference on Learning Representations , year=

work page
[50]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

Simple Recurrent Units for Highly Parallelizable Recurrence , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[51]

Long short-term memory , author=. Neural. 1997 , publisher=

work page 1997
[52]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. arXiv preprint arXiv:2205.14135 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

1985 , publisher=

Multidimensional systems theory: Progress, directions and open problems in multidimensional systems , author=. 1985 , publisher=

work page 1985
[54]

Antisymmetric

Bo Chang and Minmin Chen and Eldad Haber and Ed Chi , booktitle=. Antisymmetric

work page
[55]

International Conference on Machine Learning , pages=

On the difficulty of training recurrent neural networks , author=. International Conference on Machine Learning , pages=. 2013 , organization=

work page 2013
[56]

Advances in Neural Information Processing Systems , year=

On the Parameterization and Initialization of Diagonal State Space Models , author=. Advances in Neural Information Processing Systems , year=

work page
[57]

International Conference on Machine Learning , pages=

Language modeling with gated convolutional networks , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[58]

How to Train your

Albert Gu and Isys Johnson and Aman Timalsina and Atri Rudra and Christopher Re , booktitle=. How to Train your

work page
[59]

Learning longer-term dependencies in

Trinh, Trieu and Dai, Andrew and Luong, Thang and Le, Quoc , booktitle=. Learning longer-term dependencies in. 2018 , organization=

work page 2018
[60]

David Romero and Anna Kuzina and Erik Bekkers and Jakub Mikolaj Tomczak and Mark Hoogendoorn , booktitle=

work page
[61]

Advances in Neural Information Processing Systems , volume=

Latent ordinary differential equations for irregularly-sampled time series , author=. Advances in Neural Information Processing Systems , volume=

work page
[62]

International Conference on Learning Representations , year=

Trellis Networks for Sequence Modeling , author=. International Conference on Learning Representations , year=

work page
[63]

International Conference on Machine Learning , pages=

Neural rough differential equations for long time series , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[64]

International Conference on Machine Learning , pages=

Improving the gating mechanism of recurrent neural networks , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[65]

Advances in Neural Information Processing Systems , volume=

Dilated recurrent neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[66]

Independently recurrent neural network (

Li, Shuai and Li, Wanqing and Cook, Chris and Zhu, Ce and Gao, Yanbo , booktitle=. Independently recurrent neural network (

work page
[67]

International Conference on Learning Representations , year=

Adversarial Audio Synthesis , author=. International Conference on Learning Representations , year=

work page
[68]

International Conference on Machine Learning , pages=

Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[69]

International Conference on Machine Learning , pages=

UnICORNN: A recurrent model for learning very long time dependencies , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[70]

Parallelizing

Chilkuri, Narsimha Reddy and Eliasmith, Chris , booktitle=. Parallelizing. 2021 , organization=

work page 2021
[71]

Rush, Sasha and Karamcheti, Sidd , booktitle=. The. 2022 , url=

work page 2022
[72]

Machine Learning for Healthcare Conference , pages=

In-depth benchmarking of deep neural network architectures for ECG diagnosis , author=. Machine Learning for Healthcare Conference , pages=. 2021 , organization=

work page 2021
[73]

International Conference on Machine Learning , pages=

Modeling irregular time series with continuous recurrent units , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[74]

International Conference on Machine Learning , pages=

Recurrent kalman networks: Factorized inference in high-dimensional deep feature spaces , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[75]

2009 , publisher=

A first course in the numerical analysis of differential equations , author=. 2009 , publisher=

work page 2009
[76]

International Conference on Learning Representations , year=

FlexConv: Continuous Kernel Convolutions With Differentiable Kernel Sizes , author=. International Conference on Learning Representations , year=

work page
[77]

Towards a General Purpose

Romero, David and Knigge, David and Gu, Albert and Bekkers, Erik and Gavves, Efstratios and Tomczak, Jakub and Hoogendoorn, Mark , journal=. Towards a General Purpose

work page
[78]

Advances in neural information processing systems , volume=

Self-normalizing neural networks , author=. Advances in neural information processing systems , volume=

work page
[79]

International Conference on Learning Representations , year=

Mega: Moving Average Equipped Gated Attention , author=. International Conference on Learning Representations , year=

work page
[80]

Pattern Recognition Letters , volume=

Weighted sigmoid gate unit for an activation function of deep neural network , author=. Pattern Recognition Letters , volume=. 2020 , publisher=

work page 2020
[81]

International Conference on Learning Representations , year=

Liquid Structural State-Space Models , author=. International Conference on Learning Representations , year=

work page
[82]

International Conference on Learning Representations , year=

Multi-Time Attention Networks for Irregularly Sampled Time Series , author=. International Conference on Learning Representations , year=

work page
[83]

Syntax, Semantics and Structure in Statistical Translation , pages=

On the Properties of Neural Machine Translation: Encoder--Decoder Approaches , author=. Syntax, Semantics and Structure in Statistical Translation , pages=. 2014 , publisher=

work page 2014
[84]

Advances in neural information processing systems , volume=

Neural ordinary differential equations , author=. Advances in neural information processing systems , volume=

work page

Showing first 80 references.