Recognition: 2 theorem links
· Lean TheoremTIDES: Implicit Time-Awareness in Selective State Space Models
Pith reviewed 2026-05-12 03:02 UTC · model grok-4.3
The pith
TIDES moves input dependence from the discretization step to the diagonal state matrix in selective SSMs, letting the step retain physical meaning for irregular timestamps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TIDES is a selective SSM variant that reconciles selective and continuous architectures by moving input-dependence off the step size and onto the diagonal state matrix. As a result, the discretization step retains its physical meaning tied to state discretization, allowing the model to handle irregular timestamps natively without sacrificing the per-token expressivity that makes selective SSMs effective.
What carries the argument
Input-dependent diagonal state matrix, which carries selectivity while leaving the discretization step as a fixed physical interval.
If this is right
- Sequence models can now process data with arbitrary sampling intervals without separate preprocessing or loss of dynamic selectivity.
- The physical interpretability of the discretization step enables direct use in domains where time intervals carry physical meaning, such as sensor streams or biological signals.
- The Fading Flash benchmark provides a compact test that isolates whether a model truly separates input dependence from time discretization.
- State-of-the-art average rank is achieved on the UEA time-series classification suite and the Physiome-ODE regression benchmark.
Where Pith is reading between the lines
- The same relocation technique could be applied to other continuous-time or hybrid recurrent architectures that currently tie selectivity to the step size.
- Because the diagonal remains the carrier of dynamics, stability analysis and eigenvalue control may become simpler than when selectivity is entangled with the step.
- This separation suggests a general design principle: keep time semantics in the integrator and move all data dependence into the system matrix.
Load-bearing premise
That relocating input dependence to the diagonal state matrix preserves the full selective expressivity and stability of prior selective SSMs while maintaining a physically interpretable discretization step.
What would settle it
An experiment where TIDES either matches or exceeds the per-token performance of standard selective SSMs on regular sequences but fails to correctly extrapolate or interpolate on sequences with out-of-distribution irregular time intervals.
Figures
read the original abstract
Selective state space models (SSMs), such as Mamba, achieve strong per-token expressivity by making the time discretization step $\Tilde{\Delta}$ a learned function of the input. However, in doing so, $\Tilde{\Delta}$ ceases to represent a physical sampling interval, limiting its irregular time series modeling capability. Continuous-time SSMs, such as S5, preserve the physical meaning of $\Tilde{\Delta}$ and handle irregular timestamps natively ($\Tilde{\Delta}\equiv\Delta)$, but their dynamics remain linear time-invariant (LTI), limiting per-token expressivity. We propose \textbf{TIDES}, a selective SSM variant that reconciles selective and continuous architectures by moving input-dependence off the step size and onto the diagonal state matrix. As a result, $\Tilde{\Delta}$ retains its physical meaning, tied to the state discretization, allowing the model to handle irregular timestamps natively without sacrificing the per-token expressivity that makes selective SSMs effective. We show this on a novel \emph{Fading Flash} experimental benchmark, a compact controlled diagnostic for sequence models that jointly tests input-dependence and extrapolation to out-of-distribution $\Delta$ values, and isolates the distinct failure modes of current state-of-the-art architectures that TIDES avoids by construction. On large-scale benchmarks, TIDES sets the new state-of-the-art average rank on UEA time-series classification and the Physiome-ODE regression benchmark. Code available at: https://github.com/TaylanSoydan/TIDES.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TIDES, a selective SSM variant that relocates input dependence from the discretization step size to the diagonal of the state matrix. This is claimed to preserve per-token expressivity while allowing the step size to retain physical meaning, enabling native handling of irregular timestamps. The authors introduce the Fading Flash benchmark as a diagnostic for input dependence and OOD extrapolation, and report new SOTA average rank on UEA time-series classification and the Physiome-ODE regression benchmark.
Significance. If the reconciliation of selective and continuous SSMs holds, the approach could improve modeling of irregular time series in applications such as healthcare and physics without loss of expressivity. The new Fading Flash benchmark and public code release are constructive contributions for controlled evaluation and reproducibility.
major comments (4)
- [Abstract / Proposed Method] Abstract and method description: the central reconciliation claim—that input dependence on the diagonal of A with fixed physical Δ yields equivalent per-token expressivity to input-dependent Δ with fixed A—is not supported by any derivation. The discretized transitions exp(Δ · A(x)) and exp(Δ(x) · A) are not shown to span comparable sets of state-transition behaviors, which is load-bearing for the expressivity and stability assertions.
- [Method / Stability] No stability analysis is provided for the input-dependent diagonal state matrix. When A(x) varies per token, eigenvalue properties (e.g., negative real parts for continuous-time stability) may not be preserved; this requires explicit bounds or empirical verification to support practical use.
- [Experiments / Fading Flash] Fading Flash benchmark: the manuscript states that the benchmark isolates distinct failure modes that TIDES avoids by construction, yet provides no construction details, quantitative metrics, or results tables demonstrating the claimed isolation and avoidance.
- [Experiments] Large-scale results: SOTA claims on UEA and Physiome-ODE are stated without error bars, statistical tests, ablation studies, or hyperparameter details, making it impossible to assess whether the performance gains are attributable to the architectural change.
minor comments (1)
- [Notation] Notation for the physical versus learned step size (Δ vs. ~Δ) should be introduced and used consistently once the discretization is defined.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Proposed Method] Abstract and method description: the central reconciliation claim—that input dependence on the diagonal of A with fixed physical Δ yields equivalent per-token expressivity to input-dependent Δ with fixed A—is not supported by any derivation. The discretized transitions exp(Δ · A(x)) and exp(Δ(x) · A) are not shown to span comparable sets of state-transition behaviors, which is load-bearing for the expressivity and stability assertions.
Authors: We agree that a formal derivation would strengthen the central claim. While the manuscript motivates the equivalence by noting that input-dependent diagonal A(x) modulates per-token eigenvalues (decay rates) under fixed physical Δ, we did not provide an explicit comparison of the spanned transition sets. In the revision we will add a dedicated subsection with a proof sketch showing that, for diagonal A, the map x → exp(Δ A(x)) can realize a comparable family of per-token scalings and rotations to the input-dependent-Δ case, supported by a low-dimensional analytic example and a brief discussion of the resulting function spaces. revision: yes
-
Referee: [Method / Stability] No stability analysis is provided for the input-dependent diagonal state matrix. When A(x) varies per token, eigenvalue properties (e.g., negative real parts for continuous-time stability) may not be preserved; this requires explicit bounds or empirical verification to support practical use.
Authors: We acknowledge the absence of a dedicated stability analysis. In the revised manuscript we will insert a new subsection that (i) derives sufficient conditions on the learned diagonal entries of A(x) to keep real parts negative (via a simple clipping or soft constraint during training), and (ii) reports empirical eigenvalue statistics and training stability metrics across all benchmarks, confirming that the learned A(x) remained stable in practice. revision: yes
-
Referee: [Experiments / Fading Flash] Fading Flash benchmark: the manuscript states that the benchmark isolates distinct failure modes that TIDES avoids by construction, yet provides no construction details, quantitative metrics, or results tables demonstrating the claimed isolation and avoidance.
Authors: The current manuscript contains a high-level description of Fading Flash in Section 4 together with qualitative illustrations. To address the referee’s concern we will expand this section with: (a) the precise generative process and parameter ranges used to create the synthetic sequences, (b) the quantitative metrics (accuracy gap on in-distribution vs. OOD Δ, per-model failure rates), and (c) a results table that directly contrasts the failure modes of Mamba, S5, and TIDES on the benchmark. revision: yes
-
Referee: [Experiments] Large-scale results: SOTA claims on UEA and Physiome-ODE are stated without error bars, statistical tests, ablation studies, or hyperparameter details, making it impossible to assess whether the performance gains are attributable to the architectural change.
Authors: We agree that the experimental section would be more convincing with these elements. In the revision we will add: error bars (standard deviation over 5 random seeds), paired statistical tests (Wilcoxon signed-rank) against the strongest baselines, a comprehensive ablation table isolating the contribution of the input-dependent diagonal A, and a hyperparameter appendix listing the exact search ranges and final values used for all models. revision: yes
Circularity Check
No significant circularity in TIDES architectural proposal or benchmarks
full rationale
The paper proposes TIDES as an architectural variant that relocates input dependence from the discretization step size to the diagonal of the state matrix, claiming this preserves per-token expressivity while retaining physical meaning for Δ. This is presented as a design choice, not a derivation that reduces to its own inputs. The abstract and description contain no equations showing a self-definitional reduction (e.g., no fitted parameter renamed as a prediction or ansatz smuggled via self-citation). The Fading Flash benchmark and large-scale results are empirical validations presented as independent of the proposal itself. No load-bearing self-citations or uniqueness theorems from prior author work are invoked. The central claim of reconciliation has independent content and does not collapse by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
moving input-dependence off the step size and onto the diagonal state matrix... ˜Δ retains its physical meaning, tied to the state discretization
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Re(Λk)=WΛ uk + Re(Λ0)... ZOH discretization ¯Λk=exp(Λk ˜Δk) with ˜Δk=Δk
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Multi-time attention networks for irregularly sampled time series.arXiv preprint arXiv:2101.10318,
Satya Narayan Shukla and Benjamin M Marlin. Multi-time attention networks for irregularly sampled time series.arXiv preprint arXiv:2101.10318, 2021
-
[3]
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values.Scientific reports, 8(1):6085, 2018
work page 2018
-
[4]
Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series.Advances in neural information processing systems, 32, 2019
work page 2019
-
[5]
Simpli- fied state space layers for sequence modeling,
Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022
-
[6]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. HiPPO: Recurrent memory with optimal polynomial projections.Advances in Neural Information Processing Systems, 33:1474–1487, 2020
work page 2020
-
[9]
Parallelizing complex scans and reductions.ACM SIGPLAN Notices, 29(6):135–146, 1994
Allan L Fisher and Anwar M Ghuloum. Parallelizing complex scans and reductions.ACM SIGPLAN Notices, 29(6):135–146, 1994. 10
work page 1994
-
[10]
arXiv preprint arXiv:2311.14495 , year=
Shida Wang and Qianxiao Li. Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization, 2024. URLhttps://arxiv.org/abs/2311.14495
-
[11]
Fernando Moreno-Pino, Alvaro Arroyo, Harrison Waldon, Xiaowen Dong, and Álvaro Cartea. Rough transformers for continuous and efficient time-series modelling.arXiv preprint arXiv:2403.10288, 2024
-
[12]
Resurrecting recurrent neural networks for long sequences
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. InInternational conference on machine learning, pages 26670–26698. PMLR, 2023
work page 2023
-
[13]
Patrick Kidger, James Morrill, James Foster, and Terry Lyons. Neural controlled differential equations for irregular time series.Advances in neural information processing systems, 33:6696–6707, 2020
work page 2020
-
[14]
Neural rough differential equations for long time series
James Morrill, Cristopher Salvi, Patrick Kidger, and James Foster. Neural rough differential equations for long time series. InInternational Conference on Machine Learning, pages 7829–7838. PMLR, 2021
work page 2021
-
[15]
Benjamin Walker, Andrew D McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, and Terry Lyons. Log neu- ral controlled differential equations: The lie brackets make a difference.arXiv preprint arXiv:2402.18512, 2024
-
[16]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[17]
Edward De Brouwer, Jaak Simm, Adam Arany, and Yves Moreau. Gru-ode-bayes: Continuous modeling of sporadically-observed time series.Advances in neural information processing systems, 32, 2019
work page 2019
-
[18]
Neural Flows: Efficient Alternative to Neural ODEs
Marin Biloš, Johanna Sommer, Syama Sundar Rangapuram, Tim Januschowski, and Stephan Günnemann. Neural Flows: Efficient Alternative to Neural ODEs. InAdvances in Neu- ral Information Processing Systems, volume 34, pages 21325–21337. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/ b21f9f98829dea9a48fd8aaddc1f15...
work page 2021
-
[19]
Modeling irregular time series with continuous recurrent units
Mona Schirmer, Mazin Eltayeb, Stefan Lessmann, and Maja Rudolph. Modeling irregular time series with continuous recurrent units. InInternational conference on machine learning, pages 19388–19405. PMLR, 2022
work page 2022
-
[20]
Latent linear odes with neural kalman filtering for irregular time series forecasting
Randolf Scholz, Stefan Born, Nghia Duong-Trung, Mariano Nicolas Cruz-Bournazou, and Lars Schmidt- Thieme. Latent linear odes with neural kalman filtering for irregular time series forecasting. 2023
work page 2023
-
[21]
Grafiti: Graphs for forecasting irregularly sampled time series
Vijaya Krishna Yalavarthi, Kiran Madhusudhanan, Randolf Scholz, Nourhan Ahmed, Johannes Burchert, Shayan Jawed, Stefan Born, and Lars Schmidt-Thieme. Grafiti: Graphs for forecasting irregularly sampled time series. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16255– 16263, 2024
work page 2024
-
[22]
A., Lines, J., Flynn, M., Large, J., Bostrom, A.,
Anthony Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn Keogh. The uea multivariate time series classification archive, 2018.arXiv preprint arXiv:1811.00075, 2018
-
[23]
Christian Klötergens, Vijaya Krishna Yalavarthi, Randolf Scholz, Maximilian Stubbemann, Stefan Born, and Lars Schmidt-Thieme. Physiome-ode: A benchmark for irregularly sampled multivariate time series forecasting based on biological odes.arXiv preprint arXiv:2502.07489, 2025
-
[24]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023
work page internal anchor Pith review arXiv 2023
-
[26]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Yuqi Chen, Kan Ren, Yansen Wang, Yuchen Fang, Weiwei Sun, and Dongsheng Li. Contiformer: Continuous-time transformer for irregular time series modeling.Advances in Neural Information Processing Systems, 36:47143–47175, 2023
work page 2023
-
[28]
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018. 11 A Architecture details Block execution order.Each TIDES block applies the following sequence of operations to its input x∈R B×L×H : z= Dropout GLU Dropout(GELU(SSM(BN(x)))) +x, where...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.