pith. machine review for the scientific record. sign in

arxiv: 2605.09742 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

TIDES: Implicit Time-Awareness in Selective State Space Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords selective state space modelsirregular time seriesdiscretizationMambacontinuous-time modelstime-aware SSMs
0
0 comments X

The pith

TIDES moves input dependence from the discretization step to the diagonal state matrix in selective SSMs, letting the step retain physical meaning for irregular timestamps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that selective state space models lose the ability to treat discretization as a true time interval when they make the step size input-dependent for expressivity. Continuous models keep the physical interval but stay linear and time-invariant. TIDES resolves this by shifting the input dependence onto the diagonal entries of the state matrix instead. This keeps the discretization step tied to actual sampling intervals, so the model processes irregular timestamps directly while retaining per-token selectivity. A new diagnostic benchmark isolates the failure modes this design avoids, and large-scale tests show improved ranks on time-series classification and regression tasks.

Core claim

TIDES is a selective SSM variant that reconciles selective and continuous architectures by moving input-dependence off the step size and onto the diagonal state matrix. As a result, the discretization step retains its physical meaning tied to state discretization, allowing the model to handle irregular timestamps natively without sacrificing the per-token expressivity that makes selective SSMs effective.

What carries the argument

Input-dependent diagonal state matrix, which carries selectivity while leaving the discretization step as a fixed physical interval.

If this is right

  • Sequence models can now process data with arbitrary sampling intervals without separate preprocessing or loss of dynamic selectivity.
  • The physical interpretability of the discretization step enables direct use in domains where time intervals carry physical meaning, such as sensor streams or biological signals.
  • The Fading Flash benchmark provides a compact test that isolates whether a model truly separates input dependence from time discretization.
  • State-of-the-art average rank is achieved on the UEA time-series classification suite and the Physiome-ODE regression benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relocation technique could be applied to other continuous-time or hybrid recurrent architectures that currently tie selectivity to the step size.
  • Because the diagonal remains the carrier of dynamics, stability analysis and eigenvalue control may become simpler than when selectivity is entangled with the step.
  • This separation suggests a general design principle: keep time semantics in the integrator and move all data dependence into the system matrix.

Load-bearing premise

That relocating input dependence to the diagonal state matrix preserves the full selective expressivity and stability of prior selective SSMs while maintaining a physically interpretable discretization step.

What would settle it

An experiment where TIDES either matches or exceeds the per-token performance of standard selective SSMs on regular sequences but fails to correctly extrapolate or interpolate on sequences with out-of-distribution irregular time intervals.

Figures

Figures reproduced from arXiv: 2605.09742 by Dirk Mohr, Miguel A. Bessa, Rui Barreira, Taylan Soydan.

Figure 1
Figure 1. Figure 1: Where input-dependence lives in each architecture. S5 keeps all parameters static. Mamba [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Information flow through S5, Mamba, and TIDES architectures. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TIDES architecture: from sequence model (a) to TIDES block (b) to the lower-level SSM [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task setup. 40 detectors split into three rate zones. Sparse flashes (top) produce zone-dependent decaying glows; the same flashes under different ∆ (middle, bottom) yield rescaled dynamics. The model needs to predict the correct decaying glows given the sparse flashes, zone and ∆ values. Before scaling to large benchmarks, we use a controlled toy problem to better illustrate our design argument. The setup… view at source ↗
Figure 5
Figure 5. Figure 5: Left: Effective learned decay vs. test ∆, per zone. S5 and TIDES bake ∆ into the discretization, so the learned decay stays flat across the full test range. MambaS distorts the physically meaningful ∆ through a learned gate, breaking the cancellation and causing drift outside training. Right: Relative error vs. test ∆. TIDES consistently outperforms Mamba and S5 both in- and out-of-distribution. See Append… view at source ↗
Figure 6
Figure 6. Figure 6: Test accuracy vs. rtest on Eigen￾Worms (n=3 seeds; trained at rtrain=0.5). In [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Selective state space models (SSMs), such as Mamba, achieve strong per-token expressivity by making the time discretization step $\Tilde{\Delta}$ a learned function of the input. However, in doing so, $\Tilde{\Delta}$ ceases to represent a physical sampling interval, limiting its irregular time series modeling capability. Continuous-time SSMs, such as S5, preserve the physical meaning of $\Tilde{\Delta}$ and handle irregular timestamps natively ($\Tilde{\Delta}\equiv\Delta)$, but their dynamics remain linear time-invariant (LTI), limiting per-token expressivity. We propose \textbf{TIDES}, a selective SSM variant that reconciles selective and continuous architectures by moving input-dependence off the step size and onto the diagonal state matrix. As a result, $\Tilde{\Delta}$ retains its physical meaning, tied to the state discretization, allowing the model to handle irregular timestamps natively without sacrificing the per-token expressivity that makes selective SSMs effective. We show this on a novel \emph{Fading Flash} experimental benchmark, a compact controlled diagnostic for sequence models that jointly tests input-dependence and extrapolation to out-of-distribution $\Delta$ values, and isolates the distinct failure modes of current state-of-the-art architectures that TIDES avoids by construction. On large-scale benchmarks, TIDES sets the new state-of-the-art average rank on UEA time-series classification and the Physiome-ODE regression benchmark. Code available at: https://github.com/TaylanSoydan/TIDES.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 1 minor

Summary. The paper proposes TIDES, a selective SSM variant that relocates input dependence from the discretization step size to the diagonal of the state matrix. This is claimed to preserve per-token expressivity while allowing the step size to retain physical meaning, enabling native handling of irregular timestamps. The authors introduce the Fading Flash benchmark as a diagnostic for input dependence and OOD extrapolation, and report new SOTA average rank on UEA time-series classification and the Physiome-ODE regression benchmark.

Significance. If the reconciliation of selective and continuous SSMs holds, the approach could improve modeling of irregular time series in applications such as healthcare and physics without loss of expressivity. The new Fading Flash benchmark and public code release are constructive contributions for controlled evaluation and reproducibility.

major comments (4)
  1. [Abstract / Proposed Method] Abstract and method description: the central reconciliation claim—that input dependence on the diagonal of A with fixed physical Δ yields equivalent per-token expressivity to input-dependent Δ with fixed A—is not supported by any derivation. The discretized transitions exp(Δ · A(x)) and exp(Δ(x) · A) are not shown to span comparable sets of state-transition behaviors, which is load-bearing for the expressivity and stability assertions.
  2. [Method / Stability] No stability analysis is provided for the input-dependent diagonal state matrix. When A(x) varies per token, eigenvalue properties (e.g., negative real parts for continuous-time stability) may not be preserved; this requires explicit bounds or empirical verification to support practical use.
  3. [Experiments / Fading Flash] Fading Flash benchmark: the manuscript states that the benchmark isolates distinct failure modes that TIDES avoids by construction, yet provides no construction details, quantitative metrics, or results tables demonstrating the claimed isolation and avoidance.
  4. [Experiments] Large-scale results: SOTA claims on UEA and Physiome-ODE are stated without error bars, statistical tests, ablation studies, or hyperparameter details, making it impossible to assess whether the performance gains are attributable to the architectural change.
minor comments (1)
  1. [Notation] Notation for the physical versus learned step size (Δ vs. ~Δ) should be introduced and used consistently once the discretization is defined.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Proposed Method] Abstract and method description: the central reconciliation claim—that input dependence on the diagonal of A with fixed physical Δ yields equivalent per-token expressivity to input-dependent Δ with fixed A—is not supported by any derivation. The discretized transitions exp(Δ · A(x)) and exp(Δ(x) · A) are not shown to span comparable sets of state-transition behaviors, which is load-bearing for the expressivity and stability assertions.

    Authors: We agree that a formal derivation would strengthen the central claim. While the manuscript motivates the equivalence by noting that input-dependent diagonal A(x) modulates per-token eigenvalues (decay rates) under fixed physical Δ, we did not provide an explicit comparison of the spanned transition sets. In the revision we will add a dedicated subsection with a proof sketch showing that, for diagonal A, the map x → exp(Δ A(x)) can realize a comparable family of per-token scalings and rotations to the input-dependent-Δ case, supported by a low-dimensional analytic example and a brief discussion of the resulting function spaces. revision: yes

  2. Referee: [Method / Stability] No stability analysis is provided for the input-dependent diagonal state matrix. When A(x) varies per token, eigenvalue properties (e.g., negative real parts for continuous-time stability) may not be preserved; this requires explicit bounds or empirical verification to support practical use.

    Authors: We acknowledge the absence of a dedicated stability analysis. In the revised manuscript we will insert a new subsection that (i) derives sufficient conditions on the learned diagonal entries of A(x) to keep real parts negative (via a simple clipping or soft constraint during training), and (ii) reports empirical eigenvalue statistics and training stability metrics across all benchmarks, confirming that the learned A(x) remained stable in practice. revision: yes

  3. Referee: [Experiments / Fading Flash] Fading Flash benchmark: the manuscript states that the benchmark isolates distinct failure modes that TIDES avoids by construction, yet provides no construction details, quantitative metrics, or results tables demonstrating the claimed isolation and avoidance.

    Authors: The current manuscript contains a high-level description of Fading Flash in Section 4 together with qualitative illustrations. To address the referee’s concern we will expand this section with: (a) the precise generative process and parameter ranges used to create the synthetic sequences, (b) the quantitative metrics (accuracy gap on in-distribution vs. OOD Δ, per-model failure rates), and (c) a results table that directly contrasts the failure modes of Mamba, S5, and TIDES on the benchmark. revision: yes

  4. Referee: [Experiments] Large-scale results: SOTA claims on UEA and Physiome-ODE are stated without error bars, statistical tests, ablation studies, or hyperparameter details, making it impossible to assess whether the performance gains are attributable to the architectural change.

    Authors: We agree that the experimental section would be more convincing with these elements. In the revision we will add: error bars (standard deviation over 5 random seeds), paired statistical tests (Wilcoxon signed-rank) against the strongest baselines, a comprehensive ablation table isolating the contribution of the input-dependent diagonal A, and a hyperparameter appendix listing the exact search ranges and final values used for all models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TIDES architectural proposal or benchmarks

full rationale

The paper proposes TIDES as an architectural variant that relocates input dependence from the discretization step size to the diagonal of the state matrix, claiming this preserves per-token expressivity while retaining physical meaning for Δ. This is presented as a design choice, not a derivation that reduces to its own inputs. The abstract and description contain no equations showing a self-definitional reduction (e.g., no fitted parameter renamed as a prediction or ansatz smuggled via self-citation). The Fading Flash benchmark and large-scale results are empirical validations presented as independent of the proposal itself. No load-bearing self-citations or uniqueness theorems from prior author work are invoked. The central claim of reconciliation has independent content and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard SSM components; the model extends existing frameworks without introducing new postulated objects.

pith-pipeline@v0.9.0 · 5582 in / 986 out tokens · 38996 ms · 2026-05-12T03:02:00.298229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 5 internal anchors

  1. [1]

    Multi-time attention networks for irregularly sampled time series.arXiv preprint arXiv:2101.10318,

    Satya Narayan Shukla and Benjamin M Marlin. Multi-time attention networks for irregularly sampled time series.arXiv preprint arXiv:2101.10318, 2021

  2. [3]

    Recurrent neural networks for multivariate time series with missing values.Scientific reports, 8(1):6085, 2018

    Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values.Scientific reports, 8(1):6085, 2018

  3. [4]

    Latent ordinary differential equations for irregularly-sampled time series.Advances in neural information processing systems, 32, 2019

    Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series.Advances in neural information processing systems, 32, 2019

  4. [5]

    Simpli- fied state space layers for sequence modeling,

    Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022

  5. [6]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  6. [7]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

  7. [8]

    HiPPO: Recurrent memory with optimal polynomial projections.Advances in Neural Information Processing Systems, 33:1474–1487, 2020

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. HiPPO: Recurrent memory with optimal polynomial projections.Advances in Neural Information Processing Systems, 33:1474–1487, 2020

  8. [9]

    Parallelizing complex scans and reductions.ACM SIGPLAN Notices, 29(6):135–146, 1994

    Allan L Fisher and Anwar M Ghuloum. Parallelizing complex scans and reductions.ACM SIGPLAN Notices, 29(6):135–146, 1994. 10

  9. [10]

    arXiv preprint arXiv:2311.14495 , year=

    Shida Wang and Qianxiao Li. Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization, 2024. URLhttps://arxiv.org/abs/2311.14495

  10. [11]

    Rough transformers for continuous and efficient time-series modelling.arXiv preprint arXiv:2403.10288, 2024

    Fernando Moreno-Pino, Alvaro Arroyo, Harrison Waldon, Xiaowen Dong, and Álvaro Cartea. Rough transformers for continuous and efficient time-series modelling.arXiv preprint arXiv:2403.10288, 2024

  11. [12]

    Resurrecting recurrent neural networks for long sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. InInternational conference on machine learning, pages 26670–26698. PMLR, 2023

  12. [13]

    Neural controlled differential equations for irregular time series.Advances in neural information processing systems, 33:6696–6707, 2020

    Patrick Kidger, James Morrill, James Foster, and Terry Lyons. Neural controlled differential equations for irregular time series.Advances in neural information processing systems, 33:6696–6707, 2020

  13. [14]

    Neural rough differential equations for long time series

    James Morrill, Cristopher Salvi, Patrick Kidger, and James Foster. Neural rough differential equations for long time series. InInternational Conference on Machine Learning, pages 7829–7838. PMLR, 2021

  14. [15]

    Log neu- ral controlled differential equations: The lie brackets make a difference.arXiv preprint arXiv:2402.18512, 2024

    Benjamin Walker, Andrew D McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, and Terry Lyons. Log neu- ral controlled differential equations: The lie brackets make a difference.arXiv preprint arXiv:2402.18512, 2024

  15. [16]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  16. [17]

    Gru-ode-bayes: Continuous modeling of sporadically-observed time series.Advances in neural information processing systems, 32, 2019

    Edward De Brouwer, Jaak Simm, Adam Arany, and Yves Moreau. Gru-ode-bayes: Continuous modeling of sporadically-observed time series.Advances in neural information processing systems, 32, 2019

  17. [18]

    Neural Flows: Efficient Alternative to Neural ODEs

    Marin Biloš, Johanna Sommer, Syama Sundar Rangapuram, Tim Januschowski, and Stephan Günnemann. Neural Flows: Efficient Alternative to Neural ODEs. InAdvances in Neu- ral Information Processing Systems, volume 34, pages 21325–21337. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/ b21f9f98829dea9a48fd8aaddc1f15...

  18. [19]

    Modeling irregular time series with continuous recurrent units

    Mona Schirmer, Mazin Eltayeb, Stefan Lessmann, and Maja Rudolph. Modeling irregular time series with continuous recurrent units. InInternational conference on machine learning, pages 19388–19405. PMLR, 2022

  19. [20]

    Latent linear odes with neural kalman filtering for irregular time series forecasting

    Randolf Scholz, Stefan Born, Nghia Duong-Trung, Mariano Nicolas Cruz-Bournazou, and Lars Schmidt- Thieme. Latent linear odes with neural kalman filtering for irregular time series forecasting. 2023

  20. [21]

    Grafiti: Graphs for forecasting irregularly sampled time series

    Vijaya Krishna Yalavarthi, Kiran Madhusudhanan, Randolf Scholz, Nourhan Ahmed, Johannes Burchert, Shayan Jawed, Stefan Born, and Lars Schmidt-Thieme. Grafiti: Graphs for forecasting irregularly sampled time series. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16255– 16263, 2024

  21. [22]

    A., Lines, J., Flynn, M., Large, J., Bostrom, A.,

    Anthony Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn Keogh. The uea multivariate time series classification archive, 2018.arXiv preprint arXiv:1811.00075, 2018

  22. [23]

    Physiome-ode: A benchmark for irregularly sampled multivariate time series forecasting based on biological odes.arXiv preprint arXiv:2502.07489, 2025

    Christian Klötergens, Vijaya Krishna Yalavarthi, Randolf Scholz, Maximilian Stubbemann, Stefan Born, and Lars Schmidt-Thieme. Physiome-ode: A benchmark for irregularly sampled multivariate time series forecasting based on biological odes.arXiv preprint arXiv:2502.07489, 2025

  23. [24]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021

  24. [25]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

  25. [26]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  26. [27]

    Contiformer: Continuous-time transformer for irregular time series modeling.Advances in Neural Information Processing Systems, 36:47143–47175, 2023

    Yuqi Chen, Kan Ren, Yansen Wang, Yuchen Fang, Weiwei Sun, and Dongsheng Li. Contiformer: Continuous-time transformer for irregular time series modeling.Advances in Neural Information Processing Systems, 36:47143–47175, 2023

  27. [28]

    volatile

    Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018. 11 A Architecture details Block execution order.Each TIDES block applies the following sequence of operations to its input x∈R B×L×H : z= Dropout GLU Dropout(GELU(SSM(BN(x)))) +x, where...