pith. sign in

arxiv: 2601.05527 · v2 · pith:GIZQBMO4new · submitted 2026-01-09 · 💻 cs.LG · cs.AI

DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis

Pith reviewed 2026-05-21 16:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Multivariate Time SeriesMamba ArchitectureDual-Path ModelingDelay-Aware AttentionTime Series ForecastingAnomaly DetectionData ImputationSeries Classification
0
0 comments X

The pith

DeMa splits multivariate time series into separate temporal and variate paths using modified Mamba modules to model delays and interactions at linear cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeMa as a backbone that decomposes each multivariate series into one path for long-range dynamics inside each individual variable and a second path for cross-variable dependencies that include explicit time lags. It keeps the linear scaling of Mamba while adding two specialized modules: Mamba-SSD for independent series processing and Mamba-DALA for delay-aware linear attention across variables. If the separation works, the model should handle long sequences in forecasting, imputation, anomaly detection, and classification without the quadratic cost of attention-based methods. Experiments across five tasks are presented as evidence that both accuracy and speed improve over prior approaches.

Core claim

DeMa preserves Mamba's linear-complexity advantage while substantially improving its suitability for MTS settings by decomposing the input into intra-series temporal dynamics captured by a Mamba-SSD module and inter-series interactions captured by a Mamba-DALA module that integrates delay-aware linear attention.

What carries the argument

Dual-path decomposition consisting of a temporal path (Mamba-SSD for series-independent long-range dynamics) and a variate path (Mamba-DALA for delay-aware cross-variate dependencies).

If this is right

  • DeMa reaches state-of-the-art accuracy on long-term and short-term forecasting while using less compute.
  • The same architecture improves data imputation, anomaly detection, and series classification over prior models.
  • Linear complexity is retained, so the method scales to longer sequences than quadratic attention allows.
  • Series-independent parallel computation in the temporal path reduces overhead compared with fully coupled models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same split could be tested on other linear-time sequence models to see whether the delay-aware component transfers beyond Mamba.
  • If the dual paths reduce memory usage in practice, the approach may enable deployment on edge devices for real-time multivariate monitoring.
  • Extending the delay modeling to non-stationary or irregularly sampled series would be a direct next measurement of robustness.

Load-bearing premise

The claim that explicitly separating intra-series dynamics from inter-series interactions with added delay modeling fully resolves the three listed limitations of vanilla Mamba without leaving modeling gaps.

What would settle it

A controlled comparison on the same five benchmark tasks where a plain Mamba or a standard Transformer reaches equal or better accuracy and wall-clock time would falsify the necessity of the dual-path design.

Figures

Figures reproduced from arXiv: 2601.05527 by Haohao Qu, Qing Li, Rui An, Wenqi Fan, Xuequn Shang.

Figure 1
Figure 1. Figure 1: Dependency modeling strategies and computational complexity of representative MTS architectures. (a) Tokenization and induced dependency patterns (variate-mixing, variate￾independent, variate-dependent). (b) Complexity comparison of representative models. Here, T is the lookback length, N the number of variates, L the token length, and d the embedding dimension, in typical long-horizon settings, T > L ≫ N,… view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of DeMa. The proposed DeMa (Left) comprises three key com [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison on classification and anomaly detection tasks. Results are averaged [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model performance comparison (left) and efficiency comparison (right). DeMa achieves [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Efficiency analysis of GPU memory and running time in a long-term lookback-window [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity of Fusion Weights α and β Across Tasks. Across all tasks, DeMa is most reliable when both paths remain active (i.e., neither α nor β is overly small), confirming that temporal modeling and cross-variate interaction are complementary rather than substitutable. Forecasting is relatively less sensitive to α and β ( [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Accurate and efficient multivariate time series (MTS) analysis is increasingly critical for a wide range of intelligent applications. Within this realm, Transformers have emerged as the predominant architecture due to their strong ability to capture pairwise dependencies. However, Transformer-based models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment in long-term and large-scale MTS modeling. Recently, Mamba has emerged as a promising linear-time alternative with high expressiveness. Nevertheless, directly applying vanilla Mamba to MTS remains suboptimal due to three key limitations: (i) the lack of explicit cross-variate modeling, (ii) difficulty in disentangling the entangled intra-series temporal dynamics and inter-series interactions, and (iii) insufficient modeling of latent time-lag interaction effects. These issues constrain its effectiveness across diverse MTS tasks. To address these challenges, we propose DeMa, a dual-path delay-aware Mamba backbone. DeMa preserves Mamba's linear-complexity advantage while substantially improving its suitability for MTS settings. Specifically, DeMa introduces three key innovations: (i) it decomposes the MTS into intra-series temporal dynamics and inter-series interactions; (ii) it develops a temporal path with a Mamba-SSD module to capture long-range dynamics within each individual series, enabling series-independent, parallel computation; and (iii) it designs a variate path with a Mamba-DALA module that integrates delay-aware linear attention to model cross-variate dependencies. Extensive experiments on five representative tasks, long- and short-term forecasting, data imputation, anomaly detection, and series classification, demonstrate that DeMa achieves state-of-the-art performance while delivering remarkable computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DeMa, a dual-path delay-aware Mamba backbone for multivariate time series analysis. It decomposes MTS data into intra-series temporal dynamics (via a Mamba-SSD module in the temporal path) and inter-series interactions (via a Mamba-DALA module with delay-aware linear attention in the variate path). The work claims this addresses three limitations of vanilla Mamba—lack of explicit cross-variate modeling, entangled dynamics, and insufficient latent time-lag effects—while retaining linear complexity, outperforming Transformer-based models. Extensive experiments across five tasks (long- and short-term forecasting, data imputation, anomaly detection, and series classification) are reported to demonstrate state-of-the-art performance and computational efficiency.

Significance. If the central claims hold, the contribution would be significant for efficient MTS modeling. The dual-path design and explicit handling of time-lag effects via linear mechanisms could provide a scalable alternative to quadratic Transformer architectures, with potential impact on long-sequence applications. The emphasis on disentangling intra- and inter-series components while preserving Mamba's efficiency is a clear strength, particularly if the delay-aware component is shown to be both effective and complexity-preserving.

major comments (2)
  1. [Abstract] Abstract: The central claim that Mamba-DALA 'integrates delay-aware linear attention' to model cross-variate dependencies and latent time-lag interaction effects lacks any supporting equation, state-transition modification, kernel definition, or pseudocode. This is load-bearing because the efficiency advantage and disentanglement rest on the mechanism preserving linear complexity for arbitrary or variable lags; without the concrete formulation it is impossible to rule out fallback to dense attention or restriction to a fixed small lag set.
  2. [§3 (Architecture description)] The description of the variate path (Mamba-DALA) does not specify how delays are injected (e.g., via modified selective state transitions, lag-specific kernels, or adjusted scanning) while keeping overall complexity linear. If the implementation either reintroduces quadratic terms for long lags or uses a small fixed lag set, the claimed modeling of entangled inter-series dynamics would be incomplete, directly undermining the SOTA and efficiency results on the five tasks.
minor comments (2)
  1. [Abstract] The abstract states 'extensive experiments' demonstrate SOTA results but does not reference specific tables, metrics (e.g., MSE, MAE), baselines, or statistical tests; these details should be summarized with pointers to the relevant result sections or tables.
  2. Notation for the two paths (temporal vs. variate) and the modules (Mamba-SSD, Mamba-DALA) should be introduced with consistent symbols or diagrams early in the architecture section to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns about the clarity of the Mamba-DALA formulation by adding explicit equations, a complexity analysis, and pseudocode.

read point-by-point responses
  1. Referee: [Abstract] The central claim that Mamba-DALA 'integrates delay-aware linear attention' to model cross-variate dependencies and latent time-lag interaction effects lacks any supporting equation, state-transition modification, kernel definition, or pseudocode. This is load-bearing because the efficiency advantage and disentanglement rest on the mechanism preserving linear complexity for arbitrary or variable lags; without the concrete formulation it is impossible to rule out fallback to dense attention or restriction to a fixed small lag set.

    Authors: We thank the referee for this observation. The abstract is intentionally high-level; the concrete formulation appears in Section 3.2, where delay-aware linear attention is realized by modifying the selective state transitions with lag-specific parameters inside the linear kernel. To improve clarity we have inserted the defining equations for the modified state update and the lag kernel, together with a short complexity argument showing the scan remains strictly linear in sequence length for arbitrary lags. Pseudocode is now also provided in the appendix. revision: yes

  2. Referee: [§3 (Architecture description)] The description of the variate path (Mamba-DALA) does not specify how delays are injected (e.g., via modified selective state transitions, lag-specific kernels, or adjusted scanning) while keeping overall complexity linear. If the implementation either reintroduces quadratic terms for long lags or uses a small fixed lag set, the claimed modeling of entangled inter-series dynamics would be incomplete, directly undermining the SOTA and efficiency results on the five tasks.

    Authors: We appreciate the referee drawing attention to this potential ambiguity. In the original text, delays are injected by lag-specific kernels that adjust the selective parameters of the Mamba scan; the overall procedure stays linear because the attention is computed via a single selective state-space pass rather than pairwise operations. Nevertheless, we agree the exposition can be tightened. The revised Section 3 now contains an explicit step-by-step derivation of the delay injection, the corresponding kernel definition, and a formal complexity proof confirming O(N) scaling even for variable or long lags. These additions directly support the reported efficiency and performance claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity: DeMa defines new modules and validates empirically on external tasks

full rationale

The paper proposes DeMa as a dual-path architecture that decomposes multivariate time series into intra-series temporal dynamics (via Mamba-SSD) and inter-series interactions (via Mamba-DALA with delay-aware linear attention). These components are introduced as explicit innovations to address stated limitations of vanilla Mamba, with the overall model evaluated through experiments on five standard external tasks. No equations, predictions, or central claims reduce by construction to fitted parameters, self-citations, or renamed inputs; the derivation remains a sequence of architectural definitions followed by independent empirical results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based solely on the abstract, the central claim rests on the introduction of two new modules whose internal mechanisms and any associated parameters are not detailed; no standard mathematical axioms or external benchmarks are invoked in the summary.

invented entities (2)
  • Mamba-SSD module no independent evidence
    purpose: Capture long-range dynamics within each individual series for series-independent parallel computation
    New component introduced in the temporal path of DeMa.
  • Mamba-DALA module no independent evidence
    purpose: Integrate delay-aware linear attention to model cross-variate dependencies and time-lag effects
    New component introduced in the variate path of DeMa.

pith-pipeline@v0.9.0 · 5843 in / 1227 out tokens · 66094 ms · 2026-05-21T16:17:11.239811+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 7 internal anchors

  1. [1]

    Deep learning for time series forecasting: a survey.International Journal of Machine Learning and Cybernetics, pages 1–34, 2025

    Xiangjie Kong, Zhenghao Chen, Weiyao Liu, Kaili Ning, Lechao Zhang, Syauqie Muhammad Marier, Yichen Liu, Yuhao Chen, and Feng Xia. Deep learning for time series forecasting: a survey.International Journal of Machine Learning and Cybernetics, pages 1–34, 2025

  2. [2]

    Guangyu Huo, Yong Zhang, Boyue Wang, Junbin Gao, Yongli Hu, and Baocai Yin. Hierarchical spatio– temporal graph convolutional networks and transformer network for traffic flow forecasting.IEEE Transactions on Intelligent Transportation Systems, 24(4):3855–3867, 2023

  3. [3]

    Damba-st: Domain-adaptive mamba for efficient urban spatio-temporal prediction.arXiv preprint arXiv:2506.18939, 2025

    Rui An, Yifeng Zhang, Ziran Liang, Wenqi Fan, Yuxuan Liang, Xuequn Shang, and Qing Li. Damba-st: Domain-adaptive mamba for efficient urban spatio-temporal prediction.arXiv preprint arXiv:2506.18939, 2025

  4. [4]

    Lara: A light and anti-overfitting retraining approach for unsupervised time series anomaly detection

    Feiyi Chen, Zhen Qin, Mengchu Zhou, Yingying Zhang, Shuiguang Deng, Lunting Fan, Guansong Pang, and Qingsong Wen. Lara: A light and anti-overfitting retraining approach for unsupervised time series anomaly detection. InProceedings of the ACM on Web Conference 2024, pages 4138–4149, 2024

  5. [5]

    A review on outlier/anomaly detection in time series data.ACM computing surveys (CSUR), 54(3):1–33, 2021

    Ane Bl´azquez-Garc´ıa, Angel Conde, Usue Mori, and Jose A Lozano. A review on outlier/anomaly detection in time series data.ACM computing surveys (CSUR), 54(3):1–33, 2021

  6. [6]

    Imaging and fusing time series for wearable sensor-based human activity recognition.Information Fusion, 53:80–87, 2020

    Zhen Qin, Yibo Zhang, Shuyu Meng, Zhiguang Qin, and Kim-Kwang Raymond Choo. Imaging and fusing time series for wearable sensor-based human activity recognition.Information Fusion, 53:80–87, 2020

  7. [7]

    Attention is all you need.Advances in Neural Information Processing Systems, 2017

    A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017

  8. [8]

    A time series is worth 64 words: Long-term forecasting with transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Conference on Learning Representations, 2023

  9. [9]

    Timer: Generative pre-trained transformers are large time series models

    Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: Generative pre-trained transformers are large time series models. InForty-first International Conference on Machine Learning, 2024

  10. [10]

    Unified training of universal time series forecasting transformers

    Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InForty-first International Conference on Machine Learning, 2024

  11. [11]

    itrans- former: Inverted transformers are effective for time series forecasting

    Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itrans- former: Inverted transformers are effective for time series forecasting. InThe Twelfth International Conference on Learning Representations, 2024

  12. [12]

    Timesnet: Temporal 2d-variation modeling for general time series analysis

    Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. InThe Eleventh International Conference on Learning Representations, 2023

  13. [13]

    itfkan: Interpretable time series forecasting with kolmogorov-arnold network.arXiv preprint arXiv:2504.16432, 2025

    Ziran Liang, Rui An, Wenqi Fan, Yanghui Rao, and Yuxuan Liang. itfkan: Interpretable time series forecasting with kolmogorov-arnold network.arXiv preprint arXiv:2504.16432, 2025

  14. [14]

    Recurrent neural networks for time series forecasting: Current status and future directions.International Journal of Forecasting, 37(1):388–427, 2021

    Hansika Hewamalage, Christoph Bergmeir, and Kasun Bandara. Recurrent neural networks for time series forecasting: Current status and future directions.International Journal of Forecasting, 37(1):388–427, 2021. 22

  15. [15]

    Recurrent neural networks for time series classification.Neurocomputing, 50:223–235, 2003

    Michael H¨usken and Peter Stagge. Recurrent neural networks for time series classification.Neurocomputing, 50:223–235, 2003

  16. [16]

    Scinet: Time series modeling and forecasting with sample convolution and interaction.Advances in Neural Information Processing Systems, 35:5816–5828, 2022

    Minhao Liu, Ailing Zeng, Muxi Chen, Zhijian Xu, Qiuxia Lai, Lingna Ma, and Qiang Xu. Scinet: Time series modeling and forecasting with sample convolution and interaction.Advances in Neural Information Processing Systems, 35:5816–5828, 2022

  17. [17]

    Moderntcn: A modern pure convolution structure for general time series analysis

    Donghao Luo and Xue Wang. Moderntcn: A modern pure convolution structure for general time series analysis. InICLR, 2024

  18. [18]

    Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

  19. [19]

    Timemixer: Decomposable multiscale mixing for time series forecasting

    Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y Zhang, and JUN ZHOU. Timemixer: Decomposable multiscale mixing for time series forecasting. InInternational Conference on Learning Representations (ICLR), 2024

  20. [20]

    Informer: Beyond efficient transformer for long sequence time-series forecasting

    Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

  21. [21]

    Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting

    Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. InInternational conference on machine learning, pages 27268–27286. PMLR, 2022

  22. [22]

    Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

    Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

  23. [23]

    Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting

    Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. InInternational conference on learning representations, 2021

  24. [24]

    Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting

    Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InThe eleventh international conference on learning representations, 2023

  25. [25]

    Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting

    Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 459–469, 2023

  26. [26]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  27. [27]

    Transformers are ssms: generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are ssms: generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning, pages 10041–10071, 2024

  28. [28]

    A Survey of Mamba

    Haohao Qu, Liangbo Ning, Rui An, Wenqi Fan, Tyler Derr, Xin Xu, and Qing Li. A survey of mamba. arXiv preprint arXiv:2408.01129, 2024

  29. [29]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024

  30. [30]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model.arXiv preprint arXiv:2401.09417, 2024

  31. [31]

    Caduceus: Bi-directional equivariant long-range dna sequence modeling

    Yair Schiff, Chia Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, and V olodymyr Kuleshov. Caduceus: Bi-directional equivariant long-range dna sequence modeling. InInternational Conference on Machine Learning, pages 43632–43648. PMLR, 2024

  32. [32]

    Ssd4rec: A structured state space duality model for efficient sequential recommendation.arXiv preprint arXiv:2409.01192, 2024

    Haohao Qu, Yifeng Zhang, Liangbo Ning, Wenqi Fan, and Qing Li. Ssd4rec: A structured state space duality model for efficient sequential recommendation.arXiv preprint arXiv:2409.01192, 2024

  33. [33]

    Deep Time Series Models: A Comprehensive Survey and Benchmark

    Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang. Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278, 2024. 23

  34. [34]

    Rethinking channel dependence for multivariate time series forecasting: Learning from leading indicators

    Lifan Zhao and Yanyan Shen. Rethinking channel dependence for multivariate time series forecasting: Learning from leading indicators. InThe Twelfth International Conference on Learning Representations, 2024

  35. [35]

    Unveiling delay effects in traffic forecasting: A perspective from spatial-temporal delay differential equations

    Qingqing Long, Zheng Fang, Chen Fang, Chong Chen, Pengfei Wang, and Yuanchun Zhou. Unveiling delay effects in traffic forecasting: A perspective from spatial-temporal delay differential equations. In Proceedings of the ACM on Web Conference 2024, pages 1035–1044, 2024

  36. [36]

    Pdformer: Propagation delay-aware dynamic long-range transformer for traffic flow prediction

    Jiawei Jiang, Chengkai Han, Wayne Xin Zhao, and Jingyuan Wang. Pdformer: Propagation delay-aware dynamic long-range transformer for traffic flow prediction. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 4365–4373, 2023

  37. [37]

    Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems, 33:1474–1487, 2020

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher R ´e. Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems, 33:1474–1487, 2020

  38. [38]

    Parallel prefix sum (scan) with cuda.GPU gems, 3(39):851–876, 2007

    Mark Harris, Shubhabrata Sengupta, and John D Owens. Parallel prefix sum (scan) with cuda.GPU gems, 3(39):851–876, 2007

  39. [39]

    Frequency-domain mlps are more effective learners in time series forecasting

    Kun Yi, Qi Zhang, Wei Fan, Shoujin Wang, Pengyang Wang, Hui He, Ning An, Defu Lian, Longbing Cao, and Zhendong Niu. Frequency-domain mlps are more effective learners in time series forecasting. Advances in Neural Information Processing Systems, 36:76656–76679, 2023

  40. [40]

    Koopa: Learning non-stationary time series dynamics with koopman predictors.Advances in neural information processing systems, 36:12271–12290, 2023

    Yong Liu, Chenyu Li, Jianmin Wang, and Mingsheng Long. Koopa: Learning non-stationary time series dynamics with koopman predictors.Advances in neural information processing systems, 36:12271–12290, 2023

  41. [41]

    Ts2vec: Towards universal representation of time series

    Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. Ts2vec: Towards universal representation of time series. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 8980–8987, 2022

  42. [42]

    How to train your hippo: State space models with generalized orthogonal basis projections

    Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher R´e. How to train your hippo: State space models with generalized orthogonal basis projections.arXiv preprint arXiv:2206.12037, 2022

  43. [43]

    Reversible instance normalization for accurate time-series forecasting against distribution shift

    Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible instance normalization for accurate time-series forecasting against distribution shift. InInternational Conference on Learning Representations, 2021

  44. [44]

    Demystify mamba in vision: A linear attention perspective.arXiv preprint arXiv:2405.16605, 2024

    Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, and Gao Huang. Demystify mamba in vision: A linear attention perspective.arXiv preprint arXiv:2405.16605, 2024

  45. [45]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  46. [46]

    Time delay estimation by generalized cross correlation methods.IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):280–285, 1984

    Mordechai Azaria and David Hertz. Time delay estimation by generalized cross correlation methods.IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):280–285, 1984

  47. [47]

    Flatten transformer: Vision transformer using focused linear attention

    Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. Flatten transformer: Vision transformer using focused linear attention. InProceedings of the IEEE/CVF international conference on computer vision, pages 5961–5971, 2023

  48. [48]

    Trindade, ElectricityLoadDiagrams20112014, UCI Machine Learning Repository, DOI: https://doi.org/10.24432/C58C86 (2015)

    Artur Trindade. ElectricityLoadDiagrams20112014. UCI Machine Learning Repository, 2015. DOI: https://doi.org/10.24432/C58C86

  49. [49]

    Modeling long-and short-term temporal patterns with deep neural networks

    Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal patterns with deep neural networks. InThe 41st international ACM SIGIR conference on research & development in information retrieval, pages 95–104, 2018

  50. [50]

    Freeway performance measurement system: mining loop detector data.Transportation research record, 1748(1):96–102, 2001

    Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia. Freeway performance measurement system: mining loop detector data.Transportation research record, 1748(1):96–102, 2001

  51. [51]

    Robust anomaly detection for multivariate time series through stochastic recurrent neural network

    Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2828–2837, 2019

  52. [52]

    Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding

    Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 387–395, 2018. 24

  53. [53]

    Swat: A water treatment testbed for research and training on ics security

    Aditya P Mathur and Nils Ole Tippenhauer. Swat: A water treatment testbed for research and training on ics security. InCySWater, 2016

  54. [54]

    Practical approach to asynchronous multivariate time series anomaly detection and localization.KDD, 2021

    Ahmed Abdulaal, Zhuanghua Liu, and Tomer Lancewicki. Practical approach to asynchronous multivariate time series anomaly detection and localization.KDD, 2021

  55. [56]

    Affirm: Interactive mamba with adaptive fourier filters for long-term time series forecasting

    Yuhan Wu, Xiyu Meng, Huajin Hu, Junru Zhang, Yabo Dong, and Dongming Lu. Affirm: Interactive mamba with adaptive fourier filters for long-term time series forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21599–21607, 2025

  56. [57]

    Is mamba effective for time series forecasting?Neurocomputing, 619:129178, 2025

    Zihan Wang, Fanheng Kong, Shi Feng, Ming Wang, Xiaocui Yang, Han Zhao, Daling Wang, and Yifei Zhang. Is mamba effective for time series forecasting?Neurocomputing, 619:129178, 2025

  57. [59]

    Simplified mamba with disentangled dependency encoding for long-term time series forecasting.arXiv preprint arXiv:2408.12068, 2024

    Zixuan Weng, Jindong Han, Wenzhao Jiang, and Hao Liu. Simplified mamba with disentangled dependency encoding for long-term time series forecasting.arXiv preprint arXiv:2408.12068, 2024

  58. [60]

    Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping

    Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting long-term time series forecasting: An investigation on linear mapping.arXiv preprint arXiv:2305.10721, 2023

  59. [61]

    Adam: A method for stochastic optimization.(No Title), 2014

    P Kingma Diederik. Adam: A method for stochastic optimization.(No Title), 2014

  60. [62]

    The UEA multivariate time series classification archive, 2018

    Anthony Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn Keogh. The uea multivariate time series classification archive, 2018.arXiv preprint arXiv:1811.00075, 2018

  61. [63]

    Unitime: A language-empowered unified model for cross-domain time series forecasting

    Xu Liu, Junfeng Hu, Yuan Li, Shizhe Diao, Yuxuan Liang, Bryan Hooi, and Roger Zimmermann. Unitime: A language-empowered unified model for cross-domain time series forecasting. InProceedings of the ACM on Web Conference 2024, pages 4095–4106, 2024

  62. [64]

    Reformer: The efficient transformer

    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. InInternational Conference on Learning Representations, 2019

  63. [65]

    Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting

    Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems, 32, 2019

  64. [66]

    C-mamba: Channel correlation enhanced state space models for multivariate time series forecasting.arXiv preprint arXiv:2406.05316, 2024

    Chaolv Zeng, Zhanyu Liu, Guanjie Zheng, and Linghe Kong. C-mamba: Channel correlation enhanced state space models for multivariate time series forecasting.arXiv preprint arXiv:2406.05316, 2024

  65. [67]

    Mambamixer: Efficient selective state space models with dual token and channel selection.arXiv preprint arXiv:2403.19888, 2024

    Ali Behrouz, Michele Santacatterina, and Ramin Zabih. Mambamixer: Efficient selective state space models with dual token and channel selection.arXiv preprint arXiv:2403.19888, 2024

  66. [68]

    Mambats: improved selective state space models for long-term time series forecasting

    Xiuding Cai, Yaoyao Zhu, Xueyao Wang, and Yu Yao. Mambats: Improved selective state space models for long-term time series forecasting.arXiv preprint arXiv:2405.16440, 2024. 25