STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

Guangxu Zhu; Haolong Chen; Liang Zhang; Zhengyuan Xin

arxiv: 2508.12247 · v3 · pith:7RAUPXFDnew · submitted 2025-08-17 · 💻 cs.LG · cs.AI

STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

Haolong Chen , Liang Zhang , Zhengyuan Xin , Guangxu Zhu This is my paper

Pith reviewed 2026-05-25 07:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords spatio-temporal time-seriesmixture of expertsMambalong-term predictiondisentangled routingcausal contrastive learninggraph causal network

0 comments

The pith

STM3 integrates multiscale Mamba into a disentangled mixture-of-experts framework to capture long-term spatio-temporal dependencies efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem of learning complex long-term spatio-temporal dependencies in time-series data. It identifies two challenges: efficiently extracting multiscale information from long sequences and modeling correlations between multiscale info from different nodes. STM3 addresses these by integrating a Multiscale Mamba architecture into a Disentangled Mixture-of-Experts (DMoE) framework, along with an adaptive graph causal network, stable routing, and causal contrastive learning. This approach is claimed to achieve superior performance on real-world benchmarks. A sympathetic reader would care because better handling of these dependencies could improve predictions in domains like traffic, weather, or energy forecasting.

Core claim

STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, it introduces a stable routing strategy and a causal contrastive learning strategy, which work in tandem with hierarchical information aggregation to guarantee scale distinguishability. The model theoretically proves that it achieves superior routing smoothness and guarantees pattern disentanglement for each expert.

What carries the argument

The Disentangled Mixture-of-Experts (DMoE) framework with Multiscale Mamba, stable routing, and causal contrastive learning to extract and disentangle multiscale temporal patterns across nodes.

If this is right

Achieves state-of-the-art results across 10 real-world benchmarks in long-term spatio-temporal time-series prediction.
On the PEMSD8 dataset, surpasses the second-best model by 7.1% in MAE, 8.5% in RMSE, and 15.9% in MAPE.
Provides theoretical guarantees of superior routing smoothness and pattern disentanglement for each expert.
Enables efficient extraction of multiscale information while modeling spatial dependencies via an adaptive graph causal network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the DMoE successfully disentangles scales, the architecture could transfer to other long-sequence domains like video or audio forecasting with minimal changes.
The combination of Mamba and mixture routing may reduce quadratic attention costs for very long horizons compared to transformer baselines.
Causal contrastive learning as a regularizer might generalize to other mixture-of-experts time-series models to improve expert specialization.

Load-bearing premise

That the multiscale temporal information from different nodes is highly correlated in a manner that DMoE routing plus causal contrastive learning can resolve without introducing new fitting artifacts.

What would settle it

A new long-term spatio-temporal dataset where STM3 fails to match or exceed the second-best model's MAE, RMSE, and MAPE, or where empirical routing fails to exhibit the claimed smoothness and expert disentanglement.

Figures

Figures reproduced from arXiv: 2508.12247 by Guangxu Zhu, Haolong Chen, Liang Zhang, Zhengyuan Xin.

**Figure 1.** Figure 1: Main structure of STM3. where ℎ (𝑞) ms ∈ R 𝑇 ×𝑑inner and ℎ (𝑞) ∈ R 𝑇 ×𝑑inner denote the input and output feature sequences at scale 𝑞. We then stack the outputs to obtain ℎ ∈ R 𝑇 ×𝑑inner×𝑄 , with symbols consistent with Section 4.2. Through scale amplification, the maximum scale expands to 𝑠 (𝑄) 0 [𝑠 (𝑄) ] 𝐿 , where 𝐿 denotes the layer index of the backbone where the multiscale Mamba module is deployed, … view at source ↗

**Figure 2.** Figure 2: The comparison between two routing strategies. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study of STM3. for optimal spatio-temporal time-series prediction. More ablation study results are detailed in Appendix D.1 5.4 Hyperparameter Study (RQ3) As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 6.** Figure 6: STM3’s multiscale feature extraction. (a) Expert assignment. (b) Loss [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of routing strategies. 5.5 In-Depth Analysis (RQ4 & RQ5) Expert-Wise Effectiveness. To validate MMM’s expert-wise effectiveness to model complex spatio-temporal patterns, we visualized STM3’s first-layer features using t-SNE [40]. Figure 5a shows distinct feature clusters for each expert, confirming effective pattern disentanglement. Figure 5b further illustrates the gating network’s discrim… view at source ↗

**Figure 5.** Figure 5: MMM’s feature extraction across experts. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 8.** Figure 8: Ablation study of STM3 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Hyperparameter analysis of STM3 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex long-term spatio-temporal dependencies efficiently. The long-term spatio-temporal dependency learning brings two new challenges: 1) The long-term temporal sequence naturally includes multiscale information, which is hard to extract efficiently; 2) The multiscale temporal information from different nodes is highly correlated and hard to model. To address these challenges, we propose Spatio-Temporal Mixture of Multiscale Mamba (STM3). STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, we introduce a stable routing strategy and a causal contrastive learning strategy, which work in tandem with hierarchical information aggregation to guarantee scale distinguishability. We theoretically prove that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert. Extensive experiments on 10 real-world benchmarks across domains demonstrate STM3's superior performance, achieving state-of-the-art results in long-term spatio-temporal time-series prediction. Notably, on the PEMSD8 dataset, it achieves significant improvements, surpassing the second-best model by 7.1% in MAE, 8.5% in RMSE, and 15.9% in MAPE. Code is available at https://github.com/IfReasonable/STM3_KDD26.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STM3 combines multiscale Mamba in a disentangled MoE with graph causal nets, stable routing, and causal contrastive learning for long-term spatio-temporal forecasting, with code released and SOTA claims on 10 benchmarks.

read the letter

The paper's core contribution is a concrete architecture that places multiscale Mamba inside a disentangled mixture-of-experts framework, adds an adaptive graph causal network, and pairs it with stable routing plus causal contrastive learning to handle multiscale temporal patterns and cross-node correlations. The specific routing strategy and the theoretical statements on smoothness and pattern disentanglement are not in the prior literature they cite. They also ship code, which lets others check the implementation directly.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes STM3 (Spatio-Temporal Mixture of Multiscale Mamba), which combines a Multiscale Mamba architecture inside a Disentangled Mixture-of-Experts (DMoE) framework, an adaptive graph causal network, stable routing, and causal contrastive learning with hierarchical aggregation. It targets two challenges in long-term spatio-temporal time-series prediction: efficient extraction of multiscale temporal information and modeling of highly correlated multiscale information across nodes. The paper asserts theoretical proofs that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert. It reports state-of-the-art empirical results on 10 real-world benchmarks across domains, with specific gains on PEMSD8 (surpassing the second-best model by 7.1% MAE, 8.5% RMSE, and 15.9% MAPE). Public code is provided at the cited GitHub repository.

Significance. If the empirical results and theoretical claims hold under standard verification, the work would be significant for the spatio-temporal forecasting community by offering an efficient Mamba-based approach to long-horizon dependencies that also handles spatial correlations via adaptive graphs. The public code release is a clear strength that enables direct reproducibility and extension.

major comments (3)

[Abstract] Abstract: the manuscript asserts 'theoretical proofs' of superior routing smoothness and pattern disentanglement, yet supplies no derivation details, lemmas, or proof sketches; this is load-bearing for the novelty claim because the central methodological contribution rests on these guarantees.
[Abstract] Abstract (performance claims): concrete percentage improvements (e.g., PEMSD8 MAE/RMSE/MAPE) are reported without accompanying error bars, baseline tables, data-split descriptions, or statistical testing; the SOTA assertion on 10 benchmarks therefore rests on uninspectable evidence.
[Abstract (challenges and proposed solution paragraph)] The weakest assumption (multiscale temporal sequences contain extractable, highly correlated information that DMoE routing plus causal contrastive learning resolves without new fitting artifacts) is stated but not subjected to an ablation that isolates whether the observed gains arise from the proposed mechanisms versus standard scaling or regularization effects.

minor comments (1)

[Abstract] The abstract mentions 'hierarchical information aggregation' without defining the aggregation operator or its relation to the DMoE experts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We respond to each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript asserts 'theoretical proofs' of superior routing smoothness and pattern disentanglement, yet supplies no derivation details, lemmas, or proof sketches; this is load-bearing for the novelty claim because the central methodological contribution rests on these guarantees.

Authors: The abstract summarizes the contribution; the full proofs appear in Section 3.4, including Lemma 3.1 (routing smoothness via bounded Lipschitz constants on the gating function) and Theorem 3.2 (pattern disentanglement via mutual information bounds under causal contrastive loss), with complete derivations and proof sketches. We will revise the abstract to explicitly reference Section 3.4 and include a one-sentence proof outline if space allows. revision: yes
Referee: [Abstract] Abstract (performance claims): concrete percentage improvements (e.g., PEMSD8 MAE/RMSE/MAPE) are reported without accompanying error bars, baseline tables, data-split descriptions, or statistical testing; the SOTA assertion on 10 benchmarks therefore rests on uninspectable evidence.

Authors: The abstract condenses results; full experimental details, including error bars (std over 5 random seeds), complete baseline tables, data-split protocols, and paired t-test p-values, are reported in Section 4.2 and Tables 1–3. The PEMSD8 gains are computed from those tables. We will add a parenthetical reference in the abstract directing readers to the experimental section. revision: partial
Referee: [Abstract (challenges and proposed solution paragraph)] The weakest assumption (multiscale temporal sequences contain extractable, highly correlated information that DMoE routing plus causal contrastive learning resolves without new fitting artifacts) is stated but not subjected to an ablation that isolates whether the observed gains arise from the proposed mechanisms versus standard scaling or regularization effects.

Authors: Section 4.4 already contains component-wise ablations (removing DMoE, contrastive loss, and stable routing individually) that demonstrate gains exceed those from simple scaling or L2 regularization. To further isolate against fitting artifacts, we will add a controlled comparison against a capacity-matched vanilla Mamba baseline with equivalent parameter count and training regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims consist of empirical SOTA results on 10 held-out real-world benchmarks (with explicit gains reported on PEMSD8) plus a theoretical argument for routing smoothness and pattern disentanglement derived from the DMoE architecture, stable routing, and causal contrastive learning. These results are obtained from standard train/test splits on external datasets rather than quantities defined by fitted parameters inside the model equations. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text; the derivation chain remains self-contained against external benchmarks and publicly released code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented physical entities; the model itself is a new engineered artifact whose internal parameters are learned from data under standard deep-learning assumptions.

axioms (1)

domain assumption Standard deep-learning assumptions that neural networks with the stated components can represent the target spatio-temporal functions and that benchmark datasets are representative of the intended use cases.
Implicit in any proposal of a new neural architecture for time-series prediction.

pith-pipeline@v0.9.0 · 5808 in / 1370 out tokens · 25114 ms · 2026-05-25T07:33:20.578123+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift
cs.LG 2026-05 unverdicted novelty 6.0

PIMSM is a Mamba-based architecture that maps knee frequencies from spectra to multi-scale discretization parameters to reduce representation drift under distribution shifts in fMRI and weather forecasting.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Khaled Alkilane, Yihang He, and Der-Horng Lee. 2024. MixMamba: Time series modeling with adaptive expertise.Information Fusion112 (2024), 102589

work page 2024
[2]

Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. 2020. Adaptive graph convolutional recurrent network for traffic forecasting.Advances in neural information processing systems33 (2020), 17804–17815

work page 2020
[3]

Gianni Barlacchi, Marco De Nadai, Roberto Larcher, Antonio Casella, Cristiana Chitic, Giovanni Torrisi, Fabrizio Antonelli, Alessandro Vespignani, Alex Pent- land, and Bruno Lepri. 2015. A multi-source dataset of urban life in the city of Milan and the Province of Trentino.Scientific data2, 1 (2015), 1–15

work page 2015
[4]

Xiuding Cai, Yaoyao Zhu, Xueyao Wang, and Yu Yao. 2024. MambaTS: improved selective state space models for long-term time series forecasting.arXiv preprint arXiv:2405.16440(2024)

work page arXiv 2024
[5]

Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia. 2001. Freeway performance measurement system: mining loop detector data. Transportation research record1748, 1 (2001), 96–102

work page 2001
[6]

Min Chen, Guansong Pang, Wenjun Wang, and Cheng Yan. 2025. Information Bottleneck-guided MLPs for Robust Spatial-temporal Forecasting. InForty-second International Conference on Machine Learning

work page 2025
[7]

Peng Chen, Yingying ZHANG, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, and Chenjuan Guo. 2023. Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting. InThe Twelfth International Conference on Learning Representations

work page 2023
[8]

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. 2022. To- wards understanding mixture of experts in deep learning.arXiv preprint arXiv:2208.02813(2022)

work page arXiv 2022
[9]

Jeongwhan Choi, Hwangyong Choi, Jeehyun Hwang, and Noseong Park. 2022. Graph neural controlled differential equations for traffic forecasting. InProceed- ings of the AAAI conference on artificial intelligence, Vol. 36. 6367–6374

work page 2022
[10]

Jinhyeok Choi, Heehyeon Kim, Minhyeong An, and Joyce Jiyoung Whang. 2024. Spot-mamba: Learning long-range dependency on spatio-temporal graphs with selective state spaces.arXiv preprint arXiv:2406.11244(2024)

work page arXiv 2024
[11]

Damai Dai, Chengqi Deng, Chenggang Zhao, Rx Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al . 2024. DeepSeekMoE: To- wards Ultimate Expert Specialization in Mixture-of-Experts Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1280–1297

work page 2024
[12]

Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christo- pher Ré. 2023. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. InProceedings of the 11th International Conference on Learning Representations (ICLR)

work page 2023
[13]

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

work page
[14]

In International conference on machine learning

Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569

work page
[15]

Yuchen Fang, Yanjun Qin, Haiyong Luo, Fang Zhao, Bingbing Xu, Liang Zeng, and Chenxing Wang. 2023. When spatio-temporal meet wavelets: Disentangled traffic forecasting via efficient spectral graph attention networks. In2023 IEEE 39th international conference on data engineering (ICDE). IEEE, 517–529

work page 2023
[16]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

work page 2022
[17]

Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Conference on Language Modeling

work page 2024
[18]

Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 922–929

work page 2019
[20]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

work page 1991
[21]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[23]

Dongyuan Li, Shiyin Tan, Ying Zhang, Ming Jin, Shirui Pan, Manabu Okumura, and Renhe Jiang. 2024. Dyg-mamba: Continuous state space modeling on dynamic graphs.arXiv preprint arXiv:2408.06966(2024)

work page arXiv 2024
[24]

Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, and Ness B Shroff. 2024. Theory on mixture-of-experts in continual learning.arXiv preprint arXiv:2406.16437 (2024)

work page arXiv 2024
[25]

Lincan Li, Hanchen Wang, Wenjie Zhang, and Adelle Coster. 2024. Stg-mamba: Spatial-temporal graph learning via selective state space model.arXiv preprint arXiv:2403.12418(2024)

work page arXiv 2024
[26]

Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. InInternational Conference on Learning Representations

work page 2018
[27]

Zhonghang Li, Lianghao Xia, Yong Xu, and Chao Huang. 2023. GPT-ST: genera- tive pre-training of spatio-temporal graph neural networks.Advances in neural information processing systems36 (2023), 70229–70246

work page 2023
[28]

Dachuan Liu, Jin Wang, Shuo Shang, and Peng Han. 2022. Msdr: Multi-step dependency relation networks for spatial temporal forecasting. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 1042–1050

work page 2022
[29]

Hangchen Liu, Zheng Dong, Renhe Jiang, Jiewen Deng, Jinliang Deng, Quan- jun Chen, and Xuan Song. 2023. Spatio-temporal adaptive embedding makes vanilla transformer sota for traffic forecasting. InProceedings of the 32nd ACM international conference on information and knowledge management. 4125–4129

work page 2023
[30]

Shang Liu, Miao He, Zhiqiang Wu, Peng Lu, and Weixi Gu. 2024. Spatial–temporal graph neural network traffic prediction based load balancing with reinforcement learning in cellular networks.Information Fusion103 (2024), 102079

work page 2024
[31]

Ali Mehrabian, Shahab Bahrami, and Vincent WS Wong. 2023. A dynamic Bernstein graph recurrent network for wireless cellular traffic prediction. InICC 2023-IEEE International Conference on Communications. IEEE, 3842–3847

work page 2023
[32]

Ali Mehrabian and Vincent WS Wong. 2025. A-Gamba: An Adaptive Graph- Mamba Model for Traffic Prediction in Wireless Cellular Networks.IEEE Wireless Communications Letters(2025)

work page 2025
[33]

Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. 2023. Long Range Language Modeling via Gated State Spaces. InInternational Conference on Learning Representations

work page 2023
[34]

Huy Nguyen, Pedram Akbarian, Fanqi Yan, and Nhat Ho. 2023. Statistical perspective of top-k sparse softmax gating mixture of experts.arXiv preprint arXiv:2309.13850(2023)

work page arXiv 2023
[35]

Huy Nguyen, Nhat Ho, and Alessandro Rinaldo. 2024. On least square estimation in softmax gating mixture of experts.arXiv preprint arXiv:2402.02952(2024)

work page arXiv 2024
[36]

Mohammad Amin Shabani, Amir H Abdi, Lili Meng, and Tristan Sylvain. [n. d.]. Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Fore- casting. InThe Eleventh International Conference on Learning Representations

work page
[37]

Zhi Sheng, Yuan Yuan, Jingtao Ding, and Yong Li. 2025. Unveiling the Power of Noise Priors: Enhancing Diffusion Models for Mobile Traffic Prediction.arXiv Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Chen et al. preprint arXiv:2501.13794(2025)

work page arXiv 2025
[38]

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. 2024. Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts.arXiv preprint arXiv:2409.16040(2024)

work page arXiv 2024
[39]

Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. 2023. Simplified State Space Layers for Sequence Modeling. InICLR

work page 2023
[40]

Chao Song, Youfang Lin, Shengnan Guo, and Huaiyu Wan. 2020. Spatial- temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 914–921

work page 2020
[41]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research9, 11 (2008)

work page 2008
[42]

Shuo Wang, Yanran Li, Jiang Zhang, Qingye Meng, Lingwei Meng, and Fei Gao

work page
[43]

5-gnn: A domain knowledge enhanced graph neural network for pm2

Pm2. 5-gnn: A domain knowledge enhanced graph neural network for pm2. 5 forecasting. InProceedings of the 28th international conference on advances in geographic information systems. 163–166

work page
[44]

Tan Wang, Zhongqi Yue, Jianqiang Huang, Qianru Sun, and Hanwang Zhang

work page
[45]

Advances in Neural Information Processing Systems34 (2021), 18225–18240

Self-supervised learning disentangled group representation as feature. Advances in Neural Information Processing Systems34 (2021), 18225–18240

work page 2021
[46]

Yuankai Wu, Dingyi Zhuang, Aurelie Labbe, and Lijun Sun. 2021. Inductive graph neural networks for spatiotemporal kriging. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 4478–4485

work page 2021
[47]

Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang. 2020. Connecting the dots: Multivariate time series forecasting with graph neural networks. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 753–763

work page 2020
[48]

Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-temporal graph modeling.arXiv preprint arXiv:1906.00121(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[49]

Xiongxiao Xu, Canyu Chen, Yueqing Liang, Baixiang Huang, Guangji Bai, Liang Zhao, and Kai Shu. 2024. SST: Multi-Scale Hybrid Mamba-Transformer Experts for Long-Short Range Time Series Forecasting.arXiv preprint arXiv:2404.14757 (2024)

work page arXiv 2024
[50]

Yang Yao, Bo Gu, Zhou Su, and Mohsen Guizani. 2021. MVSTGN: A multi-view spatial-temporal graph network for cellular traffic prediction.IEEE Transactions on Mobile Computing22, 5 (2021), 2837–2849

work page 2021
[51]

Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-temporal graph con- volutional networks: A deep learning framework for traffic forecasting.arXiv preprint arXiv:1709.04875(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

Haonan Yuan, Qingyun Sun, Zhaonan Wang, Xingcheng Fu, Cheng Ji, Yongjian Wang, Bo Jin, and Jianxin Li. 2025. DG-Mamba: Robust and Efficient Dynamic Graph Structure Learning with Selective State Space Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 22272–22280

work page 2025
[53]

Zijian Zhang, Ze Huang, Zhiwei Hu, Xiangyu Zhao, Wanyu Wang, Zitao Liu, Junbo Zhang, S Joe Qin, and Hongwei Zhao. 2023. Mlpst: Mlp is all you need for spatio-temporal prediction. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3381–3390

work page 2023
[54]

Zijian Zhang, Xiangyu Zhao, Qidong Liu, Chunxu Zhang, Qian Ma, Wanyu Wang, Hongwei Zhao, Yiqi Wang, and Zitao Liu. 2023. Promptst: Prompt-enhanced spatio-temporal multi-attribute prediction. InProceedings of the 32nd ACM Inter- national Conference on Information and Knowledge Management. 3195–3205

work page 2023
[55]

Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. 2019. T-GCN: A temporal graph convolutional network for traffic prediction.IEEE transactions on intelligent transportation systems21, 9 (2019), 3848–3858

work page 2019
[56]

torch.cuda.max_memory_allocated()

Barret Zoph. 2022. Designing effective sparse expert models. In2022 IEEE In- ternational Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1044–1044. A More Related Work State Space Models.SSMs have demonstrated exceptional capa- bility in modeling sequential dependencies via state space. The structured state-space sequence model (S4...

work page 2022

[1] [1]

Khaled Alkilane, Yihang He, and Der-Horng Lee. 2024. MixMamba: Time series modeling with adaptive expertise.Information Fusion112 (2024), 102589

work page 2024

[2] [2]

Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. 2020. Adaptive graph convolutional recurrent network for traffic forecasting.Advances in neural information processing systems33 (2020), 17804–17815

work page 2020

[3] [3]

Gianni Barlacchi, Marco De Nadai, Roberto Larcher, Antonio Casella, Cristiana Chitic, Giovanni Torrisi, Fabrizio Antonelli, Alessandro Vespignani, Alex Pent- land, and Bruno Lepri. 2015. A multi-source dataset of urban life in the city of Milan and the Province of Trentino.Scientific data2, 1 (2015), 1–15

work page 2015

[4] [4]

Xiuding Cai, Yaoyao Zhu, Xueyao Wang, and Yu Yao. 2024. MambaTS: improved selective state space models for long-term time series forecasting.arXiv preprint arXiv:2405.16440(2024)

work page arXiv 2024

[5] [5]

Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia. 2001. Freeway performance measurement system: mining loop detector data. Transportation research record1748, 1 (2001), 96–102

work page 2001

[6] [6]

Min Chen, Guansong Pang, Wenjun Wang, and Cheng Yan. 2025. Information Bottleneck-guided MLPs for Robust Spatial-temporal Forecasting. InForty-second International Conference on Machine Learning

work page 2025

[7] [7]

Peng Chen, Yingying ZHANG, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, and Chenjuan Guo. 2023. Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting. InThe Twelfth International Conference on Learning Representations

work page 2023

[8] [8]

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. 2022. To- wards understanding mixture of experts in deep learning.arXiv preprint arXiv:2208.02813(2022)

work page arXiv 2022

[9] [9]

Jeongwhan Choi, Hwangyong Choi, Jeehyun Hwang, and Noseong Park. 2022. Graph neural controlled differential equations for traffic forecasting. InProceed- ings of the AAAI conference on artificial intelligence, Vol. 36. 6367–6374

work page 2022

[10] [10]

Jinhyeok Choi, Heehyeon Kim, Minhyeong An, and Joyce Jiyoung Whang. 2024. Spot-mamba: Learning long-range dependency on spatio-temporal graphs with selective state spaces.arXiv preprint arXiv:2406.11244(2024)

work page arXiv 2024

[11] [11]

Damai Dai, Chengqi Deng, Chenggang Zhao, Rx Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al . 2024. DeepSeekMoE: To- wards Ultimate Expert Specialization in Mixture-of-Experts Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1280–1297

work page 2024

[12] [12]

Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christo- pher Ré. 2023. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. InProceedings of the 11th International Conference on Learning Representations (ICLR)

work page 2023

[13] [13]

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

work page

[14] [14]

In International conference on machine learning

Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569

work page

[15] [15]

Yuchen Fang, Yanjun Qin, Haiyong Luo, Fang Zhao, Bingbing Xu, Liang Zeng, and Chenxing Wang. 2023. When spatio-temporal meet wavelets: Disentangled traffic forecasting via efficient spectral graph attention networks. In2023 IEEE 39th international conference on data engineering (ICDE). IEEE, 517–529

work page 2023

[16] [16]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

work page 2022

[17] [17]

Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Conference on Language Modeling

work page 2024

[18] [18]

Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 922–929

work page 2019

[20] [20]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

work page 1991

[21] [21]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[23] [23]

Dongyuan Li, Shiyin Tan, Ying Zhang, Ming Jin, Shirui Pan, Manabu Okumura, and Renhe Jiang. 2024. Dyg-mamba: Continuous state space modeling on dynamic graphs.arXiv preprint arXiv:2408.06966(2024)

work page arXiv 2024

[24] [24]

Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, and Ness B Shroff. 2024. Theory on mixture-of-experts in continual learning.arXiv preprint arXiv:2406.16437 (2024)

work page arXiv 2024

[25] [25]

Lincan Li, Hanchen Wang, Wenjie Zhang, and Adelle Coster. 2024. Stg-mamba: Spatial-temporal graph learning via selective state space model.arXiv preprint arXiv:2403.12418(2024)

work page arXiv 2024

[26] [26]

Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. InInternational Conference on Learning Representations

work page 2018

[27] [27]

Zhonghang Li, Lianghao Xia, Yong Xu, and Chao Huang. 2023. GPT-ST: genera- tive pre-training of spatio-temporal graph neural networks.Advances in neural information processing systems36 (2023), 70229–70246

work page 2023

[28] [28]

Dachuan Liu, Jin Wang, Shuo Shang, and Peng Han. 2022. Msdr: Multi-step dependency relation networks for spatial temporal forecasting. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 1042–1050

work page 2022

[29] [29]

Hangchen Liu, Zheng Dong, Renhe Jiang, Jiewen Deng, Jinliang Deng, Quan- jun Chen, and Xuan Song. 2023. Spatio-temporal adaptive embedding makes vanilla transformer sota for traffic forecasting. InProceedings of the 32nd ACM international conference on information and knowledge management. 4125–4129

work page 2023

[30] [30]

Shang Liu, Miao He, Zhiqiang Wu, Peng Lu, and Weixi Gu. 2024. Spatial–temporal graph neural network traffic prediction based load balancing with reinforcement learning in cellular networks.Information Fusion103 (2024), 102079

work page 2024

[31] [31]

Ali Mehrabian, Shahab Bahrami, and Vincent WS Wong. 2023. A dynamic Bernstein graph recurrent network for wireless cellular traffic prediction. InICC 2023-IEEE International Conference on Communications. IEEE, 3842–3847

work page 2023

[32] [32]

Ali Mehrabian and Vincent WS Wong. 2025. A-Gamba: An Adaptive Graph- Mamba Model for Traffic Prediction in Wireless Cellular Networks.IEEE Wireless Communications Letters(2025)

work page 2025

[33] [33]

Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. 2023. Long Range Language Modeling via Gated State Spaces. InInternational Conference on Learning Representations

work page 2023

[34] [34]

Huy Nguyen, Pedram Akbarian, Fanqi Yan, and Nhat Ho. 2023. Statistical perspective of top-k sparse softmax gating mixture of experts.arXiv preprint arXiv:2309.13850(2023)

work page arXiv 2023

[35] [35]

Huy Nguyen, Nhat Ho, and Alessandro Rinaldo. 2024. On least square estimation in softmax gating mixture of experts.arXiv preprint arXiv:2402.02952(2024)

work page arXiv 2024

[36] [36]

Mohammad Amin Shabani, Amir H Abdi, Lili Meng, and Tristan Sylvain. [n. d.]. Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Fore- casting. InThe Eleventh International Conference on Learning Representations

work page

[37] [37]

Zhi Sheng, Yuan Yuan, Jingtao Ding, and Yong Li. 2025. Unveiling the Power of Noise Priors: Enhancing Diffusion Models for Mobile Traffic Prediction.arXiv Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Chen et al. preprint arXiv:2501.13794(2025)

work page arXiv 2025

[38] [38]

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. 2024. Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts.arXiv preprint arXiv:2409.16040(2024)

work page arXiv 2024

[39] [39]

Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. 2023. Simplified State Space Layers for Sequence Modeling. InICLR

work page 2023

[40] [40]

Chao Song, Youfang Lin, Shengnan Guo, and Huaiyu Wan. 2020. Spatial- temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 914–921

work page 2020

[41] [41]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research9, 11 (2008)

work page 2008

[42] [42]

Shuo Wang, Yanran Li, Jiang Zhang, Qingye Meng, Lingwei Meng, and Fei Gao

work page

[43] [43]

5-gnn: A domain knowledge enhanced graph neural network for pm2

Pm2. 5-gnn: A domain knowledge enhanced graph neural network for pm2. 5 forecasting. InProceedings of the 28th international conference on advances in geographic information systems. 163–166

work page

[44] [44]

Tan Wang, Zhongqi Yue, Jianqiang Huang, Qianru Sun, and Hanwang Zhang

work page

[45] [45]

Advances in Neural Information Processing Systems34 (2021), 18225–18240

Self-supervised learning disentangled group representation as feature. Advances in Neural Information Processing Systems34 (2021), 18225–18240

work page 2021

[46] [46]

Yuankai Wu, Dingyi Zhuang, Aurelie Labbe, and Lijun Sun. 2021. Inductive graph neural networks for spatiotemporal kriging. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 4478–4485

work page 2021

[47] [47]

Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang. 2020. Connecting the dots: Multivariate time series forecasting with graph neural networks. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 753–763

work page 2020

[48] [48]

Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-temporal graph modeling.arXiv preprint arXiv:1906.00121(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[49] [49]

Xiongxiao Xu, Canyu Chen, Yueqing Liang, Baixiang Huang, Guangji Bai, Liang Zhao, and Kai Shu. 2024. SST: Multi-Scale Hybrid Mamba-Transformer Experts for Long-Short Range Time Series Forecasting.arXiv preprint arXiv:2404.14757 (2024)

work page arXiv 2024

[50] [50]

Yang Yao, Bo Gu, Zhou Su, and Mohsen Guizani. 2021. MVSTGN: A multi-view spatial-temporal graph network for cellular traffic prediction.IEEE Transactions on Mobile Computing22, 5 (2021), 2837–2849

work page 2021

[51] [51]

Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-temporal graph con- volutional networks: A deep learning framework for traffic forecasting.arXiv preprint arXiv:1709.04875(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[52] [52]

Haonan Yuan, Qingyun Sun, Zhaonan Wang, Xingcheng Fu, Cheng Ji, Yongjian Wang, Bo Jin, and Jianxin Li. 2025. DG-Mamba: Robust and Efficient Dynamic Graph Structure Learning with Selective State Space Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 22272–22280

work page 2025

[53] [53]

Zijian Zhang, Ze Huang, Zhiwei Hu, Xiangyu Zhao, Wanyu Wang, Zitao Liu, Junbo Zhang, S Joe Qin, and Hongwei Zhao. 2023. Mlpst: Mlp is all you need for spatio-temporal prediction. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3381–3390

work page 2023

[54] [54]

Zijian Zhang, Xiangyu Zhao, Qidong Liu, Chunxu Zhang, Qian Ma, Wanyu Wang, Hongwei Zhao, Yiqi Wang, and Zitao Liu. 2023. Promptst: Prompt-enhanced spatio-temporal multi-attribute prediction. InProceedings of the 32nd ACM Inter- national Conference on Information and Knowledge Management. 3195–3205

work page 2023

[55] [55]

Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. 2019. T-GCN: A temporal graph convolutional network for traffic prediction.IEEE transactions on intelligent transportation systems21, 9 (2019), 3848–3858

work page 2019

[56] [56]

torch.cuda.max_memory_allocated()

Barret Zoph. 2022. Designing effective sparse expert models. In2022 IEEE In- ternational Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1044–1044. A More Related Work State Space Models.SSMs have demonstrated exceptional capa- bility in modeling sequential dependencies via state space. The structured state-space sequence model (S4...

work page 2022