pith. sign in

arxiv: 2505.14202 · v3 · submitted 2025-05-20 · 💻 cs.LG

MSDformer: Multi-scale Discrete Transformer For Time Series Generation

Pith reviewed 2026-05-22 13:40 UTC · model grok-4.3

classification 💻 cs.LG
keywords time series generationdiscrete token modelingmulti-scale tokenizerautoregressive modelingvector quantizationrate-distortion theoremtransformer
0
0 comments X

The pith

MSDformer generates higher-quality time series by tokenizing data at multiple scales and modeling those patterns autoregressively in discrete space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix two gaps in discrete token modeling for time series: the inability to represent patterns that unfold at different time resolutions and the lack of theory to justify design choices. It introduces a tokenizer that produces discrete tokens at several scales simultaneously and then runs autoregressive prediction across those scales inside the quantized latent space. The rate-distortion theorem is used to argue that this multi-scale structure is rational and effective. Experiments show clear gains over prior state-of-the-art DTM methods on standard generation benchmarks. Readers interested in synthetic data for forecasting, simulation, or anomaly detection would find the improvement practically relevant if the gains hold.

Core claim

MSDformer employs a multi-scale time series tokenizer to learn discrete token representations at multiple scales, which jointly characterize the complex nature of time series data. It then applies a multi-scale autoregressive token modeling technique to capture the multi-scale patterns of time series within the discrete latent space. The effectiveness of the DTM method and the rationality of MSDformer are validated through the rate-distortion theorem, with comprehensive experiments demonstrating that MSDformer significantly outperforms state-of-the-art methods.

What carries the argument

The multi-scale time series tokenizer paired with multi-scale autoregressive token modeling inside the discrete latent space.

If this is right

  • Incorporating multi-scale information substantially enhances the quality of generated time series in DTM-based approaches.
  • Modeling multi-scale patterns within the discrete latent space improves generation performance.
  • The rate-distortion theorem supplies a theoretical foundation that justifies the multi-scale design choices.
  • Both theoretical analysis and experimental results together support the use of multi-scale structures for time series generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-scale tokenization idea could be tested on other sequential modalities such as audio waveforms or sensor streams that also exhibit structure at several resolutions.
  • Allowing the model to select or weight scales dynamically during training might reduce unnecessary computation while preserving gains.
  • The interaction between the number of quantization levels and the choice of scales remains open for further empirical mapping.

Load-bearing premise

The rate-distortion theorem directly validates the specific design choices of the multi-scale tokenizer and autoregressive modeling without additional assumptions about how the theorem maps onto the discrete latent space of time series.

What would settle it

An ablation experiment in which the multi-scale tokenizer and multi-scale autoregressive components are replaced by single-scale equivalents and generation quality does not drop on the same benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2505.14202 by Peilin Zhao, Qing Li, Shibo Feng, Xingyu Gao, Xi Xiao, Zhicheng Chen, Zhong Zhang.

Figure 1
Figure 1. Figure 1: Workflow of the multi-scale time-series tokenizer. The tokenizer consists of (K) modules, each implemented as a [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Stage-2 workflow of MSDformer (multi-scale autore [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-scale time-series visualization for ( [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Details of the TS2Vec encoder. TABLE 7: Detailed hyperparameters of SDformer. Dataset Sines Stocks ETTh MuJoCo Energy fMRI V 1024 512 512 512 512 512 dc 512 256 512 512 512 512 λ 0.5 2.0 0.5 0.5 0.001 0.01 r 4 4 4 4 4 2 Nenc 2 2 2 2 2 1 Ntrans 2 2 6 2 2 2 ϵ 0.3 0.3 0.3 0.1 0.1 0.1 Nenc: Number of encoder layers Ntrans: Number of transformer layers TABLE 8: Detailed hyperparameters of MSDformer. Dataset Sin… view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualizations of the time series synthesized by MSDformer, SDformer and Diffusion-TS. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Kernel density estimation visualizations of the time series synthesized by MSDformer, SDformer and Diffusion-TS. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Discrete Token Modeling (DTM), which employs vector quantization techniques, has demonstrated remarkable success in modeling non-natural language modalities, particularly in time series generation. While our prior work SDformer established the first DTM-based framework to achieve state-of-the-art performance in this domain, two critical limitations persist in existing DTM approaches: 1) their inability to capture multi-scale temporal patterns inherent to complex time series data, and 2) the absence of theoretical foundations to guide model optimization. To address these challenges, we proposes a novel multi-scale DTM-based time series generation method, called Multi-Scale Discrete Transformer (MSDformer). MSDformer employs a multi-scale time series tokenizer to learn discrete token representations at multiple scales, which jointly characterize the complex nature of time series data. Subsequently, MSDformer applies a multi-scale autoregressive token modeling technique to capture the multi-scale patterns of time series within the discrete latent space. Theoretically, we validate the effectiveness of the DTM method and the rationality of MSDformer through the rate-distortion theorem. Comprehensive experiments demonstrate that MSDformer significantly outperforms state-of-the-art methods. Both theoretical analysis and experimental results demonstrate that incorporating multi-scale information and modeling multi-scale patterns can substantially enhance the quality of generated time series in DTM-based approaches. Code is available at this repository:https://github.com/kkking-kk/MSDformer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MSDformer, a multi-scale discrete transformer for time series generation. It addresses limitations in existing DTM approaches by proposing a multi-scale time series tokenizer that learns discrete token representations at multiple scales and a multi-scale autoregressive token modeling technique to capture multi-scale patterns in the discrete latent space. The work claims theoretical validation of both the DTM framework and the specific MSDformer design choices via the rate-distortion theorem, along with experimental results showing significant outperformance over state-of-the-art methods in time series generation.

Significance. If the central claims hold, this work would advance discrete token modeling for time series by providing a principled way to incorporate multi-scale temporal structure, potentially improving generation quality in domains with complex, hierarchical patterns. The combination of multi-scale discretization and autoregressive modeling, if rigorously linked to rate-distortion bounds, could offer a template for future DTM architectures beyond time series.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Theoretical Analysis): The claim that the rate-distortion theorem validates both the DTM method and the specific multi-scale tokenizer plus multi-scale AR modeling choices lacks a derivation. Rate-distortion theory provides the minimal rate R(D) for a given distortion but is agnostic to discretization granularity, scale hierarchy, and autoregressive factorization; no section derives how the proposed multi-scale VQ objective or joint token prediction reduces R(D) relative to single-scale baselines.
  2. [§4] §4 (Experiments): The abstract asserts both experimental outperformance and rate-distortion validation, yet the provided summary supplies no quantitative results, error bars, ablation details, or explicit mapping from the theorem to observed metrics. Without these, the central claim that multi-scale information 'substantially enhance[s] the quality' rests on unshown evidence.
minor comments (2)
  1. [Abstract] Abstract: 'we proposes' should be 'we propose'.
  2. [Abstract] Abstract: The GitHub link is given as 'this repository:https://github.com/kkking-kk/MSDformer' without proper formatting or description of what is released (e.g., code for tokenizer, training scripts, or evaluation).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Theoretical Analysis): The claim that the rate-distortion theorem validates both the DTM method and the specific multi-scale tokenizer plus multi-scale AR modeling choices lacks a derivation. Rate-distortion theory provides the minimal rate R(D) for a given distortion but is agnostic to discretization granularity, scale hierarchy, and autoregressive factorization; no section derives how the proposed multi-scale VQ objective or joint token prediction reduces R(D) relative to single-scale baselines.

    Authors: We appreciate the referee identifying this gap in the theoretical presentation. Section 3 invokes the rate-distortion theorem to argue that multi-scale discretization permits a more efficient representation of hierarchical temporal dependencies, thereby supporting a favorable rate-distortion operating point that single-scale DTM cannot achieve as readily. We acknowledge, however, that the current text does not contain an explicit derivation or inequality showing how the multi-scale VQ loss and joint autoregressive prediction strictly lower the achievable R(D) relative to single-scale baselines. In the revised manuscript we will expand §3 with a concise derivation sketch that relates the multi-scale codebook objective to a tighter upper bound on distortion for a fixed rate, using the standard rate-distortion function as reference. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts both experimental outperformance and rate-distortion validation, yet the provided summary supplies no quantitative results, error bars, ablation details, or explicit mapping from the theorem to observed metrics. Without these, the central claim that multi-scale information 'substantially enhance[s] the quality' rests on unshown evidence.

    Authors: The full manuscript already contains the requested elements in §4: quantitative tables comparing MSDformer against prior DTM and non-DTM baselines on standard time-series generation benchmarks, error bars computed over multiple random seeds, and ablation studies that isolate the contribution of the multi-scale tokenizer and multi-scale autoregressive modeling. To directly address the referee’s concern we will add a short subsection that explicitly links the rate-distortion arguments of §3 to the observed metric improvements (e.g., lower FID and higher precision at comparable or lower token rates). revision: yes

Circularity Check

0 steps flagged

No significant circularity; rate-distortion theorem is external and derivation remains independent

full rationale

The paper invokes the rate-distortion theorem as external theoretical validation for DTM effectiveness and MSDformer rationality. This is a standard information-theoretic result (Shannon) independent of the authors' prior work or current data. Self-citation to SDformer only establishes the baseline DTM framework; the multi-scale tokenizer, AR modeling, and performance claims are supported by new experiments rather than reducing to that citation or to fitted parameters renamed as predictions. No equations or steps equate the claimed theoretical validation to the paper's own inputs by construction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of the rate-distortion theorem to discrete token time series models and on the empirical superiority of multi-scale representations; no free parameters or invented entities are explicitly named in the abstract.

axioms (1)
  • domain assumption Rate-distortion theorem can be used to validate effectiveness and rationality of DTM and MSDformer
    Invoked in abstract to provide theoretical foundation; no derivation or mapping details given.

pith-pipeline@v0.9.0 · 5782 in / 1239 out tokens · 27239 ms · 2026-05-22T13:40:53.859588+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 12 internal anchors

  1. [1]

    Online arima algorithms for time series prediction,

    C. Liu, S. C. Hoi, P . Zhao, and J. Sun, “Online arima algorithms for time series prediction,” inProceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016

  2. [2]

    Hierarchical multi- scale gaussian transformer for stock movement prediction

    Q. Ding, S. Wu, H. Sun, J. Guo, and J. Guo, “Hierarchical multi- scale gaussian transformer for stock movement prediction.” in IJCAI, 2020, pp. 4640–4646

  3. [3]

    Accurate multivariate stock movement prediction via data-axis transformer with multi- level contexts,

    J. Yoo, Y. Soun, Y.-c. Park, and U. Kang, “Accurate multivariate stock movement prediction via data-axis transformer with multi- level contexts,” inProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 2037–2045

  4. [4]

    Relation-aware transformer for portfolio policy learning,

    K. Xu, Y. Zhang, D. Ye, P . Zhao, and M. Tan, “Relation-aware transformer for portfolio policy learning,” inProceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, 2021, pp. 4647–4653

  5. [5]

    Financial time series frequent pattern mining algo- rithm based on time series arima model,

    M. Zhang, “Financial time series frequent pattern mining algo- rithm based on time series arima model,” in2023 International Conference on Networking, Informatics and Computing (ICNETIC), 2023, pp. 244–247

  6. [6]

    Use of interrupted time series anal- ysis in evaluating health care quality improvements,

    R. B. Penfold and F. Zhang, “Use of interrupted time series anal- ysis in evaluating health care quality improvements,”Academic pediatrics, vol. 13, no. 6, pp. S38–S44, 2013

  7. [7]

    A short-term rainfall prediction model using multi- task convolutional neural networks,

    M. Qiu, P . Zhao, K. Zhang, J. Huang, X. Shi, X. Wang, and W. Chu, “A short-term rainfall prediction model using multi- task convolutional neural networks,” in2017 IEEE international conference on data mining (ICDM). IEEE, 2017, pp. 395–404

  8. [8]

    Time-series forecasting with deep learning: a survey,

    B. Lim and S. Zohren, “Time-series forecasting with deep learning: a survey,”Philosophical Transactions of the Royal Society A, vol. 379, no. 2194, p. 20200209, 2021

  9. [9]

    Multi-scale attention flow for probabilistic time series forecast- ing,

    S. Feng, C. Miao, K. Xu, J. Wu, P . Wu, Y. Zhang, and P . Zhao, “Multi-scale attention flow for probabilistic time series forecast- ing,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 5, pp. 2056–2068, 2023

  10. [10]

    Latent diffusion trans- former for probabilistic time series forecasting,

    S. Feng, C. Miao, Z. Zhang, and P . Zhao, “Latent diffusion trans- former for probabilistic time series forecasting,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 11 979–11 987

  11. [11]

    Temporal fusion trans- formers for interpretable multi-horizon time series forecasting,

    B. Lim, S. ¨O. Arık, N. Loeff, and T. Pfister, “Temporal fusion trans- formers for interpretable multi-horizon time series forecasting,” International Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021

  12. [12]

    One fits all: Power general time series analysis by pretrained lm,

    T. Zhou, P . Niu, L. Sun, R. Jinet al., “One fits all: Power general time series analysis by pretrained lm,”Advances in neural informa- tion processing systems, vol. 36, pp. 43 322–43 355, 2023

  13. [13]

    Ad- versarial sparse transformer for time series forecasting,

    S. Wu, X. Xiao, Q. Ding, P . Zhao, Y. Wei, and J. Huang, “Ad- versarial sparse transformer for time series forecasting,”Advances in neural information processing systems, vol. 33, pp. 17 105–17 115, 2020

  14. [14]

    Generative time- series modeling with fourier flows,

    A. Alaa, A. J. Chan, and M. van der Schaar, “Generative time- series modeling with fourier flows,” inInternational Conference on Learning Representations, 2020

  15. [15]

    C-RNN-GAN: Continuous recurrent neural networks with adversarial training

    O. Mogren, “C-rnn-gan: Continuous recurrent neural networks with adversarial training,”arXiv preprint arXiv:1611.09904, 2016

  16. [16]

    Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

    C. Esteban, S. L. Hyland, and G. R ¨atsch, “Real-valued (medical) time series generation with recurrent conditional gans,”arXiv preprint arXiv:1706.02633, 2017

  17. [17]

    Time-series generative adversarial networks,

    J. Yoon, D. Jarrett, and M. Van der Schaar, “Time-series generative adversarial networks,”Advances in neural information processing systems, vol. 32, 2019

  18. [18]

    Cot-gan: Gen- erating sequential data via causal optimal transport,

    T. Xu, L. K. Wenliang, M. Munn, and B. Acciaio, “Cot-gan: Gen- erating sequential data via causal optimal transport,”Advances in neural information processing systems, vol. 33, pp. 8798–8809, 2020

  19. [19]

    Towards generating real-world time series data,

    H. Pei, K. Ren, Y. Yang, C. Liu, T. Qin, and D. Li, “Towards generating real-world time series data,” in2021 IEEE International Conference on Data Mining (ICDM). IEEE, 2021, pp. 469–478

  20. [20]

    Psa-gan: Progres- sive self attention gans for synthetic time series,

    P . Jeha, M. Bohlke-Schneider, P . Mercado, S. Kapoor, R. S. Nirwan, V . Flunkert, J. Gasthaus, and T. Januschowski, “Psa-gan: Progres- sive self attention gans for synthetic time series,” inThe tenth international conference on learning representations, 2022

  21. [21]

    Gt-gan: General purpose time series synthesis with generative adversarial net- works,

    J. Jeon, J. Kim, H. Song, S. Cho, and N. Park, “Gt-gan: General purpose time series synthesis with generative adversarial net- works,”Advances in Neural Information Processing Systems, vol. 35, pp. 36 999–37 010, 2022

  22. [22]

    Timevae: A variational auto-encoder for multivariate time series generation,

    A. Desai, C. Freeman, Z. Wang, and I. Beaver, “Timevae: A variational auto-encoder for multivariate time series generation,” arXiv preprint arXiv:2111.08095, 2021

  23. [23]

    Generative modeling of regular and irregular time series data via koopman vaes,

    I. Naiman, N. B. Erichson, P . Ren, M. W. Mahoney, and O. Azencot, “Generative modeling of regular and irregular time series data via koopman vaes,”arXiv preprint arXiv:2310.02619, 2023

  24. [24]

    DiffWave: A Versatile Diffusion Model for Audio Synthesis

    Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,”arXiv preprint arXiv:2009.09761, 2020

  25. [25]

    On the constrained time-series generation problem,

    A. Coletta, S. Gopalakrishnan, D. Borrajo, and S. Vyetrenko, “On the constrained time-series generation problem,”Advances in Neu- ral Information Processing Systems, vol. 36, 2024

  26. [26]

    arXiv preprint arXiv:2403.01742 , year=

    X. Yuan and Y. Qiao, “Diffusion-ts: Interpretable diffusion for general time series generation,”arXiv preprint arXiv:2403.01742, 2024

  27. [27]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  28. [28]

    PaLM 2 Technical Report

    R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P . Bailey, Z. Chenet al., “Palm 2 technical report,”arXiv preprint arXiv:2305.10403, 2023

  29. [29]

    Zero-shot text-to-image generation,

    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” inInternational conference on machine learning. Pmlr, 2021, pp. 8821– 8831

  30. [30]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V . Vasude- van, A. Ku, Y. Yang, B. K. Ayanet al., “Scaling autoregressive models for content-rich text-to-image generation,”arXiv preprint arXiv:2206.10789, vol. 2, no. 3, p. 5, 2022

  31. [31]

    Maskgit: Masked generative image transformer,

    H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 315–11 325

  32. [32]

    Muse: Text-to- image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

    H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinsteinet al., “Muse: Text-to-image generation via masked generative trans- formers,”arXiv preprint arXiv:2301.00704, 2023

  33. [33]

    Magvit: Masked generative video transformer,

    L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essaet al., “Magvit: Masked generative video transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 459–10 469

  34. [34]

    Phenaki: Variable Length Video Generation From Open Domain Textual Description

    R. Villegas, M. Babaeizadeh, P .-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual de- scription,”arXiv preprint arXiv:2210.02399, 2022

  35. [35]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiuet al., “Videopoet: A large language model for zero-shot video generation,”arXiv preprint arXiv:2312.14125, 2023

  36. [36]

    Soundstream: An end-to-end neural audio codec,

    N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021

  37. [37]

    Simple and controllable music generation,

    J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. D ´efossez, “Simple and controllable music generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 47 704–47 720, 2023

  38. [38]

    Neural codec language models are zero-shot text to speech synthesizers,

    S. Chen, C. Wang, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Liet al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Audio, Speech and Language Processing, 2025

  39. [39]

    Sdformer: Similarity-driven discrete transformer for time series generation,

    C. Zhicheng, F. SHIBO, Z. Zhang, X. Xiao, X. Gao, and P . Zhao, “Sdformer: Similarity-driven discrete transformer for time series generation,” inThe Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024

  40. [40]

    Hdt: Hierarchical discrete transformer for multivariate time series forecasting,

    F. Shibo, P . Zhao, L. Liu, P . Wu, and Z. Shen, “Hdt: Hierarchical discrete transformer for multivariate time series forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 1, 2025, pp. 746–754

  41. [41]

    Cascade residual learning: A two-stage convolutional neural network for stereo matching,

    J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan, “Cascade residual learning: A two-stage convolutional neural network for stereo matching,” inProceedings of the IEEE international conference on computer vision workshops, 2017, pp. 887–895

  42. [42]

    Neural discrete representation learning,

    A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

  43. [43]

    Which training meth- ods for gans do actually converge?

    L. Mescheder, A. Geiger, and S. Nowozin, “Which training meth- ods for gans do actually converge?” inInternational conference on machine learning. PMLR, 2018, pp. 3481–3490. JOURNAL OF LATEX CLASS FILES, 2026 14

  44. [44]

    On Convergence and Stability of GANs

    N. Kodali, J. Abernethy, J. Hays, and Z. Kira, “On convergence and stability of gans,”arXiv preprint arXiv:1705.07215, 2017

  45. [45]

    Improving gener- alization and stability of generative adversarial networks,

    H. Thanh-Tung, T. Tran, and S. Venkatesh, “Improving gener- alization and stability of generative adversarial networks,” in International Conference on Learning Representations, 2018

  46. [46]

    Catastrophic forgetting and mode collapse in gans,

    H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse in gans,” in2020 international joint conference on neural networks (ijcnn). IEEE, 2020, pp. 1–10

  47. [47]

    Generative adversarial networks in time series: A survey and taxonomy,

    E. Brophy, Z. Wang, Q. She, and T. Ward, “Generative adversarial networks in time series: A survey and taxonomy,”arXiv preprint arXiv:2107.11098, 2021

  48. [48]

    Deep Time Series Models: A Comprehensive Survey and Benchmark

    Y. Wang, H. Wu, J. Dong, Y. Liu, C. Wang, M. Long, and J. Wang, “Deep time series models: A comprehensive survey and bench- mark,”arXiv preprint arXiv:2407.13278, 2024

  49. [49]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

  50. [50]

    Bert: Pre- training of deep bidirectional transformers for language under- standing,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

  51. [51]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    W. Yan, Y. Zhang, P . Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,”arXiv preprint arXiv:2104.10157, 2021

  52. [52]

    Vqtts: High-fidelity text-to- speech synthesis with self-supervised vq acoustic feature,

    C. Du, Y. Guo, X. Chen, and K. Yu, “Vqtts: High-fidelity text-to- speech synthesis with self-supervised vq acoustic feature,”arXiv preprint arXiv:2204.00768, 2022

  53. [53]

    Taming transformers for high-resolution image synthesis,

    P . Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883

  54. [54]

    T2m-gpt: Generating human motion from textual descriptions with discrete representations,

    J. Zhang, Y. Zhang, X. Cun, S. Huang, Y. Zhang, H. Zhao, H. Lu, and X. Shen, “T2m-gpt: Generating human motion from textual descriptions with discrete representations,”arXiv preprint arXiv:2301.06052, 2023

  55. [55]

    Motiongpt: Human motion as a foreign language,

    B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,”Advances in Neural Infor- mation Processing Systems, vol. 36, 2024

  56. [56]

    BERT: pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short P...

  57. [57]

    Visual autore- gressive modeling: Scalable image generation via next-scale pre- diction,

    K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autore- gressive modeling: Scalable image generation via next-scale pre- diction,”Advances in neural information processing systems, vol. 37, pp. 84 839–84 865, 2024

  58. [58]

    Multi-scale adaptive graph neural network for mul- tivariate time series forecasting,

    L. Chen, D. Chen, Z. Shang, B. Wu, C. Zheng, B. Wen, and W. Zhang, “Multi-scale adaptive graph neural network for mul- tivariate time series forecasting,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 10, pp. 10 748–10 761, 2023

  59. [59]

    Scientific reports 12, 16327

    P . Chen, Y. Zhang, Y. Cheng, Y. Shu, Y. Wang, Q. Wen, B. Yang, and C. Guo, “Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting,”arXiv preprint arXiv:2402.05956, 2024

  60. [60]

    Hierarchical quantized autoencoders,

    W. Williams, S. Ringer, T. Ash, D. MacLeod, J. Dougherty, and J. Hughes, “Hierarchical quantized autoencoders,”Advances in Neural Information Processing Systems, vol. 33, pp. 4524–4535, 2020

  61. [61]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

  62. [62]

    Coding theorems for a discrete source with a fidelity criterion,

    C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,”IRE International Convention Record, vol. 7, no. 4, pp. 142–163, 1959

  63. [63]

    Visualizing data using t-sne

    L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008

  64. [64]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017

  65. [65]

    Informer: Beyond efficient transformer for long se- quence time-series forecasting,

    H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long se- quence time-series forecasting,” inProceedings of the AAAI confer- ence on artificial intelligence, vol. 35, no. 12, 2021, pp. 11 106–11 115

  66. [66]

    Data driven prediction models of energy use of appliances in a low-energy house,

    L. M. Candanedo, V . Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a low-energy house,”Energy and buildings, vol. 140, pp. 81–97, 2017

  67. [67]

    Mujoco: A physics engine for model-based control,

    E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

  68. [68]

    Latent ordi- nary differential equations for irregularly-sampled time series,

    Y. Rubanova, R. T. Chen, and D. K. Duvenaud, “Latent ordi- nary differential equations for irregularly-sampled time series,” Advances in neural information processing systems, vol. 32, 2019

  69. [69]

    Ts2vec: Towards universal representation of time series,

    Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu, “Ts2vec: Towards universal representation of time series,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 8, 2022, pp. 8980–8987. JOURNAL OF LATEX CLASS FILES, 2026 15 SUPPLEMENTARYMATERIALS FOR MSDFORMER In the supplementary, we provide more implementation det...

  70. [70]

    The position of the body in 2D space is uniformly sampled from the interval[0,0.5]

    dataset is a multivariate physics simulation time series dataset, which collects a total of 10,000 simulations of the “Hopper” model from the DeepMind Control Suite and MuJoCo simulator. The position of the body in 2D space is uniformly sampled from the interval[0,0.5]. The relative position of the limbs is sampled from the range[−2,2], and initial veloci...

  71. [71]

    https://finance.yahoo.com/quote/GOOG/history/?p=GOOG

  72. [72]

    https://github.com/zhouhaoyi/ETDataset

  73. [73]

    https://archive.ics.uci.edu/dataset/374/appliances+energy+ prediction

  74. [74]

    https://www.fmrib.ox.ac.uk/datasets/netsim/

  75. [75]

    https://github.com/jsyoon0823/TimeGAN

  76. [76]

    real”) from synthesized data (labeled as “synthetic

    https://github.com/google-deepmind/dm control all, there are 10000 sequences of 100 regularly sampled time points with a feature dimension of 14. Quantitative Metrics.To be specific, 1).Discriminative Scoremeasures the distributional similarity between original and synthesized time series data. A binary classifier (e.g., RNNs-based) is trained to distin- ...

  77. [77]

    time steps (all but the last step of each window), and the prediction target is also (T-1) steps, corresponding to a one-step-shifted forecasting objective. Thus, the predictive score does not evaluate long-horizon forecasting; it eval- uates one-step-ahead prediction across the entire window following TimeGAN’s post-hoc predictive score protocol. The pre...

  78. [78]

    3).Context-FID Scoreevaluates the distributional fi- delity of synthesized time series by comparing con- textual feature embeddings between original and syn- thetic datasets

    https://github.com/Y-debug-sys/Diffusion-TS/tree/main JOURNAL OF LATEX CLASS FILES, 2026 16 one) on synthetic data and reports the MAE on real data over all time steps in the generation window. 3).Context-FID Scoreevaluates the distributional fi- delity of synthesized time series by comparing con- textual feature embeddings between original and syn- theti...

  79. [79]

    0.006±.004 0.249±.000 0.003±.000

  80. [80]

    0.005±.004 0.249±.000 0.003±.000 [512,512] 0.005±.003 0.249±.000 0.003±.000 64

Showing first 80 references.