MSDformer: Multi-scale Discrete Transformer For Time Series Generation

Peilin Zhao; Qing Li; Shibo Feng; Xingyu Gao; Xi Xiao; Zhicheng Chen; Zhong Zhang

arxiv: 2505.14202 · v3 · submitted 2025-05-20 · 💻 cs.LG

MSDformer: Multi-scale Discrete Transformer For Time Series Generation

Shibo Feng , Zhicheng Chen , Xi Xiao , Zhong Zhang , Qing Li , Xingyu Gao , Peilin Zhao This is my paper

Pith reviewed 2026-05-22 13:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords time series generationdiscrete token modelingmulti-scale tokenizerautoregressive modelingvector quantizationrate-distortion theoremtransformer

0 comments

The pith

MSDformer generates higher-quality time series by tokenizing data at multiple scales and modeling those patterns autoregressively in discrete space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix two gaps in discrete token modeling for time series: the inability to represent patterns that unfold at different time resolutions and the lack of theory to justify design choices. It introduces a tokenizer that produces discrete tokens at several scales simultaneously and then runs autoregressive prediction across those scales inside the quantized latent space. The rate-distortion theorem is used to argue that this multi-scale structure is rational and effective. Experiments show clear gains over prior state-of-the-art DTM methods on standard generation benchmarks. Readers interested in synthetic data for forecasting, simulation, or anomaly detection would find the improvement practically relevant if the gains hold.

Core claim

MSDformer employs a multi-scale time series tokenizer to learn discrete token representations at multiple scales, which jointly characterize the complex nature of time series data. It then applies a multi-scale autoregressive token modeling technique to capture the multi-scale patterns of time series within the discrete latent space. The effectiveness of the DTM method and the rationality of MSDformer are validated through the rate-distortion theorem, with comprehensive experiments demonstrating that MSDformer significantly outperforms state-of-the-art methods.

What carries the argument

The multi-scale time series tokenizer paired with multi-scale autoregressive token modeling inside the discrete latent space.

If this is right

Incorporating multi-scale information substantially enhances the quality of generated time series in DTM-based approaches.
Modeling multi-scale patterns within the discrete latent space improves generation performance.
The rate-distortion theorem supplies a theoretical foundation that justifies the multi-scale design choices.
Both theoretical analysis and experimental results together support the use of multi-scale structures for time series generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-scale tokenization idea could be tested on other sequential modalities such as audio waveforms or sensor streams that also exhibit structure at several resolutions.
Allowing the model to select or weight scales dynamically during training might reduce unnecessary computation while preserving gains.
The interaction between the number of quantization levels and the choice of scales remains open for further empirical mapping.

Load-bearing premise

The rate-distortion theorem directly validates the specific design choices of the multi-scale tokenizer and autoregressive modeling without additional assumptions about how the theorem maps onto the discrete latent space of time series.

What would settle it

An ablation experiment in which the multi-scale tokenizer and multi-scale autoregressive components are replaced by single-scale equivalents and generation quality does not drop on the same benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2505.14202 by Peilin Zhao, Qing Li, Shibo Feng, Xingyu Gao, Xi Xiao, Zhicheng Chen, Zhong Zhang.

**Figure 1.** Figure 1: Workflow of the multi-scale time-series tokenizer. The tokenizer consists of (K) modules, each implemented as a [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Stage-2 workflow of MSDformer (multi-scale autore [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-scale time-series visualization for ( [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Details of the TS2Vec encoder. TABLE 7: Detailed hyperparameters of SDformer. Dataset Sines Stocks ETTh MuJoCo Energy fMRI V 1024 512 512 512 512 512 dc 512 256 512 512 512 512 λ 0.5 2.0 0.5 0.5 0.001 0.01 r 4 4 4 4 4 2 Nenc 2 2 2 2 2 1 Ntrans 2 2 6 2 2 2 ϵ 0.3 0.3 0.3 0.1 0.1 0.1 Nenc: Number of encoder layers Ntrans: Number of transformer layers TABLE 8: Detailed hyperparameters of MSDformer. Dataset Sin… view at source ↗

**Figure 5.** Figure 5: t-SNE visualizations of the time series synthesized by MSDformer, SDformer and Diffusion-TS. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Kernel density estimation visualizations of the time series synthesized by MSDformer, SDformer and Diffusion-TS. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Discrete Token Modeling (DTM), which employs vector quantization techniques, has demonstrated remarkable success in modeling non-natural language modalities, particularly in time series generation. While our prior work SDformer established the first DTM-based framework to achieve state-of-the-art performance in this domain, two critical limitations persist in existing DTM approaches: 1) their inability to capture multi-scale temporal patterns inherent to complex time series data, and 2) the absence of theoretical foundations to guide model optimization. To address these challenges, we proposes a novel multi-scale DTM-based time series generation method, called Multi-Scale Discrete Transformer (MSDformer). MSDformer employs a multi-scale time series tokenizer to learn discrete token representations at multiple scales, which jointly characterize the complex nature of time series data. Subsequently, MSDformer applies a multi-scale autoregressive token modeling technique to capture the multi-scale patterns of time series within the discrete latent space. Theoretically, we validate the effectiveness of the DTM method and the rationality of MSDformer through the rate-distortion theorem. Comprehensive experiments demonstrate that MSDformer significantly outperforms state-of-the-art methods. Both theoretical analysis and experimental results demonstrate that incorporating multi-scale information and modeling multi-scale patterns can substantially enhance the quality of generated time series in DTM-based approaches. Code is available at this repository:https://github.com/kkking-kk/MSDformer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSDformer layers multi-scale tokenization and autoregressive modeling onto the authors' prior SDformer for discrete time series generation, but the rate-distortion link stays high-level without a derivation.

read the letter

The main takeaway is that this paper takes the discrete token modeling setup from the authors' earlier SDformer and adds explicit multi-scale handling in both the tokenizer and the autoregressive transformer. The new pieces are a multi-scale time series tokenizer that produces discrete tokens at different resolutions and a multi-scale autoregressive model that predicts across those scales in the latent space. This directly targets the limitation they flag in prior DTM work around capturing complex temporal patterns at multiple scales. Releasing the code on GitHub is a practical plus for anyone who wants to test or extend it. The attempt to tie the design back to the rate-distortion theorem is also a step in the right direction for grounding the choices beyond pure experiments. The central claim of significant outperformance over state-of-the-art methods rests on the experiments they ran, which the abstract says are comprehensive. If the full paper includes solid ablations, error bars, and dataset details, that would make the empirical case clearer. The softer spot is the theoretical part. Rate-distortion gives a general bound but does not automatically explain why the specific multi-scale VQ objective or joint token prediction across scales reduces rate for a given distortion. Without a derivation or parameter-free mapping shown in the paper, the validation stays at the level of citing the theorem rather than demonstrating how it justifies these exact design choices. This does not sink the work, but it means the theoretical support is more suggestive than conclusive. This paper is for researchers already working on generative models for time series using vector quantization and transformers. Someone looking for an incremental but targeted upgrade to discrete-token approaches could get value from the architecture and the released code. It is not reshaping the broader field, but it is a focused extension worth checking. I would send it to peer review. The idea is straightforward, the code is available, and referees can evaluate the experiments and any missing derivation steps in detail.

Referee Report

2 major / 2 minor

Summary. The paper introduces MSDformer, a multi-scale discrete transformer for time series generation. It addresses limitations in existing DTM approaches by proposing a multi-scale time series tokenizer that learns discrete token representations at multiple scales and a multi-scale autoregressive token modeling technique to capture multi-scale patterns in the discrete latent space. The work claims theoretical validation of both the DTM framework and the specific MSDformer design choices via the rate-distortion theorem, along with experimental results showing significant outperformance over state-of-the-art methods in time series generation.

Significance. If the central claims hold, this work would advance discrete token modeling for time series by providing a principled way to incorporate multi-scale temporal structure, potentially improving generation quality in domains with complex, hierarchical patterns. The combination of multi-scale discretization and autoregressive modeling, if rigorously linked to rate-distortion bounds, could offer a template for future DTM architectures beyond time series.

major comments (2)

[Abstract and §3] Abstract and §3 (Theoretical Analysis): The claim that the rate-distortion theorem validates both the DTM method and the specific multi-scale tokenizer plus multi-scale AR modeling choices lacks a derivation. Rate-distortion theory provides the minimal rate R(D) for a given distortion but is agnostic to discretization granularity, scale hierarchy, and autoregressive factorization; no section derives how the proposed multi-scale VQ objective or joint token prediction reduces R(D) relative to single-scale baselines.
[§4] §4 (Experiments): The abstract asserts both experimental outperformance and rate-distortion validation, yet the provided summary supplies no quantitative results, error bars, ablation details, or explicit mapping from the theorem to observed metrics. Without these, the central claim that multi-scale information 'substantially enhance[s] the quality' rests on unshown evidence.

minor comments (2)

[Abstract] Abstract: 'we proposes' should be 'we propose'.
[Abstract] Abstract: The GitHub link is given as 'this repository:https://github.com/kkking-kk/MSDformer' without proper formatting or description of what is released (e.g., code for tokenizer, training scripts, or evaluation).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Theoretical Analysis): The claim that the rate-distortion theorem validates both the DTM method and the specific multi-scale tokenizer plus multi-scale AR modeling choices lacks a derivation. Rate-distortion theory provides the minimal rate R(D) for a given distortion but is agnostic to discretization granularity, scale hierarchy, and autoregressive factorization; no section derives how the proposed multi-scale VQ objective or joint token prediction reduces R(D) relative to single-scale baselines.

Authors: We appreciate the referee identifying this gap in the theoretical presentation. Section 3 invokes the rate-distortion theorem to argue that multi-scale discretization permits a more efficient representation of hierarchical temporal dependencies, thereby supporting a favorable rate-distortion operating point that single-scale DTM cannot achieve as readily. We acknowledge, however, that the current text does not contain an explicit derivation or inequality showing how the multi-scale VQ loss and joint autoregressive prediction strictly lower the achievable R(D) relative to single-scale baselines. In the revised manuscript we will expand §3 with a concise derivation sketch that relates the multi-scale codebook objective to a tighter upper bound on distortion for a fixed rate, using the standard rate-distortion function as reference. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts both experimental outperformance and rate-distortion validation, yet the provided summary supplies no quantitative results, error bars, ablation details, or explicit mapping from the theorem to observed metrics. Without these, the central claim that multi-scale information 'substantially enhance[s] the quality' rests on unshown evidence.

Authors: The full manuscript already contains the requested elements in §4: quantitative tables comparing MSDformer against prior DTM and non-DTM baselines on standard time-series generation benchmarks, error bars computed over multiple random seeds, and ablation studies that isolate the contribution of the multi-scale tokenizer and multi-scale autoregressive modeling. To directly address the referee’s concern we will add a short subsection that explicitly links the rate-distortion arguments of §3 to the observed metric improvements (e.g., lower FID and higher precision at comparable or lower token rates). revision: yes

Circularity Check

0 steps flagged

No significant circularity; rate-distortion theorem is external and derivation remains independent

full rationale

The paper invokes the rate-distortion theorem as external theoretical validation for DTM effectiveness and MSDformer rationality. This is a standard information-theoretic result (Shannon) independent of the authors' prior work or current data. Self-citation to SDformer only establishes the baseline DTM framework; the multi-scale tokenizer, AR modeling, and performance claims are supported by new experiments rather than reducing to that citation or to fitted parameters renamed as predictions. No equations or steps equate the claimed theoretical validation to the paper's own inputs by construction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of the rate-distortion theorem to discrete token time series models and on the empirical superiority of multi-scale representations; no free parameters or invented entities are explicitly named in the abstract.

axioms (1)

domain assumption Rate-distortion theorem can be used to validate effectiveness and rationality of DTM and MSDformer
Invoked in abstract to provide theoretical foundation; no derivation or mapping details given.

pith-pipeline@v0.9.0 · 5782 in / 1239 out tokens · 27239 ms · 2026-05-22T13:40:53.859588+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theoretically, we validate the effectiveness of the DTM method and the rationality of MSDformer through the rate-distortion theorem... under the same overall rate budget, allocating capacity across scales can reduce distortion more effectively than increasing the codebook size in a single-scale model.
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-scale time series tokenizer to learn discrete token representations at multiple scales... multi-scale autoregressive token modeling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 12 internal anchors

[1]

Online arima algorithms for time series prediction,

C. Liu, S. C. Hoi, P . Zhao, and J. Sun, “Online arima algorithms for time series prediction,” inProceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016

work page 2016
[2]

Hierarchical multi- scale gaussian transformer for stock movement prediction

Q. Ding, S. Wu, H. Sun, J. Guo, and J. Guo, “Hierarchical multi- scale gaussian transformer for stock movement prediction.” in IJCAI, 2020, pp. 4640–4646

work page 2020
[3]

Accurate multivariate stock movement prediction via data-axis transformer with multi- level contexts,

J. Yoo, Y. Soun, Y.-c. Park, and U. Kang, “Accurate multivariate stock movement prediction via data-axis transformer with multi- level contexts,” inProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 2037–2045

work page 2021
[4]

Relation-aware transformer for portfolio policy learning,

K. Xu, Y. Zhang, D. Ye, P . Zhao, and M. Tan, “Relation-aware transformer for portfolio policy learning,” inProceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, 2021, pp. 4647–4653

work page 2021
[5]

Financial time series frequent pattern mining algo- rithm based on time series arima model,

M. Zhang, “Financial time series frequent pattern mining algo- rithm based on time series arima model,” in2023 International Conference on Networking, Informatics and Computing (ICNETIC), 2023, pp. 244–247

work page 2023
[6]

Use of interrupted time series anal- ysis in evaluating health care quality improvements,

R. B. Penfold and F. Zhang, “Use of interrupted time series anal- ysis in evaluating health care quality improvements,”Academic pediatrics, vol. 13, no. 6, pp. S38–S44, 2013

work page 2013
[7]

A short-term rainfall prediction model using multi- task convolutional neural networks,

M. Qiu, P . Zhao, K. Zhang, J. Huang, X. Shi, X. Wang, and W. Chu, “A short-term rainfall prediction model using multi- task convolutional neural networks,” in2017 IEEE international conference on data mining (ICDM). IEEE, 2017, pp. 395–404

work page 2017
[8]

Time-series forecasting with deep learning: a survey,

B. Lim and S. Zohren, “Time-series forecasting with deep learning: a survey,”Philosophical Transactions of the Royal Society A, vol. 379, no. 2194, p. 20200209, 2021

work page 2021
[9]

Multi-scale attention flow for probabilistic time series forecast- ing,

S. Feng, C. Miao, K. Xu, J. Wu, P . Wu, Y. Zhang, and P . Zhao, “Multi-scale attention flow for probabilistic time series forecast- ing,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 5, pp. 2056–2068, 2023

work page 2056
[10]

Latent diffusion trans- former for probabilistic time series forecasting,

S. Feng, C. Miao, Z. Zhang, and P . Zhao, “Latent diffusion trans- former for probabilistic time series forecasting,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 11 979–11 987

work page 2024
[11]

Temporal fusion trans- formers for interpretable multi-horizon time series forecasting,

B. Lim, S. ¨O. Arık, N. Loeff, and T. Pfister, “Temporal fusion trans- formers for interpretable multi-horizon time series forecasting,” International Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021

work page 2021
[12]

One fits all: Power general time series analysis by pretrained lm,

T. Zhou, P . Niu, L. Sun, R. Jinet al., “One fits all: Power general time series analysis by pretrained lm,”Advances in neural informa- tion processing systems, vol. 36, pp. 43 322–43 355, 2023

work page 2023
[13]

Ad- versarial sparse transformer for time series forecasting,

S. Wu, X. Xiao, Q. Ding, P . Zhao, Y. Wei, and J. Huang, “Ad- versarial sparse transformer for time series forecasting,”Advances in neural information processing systems, vol. 33, pp. 17 105–17 115, 2020

work page 2020
[14]

Generative time- series modeling with fourier flows,

A. Alaa, A. J. Chan, and M. van der Schaar, “Generative time- series modeling with fourier flows,” inInternational Conference on Learning Representations, 2020

work page 2020
[15]

C-RNN-GAN: Continuous recurrent neural networks with adversarial training

O. Mogren, “C-rnn-gan: Continuous recurrent neural networks with adversarial training,”arXiv preprint arXiv:1611.09904, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

C. Esteban, S. L. Hyland, and G. R ¨atsch, “Real-valued (medical) time series generation with recurrent conditional gans,”arXiv preprint arXiv:1706.02633, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Time-series generative adversarial networks,

J. Yoon, D. Jarrett, and M. Van der Schaar, “Time-series generative adversarial networks,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[18]

Cot-gan: Gen- erating sequential data via causal optimal transport,

T. Xu, L. K. Wenliang, M. Munn, and B. Acciaio, “Cot-gan: Gen- erating sequential data via causal optimal transport,”Advances in neural information processing systems, vol. 33, pp. 8798–8809, 2020

work page 2020
[19]

Towards generating real-world time series data,

H. Pei, K. Ren, Y. Yang, C. Liu, T. Qin, and D. Li, “Towards generating real-world time series data,” in2021 IEEE International Conference on Data Mining (ICDM). IEEE, 2021, pp. 469–478

work page 2021
[20]

Psa-gan: Progres- sive self attention gans for synthetic time series,

P . Jeha, M. Bohlke-Schneider, P . Mercado, S. Kapoor, R. S. Nirwan, V . Flunkert, J. Gasthaus, and T. Januschowski, “Psa-gan: Progres- sive self attention gans for synthetic time series,” inThe tenth international conference on learning representations, 2022

work page 2022
[21]

Gt-gan: General purpose time series synthesis with generative adversarial net- works,

J. Jeon, J. Kim, H. Song, S. Cho, and N. Park, “Gt-gan: General purpose time series synthesis with generative adversarial net- works,”Advances in Neural Information Processing Systems, vol. 35, pp. 36 999–37 010, 2022

work page 2022
[22]

Timevae: A variational auto-encoder for multivariate time series generation,

A. Desai, C. Freeman, Z. Wang, and I. Beaver, “Timevae: A variational auto-encoder for multivariate time series generation,” arXiv preprint arXiv:2111.08095, 2021

work page arXiv 2021
[23]

Generative modeling of regular and irregular time series data via koopman vaes,

I. Naiman, N. B. Erichson, P . Ren, M. W. Mahoney, and O. Azencot, “Generative modeling of regular and irregular time series data via koopman vaes,”arXiv preprint arXiv:2310.02619, 2023

work page arXiv 2023
[24]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,”arXiv preprint arXiv:2009.09761, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[25]

On the constrained time-series generation problem,

A. Coletta, S. Gopalakrishnan, D. Borrajo, and S. Vyetrenko, “On the constrained time-series generation problem,”Advances in Neu- ral Information Processing Systems, vol. 36, 2024

work page 2024
[26]

arXiv preprint arXiv:2403.01742 , year=

X. Yuan and Y. Qiao, “Diffusion-ts: Interpretable diffusion for general time series generation,”arXiv preprint arXiv:2403.01742, 2024

work page arXiv 2024
[27]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

PaLM 2 Technical Report

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P . Bailey, Z. Chenet al., “Palm 2 technical report,”arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Zero-shot text-to-image generation,

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” inInternational conference on machine learning. Pmlr, 2021, pp. 8821– 8831

work page 2021
[30]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V . Vasude- van, A. Ku, Y. Yang, B. K. Ayanet al., “Scaling autoregressive models for content-rich text-to-image generation,”arXiv preprint arXiv:2206.10789, vol. 2, no. 3, p. 5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Maskgit: Masked generative image transformer,

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 315–11 325

work page 2022
[32]

Muse: Text-to- image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinsteinet al., “Muse: Text-to-image generation via masked generative trans- formers,”arXiv preprint arXiv:2301.00704, 2023

work page arXiv 2023
[33]

Magvit: Masked generative video transformer,

L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essaet al., “Magvit: Masked generative video transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 459–10 469

work page 2023
[34]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

R. Villegas, M. Babaeizadeh, P .-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual de- scription,”arXiv preprint arXiv:2210.02399, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiuet al., “Videopoet: A large language model for zero-shot video generation,”arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Soundstream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021

work page 2021
[37]

Simple and controllable music generation,

J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. D ´efossez, “Simple and controllable music generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 47 704–47 720, 2023

work page 2023
[38]

Neural codec language models are zero-shot text to speech synthesizers,

S. Chen, C. Wang, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Liet al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025
[39]

Sdformer: Similarity-driven discrete transformer for time series generation,

C. Zhicheng, F. SHIBO, Z. Zhang, X. Xiao, X. Gao, and P . Zhao, “Sdformer: Similarity-driven discrete transformer for time series generation,” inThe Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024

work page 2024
[40]

Hdt: Hierarchical discrete transformer for multivariate time series forecasting,

F. Shibo, P . Zhao, L. Liu, P . Wu, and Z. Shen, “Hdt: Hierarchical discrete transformer for multivariate time series forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 1, 2025, pp. 746–754

work page 2025
[41]

Cascade residual learning: A two-stage convolutional neural network for stereo matching,

J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan, “Cascade residual learning: A two-stage convolutional neural network for stereo matching,” inProceedings of the IEEE international conference on computer vision workshops, 2017, pp. 887–895

work page 2017
[42]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[43]

Which training meth- ods for gans do actually converge?

L. Mescheder, A. Geiger, and S. Nowozin, “Which training meth- ods for gans do actually converge?” inInternational conference on machine learning. PMLR, 2018, pp. 3481–3490. JOURNAL OF LATEX CLASS FILES, 2026 14

work page 2018
[44]

On Convergence and Stability of GANs

N. Kodali, J. Abernethy, J. Hays, and Z. Kira, “On convergence and stability of gans,”arXiv preprint arXiv:1705.07215, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Improving gener- alization and stability of generative adversarial networks,

H. Thanh-Tung, T. Tran, and S. Venkatesh, “Improving gener- alization and stability of generative adversarial networks,” in International Conference on Learning Representations, 2018

work page 2018
[46]

Catastrophic forgetting and mode collapse in gans,

H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse in gans,” in2020 international joint conference on neural networks (ijcnn). IEEE, 2020, pp. 1–10

work page 2020
[47]

Generative adversarial networks in time series: A survey and taxonomy,

E. Brophy, Z. Wang, Q. She, and T. Ward, “Generative adversarial networks in time series: A survey and taxonomy,”arXiv preprint arXiv:2107.11098, 2021

work page arXiv 2021
[48]

Deep Time Series Models: A Comprehensive Survey and Benchmark

Y. Wang, H. Wu, J. Dong, Y. Liu, C. Wang, M. Long, and J. Wang, “Deep time series models: A comprehensive survey and bench- mark,”arXiv preprint arXiv:2407.13278, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[50]

Bert: Pre- training of deep bidirectional transformers for language under- standing,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019
[51]

VideoGPT: Video Generation using VQ-VAE and Transformers

W. Yan, Y. Zhang, P . Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,”arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Vqtts: High-fidelity text-to- speech synthesis with self-supervised vq acoustic feature,

C. Du, Y. Guo, X. Chen, and K. Yu, “Vqtts: High-fidelity text-to- speech synthesis with self-supervised vq acoustic feature,”arXiv preprint arXiv:2204.00768, 2022

work page arXiv 2022
[53]

Taming transformers for high-resolution image synthesis,

P . Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883

work page 2021
[54]

T2m-gpt: Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y. Zhang, X. Cun, S. Huang, Y. Zhang, H. Zhao, H. Lu, and X. Shen, “T2m-gpt: Generating human motion from textual descriptions with discrete representations,”arXiv preprint arXiv:2301.06052, 2023

work page arXiv 2023
[55]

Motiongpt: Human motion as a foreign language,

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,”Advances in Neural Infor- mation Processing Systems, vol. 36, 2024

work page 2024
[56]

BERT: pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short P...

work page 2019
[57]

Visual autore- gressive modeling: Scalable image generation via next-scale pre- diction,

K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autore- gressive modeling: Scalable image generation via next-scale pre- diction,”Advances in neural information processing systems, vol. 37, pp. 84 839–84 865, 2024

work page 2024
[58]

Multi-scale adaptive graph neural network for mul- tivariate time series forecasting,

L. Chen, D. Chen, Z. Shang, B. Wu, C. Zheng, B. Wen, and W. Zhang, “Multi-scale adaptive graph neural network for mul- tivariate time series forecasting,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 10, pp. 10 748–10 761, 2023

work page 2023
[59]

Scientific reports 12, 16327

P . Chen, Y. Zhang, Y. Cheng, Y. Shu, Y. Wang, Q. Wen, B. Yang, and C. Guo, “Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting,”arXiv preprint arXiv:2402.05956, 2024

work page arXiv 2024
[60]

Hierarchical quantized autoencoders,

W. Williams, S. Ringer, T. Ash, D. MacLeod, J. Dougherty, and J. Hughes, “Hierarchical quantized autoencoders,”Advances in Neural Information Processing Systems, vol. 33, pp. 4524–4535, 2020

work page 2020
[61]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

work page 2017
[62]

Coding theorems for a discrete source with a fidelity criterion,

C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,”IRE International Convention Record, vol. 7, no. 4, pp. 142–163, 1959

work page 1959
[63]

Visualizing data using t-sne

L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008

work page 2008
[64]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[65]

Informer: Beyond efficient transformer for long se- quence time-series forecasting,

H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long se- quence time-series forecasting,” inProceedings of the AAAI confer- ence on artificial intelligence, vol. 35, no. 12, 2021, pp. 11 106–11 115

work page 2021
[66]

Data driven prediction models of energy use of appliances in a low-energy house,

L. M. Candanedo, V . Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a low-energy house,”Energy and buildings, vol. 140, pp. 81–97, 2017

work page 2017
[67]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

work page 2012
[68]

Latent ordi- nary differential equations for irregularly-sampled time series,

Y. Rubanova, R. T. Chen, and D. K. Duvenaud, “Latent ordi- nary differential equations for irregularly-sampled time series,” Advances in neural information processing systems, vol. 32, 2019

work page 2019
[69]

Ts2vec: Towards universal representation of time series,

Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu, “Ts2vec: Towards universal representation of time series,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 8, 2022, pp. 8980–8987. JOURNAL OF LATEX CLASS FILES, 2026 15 SUPPLEMENTARYMATERIALS FOR MSDFORMER In the supplementary, we provide more implementation det...

work page 2022
[70]

The position of the body in 2D space is uniformly sampled from the interval[0,0.5]

dataset is a multivariate physics simulation time series dataset, which collects a total of 10,000 simulations of the “Hopper” model from the DeepMind Control Suite and MuJoCo simulator. The position of the body in 2D space is uniformly sampled from the interval[0,0.5]. The relative position of the limbs is sampled from the range[−2,2], and initial veloci...

work page
[71]

https://finance.yahoo.com/quote/GOOG/history/?p=GOOG

work page
[72]

https://github.com/zhouhaoyi/ETDataset

work page
[73]

https://archive.ics.uci.edu/dataset/374/appliances+energy+ prediction

work page
[74]

https://www.fmrib.ox.ac.uk/datasets/netsim/

work page
[75]

https://github.com/jsyoon0823/TimeGAN

work page
[76]

real”) from synthesized data (labeled as “synthetic

https://github.com/google-deepmind/dm control all, there are 10000 sequences of 100 regularly sampled time points with a feature dimension of 14. Quantitative Metrics.To be specific, 1).Discriminative Scoremeasures the distributional similarity between original and synthesized time series data. A binary classifier (e.g., RNNs-based) is trained to distin- ...

work page 2000
[77]

time steps (all but the last step of each window), and the prediction target is also (T-1) steps, corresponding to a one-step-shifted forecasting objective. Thus, the predictive score does not evaluate long-horizon forecasting; it eval- uates one-step-ahead prediction across the entire window following TimeGAN’s post-hoc predictive score protocol. The pre...

work page
[78]

3).Context-FID Scoreevaluates the distributional fi- delity of synthesized time series by comparing con- textual feature embeddings between original and syn- thetic datasets

https://github.com/Y-debug-sys/Diffusion-TS/tree/main JOURNAL OF LATEX CLASS FILES, 2026 16 one) on synthetic data and reports the MAE on real data over all time steps in the generation window. 3).Context-FID Scoreevaluates the distributional fi- delity of synthesized time series by comparing con- textual feature embeddings between original and syn- theti...

work page 2026
[79]

0.006±.004 0.249±.000 0.003±.000

work page
[80]

0.005±.004 0.249±.000 0.003±.000 [512,512] 0.005±.003 0.249±.000 0.003±.000 64

work page

Showing first 80 references.

[1] [1]

Online arima algorithms for time series prediction,

C. Liu, S. C. Hoi, P . Zhao, and J. Sun, “Online arima algorithms for time series prediction,” inProceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016

work page 2016

[2] [2]

Hierarchical multi- scale gaussian transformer for stock movement prediction

Q. Ding, S. Wu, H. Sun, J. Guo, and J. Guo, “Hierarchical multi- scale gaussian transformer for stock movement prediction.” in IJCAI, 2020, pp. 4640–4646

work page 2020

[3] [3]

Accurate multivariate stock movement prediction via data-axis transformer with multi- level contexts,

J. Yoo, Y. Soun, Y.-c. Park, and U. Kang, “Accurate multivariate stock movement prediction via data-axis transformer with multi- level contexts,” inProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 2037–2045

work page 2021

[4] [4]

Relation-aware transformer for portfolio policy learning,

K. Xu, Y. Zhang, D. Ye, P . Zhao, and M. Tan, “Relation-aware transformer for portfolio policy learning,” inProceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, 2021, pp. 4647–4653

work page 2021

[5] [5]

Financial time series frequent pattern mining algo- rithm based on time series arima model,

M. Zhang, “Financial time series frequent pattern mining algo- rithm based on time series arima model,” in2023 International Conference on Networking, Informatics and Computing (ICNETIC), 2023, pp. 244–247

work page 2023

[6] [6]

Use of interrupted time series anal- ysis in evaluating health care quality improvements,

R. B. Penfold and F. Zhang, “Use of interrupted time series anal- ysis in evaluating health care quality improvements,”Academic pediatrics, vol. 13, no. 6, pp. S38–S44, 2013

work page 2013

[7] [7]

A short-term rainfall prediction model using multi- task convolutional neural networks,

M. Qiu, P . Zhao, K. Zhang, J. Huang, X. Shi, X. Wang, and W. Chu, “A short-term rainfall prediction model using multi- task convolutional neural networks,” in2017 IEEE international conference on data mining (ICDM). IEEE, 2017, pp. 395–404

work page 2017

[8] [8]

Time-series forecasting with deep learning: a survey,

B. Lim and S. Zohren, “Time-series forecasting with deep learning: a survey,”Philosophical Transactions of the Royal Society A, vol. 379, no. 2194, p. 20200209, 2021

work page 2021

[9] [9]

Multi-scale attention flow for probabilistic time series forecast- ing,

S. Feng, C. Miao, K. Xu, J. Wu, P . Wu, Y. Zhang, and P . Zhao, “Multi-scale attention flow for probabilistic time series forecast- ing,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 5, pp. 2056–2068, 2023

work page 2056

[10] [10]

Latent diffusion trans- former for probabilistic time series forecasting,

S. Feng, C. Miao, Z. Zhang, and P . Zhao, “Latent diffusion trans- former for probabilistic time series forecasting,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 11 979–11 987

work page 2024

[11] [11]

Temporal fusion trans- formers for interpretable multi-horizon time series forecasting,

B. Lim, S. ¨O. Arık, N. Loeff, and T. Pfister, “Temporal fusion trans- formers for interpretable multi-horizon time series forecasting,” International Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021

work page 2021

[12] [12]

One fits all: Power general time series analysis by pretrained lm,

T. Zhou, P . Niu, L. Sun, R. Jinet al., “One fits all: Power general time series analysis by pretrained lm,”Advances in neural informa- tion processing systems, vol. 36, pp. 43 322–43 355, 2023

work page 2023

[13] [13]

Ad- versarial sparse transformer for time series forecasting,

S. Wu, X. Xiao, Q. Ding, P . Zhao, Y. Wei, and J. Huang, “Ad- versarial sparse transformer for time series forecasting,”Advances in neural information processing systems, vol. 33, pp. 17 105–17 115, 2020

work page 2020

[14] [14]

Generative time- series modeling with fourier flows,

A. Alaa, A. J. Chan, and M. van der Schaar, “Generative time- series modeling with fourier flows,” inInternational Conference on Learning Representations, 2020

work page 2020

[15] [15]

C-RNN-GAN: Continuous recurrent neural networks with adversarial training

O. Mogren, “C-rnn-gan: Continuous recurrent neural networks with adversarial training,”arXiv preprint arXiv:1611.09904, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

C. Esteban, S. L. Hyland, and G. R ¨atsch, “Real-valued (medical) time series generation with recurrent conditional gans,”arXiv preprint arXiv:1706.02633, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Time-series generative adversarial networks,

J. Yoon, D. Jarrett, and M. Van der Schaar, “Time-series generative adversarial networks,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[18] [18]

Cot-gan: Gen- erating sequential data via causal optimal transport,

T. Xu, L. K. Wenliang, M. Munn, and B. Acciaio, “Cot-gan: Gen- erating sequential data via causal optimal transport,”Advances in neural information processing systems, vol. 33, pp. 8798–8809, 2020

work page 2020

[19] [19]

Towards generating real-world time series data,

H. Pei, K. Ren, Y. Yang, C. Liu, T. Qin, and D. Li, “Towards generating real-world time series data,” in2021 IEEE International Conference on Data Mining (ICDM). IEEE, 2021, pp. 469–478

work page 2021

[20] [20]

Psa-gan: Progres- sive self attention gans for synthetic time series,

P . Jeha, M. Bohlke-Schneider, P . Mercado, S. Kapoor, R. S. Nirwan, V . Flunkert, J. Gasthaus, and T. Januschowski, “Psa-gan: Progres- sive self attention gans for synthetic time series,” inThe tenth international conference on learning representations, 2022

work page 2022

[21] [21]

Gt-gan: General purpose time series synthesis with generative adversarial net- works,

J. Jeon, J. Kim, H. Song, S. Cho, and N. Park, “Gt-gan: General purpose time series synthesis with generative adversarial net- works,”Advances in Neural Information Processing Systems, vol. 35, pp. 36 999–37 010, 2022

work page 2022

[22] [22]

Timevae: A variational auto-encoder for multivariate time series generation,

A. Desai, C. Freeman, Z. Wang, and I. Beaver, “Timevae: A variational auto-encoder for multivariate time series generation,” arXiv preprint arXiv:2111.08095, 2021

work page arXiv 2021

[23] [23]

Generative modeling of regular and irregular time series data via koopman vaes,

I. Naiman, N. B. Erichson, P . Ren, M. W. Mahoney, and O. Azencot, “Generative modeling of regular and irregular time series data via koopman vaes,”arXiv preprint arXiv:2310.02619, 2023

work page arXiv 2023

[24] [24]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,”arXiv preprint arXiv:2009.09761, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[25] [25]

On the constrained time-series generation problem,

A. Coletta, S. Gopalakrishnan, D. Borrajo, and S. Vyetrenko, “On the constrained time-series generation problem,”Advances in Neu- ral Information Processing Systems, vol. 36, 2024

work page 2024

[26] [26]

arXiv preprint arXiv:2403.01742 , year=

X. Yuan and Y. Qiao, “Diffusion-ts: Interpretable diffusion for general time series generation,”arXiv preprint arXiv:2403.01742, 2024

work page arXiv 2024

[27] [27]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

PaLM 2 Technical Report

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P . Bailey, Z. Chenet al., “Palm 2 technical report,”arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Zero-shot text-to-image generation,

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” inInternational conference on machine learning. Pmlr, 2021, pp. 8821– 8831

work page 2021

[30] [30]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V . Vasude- van, A. Ku, Y. Yang, B. K. Ayanet al., “Scaling autoregressive models for content-rich text-to-image generation,”arXiv preprint arXiv:2206.10789, vol. 2, no. 3, p. 5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Maskgit: Masked generative image transformer,

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 315–11 325

work page 2022

[32] [32]

Muse: Text-to- image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinsteinet al., “Muse: Text-to-image generation via masked generative trans- formers,”arXiv preprint arXiv:2301.00704, 2023

work page arXiv 2023

[33] [33]

Magvit: Masked generative video transformer,

L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essaet al., “Magvit: Masked generative video transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 459–10 469

work page 2023

[34] [34]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

R. Villegas, M. Babaeizadeh, P .-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual de- scription,”arXiv preprint arXiv:2210.02399, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiuet al., “Videopoet: A large language model for zero-shot video generation,”arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Soundstream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021

work page 2021

[37] [37]

Simple and controllable music generation,

J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. D ´efossez, “Simple and controllable music generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 47 704–47 720, 2023

work page 2023

[38] [38]

Neural codec language models are zero-shot text to speech synthesizers,

S. Chen, C. Wang, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Liet al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025

[39] [39]

Sdformer: Similarity-driven discrete transformer for time series generation,

C. Zhicheng, F. SHIBO, Z. Zhang, X. Xiao, X. Gao, and P . Zhao, “Sdformer: Similarity-driven discrete transformer for time series generation,” inThe Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024

work page 2024

[40] [40]

Hdt: Hierarchical discrete transformer for multivariate time series forecasting,

F. Shibo, P . Zhao, L. Liu, P . Wu, and Z. Shen, “Hdt: Hierarchical discrete transformer for multivariate time series forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 1, 2025, pp. 746–754

work page 2025

[41] [41]

Cascade residual learning: A two-stage convolutional neural network for stereo matching,

J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan, “Cascade residual learning: A two-stage convolutional neural network for stereo matching,” inProceedings of the IEEE international conference on computer vision workshops, 2017, pp. 887–895

work page 2017

[42] [42]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[43] [43]

Which training meth- ods for gans do actually converge?

L. Mescheder, A. Geiger, and S. Nowozin, “Which training meth- ods for gans do actually converge?” inInternational conference on machine learning. PMLR, 2018, pp. 3481–3490. JOURNAL OF LATEX CLASS FILES, 2026 14

work page 2018

[44] [44]

On Convergence and Stability of GANs

N. Kodali, J. Abernethy, J. Hays, and Z. Kira, “On convergence and stability of gans,”arXiv preprint arXiv:1705.07215, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

Improving gener- alization and stability of generative adversarial networks,

H. Thanh-Tung, T. Tran, and S. Venkatesh, “Improving gener- alization and stability of generative adversarial networks,” in International Conference on Learning Representations, 2018

work page 2018

[46] [46]

Catastrophic forgetting and mode collapse in gans,

H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse in gans,” in2020 international joint conference on neural networks (ijcnn). IEEE, 2020, pp. 1–10

work page 2020

[47] [47]

Generative adversarial networks in time series: A survey and taxonomy,

E. Brophy, Z. Wang, Q. She, and T. Ward, “Generative adversarial networks in time series: A survey and taxonomy,”arXiv preprint arXiv:2107.11098, 2021

work page arXiv 2021

[48] [48]

Deep Time Series Models: A Comprehensive Survey and Benchmark

Y. Wang, H. Wu, J. Dong, Y. Liu, C. Wang, M. Long, and J. Wang, “Deep time series models: A comprehensive survey and bench- mark,”arXiv preprint arXiv:2407.13278, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020

[50] [50]

Bert: Pre- training of deep bidirectional transformers for language under- standing,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019

[51] [51]

VideoGPT: Video Generation using VQ-VAE and Transformers

W. Yan, Y. Zhang, P . Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,”arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[52] [52]

Vqtts: High-fidelity text-to- speech synthesis with self-supervised vq acoustic feature,

C. Du, Y. Guo, X. Chen, and K. Yu, “Vqtts: High-fidelity text-to- speech synthesis with self-supervised vq acoustic feature,”arXiv preprint arXiv:2204.00768, 2022

work page arXiv 2022

[53] [53]

Taming transformers for high-resolution image synthesis,

P . Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883

work page 2021

[54] [54]

T2m-gpt: Generating human motion from textual descriptions with discrete representations,

J. Zhang, Y. Zhang, X. Cun, S. Huang, Y. Zhang, H. Zhao, H. Lu, and X. Shen, “T2m-gpt: Generating human motion from textual descriptions with discrete representations,”arXiv preprint arXiv:2301.06052, 2023

work page arXiv 2023

[55] [55]

Motiongpt: Human motion as a foreign language,

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,”Advances in Neural Infor- mation Processing Systems, vol. 36, 2024

work page 2024

[56] [56]

BERT: pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short P...

work page 2019

[57] [57]

Visual autore- gressive modeling: Scalable image generation via next-scale pre- diction,

K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autore- gressive modeling: Scalable image generation via next-scale pre- diction,”Advances in neural information processing systems, vol. 37, pp. 84 839–84 865, 2024

work page 2024

[58] [58]

Multi-scale adaptive graph neural network for mul- tivariate time series forecasting,

L. Chen, D. Chen, Z. Shang, B. Wu, C. Zheng, B. Wen, and W. Zhang, “Multi-scale adaptive graph neural network for mul- tivariate time series forecasting,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 10, pp. 10 748–10 761, 2023

work page 2023

[59] [59]

Scientific reports 12, 16327

P . Chen, Y. Zhang, Y. Cheng, Y. Shu, Y. Wang, Q. Wen, B. Yang, and C. Guo, “Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting,”arXiv preprint arXiv:2402.05956, 2024

work page arXiv 2024

[60] [60]

Hierarchical quantized autoencoders,

W. Williams, S. Ringer, T. Ash, D. MacLeod, J. Dougherty, and J. Hughes, “Hierarchical quantized autoencoders,”Advances in Neural Information Processing Systems, vol. 33, pp. 4524–4535, 2020

work page 2020

[61] [61]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

work page 2017

[62] [62]

Coding theorems for a discrete source with a fidelity criterion,

C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,”IRE International Convention Record, vol. 7, no. 4, pp. 142–163, 1959

work page 1959

[63] [63]

Visualizing data using t-sne

L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008

work page 2008

[64] [64]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[65] [65]

Informer: Beyond efficient transformer for long se- quence time-series forecasting,

H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long se- quence time-series forecasting,” inProceedings of the AAAI confer- ence on artificial intelligence, vol. 35, no. 12, 2021, pp. 11 106–11 115

work page 2021

[66] [66]

Data driven prediction models of energy use of appliances in a low-energy house,

L. M. Candanedo, V . Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a low-energy house,”Energy and buildings, vol. 140, pp. 81–97, 2017

work page 2017

[67] [67]

Mujoco: A physics engine for model-based control,

E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

work page 2012

[68] [68]

Latent ordi- nary differential equations for irregularly-sampled time series,

Y. Rubanova, R. T. Chen, and D. K. Duvenaud, “Latent ordi- nary differential equations for irregularly-sampled time series,” Advances in neural information processing systems, vol. 32, 2019

work page 2019

[69] [69]

Ts2vec: Towards universal representation of time series,

Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu, “Ts2vec: Towards universal representation of time series,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 8, 2022, pp. 8980–8987. JOURNAL OF LATEX CLASS FILES, 2026 15 SUPPLEMENTARYMATERIALS FOR MSDFORMER In the supplementary, we provide more implementation det...

work page 2022

[70] [70]

The position of the body in 2D space is uniformly sampled from the interval[0,0.5]

dataset is a multivariate physics simulation time series dataset, which collects a total of 10,000 simulations of the “Hopper” model from the DeepMind Control Suite and MuJoCo simulator. The position of the body in 2D space is uniformly sampled from the interval[0,0.5]. The relative position of the limbs is sampled from the range[−2,2], and initial veloci...

work page

[71] [71]

https://finance.yahoo.com/quote/GOOG/history/?p=GOOG

work page

[72] [72]

https://github.com/zhouhaoyi/ETDataset

work page

[73] [73]

https://archive.ics.uci.edu/dataset/374/appliances+energy+ prediction

work page

[74] [74]

https://www.fmrib.ox.ac.uk/datasets/netsim/

work page

[75] [75]

https://github.com/jsyoon0823/TimeGAN

work page

[76] [76]

real”) from synthesized data (labeled as “synthetic

https://github.com/google-deepmind/dm control all, there are 10000 sequences of 100 regularly sampled time points with a feature dimension of 14. Quantitative Metrics.To be specific, 1).Discriminative Scoremeasures the distributional similarity between original and synthesized time series data. A binary classifier (e.g., RNNs-based) is trained to distin- ...

work page 2000

[77] [77]

time steps (all but the last step of each window), and the prediction target is also (T-1) steps, corresponding to a one-step-shifted forecasting objective. Thus, the predictive score does not evaluate long-horizon forecasting; it eval- uates one-step-ahead prediction across the entire window following TimeGAN’s post-hoc predictive score protocol. The pre...

work page

[78] [78]

3).Context-FID Scoreevaluates the distributional fi- delity of synthesized time series by comparing con- textual feature embeddings between original and syn- thetic datasets

https://github.com/Y-debug-sys/Diffusion-TS/tree/main JOURNAL OF LATEX CLASS FILES, 2026 16 one) on synthetic data and reports the MAE on real data over all time steps in the generation window. 3).Context-FID Scoreevaluates the distributional fi- delity of synthesized time series by comparing con- textual feature embeddings between original and syn- theti...

work page 2026

[79] [79]

0.006±.004 0.249±.000 0.003±.000

work page

[80] [80]

0.005±.004 0.249±.000 0.003±.000 [512,512] 0.005±.003 0.249±.000 0.003±.000 64

work page