MSDformer: Multi-scale Discrete Transformer For Time Series Generation
Pith reviewed 2026-05-22 13:40 UTC · model grok-4.3
The pith
MSDformer generates higher-quality time series by tokenizing data at multiple scales and modeling those patterns autoregressively in discrete space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MSDformer employs a multi-scale time series tokenizer to learn discrete token representations at multiple scales, which jointly characterize the complex nature of time series data. It then applies a multi-scale autoregressive token modeling technique to capture the multi-scale patterns of time series within the discrete latent space. The effectiveness of the DTM method and the rationality of MSDformer are validated through the rate-distortion theorem, with comprehensive experiments demonstrating that MSDformer significantly outperforms state-of-the-art methods.
What carries the argument
The multi-scale time series tokenizer paired with multi-scale autoregressive token modeling inside the discrete latent space.
If this is right
- Incorporating multi-scale information substantially enhances the quality of generated time series in DTM-based approaches.
- Modeling multi-scale patterns within the discrete latent space improves generation performance.
- The rate-distortion theorem supplies a theoretical foundation that justifies the multi-scale design choices.
- Both theoretical analysis and experimental results together support the use of multi-scale structures for time series generation.
Where Pith is reading between the lines
- The same multi-scale tokenization idea could be tested on other sequential modalities such as audio waveforms or sensor streams that also exhibit structure at several resolutions.
- Allowing the model to select or weight scales dynamically during training might reduce unnecessary computation while preserving gains.
- The interaction between the number of quantization levels and the choice of scales remains open for further empirical mapping.
Load-bearing premise
The rate-distortion theorem directly validates the specific design choices of the multi-scale tokenizer and autoregressive modeling without additional assumptions about how the theorem maps onto the discrete latent space of time series.
What would settle it
An ablation experiment in which the multi-scale tokenizer and multi-scale autoregressive components are replaced by single-scale equivalents and generation quality does not drop on the same benchmarks would falsify the central claim.
Figures
read the original abstract
Discrete Token Modeling (DTM), which employs vector quantization techniques, has demonstrated remarkable success in modeling non-natural language modalities, particularly in time series generation. While our prior work SDformer established the first DTM-based framework to achieve state-of-the-art performance in this domain, two critical limitations persist in existing DTM approaches: 1) their inability to capture multi-scale temporal patterns inherent to complex time series data, and 2) the absence of theoretical foundations to guide model optimization. To address these challenges, we proposes a novel multi-scale DTM-based time series generation method, called Multi-Scale Discrete Transformer (MSDformer). MSDformer employs a multi-scale time series tokenizer to learn discrete token representations at multiple scales, which jointly characterize the complex nature of time series data. Subsequently, MSDformer applies a multi-scale autoregressive token modeling technique to capture the multi-scale patterns of time series within the discrete latent space. Theoretically, we validate the effectiveness of the DTM method and the rationality of MSDformer through the rate-distortion theorem. Comprehensive experiments demonstrate that MSDformer significantly outperforms state-of-the-art methods. Both theoretical analysis and experimental results demonstrate that incorporating multi-scale information and modeling multi-scale patterns can substantially enhance the quality of generated time series in DTM-based approaches. Code is available at this repository:https://github.com/kkking-kk/MSDformer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MSDformer, a multi-scale discrete transformer for time series generation. It addresses limitations in existing DTM approaches by proposing a multi-scale time series tokenizer that learns discrete token representations at multiple scales and a multi-scale autoregressive token modeling technique to capture multi-scale patterns in the discrete latent space. The work claims theoretical validation of both the DTM framework and the specific MSDformer design choices via the rate-distortion theorem, along with experimental results showing significant outperformance over state-of-the-art methods in time series generation.
Significance. If the central claims hold, this work would advance discrete token modeling for time series by providing a principled way to incorporate multi-scale temporal structure, potentially improving generation quality in domains with complex, hierarchical patterns. The combination of multi-scale discretization and autoregressive modeling, if rigorously linked to rate-distortion bounds, could offer a template for future DTM architectures beyond time series.
major comments (2)
- [Abstract and §3] Abstract and §3 (Theoretical Analysis): The claim that the rate-distortion theorem validates both the DTM method and the specific multi-scale tokenizer plus multi-scale AR modeling choices lacks a derivation. Rate-distortion theory provides the minimal rate R(D) for a given distortion but is agnostic to discretization granularity, scale hierarchy, and autoregressive factorization; no section derives how the proposed multi-scale VQ objective or joint token prediction reduces R(D) relative to single-scale baselines.
- [§4] §4 (Experiments): The abstract asserts both experimental outperformance and rate-distortion validation, yet the provided summary supplies no quantitative results, error bars, ablation details, or explicit mapping from the theorem to observed metrics. Without these, the central claim that multi-scale information 'substantially enhance[s] the quality' rests on unshown evidence.
minor comments (2)
- [Abstract] Abstract: 'we proposes' should be 'we propose'.
- [Abstract] Abstract: The GitHub link is given as 'this repository:https://github.com/kkking-kk/MSDformer' without proper formatting or description of what is released (e.g., code for tokenizer, training scripts, or evaluation).
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Theoretical Analysis): The claim that the rate-distortion theorem validates both the DTM method and the specific multi-scale tokenizer plus multi-scale AR modeling choices lacks a derivation. Rate-distortion theory provides the minimal rate R(D) for a given distortion but is agnostic to discretization granularity, scale hierarchy, and autoregressive factorization; no section derives how the proposed multi-scale VQ objective or joint token prediction reduces R(D) relative to single-scale baselines.
Authors: We appreciate the referee identifying this gap in the theoretical presentation. Section 3 invokes the rate-distortion theorem to argue that multi-scale discretization permits a more efficient representation of hierarchical temporal dependencies, thereby supporting a favorable rate-distortion operating point that single-scale DTM cannot achieve as readily. We acknowledge, however, that the current text does not contain an explicit derivation or inequality showing how the multi-scale VQ loss and joint autoregressive prediction strictly lower the achievable R(D) relative to single-scale baselines. In the revised manuscript we will expand §3 with a concise derivation sketch that relates the multi-scale codebook objective to a tighter upper bound on distortion for a fixed rate, using the standard rate-distortion function as reference. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract asserts both experimental outperformance and rate-distortion validation, yet the provided summary supplies no quantitative results, error bars, ablation details, or explicit mapping from the theorem to observed metrics. Without these, the central claim that multi-scale information 'substantially enhance[s] the quality' rests on unshown evidence.
Authors: The full manuscript already contains the requested elements in §4: quantitative tables comparing MSDformer against prior DTM and non-DTM baselines on standard time-series generation benchmarks, error bars computed over multiple random seeds, and ablation studies that isolate the contribution of the multi-scale tokenizer and multi-scale autoregressive modeling. To directly address the referee’s concern we will add a short subsection that explicitly links the rate-distortion arguments of §3 to the observed metric improvements (e.g., lower FID and higher precision at comparable or lower token rates). revision: yes
Circularity Check
No significant circularity; rate-distortion theorem is external and derivation remains independent
full rationale
The paper invokes the rate-distortion theorem as external theoretical validation for DTM effectiveness and MSDformer rationality. This is a standard information-theoretic result (Shannon) independent of the authors' prior work or current data. Self-citation to SDformer only establishes the baseline DTM framework; the multi-scale tokenizer, AR modeling, and performance claims are supported by new experiments rather than reducing to that citation or to fitted parameters renamed as predictions. No equations or steps equate the claimed theoretical validation to the paper's own inputs by construction. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rate-distortion theorem can be used to validate effectiveness and rationality of DTM and MSDformer
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theoretically, we validate the effectiveness of the DTM method and the rationality of MSDformer through the rate-distortion theorem... under the same overall rate budget, allocating capacity across scales can reduce distortion more effectively than increasing the codebook size in a single-scale model.
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-scale time series tokenizer to learn discrete token representations at multiple scales... multi-scale autoregressive token modeling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Online arima algorithms for time series prediction,
C. Liu, S. C. Hoi, P . Zhao, and J. Sun, “Online arima algorithms for time series prediction,” inProceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016
work page 2016
-
[2]
Hierarchical multi- scale gaussian transformer for stock movement prediction
Q. Ding, S. Wu, H. Sun, J. Guo, and J. Guo, “Hierarchical multi- scale gaussian transformer for stock movement prediction.” in IJCAI, 2020, pp. 4640–4646
work page 2020
-
[3]
J. Yoo, Y. Soun, Y.-c. Park, and U. Kang, “Accurate multivariate stock movement prediction via data-axis transformer with multi- level contexts,” inProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 2037–2045
work page 2021
-
[4]
Relation-aware transformer for portfolio policy learning,
K. Xu, Y. Zhang, D. Ye, P . Zhao, and M. Tan, “Relation-aware transformer for portfolio policy learning,” inProceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, 2021, pp. 4647–4653
work page 2021
-
[5]
Financial time series frequent pattern mining algo- rithm based on time series arima model,
M. Zhang, “Financial time series frequent pattern mining algo- rithm based on time series arima model,” in2023 International Conference on Networking, Informatics and Computing (ICNETIC), 2023, pp. 244–247
work page 2023
-
[6]
Use of interrupted time series anal- ysis in evaluating health care quality improvements,
R. B. Penfold and F. Zhang, “Use of interrupted time series anal- ysis in evaluating health care quality improvements,”Academic pediatrics, vol. 13, no. 6, pp. S38–S44, 2013
work page 2013
-
[7]
A short-term rainfall prediction model using multi- task convolutional neural networks,
M. Qiu, P . Zhao, K. Zhang, J. Huang, X. Shi, X. Wang, and W. Chu, “A short-term rainfall prediction model using multi- task convolutional neural networks,” in2017 IEEE international conference on data mining (ICDM). IEEE, 2017, pp. 395–404
work page 2017
-
[8]
Time-series forecasting with deep learning: a survey,
B. Lim and S. Zohren, “Time-series forecasting with deep learning: a survey,”Philosophical Transactions of the Royal Society A, vol. 379, no. 2194, p. 20200209, 2021
work page 2021
-
[9]
Multi-scale attention flow for probabilistic time series forecast- ing,
S. Feng, C. Miao, K. Xu, J. Wu, P . Wu, Y. Zhang, and P . Zhao, “Multi-scale attention flow for probabilistic time series forecast- ing,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 5, pp. 2056–2068, 2023
work page 2056
-
[10]
Latent diffusion trans- former for probabilistic time series forecasting,
S. Feng, C. Miao, Z. Zhang, and P . Zhao, “Latent diffusion trans- former for probabilistic time series forecasting,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 11 979–11 987
work page 2024
-
[11]
Temporal fusion trans- formers for interpretable multi-horizon time series forecasting,
B. Lim, S. ¨O. Arık, N. Loeff, and T. Pfister, “Temporal fusion trans- formers for interpretable multi-horizon time series forecasting,” International Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021
work page 2021
-
[12]
One fits all: Power general time series analysis by pretrained lm,
T. Zhou, P . Niu, L. Sun, R. Jinet al., “One fits all: Power general time series analysis by pretrained lm,”Advances in neural informa- tion processing systems, vol. 36, pp. 43 322–43 355, 2023
work page 2023
-
[13]
Ad- versarial sparse transformer for time series forecasting,
S. Wu, X. Xiao, Q. Ding, P . Zhao, Y. Wei, and J. Huang, “Ad- versarial sparse transformer for time series forecasting,”Advances in neural information processing systems, vol. 33, pp. 17 105–17 115, 2020
work page 2020
-
[14]
Generative time- series modeling with fourier flows,
A. Alaa, A. J. Chan, and M. van der Schaar, “Generative time- series modeling with fourier flows,” inInternational Conference on Learning Representations, 2020
work page 2020
-
[15]
C-RNN-GAN: Continuous recurrent neural networks with adversarial training
O. Mogren, “C-rnn-gan: Continuous recurrent neural networks with adversarial training,”arXiv preprint arXiv:1611.09904, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs
C. Esteban, S. L. Hyland, and G. R ¨atsch, “Real-valued (medical) time series generation with recurrent conditional gans,”arXiv preprint arXiv:1706.02633, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Time-series generative adversarial networks,
J. Yoon, D. Jarrett, and M. Van der Schaar, “Time-series generative adversarial networks,”Advances in neural information processing systems, vol. 32, 2019
work page 2019
-
[18]
Cot-gan: Gen- erating sequential data via causal optimal transport,
T. Xu, L. K. Wenliang, M. Munn, and B. Acciaio, “Cot-gan: Gen- erating sequential data via causal optimal transport,”Advances in neural information processing systems, vol. 33, pp. 8798–8809, 2020
work page 2020
-
[19]
Towards generating real-world time series data,
H. Pei, K. Ren, Y. Yang, C. Liu, T. Qin, and D. Li, “Towards generating real-world time series data,” in2021 IEEE International Conference on Data Mining (ICDM). IEEE, 2021, pp. 469–478
work page 2021
-
[20]
Psa-gan: Progres- sive self attention gans for synthetic time series,
P . Jeha, M. Bohlke-Schneider, P . Mercado, S. Kapoor, R. S. Nirwan, V . Flunkert, J. Gasthaus, and T. Januschowski, “Psa-gan: Progres- sive self attention gans for synthetic time series,” inThe tenth international conference on learning representations, 2022
work page 2022
-
[21]
Gt-gan: General purpose time series synthesis with generative adversarial net- works,
J. Jeon, J. Kim, H. Song, S. Cho, and N. Park, “Gt-gan: General purpose time series synthesis with generative adversarial net- works,”Advances in Neural Information Processing Systems, vol. 35, pp. 36 999–37 010, 2022
work page 2022
-
[22]
Timevae: A variational auto-encoder for multivariate time series generation,
A. Desai, C. Freeman, Z. Wang, and I. Beaver, “Timevae: A variational auto-encoder for multivariate time series generation,” arXiv preprint arXiv:2111.08095, 2021
-
[23]
Generative modeling of regular and irregular time series data via koopman vaes,
I. Naiman, N. B. Erichson, P . Ren, M. W. Mahoney, and O. Azencot, “Generative modeling of regular and irregular time series data via koopman vaes,”arXiv preprint arXiv:2310.02619, 2023
-
[24]
DiffWave: A Versatile Diffusion Model for Audio Synthesis
Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,”arXiv preprint arXiv:2009.09761, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[25]
On the constrained time-series generation problem,
A. Coletta, S. Gopalakrishnan, D. Borrajo, and S. Vyetrenko, “On the constrained time-series generation problem,”Advances in Neu- ral Information Processing Systems, vol. 36, 2024
work page 2024
-
[26]
arXiv preprint arXiv:2403.01742 , year=
X. Yuan and Y. Qiao, “Diffusion-ts: Interpretable diffusion for general time series generation,”arXiv preprint arXiv:2403.01742, 2024
-
[27]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P . Bailey, Z. Chenet al., “Palm 2 technical report,”arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Zero-shot text-to-image generation,
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” inInternational conference on machine learning. Pmlr, 2021, pp. 8821– 8831
work page 2021
-
[30]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V . Vasude- van, A. Ku, Y. Yang, B. K. Ayanet al., “Scaling autoregressive models for content-rich text-to-image generation,”arXiv preprint arXiv:2206.10789, vol. 2, no. 3, p. 5, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Maskgit: Masked generative image transformer,
H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 315–11 325
work page 2022
-
[32]
H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinsteinet al., “Muse: Text-to-image generation via masked generative trans- formers,”arXiv preprint arXiv:2301.00704, 2023
-
[33]
Magvit: Masked generative video transformer,
L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essaet al., “Magvit: Masked generative video transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 459–10 469
work page 2023
-
[34]
Phenaki: Variable Length Video Generation From Open Domain Textual Description
R. Villegas, M. Babaeizadeh, P .-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual de- scription,”arXiv preprint arXiv:2210.02399, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiuet al., “Videopoet: A large language model for zero-shot video generation,”arXiv preprint arXiv:2312.14125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Soundstream: An end-to-end neural audio codec,
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021
work page 2021
-
[37]
Simple and controllable music generation,
J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. D ´efossez, “Simple and controllable music generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 47 704–47 720, 2023
work page 2023
-
[38]
Neural codec language models are zero-shot text to speech synthesizers,
S. Chen, C. Wang, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Liet al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Audio, Speech and Language Processing, 2025
work page 2025
-
[39]
Sdformer: Similarity-driven discrete transformer for time series generation,
C. Zhicheng, F. SHIBO, Z. Zhang, X. Xiao, X. Gao, and P . Zhao, “Sdformer: Similarity-driven discrete transformer for time series generation,” inThe Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024
work page 2024
-
[40]
Hdt: Hierarchical discrete transformer for multivariate time series forecasting,
F. Shibo, P . Zhao, L. Liu, P . Wu, and Z. Shen, “Hdt: Hierarchical discrete transformer for multivariate time series forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 1, 2025, pp. 746–754
work page 2025
-
[41]
Cascade residual learning: A two-stage convolutional neural network for stereo matching,
J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan, “Cascade residual learning: A two-stage convolutional neural network for stereo matching,” inProceedings of the IEEE international conference on computer vision workshops, 2017, pp. 887–895
work page 2017
-
[42]
Neural discrete representation learning,
A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[43]
Which training meth- ods for gans do actually converge?
L. Mescheder, A. Geiger, and S. Nowozin, “Which training meth- ods for gans do actually converge?” inInternational conference on machine learning. PMLR, 2018, pp. 3481–3490. JOURNAL OF LATEX CLASS FILES, 2026 14
work page 2018
-
[44]
On Convergence and Stability of GANs
N. Kodali, J. Abernethy, J. Hays, and Z. Kira, “On convergence and stability of gans,”arXiv preprint arXiv:1705.07215, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[45]
Improving gener- alization and stability of generative adversarial networks,
H. Thanh-Tung, T. Tran, and S. Venkatesh, “Improving gener- alization and stability of generative adversarial networks,” in International Conference on Learning Representations, 2018
work page 2018
-
[46]
Catastrophic forgetting and mode collapse in gans,
H. Thanh-Tung and T. Tran, “Catastrophic forgetting and mode collapse in gans,” in2020 international joint conference on neural networks (ijcnn). IEEE, 2020, pp. 1–10
work page 2020
-
[47]
Generative adversarial networks in time series: A survey and taxonomy,
E. Brophy, Z. Wang, Q. She, and T. Ward, “Generative adversarial networks in time series: A survey and taxonomy,”arXiv preprint arXiv:2107.11098, 2021
-
[48]
Deep Time Series Models: A Comprehensive Survey and Benchmark
Y. Wang, H. Wu, J. Dong, Y. Liu, C. Wang, M. Long, and J. Wang, “Deep time series models: A comprehensive survey and bench- mark,”arXiv preprint arXiv:2407.13278, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[50]
Bert: Pre- training of deep bidirectional transformers for language under- standing,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186
work page 2019
-
[51]
VideoGPT: Video Generation using VQ-VAE and Transformers
W. Yan, Y. Zhang, P . Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,”arXiv preprint arXiv:2104.10157, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[52]
Vqtts: High-fidelity text-to- speech synthesis with self-supervised vq acoustic feature,
C. Du, Y. Guo, X. Chen, and K. Yu, “Vqtts: High-fidelity text-to- speech synthesis with self-supervised vq acoustic feature,”arXiv preprint arXiv:2204.00768, 2022
-
[53]
Taming transformers for high-resolution image synthesis,
P . Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883
work page 2021
-
[54]
T2m-gpt: Generating human motion from textual descriptions with discrete representations,
J. Zhang, Y. Zhang, X. Cun, S. Huang, Y. Zhang, H. Zhao, H. Lu, and X. Shen, “T2m-gpt: Generating human motion from textual descriptions with discrete representations,”arXiv preprint arXiv:2301.06052, 2023
-
[55]
Motiongpt: Human motion as a foreign language,
B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,”Advances in Neural Infor- mation Processing Systems, vol. 36, 2024
work page 2024
-
[56]
BERT: pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short P...
work page 2019
-
[57]
Visual autore- gressive modeling: Scalable image generation via next-scale pre- diction,
K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autore- gressive modeling: Scalable image generation via next-scale pre- diction,”Advances in neural information processing systems, vol. 37, pp. 84 839–84 865, 2024
work page 2024
-
[58]
Multi-scale adaptive graph neural network for mul- tivariate time series forecasting,
L. Chen, D. Chen, Z. Shang, B. Wu, C. Zheng, B. Wen, and W. Zhang, “Multi-scale adaptive graph neural network for mul- tivariate time series forecasting,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 10, pp. 10 748–10 761, 2023
work page 2023
-
[59]
P . Chen, Y. Zhang, Y. Cheng, Y. Shu, Y. Wang, Q. Wen, B. Yang, and C. Guo, “Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting,”arXiv preprint arXiv:2402.05956, 2024
-
[60]
Hierarchical quantized autoencoders,
W. Williams, S. Ringer, T. Ash, D. MacLeod, J. Dougherty, and J. Hughes, “Hierarchical quantized autoencoders,”Advances in Neural Information Processing Systems, vol. 33, pp. 4524–4535, 2020
work page 2020
-
[61]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[62]
Coding theorems for a discrete source with a fidelity criterion,
C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,”IRE International Convention Record, vol. 7, no. 4, pp. 142–163, 1959
work page 1959
-
[63]
L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008
work page 2008
-
[64]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[65]
Informer: Beyond efficient transformer for long se- quence time-series forecasting,
H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long se- quence time-series forecasting,” inProceedings of the AAAI confer- ence on artificial intelligence, vol. 35, no. 12, 2021, pp. 11 106–11 115
work page 2021
-
[66]
Data driven prediction models of energy use of appliances in a low-energy house,
L. M. Candanedo, V . Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a low-energy house,”Energy and buildings, vol. 140, pp. 81–97, 2017
work page 2017
-
[67]
Mujoco: A physics engine for model-based control,
E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033
work page 2012
-
[68]
Latent ordi- nary differential equations for irregularly-sampled time series,
Y. Rubanova, R. T. Chen, and D. K. Duvenaud, “Latent ordi- nary differential equations for irregularly-sampled time series,” Advances in neural information processing systems, vol. 32, 2019
work page 2019
-
[69]
Ts2vec: Towards universal representation of time series,
Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu, “Ts2vec: Towards universal representation of time series,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 8, 2022, pp. 8980–8987. JOURNAL OF LATEX CLASS FILES, 2026 15 SUPPLEMENTARYMATERIALS FOR MSDFORMER In the supplementary, we provide more implementation det...
work page 2022
-
[70]
The position of the body in 2D space is uniformly sampled from the interval[0,0.5]
dataset is a multivariate physics simulation time series dataset, which collects a total of 10,000 simulations of the “Hopper” model from the DeepMind Control Suite and MuJoCo simulator. The position of the body in 2D space is uniformly sampled from the interval[0,0.5]. The relative position of the limbs is sampled from the range[−2,2], and initial veloci...
-
[71]
https://finance.yahoo.com/quote/GOOG/history/?p=GOOG
-
[72]
https://github.com/zhouhaoyi/ETDataset
-
[73]
https://archive.ics.uci.edu/dataset/374/appliances+energy+ prediction
-
[74]
https://www.fmrib.ox.ac.uk/datasets/netsim/
-
[75]
https://github.com/jsyoon0823/TimeGAN
-
[76]
real”) from synthesized data (labeled as “synthetic
https://github.com/google-deepmind/dm control all, there are 10000 sequences of 100 regularly sampled time points with a feature dimension of 14. Quantitative Metrics.To be specific, 1).Discriminative Scoremeasures the distributional similarity between original and synthesized time series data. A binary classifier (e.g., RNNs-based) is trained to distin- ...
work page 2000
-
[77]
time steps (all but the last step of each window), and the prediction target is also (T-1) steps, corresponding to a one-step-shifted forecasting objective. Thus, the predictive score does not evaluate long-horizon forecasting; it eval- uates one-step-ahead prediction across the entire window following TimeGAN’s post-hoc predictive score protocol. The pre...
-
[78]
https://github.com/Y-debug-sys/Diffusion-TS/tree/main JOURNAL OF LATEX CLASS FILES, 2026 16 one) on synthetic data and reports the MAE on real data over all time steps in the generation window. 3).Context-FID Scoreevaluates the distributional fi- delity of synthesized time series by comparing con- textual feature embeddings between original and syn- theti...
work page 2026
-
[79]
0.006±.004 0.249±.000 0.003±.000
-
[80]
0.005±.004 0.249±.000 0.003±.000 [512,512] 0.005±.003 0.249±.000 0.003±.000 64
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.