pith. sign in

arxiv: 2606.09861 · v1 · pith:Q4F3D4ANnew · submitted 2026-05-31 · 💻 cs.LG · cs.AI

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

Pith reviewed 2026-06-28 17:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords time seriestokenizerfoundation modelnext token predictionzero-shot forecastingin-context learningvector quantization
0
0 comments X

The pith

A universal tokenizer converts continuous time series into discrete tokens so that an unmodified large language model can perform zero-shot forecasting, generation, and classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniTok, a tokenizer that turns time series data into discrete tokens using a vector-quantized autoencoder. It then pretrains UniTok-FM, based on a standard LLM, using next-token prediction on groups of similar time series. This setup allows the model to handle multiple tasks like forecasting and classification in zero-shot or few-shot in-context ways without additional training. The approach aims to unify time series modeling under the language model paradigm by capturing shared dynamics across series.

Core claim

UniTok is a vector-quantized autoencoder with prefix normalization for scale stabilization, a progressive-resolution causal architecture, and a structure-preserving reconstruction loss. UniTok-FM uses an off-the-shelf LLM architecture pretrained via next-token prediction on context windows of multiple series with similar patterns, enabling it to support zero-shot and prompt-boosted forecasting as well as training-free in-context inference for generation and classification.

What carries the argument

UniTok, the universal tokenizer that is a vector-quantized autoencoder transforming time series into discrete tokens while preserving structure through specific normalization and loss terms.

If this is right

  • A single model can outperform statistical and supervised baselines on forecasting, generation, and classification tasks.
  • The model achieves competitive performance with task-specific foundation models.
  • Training-free in-context inference becomes possible across different time series tasks.
  • Pretraining on grouped similar series captures shared dynamics for general-purpose use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the tokenizer works across domains, it could allow foundation models to handle mixed data types like text and time series in one model.
  • Extending the context window grouping to more diverse patterns might improve robustness to distribution shifts.
  • Testing on very long time series or high-frequency data could reveal limits of the progressive-resolution architecture.

Load-bearing premise

Pretraining an unmodified LLM via next-token prediction on context windows of multiple similar time series will capture enough shared dynamics to enable general-purpose performance across forecasting, generation, and classification without task-specific fine-tuning.

What would settle it

A benchmark where UniTok-FM fails to outperform simple statistical baselines in zero-shot forecasting on a held-out dataset with different patterns from the pretraining groups.

Figures

Figures reproduced from arXiv: 2606.09861 by Jiale Zheng, Jianfeng Zhang, Junchi Yan, Lujia Pan, Ruiying Qi, Yunhao Zhang.

Figure 1
Figure 1. Figure 1: (a) Incremental vs. non-incremental tokenization. Incremental tokenization makes prefix tokens independent of future observations, so appending data extends the token sequence, aligning with the NTP paradigm. Otherwise, incompatible tokens for a prefix and its extension limit generalization from long to short series. (b) Overview of UniTok. The raw TS is decomposed into scale statistics and a normalized se… view at source ↗
Figure 2
Figure 2. Figure 2: Progressive-Resolution Causal Autoen￾coder. Each block applies causal convolution and attention, allowing each latent vector to attend only to the past. At block s, the first 2 s − 1 vectors are preserved, while the remaining are downsam￾pled/upsampled, yielding a progressive-resolution architecture in which earlier tokens with smaller receptive fields receive finer representations. Progressive-Resolution … view at source ↗
Figure 3
Figure 3. Figure 3: Token arrangement for in-context NTP pretraining and training-free in-context inference. In pretraining, multiple series with similar patterns are concatenated into a context window. In zero-shot forecasting, lookback tokens condition AR generation of future tokens. In prompt￾boosted forecasting and few-shot generation, similar-pattern series are prepended as contextual prompts. In few-shot classification,… view at source ↗
Figure 4
Figure 4. Figure 4: All prompt examples (red) and sampled generations (blue) of UniTok-FM on Stocks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scaling behavior across LLM backbone sizes. Qwen3 backbones of three sizes are evaluated: Small (14M), Medium (26M), and Base (129M). (a) Training loss. (b) Forecasting MASE. (c) Generation discriminative score. (d) Classification accuracy. 5.2 Main Results Zero-Shot&Prompt-Boosted Forecasting We evaluate forecasting on GIFT-Eval [1], which com￾prises 97 tasks with different datasets and prediction horizon… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison between the full UniTok and ablated variants. (a) Zero-shot forecasting: full v.s. prefix normalization ablated. (b) Series reconstruction: full v.s. progressive￾resolution causal autoencoder ablated. (c) Series generation: full v.s. structure-preserving reconstruc￾tion loss ablated, using same prompts as [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot forecasting efficiency com￾parison between Chronos and UniTok-FM on Jena Weather. (a) Inference time per instance w.r.t number of sampling trajectories. (b) Memory oc￾cupancy. (c) Forecasting performance (MASE). Inference Efficiency UniTok produces much shorter token sequences than point-wise bin￾ning of Chronos, resulting in improved LLM inference efficiency. We compare Chronos (Base) and UniTok… view at source ↗
read the original abstract

While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete tokens, and UniTok-FM, a foundation model pretrained via NTP on these tokens. UniTok-FM is a general-purpose foundation model that supports zero-shot and prompt-boosted forecasting, as well as few-shot generation and classification via training-free in-context inference--a capability not achieved by prior works. Technically, UniTok is a vector-quantized autoencoder incorporating prefix normalization for scale stabilization, a progressive-resolution causal architecture for encoding and decoding, and a structure-preserving reconstruction loss for training. UniTok-FM adopts an off-the-shelf LLM architecture without TS-specific modifications. Instead of pretraining on isolated TS, it performs NTP on context windows formed by multiple series with similar patterns, aiming to capture their shared dynamics. Experiments on forecasting, generation, and classification show that a single unified UniTok-FM consistently outperforms statistical and supervised baselines, achieves competitive performance with task-specific foundation models, and uniquely enables training-free in-context inference across tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces UniTok, a VQ-VAE tokenizer for continuous time series that incorporates prefix normalization for scale stabilization, a progressive-resolution causal encoder/decoder architecture, and a structure-preserving reconstruction loss. It then pretrains UniTok-FM, an unmodified off-the-shelf LLM, via next-token prediction on context windows formed by grouping multiple time series that exhibit similar patterns. The central claim is that this yields a single general-purpose foundation model supporting zero-shot and prompt-boosted forecasting as well as training-free in-context few-shot generation and classification, with experiments showing consistent outperformance over statistical and supervised baselines and competitiveness with task-specific foundation models.

Significance. If the empirical results hold, the work would be significant for demonstrating that an unmodified LLM architecture, when paired with a carefully designed universal tokenizer and grouped-pattern pretraining, can deliver cross-task generalization including training-free in-context inference—a capability not previously achieved for time series. Credit is due for the tokenizer components that directly target scale invariance, causality, and structure preservation, and for the pretraining strategy that aims to capture shared dynamics without task-specific modifications.

minor comments (3)
  1. [Abstract] Abstract: performance claims are stated without any quantitative metrics, baseline names, or dataset identifiers, which reduces immediate readability even though the full experiments section presumably supplies them.
  2. [§3.2] §3.2: the structure-preserving loss is described in prose; adding an explicit equation would clarify its distinction from standard VQ-VAE reconstruction terms and aid reproducibility.
  3. [Experiments] Table captions and axis labels in the experimental figures should explicitly state the number of series, context lengths, and whether results are averaged over multiple random seeds.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its potential significance, and the recommendation for minor revision. We appreciate the credit given to the tokenizer design and the pretraining strategy.

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with no derivation chain

full rationale

The paper describes an empirical construction (VQ-VAE tokenizer with prefix norm, progressive causal encoder, structure-preserving loss, followed by unmodified LLM NTP on grouped similar-pattern windows) and reports experimental results on forecasting/generation/classification. No mathematical derivation, equations, or 'first-principles' claims are present in the provided text. All performance claims are benchmark-driven rather than derived from inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked in a load-bearing way. The design choices are presented as direct engineering responses to stated requirements, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level architectural choices; the tokenizer and foundation model are the primary contributions but rest on standard VQ-VAE and LLM assumptions not detailed here.

pith-pipeline@v0.9.1-grok · 5750 in / 1163 out tokens · 20934 ms · 2026-06-28T17:54:22.558976+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 14 canonical work pages · 6 internal anchors

  1. [1]

    Gift-eval: A benchmark for general time series forecasting model evaluation

    Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393, 2024

  2. [2]

    Chronos-2: From Univariate to Universal Forecasting

    Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, and Michael B...

  3. [3]

    Chronos: Learning the language of time series.Transactions on Machine Learning Research (TMLR), 2024

    Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research (TMLR), 2024

  4. [4]

    Fast and accurate zero-shot forecasting with chronos-bolt and autogluon

    Abdul Fatir Ansari, Caner Turkmen, Oleksandr Shchur, and Lorenzo Stella. Fast and accurate zero-shot forecasting with chronos-bolt and autogluon. ht tp s: // aw s. am az on .c om /b lo gs /m ac hi ne -l ea rn in g/ fa st -a nd -a cc ur at e-z er o-s ho t-f or ec as ti ng -w it h-c hr on os -b ol t-a nd -a ut og lu on, 2024

  5. [5]

    Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning

    Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. InConference on Neural Information Processing Systems (NeurIPS), 2025

  6. [6]

    VisionTS: Visual masked autoencoders are free-lunch zero-shot time series forecasters

    Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, and Chenghao Liu. VisionTS: Visual masked autoencoders are free-lunch zero-shot time series forecasters. In International Conference on Machine Learning (ICML), 2025. 10

  7. [7]

    Sdformer: Similarity-driven discrete transformer for time series generation

    Zhicheng Chen, FENG SHIBO, Zhong Zhang, Xi Xiao, Xingyu Gao, and Peilin Zhao. Sdformer: Similarity-driven discrete transformer for time series generation. InConference on Neural Information Processing Systems (NeurIPS), 2024

  8. [8]

    This time is different: An observability perspective on time series foundation models.arXiv preprint arXiv:2505.14766, 2025

    Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, et al. This time is different: An observability perspective on time series foundation models.arXiv preprint arXiv:2505.14766, 2025

  9. [9]

    The ucr time series classification archive

    Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, and Hexagon-ML. The ucr time series classification archive. ht tp s: // ww w. cs .u cr .e du /~e am on n/ ti me _s er ie s_ da ta _2 01 8, 2018

  10. [10]

    A time series forest for classification and feature extraction.Information Sciences, 2013

    Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir. A time series forest for classification and feature extraction.Information Sciences, 2013

  11. [11]

    Timevae: A variational auto- encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095,

    Abhyuday Desai, Cynthia Freeman, Zuhui Wang, and Ian Beaver. Timevae: A variational auto-encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095, 2021

  12. [12]

    Ideal spatial adaptation by wavelet shrinkage

    David L Donoho and Iain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika, 1994

  13. [13]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2021

  14. [14]

    Hdt: Hierarchical dis- crete transformer for multivariate time series forecasting

    Shibo Feng, Peilin Zhao, Liu Liu, Pengcheng Wu, and Zhiqi Shen. Hdt: Hierarchical dis- crete transformer for multivariate time series forecasting. InAAAI Conference on Artificial Intelligence (AAAI), 2025

  15. [15]

    Mantis: Lightweight calibrated foundation model for user-friendly time series classification.1st ICML Workshop on Foundation Models for Structured Data, 2025

    Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, and Ievgen Redko. Mantis: Lightweight calibrated foundation model for user-friendly time series classification.1st ICML Workshop on Foundation Models for Structured Data, 2025

  16. [16]

    Units: A unified multi-task time series model

    Shanghua Gao, Teddy Koker, Owen Queen, Tom Hartvigsen, Theodoros Tsiligkaridis, and Marinka Zitnik. Units: A unified multi-task time series model. InConference on Neural Information Processing Systems (NeurIPS), 2024

  17. [17]

    Moment: a family of open time-series foundation models

    Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: a family of open time-series foundation models. InInternational Conference on Machine Learning (ICML), 2024

  18. [18]

    Random dilated shapelet transform: A new approach for time series shapelets

    Antoine Guillaume, Christel Vrain, and Wael Elloumi. Random dilated shapelet transform: A new approach for time series shapelets. InInternational Conference on Pattern Recognition and Artificial Intelligence (ICPRAI), 2022

  19. [19]

    Look into the lite in deep learning for time series classification.International Journal of Data Science and Analytics, 2025

    Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, and Germain Forestier. Look into the lite in deep learning for time series classification.International Journal of Data Science and Analytics, 2025

  20. [20]

    Inceptiontime: Finding alexnet for time series classification.Data Mining and Knowledge Discovery, 2020

    Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-Alain Muller, and François Petitjean. Inceptiontime: Finding alexnet for time series classification.Data Mining and Knowledge Discovery, 2020

  21. [21]

    Jian Jia, Jingtong Gao, Ben Xue, Junhao Wang, Qingpeng Cai, Quan Chen, Xiangyu Zhao, Peng Jiang, and Kun Gai. From principles to applications: A comprehensive survey of discrete tokenizers in generation, comprehension, recommendation, and information retrieval.arXiv preprint arXiv:2502.12448, 2025. 11

  22. [22]

    Time-llm: Time series forecasting by repro- gramming large language models

    Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by repro- gramming large language models. InInternational Conference on Learning Representations (ICLR), 2024

  23. [23]

    Photo-realistic single image super-resolution using a generative adversarial network

    Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2017

  24. [24]

    Vector quantized time series generation with a bidirectional prior model

    Daesoo Lee, Sara Malacarne, and Erlend Aune. Vector quantized time series generation with a bidirectional prior model. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2023

  25. [25]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

  26. [26]

    Your diffusion model is secretly a zero-shot classifier

    Alexander Cong Li, Mihir Prabhudesai, Shivam Duggal, Ellis Langham Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. InICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023

  27. [27]

    Autoregressive image generation without vector quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InConference on Neural Information Processing Systems (NeurIPS), 2024

  28. [28]

    itransformer: Inverted transformers are effective for time series forecasting

    Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. InInternational Conference on Learning Representations (ICLR), 2024

  29. [29]

    Sundial: A family of highly capable time series foundation models

    Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Sundial: A family of highly capable time series foundation models. In International Conference on Machine Learning (ICML), 2025

  30. [30]

    Timer: Generative pre-trained transformers are large time series models

    Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: Generative pre-trained transformers are large time series models. InInternational Conference on Machine Learning (ICML), 2024

  31. [31]

    Finite scalar quantization: VQ-V AE made simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InInternational Conference on Learning Representations (ICLR), 2024

  32. [32]

    Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam

    Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations (ICLR), 2023

  33. [33]

    Language models are unsupervised multitask learners.OpenAI blog, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 2019

  34. [34]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020

  35. [35]

    Lag-llama: To- wards foundation models for time series forecasting

    Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, Sahil Garg, Alexandre Drouin, Nicolas Chapados, Yuriy Nevmyvaka, and Irina Rish. Lag-llama: To- wards foundation models for time series forecasting. InNeurIPS Workshop R0-FoMo:Robustness o...

  36. [36]

    Generating diverse high-fidelity images with vq-vae-2

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. InConference on Neural Information Processing Systems (NeurIPS), 2019. 12

  37. [37]

    Time-moe: Billion-scale time series foundation models with mixture of experts

    Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time-moe: Billion-scale time series foundation models with mixture of experts. InInternational Conference on Learning Representations (ICLR), 2025

  38. [38]

    Kronos: A foundation model for the language of financial markets.arXiv preprint arXiv:2508.02739, 2025

    Yu Shi, Zongliang Fu, Shuo Chen, Bohan Zhao, Wei Xu, Changshui Zhang, and Jian Li. Kronos: A foundation model for the language of financial markets.arXiv preprint arXiv:2508.02739, 2025

  39. [39]

    Xihe: Scalable zero-shot time series learner via hierarchical interleaved block attention.arXiv preprint arXiv:2510.21795, 2025

    Yinbo Sun, Yuchen Fang, Zhibo Zhu, Jia Li, Yu Liu, Qiwen Deng, Jun Zhou, Hang Yu, Xingyu Lu, and Lintao Ma. Xihe: Scalable zero-shot time series learner via hierarchical interleaved block attention.arXiv preprint arXiv:2510.21795, 2025

  40. [40]

    Totem: Tokenized time series embeddings for general time series analysis.Transactions on Machine Learning Research (TMLR), 2024

    Sabera Talukder, Yisong Yue, and Georgia Gkioxari. Totem: Tokenized time series embeddings for general time series analysis.Transactions on Machine Learning Research (TMLR), 2024

  41. [41]

    From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

    Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, and Shijin Wang. From values to tokens: An llm-driven framework for context-aware time series forecasting via symbolic discretization.arXiv preprint arXiv:2508.09191, 2025

  42. [42]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  43. [43]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InConference on Neural Information Processing Systems (NeurIPS), 2024

  44. [44]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  45. [45]

    Conditional image generation with pixelcnn decoders

    Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. InConference on Neural Information Processing Systems (NeurIPS), 2016

  46. [46]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InConference on Neural Information Processing Systems (NeurIPS), 2017

  47. [47]

    Output scaling: Yinglong- delayed chain of thought in a large pretrained time series forecasting model.arXiv preprint arXiv:2506.11029, 2025

    Xue Wang, Tian Zhou, Jinyang Gao, Bolin Ding, and Jingren Zhou. Output scaling: Yinglong- delayed chain of thought in a large pretrained time series forecasting model.arXiv preprint arXiv:2506.11029, 2025

  48. [48]

    Time series classification from scratch with deep neural networks: A strong baseline

    Zhiguang Wang, Weizhong Yan, and Tim Oates. Time series classification from scratch with deep neural networks: A strong baseline. InInternational Joint Conference on Neural Networks (IJCNN), 2017

  49. [49]

    Abstracted shapes as tokens-a generalizable and interpretable model for time-series classification

    Yunshi Wen, Tengfei Ma, Lily Weng, Lam Nguyen, and Anak Agung Julius. Abstracted shapes as tokens-a generalizable and interpretable model for time-series classification. InConference on Neural Information Processing Systems (NeurIPS), 2024

  50. [50]

    Unified training of universal time series forecasting transformers

    Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InInternational Conference on Machine Learning (ICML), 2024

  51. [51]

    Cot-gan: Generating sequential data via causal optimal transport

    Tianlin Xu, Li Kevin Wenliang, Michael Munn, and Beatrice Acciaio. Cot-gan: Generating sequential data via causal optimal transport. InConference on Neural Information Processing Systems (NeurIPS), 2020

  52. [52]

    FITS: Modeling time series with $10k$ parameters

    Zhijian Xu, Ailing Zeng, and Qiang Xu. FITS: Modeling time series with $10k$ parameters. In International Conference on Learning Representations (ICLR), 2024. 13

  53. [53]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  54. [54]

    Time-series generative adversarial networks

    Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. InConference on Neural Information Processing Systems (NeurIPS), 2019

  55. [55]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023

  56. [56]

    An image is worth 32 tokens for reconstruction and generation

    Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InConference on Neural Information Processing Systems (NeurIPS), 2024

  57. [57]

    Diffusion-TS: Interpretable diffusion for general time series generation

    Xinyu Yuan and Yan Qiao. Diffusion-TS: Interpretable diffusion for general time series generation. InInternational Conference on Learning Representations (ICLR), 2024

  58. [58]

    Are transformers effective for time series forecasting? InAAAI Conference on Artificial Intelligence (AAAI), 2023

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InAAAI Conference on Artificial Intelligence (AAAI), 2023

  59. [59]

    The unreason- able effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2018

  60. [60]

    MMPD: Diverse time series forecasting via multi-mode patch diffusion loss

    Yunhao Zhang, Wenyao Hu, Jiale Zheng, Lujia Pan, and Junchi Yan. MMPD: Diverse time series forecasting via multi-mode patch diffusion loss. InInternational Conference on Learning Representations (ICLR), 2026

  61. [61]

    Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting

    Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InInternational Conference on Learning Representa- tions (ICLR), 2023. 14 A Design Details and Referenced Methods in UniTok A.1 Length Mapping Functions Accounting for the 4 special tokens, 2×8 scale statistic tokens and the...

  62. [62]

    GIFT-Pretrain[ 1]: This is large-scale corpus released alongside the GIFT-Eval benchmark. A strict split-checking procedure is applied to ensure that no test data from GIFT-Eval appears in the pretraining set, guaranteeing a fully zero-shot evaluation for TSFMs trained on it. The dataset consists of 88 sub-datasets spanning 7 domains and 13 sampling frequ...

  63. [63]

    Dataset/Frequency/Prediction Term

    Chronos-Dataset[ 3]: This is the dataset for training and evaluation of Chronos. The original dataset contains 67 subsets spanning 8 domains and is publicly available at https://huggingface. co/datasets/autogluon/chronos_datasets. These two datasets partially overlap. We carefully construct their union and avoid test-set leak- age by adding the following ...

  64. [64]

    Normalization by Seasonal-Naive:For each task i, the raw score si is normalized by the corre- sponding score of Seasonal-Naives (season) i : esi = si s(season) i (19) This normalization reflects the relative performance of the evaluated model compared to the Seasonal- Naive baseline

  65. [65]

    Dataset / Frequency / Prediction Term

    Geometric Mean Aggregation:The final aggregated score is computed as the geometric mean of the normalized scores across allN= 97tasks: sagg = ( NY i=1 esi)1/N (20) C.2 Few-Shot Generation DatasetsWe adopt four real-world datasets from Diffusion-TS [ 57] (i.e., Stocks, ETTh, Energy, fMRI) for generation evaluation. For each dataset, only the first channel ...