Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Jeremy Levasseur; Paul Quinlan; Qingguo Li; Xiaodan Zhu

arxiv: 2605.20268 · v1 · pith:MPKQHTAUnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Paul Quinlan , Jeremy Levasseur , Qingguo Li , Xiaodan Zhu This is my paper

Pith reviewed 2026-05-21 07:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords multimodal foundation modeltime seriesjoint pretrainingdecoder-only transformerlanguage understandingfrozen embeddingsmultimodal forecasting

0 comments

The pith

A compact decoder-only transformer trained jointly on text and time series from scratch matches a dedicated language model on NLU tasks while topping time series classification and multimodal forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Chronicle, a 324M-parameter model that uses the same transformer blocks, attention, and residual stream for both natural language and time series. It trains mostly on unimodal batches so that cross-modal understanding arises from parameter sharing alone, followed by a brief alignment stage. A sympathetic reader would care because this setup questions whether separate foundation models or post-hoc adaptations are necessary, and shows that one backbone can compete with the strongest unimodal models in each domain while also handling mixed text-and-numbers tasks.

Core claim

Chronicle is the first model jointly pretrained on text and time series from scratch in a unified architecture; it matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

What carries the argument

Shared transformer blocks, attention mechanism, and residual stream across language and time series, with the bulk of pretraining on unimodal batches and only a short alignment stage for interleaving modalities.

If this is right

Joint pretraining on shared parameters produces a single backbone competitive with the strongest unimodal foundation models in both language and time series.
Frozen embeddings from the joint model achieve state-of-the-art results on time series classification without task-specific fine-tuning.
Multimodal forecasting improves when modalities are trained together rather than fused after separate pretraining.
Evaluation against dedicated models in each domain, rather than only multimodal baselines, is required to establish the value of joint training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same shared-parameter approach could be tested on additional modalities such as images or audio that also contain temporal structure.
Varying the length or schedule of the alignment stage might reveal how much interleaving is needed for stronger cross-modal transfer.
Domains that naturally pair text reports with numerical series, such as finance or sensor networks, could adopt this single-model pattern instead of maintaining separate pipelines.

Load-bearing premise

Cross-modal capability emerges from shared parameters when the bulk of pretraining uses unimodal batches, with only a short alignment stage needed to interleave the modalities.

What would settle it

A controlled experiment that trains an identical architecture on time series data alone and compares its frozen-embedding classification accuracy on the 24 UCR/UEA datasets against Chronicle's reported results.

Figures

Figures reproduced from arXiv: 2605.20268 by Jeremy Levasseur, Paul Quinlan, Qingguo Li, Xiaodan Zhu.

**Figure 1.** Figure 1: The Chronicle architecture. Text tokens and time series patches share a 16-layer decoder-only transformer, modality-specific components are limited to the input and output interfaces. Modality-specific output heads produce quantile forecasts (LQL) and next-token predictions (LCE); the same backbone produces frozen embeddings for downstream classification. instruction-tunes LLaMA-2 with discretized series b… view at source ↗

**Figure 2.** Figure 2: GIFT-Eval leaderboard (97 tasks; lower is better). MASE (left) and CRPS (right) for comparative models, plus Chronicle Stage 1 and Stage 2 (highlighted). Stage 1 is the stronger pure forecaster, while Stage 2 is the aligned checkpoint used for multimodal transfer. language training budget, and mirrors the pattern observed across all unimodal benchmarks. On ARC-Easy, Stage 2 (0.651) closely approaches LLaMA… view at source ↗

**Figure 3.** Figure 3: Effect of TS-token repetition on multimodal classification. Accuracy (left), AUC (middle), and macro-F1 (right) as a function of TS-token repeats r, evaluated on the three TimeCAP domains and averaged (dashed black). Repetition rebalances the text–TS token ratio in the shared sequence; performance peaks near r=64 then degrades as attention dilutes across identical copies. 5.4 TS-Token Repetition for Short … view at source ↗

**Figure 4.** Figure 4: Effect of channel-aware multivariate handling on UEA classification. Bars show the per-dataset delta between joint multivariate handling and mean-channel pooling (joint minus mean) for accuracy, macro-F1, and AUC. Positive values favor joint channel-aware handling. Averaged across the 10 multivariate UEA datasets, joint handling improves accuracy by +0.039, macro-F1 by +0.035, and AUC by +0.020 [PITH_FULL… view at source ↗

read the original abstract

Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chronicle's from-scratch joint pretraining on text and time series in one decoder-only model is the real novelty, but the abstract's lack of ablations makes the cross-modal emergence claim hard to evaluate.

read the letter

The main takeaway is that Chronicle trains a 324M decoder-only transformer from scratch on both natural language and time series, using mostly unimodal batches plus a short alignment stage, then shows it can match a similar-sized Gemma on language tasks while also advancing time series classification and multimodal forecasting from the same backbone. This setup and the direct comparison to strong unimodal models are new relative to prior post-hoc adaptation work. The paper does a solid job framing the problem: real-world time series usually arrive with text metadata, yet existing models either ignore one modality or inherit mismatched representations. The reported results—holding even with Gemma-3-270M-PT on 19 NLU tasks, raising the bar on frozen-embedding classification across 24 UCR/UEA datasets, and beating supervised fusion baselines on Time-MMD—suggest the shared architecture can deliver practical value without separate models. The soft spot is the missing support for how the cross-modal capability actually arises. The abstract gives no data sources, hyperparameter details, statistical tests, or ablations that test whether the short alignment stage is responsible or whether scale, tokenizer choices, or the 324M architecture alone explain the gains. Without a comparison to a fully interleaved baseline or to models with modality-specific components, it remains unclear if the claimed mechanism holds or if the numbers could be replicated by simpler means. This paper is for researchers working on compact multimodal systems for domains like finance, healthcare, or monitoring where text and temporal signals mix. A reader looking for unified backbones would find the evaluation framing useful even if the current evidence is preliminary. It deserves a serious referee because the core idea targets a genuine gap, though revisions would need to add the missing controls and reproducibility details to make the claims stick.

Referee Report

2 major / 0 minor

Summary. The paper introduces Chronicle, a 324M-parameter decoder-only transformer pretrained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention, and residual stream; the bulk of pretraining uses unimodal batches, with a short alignment stage to interleave modalities. The model is claimed to match Gemma-3-270M-PT on 19 NLU tasks, set a new state-of-the-art for frozen-embedding time series classification on 24 UCR/UEA datasets, and outperform all supervised fusion baselines on multimodal forecasting in Time-MMD, positioning it as the first jointly pretrained from-scratch multimodal model evaluated against dedicated unimodal foundation models in both domains.

Significance. If the empirical results hold under rigorous verification, the work would be significant as the first demonstration of a compact from-scratch multimodal foundation model for text and time series that competes with strong unimodal models without post-hoc adaptation. It would provide evidence that shared parameters can enable cross-modal capability with minimal interleaved data, potentially reducing the need for separate modality-specific pretraining pipelines and offering a practical backbone for real-world applications involving metadata, reports, and temporal data.

major comments (2)

Abstract: The central claim that cross-modal capability emerges purely from shared parameters when the bulk of pretraining uses unimodal batches plus a short alignment stage lacks supporting ablation evidence. No direct comparison is described to a fully interleaved pretraining baseline or to a model using modality-specific adapters, so it remains unclear whether the reported gains on Time-MMD multimodal forecasting and competitive NLU/TS performance arise from the claimed emergence mechanism rather than data scale, tokenizer design, or the 324M decoder-only architecture.
Abstract and results sections: Strong performance claims (matching Gemma-3-270M-PT on 19 NLU tasks, new bar on 24 UCR/UEA datasets, beating supervised fusion baselines on Time-MMD) are stated without details on training data sources, exact hyperparameters, statistical significance tests, or ablation studies, rendering it impossible to verify whether the results support the soundness of the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and have revised the manuscript to incorporate additional evidence and details where this strengthens the presentation of our claims.

read point-by-point responses

Referee: Abstract: The central claim that cross-modal capability emerges purely from shared parameters when the bulk of pretraining uses unimodal batches plus a short alignment stage lacks supporting ablation evidence. No direct comparison is described to a fully interleaved pretraining baseline or to a model using modality-specific adapters, so it remains unclear whether the reported gains on Time-MMD multimodal forecasting and competitive NLU/TS performance arise from the claimed emergence mechanism rather than data scale, tokenizer design, or the 324M decoder-only architecture.

Authors: We agree that direct ablation evidence would more rigorously isolate the contribution of shared parameters under predominantly unimodal pretraining. In the revised manuscript we have added a new Ablation Studies subsection that reports results from a controlled comparison (at reduced scale) between our unimodal-batch-plus-short-alignment schedule and a fully interleaved pretraining baseline using the same data mixture and tokenizer. We also include a brief discussion of the design choice against modality-specific adapters, noting that adapters would reintroduce separate modality pipelines and defeat the goal of a single unified backbone. While a full-scale 324M adapter ablation remains computationally prohibitive, the smaller-scale results and architectural analysis support that the observed cross-modal gains are not solely attributable to data scale or architecture size. revision: yes
Referee: Abstract and results sections: Strong performance claims (matching Gemma-3-270M-PT on 19 NLU tasks, new bar on 24 UCR/UEA datasets, beating supervised fusion baselines on Time-MMD) are stated without details on training data sources, exact hyperparameters, statistical significance tests, or ablation studies, rendering it impossible to verify whether the results support the soundness of the central claims.

Authors: We acknowledge that the original submission omitted several implementation details required for full reproducibility and verification. The revised manuscript now contains an expanded Experimental Setup section that specifies the exact language and time-series pretraining corpora (including sizes and sources), a comprehensive hyperparameter table in the appendix, and statistical significance testing (paired t-tests with p-values) for all headline comparisons against Gemma-3-270M-PT, the UCR/UEA baselines, and the Time-MMD fusion models. The new Ablation Studies subsection further addresses the request for supporting analyses. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivations or self-referential reductions

full rationale

The paper describes training a 324M decoder-only transformer from scratch on unimodal batches followed by a short alignment stage, then reports empirical results on NLU tasks, UCR/UEA classification, and Time-MMD forecasting. No equations, fitted parameters, uniqueness theorems, or ansatzes are presented as derivations. All central claims rest on direct experimental comparisons to external baselines (Gemma-3, supervised fusion models) rather than any internal reduction of outputs to inputs by construction. The absence of a mathematical derivation chain makes the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical success of a shared-parameter transformer trained mostly unimodally. No explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1189 out tokens · 46280 ms · 2026-05-21T07:36:10.953312+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Chronicle is a 16-layer, 324M-parameter decoder-only transformer (d=1024, 8 GQA heads with 4 KV heads, RoPE, SwiGLU, pre-norm RMSNorm)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 9 internal anchors

[1]

International Conference on Machine Learning , year =

Chronos: Learning the Language of Time Series , author =. International Conference on Machine Learning , year =

work page
[2]

Chronos-2: From Univariate to Universal Forecasting

Chronos-2: From Univariate to Universal Forecasting , author =. arXiv preprint arXiv:2510.15821 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

International Conference on Machine Learning , year =

A Decoder-Only Foundation Model for Time-Series Forecasting , author =. International Conference on Machine Learning , year =

work page
[4]

International Conference on Machine Learning , year =

Unified Training of Universal Time Series Forecasting Transformers , author =. International Conference on Machine Learning , year =

work page
[5]

Liu, Xu and Liu, Juncheng and Woo, Gerald and Aksu, Taha and Liang, Yuxuan and Zimmermann, Roger and Liu, Chenghao and Savarese, Silvio and Xiong, Caiming and Sahoo, Doyen , journal =. Moirai-

work page
[6]

International Conference on Learning Representations , year =

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations , year =

work page
[7]

2026 , eprint=

Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models , author=. 2026 , eprint=

work page 2026
[8]

Rasul, Kashif and Ashok, Arjun and Williams, Andrew R. and Ghonia, Hena and Bhagwatkar, Rishika and Khorasani, Ali and Bayazi, Mohammad Javad Dastgheib and Adamopoulos, George and Riachi, Roland and Hassen, Nadhir and others , journal =. Lag-

work page
[9]

Goswami, Mononito and Szafer, Konrad and Choudhry, Arjun and Cai, Yifu and Li, Shuo and Dubrawski, Artur , journal =

work page
[10]

Garza, Azul and Mergenthaler-Canseco, Max , journal =

work page
[11]

Gao, Shanghua and Koker, Teddy and Queen, Owen and Hartvigsen, Tom and Tsiligkaridis, Theodoros and Zitnik, Marinka , booktitle =

work page
[12]

arXiv preprint arXiv:2403.14735 , year =

Foundation Models for Time Series Analysis: A Tutorial and Survey , author =. arXiv preprint arXiv:2403.14735 , year =

work page arXiv
[13]

OpenAI Blog , year =

Language Models are Unsupervised Multitask Learners , author =. OpenAI Blog , year =

work page
[14]

Advances in Neural Information Processing Systems , volume =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =

work page
[15]

Touvron, Hugo and others , journal =

work page
[16]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , journal =. The

work page
[17]

Qwen2 Technical Report

Qwen2 Technical Report , author =. arXiv preprint arXiv:2407.10671 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal =

work page
[19]

Journal of Machine Learning Research , volume =

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of Machine Learning Research , volume =

work page
[20]

Text and Code Embeddings by Contrastive Pre-Training

Text and Code Embeddings by Contrastive Pre-Training , author =. arXiv preprint arXiv:2201.10005 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

MTEB: Massive Text Embedding Benchmark

Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo. arXiv preprint arXiv:2210.07316 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Advances in Neural Information Processing Systems , volume =

Large Language Models Are Zero-Shot Time Series Forecasters , author =. Advances in Neural Information Processing Systems , volume =

work page
[23]

and Shi, Xiaoming and Chen, Pin-Yu and Liang, Yuxuan and Li, Yuan-Fang and Pan, Shirui and Wen, Qingsong , booktitle =

Jin, Ming and Wang, Shiyu and Ma, Lintao and Chu, Zhixuan and Zhang, James Y. and Shi, Xiaoming and Chen, Pin-Yu and Liang, Yuxuan and Li, Yuan-Fang and Pan, Shirui and Wen, Qingsong , booktitle =. Time-

work page
[24]

2023 , eprint=

One Fits All:Power General Time Series Analysis by Pretrained LM , author=. 2023 , eprint=

work page 2023
[25]

and Pfister, Tomas and Zheng, Yixiang and Ye, Wen and Liu, Yan , journal =

Cao, Defu and Jia, Furong and Arik, Sercan O. and Pfister, Tomas and Zheng, Yixiang and Ye, Wen and Liu, Yan , journal =

work page
[26]

, journal =

Xue, Hao and Salim, Flora D. , journal =

work page
[27]

Language models still struggle to zero-shot reason about time series

Language Models Still Struggle to Zero-shot Reason about Time Series , author =. arXiv preprint arXiv:2404.11757 , year =

work page arXiv
[28]

ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning , volume=

Xie, Zhe and Li, Zeyan and He, Xiao and Xu, Longlong and Wen, Xidao and Zhang, Tieying and Chen, Jianjun and Shi, Rui and Pei, Dan , year=. ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning , volume=. Proceedings of the VLDB Endowment , publisher=. doi:10.14778/3742728.3742735 , number=

work page doi:10.14778/3742728.3742735
[29]

Liu, Haoxin and others , journal =. Time-

work page
[30]

2025 , note =

Lee, Geon and others , booktitle =. 2025 , note =

work page 2025
[31]

2024 , eprint=

ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data , author=. 2024 , eprint=

work page 2024
[32]

Jia, Furong and Wang, Kevin and Zheng, Yixiang and Cao, Defu and Liu, Yan , booktitle =

work page
[33]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

work page
[34]

Advances in Neural Information Processing Systems , volume =

Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting , author =. Advances in Neural Information Processing Systems , volume =

work page
[35]

Wu, Haixu and Hu, Tengge and Liu, Yong and Zhou, Hang and Wang, Jianmin and Long, Mingsheng , booktitle =

work page
[36]

Liu, Yong and Hu, Tengge and Zhang, Haoran and Wu, Haixu and Wang, Shiyu and Ma, Lintao and Long, Mingsheng , journal =. i

work page
[37]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Are Transformers Effective for Time Series Forecasting? , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

work page
[38]

Zhou, Tian and Ma, Ziqing and Wen, Qingsong and Wang, Xue and Sun, Liang and Jin, Rong , journal =

work page
[39]

Advances in Neural Information Processing Systems , volume =

Attention is All You Need , author =. Advances in Neural Information Processing Systems , volume =

work page
[40]

Advances in Neural Information Processing Systems , volume =

Root Mean Square Layer Normalization , author =. Advances in Neural Information Processing Systems , volume =

work page
[41]

Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , journal =

work page
[42]

Shah, Jay and Bikshandi, Ganesh and Zhang, Ying and Thakkar, Vijay and Ramani, Pradeep and Dao, Tri , journal =

work page
[43]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. arXiv preprint arXiv:2305.13245 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Longformer: The Long-Document Transformer

Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2004
[45]

Advances in Neural Information Processing Systems , volume =

Primer: Searching for Efficient Transformers for Language Modeling , author =. Advances in Neural Information Processing Systems , volume =

work page
[46]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv preprint arXiv:2408.00118 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[47]

URL https://kellerjordan

Muon: An optimizer for hidden layers in neural networks, 2024 , author=. URL https://kellerjordan. github. io/posts/muon , volume=

work page 2024
[48]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

work page
[49]

Rajbhandari, Samyam and Rasley, Jeff and Rabe, Markus N and He, Yuxiong , journal =

work page
[50]

2024 , eprint=

GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation , author=. 2024 , eprint=

work page 2024
[51]

Dau, Hoang Anh and Bagnall, Anthony and Kamgar, Kaveh and Yeh, Chin-Chia Michael and Zhu, Yan and Gharghabi, Shaghayegh and Ratanamahatana, Chotirat Ann and Keogh, Eamonn , journal =. The

work page
[52]

Journal of the American Statistical Association , volume =

Strictly Proper Scoring Rules, Prediction, and Estimation , author =. Journal of the American Statistical Association , volume =

work page
[53]

International Conference on Learning Representations , year =

Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift , author =. International Conference on Learning Representations , year =

work page
[54]

2025 , howpublished =

work page 2025
[55]

2025 , eprint=

NorMuon: Making Muon more efficient and scalable , author=. 2025 , eprint=

work page 2025
[56]

Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning,

Auer, Andreas and Podest, Patrick and Klotz, Daniel and B. arXiv preprint arXiv:2505.23719 , year =

work page arXiv
[57]

2025 , eprint =

This Time is Different: An Observability Perspective on Time Series Foundation Models , author =. 2025 , eprint =

work page 2025
[58]

Output Scaling:

Wang, Xue and Zhou, Tian and Gao, Jinyang and Ding, Bolin and Zhou, Jingren , journal =. Output Scaling:. 2025 , url =

work page 2025
[59]

and Carpov, Dmitri and Chapados, Nicolas and Bengio, Yoshua , booktitle =

Oreshkin, Boris N. and Carpov, Dmitri and Chapados, Nicolas and Bengio, Yoshua , booktitle =. 2020 , url =

work page 2020
[60]

2020 , publisher =

Salinas, David and Flunkert, Valentin and Gasthaus, Jan and Januschowski, Tim , journal =. 2020 , publisher =

work page 2020
[61]

Gemma 3 Technical Report

Gemma 3 Technical Report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

2025 , eprint=

LFM2 Technical Report , author=. 2025 , eprint=

work page 2025
[63]

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and others , journal=

work page
[64]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=

work page
[65]

Think you have Solved Question Answering? Try

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have Solved Question Answering? Try

work page
[66]

AAAI Spring Symposium Series , year=

Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , author=. AAAI Spring Symposium Series , year=

work page
[67]

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , booktitle=

work page
[68]

Bisk, Yonatan and Zellers, Rowan and Gao, Jianfeng and Choi, Yejin and others , booktitle=

work page
[69]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[70]

Paperno, Denis and Kruszewski, Germ. The. arXiv preprint arXiv:1606.06031 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Levesque, Hector and Davis, Ernest and Morgenstern, Leora , booktitle=. The

work page
[72]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019
[73]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=

work page
[74]

Reddy, Siva and Chen, Danqi and Manning, Christopher D , journal=

work page
[75]

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=

work page
[76]

Transactions on Machine Learning Research , year=

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author=. Transactions on Machine Learning Research , year=

work page
[77]

Zhong, Wanjun and Cui, Ruixiang and Guo, Yiduo and Liang, Yaobo and Lu, Shuai and Wang, Yanlin and Saied, Amin and Chen, Weizhu and Duan, Nan , journal=

work page
[78]

Using the Output Embedding to Improve Language Models

Press, Ofir and Wolf, Lior. Using the Output Embedding to Improve Language Models. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017

work page 2017
[79]

2026 , eprint=

Moirai 2.0: When Less Is More for Time Series Forecasting , author=. 2026 , eprint=

work page 2026
[80]

Lee, Geon and Yu, Wenchao and Cheng, Wei and Chen, Haifeng , year=

work page

Showing first 80 references.

[1] [1]

International Conference on Machine Learning , year =

Chronos: Learning the Language of Time Series , author =. International Conference on Machine Learning , year =

work page

[2] [2]

Chronos-2: From Univariate to Universal Forecasting

Chronos-2: From Univariate to Universal Forecasting , author =. arXiv preprint arXiv:2510.15821 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

International Conference on Machine Learning , year =

A Decoder-Only Foundation Model for Time-Series Forecasting , author =. International Conference on Machine Learning , year =

work page

[4] [4]

International Conference on Machine Learning , year =

Unified Training of Universal Time Series Forecasting Transformers , author =. International Conference on Machine Learning , year =

work page

[5] [5]

Liu, Xu and Liu, Juncheng and Woo, Gerald and Aksu, Taha and Liang, Yuxuan and Zimmermann, Roger and Liu, Chenghao and Savarese, Silvio and Xiong, Caiming and Sahoo, Doyen , journal =. Moirai-

work page

[6] [6]

International Conference on Learning Representations , year =

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations , year =

work page

[7] [7]

2026 , eprint=

Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models , author=. 2026 , eprint=

work page 2026

[8] [8]

Rasul, Kashif and Ashok, Arjun and Williams, Andrew R. and Ghonia, Hena and Bhagwatkar, Rishika and Khorasani, Ali and Bayazi, Mohammad Javad Dastgheib and Adamopoulos, George and Riachi, Roland and Hassen, Nadhir and others , journal =. Lag-

work page

[9] [9]

Goswami, Mononito and Szafer, Konrad and Choudhry, Arjun and Cai, Yifu and Li, Shuo and Dubrawski, Artur , journal =

work page

[10] [10]

Garza, Azul and Mergenthaler-Canseco, Max , journal =

work page

[11] [11]

Gao, Shanghua and Koker, Teddy and Queen, Owen and Hartvigsen, Tom and Tsiligkaridis, Theodoros and Zitnik, Marinka , booktitle =

work page

[12] [12]

arXiv preprint arXiv:2403.14735 , year =

Foundation Models for Time Series Analysis: A Tutorial and Survey , author =. arXiv preprint arXiv:2403.14735 , year =

work page arXiv

[13] [13]

OpenAI Blog , year =

Language Models are Unsupervised Multitask Learners , author =. OpenAI Blog , year =

work page

[14] [14]

Advances in Neural Information Processing Systems , volume =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =

work page

[15] [15]

Touvron, Hugo and others , journal =

work page

[16] [16]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , journal =. The

work page

[17] [17]

Qwen2 Technical Report

Qwen2 Technical Report , author =. arXiv preprint arXiv:2407.10671 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal =

work page

[19] [19]

Journal of Machine Learning Research , volume =

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of Machine Learning Research , volume =

work page

[20] [20]

Text and Code Embeddings by Contrastive Pre-Training

Text and Code Embeddings by Contrastive Pre-Training , author =. arXiv preprint arXiv:2201.10005 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

MTEB: Massive Text Embedding Benchmark

Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo. arXiv preprint arXiv:2210.07316 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Advances in Neural Information Processing Systems , volume =

Large Language Models Are Zero-Shot Time Series Forecasters , author =. Advances in Neural Information Processing Systems , volume =

work page

[23] [23]

and Shi, Xiaoming and Chen, Pin-Yu and Liang, Yuxuan and Li, Yuan-Fang and Pan, Shirui and Wen, Qingsong , booktitle =

Jin, Ming and Wang, Shiyu and Ma, Lintao and Chu, Zhixuan and Zhang, James Y. and Shi, Xiaoming and Chen, Pin-Yu and Liang, Yuxuan and Li, Yuan-Fang and Pan, Shirui and Wen, Qingsong , booktitle =. Time-

work page

[24] [24]

2023 , eprint=

One Fits All:Power General Time Series Analysis by Pretrained LM , author=. 2023 , eprint=

work page 2023

[25] [25]

and Pfister, Tomas and Zheng, Yixiang and Ye, Wen and Liu, Yan , journal =

Cao, Defu and Jia, Furong and Arik, Sercan O. and Pfister, Tomas and Zheng, Yixiang and Ye, Wen and Liu, Yan , journal =

work page

[26] [26]

, journal =

Xue, Hao and Salim, Flora D. , journal =

work page

[27] [27]

Language models still struggle to zero-shot reason about time series

Language Models Still Struggle to Zero-shot Reason about Time Series , author =. arXiv preprint arXiv:2404.11757 , year =

work page arXiv

[28] [28]

ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning , volume=

Xie, Zhe and Li, Zeyan and He, Xiao and Xu, Longlong and Wen, Xidao and Zhang, Tieying and Chen, Jianjun and Shi, Rui and Pei, Dan , year=. ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning , volume=. Proceedings of the VLDB Endowment , publisher=. doi:10.14778/3742728.3742735 , number=

work page doi:10.14778/3742728.3742735

[29] [29]

Liu, Haoxin and others , journal =. Time-

work page

[30] [30]

2025 , note =

Lee, Geon and others , booktitle =. 2025 , note =

work page 2025

[31] [31]

2024 , eprint=

ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data , author=. 2024 , eprint=

work page 2024

[32] [32]

Jia, Furong and Wang, Kevin and Zheng, Yixiang and Cao, Defu and Liu, Yan , booktitle =

work page

[33] [33]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

work page

[34] [34]

Advances in Neural Information Processing Systems , volume =

Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting , author =. Advances in Neural Information Processing Systems , volume =

work page

[35] [35]

Wu, Haixu and Hu, Tengge and Liu, Yong and Zhou, Hang and Wang, Jianmin and Long, Mingsheng , booktitle =

work page

[36] [36]

Liu, Yong and Hu, Tengge and Zhang, Haoran and Wu, Haixu and Wang, Shiyu and Ma, Lintao and Long, Mingsheng , journal =. i

work page

[37] [37]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Are Transformers Effective for Time Series Forecasting? , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

work page

[38] [38]

Zhou, Tian and Ma, Ziqing and Wen, Qingsong and Wang, Xue and Sun, Liang and Jin, Rong , journal =

work page

[39] [39]

Advances in Neural Information Processing Systems , volume =

Attention is All You Need , author =. Advances in Neural Information Processing Systems , volume =

work page

[40] [40]

Advances in Neural Information Processing Systems , volume =

Root Mean Square Layer Normalization , author =. Advances in Neural Information Processing Systems , volume =

work page

[41] [41]

Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , journal =

work page

[42] [42]

Shah, Jay and Bikshandi, Ganesh and Zhang, Ying and Thakkar, Vijay and Ramani, Pradeep and Dao, Tri , journal =

work page

[43] [43]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. arXiv preprint arXiv:2305.13245 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Longformer: The Long-Document Transformer

Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2004

[45] [45]

Advances in Neural Information Processing Systems , volume =

Primer: Searching for Efficient Transformers for Language Modeling , author =. Advances in Neural Information Processing Systems , volume =

work page

[46] [46]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv preprint arXiv:2408.00118 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

URL https://kellerjordan

Muon: An optimizer for hidden layers in neural networks, 2024 , author=. URL https://kellerjordan. github. io/posts/muon , volume=

work page 2024

[48] [48]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

work page

[49] [49]

Rajbhandari, Samyam and Rasley, Jeff and Rabe, Markus N and He, Yuxiong , journal =

work page

[50] [50]

2024 , eprint=

GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation , author=. 2024 , eprint=

work page 2024

[51] [51]

Dau, Hoang Anh and Bagnall, Anthony and Kamgar, Kaveh and Yeh, Chin-Chia Michael and Zhu, Yan and Gharghabi, Shaghayegh and Ratanamahatana, Chotirat Ann and Keogh, Eamonn , journal =. The

work page

[52] [52]

Journal of the American Statistical Association , volume =

Strictly Proper Scoring Rules, Prediction, and Estimation , author =. Journal of the American Statistical Association , volume =

work page

[53] [53]

International Conference on Learning Representations , year =

Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift , author =. International Conference on Learning Representations , year =

work page

[54] [54]

2025 , howpublished =

work page 2025

[55] [55]

2025 , eprint=

NorMuon: Making Muon more efficient and scalable , author=. 2025 , eprint=

work page 2025

[56] [56]

Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning,

Auer, Andreas and Podest, Patrick and Klotz, Daniel and B. arXiv preprint arXiv:2505.23719 , year =

work page arXiv

[57] [57]

2025 , eprint =

This Time is Different: An Observability Perspective on Time Series Foundation Models , author =. 2025 , eprint =

work page 2025

[58] [58]

Output Scaling:

Wang, Xue and Zhou, Tian and Gao, Jinyang and Ding, Bolin and Zhou, Jingren , journal =. Output Scaling:. 2025 , url =

work page 2025

[59] [59]

and Carpov, Dmitri and Chapados, Nicolas and Bengio, Yoshua , booktitle =

Oreshkin, Boris N. and Carpov, Dmitri and Chapados, Nicolas and Bengio, Yoshua , booktitle =. 2020 , url =

work page 2020

[60] [60]

2020 , publisher =

Salinas, David and Flunkert, Valentin and Gasthaus, Jan and Januschowski, Tim , journal =. 2020 , publisher =

work page 2020

[61] [61]

Gemma 3 Technical Report

Gemma 3 Technical Report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

2025 , eprint=

LFM2 Technical Report , author=. 2025 , eprint=

work page 2025

[63] [63]

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and others , journal=

work page

[64] [64]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=

work page

[65] [65]

Think you have Solved Question Answering? Try

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have Solved Question Answering? Try

work page

[66] [66]

AAAI Spring Symposium Series , year=

Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , author=. AAAI Spring Symposium Series , year=

work page

[67] [67]

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , booktitle=

work page

[68] [68]

Bisk, Yonatan and Zellers, Rowan and Gao, Jianfeng and Choi, Yejin and others , booktitle=

work page

[69] [69]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018

[70] [70]

Paperno, Denis and Kruszewski, Germ. The. arXiv preprint arXiv:1606.06031 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[71] [71]

Levesque, Hector and Davis, Ernest and Morgenstern, Leora , booktitle=. The

work page

[72] [72]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019

[73] [73]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=

work page

[74] [74]

Reddy, Siva and Chen, Danqi and Manning, Christopher D , journal=

work page

[75] [75]

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=

work page

[76] [76]

Transactions on Machine Learning Research , year=

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author=. Transactions on Machine Learning Research , year=

work page

[77] [77]

Zhong, Wanjun and Cui, Ruixiang and Guo, Yiduo and Liang, Yaobo and Lu, Shuai and Wang, Yanlin and Saied, Amin and Chen, Weizhu and Duan, Nan , journal=

work page

[78] [78]

Using the Output Embedding to Improve Language Models

Press, Ofir and Wolf, Lior. Using the Output Embedding to Improve Language Models. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017

work page 2017

[79] [79]

2026 , eprint=

Moirai 2.0: When Less Is More for Time Series Forecasting , author=. 2026 , eprint=

work page 2026

[80] [80]

Lee, Geon and Yu, Wenchao and Cheng, Wei and Chen, Haifeng , year=

work page