Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
Pith reviewed 2026-05-21 07:36 UTC · model grok-4.3
The pith
A compact decoder-only transformer trained jointly on text and time series from scratch matches a dedicated language model on NLU tasks while topping time series classification and multimodal forecasting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Chronicle is the first model jointly pretrained on text and time series from scratch in a unified architecture; it matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.
What carries the argument
Shared transformer blocks, attention mechanism, and residual stream across language and time series, with the bulk of pretraining on unimodal batches and only a short alignment stage for interleaving modalities.
If this is right
- Joint pretraining on shared parameters produces a single backbone competitive with the strongest unimodal foundation models in both language and time series.
- Frozen embeddings from the joint model achieve state-of-the-art results on time series classification without task-specific fine-tuning.
- Multimodal forecasting improves when modalities are trained together rather than fused after separate pretraining.
- Evaluation against dedicated models in each domain, rather than only multimodal baselines, is required to establish the value of joint training.
Where Pith is reading between the lines
- The same shared-parameter approach could be tested on additional modalities such as images or audio that also contain temporal structure.
- Varying the length or schedule of the alignment stage might reveal how much interleaving is needed for stronger cross-modal transfer.
- Domains that naturally pair text reports with numerical series, such as finance or sensor networks, could adopt this single-model pattern instead of maintaining separate pipelines.
Load-bearing premise
Cross-modal capability emerges from shared parameters when the bulk of pretraining uses unimodal batches, with only a short alignment stage needed to interleave the modalities.
What would settle it
A controlled experiment that trains an identical architecture on time series data alone and compares its frozen-embedding classification accuracy on the 24 UCR/UEA datasets against Chronicle's reported results.
Figures
read the original abstract
Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Chronicle, a 324M-parameter decoder-only transformer pretrained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention, and residual stream; the bulk of pretraining uses unimodal batches, with a short alignment stage to interleave modalities. The model is claimed to match Gemma-3-270M-PT on 19 NLU tasks, set a new state-of-the-art for frozen-embedding time series classification on 24 UCR/UEA datasets, and outperform all supervised fusion baselines on multimodal forecasting in Time-MMD, positioning it as the first jointly pretrained from-scratch multimodal model evaluated against dedicated unimodal foundation models in both domains.
Significance. If the empirical results hold under rigorous verification, the work would be significant as the first demonstration of a compact from-scratch multimodal foundation model for text and time series that competes with strong unimodal models without post-hoc adaptation. It would provide evidence that shared parameters can enable cross-modal capability with minimal interleaved data, potentially reducing the need for separate modality-specific pretraining pipelines and offering a practical backbone for real-world applications involving metadata, reports, and temporal data.
major comments (2)
- Abstract: The central claim that cross-modal capability emerges purely from shared parameters when the bulk of pretraining uses unimodal batches plus a short alignment stage lacks supporting ablation evidence. No direct comparison is described to a fully interleaved pretraining baseline or to a model using modality-specific adapters, so it remains unclear whether the reported gains on Time-MMD multimodal forecasting and competitive NLU/TS performance arise from the claimed emergence mechanism rather than data scale, tokenizer design, or the 324M decoder-only architecture.
- Abstract and results sections: Strong performance claims (matching Gemma-3-270M-PT on 19 NLU tasks, new bar on 24 UCR/UEA datasets, beating supervised fusion baselines on Time-MMD) are stated without details on training data sources, exact hyperparameters, statistical significance tests, or ablation studies, rendering it impossible to verify whether the results support the soundness of the central claims.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment below and have revised the manuscript to incorporate additional evidence and details where this strengthens the presentation of our claims.
read point-by-point responses
-
Referee: Abstract: The central claim that cross-modal capability emerges purely from shared parameters when the bulk of pretraining uses unimodal batches plus a short alignment stage lacks supporting ablation evidence. No direct comparison is described to a fully interleaved pretraining baseline or to a model using modality-specific adapters, so it remains unclear whether the reported gains on Time-MMD multimodal forecasting and competitive NLU/TS performance arise from the claimed emergence mechanism rather than data scale, tokenizer design, or the 324M decoder-only architecture.
Authors: We agree that direct ablation evidence would more rigorously isolate the contribution of shared parameters under predominantly unimodal pretraining. In the revised manuscript we have added a new Ablation Studies subsection that reports results from a controlled comparison (at reduced scale) between our unimodal-batch-plus-short-alignment schedule and a fully interleaved pretraining baseline using the same data mixture and tokenizer. We also include a brief discussion of the design choice against modality-specific adapters, noting that adapters would reintroduce separate modality pipelines and defeat the goal of a single unified backbone. While a full-scale 324M adapter ablation remains computationally prohibitive, the smaller-scale results and architectural analysis support that the observed cross-modal gains are not solely attributable to data scale or architecture size. revision: yes
-
Referee: Abstract and results sections: Strong performance claims (matching Gemma-3-270M-PT on 19 NLU tasks, new bar on 24 UCR/UEA datasets, beating supervised fusion baselines on Time-MMD) are stated without details on training data sources, exact hyperparameters, statistical significance tests, or ablation studies, rendering it impossible to verify whether the results support the soundness of the central claims.
Authors: We acknowledge that the original submission omitted several implementation details required for full reproducibility and verification. The revised manuscript now contains an expanded Experimental Setup section that specifies the exact language and time-series pretraining corpora (including sizes and sources), a comprehensive hyperparameter table in the appendix, and statistical significance testing (paired t-tests with p-values) for all headline comparisons against Gemma-3-270M-PT, the UCR/UEA baselines, and the Time-MMD fusion models. The new Ablation Studies subsection further addresses the request for supporting analyses. revision: yes
Circularity Check
No circularity: purely empirical claims with no derivations or self-referential reductions
full rationale
The paper describes training a 324M decoder-only transformer from scratch on unimodal batches followed by a short alignment stage, then reports empirical results on NLU tasks, UCR/UEA classification, and Time-MMD forecasting. No equations, fitted parameters, uniqueness theorems, or ansatzes are presented as derivations. All central claims rest on direct experimental comparisons to external baselines (Gemma-3, supervised fusion models) rather than any internal reduction of outputs to inputs by construction. The absence of a mathematical derivation chain makes the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Chronicle is a 16-layer, 324M-parameter decoder-only transformer (d=1024, 8 GQA heads with 4 KV heads, RoPE, SwiGLU, pre-norm RMSNorm)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
International Conference on Machine Learning , year =
Chronos: Learning the Language of Time Series , author =. International Conference on Machine Learning , year =
-
[2]
Chronos-2: From Univariate to Universal Forecasting
Chronos-2: From Univariate to Universal Forecasting , author =. arXiv preprint arXiv:2510.15821 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
International Conference on Machine Learning , year =
A Decoder-Only Foundation Model for Time-Series Forecasting , author =. International Conference on Machine Learning , year =
-
[4]
International Conference on Machine Learning , year =
Unified Training of Universal Time Series Forecasting Transformers , author =. International Conference on Machine Learning , year =
-
[5]
Liu, Xu and Liu, Juncheng and Woo, Gerald and Aksu, Taha and Liang, Yuxuan and Zimmermann, Roger and Liu, Chenghao and Savarese, Silvio and Xiong, Caiming and Sahoo, Doyen , journal =. Moirai-
-
[6]
International Conference on Learning Representations , year =
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations , year =
-
[7]
Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models , author=. 2026 , eprint=
work page 2026
-
[8]
Rasul, Kashif and Ashok, Arjun and Williams, Andrew R. and Ghonia, Hena and Bhagwatkar, Rishika and Khorasani, Ali and Bayazi, Mohammad Javad Dastgheib and Adamopoulos, George and Riachi, Roland and Hassen, Nadhir and others , journal =. Lag-
-
[9]
Goswami, Mononito and Szafer, Konrad and Choudhry, Arjun and Cai, Yifu and Li, Shuo and Dubrawski, Artur , journal =
-
[10]
Garza, Azul and Mergenthaler-Canseco, Max , journal =
-
[11]
Gao, Shanghua and Koker, Teddy and Queen, Owen and Hartvigsen, Tom and Tsiligkaridis, Theodoros and Zitnik, Marinka , booktitle =
-
[12]
arXiv preprint arXiv:2403.14735 , year =
Foundation Models for Time Series Analysis: A Tutorial and Survey , author =. arXiv preprint arXiv:2403.14735 , year =
-
[13]
Language Models are Unsupervised Multitask Learners , author =. OpenAI Blog , year =
-
[14]
Advances in Neural Information Processing Systems , volume =
Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =
-
[15]
Touvron, Hugo and others , journal =
-
[16]
Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , journal =. The
-
[17]
Qwen2 Technical Report , author =. arXiv preprint arXiv:2407.10671 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal =
-
[19]
Journal of Machine Learning Research , volume =
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of Machine Learning Research , volume =
-
[20]
Text and Code Embeddings by Contrastive Pre-Training
Text and Code Embeddings by Contrastive Pre-Training , author =. arXiv preprint arXiv:2201.10005 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
MTEB: Massive Text Embedding Benchmark
Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo. arXiv preprint arXiv:2210.07316 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Advances in Neural Information Processing Systems , volume =
Large Language Models Are Zero-Shot Time Series Forecasters , author =. Advances in Neural Information Processing Systems , volume =
-
[23]
Jin, Ming and Wang, Shiyu and Ma, Lintao and Chu, Zhixuan and Zhang, James Y. and Shi, Xiaoming and Chen, Pin-Yu and Liang, Yuxuan and Li, Yuan-Fang and Pan, Shirui and Wen, Qingsong , booktitle =. Time-
-
[24]
One Fits All:Power General Time Series Analysis by Pretrained LM , author=. 2023 , eprint=
work page 2023
-
[25]
and Pfister, Tomas and Zheng, Yixiang and Ye, Wen and Liu, Yan , journal =
Cao, Defu and Jia, Furong and Arik, Sercan O. and Pfister, Tomas and Zheng, Yixiang and Ye, Wen and Liu, Yan , journal =
- [26]
-
[27]
Language models still struggle to zero-shot reason about time series
Language Models Still Struggle to Zero-shot Reason about Time Series , author =. arXiv preprint arXiv:2404.11757 , year =
-
[28]
Xie, Zhe and Li, Zeyan and He, Xiao and Xu, Longlong and Wen, Xidao and Zhang, Tieying and Chen, Jianjun and Shi, Rui and Pei, Dan , year=. ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning , volume=. Proceedings of the VLDB Endowment , publisher=. doi:10.14778/3742728.3742735 , number=
-
[29]
Liu, Haoxin and others , journal =. Time-
- [30]
-
[31]
ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data , author=. 2024 , eprint=
work page 2024
-
[32]
Jia, Furong and Wang, Kevin and Zheng, Yixiang and Cao, Defu and Liu, Yan , booktitle =
-
[33]
Proceedings of the AAAI Conference on Artificial Intelligence , year =
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
-
[34]
Advances in Neural Information Processing Systems , volume =
Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting , author =. Advances in Neural Information Processing Systems , volume =
-
[35]
Wu, Haixu and Hu, Tengge and Liu, Yong and Zhou, Hang and Wang, Jianmin and Long, Mingsheng , booktitle =
-
[36]
Liu, Yong and Hu, Tengge and Zhang, Haoran and Wu, Haixu and Wang, Shiyu and Ma, Lintao and Long, Mingsheng , journal =. i
-
[37]
Proceedings of the AAAI Conference on Artificial Intelligence , volume =
Are Transformers Effective for Time Series Forecasting? , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
-
[38]
Zhou, Tian and Ma, Ziqing and Wen, Qingsong and Wang, Xue and Sun, Liang and Jin, Rong , journal =
-
[39]
Advances in Neural Information Processing Systems , volume =
Attention is All You Need , author =. Advances in Neural Information Processing Systems , volume =
-
[40]
Advances in Neural Information Processing Systems , volume =
Root Mean Square Layer Normalization , author =. Advances in Neural Information Processing Systems , volume =
-
[41]
Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , journal =
-
[42]
Shah, Jay and Bikshandi, Ganesh and Zhang, Ying and Thakkar, Vijay and Ramani, Pradeep and Dao, Tri , journal =
-
[43]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. arXiv preprint arXiv:2305.13245 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[45]
Advances in Neural Information Processing Systems , volume =
Primer: Searching for Efficient Transformers for Language Modeling , author =. Advances in Neural Information Processing Systems , volume =
-
[46]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv preprint arXiv:2408.00118 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Muon: An optimizer for hidden layers in neural networks, 2024 , author=. URL https://kellerjordan. github. io/posts/muon , volume=
work page 2024
-
[48]
International Conference on Learning Representations , year =
Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =
-
[49]
Rajbhandari, Samyam and Rasley, Jeff and Rabe, Markus N and He, Yuxiong , journal =
-
[50]
GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation , author=. 2024 , eprint=
work page 2024
-
[51]
Dau, Hoang Anh and Bagnall, Anthony and Kamgar, Kaveh and Yeh, Chin-Chia Michael and Zhu, Yan and Gharghabi, Shaghayegh and Ratanamahatana, Chotirat Ann and Keogh, Eamonn , journal =. The
-
[52]
Journal of the American Statistical Association , volume =
Strictly Proper Scoring Rules, Prediction, and Estimation , author =. Journal of the American Statistical Association , volume =
-
[53]
International Conference on Learning Representations , year =
Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift , author =. International Conference on Learning Representations , year =
-
[54]
2025 , howpublished =
work page 2025
-
[55]
NorMuon: Making Muon more efficient and scalable , author=. 2025 , eprint=
work page 2025
-
[56]
Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning,
Auer, Andreas and Podest, Patrick and Klotz, Daniel and B. arXiv preprint arXiv:2505.23719 , year =
-
[57]
This Time is Different: An Observability Perspective on Time Series Foundation Models , author =. 2025 , eprint =
work page 2025
-
[58]
Wang, Xue and Zhou, Tian and Gao, Jinyang and Ding, Bolin and Zhou, Jingren , journal =. Output Scaling:. 2025 , url =
work page 2025
-
[59]
and Carpov, Dmitri and Chapados, Nicolas and Bengio, Yoshua , booktitle =
Oreshkin, Boris N. and Carpov, Dmitri and Chapados, Nicolas and Bengio, Yoshua , booktitle =. 2020 , url =
work page 2020
-
[60]
Salinas, David and Flunkert, Valentin and Gasthaus, Jan and Januschowski, Tim , journal =. 2020 , publisher =
work page 2020
-
[61]
Gemma 3 Technical Report , author=. arXiv preprint arXiv:2503.19786 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [62]
-
[63]
Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and others , journal=
-
[64]
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=
-
[65]
Think you have Solved Question Answering? Try
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have Solved Question Answering? Try
-
[66]
AAAI Spring Symposium Series , year=
Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , author=. AAAI Spring Symposium Series , year=
-
[67]
Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , booktitle=
-
[68]
Bisk, Yonatan and Zellers, Rowan and Gao, Jianfeng and Choi, Yejin and others , booktitle=
-
[69]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2018
-
[70]
Paperno, Denis and Kruszewski, Germ. The. arXiv preprint arXiv:1606.06031 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Levesque, Hector and Davis, Ernest and Morgenstern, Leora , booktitle=. The
-
[72]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=
work page 2019
-
[73]
Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=
-
[74]
Reddy, Siva and Chen, Danqi and Manning, Christopher D , journal=
-
[75]
Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=
-
[76]
Transactions on Machine Learning Research , year=
Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models , author=. Transactions on Machine Learning Research , year=
-
[77]
Zhong, Wanjun and Cui, Ruixiang and Guo, Yiduo and Liang, Yaobo and Lu, Shuai and Wang, Yanlin and Saied, Amin and Chen, Weizhu and Duan, Nan , journal=
-
[78]
Using the Output Embedding to Improve Language Models
Press, Ofir and Wolf, Lior. Using the Output Embedding to Improve Language Models. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017
work page 2017
-
[79]
Moirai 2.0: When Less Is More for Time Series Forecasting , author=. 2026 , eprint=
work page 2026
-
[80]
Lee, Geon and Yu, Wenchao and Cheng, Wei and Chen, Haifeng , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.