arxiv: 2510.15821 · v1 · submitted 2025-10-17 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

Chronos-2: From Univariate to Universal Forecasting

Abdul Fatir Ansari , Oleksandr Shchur , Jaris K\"uken , Andreas Auer , Boran Han , Pedro Mercado , Syama Sundar Rangapuram , Huibin Shen

show 15 more authors

Lorenzo Stella Xiyuan Zhang Mononito Goswami Shubham Kapoor Danielle C. Maddix Pablo Guerron Tony Hu Junming Yin Nick Erickson Prateek Mutalik Desai Hao Wang Huzefa Rangwala George Karypis Yuyang Wang Michael Bohlke-Schneider

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords time series forecastingpretrained modelsin-context learningzero-shot forecastingmultivariate forecastingcovariate forecastingsynthetic data training

0 comments

The pith

Chronos-2 is a pretrained model that performs zero-shot forecasting on univariate, multivariate, and covariate-informed tasks via group attention for in-context learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Chronos-2 as a model pretrained solely on synthetic data to handle forecasting tasks that previously required separate specialized systems. It uses a group attention mechanism so that multiple related series can share information during inference, whether those series are variates within one dataset or a target paired with external covariates. This design allows the model to produce accurate predictions without any task-specific training or fine-tuning. A reader would care because real-world forecasting often involves multivariate data and covariates, yet most existing pretrained models remain limited to single series. The authors report state-of-the-art results on three large benchmarks, with particularly large gains on covariate-heavy tasks.

Core claim

Chronos-2 employs a group attention mechanism that facilitates in-context learning through efficient information sharing across multiple time series within a group, which may represent sets of related series, variates of a multivariate series, or targets and covariates in a forecasting task. These general capabilities are achieved through training on synthetic datasets that impose diverse multivariate structures on univariate series. Chronos-2 delivers state-of-the-art performance across three comprehensive benchmarks: fev-bench, GIFT-Eval, and Chronos Benchmark II.

What carries the argument

group attention mechanism that enables in-context learning by sharing information across time series grouped as related series, variates, or target-covariate pairs

Load-bearing premise

Training exclusively on synthetic datasets that impose diverse multivariate structures on univariate series will produce a model whose in-context learning generalizes to real-world multivariate and covariate distributions without domain-specific fine-tuning.

What would settle it

A new benchmark of real-world multivariate series with covariate relationships absent from the synthetic training distribution where Chronos-2 fails to match or exceed the accuracy of models trained on domain data.

read the original abstract

Pretrained time series models have enabled inference-only forecasting systems that produce accurate predictions without task-specific training. However, existing approaches largely focus on univariate forecasting, limiting their applicability in real-world scenarios where multivariate data and covariates play a crucial role. We present Chronos-2, a pretrained model capable of handling univariate, multivariate, and covariate-informed forecasting tasks in a zero-shot manner. Chronos-2 employs a group attention mechanism that facilitates in-context learning (ICL) through efficient information sharing across multiple time series within a group, which may represent sets of related series, variates of a multivariate series, or targets and covariates in a forecasting task. These general capabilities are achieved through training on synthetic datasets that impose diverse multivariate structures on univariate series. Chronos-2 delivers state-of-the-art performance across three comprehensive benchmarks: fev-bench, GIFT-Eval, and Chronos Benchmark II. On fev-bench, which emphasizes multivariate and covariate-informed forecasting, Chronos-2's universal ICL capabilities lead to substantial improvements over existing models. On tasks involving covariates, it consistently outperforms baselines by a wide margin. Case studies in the energy and retail domains further highlight its practical advantages. The in-context learning capabilities of Chronos-2 establish it as a general-purpose forecasting model that can be used "as is" in real-world forecasting pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chronos-2 adds group attention to enable zero-shot multivariate and covariate forecasting after synthetic training, with solid benchmark wins but thin validation that the synthetic structures actually transfer.

read the letter

Chronos-2 takes the existing Chronos univariate pretrained model and adds a group attention mechanism so the model can share information across related series, variates, or targets plus covariates during in-context learning. They train exclusively on synthetic data that layers artificial multivariate structures onto univariate series, then evaluate zero-shot on fev-bench, GIFT-Eval, and Chronos Benchmark II. The reported gains are largest on covariate tasks, and the energy and retail case studies show the model working in practical settings without fine-tuning. That combination is the real contribution: a single model that handles more input types than prior pretrained forecasters while staying inference-only. The engineering looks clean, and the synthetic-data recipe lets them scale training without needing huge real multivariate corpora. The soft spot is the missing link between synthetic training and real distributions. The abstract gives no ablations that measure how well the generated correlation structures match the benchmarks, no error bars, and no statistical tests. Without those, the wide margins on covariate tasks could partly reflect benchmark construction rather than true universality. The citation pattern is straightforward and builds on the right prior work. This paper is for engineers who need a drop-in forecaster across many domains and for researchers studying how to extend in-context learning in time series. A practitioner would get immediate value from the benchmark tables and case studies. It deserves peer review because the practical framing and results are strong enough to justify detailed feedback on the experimental gaps, even if the generalization story needs tightening.

Referee Report

3 major / 1 minor

Summary. The paper introduces Chronos-2, a pretrained time series model extending univariate forecasting to universal zero-shot capabilities for multivariate and covariate-informed tasks. It employs a group attention mechanism to enable in-context learning across groups of series (representing related variates or targets/covariates), trained exclusively on synthetic datasets that impose diverse multivariate structures on univariate series. The central claims are state-of-the-art performance on fev-bench, GIFT-Eval, and Chronos Benchmark II, with wide margins on covariate tasks, plus practical advantages shown in energy and retail case studies.

Significance. If the generalization from synthetic training to real-world distributions holds, this would mark a substantial advance toward general-purpose, inference-only forecasting models that eliminate the need for task-specific fine-tuning. The synthetic-data approach and group attention for ICL could reduce reliance on domain-specific datasets, with potential broad impact in applied domains like energy and retail if the benchmark gains prove robust.

major comments (3)

[Abstract] Abstract: The SOTA performance claims on fev-bench, GIFT-Eval, and Chronos Benchmark II are reported without error bars, ablation studies, or statistical significance tests. This omission is load-bearing because the wide margins on covariate tasks rest entirely on these external benchmarks, and without such controls the improvements cannot be confidently attributed to the universal ICL capabilities rather than benchmark artifacts.
[Training and evaluation sections] Training and evaluation sections: The central claim that training exclusively on synthetic datasets (imposing multivariate structures on univariate series) produces ICL that generalizes to real-world multivariate and covariate distributions lacks any ablation or analysis measuring distribution shift between the synthetic generator and the target/covariate relationships in fev-bench or GIFT-Eval. This is load-bearing for the zero-shot universality assertion.
[Group attention mechanism description] Group attention mechanism description: The mechanism is presented at a high level for information sharing across groups, but the training objective contains no explicit regularization term or analysis for real-world correlation structures. Without this, it is unclear whether the reported gains on covariate tasks arise from the mechanism itself or from incidental overlap with the synthetic training distribution.

minor comments (1)

[Abstract] The abstract would be strengthened by including a brief statement of model scale (parameter count) and a high-level architectural diagram reference to aid readers in assessing the practicality of the zero-shot approach.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their constructive and detailed feedback. The comments highlight important areas where additional evidence and analysis would strengthen the claims regarding zero-shot universality. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The SOTA performance claims on fev-bench, GIFT-Eval, and Chronos Benchmark II are reported without error bars, ablation studies, or statistical significance tests. This omission is load-bearing because the wide margins on covariate tasks rest entirely on these external benchmarks, and without such controls the improvements cannot be confidently attributed to the universal ICL capabilities rather than benchmark artifacts.

Authors: We agree that the lack of error bars, statistical significance tests, and expanded ablations limits the strength of the SOTA claims, particularly for the covariate tasks. In the revised manuscript we will report standard deviations from multiple evaluation runs (where computationally feasible given the scale of the benchmarks), include paired statistical tests for the reported improvements, and expand the ablation studies section with additional controls on model components and data variations. revision: yes
Referee: [Training and evaluation sections] Training and evaluation sections: The central claim that training exclusively on synthetic datasets (imposing multivariate structures on univariate series) produces ICL that generalizes to real-world multivariate and covariate distributions lacks any ablation or analysis measuring distribution shift between the synthetic generator and the target/covariate relationships in fev-bench or GIFT-Eval. This is load-bearing for the zero-shot universality assertion.

Authors: The referee correctly notes the absence of explicit distribution-shift analysis. While the synthetic data generator is described in detail and constructed to impose diverse multivariate structures, we did not quantify shifts relative to the evaluation benchmarks. We will add a new subsection with quantitative comparisons of key statistics (cross-series correlations, covariate-target dependencies, and other distributional properties) between the synthetic training distribution and the fev-bench/GIFT-Eval datasets, together with any feasible ablations on the effect of synthetic data diversity. revision: yes
Referee: [Group attention mechanism description] Group attention mechanism description: The mechanism is presented at a high level for information sharing across groups, but the training objective contains no explicit regularization term or analysis for real-world correlation structures. Without this, it is unclear whether the reported gains on covariate tasks arise from the mechanism itself or from incidental overlap with the synthetic training distribution.

Authors: We acknowledge that the group attention description remains high-level and that the training objective lacks an explicit regularization term targeting real-world correlations. The mechanism relies on the diversity of synthetic structures to learn in-context sharing. In the revision we will provide a more detailed description of the attention implementation, add an ablation comparing performance with and without group attention, and include an analysis of attention patterns on real-world examples to illustrate that relevant correlation structures are captured. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks and synthetic training without self-referential reduction

full rationale

The paper's central claims concern empirical performance of a pretrained model on independent benchmarks (fev-bench, GIFT-Eval, Chronos Benchmark II) after training exclusively on synthetic data that imposes multivariate structure on univariate series. No equations, derivations, or self-citations are presented that reduce reported improvements, ICL capabilities, or generalization to fitted parameters or prior results by construction. The training procedure and group attention mechanism are described as design choices whose effectiveness is measured externally rather than defined tautologically. This is the most common honest non-finding for papers whose primary output is benchmark numbers rather than a closed mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that synthetic data can stand in for real multivariate distributions and that group attention will produce useful in-context learning without further supervision. No free parameters are explicitly named in the abstract, but the synthetic data generation process implicitly contains choices about how multivariate structure is imposed.

axioms (1)

domain assumption Synthetic datasets that impose diverse multivariate structures on univariate series are sufficient to train generalizable in-context learning for real-world multivariate and covariate forecasting.
Invoked in the description of how Chronos-2 achieves its universal capabilities.

invented entities (1)

Group attention mechanism no independent evidence
purpose: To enable efficient information sharing across multiple time series within a group for in-context learning.
New architectural component introduced to handle multivariate and covariate tasks.

pith-pipeline@v0.9.0 · 5646 in / 1505 out tokens · 51700 ms · 2026-05-15T01:16:12.885791+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting
cs.LG 2026-05 unverdicted novelty 7.0

EpiCastBench supplies 40 curated multivariate epidemic datasets and evaluates 15 forecasting models under unified preprocessing, horizons, metrics, and significance tests.
Benchmarking Sensor-Fault Robustness in Forecasting
cs.LG 2026-05 conditional novelty 7.0

SensorFault-Bench is a new CPS-grounded benchmark showing that clean-MSE rankings of forecasting models often disagree with their robustness under standardized sensor-fault scenarios across four real datasets.
Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models
cs.LG 2026-04 unverdicted novelty 7.0

Time series foundation models match the performance of specialized models for day-ahead load forecasting while providing explanations that match domain knowledge on weather and calendar effects.
TempusBench: An Evaluation Framework for Time-Series Forecasting
cs.LG 2026-04 unverdicted novelty 7.0

TempusBench is a new evaluation framework for time-series forecasting models that supplies fresh non-overlapping datasets, tasks beyond horizon and domain, consistent tuning across models, and visualization tools.
TabPFN-3: Technical Report
cs.LG 2026-05 unverdicted novelty 6.0

TabPFN-3 delivers state-of-the-art tabular prediction performance on benchmarks up to 1M rows, is up to 20x faster than prior versions, and introduces test-time scaling that beats non-TabPFN models by hundreds of Elo points.
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling
cs.LG 2026-05 unverdicted novelty 6.0

MILM fine-tunes LLMs on XML-encoded multimodal irregular time series via a two-stage process that exploits informative sampling patterns to achieve top performance on EHR classification datasets.
FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

Foundation models outperform dataset-specific machine learning in energy time series forecasting across 54 datasets in 9 categories.
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability
cs.LG 2026-04 conditional novelty 6.0

Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.
WaveMoE: A Wavelet-Enhanced Mixture-of-Experts Foundation Model for Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

WaveMoE uses a dual-path architecture with aligned time-series and wavelet tokens routed through shared experts to improve forecasting performance on diverse benchmarks.
Zero-shot Multivariate Time Series Forecasting Using Tabular Prior Fitted Networks
cs.LG 2026-04 unverdicted novelty 6.0

A framework recasts multivariate time series forecasting as scalar regression problems that tabular prior-fitted networks can solve zero-shot while addressing inter-channel interactions.
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

MICA adds linearly scaling compressive cross-channel attention to Transformers, cutting average forecast error by 5.4% and ranking first among multivariate baselines.
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

MICA adapts infini compressive attention to the channel dimension, enabling scalable cross-channel dependencies in Transformers and cutting forecast error by 5.4% on average versus channel-independent baselines.
Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series
cs.LG 2026-04 unverdicted novelty 6.0

DynLMC creates synthetic time series data with dynamic inter-channel correlations that improve zero-shot forecasting in foundation models across multiple benchmarks.
Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series
cs.LG 2026-04 unverdicted novelty 6.0

DynLMC creates synthetic multivariate time series with dynamic inter-channel correlations that improve zero-shot forecasting performance when used to fine-tune foundation models across nine benchmarks.
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
cs.LG 2026-03 unverdicted novelty 6.0

iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.
Investigating simple target-covariate relationships for Chronos-2 and TabPFN-TS
cs.LG 2026-05 unverdicted novelty 5.0

TabPFN-TS captures simple target-covariate relationships more effectively than Chronos-2 in controlled experiments, especially for short horizons.
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
Assessing the Performance-Efficiency Trade-off of Foundation Models in Probabilistic Electricity Price Forecasting
cs.LG 2026-04 unverdicted novelty 5.0

Foundation models slightly outperform task-specific models on probabilistic electricity price forecasts but the gap narrows or reverses with extra features or few-shot adaptation, showing that efficiency often outweig...
Thermal-GEMs: Generalized Models for Building Thermal Dynamics
eess.SY 2026-04 unverdicted novelty 5.0

Multi-source transfer learning for building thermal dynamics yields up to 63% lower forecasting errors than single-source models and outperforms time series foundation models when pretrained on 16-32 buildings over one year.
Non-Stationarity in the Embedding Space of Time Series Foundation Models
cs.LG 2026-04 unverdicted novelty 5.0

Embedding spaces of time series foundation models make mean shifts, variance changes, and trends linearly detectable, but detection degrades smoothly with shift strength and shows model-specific failure modes.
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
cs.LG 2026-03 unverdicted novelty 5.0

iAmTime is a hierarchical transformer-based time series foundation model that uses semantic tokens and instruction-conditioned prompts to infer tasks from demonstrations, achieving improved zero-shot performance on fo...
Don't Learn the Shape: Forecasting Periodic Time Series by Rank-1 Decomposition
cs.LG 2026-05 unverdicted novelty 4.0

A frozen average of the last two cycles matches or exceeds eight shape-learning alternatives on 97 GIFT-Eval configurations for periodic time series forecasting.
RACF: A Resilient Autonomous Car Framework with Object Distance Correction
cs.RO 2026-04 unverdicted novelty 4.0

RACF corrects inconsistent depth camera distance estimates in autonomous vehicles using LiDAR and kinematic redundancy, achieving up to 35% RMSE reduction and better braking in tests on a Quanser QCar 2 platform.
Preliminary Insights in Chronos Frequency Data Understanding and Reconstruction
cs.LG 2026-05 unverdicted novelty 3.0

Chronos encodes frequency content in decoder representations with quality that varies across the spectrum, as revealed by minimum description length probes on sinusoid inputs.
Challenges and opportunities for AI to help deliver fusion energy
physics.plasm-ph 2026-03 unverdicted novelty 2.0

AI offers opportunities to advance fusion energy R&D but requires responsible practices and expert collaborations to overcome its inherent challenges.