Toto 2.0: Time Series Forecasting Enters the Scaling Era

Ameet Talwalkar; Chenghao Liu; Chris Lettieri; David Asker; Eden Belouadah; Emaad Khwaja; Enguerrand Paquin; Gerald Woo; Guillaume Jarry; Marc Cenac

arxiv: 2605.20119 · v2 · pith:T3KMXW7Pnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Toto 2.0: Time Series Forecasting Enters the Scaling Era

Emaad Khwaja , Chris Lettieri , Gerald Woo , Eden Belouadah , Marc Cenac , Guillaume Jarry , Enguerrand Paquin , Xunyi Zhao

show 5 more authors

Viktoriya Zhukov Othmane Abou-Amal Chenghao Liu Ameet Talwalkar David Asker

This is my paper

Pith reviewed 2026-05-20 06:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords time series forecastingfoundation modelsmodel scalingforecasting benchmarksopen weightsobservability

0 comments

The pith

Time series foundation models improve forecast quality as they scale from 4 million to 2.5 billion parameters under one fixed training recipe.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that time series forecasting benefits from scaling in the same way other foundation model domains do. A consistent training approach yields steady accuracy gains across five model sizes without needing adjustments for each scale. This matters because it opens the door to larger models delivering better predictions on real-world tasks like observability and general forecasting. The authors back the claim with new state-of-the-art results on three benchmarks, including one built to resist data contamination, and release the models openly.

Core claim

A single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters in time series foundation models, as shown by the Toto 2.0 family setting new state of the art on BOOM, GIFT-Eval, and the contamination-resistant TIME benchmark.

What carries the argument

The u-muP hyperparameter transfer pipeline that lets the same architecture, data handling, and training steps work across model sizes without per-size retuning.

If this is right

Forecast accuracy keeps rising with added parameters when the recipe stays fixed.
The same design choices support deployment at any scale from small to very large models.
Open release of five checkpoints lets users test scaling directly on their own data.
New benchmarks confirm gains hold even when contamination is controlled for.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could train once at small scale and then grow the model size for harder problems without restarting from scratch.
The approach might extend to related tasks such as time series classification or anomaly detection.
Larger models could reduce the need for task-specific fine-tuning in production forecasting systems.
Testing on longer horizons or multivariate series would show whether the scaling pattern continues.

Load-bearing premise

The observed forecast gains come from the models learning general patterns rather than from test data leaking into training or from overfitting to the specific benchmarks.

What would settle it

Results on a fresh, held-out time series forecasting dataset where the 2.5B model performs no better than the 4M model or falls short of prior non-scaled baselines.

read the original abstract

We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that time series foundation models scale: a single fixed training recipe (including u-muP hyperparameter transfer) produces reliable forecast-quality gains as parameters increase from 4M to 2.5B. The authors release the Toto 2.0 family of five open-weights models, which set new state-of-the-art results on the BOOM observability benchmark, GIFT-Eval general-purpose benchmark, and contamination-resistant TIME benchmark. The paper details the architecture, training recipe, data handling, and experimental results.

Significance. If the scaling holds after isolating parameter count from data volume, this would provide concrete evidence that time series forecasting can enter a scaling regime comparable to language models, with practical value from the open model releases. The emphasis on the TIME benchmark for contamination resistance is a methodological strength that supports the generalization claim.

major comments (2)

[Training recipe and data] Training recipe and data section: the central claim is that a single training recipe yields parameter-only scaling improvements from 4M to 2.5B. However, the manuscript does not report total training tokens or effective dataset size after sampling for each scale point. Without this, gains on BOOM/GIFT-Eval/TIME cannot be isolated from possible joint scaling with data exposure, which directly undermines the 'parameter scaling' interpretation asserted in the abstract.
[Results] Results section: while the abstract states 'consistent empirical gains across scales,' no quantitative scaling curves, per-scale token counts, or ablation isolating capacity from data are provided. This leaves the load-bearing claim of reliable parameter-driven improvement only partially supported by the reported benchmarks.

minor comments (2)

[Abstract] Abstract: the phrase 'reliable forecast-quality improvements' would benefit from a brief quantification (e.g., average MAE reduction or rank improvement) to make the claim more precise.
[Architecture and u-muP pipeline] Notation: the u-muP pipeline is referenced but its exact hyperparameter transfer rules for time-series-specific components (e.g., patch size or horizon) could be clarified with a short equation or table.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on strengthening the evidence for parameter scaling. We respond to each major point below and indicate planned revisions to improve transparency around data exposure and scaling curves.

read point-by-point responses

Referee: Training recipe and data section: the central claim is that a single training recipe yields parameter-only scaling improvements from 4M to 2.5B. However, the manuscript does not report total training tokens or effective dataset size after sampling for each scale point. Without this, gains on BOOM/GIFT-Eval/TIME cannot be isolated from possible joint scaling with data exposure, which directly undermines the 'parameter scaling' interpretation asserted in the abstract.

Authors: We agree that explicit per-scale token counts are required to isolate parameter effects. The training pipeline employed a single fixed data mixture and sampling strategy for all five models, so effective data exposure was held constant while only model capacity varied. In the revised manuscript we will add a table in the Training recipe and data section reporting approximate total tokens processed by each scale (4M through 2.5B), confirming that data volume did not increase with parameter count. revision: yes
Referee: Results section: while the abstract states 'consistent empirical gains across scales,' no quantitative scaling curves, per-scale token counts, or ablation isolating capacity from data are provided. This leaves the load-bearing claim of reliable parameter-driven improvement only partially supported by the reported benchmarks.

Authors: We acknowledge that a scaling curve figure would make the gains more visible. We will add such a plot in the Results section, with performance on BOOM, GIFT-Eval, and TIME shown against parameter count and annotated with the token counts from the new table. A full ablation that independently varies data volume at the 2.5B scale is not feasible given computational limits; the fixed-recipe design with u-muP already holds all non-capacity factors constant, providing the strongest evidence obtainable under our constraints. revision: partial

standing simulated objections not resolved

A complete ablation study that decouples data volume from parameter count at the largest scale, which is computationally prohibitive.

Circularity Check

0 steps flagged

No circularity: empirical scaling results are self-contained

full rationale

The paper reports direct training and evaluation of models from 4M to 2.5B parameters under a fixed recipe, with results measured on external benchmarks (BOOM, GIFT-Eval, TIME). No derivation chain, equation, or prediction reduces by construction to a fitted quantity defined in terms of itself; the scaling claim rests on observed performance deltas rather than any self-referential ansatz, uniqueness theorem, or renamed empirical pattern. The u-muP pipeline and architecture choices are described as design decisions, not as load-bearing self-citations that collapse the central result. This is the standard non-circular outcome for an empirical scaling study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from transformer-based sequence modeling and scaling observations established in other domains; the main addition is empirical validation on time series data.

axioms (1)

domain assumption Transformer-style architectures can be effectively adapted to time series data
This underpins the model family design described in the abstract.

pith-pipeline@v0.9.0 · 5709 in / 1163 out tokens · 79930 ms · 2026-05-20T06:37:54.480344+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
cs.LG 2026-05 unverdicted novelty 5.0

Falcon-X introduces a latent prototype space with Unified Prototype Diff-Attention and Latent Entity Attention for heterogeneous multivariate time series forecasting.