Toto 2.0: Time Series Forecasting Enters the Scaling Era
Pith reviewed 2026-05-20 06:37 UTC · model grok-4.3
The pith
Time series foundation models improve forecast quality as they scale from 4 million to 2.5 billion parameters under one fixed training recipe.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters in time series foundation models, as shown by the Toto 2.0 family setting new state of the art on BOOM, GIFT-Eval, and the contamination-resistant TIME benchmark.
What carries the argument
The u-muP hyperparameter transfer pipeline that lets the same architecture, data handling, and training steps work across model sizes without per-size retuning.
If this is right
- Forecast accuracy keeps rising with added parameters when the recipe stays fixed.
- The same design choices support deployment at any scale from small to very large models.
- Open release of five checkpoints lets users test scaling directly on their own data.
- New benchmarks confirm gains hold even when contamination is controlled for.
Where Pith is reading between the lines
- Practitioners could train once at small scale and then grow the model size for harder problems without restarting from scratch.
- The approach might extend to related tasks such as time series classification or anomaly detection.
- Larger models could reduce the need for task-specific fine-tuning in production forecasting systems.
- Testing on longer horizons or multivariate series would show whether the scaling pattern continues.
Load-bearing premise
The observed forecast gains come from the models learning general patterns rather than from test data leaking into training or from overfitting to the specific benchmarks.
What would settle it
Results on a fresh, held-out time series forecasting dataset where the 2.5B model performs no better than the 4M model or falls short of prior non-scaled baselines.
read the original abstract
We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that time series foundation models scale: a single fixed training recipe (including u-muP hyperparameter transfer) produces reliable forecast-quality gains as parameters increase from 4M to 2.5B. The authors release the Toto 2.0 family of five open-weights models, which set new state-of-the-art results on the BOOM observability benchmark, GIFT-Eval general-purpose benchmark, and contamination-resistant TIME benchmark. The paper details the architecture, training recipe, data handling, and experimental results.
Significance. If the scaling holds after isolating parameter count from data volume, this would provide concrete evidence that time series forecasting can enter a scaling regime comparable to language models, with practical value from the open model releases. The emphasis on the TIME benchmark for contamination resistance is a methodological strength that supports the generalization claim.
major comments (2)
- [Training recipe and data] Training recipe and data section: the central claim is that a single training recipe yields parameter-only scaling improvements from 4M to 2.5B. However, the manuscript does not report total training tokens or effective dataset size after sampling for each scale point. Without this, gains on BOOM/GIFT-Eval/TIME cannot be isolated from possible joint scaling with data exposure, which directly undermines the 'parameter scaling' interpretation asserted in the abstract.
- [Results] Results section: while the abstract states 'consistent empirical gains across scales,' no quantitative scaling curves, per-scale token counts, or ablation isolating capacity from data are provided. This leaves the load-bearing claim of reliable parameter-driven improvement only partially supported by the reported benchmarks.
minor comments (2)
- [Abstract] Abstract: the phrase 'reliable forecast-quality improvements' would benefit from a brief quantification (e.g., average MAE reduction or rank improvement) to make the claim more precise.
- [Architecture and u-muP pipeline] Notation: the u-muP pipeline is referenced but its exact hyperparameter transfer rules for time-series-specific components (e.g., patch size or horizon) could be clarified with a short equation or table.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on strengthening the evidence for parameter scaling. We respond to each major point below and indicate planned revisions to improve transparency around data exposure and scaling curves.
read point-by-point responses
-
Referee: Training recipe and data section: the central claim is that a single training recipe yields parameter-only scaling improvements from 4M to 2.5B. However, the manuscript does not report total training tokens or effective dataset size after sampling for each scale point. Without this, gains on BOOM/GIFT-Eval/TIME cannot be isolated from possible joint scaling with data exposure, which directly undermines the 'parameter scaling' interpretation asserted in the abstract.
Authors: We agree that explicit per-scale token counts are required to isolate parameter effects. The training pipeline employed a single fixed data mixture and sampling strategy for all five models, so effective data exposure was held constant while only model capacity varied. In the revised manuscript we will add a table in the Training recipe and data section reporting approximate total tokens processed by each scale (4M through 2.5B), confirming that data volume did not increase with parameter count. revision: yes
-
Referee: Results section: while the abstract states 'consistent empirical gains across scales,' no quantitative scaling curves, per-scale token counts, or ablation isolating capacity from data are provided. This leaves the load-bearing claim of reliable parameter-driven improvement only partially supported by the reported benchmarks.
Authors: We acknowledge that a scaling curve figure would make the gains more visible. We will add such a plot in the Results section, with performance on BOOM, GIFT-Eval, and TIME shown against parameter count and annotated with the token counts from the new table. A full ablation that independently varies data volume at the 2.5B scale is not feasible given computational limits; the fixed-recipe design with u-muP already holds all non-capacity factors constant, providing the strongest evidence obtainable under our constraints. revision: partial
- A complete ablation study that decouples data volume from parameter count at the largest scale, which is computationally prohibitive.
Circularity Check
No circularity: empirical scaling results are self-contained
full rationale
The paper reports direct training and evaluation of models from 4M to 2.5B parameters under a fixed recipe, with results measured on external benchmarks (BOOM, GIFT-Eval, TIME). No derivation chain, equation, or prediction reduces by construction to a fitted quantity defined in terms of itself; the scaling claim rests on observed performance deltas rather than any self-referential ansatz, uniqueness theorem, or renamed empirical pattern. The u-muP pipeline and architecture choices are described as design decisions, not as load-bearing self-citations that collapse the central result. This is the standard non-circular outcome for an empirical scaling study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer-style architectures can be effectively adapted to time series data
Forward citations
Cited by 1 Pith paper
-
Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
Falcon-X introduces a latent prototype space with Unified Prototype Diff-Attention and Latent Entity Attention for heterogeneous multivariate time series forecasting.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.