Exploring Accuracy Law for Deep Time Series Forecasters: An Empirical Study

Haixu Wu; Jianmin Wang; Mingsheng Long; Shiyu Wang; Yang Xiang; Yong Liu; Yuchen Fang; Yuezhou Ma; Yuxuan Wang; Zhou Ye

arxiv: 2510.02729 · v2 · submitted 2025-10-03 · 💻 cs.LG

Exploring Accuracy Law for Deep Time Series Forecasters: An Empirical Study

Yuxuan Wang , Haixu Wu , Yuezhou Ma , Yuchen Fang , Ziyi Zhang , Yong Liu , Shiyu Wang , Zhou Ye

show 3 more authors

Yang Xiang Jianmin Wang Mingsheng Long

This is my paper

Pith reviewed 2026-05-18 10:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords deep time series forecastingaccuracy lawwindow-wise pattern complexityempirical relationshipminimum forecasting errorunivariate time seriestime series foundation modelsperformance upper bound

0 comments

The pith

Deep time series forecasters exhibit an accuracy law linking minimum error to window-wise pattern complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to estimate the performance upper bound for deep time series forecasters, noting that classical series-wide predictability measures fall short. It shifts focus to window-wise properties because deep models operate in a sequence-to-sequence manner. A quantitative measure of pattern complexity within each window is introduced to capture forecasting difficulty. Statistical analysis across more than 4700 trained models reveals a consistent empirical link between this complexity and the lowest achievable error, which the authors call the accuracy law. If valid, the law clarifies why benchmark gains remain marginal and supplies practical ways to detect saturated tasks or refine training for foundation models.

Core claim

Through rigorous statistical analyses over more than 4700 newly trained deep forecasting models, the authors discover a consistent empirical relationship between the minimum attainable forecasting error of deep models and the complexity of window-wise series patterns, which is termed the accuracy law.

What carries the argument

The accuracy law, an empirical relationship connecting minimum forecasting error to a quantitative measure of window-wise pattern complexity under the sequence-to-sequence paradigm.

If this is right

The law identifies saturated tasks from widely used benchmarks.
It derives an effective training strategy for time series foundation models.
It supplies a concrete method to estimate the performance upper bound of deep time series forecasters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures could be tuned to target specific ranges of window-wise complexity to approach the error floor more closely.
The relationship might inform choices of window length or preprocessing to reduce effective complexity on hard series.
Applying the law across new datasets could help select benchmarks that remain challenging rather than already saturated.

Load-bearing premise

Forecasting performance depends on window-wise properties and the introduced quantitative measure of pattern complexity accurately reflects the relevant aspects of forecasting difficulty.

What would settle it

Training thousands of additional deep models on fresh univariate series and checking whether the observed minimum errors still align with the predicted relationship to the window-wise complexity measure.

read the original abstract

Deep time series forecasting has emerged as a rapidly growing field in recent years. Despite the exponential growth of community interests, progress on standard benchmarks is often limited to marginal improvements. A common consensus of the community is that time series forecasting inherently faces a non-zero error lower bound due to its partially observable and uncertain nature. However, a fundamental question arises: how to estimate the performance upper bound of deep time series forecasters? We delve into univariate time series forecasting, a prevalent forecasting paradigm spanning traditional statistical models to advanced time series foundation models. Going beyond classical series-wise predictability metrics, we realize that the forecasting performance is highly related to window-wise properties due to the sequence-to-sequence forecasting paradigm of deep time series models and introduce a quantitative measurement of window-wise pattern complexity. Through rigorous statistical analyses over more than 4700 newly trained deep forecasting models, we discover a consistent empirical relationship between the minimum attainable forecasting error of deep models and the complexity of window-wise series patterns, which is termed the accuracy law. We further demonstrate that this empirical finding successfully guides us to identify saturated tasks from widely used benchmarks and derive an effective training strategy for time series foundation models, offering valuable insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They trained thousands of deep forecasters and report an empirical accuracy law tying minimum error to a new window-wise complexity score.

read the letter

The main thing here is that the authors trained over 4700 deep time series models and observed a repeatable link between the lowest error achievable on a window and their measure of window-wise pattern complexity, which they label the accuracy law. This shifts focus from whole-series predictability to local window properties, which aligns with how seq2seq deep models actually operate. They then apply the finding to flag saturated tasks in standard benchmarks and to suggest a training tweak for time series foundation models. The scale of the runs is the clearest strength, and the downstream uses show they tried to make the result actionable rather than purely descriptive. The experiments give a concrete way to think about performance ceilings on specific windows. The soft spot is whether the new complexity metric actually isolates something beyond simpler window statistics. If it largely tracks local variance, lag-1 autocorrelation, or spectral entropy, then the law may not generalize as a distinct lower bound and could be a post-hoc fit to known effects. Direct comparisons to those baselines, plus checks on how windows were selected and whether the relation holds across model families, would clarify this. The paper is aimed at researchers building or evaluating deep forecasters and foundation models. Anyone working on benchmark design or scaling training for time series would find the empirical angle useful. It has enough substance from the model count and practical extensions to merit a serious referee, even with the open questions on the metric's incremental value. I would send it for peer review and ask specifically for those baseline comparisons and robustness details.

Referee Report

3 major / 2 minor

Summary. The paper claims that through training and analyzing over 4700 deep time series forecasting models, the authors discover a consistent empirical 'accuracy law' relating the minimum attainable forecasting error to a newly introduced quantitative measure of window-wise pattern complexity. This goes beyond classical series-wise predictability metrics, leveraging the seq2seq nature of deep models, and is shown to help identify saturated benchmark tasks and inform training strategies for time series foundation models.

Significance. If the accuracy law is robustly supported and the window-wise complexity metric proves independent of simpler statistics, the work offers a practical empirical tool for bounding deep forecaster performance and improving benchmark curation. The scale of the empirical study (4700+ models) is a positive feature, though the absence of detailed validation against confounding factors reduces its immediate field impact.

major comments (3)

[§3.2] §3.2 (or equivalent methods section introducing the metric): The quantitative measurement of window-wise pattern complexity is presented without explicit comparison or ablation against standard alternatives such as window variance, lag-1 autocorrelation, or spectral entropy. If the metric largely correlates with these proxies, the claimed relationship with minimum attainable error may not isolate pattern complexity as asserted and risks being a post-hoc fit rather than a general law.
[§4] §4 (statistical analyses): The support for the accuracy law over 4700 models lacks reported error bars, confidence intervals on the fitted relationship, or robustness checks (e.g., controlling for local variance or testing on held-out datasets). This leaves open whether the observed consistency survives alternative model selections or data splits.
[§5] §5 (applications to saturated tasks and foundation model training): The guidance derived from the accuracy law assumes the window-wise complexity captures seq2seq-relevant difficulty independently of simpler statistics; without such controls or out-of-sample validation, the practical utility claims rest on an untested assumption highlighted in the skeptic note.

minor comments (2)

[Abstract] Abstract: The phrase 'rigorous statistical analyses' should be accompanied by a brief mention of the specific tests or controls performed to allow readers to assess the claim immediately.
[Notation] Notation: Ensure the complexity measure is given a clear symbol and formula early in the text, with all parameters defined, to avoid ambiguity when discussing the accuracy law.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and indicate the revisions we will make to strengthen the empirical support and clarity of the accuracy law.

read point-by-point responses

Referee: [§3.2] §3.2 (or equivalent methods section introducing the metric): The quantitative measurement of window-wise pattern complexity is presented without explicit comparison or ablation against standard alternatives such as window variance, lag-1 autocorrelation, or spectral entropy. If the metric largely correlates with these proxies, the claimed relationship with minimum attainable error may not isolate pattern complexity as asserted and risks being a post-hoc fit rather than a general law.

Authors: We agree that explicit validation against simpler statistics is necessary. Our window-wise pattern complexity metric was motivated by the seq2seq nature of deep forecasters and aims to quantify intra-window pattern variability beyond aggregate series properties. In the revision we will add a dedicated ablation subsection in §3.2 that reports Pearson and partial correlations with window variance, lag-1 autocorrelation, and spectral entropy across the same 4700-model corpus. We will also show that the accuracy-law fit remains significant after controlling for these proxies, thereby demonstrating incremental explanatory power. revision: yes
Referee: [§4] §4 (statistical analyses): The support for the accuracy law over 4700 models lacks reported error bars, confidence intervals on the fitted relationship, or robustness checks (e.g., controlling for local variance or testing on held-out datasets). This leaves open whether the observed consistency survives alternative model selections or data splits.

Authors: We accept this criticism. The revised §4 will include bootstrap-derived confidence intervals and standard errors for all fitted accuracy-law parameters. We will further report results after (i) regressing out local window variance, (ii) repeating the analysis on three random 70/30 data splits, and (iii) restricting the model pool to transformer-only architectures. These checks will be presented as supplementary figures and tables. revision: yes
Referee: [§5] §5 (applications to saturated tasks and foundation model training): The guidance derived from the accuracy law assumes the window-wise complexity captures seq2seq-relevant difficulty independently of simpler statistics; without such controls or out-of-sample validation, the practical utility claims rest on an untested assumption highlighted in the skeptic note.

Authors: We will leverage the new ablations and controls added to §3.2 and §4 to substantiate the independence claim within §5. For out-of-sample validation we will apply the accuracy law to two additional public benchmarks (not used in the original 4700-model study) and show that the predicted saturation ordering matches observed performance ceilings. These results will be added to the revised Section 5. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical discovery from new model trainings

full rationale

The paper's central claim is an observed empirical relationship between minimum attainable forecasting error and window-wise pattern complexity, discovered via statistical analyses on more than 4700 newly trained deep forecasting models. This is presented as a data-driven finding rather than a mathematical derivation or prediction that reduces to its own inputs by construction. The window-wise complexity metric is introduced as a quantitative measurement tied to the seq2seq paradigm, but the accuracy law itself is validated through extensive independent experiments on standard benchmarks, with no self-definitional equations, fitted-input predictions, or load-bearing self-citations that collapse the result. The derivation chain is self-contained as an empirical observation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced quantitative measurement of window-wise pattern complexity as the key explanatory variable for forecasting error bounds. This measurement is postulated without independent evidence outside the empirical fits. The domain assumption that seq2seq architecture makes window-wise properties dominant is invoked to justify the approach.

axioms (1)

domain assumption The forecasting performance is highly related to window-wise properties due to the sequence-to-sequence forecasting paradigm of deep time series models.
Directly stated in the abstract as the motivation for shifting from series-wise to window-wise analysis.

invented entities (1)

window-wise pattern complexity no independent evidence
purpose: Quantitative measurement to capture pattern properties in data windows that determine minimum forecasting error.
Introduced as a new metric going beyond classical predictability measures.

pith-pipeline@v0.9.0 · 5771 in / 1532 out tokens · 60632 ms · 2026-05-18T10:30:01.601005+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we discover a consistent empirical relationship between the minimum attainable forecasting error of deep models and the complexity of window-wise series patterns, which is termed the accuracy law... MSE≈exp(α·Complexity(x))−1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling
cs.LG 2026-05 unverdicted novelty 7.0

LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapt...