Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting
Pith reviewed 2026-05-18 11:43 UTC · model grok-4.3
The pith
Fidel-TS introduces a large-scale benchmark that removes contamination and leakage problems distorting time series forecasting evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Fidel-TS, a new large-scale benchmark built from these principles. Our experiments reveal the limitations of prior benchmarks and the potential discrepancies in model evaluation, providing new insights into multiple existing unimodal and multimodal forecasting models and LLMs across various evaluation tasks.
What carries the argument
Fidel-TS benchmark, constructed to enforce data sourcing integrity, leak-free design, and structural clarity for multimodal time series data.
If this is right
- Prior benchmarks have produced inflated performance estimates for forecasting models.
- Unimodal and multimodal approaches plus LLMs display different relative strengths under leak-free conditions.
- Evaluation discrepancies appear across multiple forecasting tasks when contamination is removed.
- Progress in the field can be tracked more reliably with benchmarks that avoid temporal and descriptive leaks.
Where Pith is reading between the lines
- Rankings of current forecasting models may shift once they are re-tested on uncontaminated data.
- The same sourcing and leak-prevention rules could be used to rebuild benchmarks in adjacent areas such as time-series classification.
- Model designers may need to develop methods that do not depend on information that would be unavailable in a truly leak-free setting.
Load-bearing premise
The principles of data sourcing integrity, leak-free design, and structural clarity can be realized in Fidel-TS in a way that genuinely removes pre-training contamination, temporal leakage, and description leakage.
What would settle it
Discovery of pre-training data overlap or future-information leakage inside the Fidel-TS collection would show the benchmark fails to meet its own fidelity standards.
read the original abstract
The evaluation of time series forecasting models is hindered by a lack of high-quality benchmarks, leading to overestimated assessments of progress. Existing datasets suffer from issues ranging from small-scale, low-frequency, pre-training data contamination in unimodal designs to the temporal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, leak-free design, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from these principles. Our experiments reveal the limitations of prior benchmarks and the potential discrepancies in model evaluation, providing new insights into multiple existing unimodal and multimodal forecasting models and LLMs across various evaluation tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies limitations in existing time series forecasting benchmarks, including small scale, low frequency, pre-training contamination in unimodal designs, and temporal and description leakage in multimodal ones. It formalizes principles for high-fidelity benchmarking focused on data sourcing integrity, leak-free design, and structural clarity. The authors introduce Fidel-TS, a new large-scale benchmark constructed according to these principles, and report experiments demonstrating the shortcomings of prior benchmarks along with new insights into the performance of various unimodal and multimodal forecasting models and LLMs.
Significance. If Fidel-TS successfully implements leak-free and high-fidelity properties as claimed, the benchmark could substantially advance the field by providing a more trustworthy standard for evaluating time series forecasting models, reducing the risk of inflated performance estimates and offering clearer comparisons across different modeling paradigms including LLMs.
major comments (1)
- Abstract: The assertion that Fidel-TS is 'built from these principles' and eliminates pre-training contamination, temporal leakage, and description leakage lacks any supporting details on data collection, temporal partitioning, overlap detection, or verification steps. This is a load-bearing issue for the central claim, as the abstract supplies no evidence that the formalized principles have been realized in a way that genuinely addresses the problems identified in prior benchmarks.
Simulated Author's Rebuttal
We thank the referee for highlighting this critical aspect of our abstract. The concern is well-taken, as the abstract must more clearly evidence that the claimed principles have been operationalized. We address the point below and will revise accordingly.
read point-by-point responses
-
Referee: Abstract: The assertion that Fidel-TS is 'built from these principles' and eliminates pre-training contamination, temporal leakage, and description leakage lacks any supporting details on data collection, temporal partitioning, overlap detection, or verification steps. This is a load-bearing issue for the central claim, as the abstract supplies no evidence that the formalized principles have been realized in a way that genuinely addresses the problems identified in prior benchmarks.
Authors: We agree that the current abstract is too concise to convey the concrete realization steps. The manuscript body details the construction process, including sourcing from verified public repositories, explicit temporal hold-out splits with no future leakage, automated and manual overlap detection against common pre-training corpora, and cross-checks for textual description leakage in the multimodal subset. To directly address the referee's concern, we will revise the abstract to incorporate a brief but explicit clause summarizing these measures (e.g., 'constructed via contamination-checked sourcing, strict temporal partitioning, and leakage-verification protocols'). revision: yes
Circularity Check
No circularity: benchmark introduction has no derivation chain
full rationale
The paper's abstract presents no equations, predictions, fitted parameters, or first-principles derivations. It simply formalizes principles and states that Fidel-TS is built from them, without any reduction of results to inputs by construction, self-citations, or renamings. No load-bearing steps exist that could be circular; the contribution is the benchmark creation itself, which is self-contained and independent of any internal fitting or self-referential logic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing datasets suffer from issues ranging from small-scale, low-frequency, pre-training data contamination in unimodal designs to the temporal and description leakage prevalent in early multimodal designs.
Forward citations
Cited by 2 Pith papers
-
TS-Arena -- A Live Forecast Pre-Registration Platform
TS-Arena is a live pre-registration platform that evaluates time series forecasts on future data streams to eliminate information leakage.
-
Toto 2.0: Time Series Forecasting Enters the Scaling Era
Toto 2.0 is a family of open time series foundation models that demonstrates reliable scaling and sets new state-of-the-art results on three forecasting benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.