Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting

Qiang Xu; Wanxu Cai; Xilin Dai; Zhaorong Deng; Zhijian Xu

arxiv: 2509.24789 · v4 · submitted 2025-09-29 · 💻 cs.LG · stat.ML

Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting

Zhijian Xu , Wanxu Cai , Xilin Dai , Zhaorong Deng , Qiang Xu This is my paper

Pith reviewed 2026-05-18 11:43 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords time series forecastingmultimodal benchmarkdata leakagemodel evaluationhigh-fidelity dataforecasting modelsLLMs

0 comments

The pith

Fidel-TS introduces a large-scale benchmark that removes contamination and leakage problems distorting time series forecasting evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for time series forecasting are compromised by pre-training data contamination in unimodal designs and by temporal plus description leakage in multimodal ones. The paper formalizes three core principles for high-fidelity work: data sourcing integrity, leak-free design, and structural clarity. Fidel-TS is presented as a new large-scale benchmark constructed according to these principles. Experiments on the benchmark expose limitations in earlier datasets and produce different performance patterns for unimodal models, multimodal models, and LLMs. A sympathetic reader would care because cleaner evaluations can change which techniques appear effective and how progress is measured.

Core claim

We introduce Fidel-TS, a new large-scale benchmark built from these principles. Our experiments reveal the limitations of prior benchmarks and the potential discrepancies in model evaluation, providing new insights into multiple existing unimodal and multimodal forecasting models and LLMs across various evaluation tasks.

What carries the argument

Fidel-TS benchmark, constructed to enforce data sourcing integrity, leak-free design, and structural clarity for multimodal time series data.

If this is right

Prior benchmarks have produced inflated performance estimates for forecasting models.
Unimodal and multimodal approaches plus LLMs display different relative strengths under leak-free conditions.
Evaluation discrepancies appear across multiple forecasting tasks when contamination is removed.
Progress in the field can be tracked more reliably with benchmarks that avoid temporal and descriptive leaks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Rankings of current forecasting models may shift once they are re-tested on uncontaminated data.
The same sourcing and leak-prevention rules could be used to rebuild benchmarks in adjacent areas such as time-series classification.
Model designers may need to develop methods that do not depend on information that would be unavailable in a truly leak-free setting.

Load-bearing premise

The principles of data sourcing integrity, leak-free design, and structural clarity can be realized in Fidel-TS in a way that genuinely removes pre-training contamination, temporal leakage, and description leakage.

What would settle it

Discovery of pre-training data overlap or future-information leakage inside the Fidel-TS collection would show the benchmark fails to meet its own fidelity standards.

read the original abstract

The evaluation of time series forecasting models is hindered by a lack of high-quality benchmarks, leading to overestimated assessments of progress. Existing datasets suffer from issues ranging from small-scale, low-frequency, pre-training data contamination in unimodal designs to the temporal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, leak-free design, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from these principles. Our experiments reveal the limitations of prior benchmarks and the potential discrepancies in model evaluation, providing new insights into multiple existing unimodal and multimodal forecasting models and LLMs across various evaluation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fidel-TS claims to fix leakage in time series benchmarks but the abstract supplies no construction details to support that.

read the letter

The main thing to know about this paper is that it introduces a new benchmark called Fidel-TS meant to provide more reliable evaluations for time series forecasting models and multimodal setups by fixing leakage and contamination problems in existing datasets. The abstract makes a case for it but gives almost no supporting details. What the paper does is formalize three principles for high-fidelity benchmarking: data sourcing integrity, leak-free design, and structural clarity. It then presents Fidel-TS as a large-scale benchmark constructed according to those rules. The authors also describe experiments that compare various unimodal and multimodal forecasting models along with LLMs, claiming these reveal limitations in how prior benchmarks assess performance. This approach has some merit because leakage issues are a known headache in the area. Pre-training contamination can make models look better than they are, and temporal or description leaks in multimodal data can create unfair advantages. Pointing that out and trying to build something cleaner is useful work on its own. The weak part is the complete absence of construction specifics. There is no information on data sources, how temporal splits were made to avoid leakage, what methods were used to check for overlaps or contamination, or how the structural clarity was ensured. The experiments are referenced but not outlined, so it's impossible to evaluate whether the discrepancies found are meaningful or artifacts. This matches the stress-test concern exactly. For readers working on time series models or LLM-based forecasting, this could be worth a look if the full paper fills in the gaps. It might spark useful conversations about evaluation standards. I would send this to peer review. The idea addresses a real problem in the subfield, and with proper methods and verification steps added, it could be a solid contribution. Right now the evidence is too thin for a strong verdict, but the topic warrants referee attention.

Referee Report

1 major / 0 minor

Summary. The paper identifies limitations in existing time series forecasting benchmarks, including small scale, low frequency, pre-training contamination in unimodal designs, and temporal and description leakage in multimodal ones. It formalizes principles for high-fidelity benchmarking focused on data sourcing integrity, leak-free design, and structural clarity. The authors introduce Fidel-TS, a new large-scale benchmark constructed according to these principles, and report experiments demonstrating the shortcomings of prior benchmarks along with new insights into the performance of various unimodal and multimodal forecasting models and LLMs.

Significance. If Fidel-TS successfully implements leak-free and high-fidelity properties as claimed, the benchmark could substantially advance the field by providing a more trustworthy standard for evaluating time series forecasting models, reducing the risk of inflated performance estimates and offering clearer comparisons across different modeling paradigms including LLMs.

major comments (1)

Abstract: The assertion that Fidel-TS is 'built from these principles' and eliminates pre-training contamination, temporal leakage, and description leakage lacks any supporting details on data collection, temporal partitioning, overlap detection, or verification steps. This is a load-bearing issue for the central claim, as the abstract supplies no evidence that the formalized principles have been realized in a way that genuinely addresses the problems identified in prior benchmarks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this critical aspect of our abstract. The concern is well-taken, as the abstract must more clearly evidence that the claimed principles have been operationalized. We address the point below and will revise accordingly.

read point-by-point responses

Referee: Abstract: The assertion that Fidel-TS is 'built from these principles' and eliminates pre-training contamination, temporal leakage, and description leakage lacks any supporting details on data collection, temporal partitioning, overlap detection, or verification steps. This is a load-bearing issue for the central claim, as the abstract supplies no evidence that the formalized principles have been realized in a way that genuinely addresses the problems identified in prior benchmarks.

Authors: We agree that the current abstract is too concise to convey the concrete realization steps. The manuscript body details the construction process, including sourcing from verified public repositories, explicit temporal hold-out splits with no future leakage, automated and manual overlap detection against common pre-training corpora, and cross-checks for textual description leakage in the multimodal subset. To directly address the referee's concern, we will revise the abstract to incorporate a brief but explicit clause summarizing these measures (e.g., 'constructed via contamination-checked sourcing, strict temporal partitioning, and leakage-verification protocols'). revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduction has no derivation chain

full rationale

The paper's abstract presents no equations, predictions, fitted parameters, or first-principles derivations. It simply formalizes principles and states that Fidel-TS is built from them, without any reduction of results to inputs by construction, self-citations, or renamings. No load-bearing steps exist that could be circular; the contribution is the benchmark creation itself, which is self-contained and independent of any internal fitting or self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that existing benchmarks contain the listed contamination and leakage problems; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Existing datasets suffer from issues ranging from small-scale, low-frequency, pre-training data contamination in unimodal designs to the temporal and description leakage prevalent in early multimodal designs.
Directly stated in the abstract as the motivation for the new benchmark.

pith-pipeline@v0.9.0 · 5626 in / 1283 out tokens · 47108 ms · 2026-05-18T11:43:16.674415+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TS-Arena -- A Live Forecast Pre-Registration Platform
cs.LG 2025-12 conditional novelty 7.0

TS-Arena is a live pre-registration platform that evaluates time series forecasts on future data streams to eliminate information leakage.
Toto 2.0: Time Series Forecasting Enters the Scaling Era
cs.LG 2026-05 unverdicted novelty 6.0

Toto 2.0 is a family of open time series foundation models that demonstrates reliable scaling and sets new state-of-the-art results on three forecasting benchmarks.