DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment
Pith reviewed 2026-05-18 02:54 UTC · model grok-4.3
The pith
A joint-distribution Wasserstein discrepancy upper bounds the conditional discrepancy and replaces mean squared error for aligning forecasts with labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing direct minimization of conditional negative log-likelihood with minimization of a joint-distribution Wasserstein discrepancy, time-series forecasting models can achieve better alignment between forecast and label distributions. The joint discrepancy provably upper-bounds the conditional one and is tractable for gradient descent.
What carries the argument
The joint-distribution Wasserstein discrepancy, which serves as a provable upper bound on the conditional distribution discrepancy and enables gradient-based optimization for distribution alignment in forecasting.
If this is right
- Forecasting models trained with DistDF exhibit reduced bias from autocorrelation in labels.
- The method integrates seamlessly with existing architectures since it is differentiable.
- It leads to leading performance on standard benchmarks for time-series forecasting.
- Diverse models benefit from this alignment approach.
Where Pith is reading between the lines
- This bounding technique could apply to other tasks involving conditional distribution matching with dependent sequences.
- Tighter bounds or alternative discrepancies might be explored for even better estimation from finite samples.
Load-bearing premise
The joint Wasserstein discrepancy estimated from finite observations remains a sufficiently tight and useful upper bound on the true conditional discrepancy even when the time series exhibit strong autocorrelation.
What would settle it
Comparing forecast accuracy of DistDF against standard direct forecast on time-series datasets with strong autocorrelation; if no improvement occurs, the practical value of the upper bound would be questioned.
Figures
read the original abstract
Training time-series forecasting models requires aligning the conditional distribution of model forecasts with that of the label sequence. The standard direct forecast (DF) approach resorts to minimizing the conditional negative log-likelihood, typically estimated by the mean squared error. However, this estimation proves biased when the label sequence exhibits autocorrelation. In this paper, we propose DistDF, which achieves alignment by minimizing a distributional discrepancy between the conditional distributions of forecast and label sequences. Since such conditional discrepancies are difficult to estimate from finite time-series observations, we introduce a joint-distribution Wasserstein discrepancy for time-series forecasting, which provably upper bounds the conditional discrepancy of interest. The proposed discrepancy is tractable, differentiable, and readily compatible with gradient-based optimization. Extensive experiments show that DistDF improves diverse forecasting models and achieves leading performance. Code is available at https://anonymous.4open.science/r/DistDF-F66B.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DistDF, a method for time-series forecasting that aligns the conditional distributions of model forecasts and label sequences by minimizing a joint-distribution Wasserstein discrepancy instead of the standard conditional negative log-likelihood (typically MSE). It claims this joint discrepancy provably upper-bounds the conditional discrepancy of interest, remains tractable and differentiable for gradient-based optimization, and yields empirical improvements over direct forecasting baselines across diverse models.
Significance. A valid and tight upper bound would address a known bias in MSE-based training under autocorrelation, offering a principled distributional alignment objective for time-series models. The reported empirical gains and open code are strengths if the theoretical justification holds under realistic dependence structures.
major comments (2)
- [§3] §3 (theoretical development of the bound): The proof that the joint Wasserstein discrepancy upper-bounds the conditional discrepancy does not explicitly incorporate mixing rates, effective sample size, or dependence structure. Standard Wasserstein concentration inequalities assume i.i.d. or weakly dependent observations; without such controls, the finite-sample estimate can be biased and the bound arbitrarily loose for strongly autocorrelated series, directly undermining the proxy justification for optimization.
- [Experiments] Experimental section (results tables): The reported performance gains are not accompanied by ablations that vary autocorrelation strength or effective sample size while measuring bound tightness (e.g., via estimated conditional vs. joint discrepancy gap). This leaves open whether improvements stem from the claimed alignment or from other optimization effects.
minor comments (2)
- [§2] Notation in §2: Distinguish more clearly between the joint distribution P(X,Y) and the conditional P(Y|X) in the discrepancy definitions to prevent reader confusion.
- [Related work] Related work: Add citations to prior uses of Wasserstein distances for dependent data or time-series distribution matching.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [§3] §3 (theoretical development of the bound): The proof that the joint Wasserstein discrepancy upper-bounds the conditional discrepancy does not explicitly incorporate mixing rates, effective sample size, or dependence structure. Standard Wasserstein concentration inequalities assume i.i.d. or weakly dependent observations; without such controls, the finite-sample estimate can be biased and the bound arbitrarily loose for strongly autocorrelated series, directly undermining the proxy justification for optimization.
Authors: We appreciate this observation. Our theoretical result establishes that the joint Wasserstein distance between the joint distributions of (forecast, label) pairs upper-bounds the conditional Wasserstein distance between the conditional distributions, and this inequality holds at the population level by the properties of optimal transport and does not rely on any independence assumptions. The proof is deterministic and applies to any joint distribution, including those with strong temporal dependence. However, we agree that the concentration of the empirical estimator around the population value may be affected by dependence, potentially making the bound looser in finite samples for highly autocorrelated data. In the revised manuscript, we will add a remark clarifying the population-level nature of the bound and discuss the implications for strongly dependent time series, including references to relevant concentration results for dependent data. We believe this addresses the concern without altering the core contribution. revision: partial
-
Referee: [Experiments] Experimental section (results tables): The reported performance gains are not accompanied by ablations that vary autocorrelation strength or effective sample size while measuring bound tightness (e.g., via estimated conditional vs. joint discrepancy gap). This leaves open whether improvements stem from the claimed alignment or from other optimization effects.
Authors: We acknowledge that such ablations would provide stronger evidence for the mechanism behind the improvements. To address this, we will include additional experiments in the revised version where we generate synthetic time series with controlled autocorrelation levels (e.g., AR(1) processes with varying coefficients) and report both the forecasting performance and the estimated gap between the joint and conditional discrepancies. This will help demonstrate that the gains correlate with tighter alignment under higher dependence. We have already begun preparing these results. revision: yes
Circularity Check
No circularity: joint Wasserstein upper bound derived mathematically from definitions
full rationale
The paper's central derivation introduces a joint-distribution Wasserstein discrepancy and states that it provably upper-bounds the conditional discrepancy of interest for time-series forecasting. This upper-bound relation is presented as a mathematical property established from the definitions of the discrepancies rather than by fitting parameters to target metrics, renaming known results, or relying on self-citation chains for the core claim. The abstract and description contain no equations or steps that reduce the bound to its inputs by construction, nor any load-bearing self-citations or ansatzes smuggled via prior work. The approach remains self-contained as an independent proxy for alignment that is tractable for gradient optimization, with no evidence of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Wasserstein distance between joint distributions of forecast and label sequences upper-bounds the conditional distributional discrepancy
Forward citations
Cited by 1 Pith paper
-
Optimal Transport for LLM Reward Modeling from Noisy Preference
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy prefe...
Reference graph
Works this paper leans on
-
[1]
Altschuler, Jonathan Weed, and Philippe Rigollet
Jason M. Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algo- rithms for optimal transport via sinkhorn iteration. InProc. Adv. Neural Inf. Process. Syst., pages 1964–1974,
work page 1964
-
[2]
Jiajun Fan, Chaoran Cheng, Shuaike Shen, Xiangxin Zhou, and Ge Liu. Fine-tuning flow matching generative models with intermediate feedback.arXiv preprint arXiv:2510.18072, 2025a. Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InProc. Int. ...
-
[3]
Deep Time Series Models: A Comprehensive Survey and Benchmark
Hao Wang, Zhichao Chen, Jiajun Fan, Haoxuan Li, Tianqiao Liu, Weiming Liu, Quanyu Dai, Yichao Wang, Zhenhua Dong, and Ruiming Tang. Optimal transport for treatment effect estimation. In Proc. Adv. Neural Inf. Process. Syst., volume 36, pages 5404–5418, 2023a. Hao Wang, Zhichao Chen, Zhaoran Liu, Xu Chen, Haoxuan Li, and Zhouchen Lin. Proximity matters: Lo...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
14 Preprint A THEORETICALJUSTIFICATION Theorem A.1(Autocorrelation bias, Theorem 3.1 in the main text).Suppose Y|X ∈R T is the label sequence given historical sequence X, ˆY|X ∈R T is the forecast sequence, Σ|X ∈R T×T is the conditional covariance of Y|X. The bias of MSE from the negative log-likelihood of the label sequence givenXis expressed as: Bias = ...
work page 2022
-
[5]
This study generalizes this bias without the first-order Markov assumption
4The pioneering work (Wang et al., 2025g) identifies the bias under the first-order Markov assumption on the label sequence. This study generalizes this bias without the first-order Markov assumption. 15 Preprint Proof.By Lemma 3.3, we have Z Wp(PY|X ,P ˆY|X )dP(X)≤ W p(PX,Y ,P X, ˆY ). Thus, if RHS = 0, we have R Wp(PY|X ,P ˆY|X )dP(X) = 0 . Since Wp is ...
work page 2019
-
[6]
and sliced OT, which reduces the problem to one-dimensional computations and achieves near-linear complexity. The second path involves adapting the OT framework to address specific challenges across various domains, such as domain adaptation (Chizat et al., 2018), causal inference (Wang et al., 2025a; 2023a), generative modeling (Marino and Gerolin, 2020;...
work page 2018
-
[7]
• ETT(Li et al., 2021): Contains seven metrics related to electricity transformers, recorded from July 2016 to July
work page 2021
-
[8]
It is divided into four subsets based on sampling frequency: ETTh1 and ETTh2 (hourly), and ETTm1 and ETTm2 (every 15 minutes). • Weather(Wu et al., 2021): Comprises 21 meteorological variables from the Max Planck Biogeo- chemistry Institute’s weather station, captured every 10 minutes throughout
work page 2021
-
[9]
•ECL(Wu et al., 2021): Features the hourly electricity consumption of 321 clients. • Traffic(Wu et al., 2021): Documents the hourly occupancy rates of 862 sensors on San Francisco Bay Area freeways, spanning from 2015 to
work page 2021
-
[10]
We utilize two common subsets, PEMS03 and PEMS08
• PEMS(Liu et al., 2022): Consists of public traffic data from the California highway system, aggregated in 5-minute intervals. We utilize two common subsets, PEMS03 and PEMS08. Following established protocols (Qiu et al., 2024; Liu et al., 2024), all datasets are chronologically partitioned into training, validation, and test sets. For the ETT, Weather, ...
work page 2022
-
[11]
The reproducibility of these baseline results was verified prior to our experiments
repositories. The reproducibility of these baseline results was verified prior to our experiments. All models were trained to minimize the MSE loss function using the Adam optimizer (Kingma and Ba, 2015). The learning rate for each baseline was selected from the set {10−3,5×10 −4,10 −4,5×10 −5} based on the best performance on the validation set. To preve...
work page 2015
-
[12]
6Implementation is available at https://www.mathworks.com/help/stats/partialcorr
Models DistDF TimeBridge Fredformer iTransformer FreTS TimesNet MICN TiDE PatchTST DLinear (Ours) (2025) (2024) (2024) (2023) (2023) (2023) (2023) (2023) (2023) Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE ETTm1 96 0.316 0.357 0.323 0.361 0.326 0.361 0.338 0.372 0.342 0.375 0.368 0.394 0.319 0.366 0.353 0.374 0.3...
work page 2025
-
[13]
which is known to require large historical lengths. The results demonstrate that DistDF consistently improves both forecast models across different historical sequence lengths. 23 Preprint Table 11: Experimental results (mean±std) with varying seeds (2021-2025). Dataset ECL Weather Models DistDFDF DistDFDF Metrics MSE MAE MSE MAE MSE MAE MSE MAE 96 0.138±...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.