DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment

Hao Wang; Haoxuan Li; Licheng Pan; Qingsong Wen; Shuting He; Xiaoxi Li; Yuan Lu; Zhichao Chen; Zhixuan Chu; Zhouchen Lin

arxiv: 2510.24574 · v2 · submitted 2025-10-28 · 💻 cs.LG · cs.AI

DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment

Hao Wang , Licheng Pan , Yuan Lu , Zhixuan Chu , Xiaoxi Li , Shuting He , Zhichao Chen , Haoxuan Li

show 2 more authors

Qingsong Wen Zhouchen Lin

This is my paper

Pith reviewed 2026-05-18 02:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords time-series forecastingWasserstein discrepancydistribution alignmentconditional distributionautocorrelation biasgradient optimization

0 comments

The pith

A joint-distribution Wasserstein discrepancy upper bounds the conditional discrepancy and replaces mean squared error for aligning forecasts with labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard mean squared error training for time-series forecasts is biased due to autocorrelation in label sequences. It introduces DistDF, a method that instead minimizes a joint-distribution Wasserstein discrepancy to align the conditional distributions of forecasts and labels. This discrepancy is proven to upper bound the desired conditional discrepancy while remaining computable and differentiable for optimization. A sympathetic reader would care because many real-world time series exhibit autocorrelation, which can lead to suboptimal model training and inaccurate predictions if not addressed properly.

Core claim

By replacing direct minimization of conditional negative log-likelihood with minimization of a joint-distribution Wasserstein discrepancy, time-series forecasting models can achieve better alignment between forecast and label distributions. The joint discrepancy provably upper-bounds the conditional one and is tractable for gradient descent.

What carries the argument

The joint-distribution Wasserstein discrepancy, which serves as a provable upper bound on the conditional distribution discrepancy and enables gradient-based optimization for distribution alignment in forecasting.

If this is right

Forecasting models trained with DistDF exhibit reduced bias from autocorrelation in labels.
The method integrates seamlessly with existing architectures since it is differentiable.
It leads to leading performance on standard benchmarks for time-series forecasting.
Diverse models benefit from this alignment approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This bounding technique could apply to other tasks involving conditional distribution matching with dependent sequences.
Tighter bounds or alternative discrepancies might be explored for even better estimation from finite samples.

Load-bearing premise

The joint Wasserstein discrepancy estimated from finite observations remains a sufficiently tight and useful upper bound on the true conditional discrepancy even when the time series exhibit strong autocorrelation.

What would settle it

Comparing forecast accuracy of DistDF against standard direct forecast on time-series datasets with strong autocorrelation; if no improvement occurs, the practical value of the upper bound would be questioned.

Figures

Figures reproduced from arXiv: 2510.24574 by Hao Wang, Haoxuan Li, Licheng Pan, Qingsong Wen, Shuting He, Xiaoxi Li, Yuan Lu, Zhichao Chen, Zhixuan Chu, Zhouchen Lin.

**Figure 2.** Figure 2: The forecast sequence of DF (in blue) and DistDF (in red), with historical length [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Improvement of DistDF applied to different forecast models, shown with colored bars for [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The forecast sequences generated with DF and DistDF. The forecast length is set to 336 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: The forecast sequences generated with DF and DistDF. The forecast length is set to 192 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of different forecast models with and without DistDF. The forecast errors are [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Running time (ms) with varying forecast horizons. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Training time-series forecasting models requires aligning the conditional distribution of model forecasts with that of the label sequence. The standard direct forecast (DF) approach resorts to minimizing the conditional negative log-likelihood, typically estimated by the mean squared error. However, this estimation proves biased when the label sequence exhibits autocorrelation. In this paper, we propose DistDF, which achieves alignment by minimizing a distributional discrepancy between the conditional distributions of forecast and label sequences. Since such conditional discrepancies are difficult to estimate from finite time-series observations, we introduce a joint-distribution Wasserstein discrepancy for time-series forecasting, which provably upper bounds the conditional discrepancy of interest. The proposed discrepancy is tractable, differentiable, and readily compatible with gradient-based optimization. Extensive experiments show that DistDF improves diverse forecasting models and achieves leading performance. Code is available at https://anonymous.4open.science/r/DistDF-F66B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DistDF, a method for time-series forecasting that aligns the conditional distributions of model forecasts and label sequences by minimizing a joint-distribution Wasserstein discrepancy instead of the standard conditional negative log-likelihood (typically MSE). It claims this joint discrepancy provably upper-bounds the conditional discrepancy of interest, remains tractable and differentiable for gradient-based optimization, and yields empirical improvements over direct forecasting baselines across diverse models.

Significance. A valid and tight upper bound would address a known bias in MSE-based training under autocorrelation, offering a principled distributional alignment objective for time-series models. The reported empirical gains and open code are strengths if the theoretical justification holds under realistic dependence structures.

major comments (2)

[§3] §3 (theoretical development of the bound): The proof that the joint Wasserstein discrepancy upper-bounds the conditional discrepancy does not explicitly incorporate mixing rates, effective sample size, or dependence structure. Standard Wasserstein concentration inequalities assume i.i.d. or weakly dependent observations; without such controls, the finite-sample estimate can be biased and the bound arbitrarily loose for strongly autocorrelated series, directly undermining the proxy justification for optimization.
[Experiments] Experimental section (results tables): The reported performance gains are not accompanied by ablations that vary autocorrelation strength or effective sample size while measuring bound tightness (e.g., via estimated conditional vs. joint discrepancy gap). This leaves open whether improvements stem from the claimed alignment or from other optimization effects.

minor comments (2)

[§2] Notation in §2: Distinguish more clearly between the joint distribution P(X,Y) and the conditional P(Y|X) in the discrepancy definitions to prevent reader confusion.
[Related work] Related work: Add citations to prior uses of Wasserstein distances for dependent data or time-series distribution matching.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: [§3] §3 (theoretical development of the bound): The proof that the joint Wasserstein discrepancy upper-bounds the conditional discrepancy does not explicitly incorporate mixing rates, effective sample size, or dependence structure. Standard Wasserstein concentration inequalities assume i.i.d. or weakly dependent observations; without such controls, the finite-sample estimate can be biased and the bound arbitrarily loose for strongly autocorrelated series, directly undermining the proxy justification for optimization.

Authors: We appreciate this observation. Our theoretical result establishes that the joint Wasserstein distance between the joint distributions of (forecast, label) pairs upper-bounds the conditional Wasserstein distance between the conditional distributions, and this inequality holds at the population level by the properties of optimal transport and does not rely on any independence assumptions. The proof is deterministic and applies to any joint distribution, including those with strong temporal dependence. However, we agree that the concentration of the empirical estimator around the population value may be affected by dependence, potentially making the bound looser in finite samples for highly autocorrelated data. In the revised manuscript, we will add a remark clarifying the population-level nature of the bound and discuss the implications for strongly dependent time series, including references to relevant concentration results for dependent data. We believe this addresses the concern without altering the core contribution. revision: partial
Referee: [Experiments] Experimental section (results tables): The reported performance gains are not accompanied by ablations that vary autocorrelation strength or effective sample size while measuring bound tightness (e.g., via estimated conditional vs. joint discrepancy gap). This leaves open whether improvements stem from the claimed alignment or from other optimization effects.

Authors: We acknowledge that such ablations would provide stronger evidence for the mechanism behind the improvements. To address this, we will include additional experiments in the revised version where we generate synthetic time series with controlled autocorrelation levels (e.g., AR(1) processes with varying coefficients) and report both the forecasting performance and the estimated gap between the joint and conditional discrepancies. This will help demonstrate that the gains correlate with tighter alignment under higher dependence. We have already begun preparing these results. revision: yes

Circularity Check

0 steps flagged

No circularity: joint Wasserstein upper bound derived mathematically from definitions

full rationale

The paper's central derivation introduces a joint-distribution Wasserstein discrepancy and states that it provably upper-bounds the conditional discrepancy of interest for time-series forecasting. This upper-bound relation is presented as a mathematical property established from the definitions of the discrepancies rather than by fitting parameters to target metrics, renaming known results, or relying on self-citation chains for the core claim. The abstract and description contain no equations or steps that reduce the bound to its inputs by construction, nor any load-bearing self-citations or ansatzes smuggled via prior work. The approach remains self-contained as an independent proxy for alignment that is tractable for gradient optimization, with no evidence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard properties of Wasserstein distance and the existence of a tractable joint formulation that upper-bounds the conditional quantity; no new free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Wasserstein distance between joint distributions of forecast and label sequences upper-bounds the conditional distributional discrepancy
Invoked to justify replacing direct conditional alignment with the joint proxy

pith-pipeline@v0.9.0 · 5710 in / 1106 out tokens · 22239 ms · 2026-05-18T02:54:15.773588+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Optimal Transport for LLM Reward Modeling from Noisy Preference
cs.LG 2026-05 unverdicted novelty 6.0

SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy prefe...

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Altschuler, Jonathan Weed, and Philippe Rigollet

Jason M. Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algo- rithms for optimal transport via sinkhorn iteration. InProc. Adv. Neural Inf. Process. Syst., pages 1964–1974,

work page 1964
[2]

Fine-tuning flow matching generative models with intermediate feedback.arXiv preprint arXiv:2510.18072, 2025a

Jiajun Fan, Chaoran Cheng, Shuaike Shen, Xiangxin Zhou, and Ge Liu. Fine-tuning flow matching generative models with intermediate feedback.arXiv preprint arXiv:2510.18072, 2025a. Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InProc. Int. ...

work page arXiv
[3]

Deep Time Series Models: A Comprehensive Survey and Benchmark

Hao Wang, Zhichao Chen, Jiajun Fan, Haoxuan Li, Tianqiao Liu, Weiming Liu, Quanyu Dai, Yichao Wang, Zhenhua Dong, and Ruiming Tang. Optimal transport for treatment effect estimation. In Proc. Adv. Neural Inf. Process. Syst., volume 36, pages 5404–5418, 2023a. Hao Wang, Zhichao Chen, Zhaoran Liu, Xu Chen, Haoxuan Li, and Zhouchen Lin. Proximity matters: Lo...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The bias of MSE from the negative log-likelihood of the label sequence givenXis expressed as: Bias = Y|X − ˆY|X 2 Σ−1 |X − Y|X − ˆY|X 2 .(7) where∥v∥ 2 Σ−1 |X =v ⊤Σ−1 |X v

14 Preprint A THEORETICALJUSTIFICATION Theorem A.1(Autocorrelation bias, Theorem 3.1 in the main text).Suppose Y|X ∈R T is the label sequence given historical sequence X, ˆY|X ∈R T is the forecast sequence, Σ|X ∈R T×T is the conditional covariance of Y|X. The bias of MSE from the negative log-likelihood of the label sequence givenXis expressed as: Bias = ...

work page 2022
[5]

This study generalizes this bias without the first-order Markov assumption

4The pioneering work (Wang et al., 2025g) identifies the bias under the first-order Markov assumption on the label sequence. This study generalizes this bias without the first-order Markov assumption. 15 Preprint Proof.By Lemma 3.3, we have Z Wp(PY|X ,P ˆY|X )dP(X)≤ W p(PX,Y ,P X, ˆY ). Thus, if RHS = 0, we have R Wp(PY|X ,P ˆY|X )dP(X) = 0 . Since Wp is ...

work page 2019
[6]

and sliced OT, which reduces the problem to one-dimensional computations and achieves near-linear complexity. The second path involves adapting the OT framework to address specific challenges across various domains, such as domain adaptation (Chizat et al., 2018), causal inference (Wang et al., 2025a; 2023a), generative modeling (Marino and Gerolin, 2020;...

work page 2018
[7]

• ETT(Li et al., 2021): Contains seven metrics related to electricity transformers, recorded from July 2016 to July

work page 2021
[8]

• Weather(Wu et al., 2021): Comprises 21 meteorological variables from the Max Planck Biogeo- chemistry Institute’s weather station, captured every 10 minutes throughout

It is divided into four subsets based on sampling frequency: ETTh1 and ETTh2 (hourly), and ETTm1 and ETTm2 (every 15 minutes). • Weather(Wu et al., 2021): Comprises 21 meteorological variables from the Max Planck Biogeo- chemistry Institute’s weather station, captured every 10 minutes throughout

work page 2021
[9]

• Traffic(Wu et al., 2021): Documents the hourly occupancy rates of 862 sensors on San Francisco Bay Area freeways, spanning from 2015 to

•ECL(Wu et al., 2021): Features the hourly electricity consumption of 321 clients. • Traffic(Wu et al., 2021): Documents the hourly occupancy rates of 862 sensors on San Francisco Bay Area freeways, spanning from 2015 to

work page 2021
[10]

We utilize two common subsets, PEMS03 and PEMS08

• PEMS(Liu et al., 2022): Consists of public traffic data from the California highway system, aggregated in 5-minute intervals. We utilize two common subsets, PEMS03 and PEMS08. Following established protocols (Qiu et al., 2024; Liu et al., 2024), all datasets are chronologically partitioned into training, validation, and test sets. For the ETT, Weather, ...

work page 2022
[11]

The reproducibility of these baseline results was verified prior to our experiments

repositories. The reproducibility of these baseline results was verified prior to our experiments. All models were trained to minimize the MSE loss function using the Adam optimizer (Kingma and Ba, 2015). The learning rate for each baseline was selected from the set {10−3,5×10 −4,10 −4,5×10 −5} based on the best performance on the validation set. To preve...

work page 2015
[12]

6Implementation is available at https://www.mathworks.com/help/stats/partialcorr

Models DistDF TimeBridge Fredformer iTransformer FreTS TimesNet MICN TiDE PatchTST DLinear (Ours) (2025) (2024) (2024) (2023) (2023) (2023) (2023) (2023) (2023) Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE ETTm1 96 0.316 0.357 0.323 0.361 0.326 0.361 0.338 0.372 0.342 0.375 0.368 0.394 0.319 0.366 0.353 0.374 0.3...

work page 2025
[13]

The results demonstrate that DistDF consistently improves both forecast models across different historical sequence lengths

which is known to require large historical lengths. The results demonstrate that DistDF consistently improves both forecast models across different historical sequence lengths. 23 Preprint Table 11: Experimental results (mean±std) with varying seeds (2021-2025). Dataset ECL Weather Models DistDFDF DistDFDF Metrics MSE MAE MSE MAE MSE MAE MSE MAE 96 0.138±...

work page 2021

[1] [1]

Altschuler, Jonathan Weed, and Philippe Rigollet

Jason M. Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algo- rithms for optimal transport via sinkhorn iteration. InProc. Adv. Neural Inf. Process. Syst., pages 1964–1974,

work page 1964

[2] [2]

Fine-tuning flow matching generative models with intermediate feedback.arXiv preprint arXiv:2510.18072, 2025a

Jiajun Fan, Chaoran Cheng, Shuaike Shen, Xiangxin Zhou, and Ge Liu. Fine-tuning flow matching generative models with intermediate feedback.arXiv preprint arXiv:2510.18072, 2025a. Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InProc. Int. ...

work page arXiv

[3] [3]

Deep Time Series Models: A Comprehensive Survey and Benchmark

Hao Wang, Zhichao Chen, Jiajun Fan, Haoxuan Li, Tianqiao Liu, Weiming Liu, Quanyu Dai, Yichao Wang, Zhenhua Dong, and Ruiming Tang. Optimal transport for treatment effect estimation. In Proc. Adv. Neural Inf. Process. Syst., volume 36, pages 5404–5418, 2023a. Hao Wang, Zhichao Chen, Zhaoran Liu, Xu Chen, Haoxuan Li, and Zhouchen Lin. Proximity matters: Lo...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

The bias of MSE from the negative log-likelihood of the label sequence givenXis expressed as: Bias = Y|X − ˆY|X 2 Σ−1 |X − Y|X − ˆY|X 2 .(7) where∥v∥ 2 Σ−1 |X =v ⊤Σ−1 |X v

14 Preprint A THEORETICALJUSTIFICATION Theorem A.1(Autocorrelation bias, Theorem 3.1 in the main text).Suppose Y|X ∈R T is the label sequence given historical sequence X, ˆY|X ∈R T is the forecast sequence, Σ|X ∈R T×T is the conditional covariance of Y|X. The bias of MSE from the negative log-likelihood of the label sequence givenXis expressed as: Bias = ...

work page 2022

[5] [5]

This study generalizes this bias without the first-order Markov assumption

4The pioneering work (Wang et al., 2025g) identifies the bias under the first-order Markov assumption on the label sequence. This study generalizes this bias without the first-order Markov assumption. 15 Preprint Proof.By Lemma 3.3, we have Z Wp(PY|X ,P ˆY|X )dP(X)≤ W p(PX,Y ,P X, ˆY ). Thus, if RHS = 0, we have R Wp(PY|X ,P ˆY|X )dP(X) = 0 . Since Wp is ...

work page 2019

[6] [6]

and sliced OT, which reduces the problem to one-dimensional computations and achieves near-linear complexity. The second path involves adapting the OT framework to address specific challenges across various domains, such as domain adaptation (Chizat et al., 2018), causal inference (Wang et al., 2025a; 2023a), generative modeling (Marino and Gerolin, 2020;...

work page 2018

[7] [7]

• ETT(Li et al., 2021): Contains seven metrics related to electricity transformers, recorded from July 2016 to July

work page 2021

[8] [8]

• Weather(Wu et al., 2021): Comprises 21 meteorological variables from the Max Planck Biogeo- chemistry Institute’s weather station, captured every 10 minutes throughout

It is divided into four subsets based on sampling frequency: ETTh1 and ETTh2 (hourly), and ETTm1 and ETTm2 (every 15 minutes). • Weather(Wu et al., 2021): Comprises 21 meteorological variables from the Max Planck Biogeo- chemistry Institute’s weather station, captured every 10 minutes throughout

work page 2021

[9] [9]

• Traffic(Wu et al., 2021): Documents the hourly occupancy rates of 862 sensors on San Francisco Bay Area freeways, spanning from 2015 to

•ECL(Wu et al., 2021): Features the hourly electricity consumption of 321 clients. • Traffic(Wu et al., 2021): Documents the hourly occupancy rates of 862 sensors on San Francisco Bay Area freeways, spanning from 2015 to

work page 2021

[10] [10]

We utilize two common subsets, PEMS03 and PEMS08

• PEMS(Liu et al., 2022): Consists of public traffic data from the California highway system, aggregated in 5-minute intervals. We utilize two common subsets, PEMS03 and PEMS08. Following established protocols (Qiu et al., 2024; Liu et al., 2024), all datasets are chronologically partitioned into training, validation, and test sets. For the ETT, Weather, ...

work page 2022

[11] [11]

The reproducibility of these baseline results was verified prior to our experiments

repositories. The reproducibility of these baseline results was verified prior to our experiments. All models were trained to minimize the MSE loss function using the Adam optimizer (Kingma and Ba, 2015). The learning rate for each baseline was selected from the set {10−3,5×10 −4,10 −4,5×10 −5} based on the best performance on the validation set. To preve...

work page 2015

[12] [12]

6Implementation is available at https://www.mathworks.com/help/stats/partialcorr

Models DistDF TimeBridge Fredformer iTransformer FreTS TimesNet MICN TiDE PatchTST DLinear (Ours) (2025) (2024) (2024) (2023) (2023) (2023) (2023) (2023) (2023) Metrics MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE ETTm1 96 0.316 0.357 0.323 0.361 0.326 0.361 0.338 0.372 0.342 0.375 0.368 0.394 0.319 0.366 0.353 0.374 0.3...

work page 2025

[13] [13]

The results demonstrate that DistDF consistently improves both forecast models across different historical sequence lengths

which is known to require large historical lengths. The results demonstrate that DistDF consistently improves both forecast models across different historical sequence lengths. 23 Preprint Table 11: Experimental results (mean±std) with varying seeds (2021-2025). Dataset ECL Weather Models DistDFDF DistDFDF Metrics MSE MAE MSE MAE MSE MAE MSE MAE 96 0.138±...

work page 2021