pith. sign in

arxiv: 2512.18610 · v3 · submitted 2025-12-21 · 💻 cs.LG

The Procrustean Bed of Time Series: The Optimization Bias in Point-wise Loss Functions

Pith reviewed 2026-05-16 20:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords time series forecastingoptimization biaspoint-wise lossstructural signal-to-noise ratioexpectation of optimization biasdebiasingforecastingimputation
0
0 comments X

The pith

Point-wise loss functions create an optimization bias in time series determined solely by sequence length and structural signal-to-noise ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Point-wise losses like MSE evaluate each timestamp independently and therefore ignore the temporal correlations that actually exist in the data. This mismatch produces a systematic optimization bias that the paper formalizes as the expectation of optimization bias, or EOB, which is the KL divergence between the true joint distribution and the factorized surrogate induced by the loss. Under covariance-stationary Gaussian assumptions the authors obtain a closed-form expression showing that EOB depends only on sequence length and structural signal-to-noise ratio, and they extend the result to nonlinear settings via a Gaussian-mixture lower bound. The bias therefore survives any improvement in model capacity or optimizer choice. A reader cares because the finding implies that forecasting accuracy on long or weakly structured series is limited by the loss itself rather than by the network, and it points to concrete preprocessing steps that reduce the bias without changing the model.

Core claim

Under covariance-stationary Gaussian assumptions, the stochastic component of the expectation of optimization bias admits a closed-form expression that serves as an irreducible lower bound on total bias for linear systems. This bias is governed by sequence length and structural signal-to-noise ratio. The result extends to nonlinear regimes via a Gaussian mixture model lower bound. The bias is independent of model architecture, optimizer choice, and the specific point-wise loss.

What carries the argument

Expectation of Optimization Bias (EOB), defined as the KL divergence between the true joint distribution of the time series and the product of independent marginals induced by point-wise loss evaluation.

If this is right

  • Reducing sequence length lowers the optimization bias.
  • Frequency or wavelet transforms that orthogonalize structure reduce the bias.
  • A harmonized ell-p norm loss combined with these transforms produces measurable accuracy gains on forecasting and imputation.
  • The bias accounts for classic failures such as trigonometric fitting as an objective-induced pathology.
  • Plug-and-play use of the modified objective yields average MSE reductions of roughly 5 percent on forecasting across eleven datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same length-and-SSNR dependence may limit point-wise objectives in other sequential domains such as language modeling.
  • Practitioners could compute SSNR on new datasets to set realistic expectations for achievable forecast error.
  • The debiasing transforms could be tested on deliberately non-stationary series to determine how far the predicted dynamics extend.
  • Stacking the proposed objective on top of existing high-capacity models may produce additive gains beyond what either achieves separately.

Load-bearing premise

The time series must obey covariance-stationary Gaussian statistics for the closed-form EOB expressions to hold exactly.

What would settle it

Generate synthetic covariance-stationary Gaussian series, vary only sequence length and SSNR while holding model and optimizer fixed, and check whether observed forecasting error tracks the predicted EOB curve.

Figures

Figures reproduced from arXiv: 2512.18610 by Daoyi Dong, Hang Yu, Kexin Zhang, Ming Jin, Qingsong Wen, Rongyao Cai, Yong Liu, Yuxi Wan, Zhiqiang Ge.

Figure 1
Figure 1. Figure 1: Motivation of our work. Point-wise losses implicitly treat each time step as i.i.d., approximating the true joint distribu￾tion with a fully factorized surrogate. This mismatch induces a systematic optimization bias. economics (Zhang et al., 2025; Bao et al., 2025), indus￾try (Cai et al., 2024), and healthcare (Li et al., 2024). The performance of deep learning models in this domain primar￾ily hinges on tw… view at source ↗
Figure 2
Figure 2. Figure 2: Error surfaces of Transformer with Gaussian distribution innovation. The blue and red arrows indicate the surface variation trend along horizon (h) and the total SSNR (SSNRx), respectively. Statistical Analysis of Harmonized Norm: We prove that the harmonized ℓp norm achieves both point-estimation and information-theoretic unbiasedness. It balances training stability and dynamic rectification (See Appendix… view at source ↗
Figure 3
Figure 3. Figure 3: Taxonomies of Time Series Analysis. The conventional task-oriented taxonomy categorizes time series analysis into five distinct silos: forecasting, classification / clustering, anomaly detection, imputation, and emerging conditional synthesis. This categorization, focusing primarily on the downstream applications rather than modeling methodologies, lacks the theoretical granularity to reconcile with emergi… view at source ↗
Figure 4
Figure 4. Figure 4: Empirical verification of EOB Theory via CNN model. The blue and red arrows indicates the surface variation trend along horizon (h) and the total SSNR (SSNRx), respectively. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Empirical verification of EOB Theory via LSTM model. The blue and red arrows indicates the surface variation trend along horizon (h) and the total SSNR (SSNRx), respectively. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Empirical verification of EOB Theory via MLP model. The blue and red arrows indicates the surface variation trend along horizon (h) and the total SSNR (SSNRx), respectively. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Empirical verification of EOB Theory via ModernTCN model. The blue and red arrows indicates the surface variation trend along horizon (h) and the total SSNR (SSNRx), respectively. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Empirical verification of EOB Theory via Transformer model. The blue and red arrows indicates the surface variation trend along horizon (h) and the total SSNR (SSNRx), respectively. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Insight experiments on trigonometric series. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_9.png] view at source ↗
read the original abstract

Intuitively, a more deterministic time series should be easier to forecast. However, point-wise loss functions (e.g., MSE and MAE), serving as differentiable surrogates for the ideal optimization target, score each timestamp independently and therefore disregard temporal dependence. This mismatch induces a systematic optimization bias that cannot be eliminated merely by improving model expressiveness or optimizer. To formalize this issue, we define the Expectation of Optimization Bias (EOB) as the Kullback--Leibler divergence between the true joint distribution and the factorized i.i.d. surrogate induced by the point-wise paradigm. Under covariance-stationary Gaussian assumptions, we derive closed-form expressions for the stochastic component of EOB, establishing it as an irreducible lower bound on the total bias in linear systems, and further extend it to nonlinear regimes through a Gaussian mixture model lower bound. Crucially, we prove this bias is governed intrinsically by two data properties, i.e., sequence length and Structural Signal-to-Noise Ratio (SSNR), regardless of specific model architecture, optimizer, or point-wise loss forms. This theory motivates a principled debiasing program based on sequence length reduction and structural orthogonalization, which we instantiate through DFT/DWT combined with a novel harmonized $\ell_p$ norm. Extensive experiments validate the predicted SSNR--horizon dynamics, resolve the classic trigonometric fitting failure as an objective-induced pathology, and demonstrate substantial plug-and-play gains. Notably, on iTransformer, our proposed objective reduces average MSE/MAE by 5.2%/5.0% in forecasting across 11 datasets and by 27.4%/19.4% in imputation across 9 datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that point-wise losses (MSE/MAE) induce an irreducible optimization bias in time series tasks, formalized as the Expectation of Optimization Bias (EOB) defined as the KL divergence between the true joint distribution and the factorized i.i.d. surrogate. Under covariance-stationary Gaussian assumptions, closed-form expressions are derived for the stochastic component of EOB, proving it is governed intrinsically by sequence length and Structural Signal-to-Noise Ratio (SSNR) independent of architecture, optimizer, or loss form. The result is extended to nonlinear regimes via a Gaussian mixture model lower bound; this motivates a debiasing approach using DFT/DWT with a harmonized ℓ_p norm. Experiments on 11 forecasting and 9 imputation datasets validate the SSNR-horizon predictions and report plug-and-play gains (e.g., 5.2%/5.0% MSE/MAE reduction on iTransformer).

Significance. If the central claims hold, the work supplies a first-principles account of why point-wise surrogates systematically underperform on temporally dependent data and supplies a data-property-driven debiasing recipe that yields concrete improvements. The closed-form derivations under the stated Gaussian assumptions constitute a clear technical contribution; the empirical validation of the predicted dynamics across datasets strengthens the practical relevance.

major comments (2)
  1. [§4] §4 (nonlinear extension): the claim that EOB remains an intrinsic lower bound governed solely by length and SSNR in nonlinear regimes rests on the Gaussian mixture model lower bound to the true KL; the tightness of this bound is never quantified or compared to the actual KL on nonlinear data, leaving open the possibility that additional bias terms dominate outside the linear-Gaussian case.
  2. [§5] §5 (experiments): the reported gains (e.g., 27.4% MAE reduction in imputation) are presented without code, data, or hyper-parameter release, which prevents independent verification of the plug-and-play improvements and of the SSNR-horizon dynamics.
minor comments (2)
  1. [§3] The definition of SSNR (Structural Signal-to-Noise Ratio) is introduced via data statistics but would benefit from an explicit equation reference in the main text when first used.
  2. [§5] Figure captions in the experimental section could more explicitly contrast the theoretically predicted SSNR-horizon curves against the observed error curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the planned revisions.

read point-by-point responses
  1. Referee: [§4] §4 (nonlinear extension): the claim that EOB remains an intrinsic lower bound governed solely by length and SSNR in nonlinear regimes rests on the Gaussian mixture model lower bound to the true KL; the tightness of this bound is never quantified or compared to the actual KL on nonlinear data, leaving open the possibility that additional bias terms dominate outside the linear-Gaussian case.

    Authors: We agree that the tightness of the GMM lower bound is not quantified in the current version. In the revision we will add a new subsection to §4 containing Monte Carlo comparisons of the GMM-approximated EOB against the true KL on synthetic nonlinear processes (NAR, bilinear, and threshold models). These experiments will report the relative gap as a function of sequence length and SSNR, confirming that the dominant scaling remains governed by those two quantities. The added material will be presented as an empirical validation of the bound rather than a change to the theoretical statement. revision: yes

  2. Referee: [§5] §5 (experiments): the reported gains (e.g., 27.4% MAE reduction in imputation) are presented without code, data, or hyper-parameter release, which prevents independent verification of the plug-and-play improvements and of the SSNR-horizon dynamics.

    Authors: We accept that the absence of released artifacts hinders verification. Upon acceptance we will publish the full codebase (including DFT/DWT orthogonalization, harmonized ℓ_p implementation, SSNR estimator, and all training scripts) on GitHub together with the exact hyper-parameter grids and dataset preprocessing pipelines used for the 11 forecasting and 9 imputation benchmarks. In the revised manuscript we will also expand the experimental appendix with additional tables listing per-dataset SSNR values and the precise configurations that produced the reported 5.2 % / 5.0 % and 27.4 % / 19.4 % improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: EOB closed-form and length/SSNR governance derived from first principles under external Gaussian assumptions

full rationale

The paper defines EOB as the KL divergence between the true joint distribution and the factorized i.i.d. surrogate from point-wise losses. It then derives closed-form expressions for the stochastic component explicitly under covariance-stationary Gaussian assumptions on the time series, establishing the dependence on sequence length and SSNR through direct mathematical expansion rather than fitting or redefinition. SSNR is introduced as a data-derived statistic (structural signal-to-noise ratio) and used to characterize the bias, not tuned to match observed errors. The GMM lower-bound extension for nonlinear regimes is presented separately without claiming tightness. No load-bearing self-citations, self-definitional loops, fitted inputs renamed as predictions, or ansatzes smuggled via citation appear in the derivation chain. The central result is therefore self-contained against the stated external assumptions and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on covariance-stationary Gaussian assumptions for the derivation of EOB and on a Gaussian mixture model for the nonlinear extension; SSNR is introduced as a data property without external validation beyond the experiments; no free parameters are fitted to the target bias quantity itself.

axioms (2)
  • domain assumption Time series are covariance-stationary Gaussian processes
    Invoked to obtain closed-form EOB expressions and to establish the irreducible lower bound in linear systems.
  • domain assumption The factorized i.i.d. surrogate induced by point-wise loss is the correct comparison distribution for measuring optimization bias
    Used to define EOB as KL divergence; this is the modeling choice that makes the bias appear irreducible.
invented entities (2)
  • Expectation of Optimization Bias (EOB) no independent evidence
    purpose: Quantify the systematic mismatch between point-wise loss and true joint distribution
    New quantity defined as KL divergence; no independent falsifiable handle outside the paper is provided.
  • Structural Signal-to-Noise Ratio (SSNR) no independent evidence
    purpose: Data property claimed to govern the bias together with sequence length
    Introduced to organize the theoretical predictions; measured on data but not shown to be independently predictive outside the reported experiments.

pith-pipeline@v0.9.0 · 5630 in / 1771 out tokens · 21564 ms · 2026-05-16T20:49:42.109470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Due to the Markov property and stationarity: •Conditional Distribution: p(zt|z1:t−1)∼ N(µ t, σ2 ϵ )(36) whereµ t =c+ Pp i=1 ϕizt−i

    Steady-State Phase (t > p ):For t > p , the conditional distribution is fully determined by the p preceding observation. Due to the Markov property and stationarity: •Conditional Distribution: p(zt|z1:t−1)∼ N(µ t, σ2 ϵ )(36) whereµ t =c+ Pp i=1 ϕizt−i. Note that the conditional variance stabilizes to the innovation varianceσ 2 ϵ . •Marginal Distribution: ...

  2. [2]

    TX t=1 H(z k,t)−H(z k,1:T ) # −H(π) = KX k=1 πkE

    Transient Phase (t < p ):For the initial time steps t < p , the conditional variance σ2 t = Var(zt|z1:t−1) is not equal to σ2 ϵ , but rather lies in the interval [σ2 ϵ , σ2 z] as the process uncertainty reduces from the marginal to conditional level. The sum of these initial terms constitutes a constantC p that depends only on the model parameters (p, ϕ) ...

  3. [3]

    Discriminative: The Modeling Objective From the perspective of modeling objective, we categorize tasks intogenerativeanddiscriminativeparadigms

    Generative vs. Discriminative: The Modeling Objective From the perspective of modeling objective, we categorize tasks intogenerativeanddiscriminativeparadigms. • Generative Tasksaim to approximate the underlying data distribution or the intrinsic pattern of the original series X, i.e., modeling the joint probabilityP(X). • Discriminative Tasksfunction as ...

  4. [4]

    Dynamics vs. Manifold: The Causal Flow Within the generative paradigm, we further classify tasks based on the directionality of causal flow (i.e., information flow) into two sub-categories:Unidirectional (Dynamics-based)andBidirectional (Manifold-based). A. Unidirectional Generation (Dynamics-based)This category strictly adheres to the arrow of time, focu...

  5. [5]

    • Imputation (Projection):The goal is to project an incomplete observation onto the learned manifold

    Imputation and Anomaly Detection as Twin Tasks Under the manifold-based generative group, we posit thatImputationandAnomaly Detectionare theoreticallyisomorphic orinverseproblems grounded in the same manifold structureM. • Imputation (Projection):The goal is to project an incomplete observation onto the learned manifold. Specifically, imputation attempts ...

  6. [6]

    The objective is to sample a personalized trajectory ¯Xthat resides within this conditional manifold ¯M

    Conditional Synthesis For the emerging task of conditional synthesis, the target manifold ¯M is jointly determined by the original data distribution D and external conditions C (e.g., text prompts or physical parameters). The objective is to sample a personalized trajectory ¯Xthat resides within this conditional manifold ¯M. ¯X∼P θ(x| ¯M(D,C)).(86) F.2.2....

  7. [7]

    We approach discriminative paradigm from a sequential perspective

    Discriminative Paradigm The primary objective of the discriminative paradigm is to establish robust decision boundaries for label distinction, typically employing Cross Entropy loss on one-hot encodings. We approach discriminative paradigm from a sequential perspective. Specifically, a one-hot label vector can be viewed as a discrete sequence composed of ...

  8. [8]

    Here, the optimization target is the time series values themselves, which inherently possess strong temporal correlations and deterministic structures (e.g., trend and periodicity)

    Generative Paradigm In stark contrast to the discriminative paradigm, the generative paradigm aims to model the underlying data distribution P(X) or generate trajectories based on contexts. Here, the optimization target is the time series values themselves, which inherently possess strong temporal correlations and deterministic structures (e.g., trend and...

  9. [9]

    Consistency with KL Minimization Although the Harmonized ℓp Norm introduces spectral weights wk, it remains consistent with the objective of minimizing the KL divergence. Since the weights wk( ¯fk) are derived from the spectral statistics of the ground truth (acting as a fixed prior) and are independent of the model’s estimation parametersθ, the optimizat...

  10. [10]

    According to our quantification of non-deterministic EOB (Theorem 2.6), the lower bound of EOB is monotonically related to the Structural Signal-to-Noise Ratio:E[B z]∝log(SSNR)

    Asymptotic Unbiasedness at Unit SSNR A critical test for any debiasing framework is its behavior at the theoretical limit where bias should naturally vanish. According to our quantification of non-deterministic EOB (Theorem 2.6), the lower bound of EOB is monotonically related to the Structural Signal-to-Noise Ratio:E[B z]∝log(SSNR). When condition (SSNR=...

  11. [11]

    The Redundancy of Error Amplitude:Analyzing the gradient of the error amplitude loss reveals a fundamental equivalence. ByParseval’s theorem, the energy of the error spectrum magnitude is identical to the energy of the time-domain error: L ¯A,ℓ2 =∥U(x−ˆx)∥ 2 2 =∥x−ˆx∥ 2 2 ≡ Ltemp,ℓ2 .(110) Consequently, its gradient collapses back to the standard residual...

  12. [12]

    Spectral Whitener

    The Distinctness of Error Phase:In contrast, the gradient for the phase of the error, L¯θ, provides a mathematically distinct learning signal: ∂L ¯θ,ℓ2 ∂ˆx = +2(U r)⊤ ¯θ⊙ U ie ¯A2 −2(U i)⊤ ¯θ⊙ U re ¯A2 .(112) This confirms that while penalizing error amplitude is redundant, penalizing error phase offers a genuinely novel optimization path that is orthogon...

  13. [13]

    Notably, our method maintains a Top-2 coverage of nearly 100% in MAE, significantly outperforming the second-best recent baseline, FreDF

    Statistical Dominance and SOTA PerformanceAs summarized in the 1st Count statistics, our framework achieves the best performance in 31 out of 44 cases (70.4%) for MAE and 17 out of 44 cases for MSE. Notably, our method maintains a Top-2 coverage of nearly 100% in MAE, significantly outperforming the second-best recent baseline, FreDF. This consistency acr...

  14. [14]

    Backbone Synergy: LHarm,ℓp vs. iTransformerSince our framework is deployed on the iTransformer backbone, the performance gap between Ours and iTransformer serves as a direct validation of the loss function’s efficacy. Across all datasets, we observe a significant performance dividend. For instance, in the BE dataset, the average MSE is reduced from 0.332 ...

  15. [15]

    Structural Gains:The most substantial improvements are observed in datasets with high SSNR, such asECL, PJM, and BE

    Empirical Echo of EOB Theory and SSNRThe results align perfectly with our Paradigm Paradox (Theorem 2.2). Structural Gains:The most substantial improvements are observed in datasets with high SSNR, such asECL, PJM, and BE. In these cases, the temporal dependencies are dense and deterministic, where standard MSE incurs the maximum optimization bias. By map...

  16. [16]

    This indicates that our method provides a consistent average bias reduction per time point ( 1 2 logSSNR )

    Stability Across Forecasting HorizonsCrucially, the performance lead of LHarm,ℓp remains stable or even widens as the forecast horizon T increases from 96 to 720. This indicates that our method provides a consistent average bias reduction per time point ( 1 2 logSSNR ). By rectifying the EOB from a first-principles perspective, we provide a more stable op...

  17. [17]

    Out of 36 experimental scenarios, our method achieves13 first places and 20 second places in MSE, resulting in a 91.7% Top-2 coverage

    Dominance in Manifold ReconstructionIn the imputation task, our framework demonstrates exceptional precision in reconstructing the underlying data manifold. Out of 36 experimental scenarios, our method achieves13 first places and 20 second places in MSE, resulting in a 91.7% Top-2 coverage. Notably, in datasets like ETTm1 and ETTm2, our MSE values are fre...

  18. [18]

    While the performance of many baselines (e.g., MICN, Autoformer) degrades sharply as more data points are masked, the lead of LHarm,ℓp remains remarkably stable

    Robustness Across Missing RatesAs the missing rate pmiss increase from 0.125 to 0.5, the task transitions from simple interpolation to complex signal recovery. While the performance of many baselines (e.g., MICN, Autoformer) degrades sharply as more data points are masked, the lead of LHarm,ℓp remains remarkably stable. Although masking introduces artific...

  19. [19]

    While TimesNet secures a high count of first places in MAE (14 cases), our framework dominates the MSE metric and maintains a more consistent Top-2 presence

    Competitive Advantage Over other ArchitecturesIt is important to note that our method is a principled loss function applied to a general-purpose iTransformer backbone, yet it outperforms or matches CNN-, MLP-based architectures: (1)TimesNet is a powerful baseline for imputation due to its 2D-variation modeling. While TimesNet secures a high count of first...

  20. [20]

    Even though bidirectional context reduces local uncertainty, the target remains a high-SSNR signal

    Statistical Alignment and DebiasingIn our mechanism-oriented taxonomy (Section 3.1), we define imputation as a bidirectional generative task. Even though bidirectional context reduces local uncertainty, the target remains a high-SSNR signal. The standard temporal MSE treats the reconstruction of each masked point as an i.i.d. event, neglecting the structu...