pith. sign in

arxiv: 2606.18539 · v1 · pith:H5JN7EEPnew · submitted 2026-06-16 · 💻 cs.LG · stat.ML

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

Pith reviewed 2026-06-27 00:35 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords time series forecastingrobustnessstructural faultsmechanism-level faultsbenchmarkfoundation modelsclean data correlation
0
0 comments X

The pith

Clean-data accuracy rankings for time series forecasters fail to predict robustness under structured faults.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ranking time series forecasting models by average error on clean held-out data does not indicate how they will behave when real faults occur. It introduces a benchmark that injects four parameterized fault modes—split along observation versus mechanism level and univariate versus multivariate—into the most prediction-critical time windows using a unified importance score. Evaluation of 21 models on 6 datasets shows clean accuracy anti-correlates with robustness, that rankings stay stable only under observation-level faults, and that mechanism-level faults cause all catastrophic failures, with the clean-data leaders including foundation models proving most fragile. A reader would care because forecasting supports decisions in energy, transportation, finance, and healthcare where faults arrive as structured events rather than random noise.

Core claim

TS-Fault demonstrates that clean-data accuracy anti-correlates with robustness; clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility.

What carries the argument

The TS-Fault benchmark that organizes recurring failures into four modes along observation-versus-mechanism and univariate-versus-multivariate axes and places each fault into the most prediction-critical window via a unified importance score.

If this is right

  • Models must be tested under paired clean and faulted conditions rather than clean data alone.
  • Observation-level faults leave existing clean rankings intact while mechanism-level faults reorder them.
  • Foundation models require separate robustness evaluation because they show the largest drop from clean to faulted performance.
  • Benchmarking should prioritize mechanism-level faults since they alone produce catastrophic failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Selection of deployed forecasters may need separate robustness criteria instead of relying on public leaderboards.
  • The benchmark protocol could be applied to new fault types if they can be expressed in the same four-mode structure.
  • Training procedures that improve clean accuracy may need explicit penalties for mechanism-level fragility.

Load-bearing premise

The four defined fault modes and the importance score for placing faults actually capture the structures that deployed forecasting models rely on rather than arbitrary synthetic corruptions.

What would settle it

A dataset or set of deployed logs in which the ranking of models by clean accuracy matches their ranking by robustness under the four fault modes, or real faults that produce model behavior outside the four modes.

Figures

Figures reproduced from arXiv: 2606.18539 by Chenxi Liu, Hao Miao, Hao Xue, Lian Xu, Yuyang Zhao.

Figure 1
Figure 1. Figure 1: The TS-Fault pipeline. TABLE II FOUR QUADRANTS OF TSF ROBUSTNESS RESEARCH. Sampled Constructed Unstructured Noise & missingness (Gaussian, masking) [15], [16], [35] Adversarial attacks (FGSM/PGD in ϵ-balls) [17], [18] Structured Distribution shift (Wild-Time [36], drift) [37], [38] Scenario-grounded (TS-Fault, this work) and variables) or structured (carrying temporal shape, cross￾variable coupling, or cau… view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative window-importance scoring. Among candidate windows [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual signatures of the four fault modes at three difficulties (clean [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: The 2 × 2 fault taxonomy. Scope (observation- vs. mechanism-level) × variate scope (uni- vs. multivariate) yields four modes [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Each line links a model’s clean-accuracy rank (left) to its robustness [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Clean rank vs. faulted rank, per mode. The dashed diagonal [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Clean MSE and robustness ratio across the [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-mode degradation distributions (log y; box = IQR, points = single configuration). Modes I/II are low and tightly concentrated; Modes III/IV are higher by several orders of magnitude and far more dispersed. D. RQ4: Are Pretrained Foundation Models More Robust? They are the opposite, strong but fragile. On clean data, the three foundation models are top-tier: TimesFM attains MSE 0.516 (second overall) an… view at source ↗
Figure 9
Figure 9. Figure 9: Degradation grows monotonically with difficulty for all four modes [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at https://github.com/Ray-zyy/TS-Fault.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TS-Fault, a benchmark that organizes recurring TSF failures into four fault modes (observation- vs. mechanism-level crossed with univariate vs. multivariate) and injects each at high-importance windows via a unified importance score. It evaluates 21 models on 6 datasets across 4 modes and 5 difficulty levels under a paired clean/corrupt protocol, reporting that clean-data accuracy anti-correlates with robustness, that clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults, and that all catastrophic failures occur under mechanism-level faults (with foundation models most fragile).

Significance. If the four parameterized modes and importance-based placement are shown to be representative of real structural faults, the benchmark would provide concrete evidence against the assumption that clean-data leaderboards predict deployed reliability and would highlight mechanism-level faults as particularly diagnostic. Public code release is a clear strength for reproducibility.

major comments (2)
  1. [Abstract] Abstract: the assertion that the benchmark 'enables robustness to be tested against the structures models actually rely on' is load-bearing for interpreting the three headline findings, yet no validation is supplied that the chosen temporal shapes, dependency breaks, regime changes, or causal propagations match the distribution of observed faults in the six domains.
  2. [§4] §4 (Experiments) and the fault-mode definitions: no equation, algorithm, or pseudocode is given for computing the unified importance score used to place faults, nor for parameterizing semantic difficulty; without these, it is impossible to verify that placement is independent of the evaluated models or that the five difficulty levels are comparable across modes.
minor comments (2)
  1. A table enumerating the 21 models, their categories (e.g., statistical, deep, foundation), and the 6 datasets would improve readability of the experimental setup.
  2. The paired clean/corrupt protocol is described at a high level; explicit pseudocode or a small worked example of how a single fault injection affects a forecast would clarify the evaluation pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below, indicating planned revisions where appropriate. The benchmark is positioned as a controlled, parameterized testbed motivated by recurring failure patterns rather than a statistical replica of real fault distributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the benchmark 'enables robustness to be tested against the structures models actually rely on' is load-bearing for interpreting the three headline findings, yet no validation is supplied that the chosen temporal shapes, dependency breaks, regime changes, or causal propagations match the distribution of observed faults in the six domains.

    Authors: We acknowledge that the manuscript provides no direct empirical validation (e.g., via fault logs or domain studies) that the injected shapes exactly match the empirical distribution of faults in the six domains. The four modes are instead synthesized from documented recurring failure patterns reported in the TSF literature for energy, transportation, finance, and healthcare. We will revise the abstract and add a dedicated paragraph in §2 (Related Work) and §5 (Discussion) that (a) explicitly cites domain-specific studies on structural faults and (b) clarifies that the benchmark tests robustness against representative structures rather than claiming distributional equivalence. This addresses the load-bearing claim without requiring new data collection. revision: partial

  2. Referee: [§4] §4 (Experiments) and the fault-mode definitions: no equation, algorithm, or pseudocode is given for computing the unified importance score used to place faults, nor for parameterizing semantic difficulty; without these, it is impossible to verify that placement is independent of the evaluated models or that the five difficulty levels are comparable across modes.

    Authors: This observation is correct; the current manuscript describes the importance score at a high level but omits the formal definition, computation steps, and parameterization. In the revision we will insert (i) the exact equation for the unified importance score (a model-agnostic weighted combination of gradient saliency, temporal variance, and cross-variable dependency strength, averaged over a small held-out set of models), (ii) the algorithm for selecting the top-k windows, and (iii) the parameterization table that maps semantic difficulty levels to concrete fault parameters (e.g., missingness rate, regime-shift magnitude) for each mode. Pseudocode will be added to §4 to demonstrate that placement is independent of any single evaluated model and that difficulty levels are defined comparably across modes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark independent of findings

full rationale

The paper defines four synthetic fault modes (observation/mechanism, uni/multi) and an importance score for injection, then runs a paired clean/corrupt evaluation on 21 models across 6 datasets. The headline results (anti-correlation of clean accuracy with robustness, ranking preservation under observation faults, all catastrophes under mechanism faults) are direct empirical outputs of this protocol. No equations, fitted parameters, or self-citations reduce the robustness scores to quantities defined by the clean-data rankings or by construction. The benchmark construction stands as an independent experimental design rather than a self-referential derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the premise that real faults are structured events with temporal shape and broken dependencies rather than i.i.d. noise; the importance score and semantic difficulty levels are introduced without external validation data shown in the abstract.

free parameters (1)
  • importance score parameters
    Used to select the prediction-critical window for fault injection; controllable semantic difficulty levels are also parameterized.
axioms (1)
  • domain assumption Real faults in deployed time series pipelines exhibit observation-level versus mechanism-level structure and univariate versus multivariate patterns.
    Invoked in the abstract to justify the four-mode organization.

pith-pipeline@v0.9.1-grok · 5819 in / 1356 out tokens · 30911 ms · 2026-06-27T00:35:54.415351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 5 linked inside Pith

  1. [1]

    Modelardb: Modu- lar model-based time series management with spark and cassandra,

    S. K. Jensen, T. B. Pedersen, and C. Thomsen, “Modelardb: Modu- lar model-based time series management with spark and cassandra,” PVLDB, vol. 11, no. 11, pp. 1688–1701, 2018

  2. [2]

    Camel: Efficient compression of floating-point time series,

    Y . Yao, L. Chen, Z. Fang, Y . Gao, C. S. Jensen, and T. Li, “Camel: Efficient compression of floating-point time series,”SIGMOD, vol. 2, no. 6, pp. 1–26, 2025

  3. [3]

    Short-term traffic forecasting: Where we are and where we’re going,

    E. I. Vlahogianni, M. G. Karlaftis, and J. C. Golias, “Short-term traffic forecasting: Where we are and where we’re going,”TRANSPORT RES C-EMER, vol. 43, no. 1, pp. 3–19, 2014

  4. [4]

    Fedformer: Frequency enhanced decomposed transformer for long-term series fore- casting,

    T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “Fedformer: Frequency enhanced decomposed transformer for long-term series fore- casting,” inICML, 2022, pp. 27 268–27 286

  5. [5]

    Tfb: Towards comprehensive and fair benchmarking of time series forecasting methods,

    X. Qiu, J. Hu, L. Zhou, X. Wu, J. Du, B. Zhang, C. Guo, A. Zhou, C. S. Jensen, Z. Sheng, and B. Yang, “Tfb: Towards comprehensive and fair benchmarking of time series forecasting methods,”PVLDB, vol. 17, no. 9, pp. 2363–2377, 2024

  6. [6]

    Probabilistic electric load forecasting: A tutorial review,

    T. Hong and S. Fan, “Probabilistic electric load forecasting: A tutorial review,”Int. J. Forecast., vol. 32, no. 3, pp. 914–938, 2016

  7. [7]

    Recent advances in electricity price forecasting: A review of probabilistic forecasting,

    J. Nowotarski and R. Weron, “Recent advances in electricity price forecasting: A review of probabilistic forecasting,”RSER, vol. 81, no. 1, pp. 1548–1568, 2018

  8. [8]

    Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis,

    B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis,”IEEE J-BHI, vol. 22, no. 5, pp. 1589–1604, 2017

  9. [9]

    Time series prediction using deep learning methods in healthcare,

    M. A. Morid, O. R. L. Sheng, and J. Dunbar, “Time series prediction using deep learning methods in healthcare,”TMIS, vol. 14, no. 1, pp. 1–29, 2023

  10. [10]

    Financial time series forecasting with deep learning: A systematic literature review: 2005–2019,

    O. B. Sezer, M. U. Gudelek, and A. M. Ozbayoglu, “Financial time series forecasting with deep learning: A systematic literature review: 2005–2019,”ASOC, vol. 90, p. 106181, 2020

  11. [11]

    The m3-competition: results, conclusions and implications,

    S. Makridakis and M. Hibon, “The m3-competition: results, conclusions and implications,”Int. J. Forecast., vol. 16, no. 4, pp. 451–476, 2000

  12. [12]

    The m4 competi- tion: 100,000 time series and 61 forecasting methods,

    S. Makridakis, E. Spiliotis, and V . Assimakopoulos, “The m4 competi- tion: 100,000 time series and 61 forecasting methods,”Int. J. Forecast., vol. 36, no. 1, pp. 54–74, 2020

  13. [13]

    Monash time series forecasting archive,

    R. W. Godahewa, C. Bergmeir, G. I. Webb, R. Hyndman, and P. Montero-Manso, “Monash time series forecasting archive,” in NeurIPS, 2021

  14. [14]

    Gift-eval: A benchmark for general time series forecasting model evaluation,

    T. Aksu, G. Woo, J. Liu, X. Liu, C. Liu, S. Savarese, C. Xiong, and D. Sahoo, “Gift-eval: A benchmark for general time series forecasting model evaluation,”NeurIPS Workshop, 2024

  15. [15]

    Robusttsf: Towards theory and design of robust time series forecasting with anomalies,

    H. Cheng, Q. Wen, Y . Liu, and L. Sun, “Robusttsf: Towards theory and design of robust time series forecasting with anomalies,” inICLR, 2024, pp. 5787–5813

  16. [16]

    Saits: Self-attention-based imputation for time series,

    W. Du, D. C ˆot´e, and Y . Liu, “Saits: Self-attention-based imputation for time series,”ESWA, vol. 219, p. 119619, 2023

  17. [17]

    Explaining and harnessing adversarial examples,

    I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,”ICLR, 2015

  18. [18]

    Towards deep learning models resistant to adversarial attacks,

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,”arXiv preprint arXiv:1706.06083, 2017

  19. [19]

    No contagion, only interdependence: measuring stock market comovements,

    K. J. Forbes and R. Rigobon, “No contagion, only interdependence: measuring stock market comovements,”J Finance, vol. 57, no. 5, pp. 2223–2261, 2002

  20. [20]

    Cascading risks: Understanding the 2021 winter blackout in texas,

    J. W. Busby, K. Baker, M. D. Bazilian, A. Q. Gilbert, E. Grubert, V . Rai, J. D. Rhodes, S. Shidore, C. A. Smith, and M. E. Webber, “Cascading risks: Understanding the 2021 winter blackout in texas,”Energy Res. Social Sci., vol. 77, no. 1, p. 102106, 2021

  21. [21]

    Informer: Beyond efficient transformer for long sequence time-series forecasting,

    H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” inAAAI, vol. 35, no. 12, 2021, pp. 11 106–11 115

  22. [22]

    Exploring progress in multivariate time series forecasting: Comprehensive benchmarking and heterogeneity analysis,

    Z. Shao, F. Wang, Y . Xu, W. Wei, C. Yu, Z. Zhang, D. Yao, T. Sun, G. Jin, X. Caoet al., “Exploring progress in multivariate time series forecasting: Comprehensive benchmarking and heterogeneity analysis,” TKDE, vol. 37, no. 1, pp. 291–305, 2024

  23. [23]

    fev-bench: A realistic benchmark for time series forecasting,

    O. Shchur, A. F. Ansari, C. Turkmen, L. Stella, N. Erickson, P. Guerron, M. Bohlke-Schneider, and Y . Wang, “fev-bench: A realistic benchmark for time series forecasting,”arXiv preprint arXiv:2509.26468, 2025

  24. [24]

    Tsfm-bench: A comprehensive and unified benchmark of foundation models for time series forecasting,

    Z. Li, X. Qiu, P. Chen, Y . Wang, H. Cheng, Y . Shu, J. Hu, C. Guo, A. Zhou, C. S. Jensenet al., “Tsfm-bench: A comprehensive and unified benchmark of foundation models for time series forecasting,” inSIGKDD, 2025, pp. 5595–5606

  25. [25]

    Probts: Benchmarking point and distributional forecasting across diverse pre- diction horizons,

    J. Zhang, X. Wen, Z. Zhang, S. Zheng, J. Li, and J. Bian, “Probts: Benchmarking point and distributional forecasting across diverse pre- diction horizons,” inNeurIPS, vol. 37, 2024, pp. 48 045–48 082

  26. [26]

    Scaling laws for neural language models,

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

  27. [27]

    Training compute-optimal large language models,

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. Casas, L. A. Hendricks, J. Welbl, A. Clarket al., “Training compute-optimal large language models,”NeurIPS, 2022

  28. [28]

    Chronos: Learning the language of time series,

    A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapooret al., “Chronos: Learning the language of time series,”TMLR, 2024

  29. [29]

    A decoder-only foundation model for time-series forecasting,

    A. Das, W. Kong, R. Sen, and Y . Zhou, “A decoder-only foundation model for time-series forecasting,”ICML, 2024

  30. [30]

    Unified training of universal time series forecasting transformers,

    G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo, “Unified training of universal time series forecasting transformers,” in ICML, 2024

  31. [31]

    Foundation models for time series analysis: A tutorial and survey,

    Y . Liang, H. Wen, Y . Nie, Y . Jiang, M. Jin, D. Song, S. Pan, and Q. Wen, “Foundation models for time series analysis: A tutorial and survey,” in SIGKDD, 2024, pp. 6555–6565

  32. [32]

    Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark,

    O. Sainz, J. Campos, I. Garc ´ıa-Ferrero, J. Etxaniz, O. L. de Lacalle, and E. Agirre, “Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark,” inEMNLP Findings, 2023, pp. 10 776–10 787

  33. [33]

    Time travel in llms: Tracing data contamination in large language models,

    S. Golchin and M. Surdeanu, “Time travel in llms: Tracing data contamination in large language models,”ICLR, 2024

  34. [34]

    Benchmark data contamination of large language models: A survey,

    C. Xu, S. Guan, D. Greene, M. Kechadiet al., “Benchmark data contamination of large language models: A survey,”arXiv preprint arXiv:2406.04244, 2024

  35. [35]

    Brits: Bidirectional recurrent imputation for time series,

    W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y . Li, “Brits: Bidirectional recurrent imputation for time series,”NeurIPS, vol. 31, pp. 6776–6786, 2018

  36. [36]

    Wild-time: A benchmark of in-the-wild distribution shift over time,

    H. Yao, C. Choi, B. Cao, Y . Lee, P. W. W. Koh, and C. Finn, “Wild-time: A benchmark of in-the-wild distribution shift over time,” inNeurIPS, vol. 35, 2022, pp. 10 309–10 324

  37. [37]

    Woods: Benchmarks for out-of-distribution generalization in time series,

    J.-C. Gagnon-Audet, K. Ahuja, M.-J. Darvishi-Bayazi, P. Mousavi, G. Dumas, and I. Rish, “Woods: Benchmarks for out-of-distribution generalization in time series,”TMLR, 2023

  38. [38]

    Wilds: A benchmark of in-the-wild distribution shifts,

    P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsub- ramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gaoet al., “Wilds: A benchmark of in-the-wild distribution shifts,” inICML, 2021, pp. 5637– 5664

  39. [39]

    Benchmarking neural network robust- ness to common corruptions and perturbations,

    D. Hendrycks and T. Dietterich, “Benchmarking neural network robust- ness to common corruptions and perturbations,”ICLR, 2019

  40. [40]

    Toward causal representation learning,

    B. Sch ¨olkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y . Bengio, “Toward causal representation learning,”Proc. IEEE, vol. 109, no. 5, pp. 612–634, 2021

  41. [41]

    Peters, D

    J. Peters, D. Janzing, and B. Scholkopf,Elements of causal inference: foundations and learning algorithms. MIT press, 2017

  42. [42]

    A survey of methods for time series change point detection,

    S. Aminikhanghahi and D. J. Cook, “A survey of methods for time series change point detection,”KAIS, vol. 51, no. 2, pp. 339–367, 2017

  43. [43]

    Hot sax: Efficiently finding the most unusual time series subsequence,

    E. Keogh, J. Lin, and A. Fu, “Hot sax: Efficiently finding the most unusual time series subsequence,” inICDM, 2005, pp. 8–pp

  44. [44]

    Causes of the 2003 major grid blackouts in north america and europe, and recommended means to improve system dynamic performance,

    G. Andersson, P. Donalek, R. Farmer, N. Hatziargyriou, I. Kamwa, P. Kundur, N. Martins, J. Paserba, P. Pourbeik, J. Sanchez-Gascaet al., “Causes of the 2003 major grid blackouts in north america and europe, and recommended means to improve system dynamic performance,”T- PWRS, vol. 20, no. 4, pp. 1922–1928, 2005

  45. [45]

    Early prediction of sepsis in the icu using machine learning: a systematic review,

    M. Moor, B. Rieck, M. Horn, C. R. Jutzeler, and K. Borgwardt, “Early prediction of sepsis in the icu using machine learning: a systematic review,”Front. Med., vol. 8, p. 607952, 2021

  46. [46]

    Time to treatment and mortality during mandated emergency care for sepsis,

    C. W. Seymour, F. Gesten, H. C. Prescott, M. E. Friedrich, T. J. Iwashyna, G. S. Phillips, S. Lemeshow, T. Osborn, K. M. Terry, and M. M. Levy, “Time to treatment and mortality during mandated emergency care for sepsis,”NEJM, vol. 376, no. 23, pp. 2235–2244, 2017

  47. [47]

    R. J. Little and D. B. Rubin,Statistical analysis with missing data. John Wiley & Sons, 2019

  48. [48]

    A cross-domain approach to analyzing the short-run impact of covid-19 on the us electricity sector,

    G. Ruan, D. Wu, X. Zheng, H. Zhong, C. Kang, M. A. Dahleh, S. Sivaranjani, and L. Xie, “A cross-domain approach to analyzing the short-run impact of covid-19 on the us electricity sector,”Joule, vol. 4, no. 11, pp. 2322–2337, 2020

  49. [49]

    Complex systems analysis of series of blackouts: Cascading failure, critical points, and self-organization,

    I. Dobson, B. A. Carreras, V . E. Lynch, and D. E. Newman, “Complex systems analysis of series of blackouts: Cascading failure, critical points, and self-organization,”Chaos, vol. 17, no. 2, 2007

  50. [50]

    Catastrophic cascade of failures in interdependent networks,

    S. V . Buldyrev, R. Parshani, G. Paul, H. E. Stanley, and S. Havlin, “Catastrophic cascade of failures in interdependent networks,”Nature, vol. 464, no. 7291, pp. 1025–1028, 2010

  51. [51]

    Reliability standards for the bulk electric systems of north america,

    N. A. E. R. Corporation, “Reliability standards for the bulk electric systems of north america,” 2018

  52. [52]

    Analysis of the blackout in europe on november 4, 2006,

    C. Li, Y . Sun, and X. Chen, “Analysis of the blackout in europe on november 4, 2006,”IPEC, pp. 939–944, 2007

  53. [53]

    Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,

    H. Wu, J. Xu, J. Wang, and M. Long, “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,” in NeurIPS, vol. 34, 2021, pp. 22 419–22 430

  54. [54]

    Modeling long-and short- term temporal patterns with deep neural networks,

    G. Lai, W.-C. Chang, Y . Yang, and H. Liu, “Modeling long-and short- term temporal patterns with deep neural networks,” inSIGIR, 2018, pp. 95–104

  55. [55]

    TSMixer: An all-MLP architecture for time series forecast-ing,

    S.-A. Chen, C.-L. Li, S. O. Arik, N. C. Yoder, and T. Pfister, “TSMixer: An all-MLP architecture for time series forecast-ing,”TMLR, 2023

  56. [56]

    A time series is worth 64 words: Long-term forecasting with transformers,

    Y . Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A time series is worth 64 words: Long-term forecasting with transformers,” inICLR, 2023

  57. [57]

    Multi-resolution time-series transformer for long-term forecasting,

    Y . Zhang, L. Ma, S. Pal, Y . Zhang, and M. Coates, “Multi-resolution time-series transformer for long-term forecasting,” inAISTATS, vol. 238, 2024, pp. 4222–4230

  58. [58]

    R. J. Hyndman and G. Athanasopoulos,Forecasting: principles and practice. OTexts, 2018

  59. [59]

    G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung,Time series analysis: forecasting and control. John Wiley & Sons, 2015

  60. [60]

    Hyndman, A

    R. Hyndman, A. Koehler, K. Ord, and R. Snyder,Forecasting with exponential smoothing: the state space approach. Springer, 2008

  61. [61]

    Are transformers effective for time series forecasting?

    A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers effective for time series forecasting?” inAAAI, vol. 37, no. 9, 2023, pp. 11 121– 11 128

  62. [62]

    N-beats: Neural basis expansion analysis for interpretable time series forecasting,

    B. N. Oreshkin, D. Carpov, N. Chapados, and Y . Bengio, “N-beats: Neural basis expansion analysis for interpretable time series forecasting,” inICLR, 2020

  63. [63]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997

  64. [64]

    Learning phrase representations using rnn encoder–decoder for statistical machine translation,

    K. Cho, B. Van Merri ¨enboer, C ¸ . Gulc ¸ehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” inEMNLP, 2014, pp. 1724–1734

  65. [65]

    An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,

    S. Bai, J. Z. Kolter, and V . Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,”arXiv preprint arXiv:1803.01271, 2018

  66. [66]

    itrans- former: Inverted transformers are effective for time series forecasting,

    Y . Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long, “itrans- former: Inverted transformers are effective for time series forecasting,” inICLR, vol. 2024, 2024, pp. 11 116–11 140

  67. [67]

    Timexer: Empowering transformers for time series forecast- ing with exogenous variables,

    Y . Wang, H. Wu, J. Dong, Y . Liu, Y . Qiu, H. Zhang, J. Wang, and M. Long, “Timexer: Empowering transformers for time series forecast- ing with exogenous variables,” inNeurIPS, vol. 37, 2024, pp. 469–498

  68. [68]

    Timemixer: Decomposable multiscale mixing for time series forecasting,

    S. Wang, H. Wu, X. Shi, T. Hu, H. Luo, L. Ma, J. Zhang, and J. Zhou, “Timemixer: Decomposable multiscale mixing for time series forecasting,” inICLR, vol. 2024, 2024, pp. 38 626–38 652

  69. [69]

    Timesnet: Temporal 2d-variation modeling for general time series analysis,

    H. Wu, T. Hu, Y . Liu, H. Zhou, J. Wang, and M. Long, “Timesnet: Temporal 2d-variation modeling for general time series analysis,” in ICLR, 2023

  70. [70]

    Non-stationary transformers: Exploring the stationarity in time series forecasting,

    Y . Liu, H. Wu, J. Wang, and M. Long, “Non-stationary transformers: Exploring the stationarity in time series forecasting,” inNeurIPS, vol. 35, 2022, pp. 9881–9893

  71. [71]

    Another look at measures of forecast accuracy,

    R. J. Hyndman and A. B. Koehler, “Another look at measures of forecast accuracy,”Int. J. Forecast., vol. 22, no. 4, pp. 679–688, 2006

  72. [72]

    Strictly proper scoring rules, prediction, and estimation,

    T. Gneiting and A. E. Raftery, “Strictly proper scoring rules, prediction, and estimation,”J Am Stat Assoc., vol. 102, no. 477, pp. 359–378, 2007

  73. [73]

    Shortcut learning in deep neural networks,

    R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,”Nat. Mach. Intell., vol. 2, no. 11, pp. 665–673, 2020

  74. [74]

    Underspec- ification presents challenges for credibility in modern machine learning,

    A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beu- tel, C. Chen, J. Deaton, J. Eisenstein, M. D. Hoffmanet al., “Underspec- ification presents challenges for credibility in modern machine learning,” JMLR, vol. 23, no. 226, pp. 1–61, 2022

  75. [75]

    Reversible instance normalization for accurate time-series forecasting against dis- tribution shift,

    T. Kim, J. Kim, Y . Tae, C. Park, J.-H. Choi, and J. Choo, “Reversible instance normalization for accurate time-series forecasting against dis- tribution shift,” inICLR, 2021

  76. [76]

    Beyond accuracy: Behavioral testing of nlp models with checklist,

    M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of nlp models with checklist,” inACL, 2020, pp. 4902–4912

  77. [77]

    Dynabench: Rethinking benchmarking in nlp,

    D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vid- gen, G. Prasad, A. Singh, P. Ringshiaet al., “Dynabench: Rethinking benchmarking in nlp,” inNAACL, 2021, pp. 4110–4124

  78. [78]

    Categorizing variants of goodhart’s law,

    D. Manheim and S. Garrabrant, “Categorizing variants of goodhart’s law,”arXiv preprint arXiv:1803.04585, 2018

  79. [79]

    On interaction between augmen- tations and corruptions in natural corruption robustness,

    E. Mintun, A. Kirillov, and S. Xie, “On interaction between augmen- tations and corruptions in natural corruption robustness,” inNeurIPS, vol. 34, 2021, pp. 3571–3583

  80. [80]

    Anomaly detection: A survey,

    V . Chandola, A. Banerjee, and V . Kumar, “Anomaly detection: A survey,” CSUR, vol. 41, no. 3, pp. 1–58, 2009

Showing first 80 references.