pith. machine review for the scientific record. sign in

arxiv: 2605.11598 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.DB· q-bio.QM

Recognition: 1 theorem link

· Lean Theorem

EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DBq-bio.QM
keywords epidemic forecastingmultivariate time seriesbenchmarking frameworkpublic healthdeep learningtime series forecastinginfectious diseasesmodel evaluation
0
0 comments X

The pith

EpiCastBench supplies 40 multivariate epidemic datasets with standardized settings to evaluate 15 forecasting models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to overcome the shortage of suitable benchmarks for developing multivariate epidemic forecasting techniques, which can account for interactions between multiple time series unlike traditional univariate methods. It introduces EpiCastBench as a collection of 40 diverse datasets covering various infectious diseases and locations, along with protocols for consistent testing. This setup allows researchers to compare models ranging from basic statistical ones to advanced neural networks and foundation models in a reproducible way. If successful, it would accelerate progress in creating more accurate tools for predicting disease spread, supporting better public health planning.

Core claim

The central discovery is the creation of EpiCastBench, a large-scale framework with 40 curated multivariate epidemic datasets that exhibit varied characteristics in granularity, length, and sparsity. These datasets are analyzed for global features and structural patterns. The framework includes fixed forecasting horizons, preprocessing steps, multiple metrics, and statistical tests to enable unbiased evaluations, which are then applied to assess 15 different multivariate forecasting models.

What carries the argument

EpiCastBench, the benchmarking framework that curates correlated multivariate datasets and enforces standardized evaluation conditions for epidemic forecasting models.

If this is right

  • Researchers can now perform fair comparisons across a wide range of models without custom data preparation.
  • The diversity of datasets helps identify model performance under different epidemic conditions.
  • Identification of dataset patterns can guide future data collection efforts in epidemiology.
  • Comprehensive evaluation reveals relative strengths of statistical versus deep learning approaches in this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark might reveal that certain foundation models outperform others in handling sparse epidemic data.
  • Extending the framework to include causal inference or intervention modeling could further enhance its utility for policy decisions.
  • If widely adopted, it could standardize practices similar to how benchmarks in other fields have driven progress.

Load-bearing premise

The 40 chosen datasets represent typical epidemic behaviors and the testing rules do not secretly favor some models over others.

What would settle it

If a re-evaluation using independently sourced epidemic datasets produces different top-performing models than those identified in the benchmark.

Figures

Figures reproduced from arXiv: 2605.11598 by Danny D'Agostino, Huitao Li, Madhurima Panja, Nan Liu, Tanujit Chakraborty.

Figure 1
Figure 1. Figure 1: Overview of the EpiCastBench framework. existing studies focus on a specific disease and univariate formulations, limiting generalization across datasets and epidemiological contexts. This study builds on prior epidemic forecasting efforts while adopting the design principles of large-scale time series benchmarks [31]. We introduce EpiCastBench, a curated collection of multivariate epidemic datasets with s… view at source ↗
Figure 2
Figure 2. Figure 2: Radar plot comparing global features of epidemic datasets across transmission channels. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Boxplots comparing model performance using MASE (top) and RMSE (bottom) across [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Win-count heatmaps based on the MASE metric for long (left), medium (middle), and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training and inference times of all models for the short-term forecasting task on the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MCB test result for long (left), medium (middle), and short-term (right) forecasting horizons [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Boxplots comparing the performance of different models based on SMAPE (upper panel) [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Multiple comparisons with the best (MCB) test based on the (a) RMSE, (b) MAE, and (c) [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
read the original abstract

The increasing adoption of data-driven decision-making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high-quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large-scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EpiCastBench, a large-scale benchmarking framework consisting of 40 curated multivariate epidemic datasets spanning multiple infectious diseases, geographical regions, temporal granularities, series lengths, and sparsity levels. It provides standardized evaluation protocols (unified forecasting horizon, preprocessing pipelines, performance metrics, and statistical significance testing) and evaluates 15 multivariate forecasting models ranging from statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are released publicly on Kaggle and GitHub.

Significance. If the curation process proves representative and free of systematic biases, EpiCastBench could become a useful public resource for the epidemic forecasting community by enabling reproducible, fair comparisons of multivariate models that capture inter-series dependencies. The public data and code release supports reproducibility, which is a clear strength.

major comments (2)
  1. [Dataset curation and analysis sections (referenced in abstract and § on datasets)] The central claim that the 40 datasets enable 'fair comparison' rests on the curation process being unbiased and representative. However, the manuscript does not provide an explicit, reproducible selection protocol (e.g., precise inclusion/exclusion rules for sparsity thresholds, minimum series length, geographic/disease coverage, or how 'correlated' multivariate structure was ensured). This makes it impossible to rule out hidden selection effects that could favor particular model families (e.g., deep learning models benefiting from longer, denser series).
  2. [Evaluation settings and experimental results sections] No sensitivity analyses or robustness checks are reported against alternative dataset pools, preprocessing variants, or horizon choices. Without these, it is unclear whether reported model rankings and statistical significance results are stable or artifacts of the specific 40-dataset collection.
minor comments (2)
  1. [Abstract] Clarify the meaning of '(correlated)' in the abstract and dataset description; specify whether it refers to within-dataset inter-series correlations or another property.
  2. [Model evaluation section] Include a summary table listing all 15 models with their categories (statistical, DL, foundation), key references, and hyperparameters used in the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript introducing EpiCastBench. We address each major comment below and indicate the revisions we will make to improve clarity and robustness.

read point-by-point responses
  1. Referee: [Dataset curation and analysis sections (referenced in abstract and § on datasets)] The central claim that the 40 datasets enable 'fair comparison' rests on the curation process being unbiased and representative. However, the manuscript does not provide an explicit, reproducible selection protocol (e.g., precise inclusion/exclusion rules for sparsity thresholds, minimum series length, geographic/disease coverage, or how 'correlated' multivariate structure was ensured). This makes it impossible to rule out hidden selection effects that could favor particular model families (e.g., deep learning models benefiting from longer, denser series).

    Authors: We agree that greater transparency in the curation protocol is needed to substantiate the claim of fair and unbiased comparison. The manuscript currently emphasizes the resulting diversity in disease types, regions, granularities, lengths, and sparsity but does not enumerate the precise selection rules. In the revised version we will add a dedicated subsection in the Datasets section that documents the full curation protocol, including the public data sources consulted, quantitative thresholds applied for series length and sparsity, the method used to verify multivariate correlation structure, and the stratification steps taken to achieve coverage across diseases and geographies. This addition will make the process fully reproducible and allow readers to evaluate potential selection effects directly. revision: yes

  2. Referee: [Evaluation settings and experimental results sections] No sensitivity analyses or robustness checks are reported against alternative dataset pools, preprocessing variants, or horizon choices. Without these, it is unclear whether reported model rankings and statistical significance results are stable or artifacts of the specific 40-dataset collection.

    Authors: We recognize that additional sensitivity checks would further demonstrate the stability of the reported rankings. Our existing evaluation already employs statistical significance testing (Friedman test followed by post-hoc Nemenyi) across all 40 datasets to assess whether performance differences are reliable. Nevertheless, we did not conduct exhaustive sensitivity experiments on alternative dataset collections or preprocessing variants. In the revision we will add a short robustness subsection that reports limited sensitivity results (re-evaluation under two alternative horizon lengths on a stratified 10-dataset subset) and will expand the Limitations section to note that broader sensitivity to entirely different dataset pools is left for future work. We believe the combination of dataset diversity, unified protocols, and significance testing already provides substantial evidence that the rankings are not artifacts of the chosen collection. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain or fitted predictions

full rationale

The paper introduces EpiCastBench as a curated collection of 40 external multivariate epidemic datasets, standardized preprocessing, and off-the-shelf model evaluations. No equations, parameter fits, or predictions are defined inside the work that later get re-used as outputs. All claims rest on publicly available external data sources and standard statistical/deep-learning baselines rather than any self-referential construction. This matches the default expectation for benchmark papers and yields no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper. It introduces no new mathematical axioms, free parameters, or invented entities; it relies on existing public data sources and previously published forecasting models.

pith-pipeline@v0.9.0 · 5546 in / 1093 out tokens · 42846 ms · 2026-05-13T01:35:33.112713+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    EpiCastBench comprises 40 publicly available multivariate time series datasets of incidence cases... standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 3 internal anchors

  1. [1]

    https://www.kaggle.com/datasets/unanimad/ corona-virus-brazil

    Corona virus brazil dataset. https://www.kaggle.com/datasets/unanimad/ corona-virus-brazil

  2. [2]

    IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem

    A. Adiga, J. Chou, A. Chiranth, B. Lewis, A. I. Bento, S. Truelove, G. Fox, M. Marathe, H. Hochheiser, and S. Venkatramanan. Idobe: Infectious disease outbreak forecasting benchmark ecosystem.arXiv preprint arXiv:2604.18521, 2026

  3. [3]

    A. S. Allende. Covid-19 data repository. https://github.com/alonsosilvaallende/ COVID-19/tree/master/data

  4. [4]

    A. F. Ansari, O. Shchur, J. Küken, A. Auer, B. Han, P. Mercado, S. S. Rangapuram, H. Shen, L. Stella, X. Zhang, et al. Chronos-2: From univariate to universal forecasting.arXiv preprint arXiv:2510.15821, 2025

  5. [5]

    Barman, M

    M. Barman, M. Panja, N. Mishra, and T. Chakraborty. Epidemic-guided deep learning for spatiotemporal forecasting of tuberculosis outbreak.Machine Learning, 114(10):213, 2025

  6. [6]

    Biggerstaff, D

    M. Biggerstaff, D. Alper, M. Dredze, S. Fox, I. C.-H. Fung, K. S. Hickmann, B. Lewis, R. Rosenfeld, J. Shaman, M.-H. Tsou, et al. Results from the centers for disease control and prevention’s predict the 2013–2014 influenza season challenge.BMC infectious diseases, 16(1):357, 2016

  7. [7]

    L. Breiman. Random forests.Machine learning, 45(1):5–32, 2001

  8. [8]

    L. C. Brooks, D. C. Farrow, S. Hyun, R. J. Tibshirani, and R. Rosenfeld. Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions.PLoS computational biology, 14(6):e1006134, 2018

  9. [9]

    Challu, K

    C. Challu, K. G. Olivares, B. N. Oreshkin, F. G. Ramirez, M. M. Canseco, and A. Dubrawski. Nhits: Neural hierarchical interpolation for time series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 6989–6997, 2023

  10. [10]

    Charniga, Z

    K. Charniga, Z. M. Cucunubá, M. Mercado, F. Prieto, M. Ospina, P. Nouvellet, and C. A. Donnelly. Spatial and temporal invasion dynamics of the 2014–2017 zika and chikungunya epidemics in colombia.PLOS Computational Biology, 17(7):e1009174, 2021

  11. [11]

    Chen, C.-L

    S.-A. Chen, C.-L. Li, S. O. Arik, N. C. Yoder, and T. Pfister. TSMixer: An all-MLP architecture for time series forecast-ing.Transactions on Machine Learning Research, 2023

  12. [12]

    Chen and C

    T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

  13. [13]

    Y . Chen, Y . Kang, Y . Chen, and Z. Wang. Probabilistic forecasting with temporal convolutional neural network.Neurocomputing, 399:491–501, 2020. 10

  14. [14]

    V . K. R. Chimmula and L. Zhang. Time series forecasting of covid-19 transmission in canada using lstm networks.Chaos, solitons & fractals, 135:109864, 2020

  15. [15]

    Clarke, A

    J. Clarke, A. Lim, P. Gupte, D. Pigott, W. van Panhuis, and O. Brady. A global dataset of publicly available dengue case count data.Scientific Data, 11(1):296, Mar 14 2024

  16. [16]

    F. J. Colón-González, R. Gibb, K. Khan, A. Watts, R. Lowe, and O. J. Brady. Projecting the future incidence and burden of dengue in southeast asia.nature communications, 14(1):5439, 2023

  17. [17]

    Covid-19 data for mexico.https://datos.covid-19.conacyt.mx

    CONACYT. Covid-19 data for mexico.https://datos.covid-19.conacyt.mx

  18. [18]

    Covid-19 data for colombia

    CovidDataProject. Covid-19 data for colombia. https://github.com/CovidDataProject/ DataCovid19Colombia

  19. [19]

    E. Y . Cramer, E. L. Ray, V . K. Lopez, J. Bracher, A. Brennen, A. J. Castro Rivadeneira, A. Gerding, T. Gneiting, K. H. House, Y . Huang, et al. Evaluation of individual and ensemble probabilistic forecasts of covid-19 mortality in the united states.Proceedings of the National Academy of Sciences, 119(15):e2113561119, 2022

  20. [20]

    da Silva Neto, T

    S. da Silva Neto, T. Tabosa de Oliveira, I. Teixiera, et al. Arboviral disease record data - dengue and chikungunya, brazil, 2013–2020.Scientific Data, 9:198, 2022

  21. [21]

    A. Das, W. Kong, A. Leach, S. K. Mathur, R. Sen, and R. Yu. Long-term forecasting with tiDE: Time-series dense encoder.Transactions on Machine Learning Research, 2023

  22. [22]

    A. Das, W. Kong, R. Sen, and Y . Zhou. A decoder-only foundation model for time-series forecasting. InForty-first International Conference on Machine Learning, 2024

  23. [23]

    della Protezione Civile

    D. della Protezione Civile. Covid-19 data for italian provinces. https://github.com/ pcm-dpc/COVID-19/tree/master/dati-province

  24. [24]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  25. [25]

    E. Dong, H. Du, and L. Gardner. An interactive web-based dashboard to track covid-19 in real time.The Lancet. Infectious diseases, 20(5):533–534, 2020

  26. [26]

    for Disease Control and P

    C. for Disease Control and P. (CDC). Zika virus data repository. https://github.com/ cdcepi/zika

  27. [27]

    E. C. for Disease Prevention and C. (ECDC). Data on daily new cases of covid- 19 in eu/eea by country. https://www.ecdc.europa.eu/en/publications-data/ data-daily-new-cases-covid-19-eueea-country

  28. [28]

    Friedman

    M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance.Journal of the American Statistical Association, 32(200):675–701, 1937

  29. [29]

    Friedman

    M. Friedman. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86–92, 1940

  30. [30]

    J. Gehrcke. Covid-19 data for germany. https://github.com/jgehrcke/ covid-19-germany-gae

  31. [31]

    Godahewa, C

    R. Godahewa, C. Bergmeir, G. I. Webb, R. J. Hyndman, and P. Montero-Manso. Monash time series forecasting archive. In35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021

  32. [32]

    Goehry, H

    B. Goehry, H. Yan, Y . Goude, P. Massart, and J.-M. Poggi. Random forests for time series. REVSTAT-Statistical Journal, 21(2):283–302, 2023

  33. [33]

    C. Gong, R. Sun, Y . Zheng, J. Zhang, T. Gu, L. Pan, and L. Lv. Epillm: unlocking the potential of large language models in epidemic forecasting.arXiv preprint arXiv:2505.12738, 2025. 11

  34. [34]

    Covid-19 epidemiological updates in canada

    Government of Canada. Covid-19 epidemiological updates in canada. https:// health-infobase.canada.ca/covid-19/

  35. [35]

    S. Hassan. R dengue analysis.https://github.com/shakhassan/r-dengue-analysis

  36. [36]

    Herzen, F

    J. Herzen, F. Lässig, S. G. Piazzetta, T. Neuer, L. Tafti, G. Raille, T. Van Pottelbergh, M. Pasieka, A. Skrodzki, N. Huguenin, et al. Darts: User-friendly modern machine learning for time series. Journal of Machine Learning Research, 23(124):1–6, 2022

  37. [37]

    Hochreiter and J

    S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural computation, 9(8):1735– 1780, 1997

  38. [38]

    Hyndman, Y

    R. Hyndman, Y . Kang, P. Montero-Manso, M. O’Hara-Wild, T. Talagala, E. Wang, Y . Yang, S. B. Taieb, C. Hanqing, D. K. Lake, N. Laptev, J. R. Moorman, and B. Zhang. tsfeatures: Time series feature extraction, 2023

  39. [39]

    R. J. Hyndman and A. B. Koehler. Another look at measures of forecast accuracy.International Journal of Forecasting, 22(4):679–688, 2006

  40. [40]

    M. A. Johansson, K. M. Apfeldorf, S. Dobson, J. Devita, A. L. Buczak, B. Baugher, L. J. Moniz, T. Bagley, S. M. Babin, E. Guven, et al. An open challenge to advance probabilistic forecasting for dengue epidemics.Proceedings of the National Academy of Sciences, 116(48):24268–24274, 2019

  41. [41]

    K. E. Jones, N. G. Patel, M. A. Levy, A. Storeygard, D. Balk, J. L. Gittleman, and P. Daszak. Global trends in emerging infectious diseases.Nature, 451(7181):990–993, 2008

  42. [42]

    A. J. Koning, P. H. Franses, M. Hibon, and H. O. Stekler. The m3 competition: Statistical tests of the results.International Journal of Forecasting, 21(3):397–409, 2005

  43. [43]

    S. F. U. D. Lab. Covid-19 data for japan. https://github.com/sfu-db/ covid19-datasets/blob/master/datasets-details/Japan.md

  44. [44]

    Z. Liu, Y . Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljacic, T. Y . Hou, and M. Tegmark. KAN: Kolmogorov–arnold networks. InThe Thirteenth International Conference on Learning Representations, 2025

  45. [45]

    R. Lowe, S. Lee, K. O’Reilly, O. Brady, L. Bastos, G. Carrasco-Escobar, R. de Castro Catão, F. Colón-González, C. Barcellos, M. Carvalho, M. Blangiardo, H. Rue, and A. Gasparrini. Combined effects of hydrometeorological hazards and urbanisation on dengue risk in brazil: a spatiotemporal modelling study.The Lancet Planetary Health, 5(4):e209–e219, Apr 2021

  46. [46]

    Lv, S.-Y

    C.-X. Lv, S.-Y . An, B.-J. Qiao, and W. Wu. Time series analysis of hemorrhagic fever with renal syndrome in mainland china by using an xgboost forecasting model.BMC infectious diseases, 21(1):839, 2021

  47. [47]

    Morris, P

    M. Morris, P. Hayes, I. Cox, and V . Lampos. Neural network models for influenza forecasting with associated uncertainty using web search activity trends.PLoS Computational Biology, 19(8):e1011392, 2023

  48. [48]

    (Netherlands)

    R. (Netherlands). Covid-19 data for the netherlands.https://data.rivm.nl/covid-19

  49. [49]

    Covid-19 cases by location

    New South Wales Government. Covid-19 cases by location. https://data.nsw.gov.au/ data/dataset/covid-19-cases-by-location

  50. [50]

    C. M. of Health. Covid-19 api for the czech republic. https://onemocneni-aktualne. mzcr.cz/api/v2/covid-19

  51. [51]

    S. F. O. of Public Health (FOPH). Covid-19 data for switzerland. https://idd.bag.admin. ch/diseases/covid/data

  52. [52]

    B. N. Oreshkin, D. Carpov, N. Chapados, and Y . Bengio. N-beats: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020. 12

  53. [53]

    V . V . Palacios. Covid-19 data for spain.https://github.com/victorvicpal

  54. [54]

    Panja, T

    M. Panja, T. Chakraborty, U. Kumar, and N. Liu. Epicasting: an ensemble wavelet neural network for forecasting epidemics.Neural Networks, 165:185–212, 2023

  55. [55]

    F. Ping. Pahang wdf: Weather-dengue forecasting. https://github.com/ping543f/ pahang-wdf/tree/main

  56. [56]

    I. C.-. D. Project. Covid-19 data for india.https://data.incovid19.org

  57. [57]

    U. o. P. Project Tycho. Project tycho data repository.https://www.tycho.pitt.edu/

  58. [58]

    I. D. Repository. Covid-19 data for ireland. https://respiratoryvirus.hpsc.ie/pages/ covid-19

  59. [59]

    Rodriguez, H

    A. Rodriguez, H. Kamarthi, P. Agarwal, J. Ho, M. Patel, S. Sapre, and B. A. Prakash. Machine learning for data-centric epidemic forecasting.Nature Machine Intelligence, 6(10):1122–1131, 2024

  60. [60]

    Rosenfeld and R

    R. Rosenfeld and R. J. Tibshirani. Epidemic tracking and forecasting: Lessons learned from a tumultuous year.Proceedings of the National Academy of Sciences, 118(51):e2111456118, 2021

  61. [61]

    Rozemberczki, P

    B. Rozemberczki, P. Scherer, O. Kiss, R. Sarkar, and T. Ferenci. Chickenpox Cases in Hungary: a Benchmark Dataset for Spatiotemporal Signal Processing with Graph Neural Networks, 2021

  62. [62]

    Salinas, V

    D. Salinas, V . Flunkert, J. Gasthaus, and T. Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181–1191, 2020

  63. [63]

    Covid-19 epidemiological dashboard for belgium

    Sciensano. Covid-19 epidemiological dashboard for belgium. https://epistat.sciensano. be/covid/

  64. [64]

    T. N. Y . Times. Covid-19 data for the united states. https://github.com/nytimes/ covid-19-data

  65. [65]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  66. [66]

    A. Wang. Glue: A multi-task benchmark and analysis platform for natural language understand- ing.arXiv preprint arXiv:1804.07461, 2018

  67. [67]

    T. White. Covid-19 data for the united kingdom. https://github.com/tomwhite/ covid-19-uk-data

  68. [68]

    N. Wu, B. Green, X. Ben, and S. O’Banion. Deep transformer models for time series forecasting: The influenza prevalence case.arXiv preprint arXiv:2001.08317, 2020

  69. [69]

    A. Zeng, M. Chen, L. Zhang, and Q. Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023. 13 A Appendix A.1 Baseline models In EpiCastBench, we benchmark the performance of fifteen forecasting models of different paradigms. A brief description of thes...

  70. [70]

    Chronos-2 and TimesFM alternate as the top-performing models across countries, including zero-inflated cases such as China and Canada, where both maintain stable performance. While metric-specific differences exist, they are generally modest; for example, KAN achieves the best scale-dependent performance for Colombia’s COVID-19 dataset but remains close t...