pith. sign in

arxiv: 2606.05692 · v2 · pith:JC3HOMYKnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

Pith reviewed 2026-06-28 02:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords counterfactual predictionepidemic time seriescausal inferencetime-varying interventionsagent-based modelbenchmark datasetdynamic policiesmulti-policy settings
0
0 comments X

The pith

A benchmark built from agent-based epidemic simulations supplies ground-truth counterfactuals to test causal methods under time-varying policies across more than 150 U.S. counties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fill the gap in time-series causal inference by creating a benchmark that supplies observable counterfactual outcomes for epidemic data. It does so by running a calibrated agent-based model on real demographic, mobility, and policy inputs to produce trajectories under static and changing interventions, both single and multiple policies at once. A sympathetic reader would care because existing datasets either lack known counterfactuals or simplify dynamics too far for realistic testing, leaving performance gaps among methods hidden. The resulting evaluations show clear differences in how well current methods handle these complex scenarios.

Core claim

We develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions that supports static and time-varying treatments as well as single-policy and multi-policy settings; the benchmark is generated from a calibrated agent-based model grounded in real-world data for more than 150 U.S. counties and reveals substantial performance differences among causal inference methods.

What carries the argument

The benchmark dataset of epidemic time series with ground-truth counterfactual trajectories generated by the calibrated agent-based model.

If this is right

  • Causal inference methods can be evaluated across static, time-varying, single-policy, and multi-policy intervention settings within the same realistic epidemic framework.
  • Performance gaps among methods become measurable in scenarios that include changing policies and multiple simultaneous interventions.
  • The benchmark supplies more than 150 county-level trajectories grounded in demographic, mobility, epidemiological, and policy data.
  • Progress in time-series causal reasoning can now be tracked against observable counterfactuals rather than simplified simulations or real data lacking ground truth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could support development of methods that handle simultaneous policy changes, which are common in real epidemic responses.
  • Similar simulation-based benchmarks might be constructed for other time-varying intervention domains such as economic or environmental policy.
  • Users of the benchmark could test whether method rankings remain stable when the underlying agent-based model is replaced with an alternative calibrated simulator.

Load-bearing premise

The agent-based model, once calibrated to real data, produces counterfactual trajectories that faithfully capture the complex causal dynamics of actual epidemics.

What would settle it

A direct comparison in which real observed epidemic outcomes after an actual policy change deviate systematically from the benchmark counterfactuals generated under identical starting conditions and the same policy sequence.

Figures

Figures reproduced from arXiv: 2606.05692 by Alexander Rodr\'iguez, Anik Mumssen, Facundo Yan, Marisa Eisenberg, Wenhao Mu.

Figure 1
Figure 1. Figure 1: Overview of the counterfactual simulation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of treatment effects across our 158 U.S. counties. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Side-by-side comparison of intervention effect in dynamics [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Calibration validation for four representative counties. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of "No Policy" vs. "Combined Treatment" scenarios for two counties where intervention policies effectively delay and reduce peak infections while resulting in a larger total volume of cases over the simulation. Appendix B: More Details on Simulation Framework B.1 Synthetic Population Generation Formalisms Building upon the demographic integration described in Section 3.1, the synthetic populatio… view at source ↗
read the original abstract

Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to address the lack of realistic benchmarks for counterfactual prediction in epidemic time series by constructing a large-scale dataset from a calibrated agent-based model (ABM) grounded in U.S. county-level demographic, mobility, epidemiological, and policy data. This benchmark supports static/time-varying treatments and single/multi-policy interventions across >150 counties, and is used to evaluate causal inference methods, with the abstract asserting that it reveals substantial performance differences among methods.

Significance. If the ABM-generated counterfactuals are shown to faithfully represent intervention effects, the benchmark would be a significant contribution by providing observable ground-truth trajectories in a complex, real-data-grounded epidemiological setting that existing simplified simulations lack. It could enable more rigorous evaluation of time-series causal methods across diverse intervention scenarios.

major comments (2)
  1. [Methods (ABM calibration and counterfactual generation)] The central claim that the benchmark provides faithful counterfactual trajectories under time-varying interventions rests on the ABM's internal causal mechanisms, yet the manuscript supplies no quantitative validation (e.g., hold-out tests or sensitivity analyses) of how interventions alter transmission parameters. Calibration to observed trajectories ensures only marginal fit and does not confirm the intervention response functions used to generate counterfactuals.
  2. [Evaluation and Results] The abstract and results claim 'substantial performance differences' among methods, but the evaluation reports no error bars, standard errors, or statistical significance tests on the performance metrics across counties or runs. This weakens the ability to interpret the benchmark's comparative findings as robust.
minor comments (2)
  1. [Abstract and Methods] The abstract and methods should include explicit details on data exclusion criteria, exact ABM fitting procedures, and the number of methods/metrics evaluated to improve reproducibility.
  2. [Introduction and Setup] Notation for time-varying interventions and multi-policy settings could be clarified with a dedicated table or diagram early in the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of validation and statistical rigor. We respond to each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Methods (ABM calibration and counterfactual generation)] The central claim that the benchmark provides faithful counterfactual trajectories under time-varying interventions rests on the ABM's internal causal mechanisms, yet the manuscript supplies no quantitative validation (e.g., hold-out tests or sensitivity analyses) of how interventions alter transmission parameters. Calibration to observed trajectories ensures only marginal fit and does not confirm the intervention response functions used to generate counterfactuals.

    Authors: We agree that the manuscript does not include dedicated quantitative validation such as hold-out tests or sensitivity analyses specifically targeting the intervention response functions. The ABM calibration matches observed trajectories, with intervention effects parameterized from established epidemiological models in the literature. To strengthen this aspect, the revised manuscript will add sensitivity analyses on key intervention parameters (e.g., transmission rate adjustments under policies) and report their impact on generated counterfactuals. revision: yes

  2. Referee: [Evaluation and Results] The abstract and results claim 'substantial performance differences' among methods, but the evaluation reports no error bars, standard errors, or statistical significance tests on the performance metrics across counties or runs. This weakens the ability to interpret the benchmark's comparative findings as robust.

    Authors: We concur that the absence of variability measures and statistical tests limits the robustness of the comparative claims. The current results present aggregate metrics without error bars or significance testing. In the revision, we will include standard errors across counties and simulation runs, along with appropriate statistical tests (e.g., paired t-tests or Wilcoxon tests) to support statements about performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark generated from external ABM with no self-referential reductions

full rationale

The paper constructs a benchmark by running a calibrated agent-based model on real-world demographic, mobility, epidemiological, and policy data to produce counterfactual trajectories. No equations, predictions, or central claims reduce by construction to parameters fitted within the paper itself, nor do they rely on self-citation chains or imported uniqueness theorems. The generation process is independent of the causal inference methods being evaluated, and the assumption that the ABM faithfully captures dynamics is an external modeling choice rather than a definitional loop. This is a standard non-circular benchmarking setup.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified fidelity of the agent-based model; no free parameters, axioms, or invented entities are enumerated in the abstract.

pith-pipeline@v0.9.1-grok · 5696 in / 1010 out tokens · 26526 ms · 2026-06-28T02:12:07.319082+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 1 linked inside Pith

  1. [1]

    Matthew Abueg, Robert Hinch, Neo Wu, Luyang Liu, William Probert, Austin Wu, Paul Eastham, Yusef Shafi, Matt Rosencrantz, Michael Dikovsky, et al. 2021. Modeling the effect of exposure notification and non-pharmaceutical interven- tions on COVID-19 transmission in Washington state.NPJ digital medicine4, 1 (2021), 49

  2. [2]

    Joseph Aylett-Bullock, Carolina Cuesta-Lazaro, Arnau Quera-Bofarull, Miguel Icaza-Lizaola, Aidan Sedgewick, Henry Truong, Aoife Curran, Edward Elliott, Tristan Caulfield, Kevin Fong, et al. 2021. June: open-source individual-based epidemiology simulation.Royal Society open science8, 7 (2021), 210506

  3. [3]

    Ioana Bica, Ahmed M Alaa, James Jordon, and Mihaela van der Schaar. 2020. Estimating counterfactual treatment outcomes over time through adversarially balanced representations. InInternational Conference on Learning Representations

  4. [4]

    Keith R Bisset, Jiangzhuo Chen, Xizhou Feng, VS Anil Kumar, and Madhav V Marathe. 2009. EpiFast: a fast algorithm for large scale realistic epidemic simu- lations on distributed memory systems. InProceedings of the 23rd international conference on Supercomputing. 430–439

  5. [5]

    Eric Bonabeau. 2002. Agent-based modeling: Methods and techniques for simu- lating human systems.Proceedings of the national academy of sciences99, suppl_3 (2002), 7280–7287

  6. [6]

    Jeanne Brooks-Gunn, Fong-ruey Liaw, and Pamela Kato Klebanov. 1992. Effects of early intervention on cognitive function of low birth weight preterm infants. The Journal of pediatrics120, 3 (1992), 350–359

  7. [7]

    Ayush Chopra, Alexander Rodriguez, B Aditya Prakash, Ramesh Raskar, and Thomas Kingsley. 2023. Using neural networks to calibrate agent based models enables improved regional evidence for vaccine strategy and policy.Vaccine41, 48 (2023), 7067–7071

  8. [8]

    Ayush Chopra, Alexander Rodríguez, Jayakumar Subramanian, Arnau Quera- Bofarull, Balaji Krishnamurthy, B Aditya Prakash, and Ramesh Raskar. 2023. Differentiable Agent-based Epidemiology. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. 1848–1857

  9. [9]

    Ayush Chopra, Jayakumar Subramanian, Balaji Krishnamurthy, and Ramesh Raskar. 2024. flame: A Framework for Learning in Agent-based ModEls. InPro- ceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems. 391–399

  10. [10]

    Richard A Davis, Keh-Shin Lii, and Dimitris N Politis. 2011. Remarks on some nonparametric estimates of a density function. InSelected Works of Murray Rosenblatt. Springer, 95–100

  11. [11]

    Ken Eames, Shweta Bansal, Simon Frost, and Steven Riley. 2015. Six challenges in measuring contact networks for use in modelling.Epidemics10 (2015), 72–77

  12. [12]

    Muhammad Hasan Ferdous, Emam Hossain, and Md Osman Gani. 2025. Time- graph: Synthetic benchmark datasets for robust time-series causal discovery. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5425–5435

  13. [13]

    Dennis Frauen, Tobias Hatt, Valentyn Melnychuk, and Stefan Feuerriegel. 2023. Estimating average causal effects from patient trajectories. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7586–7594

  14. [14]

    Chad M Glen, Melissa L Kemp, and Eberhard O Voit. 2019. Agent-based modeling of morphogenetic systems: Advantages and challenges.PLoS computational biology15, 3 (2019), e1006577

  15. [15]

    Nicolò Gozzi, Matteo Chinazzi, Jessica T Davis, Corrado Gioannini, Luca Rossi, Marco Ajelli, Nicola Perra, and Alessandro Vespignani. 2025. Epydemix: An open-source Python package for epidemic modeling with integrated approximate Bayesian calibration.PLOS Computational Biology21, 11 (2025), e1013735

  16. [16]

    Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society (1969), 424–438

  17. [17]

    2006.Longitudinal data analysis

    Donald Hedeker and Robert D Gibbons. 2006.Longitudinal data analysis. John Wiley & Sons

  18. [18]

    Robert Hinch, William JM Probert, Anel Nurtay, Michelle Kendall, Chris Wymant, Matthew Hall, Katrina Lythgoe, Ana Bulas Cruz, Lele Zhao, Andrea Stewart, et al. 2021. OpenABM-Covid19—An agent-based model for non-pharmaceutical interventions against COVID-19 including contact tracing.PLoS computational biology17, 7 (2021), e1009146

  19. [19]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

  20. [20]

    John H Holland and John H Miller. 1991. Artificial adaptive agents in economic theory.The American economic review81, 2 (1991), 365–370

  21. [21]

    Emily Howerton, Lucie Contamin, Luke C Mullany, Michelle Qin, Nicholas G Reich, Samantha Bents, Rebecca K Borchering, Sung-mok Jung, Sara L Loo, Claire P Smith, et al. 2023. Evaluation of the US COVID-19 Scenario Modeling Hub for informing pandemic response under uncertainty.Nature communications 14, 1 (2023), 7260

  22. [22]

    Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database.Scientific data3, 1 (2016), 1–9

  23. [23]

    Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)

  24. [24]

    Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning.Proceedings of the national academy of sciences116, 10 (2019), 4156–4165

  25. [25]

    Rui Li, Stephanie Hu, Mingyu Lu, Yuria Utsumi, Prithwish Chakraborty, Daby M Sow, Piyush Madan, Jun Li, Mohamed Ghalwash, Zach Shahn, et al. 2021. G-net: a recurrent network approach to g-computation for counterfactual prediction under a dynamic treatment regime. InMachine Learning for Health. PMLR, 282– 299

  26. [26]

    Ruipu Li and Alexander Rodríguez. 2025. Neural Conformal Control for Time Series Forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 18439–18447

  27. [27]

    Bryan Lim. 2018. Forecasting treatment responses over time using recurrent marginal structural networks.Advances in neural information processing systems 31 (2018)

  28. [28]

    Yuchen Ma, Valentyn Melnychuk, Jonas Schweisthal, and Stefan Feuerriegel. 2024. DiffPO: A causal diffusion model for learning distributions of potential outcomes. Advances in Neural Information Processing Systems37 (2024), 43663–43692

  29. [29]

    Madhav Marathe and Anil Kumar S Vullikanti. 2013. Computational epidemiology. Commun. ACM56, 7 (2013), 88–96

  30. [30]

    Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. 2022. Causal trans- former for estimating counterfactual outcomes. InInternational conference on machine learning. PMLR, 15293–15329

  31. [31]

    Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. 2023. Normaliz- ing flows for interventional density estimation. InInternational Conference on Machine Learning. PMLR, 24361–24397

  32. [32]

    Erica EM Moodie, Thomas S Richardson, and David A Stephens. 2007. Demysti- fying optimal dynamic treatment regimes.Biometrics63, 2 (2007), 447–455

  33. [33]

    Raha Moraffah, Paras Sheth, Mansooreh Karami, Anchit Bhattacharya, Qianru Wang, Anique Tahir, Adrienne Raglin, and Huan Liu. 2021. Causal inference for time series analysis: Problems, methods and evaluation.Knowledge and Information Systems63, 12 (2021), 3041–3085

  34. [34]

    Wenhao Mu, Zhi Cao, Mehmed Uludag, and Alexander Rodríguez. 2025. Counter- factual probabilistic diffusion with expert models.arXiv preprint arXiv:2508.13355 (2025)

  35. [35]

    Judea Pearl. 2009. Causal inference in statistics: An overview. (2009)

  36. [36]

    Lorenzo Pellis, Frank Ball, Shweta Bansal, Ken Eames, Thomas House, Valerie Isham, and Pieter Trapman. 2015. Eight challenges for network epidemic models. Epidemics10 (2015), 58–62

  37. [37]

    James Robins. 1986. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect.Mathematical modelling7, 9-12 (1986), 1393–1512

  38. [38]

    James M Robins. 1994. Correcting for non-compliance in randomized trials using structural nested mean models.Communications in Statistics-Theory and methods 23, 8 (1994), 2379–2412

  39. [39]

    Alexander Rodríguez, Harshavardhan Kamarthi, Pulak Agarwal, Javen Ho, Mira Patel, Suchet Sapre, and B Aditya Prakash. 2024. Machine learning for data-centric epidemic forecasting.Nature Machine Intelligence6, 10 (2024), 1122–1131

  40. [40]

    Alvaro Ruiz-Martinez, Chang Gong, Hanwen Wang, Richard J Sové, Haoyang Mi, Holly Kimko, and Aleksander S Popel. 2022. Simulations of tumor growth and response to immunotherapy by coupling a spatial agent-based model with a whole-patient quantitative systems pharmacology model.PLoS computational biology18, 7 (2022), e1010254

  41. [41]

    Jakob Runge, Andreas Gerhardus, Gherardo Varando, Veronika Eyring, and Gustau Camps-Valls. 2023. Causal inference for time series.Nature Reviews Earth & Environment4, 7 (2023), 487–505

  42. [42]

    Mohammed Saeed, Christine Lieu, Greg Raber, and Roger G Mark. 2002. MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring. InComputers in cardiology. IEEE, 641–644

  43. [43]

    Nabeel Seedat, Fergus Imrie, Alexis Bellot, Zhaozhi Qian, and Mihaela van der Schaar. 2022. Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations. InInternational Conference on Machine Learning. PMLR, 19497–19521

  44. [44]

    Alex Sherstinsky. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network.Physica D: Nonlinear Phenomena404 (2020), 132306

  45. [45]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  46. [46]

    Joseph T Wu, Kathy Leung, and Gabriel M Leung. 2020. Nowcasting and forecast- ing the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study.The lancet395, 10225 (2020), 689–697

  47. [47]

    Shenghao Wu, Wenbin Zhou, Minshuo Chen, and Shixiang Zhu. 2024. Counter- factual generative models for time-varying treatments. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3402–3413. Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions KDD ’26, August 09–13, 2026, Jeju I...

  48. [48]

    Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. 2018. GANITE: Esti- mation of individualized treatment effects using generative adversarial nets. In International conference on learning representations

  49. [49]

    flattens the curve

    Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C Parkes, and Richard Socher. 2022. The AI Economist: Taxation policy design via two-level deep multiagent reinforcement learning.Science advances8, 18 (2022), eabk2607. Appendix A: Temporal Dynamics of Negative Treatment Effects This appendix presents county-level infection trajectories for spe- cifi...