Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions
Pith reviewed 2026-06-28 02:12 UTC · model grok-4.3
The pith
A benchmark built from agent-based epidemic simulations supplies ground-truth counterfactuals to test causal methods under time-varying policies across more than 150 U.S. counties.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions that supports static and time-varying treatments as well as single-policy and multi-policy settings; the benchmark is generated from a calibrated agent-based model grounded in real-world data for more than 150 U.S. counties and reveals substantial performance differences among causal inference methods.
What carries the argument
The benchmark dataset of epidemic time series with ground-truth counterfactual trajectories generated by the calibrated agent-based model.
If this is right
- Causal inference methods can be evaluated across static, time-varying, single-policy, and multi-policy intervention settings within the same realistic epidemic framework.
- Performance gaps among methods become measurable in scenarios that include changing policies and multiple simultaneous interventions.
- The benchmark supplies more than 150 county-level trajectories grounded in demographic, mobility, epidemiological, and policy data.
- Progress in time-series causal reasoning can now be tracked against observable counterfactuals rather than simplified simulations or real data lacking ground truth.
Where Pith is reading between the lines
- The benchmark could support development of methods that handle simultaneous policy changes, which are common in real epidemic responses.
- Similar simulation-based benchmarks might be constructed for other time-varying intervention domains such as economic or environmental policy.
- Users of the benchmark could test whether method rankings remain stable when the underlying agent-based model is replaced with an alternative calibrated simulator.
Load-bearing premise
The agent-based model, once calibrated to real data, produces counterfactual trajectories that faithfully capture the complex causal dynamics of actual epidemics.
What would settle it
A direct comparison in which real observed epidemic outcomes after an actual policy change deviate systematically from the benchmark counterfactuals generated under identical starting conditions and the same policy sequence.
Figures
read the original abstract
Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to address the lack of realistic benchmarks for counterfactual prediction in epidemic time series by constructing a large-scale dataset from a calibrated agent-based model (ABM) grounded in U.S. county-level demographic, mobility, epidemiological, and policy data. This benchmark supports static/time-varying treatments and single/multi-policy interventions across >150 counties, and is used to evaluate causal inference methods, with the abstract asserting that it reveals substantial performance differences among methods.
Significance. If the ABM-generated counterfactuals are shown to faithfully represent intervention effects, the benchmark would be a significant contribution by providing observable ground-truth trajectories in a complex, real-data-grounded epidemiological setting that existing simplified simulations lack. It could enable more rigorous evaluation of time-series causal methods across diverse intervention scenarios.
major comments (2)
- [Methods (ABM calibration and counterfactual generation)] The central claim that the benchmark provides faithful counterfactual trajectories under time-varying interventions rests on the ABM's internal causal mechanisms, yet the manuscript supplies no quantitative validation (e.g., hold-out tests or sensitivity analyses) of how interventions alter transmission parameters. Calibration to observed trajectories ensures only marginal fit and does not confirm the intervention response functions used to generate counterfactuals.
- [Evaluation and Results] The abstract and results claim 'substantial performance differences' among methods, but the evaluation reports no error bars, standard errors, or statistical significance tests on the performance metrics across counties or runs. This weakens the ability to interpret the benchmark's comparative findings as robust.
minor comments (2)
- [Abstract and Methods] The abstract and methods should include explicit details on data exclusion criteria, exact ABM fitting procedures, and the number of methods/metrics evaluated to improve reproducibility.
- [Introduction and Setup] Notation for time-varying interventions and multi-policy settings could be clarified with a dedicated table or diagram early in the paper.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of validation and statistical rigor. We respond to each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Methods (ABM calibration and counterfactual generation)] The central claim that the benchmark provides faithful counterfactual trajectories under time-varying interventions rests on the ABM's internal causal mechanisms, yet the manuscript supplies no quantitative validation (e.g., hold-out tests or sensitivity analyses) of how interventions alter transmission parameters. Calibration to observed trajectories ensures only marginal fit and does not confirm the intervention response functions used to generate counterfactuals.
Authors: We agree that the manuscript does not include dedicated quantitative validation such as hold-out tests or sensitivity analyses specifically targeting the intervention response functions. The ABM calibration matches observed trajectories, with intervention effects parameterized from established epidemiological models in the literature. To strengthen this aspect, the revised manuscript will add sensitivity analyses on key intervention parameters (e.g., transmission rate adjustments under policies) and report their impact on generated counterfactuals. revision: yes
-
Referee: [Evaluation and Results] The abstract and results claim 'substantial performance differences' among methods, but the evaluation reports no error bars, standard errors, or statistical significance tests on the performance metrics across counties or runs. This weakens the ability to interpret the benchmark's comparative findings as robust.
Authors: We concur that the absence of variability measures and statistical tests limits the robustness of the comparative claims. The current results present aggregate metrics without error bars or significance testing. In the revision, we will include standard errors across counties and simulation runs, along with appropriate statistical tests (e.g., paired t-tests or Wilcoxon tests) to support statements about performance differences. revision: yes
Circularity Check
No circularity: benchmark generated from external ABM with no self-referential reductions
full rationale
The paper constructs a benchmark by running a calibrated agent-based model on real-world demographic, mobility, epidemiological, and policy data to produce counterfactual trajectories. No equations, predictions, or central claims reduce by construction to parameters fitted within the paper itself, nor do they rely on self-citation chains or imported uniqueness theorems. The generation process is independent of the causal inference methods being evaluated, and the assumption that the ABM faithfully captures dynamics is an external modeling choice rather than a definitional loop. This is a standard non-circular benchmarking setup.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Matthew Abueg, Robert Hinch, Neo Wu, Luyang Liu, William Probert, Austin Wu, Paul Eastham, Yusef Shafi, Matt Rosencrantz, Michael Dikovsky, et al. 2021. Modeling the effect of exposure notification and non-pharmaceutical interven- tions on COVID-19 transmission in Washington state.NPJ digital medicine4, 1 (2021), 49
2021
-
[2]
Joseph Aylett-Bullock, Carolina Cuesta-Lazaro, Arnau Quera-Bofarull, Miguel Icaza-Lizaola, Aidan Sedgewick, Henry Truong, Aoife Curran, Edward Elliott, Tristan Caulfield, Kevin Fong, et al. 2021. June: open-source individual-based epidemiology simulation.Royal Society open science8, 7 (2021), 210506
2021
-
[3]
Ioana Bica, Ahmed M Alaa, James Jordon, and Mihaela van der Schaar. 2020. Estimating counterfactual treatment outcomes over time through adversarially balanced representations. InInternational Conference on Learning Representations
2020
-
[4]
Keith R Bisset, Jiangzhuo Chen, Xizhou Feng, VS Anil Kumar, and Madhav V Marathe. 2009. EpiFast: a fast algorithm for large scale realistic epidemic simu- lations on distributed memory systems. InProceedings of the 23rd international conference on Supercomputing. 430–439
2009
-
[5]
Eric Bonabeau. 2002. Agent-based modeling: Methods and techniques for simu- lating human systems.Proceedings of the national academy of sciences99, suppl_3 (2002), 7280–7287
2002
-
[6]
Jeanne Brooks-Gunn, Fong-ruey Liaw, and Pamela Kato Klebanov. 1992. Effects of early intervention on cognitive function of low birth weight preterm infants. The Journal of pediatrics120, 3 (1992), 350–359
1992
-
[7]
Ayush Chopra, Alexander Rodriguez, B Aditya Prakash, Ramesh Raskar, and Thomas Kingsley. 2023. Using neural networks to calibrate agent based models enables improved regional evidence for vaccine strategy and policy.Vaccine41, 48 (2023), 7067–7071
2023
-
[8]
Ayush Chopra, Alexander Rodríguez, Jayakumar Subramanian, Arnau Quera- Bofarull, Balaji Krishnamurthy, B Aditya Prakash, and Ramesh Raskar. 2023. Differentiable Agent-based Epidemiology. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. 1848–1857
2023
-
[9]
Ayush Chopra, Jayakumar Subramanian, Balaji Krishnamurthy, and Ramesh Raskar. 2024. flame: A Framework for Learning in Agent-based ModEls. InPro- ceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems. 391–399
2024
-
[10]
Richard A Davis, Keh-Shin Lii, and Dimitris N Politis. 2011. Remarks on some nonparametric estimates of a density function. InSelected Works of Murray Rosenblatt. Springer, 95–100
2011
-
[11]
Ken Eames, Shweta Bansal, Simon Frost, and Steven Riley. 2015. Six challenges in measuring contact networks for use in modelling.Epidemics10 (2015), 72–77
2015
-
[12]
Muhammad Hasan Ferdous, Emam Hossain, and Md Osman Gani. 2025. Time- graph: Synthetic benchmark datasets for robust time-series causal discovery. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5425–5435
2025
-
[13]
Dennis Frauen, Tobias Hatt, Valentyn Melnychuk, and Stefan Feuerriegel. 2023. Estimating average causal effects from patient trajectories. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7586–7594
2023
-
[14]
Chad M Glen, Melissa L Kemp, and Eberhard O Voit. 2019. Agent-based modeling of morphogenetic systems: Advantages and challenges.PLoS computational biology15, 3 (2019), e1006577
2019
-
[15]
Nicolò Gozzi, Matteo Chinazzi, Jessica T Davis, Corrado Gioannini, Luca Rossi, Marco Ajelli, Nicola Perra, and Alessandro Vespignani. 2025. Epydemix: An open-source Python package for epidemic modeling with integrated approximate Bayesian calibration.PLOS Computational Biology21, 11 (2025), e1013735
2025
-
[16]
Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society (1969), 424–438
1969
-
[17]
2006.Longitudinal data analysis
Donald Hedeker and Robert D Gibbons. 2006.Longitudinal data analysis. John Wiley & Sons
2006
-
[18]
Robert Hinch, William JM Probert, Anel Nurtay, Michelle Kendall, Chris Wymant, Matthew Hall, Katrina Lythgoe, Ana Bulas Cruz, Lele Zhao, Andrea Stewart, et al. 2021. OpenABM-Covid19—An agent-based model for non-pharmaceutical interventions against COVID-19 including contact tracing.PLoS computational biology17, 7 (2021), e1009146
2021
-
[19]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851
2020
-
[20]
John H Holland and John H Miller. 1991. Artificial adaptive agents in economic theory.The American economic review81, 2 (1991), 365–370
1991
-
[21]
Emily Howerton, Lucie Contamin, Luke C Mullany, Michelle Qin, Nicholas G Reich, Samantha Bents, Rebecca K Borchering, Sung-mok Jung, Sara L Loo, Claire P Smith, et al. 2023. Evaluation of the US COVID-19 Scenario Modeling Hub for informing pandemic response under uncertainty.Nature communications 14, 1 (2023), 7260
2023
-
[22]
Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database.Scientific data3, 1 (2016), 1–9
2016
-
[23]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)
Pith/arXiv arXiv 2013
-
[24]
Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning.Proceedings of the national academy of sciences116, 10 (2019), 4156–4165
2019
-
[25]
Rui Li, Stephanie Hu, Mingyu Lu, Yuria Utsumi, Prithwish Chakraborty, Daby M Sow, Piyush Madan, Jun Li, Mohamed Ghalwash, Zach Shahn, et al. 2021. G-net: a recurrent network approach to g-computation for counterfactual prediction under a dynamic treatment regime. InMachine Learning for Health. PMLR, 282– 299
2021
-
[26]
Ruipu Li and Alexander Rodríguez. 2025. Neural Conformal Control for Time Series Forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 18439–18447
2025
-
[27]
Bryan Lim. 2018. Forecasting treatment responses over time using recurrent marginal structural networks.Advances in neural information processing systems 31 (2018)
2018
-
[28]
Yuchen Ma, Valentyn Melnychuk, Jonas Schweisthal, and Stefan Feuerriegel. 2024. DiffPO: A causal diffusion model for learning distributions of potential outcomes. Advances in Neural Information Processing Systems37 (2024), 43663–43692
2024
-
[29]
Madhav Marathe and Anil Kumar S Vullikanti. 2013. Computational epidemiology. Commun. ACM56, 7 (2013), 88–96
2013
-
[30]
Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. 2022. Causal trans- former for estimating counterfactual outcomes. InInternational conference on machine learning. PMLR, 15293–15329
2022
-
[31]
Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. 2023. Normaliz- ing flows for interventional density estimation. InInternational Conference on Machine Learning. PMLR, 24361–24397
2023
-
[32]
Erica EM Moodie, Thomas S Richardson, and David A Stephens. 2007. Demysti- fying optimal dynamic treatment regimes.Biometrics63, 2 (2007), 447–455
2007
-
[33]
Raha Moraffah, Paras Sheth, Mansooreh Karami, Anchit Bhattacharya, Qianru Wang, Anique Tahir, Adrienne Raglin, and Huan Liu. 2021. Causal inference for time series analysis: Problems, methods and evaluation.Knowledge and Information Systems63, 12 (2021), 3041–3085
2021
-
[34]
Wenhao Mu, Zhi Cao, Mehmed Uludag, and Alexander Rodríguez. 2025. Counter- factual probabilistic diffusion with expert models.arXiv preprint arXiv:2508.13355 (2025)
arXiv 2025
-
[35]
Judea Pearl. 2009. Causal inference in statistics: An overview. (2009)
2009
-
[36]
Lorenzo Pellis, Frank Ball, Shweta Bansal, Ken Eames, Thomas House, Valerie Isham, and Pieter Trapman. 2015. Eight challenges for network epidemic models. Epidemics10 (2015), 58–62
2015
-
[37]
James Robins. 1986. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect.Mathematical modelling7, 9-12 (1986), 1393–1512
1986
-
[38]
James M Robins. 1994. Correcting for non-compliance in randomized trials using structural nested mean models.Communications in Statistics-Theory and methods 23, 8 (1994), 2379–2412
1994
-
[39]
Alexander Rodríguez, Harshavardhan Kamarthi, Pulak Agarwal, Javen Ho, Mira Patel, Suchet Sapre, and B Aditya Prakash. 2024. Machine learning for data-centric epidemic forecasting.Nature Machine Intelligence6, 10 (2024), 1122–1131
2024
-
[40]
Alvaro Ruiz-Martinez, Chang Gong, Hanwen Wang, Richard J Sové, Haoyang Mi, Holly Kimko, and Aleksander S Popel. 2022. Simulations of tumor growth and response to immunotherapy by coupling a spatial agent-based model with a whole-patient quantitative systems pharmacology model.PLoS computational biology18, 7 (2022), e1010254
2022
-
[41]
Jakob Runge, Andreas Gerhardus, Gherardo Varando, Veronika Eyring, and Gustau Camps-Valls. 2023. Causal inference for time series.Nature Reviews Earth & Environment4, 7 (2023), 487–505
2023
-
[42]
Mohammed Saeed, Christine Lieu, Greg Raber, and Roger G Mark. 2002. MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring. InComputers in cardiology. IEEE, 641–644
2002
-
[43]
Nabeel Seedat, Fergus Imrie, Alexis Bellot, Zhaozhi Qian, and Mihaela van der Schaar. 2022. Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations. InInternational Conference on Machine Learning. PMLR, 19497–19521
2022
-
[44]
Alex Sherstinsky. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network.Physica D: Nonlinear Phenomena404 (2020), 132306
2020
-
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
2017
-
[46]
Joseph T Wu, Kathy Leung, and Gabriel M Leung. 2020. Nowcasting and forecast- ing the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study.The lancet395, 10225 (2020), 689–697
2020
-
[47]
Shenghao Wu, Wenbin Zhou, Minshuo Chen, and Shixiang Zhu. 2024. Counter- factual generative models for time-varying treatments. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3402–3413. Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions KDD ’26, August 09–13, 2026, Jeju I...
2024
-
[48]
Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. 2018. GANITE: Esti- mation of individualized treatment effects using generative adversarial nets. In International conference on learning representations
2018
-
[49]
flattens the curve
Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C Parkes, and Richard Socher. 2022. The AI Economist: Taxation policy design via two-level deep multiagent reinforcement learning.Science advances8, 18 (2022), eabk2607. Appendix A: Temporal Dynamics of Negative Treatment Effects This appendix presents county-level infection trajectories for spe- cifi...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.