Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

Alexander Rodr\'iguez; Anik Mumssen; Facundo Yan; Marisa Eisenberg; Wenhao Mu

arxiv: 2606.05692 · v2 · pith:JC3HOMYKnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

Wenhao Mu , Facundo Yan , Anik Mumssen , Marisa Eisenberg , Alexander Rodr\'iguez This is my paper

Pith reviewed 2026-06-28 02:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords counterfactual predictionepidemic time seriescausal inferencetime-varying interventionsagent-based modelbenchmark datasetdynamic policiesmulti-policy settings

0 comments

The pith

A benchmark built from agent-based epidemic simulations supplies ground-truth counterfactuals to test causal methods under time-varying policies across more than 150 U.S. counties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fill the gap in time-series causal inference by creating a benchmark that supplies observable counterfactual outcomes for epidemic data. It does so by running a calibrated agent-based model on real demographic, mobility, and policy inputs to produce trajectories under static and changing interventions, both single and multiple policies at once. A sympathetic reader would care because existing datasets either lack known counterfactuals or simplify dynamics too far for realistic testing, leaving performance gaps among methods hidden. The resulting evaluations show clear differences in how well current methods handle these complex scenarios.

Core claim

We develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions that supports static and time-varying treatments as well as single-policy and multi-policy settings; the benchmark is generated from a calibrated agent-based model grounded in real-world data for more than 150 U.S. counties and reveals substantial performance differences among causal inference methods.

What carries the argument

The benchmark dataset of epidemic time series with ground-truth counterfactual trajectories generated by the calibrated agent-based model.

If this is right

Causal inference methods can be evaluated across static, time-varying, single-policy, and multi-policy intervention settings within the same realistic epidemic framework.
Performance gaps among methods become measurable in scenarios that include changing policies and multiple simultaneous interventions.
The benchmark supplies more than 150 county-level trajectories grounded in demographic, mobility, epidemiological, and policy data.
Progress in time-series causal reasoning can now be tracked against observable counterfactuals rather than simplified simulations or real data lacking ground truth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could support development of methods that handle simultaneous policy changes, which are common in real epidemic responses.
Similar simulation-based benchmarks might be constructed for other time-varying intervention domains such as economic or environmental policy.
Users of the benchmark could test whether method rankings remain stable when the underlying agent-based model is replaced with an alternative calibrated simulator.

Load-bearing premise

The agent-based model, once calibrated to real data, produces counterfactual trajectories that faithfully capture the complex causal dynamics of actual epidemics.

What would settle it

A direct comparison in which real observed epidemic outcomes after an actual policy change deviate systematically from the benchmark counterfactuals generated under identical starting conditions and the same policy sequence.

Figures

Figures reproduced from arXiv: 2606.05692 by Alexander Rodr\'iguez, Anik Mumssen, Facundo Yan, Marisa Eisenberg, Wenhao Mu.

**Figure 2.** Figure 2: Distribution of treatment effects across our 158 U.S. counties. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Side-by-side comparison of intervention effect in dynamics [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Calibration validation for four representative counties. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of "No Policy" vs. "Combined Treatment" scenarios for two counties where intervention policies effectively delay and reduce peak infections while resulting in a larger total volume of cases over the simulation. Appendix B: More Details on Simulation Framework B.1 Synthetic Population Generation Formalisms Building upon the demographic integration described in Section 3.1, the synthetic populatio… view at source ↗

read the original abstract

Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark for time-varying and multi-policy epidemic counterfactuals, but its value rests on unverified ABM causal accuracy.

read the letter

This paper's core offering is a benchmark dataset for counterfactual prediction in epidemic time series that explicitly handles time-varying interventions and multi-policy settings, generated from a calibrated agent-based model across more than 150 U.S. counties.

It fills a practical gap. Existing options are either real observations without ground-truth counterfactuals or overly simple simulations. The new setup supports static and dynamic treatments in both single- and multi-policy scenarios, which lets researchers test causal methods on scenarios closer to actual epidemic policy changes. Running several standard and newer methods on the data and noting performance gaps is a straightforward way to show where current approaches fall short.

The main soft spot is the ABM ground truth. Calibration to demographic, mobility, and policy data helps match observed trajectories, but that does not confirm the model's internal rules for how interventions change transmission are realistic. If those response functions are off, the benchmark may highlight simulator-specific artifacts rather than general causal challenges. The abstract supplies no quantitative checks on counterfactual fidelity, no error bars, and no details on fitting or exclusion procedures.

This is aimed at causal inference researchers working on time series and at epidemic modelers who need test cases with dynamic policies. A reader building or evaluating methods would get concrete data to try.

It deserves peer review. The benchmark construction targets a real limitation in the literature, and the scale plus intervention variety are concrete advances even if the ABM justification needs closer examination.

Referee Report

2 major / 2 minor

Summary. The paper claims to address the lack of realistic benchmarks for counterfactual prediction in epidemic time series by constructing a large-scale dataset from a calibrated agent-based model (ABM) grounded in U.S. county-level demographic, mobility, epidemiological, and policy data. This benchmark supports static/time-varying treatments and single/multi-policy interventions across >150 counties, and is used to evaluate causal inference methods, with the abstract asserting that it reveals substantial performance differences among methods.

Significance. If the ABM-generated counterfactuals are shown to faithfully represent intervention effects, the benchmark would be a significant contribution by providing observable ground-truth trajectories in a complex, real-data-grounded epidemiological setting that existing simplified simulations lack. It could enable more rigorous evaluation of time-series causal methods across diverse intervention scenarios.

major comments (2)

[Methods (ABM calibration and counterfactual generation)] The central claim that the benchmark provides faithful counterfactual trajectories under time-varying interventions rests on the ABM's internal causal mechanisms, yet the manuscript supplies no quantitative validation (e.g., hold-out tests or sensitivity analyses) of how interventions alter transmission parameters. Calibration to observed trajectories ensures only marginal fit and does not confirm the intervention response functions used to generate counterfactuals.
[Evaluation and Results] The abstract and results claim 'substantial performance differences' among methods, but the evaluation reports no error bars, standard errors, or statistical significance tests on the performance metrics across counties or runs. This weakens the ability to interpret the benchmark's comparative findings as robust.

minor comments (2)

[Abstract and Methods] The abstract and methods should include explicit details on data exclusion criteria, exact ABM fitting procedures, and the number of methods/metrics evaluated to improve reproducibility.
[Introduction and Setup] Notation for time-varying interventions and multi-policy settings could be clarified with a dedicated table or diagram early in the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of validation and statistical rigor. We respond to each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Methods (ABM calibration and counterfactual generation)] The central claim that the benchmark provides faithful counterfactual trajectories under time-varying interventions rests on the ABM's internal causal mechanisms, yet the manuscript supplies no quantitative validation (e.g., hold-out tests or sensitivity analyses) of how interventions alter transmission parameters. Calibration to observed trajectories ensures only marginal fit and does not confirm the intervention response functions used to generate counterfactuals.

Authors: We agree that the manuscript does not include dedicated quantitative validation such as hold-out tests or sensitivity analyses specifically targeting the intervention response functions. The ABM calibration matches observed trajectories, with intervention effects parameterized from established epidemiological models in the literature. To strengthen this aspect, the revised manuscript will add sensitivity analyses on key intervention parameters (e.g., transmission rate adjustments under policies) and report their impact on generated counterfactuals. revision: yes
Referee: [Evaluation and Results] The abstract and results claim 'substantial performance differences' among methods, but the evaluation reports no error bars, standard errors, or statistical significance tests on the performance metrics across counties or runs. This weakens the ability to interpret the benchmark's comparative findings as robust.

Authors: We concur that the absence of variability measures and statistical tests limits the robustness of the comparative claims. The current results present aggregate metrics without error bars or significance testing. In the revision, we will include standard errors across counties and simulation runs, along with appropriate statistical tests (e.g., paired t-tests or Wilcoxon tests) to support statements about performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark generated from external ABM with no self-referential reductions

full rationale

The paper constructs a benchmark by running a calibrated agent-based model on real-world demographic, mobility, epidemiological, and policy data to produce counterfactual trajectories. No equations, predictions, or central claims reduce by construction to parameters fitted within the paper itself, nor do they rely on self-citation chains or imported uniqueness theorems. The generation process is independent of the causal inference methods being evaluated, and the assumption that the ABM faithfully captures dynamics is an external modeling choice rather than a definitional loop. This is a standard non-circular benchmarking setup.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified fidelity of the agent-based model; no free parameters, axioms, or invented entities are enumerated in the abstract.

pith-pipeline@v0.9.1-grok · 5696 in / 1010 out tokens · 26526 ms · 2026-06-28T02:12:07.319082+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 1 linked inside Pith

[1]

Matthew Abueg, Robert Hinch, Neo Wu, Luyang Liu, William Probert, Austin Wu, Paul Eastham, Yusef Shafi, Matt Rosencrantz, Michael Dikovsky, et al. 2021. Modeling the effect of exposure notification and non-pharmaceutical interven- tions on COVID-19 transmission in Washington state.NPJ digital medicine4, 1 (2021), 49

2021
[2]

Joseph Aylett-Bullock, Carolina Cuesta-Lazaro, Arnau Quera-Bofarull, Miguel Icaza-Lizaola, Aidan Sedgewick, Henry Truong, Aoife Curran, Edward Elliott, Tristan Caulfield, Kevin Fong, et al. 2021. June: open-source individual-based epidemiology simulation.Royal Society open science8, 7 (2021), 210506

2021
[3]

Ioana Bica, Ahmed M Alaa, James Jordon, and Mihaela van der Schaar. 2020. Estimating counterfactual treatment outcomes over time through adversarially balanced representations. InInternational Conference on Learning Representations

2020
[4]

Keith R Bisset, Jiangzhuo Chen, Xizhou Feng, VS Anil Kumar, and Madhav V Marathe. 2009. EpiFast: a fast algorithm for large scale realistic epidemic simu- lations on distributed memory systems. InProceedings of the 23rd international conference on Supercomputing. 430–439

2009
[5]

Eric Bonabeau. 2002. Agent-based modeling: Methods and techniques for simu- lating human systems.Proceedings of the national academy of sciences99, suppl_3 (2002), 7280–7287

2002
[6]

Jeanne Brooks-Gunn, Fong-ruey Liaw, and Pamela Kato Klebanov. 1992. Effects of early intervention on cognitive function of low birth weight preterm infants. The Journal of pediatrics120, 3 (1992), 350–359

1992
[7]

Ayush Chopra, Alexander Rodriguez, B Aditya Prakash, Ramesh Raskar, and Thomas Kingsley. 2023. Using neural networks to calibrate agent based models enables improved regional evidence for vaccine strategy and policy.Vaccine41, 48 (2023), 7067–7071

2023
[8]

Ayush Chopra, Alexander Rodríguez, Jayakumar Subramanian, Arnau Quera- Bofarull, Balaji Krishnamurthy, B Aditya Prakash, and Ramesh Raskar. 2023. Differentiable Agent-based Epidemiology. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. 1848–1857

2023
[9]

Ayush Chopra, Jayakumar Subramanian, Balaji Krishnamurthy, and Ramesh Raskar. 2024. flame: A Framework for Learning in Agent-based ModEls. InPro- ceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems. 391–399

2024
[10]

Richard A Davis, Keh-Shin Lii, and Dimitris N Politis. 2011. Remarks on some nonparametric estimates of a density function. InSelected Works of Murray Rosenblatt. Springer, 95–100

2011
[11]

Ken Eames, Shweta Bansal, Simon Frost, and Steven Riley. 2015. Six challenges in measuring contact networks for use in modelling.Epidemics10 (2015), 72–77

2015
[12]

Muhammad Hasan Ferdous, Emam Hossain, and Md Osman Gani. 2025. Time- graph: Synthetic benchmark datasets for robust time-series causal discovery. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5425–5435

2025
[13]

Dennis Frauen, Tobias Hatt, Valentyn Melnychuk, and Stefan Feuerriegel. 2023. Estimating average causal effects from patient trajectories. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7586–7594

2023
[14]

Chad M Glen, Melissa L Kemp, and Eberhard O Voit. 2019. Agent-based modeling of morphogenetic systems: Advantages and challenges.PLoS computational biology15, 3 (2019), e1006577

2019
[15]

Nicolò Gozzi, Matteo Chinazzi, Jessica T Davis, Corrado Gioannini, Luca Rossi, Marco Ajelli, Nicola Perra, and Alessandro Vespignani. 2025. Epydemix: An open-source Python package for epidemic modeling with integrated approximate Bayesian calibration.PLOS Computational Biology21, 11 (2025), e1013735

2025
[16]

Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society (1969), 424–438

1969
[17]

2006.Longitudinal data analysis

Donald Hedeker and Robert D Gibbons. 2006.Longitudinal data analysis. John Wiley & Sons

2006
[18]

Robert Hinch, William JM Probert, Anel Nurtay, Michelle Kendall, Chris Wymant, Matthew Hall, Katrina Lythgoe, Ana Bulas Cruz, Lele Zhao, Andrea Stewart, et al. 2021. OpenABM-Covid19—An agent-based model for non-pharmaceutical interventions against COVID-19 including contact tracing.PLoS computational biology17, 7 (2021), e1009146

2021
[19]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

2020
[20]

John H Holland and John H Miller. 1991. Artificial adaptive agents in economic theory.The American economic review81, 2 (1991), 365–370

1991
[21]

Emily Howerton, Lucie Contamin, Luke C Mullany, Michelle Qin, Nicholas G Reich, Samantha Bents, Rebecca K Borchering, Sung-mok Jung, Sara L Loo, Claire P Smith, et al. 2023. Evaluation of the US COVID-19 Scenario Modeling Hub for informing pandemic response under uncertainty.Nature communications 14, 1 (2023), 7260

2023
[22]

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database.Scientific data3, 1 (2016), 1–9

2016
[23]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)

Pith/arXiv arXiv 2013
[24]

Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning.Proceedings of the national academy of sciences116, 10 (2019), 4156–4165

2019
[25]

Rui Li, Stephanie Hu, Mingyu Lu, Yuria Utsumi, Prithwish Chakraborty, Daby M Sow, Piyush Madan, Jun Li, Mohamed Ghalwash, Zach Shahn, et al. 2021. G-net: a recurrent network approach to g-computation for counterfactual prediction under a dynamic treatment regime. InMachine Learning for Health. PMLR, 282– 299

2021
[26]

Ruipu Li and Alexander Rodríguez. 2025. Neural Conformal Control for Time Series Forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 18439–18447

2025
[27]

Bryan Lim. 2018. Forecasting treatment responses over time using recurrent marginal structural networks.Advances in neural information processing systems 31 (2018)

2018
[28]

Yuchen Ma, Valentyn Melnychuk, Jonas Schweisthal, and Stefan Feuerriegel. 2024. DiffPO: A causal diffusion model for learning distributions of potential outcomes. Advances in Neural Information Processing Systems37 (2024), 43663–43692

2024
[29]

Madhav Marathe and Anil Kumar S Vullikanti. 2013. Computational epidemiology. Commun. ACM56, 7 (2013), 88–96

2013
[30]

Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. 2022. Causal trans- former for estimating counterfactual outcomes. InInternational conference on machine learning. PMLR, 15293–15329

2022
[31]

Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. 2023. Normaliz- ing flows for interventional density estimation. InInternational Conference on Machine Learning. PMLR, 24361–24397

2023
[32]

Erica EM Moodie, Thomas S Richardson, and David A Stephens. 2007. Demysti- fying optimal dynamic treatment regimes.Biometrics63, 2 (2007), 447–455

2007
[33]

Raha Moraffah, Paras Sheth, Mansooreh Karami, Anchit Bhattacharya, Qianru Wang, Anique Tahir, Adrienne Raglin, and Huan Liu. 2021. Causal inference for time series analysis: Problems, methods and evaluation.Knowledge and Information Systems63, 12 (2021), 3041–3085

2021
[34]

Wenhao Mu, Zhi Cao, Mehmed Uludag, and Alexander Rodríguez. 2025. Counter- factual probabilistic diffusion with expert models.arXiv preprint arXiv:2508.13355 (2025)

arXiv 2025
[35]

Judea Pearl. 2009. Causal inference in statistics: An overview. (2009)

2009
[36]

Lorenzo Pellis, Frank Ball, Shweta Bansal, Ken Eames, Thomas House, Valerie Isham, and Pieter Trapman. 2015. Eight challenges for network epidemic models. Epidemics10 (2015), 58–62

2015
[37]

James Robins. 1986. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect.Mathematical modelling7, 9-12 (1986), 1393–1512

1986
[38]

James M Robins. 1994. Correcting for non-compliance in randomized trials using structural nested mean models.Communications in Statistics-Theory and methods 23, 8 (1994), 2379–2412

1994
[39]

Alexander Rodríguez, Harshavardhan Kamarthi, Pulak Agarwal, Javen Ho, Mira Patel, Suchet Sapre, and B Aditya Prakash. 2024. Machine learning for data-centric epidemic forecasting.Nature Machine Intelligence6, 10 (2024), 1122–1131

2024
[40]

Alvaro Ruiz-Martinez, Chang Gong, Hanwen Wang, Richard J Sové, Haoyang Mi, Holly Kimko, and Aleksander S Popel. 2022. Simulations of tumor growth and response to immunotherapy by coupling a spatial agent-based model with a whole-patient quantitative systems pharmacology model.PLoS computational biology18, 7 (2022), e1010254

2022
[41]

Jakob Runge, Andreas Gerhardus, Gherardo Varando, Veronika Eyring, and Gustau Camps-Valls. 2023. Causal inference for time series.Nature Reviews Earth & Environment4, 7 (2023), 487–505

2023
[42]

Mohammed Saeed, Christine Lieu, Greg Raber, and Roger G Mark. 2002. MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring. InComputers in cardiology. IEEE, 641–644

2002
[43]

Nabeel Seedat, Fergus Imrie, Alexis Bellot, Zhaozhi Qian, and Mihaela van der Schaar. 2022. Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations. InInternational Conference on Machine Learning. PMLR, 19497–19521

2022
[44]

Alex Sherstinsky. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network.Physica D: Nonlinear Phenomena404 (2020), 132306

2020
[45]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[46]

Joseph T Wu, Kathy Leung, and Gabriel M Leung. 2020. Nowcasting and forecast- ing the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study.The lancet395, 10225 (2020), 689–697

2020
[47]

Shenghao Wu, Wenbin Zhou, Minshuo Chen, and Shixiang Zhu. 2024. Counter- factual generative models for time-varying treatments. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3402–3413. Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions KDD ’26, August 09–13, 2026, Jeju I...

2024
[48]

Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. 2018. GANITE: Esti- mation of individualized treatment effects using generative adversarial nets. In International conference on learning representations

2018
[49]

flattens the curve

Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C Parkes, and Richard Socher. 2022. The AI Economist: Taxation policy design via two-level deep multiagent reinforcement learning.Science advances8, 18 (2022), eabk2607. Appendix A: Temporal Dynamics of Negative Treatment Effects This appendix presents county-level infection trajectories for spe- cifi...

2022

[1] [1]

Matthew Abueg, Robert Hinch, Neo Wu, Luyang Liu, William Probert, Austin Wu, Paul Eastham, Yusef Shafi, Matt Rosencrantz, Michael Dikovsky, et al. 2021. Modeling the effect of exposure notification and non-pharmaceutical interven- tions on COVID-19 transmission in Washington state.NPJ digital medicine4, 1 (2021), 49

2021

[2] [2]

Joseph Aylett-Bullock, Carolina Cuesta-Lazaro, Arnau Quera-Bofarull, Miguel Icaza-Lizaola, Aidan Sedgewick, Henry Truong, Aoife Curran, Edward Elliott, Tristan Caulfield, Kevin Fong, et al. 2021. June: open-source individual-based epidemiology simulation.Royal Society open science8, 7 (2021), 210506

2021

[3] [3]

Ioana Bica, Ahmed M Alaa, James Jordon, and Mihaela van der Schaar. 2020. Estimating counterfactual treatment outcomes over time through adversarially balanced representations. InInternational Conference on Learning Representations

2020

[4] [4]

Keith R Bisset, Jiangzhuo Chen, Xizhou Feng, VS Anil Kumar, and Madhav V Marathe. 2009. EpiFast: a fast algorithm for large scale realistic epidemic simu- lations on distributed memory systems. InProceedings of the 23rd international conference on Supercomputing. 430–439

2009

[5] [5]

Eric Bonabeau. 2002. Agent-based modeling: Methods and techniques for simu- lating human systems.Proceedings of the national academy of sciences99, suppl_3 (2002), 7280–7287

2002

[6] [6]

Jeanne Brooks-Gunn, Fong-ruey Liaw, and Pamela Kato Klebanov. 1992. Effects of early intervention on cognitive function of low birth weight preterm infants. The Journal of pediatrics120, 3 (1992), 350–359

1992

[7] [7]

Ayush Chopra, Alexander Rodriguez, B Aditya Prakash, Ramesh Raskar, and Thomas Kingsley. 2023. Using neural networks to calibrate agent based models enables improved regional evidence for vaccine strategy and policy.Vaccine41, 48 (2023), 7067–7071

2023

[8] [8]

Ayush Chopra, Alexander Rodríguez, Jayakumar Subramanian, Arnau Quera- Bofarull, Balaji Krishnamurthy, B Aditya Prakash, and Ramesh Raskar. 2023. Differentiable Agent-based Epidemiology. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. 1848–1857

2023

[9] [9]

Ayush Chopra, Jayakumar Subramanian, Balaji Krishnamurthy, and Ramesh Raskar. 2024. flame: A Framework for Learning in Agent-based ModEls. InPro- ceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems. 391–399

2024

[10] [10]

Richard A Davis, Keh-Shin Lii, and Dimitris N Politis. 2011. Remarks on some nonparametric estimates of a density function. InSelected Works of Murray Rosenblatt. Springer, 95–100

2011

[11] [11]

Ken Eames, Shweta Bansal, Simon Frost, and Steven Riley. 2015. Six challenges in measuring contact networks for use in modelling.Epidemics10 (2015), 72–77

2015

[12] [12]

Muhammad Hasan Ferdous, Emam Hossain, and Md Osman Gani. 2025. Time- graph: Synthetic benchmark datasets for robust time-series causal discovery. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5425–5435

2025

[13] [13]

Dennis Frauen, Tobias Hatt, Valentyn Melnychuk, and Stefan Feuerriegel. 2023. Estimating average causal effects from patient trajectories. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7586–7594

2023

[14] [14]

Chad M Glen, Melissa L Kemp, and Eberhard O Voit. 2019. Agent-based modeling of morphogenetic systems: Advantages and challenges.PLoS computational biology15, 3 (2019), e1006577

2019

[15] [15]

Nicolò Gozzi, Matteo Chinazzi, Jessica T Davis, Corrado Gioannini, Luca Rossi, Marco Ajelli, Nicola Perra, and Alessandro Vespignani. 2025. Epydemix: An open-source Python package for epidemic modeling with integrated approximate Bayesian calibration.PLOS Computational Biology21, 11 (2025), e1013735

2025

[16] [16]

Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society (1969), 424–438

1969

[17] [17]

2006.Longitudinal data analysis

Donald Hedeker and Robert D Gibbons. 2006.Longitudinal data analysis. John Wiley & Sons

2006

[18] [18]

Robert Hinch, William JM Probert, Anel Nurtay, Michelle Kendall, Chris Wymant, Matthew Hall, Katrina Lythgoe, Ana Bulas Cruz, Lele Zhao, Andrea Stewart, et al. 2021. OpenABM-Covid19—An agent-based model for non-pharmaceutical interventions against COVID-19 including contact tracing.PLoS computational biology17, 7 (2021), e1009146

2021

[19] [19]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

2020

[20] [20]

John H Holland and John H Miller. 1991. Artificial adaptive agents in economic theory.The American economic review81, 2 (1991), 365–370

1991

[21] [21]

Emily Howerton, Lucie Contamin, Luke C Mullany, Michelle Qin, Nicholas G Reich, Samantha Bents, Rebecca K Borchering, Sung-mok Jung, Sara L Loo, Claire P Smith, et al. 2023. Evaluation of the US COVID-19 Scenario Modeling Hub for informing pandemic response under uncertainty.Nature communications 14, 1 (2023), 7260

2023

[22] [22]

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database.Scientific data3, 1 (2016), 1–9

2016

[23] [23]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)

Pith/arXiv arXiv 2013

[24] [24]

Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning.Proceedings of the national academy of sciences116, 10 (2019), 4156–4165

2019

[25] [25]

Rui Li, Stephanie Hu, Mingyu Lu, Yuria Utsumi, Prithwish Chakraborty, Daby M Sow, Piyush Madan, Jun Li, Mohamed Ghalwash, Zach Shahn, et al. 2021. G-net: a recurrent network approach to g-computation for counterfactual prediction under a dynamic treatment regime. InMachine Learning for Health. PMLR, 282– 299

2021

[26] [26]

Ruipu Li and Alexander Rodríguez. 2025. Neural Conformal Control for Time Series Forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 18439–18447

2025

[27] [27]

Bryan Lim. 2018. Forecasting treatment responses over time using recurrent marginal structural networks.Advances in neural information processing systems 31 (2018)

2018

[28] [28]

Yuchen Ma, Valentyn Melnychuk, Jonas Schweisthal, and Stefan Feuerriegel. 2024. DiffPO: A causal diffusion model for learning distributions of potential outcomes. Advances in Neural Information Processing Systems37 (2024), 43663–43692

2024

[29] [29]

Madhav Marathe and Anil Kumar S Vullikanti. 2013. Computational epidemiology. Commun. ACM56, 7 (2013), 88–96

2013

[30] [30]

Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. 2022. Causal trans- former for estimating counterfactual outcomes. InInternational conference on machine learning. PMLR, 15293–15329

2022

[31] [31]

Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. 2023. Normaliz- ing flows for interventional density estimation. InInternational Conference on Machine Learning. PMLR, 24361–24397

2023

[32] [32]

Erica EM Moodie, Thomas S Richardson, and David A Stephens. 2007. Demysti- fying optimal dynamic treatment regimes.Biometrics63, 2 (2007), 447–455

2007

[33] [33]

Raha Moraffah, Paras Sheth, Mansooreh Karami, Anchit Bhattacharya, Qianru Wang, Anique Tahir, Adrienne Raglin, and Huan Liu. 2021. Causal inference for time series analysis: Problems, methods and evaluation.Knowledge and Information Systems63, 12 (2021), 3041–3085

2021

[34] [34]

Wenhao Mu, Zhi Cao, Mehmed Uludag, and Alexander Rodríguez. 2025. Counter- factual probabilistic diffusion with expert models.arXiv preprint arXiv:2508.13355 (2025)

arXiv 2025

[35] [35]

Judea Pearl. 2009. Causal inference in statistics: An overview. (2009)

2009

[36] [36]

Lorenzo Pellis, Frank Ball, Shweta Bansal, Ken Eames, Thomas House, Valerie Isham, and Pieter Trapman. 2015. Eight challenges for network epidemic models. Epidemics10 (2015), 58–62

2015

[37] [37]

James Robins. 1986. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect.Mathematical modelling7, 9-12 (1986), 1393–1512

1986

[38] [38]

James M Robins. 1994. Correcting for non-compliance in randomized trials using structural nested mean models.Communications in Statistics-Theory and methods 23, 8 (1994), 2379–2412

1994

[39] [39]

Alexander Rodríguez, Harshavardhan Kamarthi, Pulak Agarwal, Javen Ho, Mira Patel, Suchet Sapre, and B Aditya Prakash. 2024. Machine learning for data-centric epidemic forecasting.Nature Machine Intelligence6, 10 (2024), 1122–1131

2024

[40] [40]

Alvaro Ruiz-Martinez, Chang Gong, Hanwen Wang, Richard J Sové, Haoyang Mi, Holly Kimko, and Aleksander S Popel. 2022. Simulations of tumor growth and response to immunotherapy by coupling a spatial agent-based model with a whole-patient quantitative systems pharmacology model.PLoS computational biology18, 7 (2022), e1010254

2022

[41] [41]

Jakob Runge, Andreas Gerhardus, Gherardo Varando, Veronika Eyring, and Gustau Camps-Valls. 2023. Causal inference for time series.Nature Reviews Earth & Environment4, 7 (2023), 487–505

2023

[42] [42]

Mohammed Saeed, Christine Lieu, Greg Raber, and Roger G Mark. 2002. MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring. InComputers in cardiology. IEEE, 641–644

2002

[43] [43]

Nabeel Seedat, Fergus Imrie, Alexis Bellot, Zhaozhi Qian, and Mihaela van der Schaar. 2022. Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations. InInternational Conference on Machine Learning. PMLR, 19497–19521

2022

[44] [44]

Alex Sherstinsky. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network.Physica D: Nonlinear Phenomena404 (2020), 132306

2020

[45] [45]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017

[46] [46]

Joseph T Wu, Kathy Leung, and Gabriel M Leung. 2020. Nowcasting and forecast- ing the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study.The lancet395, 10225 (2020), 689–697

2020

[47] [47]

Shenghao Wu, Wenbin Zhou, Minshuo Chen, and Shixiang Zhu. 2024. Counter- factual generative models for time-varying treatments. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3402–3413. Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions KDD ’26, August 09–13, 2026, Jeju I...

2024

[48] [48]

Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. 2018. GANITE: Esti- mation of individualized treatment effects using generative adversarial nets. In International conference on learning representations

2018

[49] [49]

flattens the curve

Stephan Zheng, Alexander Trott, Sunil Srinivasa, David C Parkes, and Richard Socher. 2022. The AI Economist: Taxation policy design via two-level deep multiagent reinforcement learning.Science advances8, 18 (2022), eabk2607. Appendix A: Temporal Dynamics of Negative Treatment Effects This appendix presents county-level infection trajectories for spe- cifi...

2022