pith. sign in

arxiv: 2508.12260 · v5 · submitted 2025-08-17 · 💻 cs.AI · q-bio.QM

Mantis: A Foundation Model for Mechanistic Disease Forecasting

Pith reviewed 2026-05-18 22:22 UTC · model grok-4.3

classification 💻 cs.AI q-bio.QM
keywords disease forecastingfoundation modelmechanistic simulationinfectious diseasegeneralizationepidemiologyCOVID-19out-of-distribution
0
0 comments X

The pith

A model trained only on disease simulations outperforms real-data forecasters on COVID-19 and generalizes to many other diseases without seeing any actual records.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mantis as a foundation model for infectious disease forecasting that is trained exclusively on mechanistic simulations of contagion rather than real epidemiological data. This design aims to overcome the data scarcity, custom training, and expert tuning that limit forecasts for new outbreaks or low-resource settings. Mantis is tested against dozens of existing models across sixteen diseases with varied transmission modes, using metrics for both point accuracy and probabilistic skill. It records lower error than every model in the CDC COVID-19 Forecast Hub on early-pandemic backtests and places in the top two for nearly all other diseases evaluated. The model also succeeds on diseases whose transmission mechanisms were absent from its training simulations, indicating it learns core dynamics instead of memorizing specific patterns.

Core claim

Mantis shows that a single foundation model trained entirely on mechanistic simulations can deliver accurate forecasts for real infectious diseases across regions, outcomes, and transmission types, even when the target disease or mechanism was never present in the training simulations and no real-world data was used at any stage.

What carries the argument

Mantis, a foundation model trained on large-scale mechanistic simulations of contagion dynamics to learn generalizable forecasting behavior.

If this is right

  • Forecasts become available immediately for new pathogens or regions without first collecting years of case data.
  • One model can serve many diseases instead of requiring separate tuned models for each.
  • Performance remains high in settings where historical records are sparse or unreliable.
  • Models can be updated by expanding the simulation library rather than retraining on new observations.
  • Forecasts for low-resource areas become feasible without local data collection infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same simulation-first approach might extend to forecasting other complex dynamical systems such as economic shocks or ecological invasions.
  • Hybrid systems could start with Mantis-style pretraining on simulations and then fine-tune lightly on limited real observations when they become available.
  • If the core dynamics are truly captured, the framework could reduce the need for disease-specific expert knowledge in model construction.

Load-bearing premise

Mechanistic simulations of disease spread contain enough of the real dynamics that a model trained only on them can accurately predict actual historical outbreaks and handle transmission mechanisms it never encountered in training.

What would settle it

Mantis produces higher mean absolute error than standard models when backtested on early data from a future outbreak whose transmission mechanism differs substantially from every simulation family used in its training.

Figures

Figures reproduced from arXiv: 2508.12260 by Ananya Sharma, Carson Dudley, Christopher Harding, Emily Martin, Marisa Eisenberg, Reiden Magdaleno.

Figure 1
Figure 1. Figure 1: Conceptual overview of Mantis. Mantis is a simulation-grounded foundation model trained entirely on synthetic outbreaks generated by mechanistic epidemiological models. The training pipeline begins with a modular simulator that encodes diverse outbreak mechanisms, including multiple transmission modes (human-to-human, vectorborne, environmental), progression dynamics, intervention strategies, and populatio… view at source ↗
Figure 2
Figure 2. Figure 2: Covariate integration improves accuracy. Mantis maintains calibrated uncertainty across forecast horizons. (a) Including covariates (e.g., using cases to predict hospitalizations) consis￾tently improves Mantis’s accuracy across all forecast horizons. Relative MAE shown for COVID-19 mortality forecasts with (blue) and without (orange) hospitalization covariates across 2, 4, 6, and 8-week horizons. (b) Manti… view at source ↗
Figure 3
Figure 3. Figure 3: Mantis Produces Accurate and Generalizable Forecasts Across Diseases and Ge￾ographies. (a) Four-week-ahead forecasts (blue dashed line and shaded 90% CI) compared to observed outcomes (black) for COVID-19 mortality in Minnesota and influenza-like illness (ILI) in Michigan. In the latter, Mantis demonstrates its foundation model capacity by accurately forecasting syndromic inputs de￾spite never being traine… view at source ↗
Figure 4
Figure 4. Figure 4: Mantis delivers consistent performance across population scales. Relative MAE versus state population for COVID-19 mortality forecasts across 51 U.S. states and territories (Vermont excluded as an outlier). Each point represents the mean relative MAE for a jurisdiction across all forecast dates from April 2020 through November 2021. Population is shown on a logarithmic scale (2020 Census). A weak negative … view at source ↗
read the original abstract

Infectious disease forecasting in novel outbreaks or low-resource settings is hampered by the need for large disease and covariate data sets, bespoke training, and expert tuning, all of which can hinder rapid generation of forecasts for new settings. To help address these challenges, we developed Mantis, a foundation model trained entirely on mechanistic simulations, which enables out-of-the-box forecasting across diseases, regions, and outcomes, even in settings with limited historical data. We evaluated Mantis against 78 forecasting models across sixteen diseases with diverse modes of transmission, assessing both point forecast accuracy (mean absolute error) and probabilistic performance (weighted interval score and coverage). Despite using no real-world data during training, Mantis achieved lower mean absolute error than all models in the CDC's COVID-19 Forecast Hub when backtested on early pandemic forecasts which it had not previously seen. Across all other diseases tested, Mantis consistently ranked in the top two models across evaluation metrics. Mantis further generalized to diseases with transmission mechanisms not represented in its training data, demonstrating that it can capture fundamental contagion dynamics rather than memorizing disease-specific patterns. These capabilities illustrate that purely simulation-based foundation models such as Mantis can provide a practical foundation for disease forecasting: general-purpose, accurate, and deployable where traditional models struggle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Mantis, a foundation model for infectious disease forecasting trained exclusively on mechanistic simulations without any real-world data. It evaluates the model against 78 forecasting models across 16 diseases with diverse transmission modes, claiming lower mean absolute error than all CDC COVID-19 Forecast Hub models on early-pandemic backtests, top-two rankings on other diseases for MAE, weighted interval score, and coverage, and generalization to transmission mechanisms absent from the training simulations.

Significance. If the results hold, the work has substantial significance for mechanistic and AI-driven epidemiology. A simulation-only foundation model that delivers zero-shot superiority on real historical data and generalizes across unseen transmission modes would address a core limitation of data-intensive or disease-specific forecasters, enabling rapid deployment in novel outbreaks or low-resource settings. The approach of learning transferable contagion dynamics from simulations rather than empirical fitting is a clear strength and could shift the field toward more general-purpose tools.

major comments (2)
  1. [Methods] Simulation design (Methods section): The central generalization claim—that Mantis captures fundamental dynamics rather than simulation-specific artifacts—requires explicit evidence that the mechanistic simulator ensemble spans real-world nuisance parameters at sufficient scale. The abstract asserts zero-shot MAE superiority and cross-mechanism generalization, yet no quantitative coverage analysis is provided for reporting delays, testing rates, mobility changes, or superspreading heterogeneity; if these factors are under-represented or fixed in the training distribution, the performance gap on real data may not transfer as claimed.
  2. [Results] COVID-19 backtest evaluation (Results section): The headline result that Mantis outperforms every model in the CDC Forecast Hub on early-pandemic forecasts is load-bearing for the zero-shot claim. The manuscript must specify the exact backtest periods, the precise subset of hub models included in the comparison, data exclusion rules, and any statistical significance tests; without these details, it remains possible that evaluation choices (e.g., period selection or model filtering) drive the reported MAE advantage.
minor comments (2)
  1. [Evaluation Metrics] Clarify the exact definitions and computation of the weighted interval score and coverage metrics in the evaluation protocol to support reproducibility.
  2. [Figures] Ensure all performance comparison figures include clear axis labels, legends, and error bars or confidence intervals for the reported rankings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and positive assessment of the significance of our work. We address each major comment below and outline revisions that will strengthen the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: [Methods] Simulation design (Methods section): The central generalization claim—that Mantis captures fundamental dynamics rather than simulation-specific artifacts—requires explicit evidence that the mechanistic simulator ensemble spans real-world nuisance parameters at sufficient scale. The abstract asserts zero-shot MAE superiority and cross-mechanism generalization, yet no quantitative coverage analysis is provided for reporting delays, testing rates, mobility changes, or superspreading heterogeneity; if these factors are under-represented or fixed in the training distribution, the performance gap on real data may not transfer as claimed.

    Authors: We agree that a quantitative coverage analysis would provide stronger support for the generalization claims. In the revised manuscript, we will add a new subsection to the Methods that reports the parameter ranges and sampling distributions used in the simulation ensemble for reporting delays, testing rates, mobility changes, and superspreading heterogeneity. We will include summary statistics and visualizations comparing these ranges to empirical values drawn from the literature for the diseases under study, thereby demonstrating that the training distribution is sufficiently broad to support zero-shot transfer and cross-mechanism generalization. revision: yes

  2. Referee: [Results] COVID-19 backtest evaluation (Results section): The headline result that Mantis outperforms every model in the CDC Forecast Hub on early-pandemic forecasts is load-bearing for the zero-shot claim. The manuscript must specify the exact backtest periods, the precise subset of hub models included in the comparison, data exclusion rules, and any statistical significance tests; without these details, it remains possible that evaluation choices (e.g., period selection or model filtering) drive the reported MAE advantage.

    Authors: We appreciate the need for full transparency on the evaluation protocol. The revised Results section will explicitly list: (i) the precise calendar periods used for the early-pandemic backtests, (ii) the complete set of CDC Forecast Hub models included in the comparison along with inclusion criteria, (iii) any data exclusion rules (e.g., handling of missing observations or specific locations), and (iv) statistical significance tests (paired Wilcoxon or t-tests on MAE differences with p-values) confirming that the reported advantage is robust to evaluation choices. These details will be added without altering the underlying experimental outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity: training on simulations with independent real-world evaluation

full rationale

The paper trains Mantis exclusively on mechanistic simulations and evaluates point and probabilistic forecasts on separate real-world data (e.g., CDC COVID-19 Forecast Hub backtests and other disease datasets) that were never seen during training. The abstract and description present these MAE, WIS, and coverage results as empirical outcomes against external models rather than quantities obtained by fitting parameters to the target evaluation data or by self-referential definitions. No equations, self-citations, or ansatzes are shown that would reduce the claimed generalization or performance to inputs by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that mechanistic simulations encode transferable contagion dynamics; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Mechanistic simulations capture fundamental contagion dynamics sufficiently for generalization to real data and unseen diseases.
    Invoked to support out-of-the-box performance and generalization claims.

pith-pipeline@v0.9.0 · 5767 in / 1303 out tokens · 52052 ms · 2026-05-18T22:22:59.748914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Latent Chain-of-Thought Improves Structured-Data Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Latent chain-of-thought via recurrent feedback tokens improves average performance of structured-data transformers on time-series forecasting and tabular prediction.

  2. In-Context Learning Under Regime Change

    cs.LG 2026-04 unverdicted novelty 6.0

    Transformers can solve in-context change-point detection with model size scaling by knowledge of the shift timing, matching optimal baselines on synthetic data and improving pretrained models on disease and financial ...

  3. Prediction Markets Underperform Simple Baselines For Infectious Disease Forecasting

    stat.AP 2026-05 conditional novelty 4.0

    Prediction markets fail to outperform standard benchmarks for forecasting influenza hospitalizations and measles cases.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 3 Pith papers · 1 internal anchor

  1. [1]

    Role of modelling in covid-19 policy development.Paediatric Respiratory Reviews, 35:57–60, Sep 2020

    Emma S McBryde, Michael T Meehan, Oyelola A Adegboye, Adeshina I Adekunle, Jamie M Caldwell, Anton Pak, Diana P Rojas, Bridget M Williams, and James M Trauer. Role of modelling in covid-19 policy development.Paediatric Respiratory Reviews, 35:57–60, Sep 2020

  2. [2]

    Cramer, Evan L

    Estee Y. Cramer, Evan L. Ray, Velma K. Lopez, Johannes Bracher, Andrea Brennen, Alvaro J. Castro Rivadeneira, Aaron Gerding, Tilmann Gneiting, Katie H. House, Yuxin Huang, Dasuni Jayawardena, Abdul H. Kanji, Ayush Khandelwal, Khoa Le, Anja M¨ uhlemann, Jarad Niemi, Apurv Shah, Ariane Stark, Yijin Wang, Nutcha Wattanachit, Martha W. Zorn, Youyang Gu, Sansi...

  3. [3]

    Deepcovid: An operational deep learning-driven framework for explain- able real-time covid-19 forecasting

    Alexander Rodriguez, Anika Tabassum, Jiaming Cui, Jiajia Xie, Javen Ho, Pulak Agarwal, Bijaya Adhikari, and Aditya Prakash. Deepcovid: An operational deep learning-driven framework for explain- able real-time covid-19 forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021

  4. [4]

    Deepgleam: A hybrid mechanistic and deep learning model for covid-19 forecasting, 2021

    Dongxia Wu, Liyao Gao, Xinyue Xiong, Matteo Chinazzi, Alessandro Vespignani, Yi-An Ma, and Rose Yu. Deepgleam: A hybrid mechanistic and deep learning model for covid-19 forecasting, 2021

  5. [5]

    Sebastian Funk, Anton Camacho, Adam J Kucharski, Rachel Lowe, Rosalind M Eggo, and W John Edmunds. Assessing the performance of real-time epidemic forecasts: A case study of ebola in the western area region of sierra leone, 2014–15.PLOS Computational Biology, 15(2):e1006785, 2019

  6. [6]

    Reich et al

    Nicholas G. Reich et al. Collaborative hubs: Making the most of predictive epidemic modeling.American Journal of Public Health, 112(6):839–842, 2022. Epub 2022 Apr 14

  7. [7]

    Runge et al

    Michael C. Runge et al. Scenario design for infectious disease projections: Integrating concepts from decision analysis and experimental design.Epidemics, 47:100775, 2024. Epub 2024 May 24

  8. [8]

    Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

    Carson Dudley, Reiden Magdaleno, Christopher Harding, and Marisa Eisenberg. Simulation as super- vision: Mechanistic pretraining for scientific discovery.arXiv preprint arXiv:2507.08977, 2025

  9. [9]

    Defsi: Deep learning based epidemic forecasting with synthetic information

    Lijing Wang, Jiangzhuo Chen, and Madhav Marathe. Defsi: Deep learning based epidemic forecasting with synthetic information. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019

  10. [10]

    Mi safe start map

    University of Michigan. Mi safe start map. Accessed August 15, 2025

  11. [11]

    Farrow, Logan C

    David C. Farrow, Logan C. Brooks, Aaron Rumack, Ryan J. Tibshirani, and Roni Rosenfeld. Delphi epidata api.https://github.com/cmu-delphi/delphi-epidata, 2015. Carnegie Mellon University, Delphi Research Group

  12. [12]

    Weekly united states covid-19 cases and deaths by state - archived.https: //data.cdc.gov/Case-Surveillance/Weekly-United-States-COVID-19-Cases-and-Deaths-by-/ pwn4-m3yp, 2025

    CDC COVID-19 Response. Weekly united states covid-19 cases and deaths by state - archived.https: //data.cdc.gov/Case-Surveillance/Weekly-United-States-COVID-19-Cases-and-Deaths-by-/ pwn4-m3yp, 2025. 11

  13. [13]

    Counts of dengue without warning signs reported in brazil: 1980–2005 (2.0) [data set]

    Willem Van Panhuis, Abigail Cross, and Donald Burke. Counts of dengue without warning signs reported in brazil: 1980–2005 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/BR.722862003,

  14. [14]

    Counts of viral hepatitis type b reported in united states of america: 1951–2007 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/ US.66071002, 2018

    Willem Van Panhuis, Abigail Cross, and Donald Burke. Counts of viral hepatitis type b reported in united states of america: 1951–2007 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/ US.66071002, 2018. Project Tycho

  15. [15]

    Counts of smallpox reported in united states of america: 1888–1952 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/US.67924001,

    Willem Van Panhuis, Abigail Cross, and Donald Burke. Counts of smallpox reported in united states of america: 1888–1952 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/US.67924001,

  16. [16]

    Counts of scarlet fever reported in united states of america: 1888–1969 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/US.30242009,

    Willem Van Panhuis, Abigail Cross, and Donald Burke. Counts of scarlet fever reported in united states of america: 1888–1969 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/US.30242009,

  17. [17]

    Ray, Tilmann Gneiting, and Nicholas G

    Johannes Bracher, Evan L. Ray, Tilmann Gneiting, and Nicholas G. Reich. Evaluating epidemic forecasts in an interval format.PLOS Computational Biology, 17(2):e1008618, 2021

  18. [18]

    Maddix, Hao Wang, Michael W

    Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Transformer-based lang...

  19. [19]

    Foundation time series models for forecasting and policy evaluation in infectious disease epidemics.medRxiv, February 2025

    Suprabhath Kalahasti, Benjamin Faucher, Boxuan Wang, Claudio Ascione, Ricardo Carbajal, Maxime Enault, Christophe Vincent Cassis, Titouan Launay, Caroline Guerrisi, Pierre-Yves Bo¨ elle, Federico Baldo, and Eugenio Valdano. Foundation time series models for forecasting and policy evaluation in infectious disease epidemics.medRxiv, February 2025. preprint,...

  20. [20]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

  21. [21]

    Estimating the latent period of coronavirus disease 2019 (covid-19)

    Hualei Xin, Yu Li, Peng Wu, Zhili Li, Eric HY Lau, Ying Qin, Liping Wang, Benjamin J Cowling, Tim K Tsang, and Zhongjie Li. Estimating the latent period of coronavirus disease 2019 (covid-19). Clinical Infectious Diseases, 74(9):1678–1681, 2022

  22. [22]

    Vincent Ka Chun Yan, Eric Yuk Fai Wan, Xuxiao Ye, Anna Hoi Ying Mok, Francisco Tsz Tsun Lai, Celine Sze Ling Chui, Xue Li, Carlos King Ho Wong, Philip Hei Li, Tiantian Ma, Simon Qin, Chak Sing Lau, Ian Chi Kei Wong, and Esther Wai Yin Chan. Waning effectiveness against covid-19-related hospitalization, severe complications, and mortality with two to three...