Mantis: A Foundation Model for Mechanistic Disease Forecasting
Pith reviewed 2026-05-18 22:22 UTC · model grok-4.3
The pith
A model trained only on disease simulations outperforms real-data forecasters on COVID-19 and generalizes to many other diseases without seeing any actual records.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mantis shows that a single foundation model trained entirely on mechanistic simulations can deliver accurate forecasts for real infectious diseases across regions, outcomes, and transmission types, even when the target disease or mechanism was never present in the training simulations and no real-world data was used at any stage.
What carries the argument
Mantis, a foundation model trained on large-scale mechanistic simulations of contagion dynamics to learn generalizable forecasting behavior.
If this is right
- Forecasts become available immediately for new pathogens or regions without first collecting years of case data.
- One model can serve many diseases instead of requiring separate tuned models for each.
- Performance remains high in settings where historical records are sparse or unreliable.
- Models can be updated by expanding the simulation library rather than retraining on new observations.
- Forecasts for low-resource areas become feasible without local data collection infrastructure.
Where Pith is reading between the lines
- The same simulation-first approach might extend to forecasting other complex dynamical systems such as economic shocks or ecological invasions.
- Hybrid systems could start with Mantis-style pretraining on simulations and then fine-tune lightly on limited real observations when they become available.
- If the core dynamics are truly captured, the framework could reduce the need for disease-specific expert knowledge in model construction.
Load-bearing premise
Mechanistic simulations of disease spread contain enough of the real dynamics that a model trained only on them can accurately predict actual historical outbreaks and handle transmission mechanisms it never encountered in training.
What would settle it
Mantis produces higher mean absolute error than standard models when backtested on early data from a future outbreak whose transmission mechanism differs substantially from every simulation family used in its training.
Figures
read the original abstract
Infectious disease forecasting in novel outbreaks or low-resource settings is hampered by the need for large disease and covariate data sets, bespoke training, and expert tuning, all of which can hinder rapid generation of forecasts for new settings. To help address these challenges, we developed Mantis, a foundation model trained entirely on mechanistic simulations, which enables out-of-the-box forecasting across diseases, regions, and outcomes, even in settings with limited historical data. We evaluated Mantis against 78 forecasting models across sixteen diseases with diverse modes of transmission, assessing both point forecast accuracy (mean absolute error) and probabilistic performance (weighted interval score and coverage). Despite using no real-world data during training, Mantis achieved lower mean absolute error than all models in the CDC's COVID-19 Forecast Hub when backtested on early pandemic forecasts which it had not previously seen. Across all other diseases tested, Mantis consistently ranked in the top two models across evaluation metrics. Mantis further generalized to diseases with transmission mechanisms not represented in its training data, demonstrating that it can capture fundamental contagion dynamics rather than memorizing disease-specific patterns. These capabilities illustrate that purely simulation-based foundation models such as Mantis can provide a practical foundation for disease forecasting: general-purpose, accurate, and deployable where traditional models struggle.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Mantis, a foundation model for infectious disease forecasting trained exclusively on mechanistic simulations without any real-world data. It evaluates the model against 78 forecasting models across 16 diseases with diverse transmission modes, claiming lower mean absolute error than all CDC COVID-19 Forecast Hub models on early-pandemic backtests, top-two rankings on other diseases for MAE, weighted interval score, and coverage, and generalization to transmission mechanisms absent from the training simulations.
Significance. If the results hold, the work has substantial significance for mechanistic and AI-driven epidemiology. A simulation-only foundation model that delivers zero-shot superiority on real historical data and generalizes across unseen transmission modes would address a core limitation of data-intensive or disease-specific forecasters, enabling rapid deployment in novel outbreaks or low-resource settings. The approach of learning transferable contagion dynamics from simulations rather than empirical fitting is a clear strength and could shift the field toward more general-purpose tools.
major comments (2)
- [Methods] Simulation design (Methods section): The central generalization claim—that Mantis captures fundamental dynamics rather than simulation-specific artifacts—requires explicit evidence that the mechanistic simulator ensemble spans real-world nuisance parameters at sufficient scale. The abstract asserts zero-shot MAE superiority and cross-mechanism generalization, yet no quantitative coverage analysis is provided for reporting delays, testing rates, mobility changes, or superspreading heterogeneity; if these factors are under-represented or fixed in the training distribution, the performance gap on real data may not transfer as claimed.
- [Results] COVID-19 backtest evaluation (Results section): The headline result that Mantis outperforms every model in the CDC Forecast Hub on early-pandemic forecasts is load-bearing for the zero-shot claim. The manuscript must specify the exact backtest periods, the precise subset of hub models included in the comparison, data exclusion rules, and any statistical significance tests; without these details, it remains possible that evaluation choices (e.g., period selection or model filtering) drive the reported MAE advantage.
minor comments (2)
- [Evaluation Metrics] Clarify the exact definitions and computation of the weighted interval score and coverage metrics in the evaluation protocol to support reproducibility.
- [Figures] Ensure all performance comparison figures include clear axis labels, legends, and error bars or confidence intervals for the reported rankings.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and positive assessment of the significance of our work. We address each major comment below and outline revisions that will strengthen the manuscript's clarity and rigor.
read point-by-point responses
-
Referee: [Methods] Simulation design (Methods section): The central generalization claim—that Mantis captures fundamental dynamics rather than simulation-specific artifacts—requires explicit evidence that the mechanistic simulator ensemble spans real-world nuisance parameters at sufficient scale. The abstract asserts zero-shot MAE superiority and cross-mechanism generalization, yet no quantitative coverage analysis is provided for reporting delays, testing rates, mobility changes, or superspreading heterogeneity; if these factors are under-represented or fixed in the training distribution, the performance gap on real data may not transfer as claimed.
Authors: We agree that a quantitative coverage analysis would provide stronger support for the generalization claims. In the revised manuscript, we will add a new subsection to the Methods that reports the parameter ranges and sampling distributions used in the simulation ensemble for reporting delays, testing rates, mobility changes, and superspreading heterogeneity. We will include summary statistics and visualizations comparing these ranges to empirical values drawn from the literature for the diseases under study, thereby demonstrating that the training distribution is sufficiently broad to support zero-shot transfer and cross-mechanism generalization. revision: yes
-
Referee: [Results] COVID-19 backtest evaluation (Results section): The headline result that Mantis outperforms every model in the CDC Forecast Hub on early-pandemic forecasts is load-bearing for the zero-shot claim. The manuscript must specify the exact backtest periods, the precise subset of hub models included in the comparison, data exclusion rules, and any statistical significance tests; without these details, it remains possible that evaluation choices (e.g., period selection or model filtering) drive the reported MAE advantage.
Authors: We appreciate the need for full transparency on the evaluation protocol. The revised Results section will explicitly list: (i) the precise calendar periods used for the early-pandemic backtests, (ii) the complete set of CDC Forecast Hub models included in the comparison along with inclusion criteria, (iii) any data exclusion rules (e.g., handling of missing observations or specific locations), and (iv) statistical significance tests (paired Wilcoxon or t-tests on MAE differences with p-values) confirming that the reported advantage is robust to evaluation choices. These details will be added without altering the underlying experimental outcomes. revision: yes
Circularity Check
No circularity: training on simulations with independent real-world evaluation
full rationale
The paper trains Mantis exclusively on mechanistic simulations and evaluates point and probabilistic forecasts on separate real-world data (e.g., CDC COVID-19 Forecast Hub backtests and other disease datasets) that were never seen during training. The abstract and description present these MAE, WIS, and coverage results as empirical outcomes against external models rather than quantities obtained by fitting parameters to the target evaluation data or by self-referential definitions. No equations, self-citations, or ansatzes are shown that would reduce the claimed generalization or performance to inputs by construction. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mechanistic simulations capture fundamental contagion dynamics sufficiently for generalization to real data and unseen diseases.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mantis was trained entirely on synthetic outbreaks generated using mechanistic epidemiological models... SEAIR structure... stochastic compartmental model... human-to-human, vector-borne, environmental transmission
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
trained on over 400 million simulated days... no real-world data during training... generalized to diseases with transmission mechanisms not represented in its training data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Latent Chain-of-Thought Improves Structured-Data Transformers
Latent chain-of-thought via recurrent feedback tokens improves average performance of structured-data transformers on time-series forecasting and tabular prediction.
-
In-Context Learning Under Regime Change
Transformers can solve in-context change-point detection with model size scaling by knowledge of the shift timing, matching optimal baselines on synthetic data and improving pretrained models on disease and financial ...
-
Prediction Markets Underperform Simple Baselines For Infectious Disease Forecasting
Prediction markets fail to outperform standard benchmarks for forecasting influenza hospitalizations and measles cases.
Reference graph
Works this paper leans on
-
[1]
Role of modelling in covid-19 policy development.Paediatric Respiratory Reviews, 35:57–60, Sep 2020
Emma S McBryde, Michael T Meehan, Oyelola A Adegboye, Adeshina I Adekunle, Jamie M Caldwell, Anton Pak, Diana P Rojas, Bridget M Williams, and James M Trauer. Role of modelling in covid-19 policy development.Paediatric Respiratory Reviews, 35:57–60, Sep 2020
work page 2020
-
[2]
Estee Y. Cramer, Evan L. Ray, Velma K. Lopez, Johannes Bracher, Andrea Brennen, Alvaro J. Castro Rivadeneira, Aaron Gerding, Tilmann Gneiting, Katie H. House, Yuxin Huang, Dasuni Jayawardena, Abdul H. Kanji, Ayush Khandelwal, Khoa Le, Anja M¨ uhlemann, Jarad Niemi, Apurv Shah, Ariane Stark, Yijin Wang, Nutcha Wattanachit, Martha W. Zorn, Youyang Gu, Sansi...
work page 2022
-
[3]
Alexander Rodriguez, Anika Tabassum, Jiaming Cui, Jiajia Xie, Javen Ho, Pulak Agarwal, Bijaya Adhikari, and Aditya Prakash. Deepcovid: An operational deep learning-driven framework for explain- able real-time covid-19 forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021
work page 2021
-
[4]
Deepgleam: A hybrid mechanistic and deep learning model for covid-19 forecasting, 2021
Dongxia Wu, Liyao Gao, Xinyue Xiong, Matteo Chinazzi, Alessandro Vespignani, Yi-An Ma, and Rose Yu. Deepgleam: A hybrid mechanistic and deep learning model for covid-19 forecasting, 2021
work page 2021
-
[5]
Sebastian Funk, Anton Camacho, Adam J Kucharski, Rachel Lowe, Rosalind M Eggo, and W John Edmunds. Assessing the performance of real-time epidemic forecasts: A case study of ebola in the western area region of sierra leone, 2014–15.PLOS Computational Biology, 15(2):e1006785, 2019
work page 2014
-
[6]
Nicholas G. Reich et al. Collaborative hubs: Making the most of predictive epidemic modeling.American Journal of Public Health, 112(6):839–842, 2022. Epub 2022 Apr 14
work page 2022
-
[7]
Michael C. Runge et al. Scenario design for infectious disease projections: Integrating concepts from decision analysis and experimental design.Epidemics, 47:100775, 2024. Epub 2024 May 24
work page 2024
-
[8]
Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery
Carson Dudley, Reiden Magdaleno, Christopher Harding, and Marisa Eisenberg. Simulation as super- vision: Mechanistic pretraining for scientific discovery.arXiv preprint arXiv:2507.08977, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Defsi: Deep learning based epidemic forecasting with synthetic information
Lijing Wang, Jiangzhuo Chen, and Madhav Marathe. Defsi: Deep learning based epidemic forecasting with synthetic information. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019
work page 2019
-
[10]
University of Michigan. Mi safe start map. Accessed August 15, 2025
work page 2025
-
[11]
David C. Farrow, Logan C. Brooks, Aaron Rumack, Ryan J. Tibshirani, and Roni Rosenfeld. Delphi epidata api.https://github.com/cmu-delphi/delphi-epidata, 2015. Carnegie Mellon University, Delphi Research Group
work page 2015
-
[12]
CDC COVID-19 Response. Weekly united states covid-19 cases and deaths by state - archived.https: //data.cdc.gov/Case-Surveillance/Weekly-United-States-COVID-19-Cases-and-Deaths-by-/ pwn4-m3yp, 2025. 11
work page 2025
-
[13]
Counts of dengue without warning signs reported in brazil: 1980–2005 (2.0) [data set]
Willem Van Panhuis, Abigail Cross, and Donald Burke. Counts of dengue without warning signs reported in brazil: 1980–2005 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/BR.722862003,
-
[14]
Willem Van Panhuis, Abigail Cross, and Donald Burke. Counts of viral hepatitis type b reported in united states of america: 1951–2007 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/ US.66071002, 2018. Project Tycho
-
[15]
Willem Van Panhuis, Abigail Cross, and Donald Burke. Counts of smallpox reported in united states of america: 1888–1952 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/US.67924001,
-
[16]
Willem Van Panhuis, Abigail Cross, and Donald Burke. Counts of scarlet fever reported in united states of america: 1888–1969 (2.0) [data set].https://doi.org/10.25337/T7/ptycho.v2.0/US.30242009,
-
[17]
Ray, Tilmann Gneiting, and Nicholas G
Johannes Bracher, Evan L. Ray, Tilmann Gneiting, and Nicholas G. Reich. Evaluating epidemic forecasts in an interval format.PLOS Computational Biology, 17(2):e1008618, 2021
work page 2021
-
[18]
Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Transformer-based lang...
work page 2024
-
[19]
Suprabhath Kalahasti, Benjamin Faucher, Boxuan Wang, Claudio Ascione, Ricardo Carbajal, Maxime Enault, Christophe Vincent Cassis, Titouan Launay, Caroline Guerrisi, Pierre-Yves Bo¨ elle, Federico Baldo, and Eugenio Valdano. Foundation time series models for forecasting and policy evaluation in infectious disease epidemics.medRxiv, February 2025. preprint,...
work page 2025
-
[20]
Decoupled weight decay regularization, 2019
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019
work page 2019
-
[21]
Estimating the latent period of coronavirus disease 2019 (covid-19)
Hualei Xin, Yu Li, Peng Wu, Zhili Li, Eric HY Lau, Ying Qin, Liping Wang, Benjamin J Cowling, Tim K Tsang, and Zhongjie Li. Estimating the latent period of coronavirus disease 2019 (covid-19). Clinical Infectious Diseases, 74(9):1678–1681, 2022
work page 2019
-
[22]
Vincent Ka Chun Yan, Eric Yuk Fai Wan, Xuxiao Ye, Anna Hoi Ying Mok, Francisco Tsz Tsun Lai, Celine Sze Ling Chui, Xue Li, Carlos King Ho Wong, Philip Hei Li, Tiantian Ma, Simon Qin, Chak Sing Lau, Ian Chi Kei Wong, and Esther Wai Yin Chan. Waning effectiveness against covid-19-related hospitalization, severe complications, and mortality with two to three...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.