Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
Pith reviewed 2026-05-20 18:37 UTC · model grok-4.3
The pith
Autonomous LLM-guided tree search produces disease forecasting models whose ensemble matches or exceeds CDC human-curated performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a fully prospective evaluation during the 2025-2026 US respiratory season, the autonomous system discovered methodologically diverse models for influenza, COVID-19, and RSV. Aggregating these yielded an ensemble that consistently matched or outperformed the gold-standard human-curated CDC hub ensembles out-of-sample. The system handled data-scarce cold-start scenarios for RSV. Ablations showed that log-scale distance metrics prevent reward hacking and that an automated judge ensures fidelity to epidemiological theory.
What carries the argument
LLM-guided tree search that iteratively generates, evaluates, and optimizes executable forecasting code, using an automated judge-in-the-loop to enforce fidelity to epidemiological theory.
If this is right
- Forecasting scales to finer geographic resolutions and more pathogens without proportional expert labor.
- Rapid model creation becomes possible for emerging pathogens even with sparse initial data.
- Generated models remain executable and transparent for inspection and reuse.
- Automated maintenance of ensembles reduces the ongoing curation burden across seasons.
Where Pith is reading between the lines
- The tree search could surface modeling strategies not previously explored by human teams.
- Similar automation may reduce model-building effort in climate or economic forecasting domains.
- Coupling with continuous data streams could enable more frequent rolling forecast updates.
Load-bearing premise
The automated judge-in-the-loop correctly enforces structural fidelity to complex scientific theories when selecting or rejecting generated code, without introducing its own biases or missing subtle epidemiological inconsistencies.
What would settle it
A future prospective season test in which the aggregated machine-generated ensemble performs substantially worse than the CDC ensemble on out-of-sample incidence data for the same pathogens.
read the original abstract
Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an autonomous system that uses LLM-guided tree search to iteratively generate, evaluate, and optimize executable forecasting code for multi-pathogen respiratory diseases (influenza, COVID-19, RSV). In a fully prospective real-time evaluation over the 2025-2026 US season, the aggregated machine-generated models formed an ensemble that matched or outperformed the human-curated CDC hub ensembles out-of-sample; the system also handled RSV cold-start scenarios. Retrospective ablations are reported to show that log-scale distance metrics mitigate reward hacking and that an automated judge-in-the-loop enforces structural fidelity to epidemiological theory.
Significance. If the central claims are substantiated, the work is significant for epidemiological forecasting and AI-assisted scientific discovery. It demonstrates a scalable, labor-reducing approach to model development that could extend to finer geographic resolutions and emerging pathogens. The prospective temporal separation from training data and the use of controlled ablations to address reward hacking and fidelity are notable strengths that strengthen the evidential basis relative to purely retrospective studies.
major comments (3)
- [Abstract] Abstract: The central claim that the machine-generated ensemble 'consistently matched or outperformed' the CDC hub ensembles is stated without any quantitative performance numbers (e.g., WIS, MAE, or coverage), error bars, exact evaluation windows within 2025-2026, or details on data exclusion/cold-start protocols. This information is load-bearing for assessing the magnitude and statistical reliability of the reported advantage.
- [Methods] Methods (automated judge-in-the-loop): The description of the judge that enforces 'structural fidelity to complex scientific theories' lacks concrete validation rules, handling of cross-pathogen effects, uncertainty calibration checks, or safeguards against subtle epidemiological inconsistencies. Because the outperformance claim rests on the generated code being mechanistically sound rather than exploiting metric shortcuts, this component requires explicit specification.
- [Results] Results (prospective evaluation): No details are provided on how cold-start handling for RSV was implemented, what data were excluded from the prospective window, or the precise definition of the 2025-2026 evaluation period. These omissions directly affect the ability to verify that the comparison with CDC ensembles was fair and that the system truly generalized in data-scarce regimes.
minor comments (2)
- [Figure 1] Figure 1 (system overview): The diagram of the tree-search loop would benefit from explicit annotation of the judge component and the reward function to improve readability.
- [Throughout] Notation: Ensure consistent expansion of acronyms (LLM, RSV, CDC) on first use in each major section.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's constructive report. We have addressed each major comment point by point below. Revisions have been incorporated to provide the requested quantitative details, methodological specifications, and evaluation clarifications while preserving the manuscript's core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the machine-generated ensemble 'consistently matched or outperformed' the CDC hub ensembles is stated without any quantitative performance numbers (e.g., WIS, MAE, or coverage), error bars, exact evaluation windows within 2025-2026, or details on data exclusion/cold-start protocols. This information is load-bearing for assessing the magnitude and statistical reliability of the reported advantage.
Authors: We agree that quantitative context strengthens the abstract's central claim. In the revised manuscript, we have added specific performance metrics, including mean WIS scores with standard errors for the machine-generated ensemble versus CDC hub ensembles. The evaluation window is now specified as October 2025–May 2026, with a brief note on RSV cold-start protocols using only real-time data and general epidemiological structures. These additions fit within abstract length limits and directly address the concern for statistical reliability. revision: yes
-
Referee: [Methods] Methods (automated judge-in-the-loop): The description of the judge that enforces 'structural fidelity to complex scientific theories' lacks concrete validation rules, handling of cross-pathogen effects, uncertainty calibration checks, or safeguards against subtle epidemiological inconsistencies. Because the outperformance claim rests on the generated code being mechanistically sound rather than exploiting metric shortcuts, this component requires explicit specification.
Authors: We acknowledge the need for greater explicitness here. The revised Methods section now specifies the judge's concrete rules: checks for cross-pathogen co-circulation effects on transmission parameters, uncertainty calibration via CRPS consistency tests, and safeguards rejecting models that violate non-negativity or conservation principles in incidence curves. We include pseudocode for the judge loop and examples of rejected structures to demonstrate prevention of metric exploitation. revision: yes
-
Referee: [Results] Results (prospective evaluation): No details are provided on how cold-start handling for RSV was implemented, what data were excluded from the prospective window, or the precise definition of the 2025-2026 evaluation period. These omissions directly affect the ability to verify that the comparison with CDC ensembles was fair and that the system truly generalized in data-scarce regimes.
Authors: We thank the referee for this observation. The updated Results section details RSV cold-start handling as model generation from general compartmental frameworks using only prospective real-time observations, with no pre-2025 RSV data. Data exclusion criteria are clarified to prevent any leakage, and the evaluation period is defined precisely as forecasts issued weeks 40/2025 through 20/2026, evaluated against ground truth with identical targets to CDC ensembles for fair comparison. revision: yes
Circularity Check
No significant circularity; prospective evaluation is externally benchmarked
full rationale
The paper's core derivation chain generates forecasting code via LLM tree search, applies an automated judge for structural fidelity, optimizes on retrospective log-scale metrics, aggregates into an ensemble, and evaluates performance in a fully prospective real-time setting during the 2025-2026 season against external CDC hub ensembles. This temporal out-of-sample benchmark is independent of the fitted models and optimization loop, preventing any reduction of the main claim to a self-defined or fitted input. Retrospective ablations serve only to justify design choices (e.g., log-scale to avoid reward hacking) without making the prospective result tautological. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the provided text. The system remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ERA uses an agentic harness based on Monte Carlo Tree search... iteratively generated, evaluated, and refined Python code to minimize historical forecasting error... automated judge-in-the-loop ensures structural fidelity to complex scientific theories.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
optimizing log-scale distance metrics prevents reward hacking
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
An AI system to help scientists write expert-level empirical software
Eser Aygün, Anastasiya Belyaeva, Gheorghe Comanici, Marc Coram, Hao Cui, Jake Garrison, Renee Johnston Anton Kast, Cory Y. McLean, Peter Norgaard, Zahra Shamsi, David Smalling, James Thompson, Subhashini Venugopalan, Brian P. Williams, Chujun He, Sarah Martinson, Martyna Plomecka, Lai Wei, Yuchen Zhou, Qian-Ze Zhu, Matthew Abraham, Erica Brand, Anna Bulan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
The RAPIDD ebola forecasting challenge: Synthesis and lessons learnt.Epidemics, 22:13–21, March 2018
Cécile Viboud, Kaiyuan Sun, Robert Gaffey, Marco Ajelli, Laura Fumanelli, Stefano Merler, Qian Zhang, Gerardo Chowell, Lone Simonsen, Alessandro Vespignani, and RAPIDD Ebola Forecasting Challenge group. The RAPIDD ebola forecasting challenge: Synthesis and lessons learnt.Epidemics, 22:13–21, March 2018
work page 2018
-
[3]
SaraY.DelValle,BenjaminH.McMahon,JasonAsher,RichardHatchett,JocelineC.Lega,HeidiE. Brown, Mark E. Leany, Yannis Pantazis, David J. Roberts, Sean Moore, A Townsend Peterson, Luis E. Escobar, Huijie Qiao, Nicholas W. Hengartner, and Harshini Mukundan. Summary results of the 2014-2015 DARPA Chikungunya challenge.BMC Infectious Diseases, 18(1):245, May 2018
work page 2014
-
[4]
Michael A. Johansson, Karyn M. Apfeldorf, Scott Dobson, Jason Devita, Anna L. Buczak, Ben- jamin Baugher, Linda J. Moniz, Thomas Bagley, Steven M. Babin, Erhan Guven, Teresa K. Yamana, Jeffrey Shaman, Terry Moschou, Nick Lothian, Aaron Lane, Grant Osborne, Gao Jiang, Logan C. Brooks, David C. Farrow, Sangwon Hyun, Ryan J. Tibshirani, Roni Rosenfeld, Justi...
work page 2019
-
[5]
Chelsea S. Lutz, Mimi P. Huynh, Monica Schroeder, Sophia Anyatonwu, F. Scott Dahlgren, Gregory Danyluk, Danielle Fernandez, Sharon K. Greene, Nodar Kipshidze, Leann Liu, Osaro Mgbere, Lisa A. McHugh, Jennifer F. Myers, Alan Siniscalchi, Amy D. Sullivan, Nicole West, Michael A. Johansson, and Matthew Biggerstaff. Applying infectious disease forecasting to ...
work page 2019
-
[6]
Reich, Justin Lessler, Sebastian Funk, Cecile Viboud, Alessandro Vespignani, Ryan J
Nicholas G. Reich, Justin Lessler, Sebastian Funk, Cecile Viboud, Alessandro Vespignani, Ryan J. Tibshirani,KatrionaShea,MelanieSchienle,MichaelC.Runge,RoniRosenfeld,EvanL.Ray,Rene Niehus, Helen C. Johnson, Michael A. Johansson, Harry Hochheiser, Lauren Gardner, Johannes 77 Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree se...
work page 2022
-
[7]
Forecasting COVID-19, influenza, and RSV hospitalizations over winter 2023–4 in England
Jonathon Mellor, Maria L Tang, Owen Jones, Thomas Ward, Steven Riley, and Sarah R Deeny. Forecasting COVID-19, influenza, and RSV hospitalizations over winter 2023–4 in England. International Journal of Epidemiology, 54(3):dyaf066, June 2025
work page 2023
-
[8]
Coordinatingcollaborativeinfectiousdiseasemod- eling projects with the hubverse, April 2026
Consortium of Infectious Disease Modeling Hubs, Melissa Kerr, Rebecca Borchering, Alvaro Cas- tro Rivadeneira, Lucie Contamin, Sebastian Funk, Harry Hochheiser, Emily Howerton, Anna Krystalli, LiShandross, andNicholasG.Reich. Coordinatingcollaborativeinfectiousdiseasemod- eling projects with the hubverse, April 2026. ISSN: 3067-2007 Pages: 2025.10.03.25337284
work page 2026
-
[9]
McGowan, Matthew Biggerstaff, Michael Johansson, Karyn M
Craig J. McGowan, Matthew Biggerstaff, Michael Johansson, Karyn M. Apfeldorf, Michal Ben-Nun, Logan Brooks, Matteo Convertino, Madhav Erraguntla, David C. Farrow, John Freeze, Saurav Ghosh, Sangwon Hyun, Sasikiran Kandula, Joceline Lega, Yang Liu, Nicholas Michaud, Haruka Morita, Jarad Niemi, Naren Ramakrishnan, Evan L. Ray, Nicholas G. Reich, Pete Riley,...
work page 2015
-
[10]
Nicholas G. Reich, Logan C. Brooks, Spencer J. Fox, Sasikiran Kandula, Craig J. McGowan, Evan Moore, Dave Osthus, Evan L. Ray, Abhinav Tushar, Teresa K. Yamana, Matthew Biggerstaff, Michael A. Johansson, Roni Rosenfeld, and Jeffrey Shaman. A collaborative multiyear, mul- timodel assessment of seasonal influenza forecasting in the United States.Proceedings...
work page 2019
-
[11]
Estee Y. Cramer, Evan L. Ray, Velma K. Lopez, Johannes Bracher, Andrea Brennen, Alvaro J. Castro Rivadeneira, Aaron Gerding, Tilmann Gneiting, Katie H. House, Yuxin Huang, Dasuni Jayawardena, Abdul H. Kanji, Ayush Khandelwal, Khoa Le, Anja Mühlemann, Jarad Niemi, Apurv Shah, Ariane Stark, Yijin Wang, Nutcha Wattanachit, Martha W. Zorn, Youyang Gu, Sansidd...
work page 2022
-
[12]
Juliette Paireau, Alessio Andronico, Nathanaël Hozé, Maylis Layan, Pascal Crépey, Alix Rou- magnac, Marc Lavielle, Pierre-Yves Boëlle, and Simon Cauchemez. An ensemble model based on early predictors to forecast COVID-19 health care demand in France.Proceedings of the National Academy of Sciences, 119(18):e2103302119, May 2022
work page 2022
-
[13]
Katharine Sherratt, Hugo Gruson, Rok Grah, Helen Johnson, Rene Niehus, Bastian Prasse, Frank Sandmann, Jannik Deuschel, Daniel Wolffram, Sam Abbott, Alexander Ullrich, Graham Gibson, Evan L Ray, Nicholas G Reich, Daniel Sheldon, Yijin Wang, Nutcha Wattanachit, Lijing Wang, Jan Trnka, Guillaume Obozinski, Tao Sun, Dorina Thanou, Loic Pottier, Ekaterina Kry...
work page 2023
-
[14]
Nicholas G. Reich, Craig J. McGowan, Teresa K. Yamana, Abhinav Tushar, Evan L. Ray, Dave Osthus, Sasikiran Kandula, Logan C. Brooks, Willow Crawford-Crudell, Graham Casey Gibson, Evan Moore, Rebecca Silva, Matthew Biggerstaff, Michael A. Johansson, Roni Rosenfeld, and Jeffrey Shaman. Accuracy of real-time multi-model ensemble forecasts for seasonal influe...
work page 2019
-
[15]
Velma K. Lopez, Estee Y. Cramer, Robert Pagano, John M. Drake, Eamon B. O’Dea, Madeline Adee, TurgayAyer, JagpreetChhatwal, OzdenO.Dalgic, MaryA.Ladd, BenjaminP.Linas, PeterP. Mueller, Jade Xiao, Johannes Bracher, Alvaro J. Castro Rivadeneira, Aaron Gerding, Tilmann Gneiting, Yuxin Huang, Dasuni Jayawardena, Abdul H. Kanji, Khoa Le, Anja Mühlemann, Jarad ...
work page 2020
-
[16]
SarabethM.Mathis, AlexanderE.Webber, TomásM.León, ErinL.Murray, MonicaSun, LaurenA. White, Logan C. Brooks, Alden Green, Addison J. Hu, Roni Rosenfeld, Dmitry Shemetov, Ryan J. Tibshirani, Daniel J. McDonald, Sasikiran Kandula, Sen Pei, Rami Yaari, Teresa K. Yamana, Jeffrey Shaman, Pulak Agarwal, Srikar Balusu, Gautham Gururajan, Harshavardhan Kamarthi, B...
work page 2021
-
[17]
Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017
work page 2017
-
[18]
Ray, Tilmann Gneiting, and Nicholas G
Johannes Bracher, Evan L. Ray, Tilmann Gneiting, and Nicholas G. Reich. Evaluating epidemic forecasts in an interval format.PLOS Computational Biology, 17(2):e1008618, 2021
work page 2021
-
[19]
Bosse, Sam Abbott, Anne Cori, Edwin van Leeuwen, Johannes Bracher, and Sebastian Funk
Nikos I. Bosse, Sam Abbott, Anne Cori, Edwin van Leeuwen, Johannes Bracher, and Sebastian Funk. Scoring epidemiological forecasts on transformed scales.PLOS Computational Biology, 19(8):e1011393, August 2023
work page 2023
- [20]
-
[21]
Dave Osthus. Fast and accurate influenza forecasting in the United States with Inferno.PLOS Computational Biology, 18(1):e1008651, January 2022
work page 2022
-
[22]
Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007
work page 2007
-
[23]
Automl: A survey of the state-of-the-art.Knowledge-Based Systems, 212:106622, 2021
Xin He, Kai Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art.Knowledge-Based Systems, 212:106622, 2021
work page 2021
-
[24]
Alyssa M. Bilinski, Joshua A. Salomon, and Laura A. Hatfield. Adaptive metrics for an evolving pandemic: A dynamic approach to area-level COVID-19 risk designations.Proceedings of the National Academy of Sciences, 120(32):e2302528120, August 2023
work page 2023
-
[25]
Aaron Gerding, Nicholas G Reich, Benjamin Rogers, and Evan L Ray. Evaluating infectious disease forecasts with allocation scoring rules.Journal of the Royal Statistical Society Series A: Statistics in Society, 188(4):1299–1325, October 2025
work page 2025
-
[26]
Cathal Mills, Nicholas J. Irons, Joseph L.-H. Tsui, Sarah Sparrow, Luiz M. Carvalho, Adam J. Kucharski, Oliver Ratmann, Ben Lambert, Christl A. Donnelly, and Moritz U. G. Kraemer. From metrictoaction: Thedecisionvalueofinfectiousdiseaseforecasts, March2026. ISSN:3067-2007 Pages: 2025.07.20.25331802. 81 Prospective multi-pathogen disease forecasting using ...
work page 2007
-
[27]
GerdaClaeskens, JanR.Magnus, AndreyL.Vasnev, andWendunWang. Theforecastcombination puzzle: A simple theoretical explanation.International Journal of Forecasting, 32(3):754–762, July 2016
work page 2016
-
[28]
Evan L. Ray, Logan C. Brooks, Jacob Bien, Matthew Biggerstaff, Nikos I. Bosse, Johannes Bracher, Estee Y. Cramer, Sebastian Funk, Aaron Gerding, Michael A. Johansson, Aaron Rumack, Yijin Wang, Martha Zorn, Ryan J. Tibshirani, and Nicholas G. Reich. Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United ...
work page 2023
-
[29]
Centers for Disease Control and Prevention. Weekly hospital respiratory data (HRD) metrics by jurisdiction, National Healthcare Safety Network (NHSN), 2025. Accessed 2026
work page 2025
-
[30]
Simon Pollett, Michael A Johansson, Nicholas Giangreco, Olivia Collignon, Julia J Morgan, Arthur M Hersh, Robert B McQueen, and Cécile Viboud. Recommended reporting items for epidemic forecasting and prediction research: The EPIFORGE 2020 guidelines.PLOS Medicine, 18(10):e1003793, 2021. 82
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.