pith. sign in

arxiv: 2606.05513 · v1 · pith:6ZASHCQLnew · submitted 2026-06-03 · 💻 cs.AI · cs.CL

EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

Pith reviewed 2026-06-28 05:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords streaming forecastingpandemic predictionLLM agentsregime shiftsepisodic memoryCOVID-19 hospitalizationself-evolving agents
0
0 comments X

The pith

A self-evolving agent adapts a fixed LLM forecaster to shifting pandemic regimes by storing past outcomes and reflecting on delayed labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the gap between static LLM training and real-world streaming pandemic forecasting, where disease regimes change and labels arrive after predictions. It introduces EpiEvolve, which keeps the underlying model weights fixed after a warm-start period and instead builds an evolving context through hierarchical episodic memory, reflection on outcomes, and regime-aware retrieval of similar past cases. This context is assembled under a strict chronological order that blocks any future data. On weekly COVID-19 hospitalization trends across five variant regimes, the agent raises average accuracy to 0.629 from 0.561 for the static backbone and 0.325 for an external CDC ensemble while shortening recovery after each regime shift from five weeks to two.

Core claim

EpiEvolve wraps an LLM forecaster with fixed weights and adapts it in a streaming setting by storing forecast outcomes in hierarchical episodic memory, reflecting on delayed labels, retrieving regime-relevant cases, and distilling recurring errors into strategic rules, all while following a chronological protocol that prevents future leakage.

What carries the argument

Hierarchical episodic memory with a reflection step and regime-aware retrieval that distills errors into reusable rules.

If this is right

  • Fixed-weight LLM forecasters can handle regime shifts in streaming data without retraining.
  • Reflection on delayed labels and retrieval of similar past cases each measurably improve adaptation speed.
  • Ablation results confirm that memory, reflection, and regime-aware retrieval each add to the observed gains.
  • The chronological protocol allows reuse of the model's own past predictions while maintaining temporal integrity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory-and-reflection structure could extend to other streaming tasks that face delayed feedback and pattern shifts.
  • Keeping model weights fixed while updating context only may lower the cost of maintaining operational forecasting systems over long periods.
  • The approach implies that explicit storage of past errors can substitute for frequent model updates when regimes change.

Load-bearing premise

The memory storage, reflection, and retrieval steps can be executed under a strict time-ordered protocol that blocks all future information yet still generalizes across the five regimes without any post-hoc adjustments that change the reported numbers.

What would settle it

If disabling the reflection step or regime-aware retrieval in an otherwise identical run causes recovery lag to return to five weeks and accuracy to fall back to 0.561, the contribution of those components would be falsified.

Figures

Figures reproduced from arXiv: 2606.05513 by Fei Liu, Max Lau, Sihang Zeng, Wei Jin, Yiming Lu, Zhengxu Tang.

Figure 1
Figure 1. Figure 1: Overview of EpiEvolve Pipeline. Each week t, (1) state-level features (epi trends, policy, genomic surveillance) drive (2) regime-conditioned retrieval over episodic memory (state, regional, national), (3) feeding the retrieved cases, distilled lessons, and current context to a pre-trained LLM backbone for (4) hospitalization trend class forecasts. (5) When delayed truth yt arrives, (6) reflection writes a… view at source ↗
Figure 2
Figure 2. Figure 2: Adaptation behavior across the variant-era stream. (a) Rolling 4-week accuracy across the evaluation period for four representative methods; the static backbone collapses at BA.5 and stays depressed through BQ.1, while EpiEvolve dips less at every boundary and returns to its new-regime steady-state level fastest. (b) Boundary￾centered recovery: weekly accuracy aligned by weeks since each transition and ave… view at source ↗
Figure 3
Figure 3. Figure 3: EpiEvolve’s internal state over the variant regime stream. Panel (a) shows the memory tier of each top N retrieved entry over time. Within a stable regime, entries from the same state gradually account for more of the retrieved context. At variant transitions, the current regime has few local entries, so retrieval shifts toward regional and national cases with similar features. Panel (b) shows drift events… view at source ↗
Figure 4
Figure 4. Figure 4: One forecasting cycle of EpiEvolve walked end to end. Top block: the model’s actual input slots (variant text, recent trend, dynamic features) together with the <MEMORY> and <RULES> populated by hierarchical retrieval and rule matching. Middle block: the model’s prediction and a counterfactual from the backbone with empty memory and rules. Bottom block: the agent’s writeback for this week, comprising a one… view at source ↗
Figure 5
Figure 5. Figure 5: Per-state EpiEvolve gain over the backbone. Each cell is the accuracy of EpiEvolve minus the backbone for one state in one regime; states are grouped by HHS region. Gains concentrate in the BA.5 and BQ.1 rows where the backbone is furthest from its training distribution, and they correlate within HHS regions, since states that share federal coordination structure also share evidence available to EpiEvolve’… view at source ↗
Figure 6
Figure 6. Figure 6: EpiEvolve confusion matrix (normalized). Rows are truth classes, columns are predictions. reflection module emits a one-sentence lesson and a candidate rule per (state, week), and the strate￾gic distiller consolidates the most recent reflections into new strategic rules every K=4 weeks and on each drift event. Bracketed fields are substituted from the agent state at call time. The lesson is appended to the… view at source ↗
read the original abstract

Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces EpiEvolve, a self-evolving LLM agent for weekly COVID-19 hospitalization trend forecasting in a streaming setting across five variant regimes. The LLM backbone is trained only on a warm-start period and kept fixed; adaptation occurs via storage of outcomes in hierarchical episodic memory, reflection on delayed labels, regime-aware retrieval, and distillation of recurring errors into rules, all under a claimed chronological protocol that prevents future leakage. On the streaming dataset the method reports 0.629 average accuracy (vs. 0.561 static backbone and 0.325 CDC ensemble) and reduces post-regime-shift recovery lag from 5 to 2 weeks; ablations attribute gains to reflection, strategic memory, and regime-aware retrieval.

Significance. If the no-leakage protocol can be shown to be strictly chronological and the empirical gains survive proper statistical controls, the framework would supply a concrete, reusable template for self-evolving agents in non-stationary streaming prediction tasks. The explicit ablation of the three adaptation components is a methodological strength that isolates their individual contributions.

major comments (3)
  1. [Abstract] Abstract: the headline figures (0.629 accuracy, 5-to-2-week lag reduction) are supplied without error bars, statistical significance tests, dataset sizes, number of weeks per regime, or any description of how the five regimes were delineated, rendering it impossible to judge whether the reported improvement over the static backbone is robust.
  2. [Abstract] Abstract (description of hierarchical episodic memory and regime-aware retrieval): the paper asserts that regime indexing, past-episode querying, and distillation from delayed labels are performed without future leakage, yet supplies no pseudocode, equations, or concrete implementation steps showing how regimes are identified from past data only, how retrieval is restricted to timesteps t' < t, or how labels arriving weeks later are incorporated without conditioning on post-t outcomes; this mechanism is load-bearing for the central streaming-adaptation claim.
  3. [Ablation study] Ablation study (mentioned in abstract): the claim that each of reflection, strategic memory, and regime-aware retrieval contributes to the gains cannot be evaluated because the ablation protocol itself is not described (e.g., whether ablations preserve the chronological constraint or whether they inadvertently allow information from later regimes).
minor comments (1)
  1. [Abstract] The term 'strategic memory' appears in the ablation sentence but is not defined or distinguished from 'hierarchical episodic memory' in the main description; a short clarifying sentence would remove ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing the need for greater detail in the abstract and ablation descriptions. We agree that these elements require expansion to allow proper assessment of the reported results and the no-leakage claims, and we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline figures (0.629 accuracy, 5-to-2-week lag reduction) are supplied without error bars, statistical significance tests, dataset sizes, number of weeks per regime, or any description of how the five regimes were delineated, rendering it impossible to judge whether the reported improvement over the static backbone is robust.

    Authors: We agree that the abstract should convey more context for the metrics. In revision we will add error bars to the headline figures, report statistical significance (paired tests against the static baseline), note the total number of weeks (~150 across regimes), and briefly describe regime delineation by documented variant emergence dates, with fuller details moved to the main text. revision: yes

  2. Referee: [Abstract] Abstract (description of hierarchical episodic memory and regime-aware retrieval): the paper asserts that regime indexing, past-episode querying, and distillation from delayed labels are performed without future leakage, yet supplies no pseudocode, equations, or concrete implementation steps showing how regimes are identified from past data only, how retrieval is restricted to timesteps t' < t, or how labels arriving weeks later are incorporated without conditioning on post-t outcomes; this mechanism is load-bearing for the central streaming-adaptation claim.

    Authors: We will add a dedicated subsection in Methods containing pseudocode and equations that formalize the chronological constraints: regime detection uses only data available at t, retrieval queries are masked to t' < t, and delayed labels are stored and reflected upon only after their arrival without reference to any post-t information. The abstract will be updated to reference this subsection. revision: yes

  3. Referee: [Ablation study] Ablation study (mentioned in abstract): the claim that each of reflection, strategic memory, and regime-aware retrieval contributes to the gains cannot be evaluated because the ablation protocol itself is not described (e.g., whether ablations preserve the chronological constraint or whether they inadvertently allow information from later regimes).

    Authors: The ablations were performed under exactly the same chronological protocol as the main experiments. We will expand the ablation subsection to state this explicitly and to describe, for each removed component, how regime indexing, retrieval, and label incorporation remain restricted to past data only. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical comparisons on streaming data.

full rationale

The paper describes an agent architecture (hierarchical episodic memory, reflection, regime-aware retrieval) and reports measured accuracies (0.629 vs. 0.561 static backbone) and lag reductions under a claimed chronological protocol. No equations, fitted parameters, or self-citations are invoked to derive these figures; the metrics are external performance numbers obtained by running the system on held-out streaming data. The leakage-prevention claim is an implementation assumption whose validity affects replicability but does not create a definitional or self-referential reduction in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the untested assumption that memory retrieval and rule distillation generalize across unseen regimes without introducing selection bias or leakage; no independent evidence is supplied for the memory components beyond the reported performance numbers.

axioms (1)
  • domain assumption A chronological protocol can be enforced that prevents any future leakage while still allowing delayed labels to inform reflection and retrieval.
    Invoked to justify the streaming evaluation setup.
invented entities (1)
  • hierarchical episodic memory no independent evidence
    purpose: Store forecast outcomes and enable regime-aware retrieval and rule distillation
    New component introduced to enable adaptation without weight updates.

pith-pipeline@v0.9.1-grok · 5742 in / 1312 out tokens · 31086 ms · 2026-06-28T05:38:18.241986+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 22 canonical work pages · 7 internal anchors

  1. [1]

    Nature Computational Science , volume=

    Advancing real-time infectious disease forecasting using large language models , author=. Nature Computational Science , volume=. 2025 , publisher=

  2. [2]

    Proceedings of the National Academy of Sciences , volume=

    Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States , author=. Proceedings of the National Academy of Sciences , volume=. 2022 , publisher=

  3. [3]

    American Journal of Public Health , volume=

    Collaborative hubs: making the most of predictive epidemic modeling , author=. American Journal of Public Health , volume=. 2022 , publisher=

  4. [4]

    arXiv preprint arXiv:2505.12738 , year=

    EpiLLM: unlocking the potential of large language models in epidemic forecasting , author=. arXiv preprint arXiv:2505.12738 , year=

  5. [5]

    medRxiv , pages=

    Fine-tuned large language models enhance influenza forecasting , author=. medRxiv , pages=. 2025 , publisher=

  6. [6]

    Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

    Time-llm: Time series forecasting by reprogramming large language models , author=. arXiv preprint arXiv:2310.01728 , year=

  7. [7]

    Chronos: Learning the Language of Time Series

    Chronos: Learning the language of time series , author=. arXiv preprint arXiv:2403.07815 , year=

  8. [8]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  9. [9]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  10. [10]

    Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

    Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

  11. [11]

    Advances in neural information processing systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

  12. [12]

    Advances in neural information processing systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

  13. [13]

    ACM computing surveys (CSUR) , volume=

    A survey on concept drift adaptation , author=. ACM computing surveys (CSUR) , volume=. 2014 , publisher=

  14. [14]

    arXiv preprint arXiv:2307.04986 , year=

    Epidemic modeling with generative agents , author=. arXiv preprint arXiv:2307.04986 , year=

  15. [15]

    arXiv preprint arXiv:2602.00299 , year=

    Agentic Framework for Epidemiological Modeling , author=. arXiv preprint arXiv:2602.00299 , year=

  16. [16]

    IEEE Transactions on Artificial Intelligence , year=

    EpidemIQs: Prompt-to-paper LLM agents for epidemic modeling and analysis , author=. IEEE Transactions on Artificial Intelligence , year=

  17. [17]

    arXiv preprint arXiv:2512.10313 , year=

    EpiPlanAgent: Agentic Automated Epidemic Response Planning , author=. arXiv preprint arXiv:2512.10313 , year=

  18. [18]

    arXiv preprint arXiv:2601.04245 , year=

    AI Agents as Policymakers in Simulated Epidemics , author=. arXiv preprint arXiv:2601.04245 , year=

  19. [19]

    arXiv preprint arXiv:2601.09264 , year=

    Coordinated Pandemic Control with Large Language Model Agents as Policymaking Assistants , author=. arXiv preprint arXiv:2601.09264 , year=

  20. [20]

    STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

    STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning , author=. arXiv preprint arXiv:2601.03248 , year=

  21. [21]

    Forty-second International Conference on Machine Learning , year=

    EARTH: Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph , author=. Forty-second International Conference on Machine Learning , year=

  22. [22]

    arXiv preprint arXiv:2501.11733 , year=

    Mobile-agent-e: Self-evolving mobile assistant for complex tasks , author=. arXiv preprint arXiv:2501.11733 , year=

  23. [23]

    arXiv preprint arXiv:2602.04837 , year=

    Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing , author=. arXiv preprint arXiv:2602.04837 , year=

  24. [24]

    Nature communications , volume=

    Evaluation of the US COVID-19 Scenario Modeling Hub for informing pandemic response under uncertainty , author=. Nature communications , volume=. 2023 , publisher=

  25. [25]

    arXiv preprint arXiv:2509.03990 , year=

    Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent , author=. arXiv preprint arXiv:2509.03990 , year=

  26. [26]

    EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

    Evolver: Self-evolving llm agents through an experience-driven lifecycle , author=. arXiv preprint arXiv:2510.16079 , year=

  27. [27]

    Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory , author=. arXiv preprint arXiv:2511.20857 , year=

  28. [28]

    arXiv preprint arXiv:2509.24704 , year=

    Memgen: Weaving generative latent memory for self-evolving agents , author=. arXiv preprint arXiv:2509.24704 , year=

  29. [29]

    MemEvolve: Meta-Evolution of Agent Memory Systems

    Memevolve: Meta-evolution of agent memory systems , author=. arXiv preprint arXiv:2512.18746 , year=

  30. [30]

    5 Emergence, United States , author=

    Large-Scale Genomic Analysis of SARS-CoV-2 Omicron BA. 5 Emergence, United States , author=. Emerging infectious diseases , volume=

  31. [31]

    Cell , volume=

    Alarming antibody evasion properties of rising SARS-CoV-2 BQ and XBB subvariants , author=. Cell , volume=. 2023 , publisher=

  32. [32]

    International Conference on Learning Representations , volume=

    Self-updatable large language models by integrating context into model parameters , author=. International Conference on Learning Representations , volume=

  33. [33]

    arXiv preprint arXiv:2505.20633 , year=

    Test-time learning for large language models , author=. arXiv preprint arXiv:2505.20633 , year=

  34. [34]

    The Fourteenth International Conference on Learning Representations , year=

    Test-time adaptation for llm agents via environment interaction , author=. The Fourteenth International Conference on Learning Representations , year=

  35. [35]

    arXiv preprint arXiv:2501.13453 , year=

    Spurious forgetting in continual learning of language models , author=. arXiv preprint arXiv:2501.13453 , year=

  36. [36]

    IEEE Transactions on Knowledge and Data Engineering , year=

    Unsupervised concept drift detection from deep learning representations in real-time , author=. IEEE Transactions on Knowledge and Data Engineering , year=

  37. [37]

    arXiv preprint arXiv:2505.04318 , year=

    Detecting Concept Drift in Neural Networks Using Chi-squared Goodness of Fit Testing , author=. arXiv preprint arXiv:2505.04318 , year=

  38. [38]

    Advances in Neural Information Processing Systems , volume=

    A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=

  39. [39]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Towards adaptive memory-based optimization for enhanced retrieval-augmented generation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  40. [40]

    2025 5th International Conference on Neural Networks, Information and Communication Engineering (NNICE) , pages=

    Research on the online update method for retrieval-augmented generation (rag) model with incremental learning , author=. 2025 5th International Conference on Neural Networks, Information and Communication Engineering (NNICE) , pages=. 2025 , organization=

  41. [41]

    arXiv preprint arXiv:2502.03393 , year=

    Pre-training Epidemic Time Series Forecasters with Compartmental Prototypes , author=. arXiv preprint arXiv:2502.03393 , year=

  42. [42]

    Advances in Neural Information Processing Systems , volume=

    Streambench: Towards benchmarking continuous improvement of language agents , author=. Advances in Neural Information Processing Systems , volume=

  43. [43]

    Advances in Neural Information Processing Systems , volume=

    Label delay in online continual learning , author=. Advances in Neural Information Processing Systems , volume=

  44. [44]

    arXiv preprint arXiv:2508.02085 , year=

    Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents , author=. arXiv preprint arXiv:2508.02085 , year=