Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes
Pith reviewed 2026-06-28 14:18 UTC · model grok-4.3
The pith
EvoNote lets LLM agents for health Community Notes build and reuse an experience memory across posts instead of resetting each time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoNote is an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment that grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing.
What carries the argument
EvoNote's evolving experience memory with fine-grained credit assignment that converts trajectory feedback into action-level memories for claim analysis, evidence acquisition, and note writing.
If this is right
- EvoNote notes are preferred over corresponding human-written notes in 89.6 percent of cases under the hierarchical utility judge.
- On Needs More Ratings posts, EvoNote produces helpful notes in 82.0 percent of cases.
- Median time to produce a candidate note drops from over 13 hours in the human pipeline to under 2 minutes.
- Performance gains trace to stronger evidence use and reusable correction strategies stored in memory.
Where Pith is reading between the lines
- The memory-distillation approach could be tested on non-health misinformation domains if comparable benchmarks are built.
- Repeated self-evolution might allow the system to handle emerging misinformation patterns without new human examples each time.
- The credit-assignment mechanism could be adapted to other agent tasks that require accumulating domain-specific correction knowledge.
Load-bearing premise
The human-validated hierarchical utility judge and the MM-HealthCN benchmark of 1.2K posts with crowd-derived labels accurately measure real-world helpfulness and generalizability.
What would settle it
Independent human raters on a fresh set of health posts show EvoNote notes preferred over human notes in fewer than half the direct comparisons.
Figures
read the original abstract
Large Language Model (LLM)-augmented Community Notes offer a scalable path for timely, evidence-grounded correction of health misinformation on social platforms. However, they still reset at every post, leaving useful correction experience from prior cases unused. We introduce EvoNote, an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment: EvoNote grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing. We evaluate EvoNote on MM-HealthCN, a 1.2K-instance multimodal benchmark of user-flagged health posts with human-written Community Notes and crowd-derived helpfulness labels. Under a human-validated hierarchical utility judge, EvoNote-generated notes are preferred over corresponding human-written notes in 89.6% of cases; on a separate set of Needs More Ratings posts without a crowd helpfulness verdict, EvoNote produces helpful notes for 82.0% of cases. It also reduces the median time needed to produce a candidate correction from over 13 hours in the human-note pipeline to under 2 minutes. Analyses link these gains to stronger evidence use and reusable correction strategies, positioning self-evolving note generation as a promising paradigm for health misinformation governance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EvoNote, a self-evolving LLM agent framework that maintains an experience memory of prior health misinformation corrections and uses fine-grained credit assignment to distill trajectory-level feedback into action-level updates for claim analysis, evidence acquisition, and note writing. Evaluated on the new MM-HealthCN benchmark of 1.2K user-flagged multimodal health posts with human notes and crowd labels, it reports that EvoNote notes are preferred over human-written notes in 89.6% of cases under a human-validated hierarchical utility judge, produces helpful notes in 82.0% of Needs More Ratings cases, and reduces median generation time from over 13 hours to under 2 minutes, attributing gains to stronger evidence use and reusable strategies.
Significance. If the central empirical claims hold under a robust judge, the work demonstrates that self-evolving agentic systems can outperform static human pipelines in evidence-grounded health misinformation correction while introducing the MM-HealthCN benchmark as a reusable resource. This positions experience-based evolution as a promising direction for scalable community notes, with potential implications for automated governance of health content on social platforms.
major comments (3)
- [§4 (Evaluation)] §4 (Evaluation) and abstract: The headline 89.6% preference rate over human notes is obtained exclusively via the human-validated hierarchical utility judge, yet the manuscript reports neither inter-annotator agreement for the judge nor its correlation with the existing crowd-derived helpfulness labels already present in MM-HealthCN. Without these checks, it is impossible to rule out that the judge systematically favors longer or more evidence-dense outputs irrespective of factual accuracy.
- [§3.2 (Credit Assignment)] §3.2 (Credit Assignment) and §5.1 (Ablations): The core mechanism of fine-grained credit assignment that grounds trajectory feedback into action-level memory is presented as load-bearing for the self-evolution claim, but no ablation isolates its contribution relative to a non-evolving baseline or to simpler memory mechanisms; the reported gains could therefore be driven primarily by the base LLM rather than the evolving experience component.
- [Table 2 / §4.2] Table 2 / §4.2 (Needs More Ratings results): The 82.0% helpfulness rate on posts without crowd verdicts is reported without an accompanying error analysis or breakdown by post type (e.g., multimodal vs. text-only), making it difficult to assess whether the self-evolution generalizes or merely exploits patterns already captured by the 1.2K training episodes.
minor comments (2)
- [Figure 1] Figure 1: The agent architecture diagram would benefit from explicit arrows or labels indicating the memory-update step after each episode, as the current rendering leaves the flow from credit assignment to action-level memory implicit.
- [§2 (Related Work)] §2 (Related Work): The discussion of prior Community Notes systems could usefully cite the original Twitter Community Notes paper and recent empirical studies on note helpfulness to better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional validation and analysis as outlined.
read point-by-point responses
-
Referee: §4 (Evaluation) and abstract: The headline 89.6% preference rate over human notes is obtained exclusively via the human-validated hierarchical utility judge, yet the manuscript reports neither inter-annotator agreement for the judge nor its correlation with the existing crowd-derived helpfulness labels already present in MM-HealthCN. Without these checks, it is impossible to rule out that the judge systematically favors longer or more evidence-dense outputs irrespective of factual accuracy.
Authors: We agree that reporting inter-annotator agreement and correlation with crowd labels would strengthen the judge's validation. In the revised manuscript, we will add IAA scores for the hierarchical utility judge, its correlation with the existing crowd-derived helpfulness labels, and an analysis controlling for note length to address potential length or density biases. revision: yes
-
Referee: §3.2 (Credit Assignment) and §5.1 (Ablations): The core mechanism of fine-grained credit assignment that grounds trajectory feedback into action-level memory is presented as load-bearing for the self-evolution claim, but no ablation isolates its contribution relative to a non-evolving baseline or to simpler memory mechanisms; the reported gains could therefore be driven primarily by the base LLM rather than the evolving experience component.
Authors: While §5.1 presents ablations on system components, we acknowledge the value of a direct isolation against a non-evolving baseline with the same base LLM and against simpler memory without fine-grained credit assignment. We will add these comparisons in the revised manuscript to better demonstrate the contribution of the credit assignment mechanism. revision: yes
-
Referee: Table 2 / §4.2 (Needs More Ratings results): The 82.0% helpfulness rate on posts without crowd verdicts is reported without an accompanying error analysis or breakdown by post type (e.g., multimodal vs. text-only), making it difficult to assess whether the self-evolution generalizes or merely exploits patterns already captured by the 1.2K training episodes.
Authors: We agree that an error analysis and breakdown by post type would better demonstrate generalization. In the revision, we will include a breakdown of the 82.0% rate by multimodal versus text-only posts and add an error analysis categorizing cases where EvoNote notes were not helpful. revision: yes
Circularity Check
No significant circularity; empirical results on external benchmark.
full rationale
The paper's central claims rest on direct empirical comparisons of EvoNote outputs against human-written notes and crowd-derived labels on the external MM-HealthCN benchmark of 1.2K instances. No equations, derivations, or fitted parameters are described that reduce to inputs by construction. The hierarchical utility judge is presented as an evaluation tool with human validation, but its use does not create a self-definitional loop or rename a fitted quantity as a prediction. No self-citation chains or ansatzes are invoked as load-bearing for the reported 89.6% preference or 82.0% helpfulness rates. The derivation chain is self-contained against external signals.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
MemGPT: Towards LLMs as Operating Systems
Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560. Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Ed- uardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. 2025. BALROG: Benchmark- ing agentic LLM and VLM reasoni...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073. Fiona Warde, Janet Papadakos, Tina Papadakos, Danielle Rodin, Mohammad Salhia, and Meredith Giuliani. 2018. Plain language communication as a priority competency for medical professionals in a globalized world.Canadian Medical Education...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Rui Xing, Preslav Nakov, Timothy Baldwin, and Jey Han Lau. 2026. COMMUNITYNOTES: A dataset for exploring the helpfulness of fact-checking explanations. InFindings of the Association for Computationa...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Understandable: explains key distinctions or necessary jargon in accessible language
-
[5]
Meaningful: explains why the correction matters for health interpretation, risk perception, or behavior
-
[6]
Usable: provides a safe and proportionate next step when the claim could affect user action
-
[7]
Instructions:
Trustworthy: avoids overclaiming and acknowledges uncertainty or evidence limits where appropriate. Instructions:
-
[8]
Compare the two notes criterion by criterion using the four criteria above
-
[9]
For each criterion, choose exactly one of: Note 1, Note 2, or Tie
-
[10]
A criterion should be marked Tie when neither note is clearly better on that criterion
-
[11]
Then choose the better note overall, or choose Tie if neither note is clearly better overall
-
[12]
Understandable: CHOICE Meaningful: CHOICE Usable: CHOICE Trustworthy: CHOICE
Use the following exact format for the criterion-level decisions, replacing CHOICE with exactly one of: Note 1, Note 2, or Tie. Understandable: CHOICE Meaningful: CHOICE Usable: CHOICE Trustworthy: CHOICE
-
[13]
Model Macro-F1 (%) Macro-Acc
End your response with exactly one of the following lines: Final decision: Note 1 Final decision: Note 2 Final decision: Tie Figure 9:Prompt for pairwise utility judgmentbetween two Community Notes candidates. Model Macro-F1 (%) Macro-Acc. (%) GPT-4.1 74.30 71.67 Gemini-2.5-flash 67.86 64.35 Claude-Sonnet-4 74.58 72.40 HEALTHJUDGE89.21 88.46 Table 5:Effec...
2025
-
[14]
have been verified. We refer readers to Wu et al. (2025) for the orig- inal human evaluation setup, prompt details, and HEALTHJUDGEtraining procedure. D.3.2 Reliability of Pairwise Utility Comparison To validate the GPT-4.1-based pairwise utility judge, we conduct a human alignment study on 100 randomly sampled note pairs. For each pair, three annotators ...
-
[15]
search",
decision_rationale How to use retrieved memory: - Retrieved memory is not expected to answer the current factual claim directly. - Retrieved memory is strategy guidance for this type of post. - Use it to decide what kind of evidence to seek next, what kind of overclaim to watch for, and whether it is too early to finalize. - Judge memory by whether it hel...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.