Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes

Bryan Hooi; Fanxiao Li; Haonan Wang; Jianyang Gu; Jiaying Wu; Min-Yen Kan; Preslav Nakov; Zihang Fu

arxiv: 2606.02215 · v1 · pith:25DLH4FPnew · submitted 2026-06-01 · 💻 cs.CL · cs.SI

Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes

Zihang Fu , Fanxiao Li , Jianyang Gu , Haonan Wang , Preslav Nakov , Bryan Hooi , Min-Yen Kan , Jiaying Wu This is my paper

Pith reviewed 2026-06-28 14:18 UTC · model grok-4.3

classification 💻 cs.CL cs.SI

keywords LLM agentsCommunity Noteshealth misinformationself-evolving agentsevidence-grounded correctionagentic frameworkmisinformation governanceexperience memory

0 comments

The pith

EvoNote lets LLM agents for health Community Notes build and reuse an experience memory across posts instead of resetting each time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoNote as an agentic system that stores prior correction episodes in a memory and distills them via fine-grained credit assignment into reusable strategies for analyzing claims, gathering evidence, and writing notes. On the MM-HealthCN benchmark of 1.2K user-flagged health posts, notes produced by the evolved agents are rated higher than matching human-written notes by a human-validated judge in 89.6 percent of direct comparisons. For posts that still need more ratings, the system generates helpful notes in 82 percent of cases while cutting the median production time from over 13 hours to under 2 minutes. The gains are tied to more consistent evidence use and reusable correction patterns learned from past episodes. This frames self-evolving note generation as a route to scalable, evidence-based responses to health misinformation.

Core claim

EvoNote is an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment that grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing.

What carries the argument

EvoNote's evolving experience memory with fine-grained credit assignment that converts trajectory feedback into action-level memories for claim analysis, evidence acquisition, and note writing.

If this is right

EvoNote notes are preferred over corresponding human-written notes in 89.6 percent of cases under the hierarchical utility judge.
On Needs More Ratings posts, EvoNote produces helpful notes in 82.0 percent of cases.
Median time to produce a candidate note drops from over 13 hours in the human pipeline to under 2 minutes.
Performance gains trace to stronger evidence use and reusable correction strategies stored in memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory-distillation approach could be tested on non-health misinformation domains if comparable benchmarks are built.
Repeated self-evolution might allow the system to handle emerging misinformation patterns without new human examples each time.
The credit-assignment mechanism could be adapted to other agent tasks that require accumulating domain-specific correction knowledge.

Load-bearing premise

The human-validated hierarchical utility judge and the MM-HealthCN benchmark of 1.2K posts with crowd-derived labels accurately measure real-world helpfulness and generalizability.

What would settle it

Independent human raters on a fresh set of health posts show EvoNote notes preferred over human notes in fewer than half the direct comparisons.

Figures

Figures reproduced from arXiv: 2606.02215 by Bryan Hooi, Fanxiao Li, Haonan Wang, Jianyang Gu, Jiaying Wu, Min-Yen Kan, Preslav Nakov, Zihang Fu.

**Figure 1.** Figure 1: Paradigms for LLM-augmented Community Notes generation. (a) Existing pipelines treat each episode independently. (b) EVONOTE distills prior episodes into evolving memory to improve future notes. turning crowd-sourced verification and community ratings into concise, source-grounded corrections that can be shown directly alongside misleading posts (Renault et al., 2024; Slaughter et al., 2025). LLM-augmente… view at source ↗

**Figure 2.** Figure 2: Reusable misinformation correction experience can remain episode-local. (a) Distribution of major correction-relevant claim patterns inferred from helpful-note cases; detailed categories are provided in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of EVONOTE for self-evolving health Community Notes generation. EVONOTE turns trajectory-level feedback into evolving experience memory to inform future note-generation episodes. However, later notes for similar posts miss this lesson, leaving the mechanism claim under-addressed or over-narrowly rebutted. This motivates selfevolving correction systems that preserve useful lessons as reusable mem… view at source ↗

**Figure 4.** Figure 4: Pairwise utility comparison across healthspecific utility dimensions on 403 cases where both human-written notes and EVONOTE are judged Helpful. Ties count as 0.5 for both sides. based methods, all memory-based agents outperform the one-pass CrowdNotes+ baseline, confirming the value of reusable experience. However, EVONOTE achieves the strongest performance among memory-based methods. Unlike ExpRAG, wh… view at source ↗

**Figure 6.** Figure 6: VLM and caption-based LLM instantiations of EVONOTE on 400 IMAGE cases. HEALTHCN, EVONOTE accumulates 4,277 memory items, each tied to an action phase and reusable strategy. Using a human-built attribution taxonomy and GPT-4.1 classification, we find that the dominant memories target primary-source verification, causal overgeneralization, safe next steps, and practical health implications (see §F.4). T… view at source ↗

**Figure 7.** Figure 7: Topic distribution of MM-HEALTHCN. Instances are grouped by post modality and assigned to representative health topics following prior work (Wu et al., 2025). study on 200 randomly sampled multimodal instances from MM-HEALTHCN, including 100 IMAGE and 100 VIDEO cases. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Human validation protocol for multimodal caption quality (§C.2) [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for pairwise utility judgment between two Community Notes candidates [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Human evaluation instruction for validating GPT-4.1-based pairwise utility judgments (§D.2). • ExpRAG, which retrieves prior trajectories as task-level context; • ReMem, which integrates reasoning, action, and memory refinement for continual improvement. Both are adapted from the Evo-Memory suite (Wei et al., 2025). E.2 Implementation Details LLM Setup. For web-search-enabled LLM baselines, we evaluate G… view at source ↗

**Figure 11.** Figure 11: Case study: memory-guided regulatory verification. Retrieved experience helps EVONOTE verify clinical-trial, EUA, and safety-review details for a stronger correction. See detailed analysis in §F.5 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for the Social Utility Judge agent (§4.2, §B.1) [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for the Memory Evolver agent (§4.3, §B.2) [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for the Claim Analyzer agent (§4.4, §B.3.1) [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt for the Note Writer agent (§4.4, §B.3.2) [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt for the optional note refinement step (§4.4) in EVONOTE when the generated note violates platform length compliance [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

**Figure 17.** Figure 17: Demonstration of key steps from the EVONOTE workflow, Part I (Episode ID: 455) [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗

**Figure 18.** Figure 18: Demonstration of key steps from the EVONOTE workflow, Part II (Episode ID: 455) [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗

read the original abstract

Large Language Model (LLM)-augmented Community Notes offer a scalable path for timely, evidence-grounded correction of health misinformation on social platforms. However, they still reset at every post, leaving useful correction experience from prior cases unused. We introduce EvoNote, an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment: EvoNote grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing. We evaluate EvoNote on MM-HealthCN, a 1.2K-instance multimodal benchmark of user-flagged health posts with human-written Community Notes and crowd-derived helpfulness labels. Under a human-validated hierarchical utility judge, EvoNote-generated notes are preferred over corresponding human-written notes in 89.6% of cases; on a separate set of Needs More Ratings posts without a crowd helpfulness verdict, EvoNote produces helpful notes for 82.0% of cases. It also reduces the median time needed to produce a candidate correction from over 13 hours in the human-note pipeline to under 2 minutes. Analyses link these gains to stronger evidence use and reusable correction strategies, positioning self-evolving note generation as a promising paradigm for health misinformation governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoNote adds a self-evolving memory loop with action-level credit assignment for health notes, but the 89.6% preference claim rests on a judge whose validation details are missing from the abstract.

read the letter

The main thing to know is that EvoNote turns past correction episodes into reusable memory for LLM agents by grounding trajectory feedback in health-specific qualities and distilling it down to actions for claim analysis, evidence search, and writing. This is a step past one-shot retrieval or prompting in the cited work.

The paper does a clean job defining the MM-HealthCN benchmark of 1.2K multimodal posts with crowd labels and showing large time reductions from the human pipeline. The 82% helpful rate on Needs More Ratings cases is a practical data point worth noting.

The soft spot is the central evaluation. The 89.6% preference over human notes comes from their human-validated hierarchical utility judge, yet the abstract supplies no inter-annotator agreement, no correlation numbers with the existing crowd helpfulness labels, and no ablation on the judge dimensions. If the judge rewards length or evidence density independently of accuracy, the margin could be an artifact. The time comparison also needs the exact human measurement protocol to be convincing.

This is for researchers building agent systems for social media or health misinformation work. A reader in that area gets a concrete framework and benchmark to examine.

Send it to peer review so the full methods, judge construction, and any memory ablations can be checked directly.

Referee Report

3 major / 2 minor

Summary. The paper introduces EvoNote, a self-evolving LLM agent framework that maintains an experience memory of prior health misinformation corrections and uses fine-grained credit assignment to distill trajectory-level feedback into action-level updates for claim analysis, evidence acquisition, and note writing. Evaluated on the new MM-HealthCN benchmark of 1.2K user-flagged multimodal health posts with human notes and crowd labels, it reports that EvoNote notes are preferred over human-written notes in 89.6% of cases under a human-validated hierarchical utility judge, produces helpful notes in 82.0% of Needs More Ratings cases, and reduces median generation time from over 13 hours to under 2 minutes, attributing gains to stronger evidence use and reusable strategies.

Significance. If the central empirical claims hold under a robust judge, the work demonstrates that self-evolving agentic systems can outperform static human pipelines in evidence-grounded health misinformation correction while introducing the MM-HealthCN benchmark as a reusable resource. This positions experience-based evolution as a promising direction for scalable community notes, with potential implications for automated governance of health content on social platforms.

major comments (3)

[§4 (Evaluation)] §4 (Evaluation) and abstract: The headline 89.6% preference rate over human notes is obtained exclusively via the human-validated hierarchical utility judge, yet the manuscript reports neither inter-annotator agreement for the judge nor its correlation with the existing crowd-derived helpfulness labels already present in MM-HealthCN. Without these checks, it is impossible to rule out that the judge systematically favors longer or more evidence-dense outputs irrespective of factual accuracy.
[§3.2 (Credit Assignment)] §3.2 (Credit Assignment) and §5.1 (Ablations): The core mechanism of fine-grained credit assignment that grounds trajectory feedback into action-level memory is presented as load-bearing for the self-evolution claim, but no ablation isolates its contribution relative to a non-evolving baseline or to simpler memory mechanisms; the reported gains could therefore be driven primarily by the base LLM rather than the evolving experience component.
[Table 2 / §4.2] Table 2 / §4.2 (Needs More Ratings results): The 82.0% helpfulness rate on posts without crowd verdicts is reported without an accompanying error analysis or breakdown by post type (e.g., multimodal vs. text-only), making it difficult to assess whether the self-evolution generalizes or merely exploits patterns already captured by the 1.2K training episodes.

minor comments (2)

[Figure 1] Figure 1: The agent architecture diagram would benefit from explicit arrows or labels indicating the memory-update step after each episode, as the current rendering leaves the flow from credit assignment to action-level memory implicit.
[§2 (Related Work)] §2 (Related Work): The discussion of prior Community Notes systems could usefully cite the original Twitter Community Notes paper and recent empirical studies on note helpfulness to better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional validation and analysis as outlined.

read point-by-point responses

Referee: §4 (Evaluation) and abstract: The headline 89.6% preference rate over human notes is obtained exclusively via the human-validated hierarchical utility judge, yet the manuscript reports neither inter-annotator agreement for the judge nor its correlation with the existing crowd-derived helpfulness labels already present in MM-HealthCN. Without these checks, it is impossible to rule out that the judge systematically favors longer or more evidence-dense outputs irrespective of factual accuracy.

Authors: We agree that reporting inter-annotator agreement and correlation with crowd labels would strengthen the judge's validation. In the revised manuscript, we will add IAA scores for the hierarchical utility judge, its correlation with the existing crowd-derived helpfulness labels, and an analysis controlling for note length to address potential length or density biases. revision: yes
Referee: §3.2 (Credit Assignment) and §5.1 (Ablations): The core mechanism of fine-grained credit assignment that grounds trajectory feedback into action-level memory is presented as load-bearing for the self-evolution claim, but no ablation isolates its contribution relative to a non-evolving baseline or to simpler memory mechanisms; the reported gains could therefore be driven primarily by the base LLM rather than the evolving experience component.

Authors: While §5.1 presents ablations on system components, we acknowledge the value of a direct isolation against a non-evolving baseline with the same base LLM and against simpler memory without fine-grained credit assignment. We will add these comparisons in the revised manuscript to better demonstrate the contribution of the credit assignment mechanism. revision: yes
Referee: Table 2 / §4.2 (Needs More Ratings results): The 82.0% helpfulness rate on posts without crowd verdicts is reported without an accompanying error analysis or breakdown by post type (e.g., multimodal vs. text-only), making it difficult to assess whether the self-evolution generalizes or merely exploits patterns already captured by the 1.2K training episodes.

Authors: We agree that an error analysis and breakdown by post type would better demonstrate generalization. In the revision, we will include a breakdown of the 82.0% rate by multimodal versus text-only posts and add an error analysis categorizing cases where EvoNote notes were not helpful. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmark.

full rationale

The paper's central claims rest on direct empirical comparisons of EvoNote outputs against human-written notes and crowd-derived labels on the external MM-HealthCN benchmark of 1.2K instances. No equations, derivations, or fitted parameters are described that reduce to inputs by construction. The hierarchical utility judge is presented as an evaluation tool with human validation, but its use does not create a self-definitional loop or rename a fitted quantity as a prediction. No self-citation chains or ansatzes are invoked as load-bearing for the reported 89.6% preference or 82.0% helpfulness rates. The derivation chain is self-contained against external signals.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework relies on standard LLM capabilities and an external benchmark; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5796 in / 1048 out tokens · 22776 ms · 2026-06-28T14:18:21.447180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 4 canonical work pages · 3 internal anchors

[1]

MemGPT: Towards LLMs as Operating Systems

Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560. Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Ed- uardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. 2025. BALROG: Benchmark- ing agentic LLM and VLM reasoni...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073. Fiona Warde, Janet Papadakos, Tina Papadakos, Danielle Rodin, Mohammad Salhia, and Meredith Giuliani. 2018. Plain language communication as a priority competency for medical professionals in a globalized world.Canadian Medical Education...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Rui Xing, Preslav Nakov, Timothy Baldwin, and Jey Han Lau. 2026. COMMUNITYNOTES: A dataset for exploring the helpfulness of fact-checking explanations. InFindings of the Association for Computationa...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Understandable: explains key distinctions or necessary jargon in accessible language
[5]

Meaningful: explains why the correction matters for health interpretation, risk perception, or behavior
[6]

Usable: provides a safe and proportionate next step when the claim could affect user action
[7]

Instructions:

Trustworthy: avoids overclaiming and acknowledges uncertainty or evidence limits where appropriate. Instructions:
[8]

Compare the two notes criterion by criterion using the four criteria above
[9]

For each criterion, choose exactly one of: Note 1, Note 2, or Tie
[10]

A criterion should be marked Tie when neither note is clearly better on that criterion
[11]

Then choose the better note overall, or choose Tie if neither note is clearly better overall
[12]

Understandable: CHOICE Meaningful: CHOICE Usable: CHOICE Trustworthy: CHOICE

Use the following exact format for the criterion-level decisions, replacing CHOICE with exactly one of: Note 1, Note 2, or Tie. Understandable: CHOICE Meaningful: CHOICE Usable: CHOICE Trustworthy: CHOICE
[13]

Model Macro-F1 (%) Macro-Acc

End your response with exactly one of the following lines: Final decision: Note 1 Final decision: Note 2 Final decision: Tie Figure 9:Prompt for pairwise utility judgmentbetween two Community Notes candidates. Model Macro-F1 (%) Macro-Acc. (%) GPT-4.1 74.30 71.67 Gemini-2.5-flash 67.86 64.35 Claude-Sonnet-4 74.58 72.40 HEALTHJUDGE89.21 88.46 Table 5:Effec...

2025
[14]

Needs More Ratings

have been verified. We refer readers to Wu et al. (2025) for the orig- inal human evaluation setup, prompt details, and HEALTHJUDGEtraining procedure. D.3.2 Reliability of Pairwise Utility Comparison To validate the GPT-4.1-based pairwise utility judge, we conduct a human alignment study on 100 randomly sampled note pairs. For each pair, three annotators ...

work page doi:10.1056/nejmoa2034577note 2025
[15]

search",

decision_rationale How to use retrieved memory: - Retrieved memory is not expected to answer the current factual claim directly. - Retrieved memory is strategy guidance for this type of post. - Use it to decide what kind of evidence to seek next, what kind of overclaim to watch for, and whether it is too early to finalize. - Judge memory by whether it hel...

[1] [1]

MemGPT: Towards LLMs as Operating Systems

Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560. Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Ed- uardo Pignatelli, Łukasz Kuci ´nski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktäschel. 2025. BALROG: Benchmark- ing agentic LLM and VLM reasoni...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073. Fiona Warde, Janet Papadakos, Tina Papadakos, Danielle Rodin, Mohammad Salhia, and Meredith Giuliani. 2018. Plain language communication as a priority competency for medical professionals in a globalized world.Canadian Medical Education...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Rui Xing, Preslav Nakov, Timothy Baldwin, and Jey Han Lau. 2026. COMMUNITYNOTES: A dataset for exploring the helpfulness of fact-checking explanations. InFindings of the Association for Computationa...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Understandable: explains key distinctions or necessary jargon in accessible language

[5] [5]

Meaningful: explains why the correction matters for health interpretation, risk perception, or behavior

[6] [6]

Usable: provides a safe and proportionate next step when the claim could affect user action

[7] [7]

Instructions:

Trustworthy: avoids overclaiming and acknowledges uncertainty or evidence limits where appropriate. Instructions:

[8] [8]

Compare the two notes criterion by criterion using the four criteria above

[9] [9]

For each criterion, choose exactly one of: Note 1, Note 2, or Tie

[10] [10]

A criterion should be marked Tie when neither note is clearly better on that criterion

[11] [11]

Then choose the better note overall, or choose Tie if neither note is clearly better overall

[12] [12]

Understandable: CHOICE Meaningful: CHOICE Usable: CHOICE Trustworthy: CHOICE

Use the following exact format for the criterion-level decisions, replacing CHOICE with exactly one of: Note 1, Note 2, or Tie. Understandable: CHOICE Meaningful: CHOICE Usable: CHOICE Trustworthy: CHOICE

[13] [13]

Model Macro-F1 (%) Macro-Acc

End your response with exactly one of the following lines: Final decision: Note 1 Final decision: Note 2 Final decision: Tie Figure 9:Prompt for pairwise utility judgmentbetween two Community Notes candidates. Model Macro-F1 (%) Macro-Acc. (%) GPT-4.1 74.30 71.67 Gemini-2.5-flash 67.86 64.35 Claude-Sonnet-4 74.58 72.40 HEALTHJUDGE89.21 88.46 Table 5:Effec...

2025

[14] [14]

Needs More Ratings

have been verified. We refer readers to Wu et al. (2025) for the orig- inal human evaluation setup, prompt details, and HEALTHJUDGEtraining procedure. D.3.2 Reliability of Pairwise Utility Comparison To validate the GPT-4.1-based pairwise utility judge, we conduct a human alignment study on 100 randomly sampled note pairs. For each pair, three annotators ...

work page doi:10.1056/nejmoa2034577note 2025

[15] [15]

search",

decision_rationale How to use retrieved memory: - Retrieved memory is not expected to answer the current factual claim directly. - Retrieved memory is strategy guidance for this type of post. - Use it to decide what kind of evidence to seek next, what kind of overclaim to watch for, and whether it is too early to finalize. - Judge memory by whether it hel...