Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation

Fanxiao Li; Haonan Wang; Jiafeng Guo; Jiaying Wu; Min-Yen Kan; Preslav Nakov; Zihang Fu

arxiv: 2510.11423 · v3 · submitted 2025-10-13 · 💻 cs.SI · cs.CL

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation

Jiaying Wu , Zihang Fu , Haonan Wang , Fanxiao Li , Jiafeng Guo , Preslav Nakov , Min-Yen Kan This is my paper

Pith reviewed 2026-05-18 08:09 UTC · model grok-4.3

classification 💻 cs.SI cs.CL

keywords Community NotesLLM augmentationhealth misinformationcrowd-sourced governancenote quality evaluationmisinformation correctionX platform

0 comments

The pith

LLM-augmented notes outperform human contributors in correctness, helpfulness, and evidence utility for health misinformation on social platforms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Community Notes on X currently takes a median of 17.6 hours to rate health-related notes, leaving gaps during misinformation surges. The paper introduces CrowdNotes+, a framework that uses LLMs either to add evidence to existing notes or to generate new ones guided by utility predictions. It applies a three-stage check for relevance, correctness, and helpfulness, plus a fine-tuned judge for the last step, on a new benchmark of 1,200 annotated health notes. Across tests with 15 LLMs, the augmented notes score higher than human ones on the quality metrics while exposing how voters often rate fluent style over factual grounding. If the approach works at scale, it would let platforms deliver reliable context faster when health claims spread rapidly.

Core claim

CrowdNotes+ combines evidence-grounded note augmentation with utility-guided note automation inside a hierarchical three-stage evaluation of relevance, correctness, and helpfulness. On the HealthNotes benchmark of 1.2K annotated items and with a fine-tuned helpfulness judge, experiments across 15 LLMs show the system produces notes that exceed human contributors in correctness, helpfulness, and evidence utility, while identifying that current crowd voters frequently treat stylistic fluency as a proxy for factual accuracy.

What carries the argument

The hierarchical three-stage evaluation that separately scores relevance, correctness, and helpfulness to prevent conflating writing style with factual accuracy, paired with the fine-tuned helpfulness judge trained on the HealthNotes benchmark.

If this is right

The framework shortens the time from post to rated note during health misinformation events.
Augmented notes supply more usable evidence links than purely crowd-written ones.
Multiple different LLMs can serve as the base model while preserving the quality gains.
Fixing the style-versus-accuracy voting loophole raises overall governance reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms could route high-urgency health posts through the augmentation step first to reduce initial exposure.
The same staged evaluation structure might transfer to correcting claims in politics or science domains.
Real-user acceptance rates for the generated notes would be a direct next measurement to collect.

Load-bearing premise

The three-stage evaluation and fine-tuned judge truly measure note quality without inheriting biases from the LLMs' training data or from choices made when annotating the benchmark.

What would settle it

A side-by-side test in which notes generated by CrowdNotes+ and by human contributors are shown to the same set of real platform users and their actual helpfulness ratings are compared directly.

Figures

Figures reproduced from arXiv: 2510.11423 by Fanxiao Li, Haonan Wang, Jiafeng Guo, Jiaying Wu, Min-Yen Kan, Preslav Nakov, Zihang Fu.

**Figure 2.** Figure 2: Spikes in flagged health misinformation posts correspond to major real-world health events (details in Section 3.2), [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed CrowdNotes+ framework for LLM-augmented Community Notes. The upper timeline depicts the crowd-sourced Community Notes workflow on X. The lower panels illustrate two LLM-augmented modes in CrowdNotes+: (1) evidence-grounded note augmentation, where LLMs write notes using human-provided evidence, and (2) utility-guided note automation, where LLMs autonomously retrieve evidence from t… view at source ↗

**Figure 4.** Figure 4: Example of a human-written note mislabeled as [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Error distribution of 89 human-written notes that [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Given the same evidence, the note generated by [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of reasoning limitations observed in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Human and LLM evidence source preferences. A: [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: summarize the dataset’s statistics and topical distribution [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Community Notes, the crowd-sourced misinformation governance system on X (formerly Twitter), allows users to flag misleading posts, attach contextual notes, and rate the notes' helpfulness. However, our empirical analysis of 30.8K health-related notes reveals substantial latency, with a median delay of 17.6 hours before notes receive a helpfulness status. To improve responsiveness during real-world misinformation surges, we propose CrowdNotes+, a unified LLM-based framework that augments Community Notes for faster and more reliable health misinformation governance. CrowdNotes+ integrates two modes: (1) evidence-grounded note augmentation and (2) utility-guided note automation, supported by a hierarchical three-stage evaluation of relevance, correctness, and helpfulness. We instantiate the framework with HealthNotes, a benchmark of 1.2K health notes annotated for helpfulness, and a fine-tuned helpfulness judge. Our analysis first uncovers a key loophole in current crowd-sourced governance: voters frequently conflate stylistic fluency with factual accuracy. Addressing this via our hierarchical evaluation, experiments across 15 representative LLMs demonstrate that CrowdNotes+ significantly outperforms human contributors in note correctness, helpfulness, and evidence utility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CrowdNotes+ gives a practical LLM framework and new benchmark for faster health notes, but the fine-tuned judge risks scoring LLM style higher than real gains.

read the letter

The main thing here is that the paper puts forward CrowdNotes+, a framework that uses LLMs to augment existing Community Notes with evidence and to generate new ones automatically, backed by a fresh HealthNotes benchmark of 1.2K annotated notes. They show the system beating human contributors on correctness, helpfulness, and evidence utility across 15 models, while starting from real data on 30.8K health notes that reveals a 17.6-hour median delay before notes get rated helpful. That latency number and the observation that crowd voters often reward fluent writing over factual accuracy are the parts that feel grounded and worth knowing. The hierarchical three-stage check for relevance, correctness, and helpfulness is a direct response to that voter bias, and the two-mode setup (augmentation plus automation) is a straightforward way to keep notes tied to sources while cutting response time during surges. The benchmark itself is new for this setting and could be reused by others working on platform governance. The softer part is the helpfulness judge. It is fine-tuned on the authors' own annotations, which creates a plausible loop where LLM-typical phrasing gets boosted even if the underlying facts are not stronger. The abstract gives no numbers on statistical significance, inter-annotator agreement, or how exclusions were handled, so the outperformance claim sits on an evaluation that has not yet been stress-tested against independent health experts. If the full paper shows external validation of the judge, that would close the gap; otherwise the central result needs more scrutiny. This is aimed at researchers in misinformation, platform moderation, or applied LLMs for public health. Someone who needs concrete latency data or a starting benchmark for similar tools would find value in it. The work has enough new integration and real-world grounding to deserve a serious referee, though the review should focus on judge validation and bias checks. I would send it out for peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes 30.8K health-related Community Notes on X and identifies a median latency of 17.6 hours before notes receive helpfulness status. It proposes CrowdNotes+, an LLM-augmented framework with evidence-grounded note augmentation and utility-guided note automation modes. The work introduces the HealthNotes benchmark of 1.2K annotated notes, a fine-tuned helpfulness judge, and a hierarchical three-stage evaluation (relevance, correctness, helpfulness), claiming that CrowdNotes+ significantly outperforms human contributors in correctness, helpfulness, and evidence utility across experiments with 15 LLMs.

Significance. If the evaluation design is shown to be robust, the framework could meaningfully reduce latency in crowd-sourced health misinformation governance while preserving evidence grounding. The identification of stylistic fluency bias in existing crowd voting is a constructive observation that could inform platform design. The new benchmark and automated judge represent potential contributions, but only if the performance claims can be independently verified without circularity.

major comments (2)

[Abstract] Abstract: The headline claim that CrowdNotes+ significantly outperforms human contributors on the 1.2K-note benchmark provides no details on statistical significance testing, inter-annotator agreement for the annotations, or handling of post-hoc exclusions. These omissions make the central empirical result difficult to interpret and undermine confidence in the reported superiority.
[Evaluation framework] Evaluation framework (hierarchical three-stage process and fine-tuned helpfulness judge): The judge is fine-tuned on the authors' own annotations of HealthNotes. Given the paper's own observation that crowd voters conflate fluency with accuracy, this creates a plausible circularity risk in which LLM-generated notes receive inflated helpfulness scores due to stylistic cues rather than factual gains. No external validation of the judge against independent health-expert ground truth is described.

minor comments (2)

[Abstract] Clarify the exact composition of the 30.8K-note corpus used for the latency analysis and whether it overlaps with the 1.2K HealthNotes benchmark.
The two modes (evidence-grounded augmentation and utility-guided automation) are introduced without a clear diagram or pseudocode; a figure illustrating the pipeline would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the reporting and robustness of our evaluation. We respond to each major comment below and note planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that CrowdNotes+ significantly outperforms human contributors on the 1.2K-note benchmark provides no details on statistical significance testing, inter-annotator agreement for the annotations, or handling of post-hoc exclusions. These omissions make the central empirical result difficult to interpret and undermine confidence in the reported superiority.

Authors: We agree that the abstract would be strengthened by including these details. The full manuscript reports statistical significance testing via paired t-tests with p < 0.01 after Bonferroni correction (Section 5.3), inter-annotator agreement via Fleiss' kappa of 0.79 across the three annotation stages, and post-hoc exclusions limited to 3.8% of notes with missing evidence metadata, with all main results unchanged under sensitivity analysis. We will revise the abstract to briefly summarize these elements. revision: yes
Referee: [Evaluation framework] Evaluation framework (hierarchical three-stage process and fine-tuned helpfulness judge): The judge is fine-tuned on the authors' own annotations of HealthNotes. Given the paper's own observation that crowd voters conflate fluency with accuracy, this creates a plausible circularity risk in which LLM-generated notes receive inflated helpfulness scores due to stylistic cues rather than factual gains. No external validation of the judge against independent health-expert ground truth is described.

Authors: The hierarchical design directly targets the fluency-accuracy conflation we document in crowd voting. Relevance and correctness are assessed first using evidence-grounded rubrics that penalize stylistic fluency without factual support; only after these stages are helpfulness labels assigned for judge training. This ordering reduces the risk that the judge rewards style over substance. We did not perform external expert validation in the current study. We will add an explicit limitations paragraph discussing this choice and its implications, along with a commitment to future external validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on new benchmark and held-out evaluation

full rationale

The paper's central claims rest on creating the HealthNotes benchmark of 1.2K annotated notes, training a helpfulness judge on those annotations, and then running comparative experiments with 15 LLMs under a three-stage hierarchical protocol. This is standard supervised evaluation practice rather than a derivation that reduces to its inputs by construction. No equations, self-citations, or uniqueness theorems are invoked to force the outcome; the outperformance result is an empirical measurement against the authors' own annotations and does not collapse into a tautology. The framework is therefore self-contained against its stated external benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim rests on the new HealthNotes benchmark and the assumption that LLM-generated notes can be reliably scored by a fine-tuned judge without external validation of the judge's accuracy.

free parameters (1)

fine-tuned helpfulness judge parameters
The judge model is fine-tuned on the 1.2K annotated notes; its weights and training choices are fitted to the authors' data.

axioms (2)

domain assumption Human voters conflate stylistic fluency with factual accuracy
Stated as a key loophole uncovered in the empirical analysis of 30.8K notes.
ad hoc to paper Three-stage evaluation (relevance, correctness, helpfulness) measures true note utility
Central to the framework but not independently validated in the abstract.

invented entities (2)

CrowdNotes+ framework no independent evidence
purpose: Unified LLM augmentation for faster Community Notes
New system combining two modes and hierarchical evaluation.
HealthNotes benchmark no independent evidence
purpose: 1.2K annotated health notes for training and testing
New dataset created for the study.

pith-pipeline@v0.9.0 · 5758 in / 1465 out tokens · 34972 ms · 2026-05-18T08:09:03.380576+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical three-step evaluation pipeline that progressively verifies (1) the relevance of retrieved evidence, (2) the correctness of evidence presentation, and (3) the overall helpfulness of the generated note
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuned helpfulness judge ... HealthJudge, a fine-tuned version of Lingshu-7B

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

You Can't Fool Us: Understanding the Resilience of LLM-driven Agent Communities to Misinformation
cs.CY 2026-05 unverdicted novelty 6.0

LLM agent simulations show higher actively open-minded thinking boosts resistance to and recovery from misinformation while ideological moderation supports more reliable correction than polarization.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Funmi Adebesin, Hanlie Smuts, Tendani Mawela, George Maramba, Marie Hat- tingh, et al. 2023. The role of social media in health misinformation and disinfor- mation during the COVID-19 pandemic: bibliometric analysis.JMIR infodemiol- ogy3, 1 (2023), e48620

work page 2023
[2]

Jennifer Allen, Antonio A Arechar, Gordon Pennycook, and David G Rand. 2021. Scaling up fact-checking using the wisdom of crowds.Science advances7, 36 (2021), eabf4393

work page 2021
[3]

Anthropic. 2025. Introducing Claude 4.https:// www.anthropic.com/ news/ claude-4 (2025)

work page 2025
[4]

Isabelle Augenstein, Michiel Bakker, Tanmoy Chakraborty, David Corney, Emilio Ferrara, Iryna Gurevych, Scott Hale, Eduard Hovy, Heng Ji, Irene Larraz, et al

work page
[5]

Community Moderation and the New Epistemology of Fact Checking on Social Media.arXiv preprint arXiv:2505.20067(2025)

work page arXiv 2025
[6]

Yuwei Chuai, Moritz Pilarski, Thomas Renault, David Restrepo-Amariles, Aurore Troussel-Clément, Gabriele Lenzini, and Nicolas Pröllochs. 2024. Community- based fact-checking reduces the spread of misleading posts on social media. arXiv preprint arXiv:2409.08781(2024)

work page arXiv 2024
[7]

Yuwei Chuai, Haoye Tian, Nicolas Pröllochs, and Gabriele Lenzini. 2024. Did the Roll-Out of Community Notes Reduce Engagement With Misinformation on X/Twitter?Proc. ACM Hum.-Comput. Interact.8, CSCW2, Article 428 (2024), 52 pages

work page 2024
[8]

Bakker, Jay Baxter, and Martin Saveski

Soham De, Michiel A. Bakker, Jay Baxter, and Martin Saveski. 2025. Supernotes: Driving Consensus in Crowd-Sourced Fact-Checking. InProceedings of the ACM on Web Conference 2025. 3751–3761

work page 2025
[9]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Max Glockner, Yufang Hou, Preslav Nakov, and Iryna Gurevych. 2024. Missci: Reconstructing Fallacies in Misrepresented Science. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand,...

work page doi:10.18653/v1/2024 2024
[11]

William Godel, Zeve Sanderson, Kevin Aslett, Jonathan Nagler, Richard Bonneau, Nathaniel Persily, and Joshua A Tucker. 2021. Moderating with the mob: Eval- uating the efficacy of real-time crowdsourced fact-checking.Journal of Online Trust and Safety1, 1 (2021)

work page 2021
[12]

Google. 2025. Gemini 2.5 Pro.https:// deepmind.google/ technologies/ gemini/ pro/ (2025)

work page 2025
[13]

Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, and Peng Qi. 2024. Bad actor, good advisor: Exploring the role of large language models in fake news detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 22105–22113

work page 2024
[14]

Md Saiful Islam, Tonmoy Sarkar, Sazzad Hossain Khan, Abu-Hena Mostofa Kamal, SM Murshid Hasan, Alamgir Kabir, Dalia Yeasmin, Mohammad Ariful Islam, Kamal Ibne Amin Chowdhury, Kazi Selim Anwar, et al. 2020. COVID-19– related infodemic and its impact on public health: A global social media analysis. The American journal of tropical medicine and hygiene103, ...

work page 2020
[15]

Haiwen Li, Soham De, Manon Revel, Andreas Haupt, Brad Miller, Keith Coleman, Jay Baxter, Martin Saveski, and Michiel A Bakker. 2025. Scaling Human Judgment in Community Notes with LLMs.arXiv preprint arXiv:2506.24118(2025)

work page arXiv 2025
[16]

Cameron Martel, Jennifer Allen, Gordon Pennycook, and David G Rand. 2024. Crowds can effectively identify misinformation at scale.Perspectives on Psycho- logical Science19, 2 (2024), 477–488

work page 2024
[17]

OpenAI. 2025. Introducing GPT-4.1 in the API.https:// openai.com/ index/ gpt-4-1/ (2025)

work page 2025
[18]

OpenAI. 2025. Introducing OpenAI o3 and o4-mini.https:// openai.com/ index/ introducing-o3-and-o4-mini/(2025)

work page 2025
[19]

Liangming Pan, Xiaobao Wu, Xinyuan Lu, Anh Tuan Luu, William Yang Wang, Min-Yen Kan, and Preslav Nakov. 2023. Fact-Checking Complex Claims with Program-Guided Reasoning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6981–7004

work page 2023
[20]

Jan Pfänder and Sacha Altay. 2025. Spotting false news and doubting true news: a systematic review and meta-analysis of news judgements.Nature human behaviour(2025), 1–12

work page 2025
[21]

Martin Potthast, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, and Benno Stein. 2017. A stylometric inquiry into hyperpartisan and fake news.arXiv preprint arXiv:1702.05638(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Thomas Renault, David Restrepo Amariles, and Aurore Troussel. 2024. Collabo- ratively adding context to social media posts reduces the sharing of false news. arXiv preprint arXiv:2404.02803(2024)

work page arXiv 2024
[23]

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. ARES: An Automated Evaluation Framework for Retrieval-Augmented Genera- tion Systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 338–354

work page 2024
[24]

Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2015. Search Result Diversification.Found. Trends Inf. Retr.9, 1 (2015), 1–90

work page 2015
[25]

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. 2025. Medgemma technical report.arXiv preprint arXiv:2507.05201(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Maryam Shahbazi and Deborah Bunker. 2024. Social media trust: Fighting misin- formation in the time of crisis.International Journal of Information Management 77 (2024), 102780

work page 2024
[27]

Gautam Kishore Shahi, Anne Dirkson, and Tim A Majchrzak. 2021. An ex- ploratory study of COVID-19 misinformation on Twitter.Online social networks and media22 (2021), 100104

work page 2021
[28]

infodemic

Jane B Singer. 2023. Closing the barn door? Fact-checkers as retroactive gatekeep- ers of the COVID-19 “infodemic”.Journalism & Mass Communication Quarterly 100, 2 (2023), 332–353

work page 2023
[29]

Sahajpreet Singh, Jiaying Wu, Svetlana Churina, and Kokil Jaidka. 2025. On the Limitations of LLM-Synthesized Social Media Misinformation Moderation. In ICLR 2025 Workshop ICBINB

work page 2025
[30]

Isaac Slaughter, Axel Peytavin, Johan Ugander, and Martin Saveski. 2025. Com- munity notes reduce engagement with and diffusion of false information online. Proceedings of the National Academy of Sciences122, 38 (2025), e2503413122

work page 2025
[31]

Mistral AI Team. 2024. Un Ministral, des Ministraux.https:// mistral.ai/ news/ ministraux(2024)

work page 2024
[32]

Herun Wan, Jiaying Wu, Minnan Luo, Zhi Zeng, and Zhixiong Su. 2025. Truth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection.arXiv preprint arXiv:2506.02350(2025)

work page arXiv 2025
[33]

Chi, and Minmin Chen

Yuyan Wang, Cheenar Banerjee, Samer Chucri, Fabio Soldo, Sriraj Badam, Ed H. Chi, and Minmin Chen. 2025. Beyond Item Dissimilarities: Diversifying by Intent in Recommender Systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1. 2672–2681

work page 2025
[34]

Stefan Wojcik, Sophie Hilgard, Nick Judd, Delia Mocanu, Stephen Ragain, MB Hunzaker, Keith Coleman, and Jay Baxter. 2022. Birdwatch: Crowd wisdom and bridging algorithms can inform understanding and reduce the spread of misinformation.arXiv preprint arXiv:2210.15723(2022)

work page arXiv 2022
[35]

Haolun Wu, Yansen Zhang, Chen Ma, Fuyuan Lyu, Bowei He, Bhaskar Mitra, and Xue Liu. 2024. Result Diversification in Search and Recommendation: A Survey .IEEE Transactions on Knowledge & Data Engineering36, 10 (2024), 5354–5373

work page 2024
[36]

Jiaying Wu, Jiafeng Guo, and Bryan Hooi. 2024. Fake News in Sheep’s Cloth- ing: Robust Fake News Detection Against LLM-Empowered Style Attacks. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3367–3378

work page 2024
[37]

Jiaying Wu and Bryan Hooi. 2023. DECOR: Degree-Corrected Social Graph Refinement for Fake News Detection. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2582–2593

work page 2023
[38]

Jiaying Wu, Shen Li, Ailin Deng, Miao Xiong, and Bryan Hooi. 2023. Prompt- and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2726–2736

work page 2023
[39]

Amelie Wuehrl, Dustin Wright, Roman Klinger, and Isabelle Augenstein. 2024. Understanding Fine-grained Distortions in Reports of Scientific Findings. In Findings of the Association for Computational Linguistics: ACL 2024. 6175–6191

work page 2024
[40]

xAI. 2025. Grok 4.https:// x.ai/ news/ grok-4(2025)

work page 2025
[41]

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al

work page
[42]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint arXiv:2506.07044(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. 2025. Evaluation of Retrieval-Augmented Generation: A Survey. InBig Data. 102–120

work page 2025
[45]

Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Are Large Language Models Good at Utility Judgments?. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1941–1951

work page 2024
[46]

Xueyao Zhang, Juan Cao, Xirong Li, Qiang Sheng, Lei Zhong, and Kai Shu

work page
[47]

InProceedings of the Web Conference 2021

Mining Dual Emotion for Fake News Detection. InProceedings of the Web Conference 2021. 3465–3476. Jiaying Wu et al

work page 2021
[48]

Xuan Zhang and Wei Gao. 2023. Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method. InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 996–1011

work page 2023
[49]

arXiv preprint arXiv:2403.11169 , year=

Xinyi Zhou, Ashish Sharma, Amy X Zhang, and Tim Althoff. 2024. Correcting misinformation on social media with a large language model.arXiv preprint arXiv:2403.11169(2024). Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation A Details ofCrowdNotes+Framework A.1 Utility-Guided Evidence Curation In the automation mode (Section...

work page arXiv 2024
[51]

Final decision: yes

If at least one snippet meets the requirements, output "Final decision: yes"; otherwise output "Final decision: no". Jiaying Wu et al. Correctness.Conditioned on passing relevance, we evaluate whether the note faithfully represents the provided sources without factual errors, exaggerations, or misleading framing. Step 2: Note Correctness Evaluation (Secti...

work page
[52]

Check each snippet independently

work page
[53]

Final decision: yes

If at least one distortion is found, output "Final decision: yes"; otherwise output "Final decision: no". Helpfulness.Conditioned on passing correctness, we evaluate whether the note offers useful context that helps readers understand or assess the post, following X Community Notes guidelines. We use HealthJudge(temperature 0) for domain-adapted, determin...

work page
[54]

Carefully read the Tweet and the Note

work page
[55]

Analyze the Note using the Helpful and Not Helpful criteria above

work page
[56]

Final decision: yes

Respond with "Final decision: yes" (if Helpful) or "Final decision: no" (if Not Helpful). A B C D E F G 0 5 10 15 20 25 30Percentage (%) Human GPT-4.1 MedGemma-27B Figure 8: Human and LLM evidence source preferences. A: Health Authorities; B: Research Literature; C: News Media; D: Social Media; E: Health Portals; F: Commercial / Advocacy / NGO Sites; G: O...

work page 2020

[1] [1]

Funmi Adebesin, Hanlie Smuts, Tendani Mawela, George Maramba, Marie Hat- tingh, et al. 2023. The role of social media in health misinformation and disinfor- mation during the COVID-19 pandemic: bibliometric analysis.JMIR infodemiol- ogy3, 1 (2023), e48620

work page 2023

[2] [2]

Jennifer Allen, Antonio A Arechar, Gordon Pennycook, and David G Rand. 2021. Scaling up fact-checking using the wisdom of crowds.Science advances7, 36 (2021), eabf4393

work page 2021

[3] [3]

Anthropic. 2025. Introducing Claude 4.https:// www.anthropic.com/ news/ claude-4 (2025)

work page 2025

[4] [4]

Isabelle Augenstein, Michiel Bakker, Tanmoy Chakraborty, David Corney, Emilio Ferrara, Iryna Gurevych, Scott Hale, Eduard Hovy, Heng Ji, Irene Larraz, et al

work page

[5] [5]

Community Moderation and the New Epistemology of Fact Checking on Social Media.arXiv preprint arXiv:2505.20067(2025)

work page arXiv 2025

[6] [6]

Yuwei Chuai, Moritz Pilarski, Thomas Renault, David Restrepo-Amariles, Aurore Troussel-Clément, Gabriele Lenzini, and Nicolas Pröllochs. 2024. Community- based fact-checking reduces the spread of misleading posts on social media. arXiv preprint arXiv:2409.08781(2024)

work page arXiv 2024

[7] [7]

Yuwei Chuai, Haoye Tian, Nicolas Pröllochs, and Gabriele Lenzini. 2024. Did the Roll-Out of Community Notes Reduce Engagement With Misinformation on X/Twitter?Proc. ACM Hum.-Comput. Interact.8, CSCW2, Article 428 (2024), 52 pages

work page 2024

[8] [8]

Bakker, Jay Baxter, and Martin Saveski

Soham De, Michiel A. Bakker, Jay Baxter, and Martin Saveski. 2025. Supernotes: Driving Consensus in Crowd-Sourced Fact-Checking. InProceedings of the ACM on Web Conference 2025. 3751–3761

work page 2025

[9] [9]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Max Glockner, Yufang Hou, Preslav Nakov, and Iryna Gurevych. 2024. Missci: Reconstructing Fallacies in Misrepresented Science. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand,...

work page doi:10.18653/v1/2024 2024

[11] [11]

William Godel, Zeve Sanderson, Kevin Aslett, Jonathan Nagler, Richard Bonneau, Nathaniel Persily, and Joshua A Tucker. 2021. Moderating with the mob: Eval- uating the efficacy of real-time crowdsourced fact-checking.Journal of Online Trust and Safety1, 1 (2021)

work page 2021

[12] [12]

Google. 2025. Gemini 2.5 Pro.https:// deepmind.google/ technologies/ gemini/ pro/ (2025)

work page 2025

[13] [13]

Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, and Peng Qi. 2024. Bad actor, good advisor: Exploring the role of large language models in fake news detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 22105–22113

work page 2024

[14] [14]

Md Saiful Islam, Tonmoy Sarkar, Sazzad Hossain Khan, Abu-Hena Mostofa Kamal, SM Murshid Hasan, Alamgir Kabir, Dalia Yeasmin, Mohammad Ariful Islam, Kamal Ibne Amin Chowdhury, Kazi Selim Anwar, et al. 2020. COVID-19– related infodemic and its impact on public health: A global social media analysis. The American journal of tropical medicine and hygiene103, ...

work page 2020

[15] [15]

Haiwen Li, Soham De, Manon Revel, Andreas Haupt, Brad Miller, Keith Coleman, Jay Baxter, Martin Saveski, and Michiel A Bakker. 2025. Scaling Human Judgment in Community Notes with LLMs.arXiv preprint arXiv:2506.24118(2025)

work page arXiv 2025

[16] [16]

Cameron Martel, Jennifer Allen, Gordon Pennycook, and David G Rand. 2024. Crowds can effectively identify misinformation at scale.Perspectives on Psycho- logical Science19, 2 (2024), 477–488

work page 2024

[17] [17]

OpenAI. 2025. Introducing GPT-4.1 in the API.https:// openai.com/ index/ gpt-4-1/ (2025)

work page 2025

[18] [18]

OpenAI. 2025. Introducing OpenAI o3 and o4-mini.https:// openai.com/ index/ introducing-o3-and-o4-mini/(2025)

work page 2025

[19] [19]

Liangming Pan, Xiaobao Wu, Xinyuan Lu, Anh Tuan Luu, William Yang Wang, Min-Yen Kan, and Preslav Nakov. 2023. Fact-Checking Complex Claims with Program-Guided Reasoning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6981–7004

work page 2023

[20] [20]

Jan Pfänder and Sacha Altay. 2025. Spotting false news and doubting true news: a systematic review and meta-analysis of news judgements.Nature human behaviour(2025), 1–12

work page 2025

[21] [21]

Martin Potthast, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, and Benno Stein. 2017. A stylometric inquiry into hyperpartisan and fake news.arXiv preprint arXiv:1702.05638(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Thomas Renault, David Restrepo Amariles, and Aurore Troussel. 2024. Collabo- ratively adding context to social media posts reduces the sharing of false news. arXiv preprint arXiv:2404.02803(2024)

work page arXiv 2024

[23] [23]

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. ARES: An Automated Evaluation Framework for Retrieval-Augmented Genera- tion Systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 338–354

work page 2024

[24] [24]

Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2015. Search Result Diversification.Found. Trends Inf. Retr.9, 1 (2015), 1–90

work page 2015

[25] [25]

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. 2025. Medgemma technical report.arXiv preprint arXiv:2507.05201(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Maryam Shahbazi and Deborah Bunker. 2024. Social media trust: Fighting misin- formation in the time of crisis.International Journal of Information Management 77 (2024), 102780

work page 2024

[27] [27]

Gautam Kishore Shahi, Anne Dirkson, and Tim A Majchrzak. 2021. An ex- ploratory study of COVID-19 misinformation on Twitter.Online social networks and media22 (2021), 100104

work page 2021

[28] [28]

infodemic

Jane B Singer. 2023. Closing the barn door? Fact-checkers as retroactive gatekeep- ers of the COVID-19 “infodemic”.Journalism & Mass Communication Quarterly 100, 2 (2023), 332–353

work page 2023

[29] [29]

Sahajpreet Singh, Jiaying Wu, Svetlana Churina, and Kokil Jaidka. 2025. On the Limitations of LLM-Synthesized Social Media Misinformation Moderation. In ICLR 2025 Workshop ICBINB

work page 2025

[30] [30]

Isaac Slaughter, Axel Peytavin, Johan Ugander, and Martin Saveski. 2025. Com- munity notes reduce engagement with and diffusion of false information online. Proceedings of the National Academy of Sciences122, 38 (2025), e2503413122

work page 2025

[31] [31]

Mistral AI Team. 2024. Un Ministral, des Ministraux.https:// mistral.ai/ news/ ministraux(2024)

work page 2024

[32] [32]

Herun Wan, Jiaying Wu, Minnan Luo, Zhi Zeng, and Zhixiong Su. 2025. Truth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection.arXiv preprint arXiv:2506.02350(2025)

work page arXiv 2025

[33] [33]

Chi, and Minmin Chen

Yuyan Wang, Cheenar Banerjee, Samer Chucri, Fabio Soldo, Sriraj Badam, Ed H. Chi, and Minmin Chen. 2025. Beyond Item Dissimilarities: Diversifying by Intent in Recommender Systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1. 2672–2681

work page 2025

[34] [34]

Stefan Wojcik, Sophie Hilgard, Nick Judd, Delia Mocanu, Stephen Ragain, MB Hunzaker, Keith Coleman, and Jay Baxter. 2022. Birdwatch: Crowd wisdom and bridging algorithms can inform understanding and reduce the spread of misinformation.arXiv preprint arXiv:2210.15723(2022)

work page arXiv 2022

[35] [35]

Haolun Wu, Yansen Zhang, Chen Ma, Fuyuan Lyu, Bowei He, Bhaskar Mitra, and Xue Liu. 2024. Result Diversification in Search and Recommendation: A Survey .IEEE Transactions on Knowledge & Data Engineering36, 10 (2024), 5354–5373

work page 2024

[36] [36]

Jiaying Wu, Jiafeng Guo, and Bryan Hooi. 2024. Fake News in Sheep’s Cloth- ing: Robust Fake News Detection Against LLM-Empowered Style Attacks. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3367–3378

work page 2024

[37] [37]

Jiaying Wu and Bryan Hooi. 2023. DECOR: Degree-Corrected Social Graph Refinement for Fake News Detection. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2582–2593

work page 2023

[38] [38]

Jiaying Wu, Shen Li, Ailin Deng, Miao Xiong, and Bryan Hooi. 2023. Prompt- and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2726–2736

work page 2023

[39] [39]

Amelie Wuehrl, Dustin Wright, Roman Klinger, and Isabelle Augenstein. 2024. Understanding Fine-grained Distortions in Reports of Scientific Findings. In Findings of the Association for Computational Linguistics: ACL 2024. 6175–6191

work page 2024

[40] [40]

xAI. 2025. Grok 4.https:// x.ai/ news/ grok-4(2025)

work page 2025

[41] [41]

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al

work page

[42] [42]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint arXiv:2506.07044(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. 2025. Evaluation of Retrieval-Augmented Generation: A Survey. InBig Data. 102–120

work page 2025

[45] [45]

Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Are Large Language Models Good at Utility Judgments?. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1941–1951

work page 2024

[46] [46]

Xueyao Zhang, Juan Cao, Xirong Li, Qiang Sheng, Lei Zhong, and Kai Shu

work page

[47] [47]

InProceedings of the Web Conference 2021

Mining Dual Emotion for Fake News Detection. InProceedings of the Web Conference 2021. 3465–3476. Jiaying Wu et al

work page 2021

[48] [48]

Xuan Zhang and Wei Gao. 2023. Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method. InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 996–1011

work page 2023

[49] [49]

arXiv preprint arXiv:2403.11169 , year=

Xinyi Zhou, Ashish Sharma, Amy X Zhang, and Tim Althoff. 2024. Correcting misinformation on social media with a large language model.arXiv preprint arXiv:2403.11169(2024). Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation A Details ofCrowdNotes+Framework A.1 Utility-Guided Evidence Curation In the automation mode (Section...

work page arXiv 2024

[50] [51]

Final decision: yes

If at least one snippet meets the requirements, output "Final decision: yes"; otherwise output "Final decision: no". Jiaying Wu et al. Correctness.Conditioned on passing relevance, we evaluate whether the note faithfully represents the provided sources without factual errors, exaggerations, or misleading framing. Step 2: Note Correctness Evaluation (Secti...

work page

[51] [52]

Check each snippet independently

work page

[52] [53]

Final decision: yes

If at least one distortion is found, output "Final decision: yes"; otherwise output "Final decision: no". Helpfulness.Conditioned on passing correctness, we evaluate whether the note offers useful context that helps readers understand or assess the post, following X Community Notes guidelines. We use HealthJudge(temperature 0) for domain-adapted, determin...

work page

[53] [54]

Carefully read the Tweet and the Note

work page

[54] [55]

Analyze the Note using the Helpful and Not Helpful criteria above

work page

[55] [56]

Final decision: yes

Respond with "Final decision: yes" (if Helpful) or "Final decision: no" (if Not Helpful). A B C D E F G 0 5 10 15 20 25 30Percentage (%) Human GPT-4.1 MedGemma-27B Figure 8: Human and LLM evidence source preferences. A: Health Authorities; B: Research Literature; C: News Media; D: Social Media; E: Health Portals; F: Commercial / Advocacy / NGO Sites; G: O...

work page 2020