AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

Haiwen Li; Michiel A. Bakker

arxiv: 2604.02592 · v2 · submitted 2026-04-03 · 💻 cs.CY

AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

Haiwen Li , Michiel A. Bakker This is my paper

Pith reviewed 2026-05-13 19:17 UTC · model grok-4.3

classification 💻 cs.CY

keywords LLM fact-checkingCommunity NotesX platformAI evaluationsocial mediahelpfulness ratingsfield study

0 comments

The pith

LLM-written notes on X receive significantly higher helpfulness scores than human notes when rater exposure is equalized.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can write effective fact-checking notes in the authentic environment of X's Community Notes system rather than in isolated benchmarks. An LLM pipeline was deployed live for three months to produce 1,614 notes on 1,597 tweets, which were then compared to 1,332 human notes on the same posts through 108,169 real-user ratings. Two complementary analyses were applied to handle uneven timing and visibility: one modeling individual rater decisions and one restricting comparison to raters who saw every note on a given post. Both approaches show LLM notes receiving more positive evaluations, with the note-level analysis confirming significantly higher helpfulness scores for the AI versions.

Core claim

LLM-written notes achieve significantly higher helpfulness scores than human-written notes among raters who evaluated all notes on the same post, after applying controls for submission timing and rating exposure differences.

What carries the argument

A multi-step LLM pipeline that processes multimodal tweet content, performs web and platform searches, and generates contextual notes, evaluated through rating-level modeling and note-level equalization of rater exposure.

If this is right

LLM notes receive more positive ratings across raters with different political viewpoints.
Equalizing rater exposure reveals a clear helpfulness advantage for LLM notes on the same posts.
Real-world platform timing and visibility must be modeled explicitly when evaluating AI contributions.
AI systems can generate fact-checking content that supports cross-partisan consensus at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other platforms with community moderation could test similar LLM writers to measure comparable performance gains.
The higher scores may arise from more consistent evidence synthesis or neutral phrasing that appeals across viewpoints.
Longer-term deployment could reveal whether LLM notes sustain their advantage as users adapt to AI-generated content.

Load-bearing premise

The two analysis strategies sufficiently control for differences in submission timing and rating exposure between LLM and human notes.

What would settle it

A controlled test in which LLM and human notes on identical posts are submitted at the same time and rated by the exact same set of users, showing no significant difference in helpfulness scores.

Figures

Figures reproduced from arXiv: 2604.02592 by Haiwen Li, Michiel A. Bakker.

**Figure 1.** Figure 1: Mean % helpful and % unhelpful ratings per note for LLM and human notes, stratified by rater ideology group (left, neutral, right). Error bars show 95% confidence intervals across notes. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: LLM vs. human note rating advantage (AI main-effect coefficient with 95% CI) by tweet modality (text-only, image, video) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: LLM vs. human note rating advantage (AI main-effect coefficient with 95% CI) by tweet topic category. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Large language models show promising capabilities for contextual fact-checking on social media: they can verify contested claims through deep research, synthesize evidence from multiple sources, and draft explanations at scale. However, prior work evaluates LLM fact-checking only in controlled settings using benchmarks or crowdworker judgments, leaving open how these systems perform in authentic platform environments. We present the first field evaluation of LLM-based fact-checking deployed on a live social media platform, testing performance directly through X Community Notes' AI writer feature over a three-month period. Our LLM writer, a multi-step pipeline that handles multimodal content (text, images, and videos), conducts web and platform-native search, and writes contextual notes, was deployed to write 1,614 notes on 1,597 tweets and compared against 1,332 human-written notes on the same tweets using 108,169 ratings from 42,521 raters. Direct comparison of note-level platform outcomes is complicated by differences in submission timing and rating exposure between LLM and human notes; we therefore pursue two complementary strategies: a rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalizes rater exposure across note types. Rating-level analysis shows that LLM notes receive more positive ratings than human notes across raters with different political viewpoints, suggesting the potential for LLM-written notes to achieve the cross-partisan consensus. Note-level analysis confirms this advantage: among raters who evaluated all notes on the same post, LLM notes achieve significantly higher helpfulness scores. Our findings demonstrate that LLMs can contribute high-quality, broadly helpful fact-checking at scale, while highlighting that real-world evaluation requires careful attention to platform dynamics absent from controlled settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First live deployment of LLM notes on X finds higher helpfulness ratings than human notes, but the timing controls look incomplete.

read the letter

This paper runs the first actual field test of an LLM pipeline writing Community Notes on X. They deployed it for three months, produced 1,614 notes on 1,597 posts, and compared them to 1,332 human notes using 108k real-user ratings. The setup moves past lab benchmarks to platform data, which is the real advance here. Their multi-step system handles text, images, and video with web and platform search, and the results point to LLM notes getting more positive ratings across raters with different politics. The note-level check, limited to raters who scored every note on the same post, also shows an edge for the LLM versions. That equalizes exposure count and gives some reassurance the difference is not just from who saw what. The timing problem is the main soft spot. LLM notes arrived later over the three-month window while human notes were already live, so later notes could pick up extra context from replies or platform signals. The rating-level model uses rater fixed effects, but the abstract does not mention note-age or time-since-post covariates. The restricted-rater subset balances how many ratings each note gets but not submission order. Without those details it is hard to rule out residual bias. This work is aimed at people studying scalable fact-checking and platform moderation. The scale and live setting make it worth referee time even with the control questions, which a review can tighten.

Referee Report

1 major / 1 minor

Summary. The paper reports the first field deployment of an LLM pipeline for writing X Community Notes. Over three months the system produced 1,614 notes on 1,597 tweets that were compared with 1,332 pre-existing human notes on the same tweets, using 108,169 ratings from 42,521 raters. Two complementary analyses are presented: a rating-level regression with rater fixed effects and a note-level analysis restricted to raters who evaluated every note on a given post. Both analyses indicate that LLM notes receive more positive ratings and higher helpfulness scores than human notes, including across political viewpoints, supporting the claim that LLMs can deliver high-quality, cross-partisan fact-checking at platform scale.

Significance. If the central comparison survives tighter temporal controls, the study supplies the first large-scale, real-user evidence on LLM fact-checking performance inside an operational social-media moderation system. The scale of the rating corpus and the dual analytic strategy (rating-level modeling plus exposure-equalized note-level comparison) are genuine strengths that move the literature beyond crowdsourced or benchmark-only evaluations.

major comments (1)

[Note-level analysis] Note-level analysis: restricting to raters who scored every note on a post equalizes exposure count but leaves submission timing uncontrolled. LLM notes were generated across a three-month window while human notes pre-existed; later notes can therefore accumulate additional context, replies, or platform signals before receiving ratings. The rating-level model includes only rater fixed effects and does not report time-since-post or note-age covariates, so residual temporal confounding cannot be ruled out. This directly affects the load-bearing claim that LLM notes achieve significantly higher helpfulness scores.

minor comments (1)

[Abstract] The abstract states that timing differences are acknowledged yet provides no quantitative check (e.g., distribution of submission delays or balance tests within the restricted rater subset) that the note-level sample is balanced on submission order.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The concern regarding uncontrolled submission timing in the note-level analysis is well-taken and points to a genuine limitation in the original design. We have revised the manuscript to incorporate explicit temporal controls (note age and time-since-post) in both analytic strategies. These additions preserve the core findings while directly addressing the potential confounding.

read point-by-point responses

Referee: [Note-level analysis] Note-level analysis: restricting to raters who scored every note on a post equalizes exposure count but leaves submission timing uncontrolled. LLM notes were generated across a three-month window while human notes pre-existed; later notes can therefore accumulate additional context, replies, or platform signals before receiving ratings. The rating-level model includes only rater fixed effects and does not report time-since-post or note-age covariates, so residual temporal confounding cannot be ruled out. This directly affects the load-bearing claim that LLM notes achieve significantly higher helpfulness scores.

Authors: We agree that equalizing rater exposure does not fully eliminate temporal differences, as LLM notes were submitted later in the observation window. In the revised manuscript we have added two sets of controls: (1) in the rating-level regression we now include note-age (days since the tweet was posted) and days-since-note-submission as covariates alongside rater fixed effects; (2) for the note-level analysis we re-estimate the comparison after matching on post age at the moment each rater evaluated the notes. Both robustness checks leave the main result unchanged—LLM notes continue to receive significantly higher helpfulness ratings. We report the updated models in Section 4.2 and new Appendix C, and we have tempered the language around causal interpretation to reflect the observational nature of the deployment. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement of platform ratings

full rationale

This is a field deployment study that records and statistically compares observed helpfulness ratings from real X users. The two analysis strategies (rating-level modeling with rater fixed effects and note-level restriction to raters who saw every note on a post) are standard empirical controls applied to the collected data; neither reduces to a fitted parameter renamed as a prediction nor to a self-citation that supplies the result. No equations, ansatzes, uniqueness theorems, or self-referential derivations appear. The central claim is therefore an independent empirical outcome rather than a restatement of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that user ratings accurately measure note helpfulness and that the two analysis strategies adequately equalize exposure differences; no free parameters or invented entities are introduced.

axioms (1)

domain assumption User ratings on the platform reflect genuine perceptions of note helpfulness
The entire comparison depends on treating aggregated rater scores as a valid quality metric.

pith-pipeline@v0.9.0 · 5608 in / 1087 out tokens · 41357 ms · 2026-05-13T19:17:55.789859+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalizes rater exposure across note types
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLM notes receive more positive ratings than human notes across raters with different political viewpoints

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Characterizing AI Fact-Checkers and Their Contributions on Community Notes
cs.CY 2026-05 unverdicted novelty 7.0

AI writers account for 14.2% of Community Notes submissions with high responsiveness and coverage but lower helpfulness classification rates than human experts.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · cited by 1 Pith paper

[1]

complete raters

doi: 10.54501/jots.v3i1.255. Jonathan Mummolo and Erik Peterson. Demand effects in survey experiments: An em- pirical assessment.American Political Science Review, 113(2):517–529, 2019. Sahajpreet Singh, Kokil Jaidka, and Min-Yen Kan. GitSearch: Enhancing community notes generation with gap-informed targeted search.arXiv preprint arXiv:2602.08945, 2026. I...

work page doi:10.54501/jots.v3i1.255 2019
[2]

Use the post context to guide your search if it could provide potential fact-check directions

Search both the web and X for factual sources that refute or confirm the post’s claims. Use the post context to guide your search if it could provide potential fact-check directions

work page
[3]

Aim for {target_url_count} pieces of evidence / URLs if possible

work page
[4]

Include the publication date of the source if available

For each source, include the URL and a brief note describing how it verifies or challenges the post. Include the publication date of the source if available

work page
[5]

Overlapping reasoning is acceptable when it comes from different publishers

Cover outlets across the ideological spectrum (left, center, right). Overlapping reasoning is acceptable when it comes from different publishers

work page
[6]

post_context

Prioritize evidence that is relevant, solid, and up to date. Target post (ID: {post_id}): {post} Your response should be returned as a JSON object with the following structure: ‘‘‘ {{ "post_context": "one/two-sentence summary of the post context", "research": [ {{"url": "url1", "description": "how the content of the URL fact-checks the post "}}, ... ] }} ...

work page
[7]

Focus on primary claim(s) of the post rather than trivial details

The note is written to explain why the post is misleading and add additional context to the post. Focus on primary claim(s) of the post rather than trivial details

work page
[8]

At least one URL must be cited

The note must be grounded in the provided evidence and should cite the URL of the evidence it uses. At least one URL must be cited

work page
[9]

Stay neutral and clear

Keep the note strictly under 280 characters. Stay neutral and clear

work page
[10]

this note

No hashtags, emojis, unnecessary words. No markdown, brackets, or parentheses around URLs. Do not mention "this note" or "the prompt." Target post: ‘‘‘ {post} ‘‘‘ Additional context about the post: ‘‘‘ {post_context} ‘‘‘ Allowed evidence sources: ‘‘‘ 23 AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X {evidence} ‘‘‘ Out...

work page

[1] [1]

complete raters

doi: 10.54501/jots.v3i1.255. Jonathan Mummolo and Erik Peterson. Demand effects in survey experiments: An em- pirical assessment.American Political Science Review, 113(2):517–529, 2019. Sahajpreet Singh, Kokil Jaidka, and Min-Yen Kan. GitSearch: Enhancing community notes generation with gap-informed targeted search.arXiv preprint arXiv:2602.08945, 2026. I...

work page doi:10.54501/jots.v3i1.255 2019

[2] [2]

Use the post context to guide your search if it could provide potential fact-check directions

Search both the web and X for factual sources that refute or confirm the post’s claims. Use the post context to guide your search if it could provide potential fact-check directions

work page

[3] [3]

Aim for {target_url_count} pieces of evidence / URLs if possible

work page

[4] [4]

Include the publication date of the source if available

For each source, include the URL and a brief note describing how it verifies or challenges the post. Include the publication date of the source if available

work page

[5] [5]

Overlapping reasoning is acceptable when it comes from different publishers

Cover outlets across the ideological spectrum (left, center, right). Overlapping reasoning is acceptable when it comes from different publishers

work page

[6] [6]

post_context

Prioritize evidence that is relevant, solid, and up to date. Target post (ID: {post_id}): {post} Your response should be returned as a JSON object with the following structure: ‘‘‘ {{ "post_context": "one/two-sentence summary of the post context", "research": [ {{"url": "url1", "description": "how the content of the URL fact-checks the post "}}, ... ] }} ...

work page

[7] [7]

Focus on primary claim(s) of the post rather than trivial details

The note is written to explain why the post is misleading and add additional context to the post. Focus on primary claim(s) of the post rather than trivial details

work page

[8] [8]

At least one URL must be cited

The note must be grounded in the provided evidence and should cite the URL of the evidence it uses. At least one URL must be cited

work page

[9] [9]

Stay neutral and clear

Keep the note strictly under 280 characters. Stay neutral and clear

work page

[10] [10]

this note

No hashtags, emojis, unnecessary words. No markdown, brackets, or parentheses around URLs. Do not mention "this note" or "the prompt." Target post: ‘‘‘ {post} ‘‘‘ Additional context about the post: ‘‘‘ {post_context} ‘‘‘ Allowed evidence sources: ‘‘‘ 23 AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X {evidence} ‘‘‘ Out...

work page