pith. sign in

arxiv: 2604.02592 · v2 · submitted 2026-04-03 · 💻 cs.CY

AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

Pith reviewed 2026-05-13 19:17 UTC · model grok-4.3

classification 💻 cs.CY
keywords LLM fact-checkingCommunity NotesX platformAI evaluationsocial mediahelpfulness ratingsfield study
0
0 comments X

The pith

LLM-written notes on X receive significantly higher helpfulness scores than human notes when rater exposure is equalized.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can write effective fact-checking notes in the authentic environment of X's Community Notes system rather than in isolated benchmarks. An LLM pipeline was deployed live for three months to produce 1,614 notes on 1,597 tweets, which were then compared to 1,332 human notes on the same posts through 108,169 real-user ratings. Two complementary analyses were applied to handle uneven timing and visibility: one modeling individual rater decisions and one restricting comparison to raters who saw every note on a given post. Both approaches show LLM notes receiving more positive evaluations, with the note-level analysis confirming significantly higher helpfulness scores for the AI versions.

Core claim

LLM-written notes achieve significantly higher helpfulness scores than human-written notes among raters who evaluated all notes on the same post, after applying controls for submission timing and rating exposure differences.

What carries the argument

A multi-step LLM pipeline that processes multimodal tweet content, performs web and platform searches, and generates contextual notes, evaluated through rating-level modeling and note-level equalization of rater exposure.

If this is right

  • LLM notes receive more positive ratings across raters with different political viewpoints.
  • Equalizing rater exposure reveals a clear helpfulness advantage for LLM notes on the same posts.
  • Real-world platform timing and visibility must be modeled explicitly when evaluating AI contributions.
  • AI systems can generate fact-checking content that supports cross-partisan consensus at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other platforms with community moderation could test similar LLM writers to measure comparable performance gains.
  • The higher scores may arise from more consistent evidence synthesis or neutral phrasing that appeals across viewpoints.
  • Longer-term deployment could reveal whether LLM notes sustain their advantage as users adapt to AI-generated content.

Load-bearing premise

The two analysis strategies sufficiently control for differences in submission timing and rating exposure between LLM and human notes.

What would settle it

A controlled test in which LLM and human notes on identical posts are submitted at the same time and rated by the exact same set of users, showing no significant difference in helpfulness scores.

Figures

Figures reproduced from arXiv: 2604.02592 by Haiwen Li, Michiel A. Bakker.

Figure 1
Figure 1. Figure 1: Mean % helpful and % unhelpful ratings per note for LLM and human notes, stratified by rater ideology group (left, neutral, right). Error bars show 95% confidence intervals across notes. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LLM vs. human note rating advantage (AI main-effect coefficient with 95% CI) by tweet modality (text-only, image, video) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLM vs. human note rating advantage (AI main-effect coefficient with 95% CI) by tweet topic category. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Large language models show promising capabilities for contextual fact-checking on social media: they can verify contested claims through deep research, synthesize evidence from multiple sources, and draft explanations at scale. However, prior work evaluates LLM fact-checking only in controlled settings using benchmarks or crowdworker judgments, leaving open how these systems perform in authentic platform environments. We present the first field evaluation of LLM-based fact-checking deployed on a live social media platform, testing performance directly through X Community Notes' AI writer feature over a three-month period. Our LLM writer, a multi-step pipeline that handles multimodal content (text, images, and videos), conducts web and platform-native search, and writes contextual notes, was deployed to write 1,614 notes on 1,597 tweets and compared against 1,332 human-written notes on the same tweets using 108,169 ratings from 42,521 raters. Direct comparison of note-level platform outcomes is complicated by differences in submission timing and rating exposure between LLM and human notes; we therefore pursue two complementary strategies: a rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalizes rater exposure across note types. Rating-level analysis shows that LLM notes receive more positive ratings than human notes across raters with different political viewpoints, suggesting the potential for LLM-written notes to achieve the cross-partisan consensus. Note-level analysis confirms this advantage: among raters who evaluated all notes on the same post, LLM notes achieve significantly higher helpfulness scores. Our findings demonstrate that LLMs can contribute high-quality, broadly helpful fact-checking at scale, while highlighting that real-world evaluation requires careful attention to platform dynamics absent from controlled settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper reports the first field deployment of an LLM pipeline for writing X Community Notes. Over three months the system produced 1,614 notes on 1,597 tweets that were compared with 1,332 pre-existing human notes on the same tweets, using 108,169 ratings from 42,521 raters. Two complementary analyses are presented: a rating-level regression with rater fixed effects and a note-level analysis restricted to raters who evaluated every note on a given post. Both analyses indicate that LLM notes receive more positive ratings and higher helpfulness scores than human notes, including across political viewpoints, supporting the claim that LLMs can deliver high-quality, cross-partisan fact-checking at platform scale.

Significance. If the central comparison survives tighter temporal controls, the study supplies the first large-scale, real-user evidence on LLM fact-checking performance inside an operational social-media moderation system. The scale of the rating corpus and the dual analytic strategy (rating-level modeling plus exposure-equalized note-level comparison) are genuine strengths that move the literature beyond crowdsourced or benchmark-only evaluations.

major comments (1)
  1. [Note-level analysis] Note-level analysis: restricting to raters who scored every note on a post equalizes exposure count but leaves submission timing uncontrolled. LLM notes were generated across a three-month window while human notes pre-existed; later notes can therefore accumulate additional context, replies, or platform signals before receiving ratings. The rating-level model includes only rater fixed effects and does not report time-since-post or note-age covariates, so residual temporal confounding cannot be ruled out. This directly affects the load-bearing claim that LLM notes achieve significantly higher helpfulness scores.
minor comments (1)
  1. [Abstract] The abstract states that timing differences are acknowledged yet provides no quantitative check (e.g., distribution of submission delays or balance tests within the restricted rater subset) that the note-level sample is balanced on submission order.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The concern regarding uncontrolled submission timing in the note-level analysis is well-taken and points to a genuine limitation in the original design. We have revised the manuscript to incorporate explicit temporal controls (note age and time-since-post) in both analytic strategies. These additions preserve the core findings while directly addressing the potential confounding.

read point-by-point responses
  1. Referee: [Note-level analysis] Note-level analysis: restricting to raters who scored every note on a post equalizes exposure count but leaves submission timing uncontrolled. LLM notes were generated across a three-month window while human notes pre-existed; later notes can therefore accumulate additional context, replies, or platform signals before receiving ratings. The rating-level model includes only rater fixed effects and does not report time-since-post or note-age covariates, so residual temporal confounding cannot be ruled out. This directly affects the load-bearing claim that LLM notes achieve significantly higher helpfulness scores.

    Authors: We agree that equalizing rater exposure does not fully eliminate temporal differences, as LLM notes were submitted later in the observation window. In the revised manuscript we have added two sets of controls: (1) in the rating-level regression we now include note-age (days since the tweet was posted) and days-since-note-submission as covariates alongside rater fixed effects; (2) for the note-level analysis we re-estimate the comparison after matching on post age at the moment each rater evaluated the notes. Both robustness checks leave the main result unchanged—LLM notes continue to receive significantly higher helpfulness ratings. We report the updated models in Section 4.2 and new Appendix C, and we have tempered the language around causal interpretation to reflect the observational nature of the deployment. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement of platform ratings

full rationale

This is a field deployment study that records and statistically compares observed helpfulness ratings from real X users. The two analysis strategies (rating-level modeling with rater fixed effects and note-level restriction to raters who saw every note on a post) are standard empirical controls applied to the collected data; neither reduces to a fitted parameter renamed as a prediction nor to a self-citation that supplies the result. No equations, ansatzes, uniqueness theorems, or self-referential derivations appear. The central claim is therefore an independent empirical outcome rather than a restatement of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that user ratings accurately measure note helpfulness and that the two analysis strategies adequately equalize exposure differences; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption User ratings on the platform reflect genuine perceptions of note helpfulness
    The entire comparison depends on treating aggregated rater scores as a valid quality metric.

pith-pipeline@v0.9.0 · 5608 in / 1087 out tokens · 41357 ms · 2026-05-13T19:17:55.789859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Characterizing AI Fact-Checkers and Their Contributions on Community Notes

    cs.CY 2026-05 unverdicted novelty 7.0

    AI writers account for 14.2% of Community Notes submissions with high responsiveness and coverage but lower helpfulness classification rates than human experts.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · cited by 1 Pith paper

  1. [1]

    complete raters

    doi: 10.54501/jots.v3i1.255. Jonathan Mummolo and Erik Peterson. Demand effects in survey experiments: An em- pirical assessment.American Political Science Review, 113(2):517–529, 2019. Sahajpreet Singh, Kokil Jaidka, and Min-Yen Kan. GitSearch: Enhancing community notes generation with gap-informed targeted search.arXiv preprint arXiv:2602.08945, 2026. I...

  2. [2]

    Use the post context to guide your search if it could provide potential fact-check directions

    Search both the web and X for factual sources that refute or confirm the post’s claims. Use the post context to guide your search if it could provide potential fact-check directions

  3. [3]

    Aim for {target_url_count} pieces of evidence / URLs if possible

  4. [4]

    Include the publication date of the source if available

    For each source, include the URL and a brief note describing how it verifies or challenges the post. Include the publication date of the source if available

  5. [5]

    Overlapping reasoning is acceptable when it comes from different publishers

    Cover outlets across the ideological spectrum (left, center, right). Overlapping reasoning is acceptable when it comes from different publishers

  6. [6]

    post_context

    Prioritize evidence that is relevant, solid, and up to date. Target post (ID: {post_id}): {post} Your response should be returned as a JSON object with the following structure: ‘‘‘ {{ "post_context": "one/two-sentence summary of the post context", "research": [ {{"url": "url1", "description": "how the content of the URL fact-checks the post "}}, ... ] }} ...

  7. [7]

    Focus on primary claim(s) of the post rather than trivial details

    The note is written to explain why the post is misleading and add additional context to the post. Focus on primary claim(s) of the post rather than trivial details

  8. [8]

    At least one URL must be cited

    The note must be grounded in the provided evidence and should cite the URL of the evidence it uses. At least one URL must be cited

  9. [9]

    Stay neutral and clear

    Keep the note strictly under 280 characters. Stay neutral and clear

  10. [10]

    this note

    No hashtags, emojis, unnecessary words. No markdown, brackets, or parentheses around URLs. Do not mention "this note" or "the prompt." Target post: ‘‘‘ {post} ‘‘‘ Additional context about the post: ‘‘‘ {post_context} ‘‘‘ Allowed evidence sources: ‘‘‘ 23 AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X {evidence} ‘‘‘ Out...