AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X
Pith reviewed 2026-05-13 19:17 UTC · model grok-4.3
The pith
LLM-written notes on X receive significantly higher helpfulness scores than human notes when rater exposure is equalized.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-written notes achieve significantly higher helpfulness scores than human-written notes among raters who evaluated all notes on the same post, after applying controls for submission timing and rating exposure differences.
What carries the argument
A multi-step LLM pipeline that processes multimodal tweet content, performs web and platform searches, and generates contextual notes, evaluated through rating-level modeling and note-level equalization of rater exposure.
If this is right
- LLM notes receive more positive ratings across raters with different political viewpoints.
- Equalizing rater exposure reveals a clear helpfulness advantage for LLM notes on the same posts.
- Real-world platform timing and visibility must be modeled explicitly when evaluating AI contributions.
- AI systems can generate fact-checking content that supports cross-partisan consensus at scale.
Where Pith is reading between the lines
- Other platforms with community moderation could test similar LLM writers to measure comparable performance gains.
- The higher scores may arise from more consistent evidence synthesis or neutral phrasing that appeals across viewpoints.
- Longer-term deployment could reveal whether LLM notes sustain their advantage as users adapt to AI-generated content.
Load-bearing premise
The two analysis strategies sufficiently control for differences in submission timing and rating exposure between LLM and human notes.
What would settle it
A controlled test in which LLM and human notes on identical posts are submitted at the same time and rated by the exact same set of users, showing no significant difference in helpfulness scores.
Figures
read the original abstract
Large language models show promising capabilities for contextual fact-checking on social media: they can verify contested claims through deep research, synthesize evidence from multiple sources, and draft explanations at scale. However, prior work evaluates LLM fact-checking only in controlled settings using benchmarks or crowdworker judgments, leaving open how these systems perform in authentic platform environments. We present the first field evaluation of LLM-based fact-checking deployed on a live social media platform, testing performance directly through X Community Notes' AI writer feature over a three-month period. Our LLM writer, a multi-step pipeline that handles multimodal content (text, images, and videos), conducts web and platform-native search, and writes contextual notes, was deployed to write 1,614 notes on 1,597 tweets and compared against 1,332 human-written notes on the same tweets using 108,169 ratings from 42,521 raters. Direct comparison of note-level platform outcomes is complicated by differences in submission timing and rating exposure between LLM and human notes; we therefore pursue two complementary strategies: a rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalizes rater exposure across note types. Rating-level analysis shows that LLM notes receive more positive ratings than human notes across raters with different political viewpoints, suggesting the potential for LLM-written notes to achieve the cross-partisan consensus. Note-level analysis confirms this advantage: among raters who evaluated all notes on the same post, LLM notes achieve significantly higher helpfulness scores. Our findings demonstrate that LLMs can contribute high-quality, broadly helpful fact-checking at scale, while highlighting that real-world evaluation requires careful attention to platform dynamics absent from controlled settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports the first field deployment of an LLM pipeline for writing X Community Notes. Over three months the system produced 1,614 notes on 1,597 tweets that were compared with 1,332 pre-existing human notes on the same tweets, using 108,169 ratings from 42,521 raters. Two complementary analyses are presented: a rating-level regression with rater fixed effects and a note-level analysis restricted to raters who evaluated every note on a given post. Both analyses indicate that LLM notes receive more positive ratings and higher helpfulness scores than human notes, including across political viewpoints, supporting the claim that LLMs can deliver high-quality, cross-partisan fact-checking at platform scale.
Significance. If the central comparison survives tighter temporal controls, the study supplies the first large-scale, real-user evidence on LLM fact-checking performance inside an operational social-media moderation system. The scale of the rating corpus and the dual analytic strategy (rating-level modeling plus exposure-equalized note-level comparison) are genuine strengths that move the literature beyond crowdsourced or benchmark-only evaluations.
major comments (1)
- [Note-level analysis] Note-level analysis: restricting to raters who scored every note on a post equalizes exposure count but leaves submission timing uncontrolled. LLM notes were generated across a three-month window while human notes pre-existed; later notes can therefore accumulate additional context, replies, or platform signals before receiving ratings. The rating-level model includes only rater fixed effects and does not report time-since-post or note-age covariates, so residual temporal confounding cannot be ruled out. This directly affects the load-bearing claim that LLM notes achieve significantly higher helpfulness scores.
minor comments (1)
- [Abstract] The abstract states that timing differences are acknowledged yet provides no quantitative check (e.g., distribution of submission delays or balance tests within the restricted rater subset) that the note-level sample is balanced on submission order.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The concern regarding uncontrolled submission timing in the note-level analysis is well-taken and points to a genuine limitation in the original design. We have revised the manuscript to incorporate explicit temporal controls (note age and time-since-post) in both analytic strategies. These additions preserve the core findings while directly addressing the potential confounding.
read point-by-point responses
-
Referee: [Note-level analysis] Note-level analysis: restricting to raters who scored every note on a post equalizes exposure count but leaves submission timing uncontrolled. LLM notes were generated across a three-month window while human notes pre-existed; later notes can therefore accumulate additional context, replies, or platform signals before receiving ratings. The rating-level model includes only rater fixed effects and does not report time-since-post or note-age covariates, so residual temporal confounding cannot be ruled out. This directly affects the load-bearing claim that LLM notes achieve significantly higher helpfulness scores.
Authors: We agree that equalizing rater exposure does not fully eliminate temporal differences, as LLM notes were submitted later in the observation window. In the revised manuscript we have added two sets of controls: (1) in the rating-level regression we now include note-age (days since the tweet was posted) and days-since-note-submission as covariates alongside rater fixed effects; (2) for the note-level analysis we re-estimate the comparison after matching on post age at the moment each rater evaluated the notes. Both robustness checks leave the main result unchanged—LLM notes continue to receive significantly higher helpfulness ratings. We report the updated models in Section 4.2 and new Appendix C, and we have tempered the language around causal interpretation to reflect the observational nature of the deployment. revision: yes
Circularity Check
No circularity: direct empirical measurement of platform ratings
full rationale
This is a field deployment study that records and statistically compares observed helpfulness ratings from real X users. The two analysis strategies (rating-level modeling with rater fixed effects and note-level restriction to raters who saw every note on a post) are standard empirical controls applied to the collected data; neither reduces to a fitted parameter renamed as a prediction nor to a self-citation that supplies the result. No equations, ansatzes, uniqueness theorems, or self-referential derivations appear. The central claim is therefore an independent empirical outcome rather than a restatement of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption User ratings on the platform reflect genuine perceptions of note helpfulness
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalizes rater exposure across note types
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLM notes receive more positive ratings than human notes across raters with different political viewpoints
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Characterizing AI Fact-Checkers and Their Contributions on Community Notes
AI writers account for 14.2% of Community Notes submissions with high responsiveness and coverage but lower helpfulness classification rates than human experts.
Reference graph
Works this paper leans on
-
[1]
doi: 10.54501/jots.v3i1.255. Jonathan Mummolo and Erik Peterson. Demand effects in survey experiments: An em- pirical assessment.American Political Science Review, 113(2):517–529, 2019. Sahajpreet Singh, Kokil Jaidka, and Min-Yen Kan. GitSearch: Enhancing community notes generation with gap-informed targeted search.arXiv preprint arXiv:2602.08945, 2026. I...
-
[2]
Use the post context to guide your search if it could provide potential fact-check directions
Search both the web and X for factual sources that refute or confirm the post’s claims. Use the post context to guide your search if it could provide potential fact-check directions
-
[3]
Aim for {target_url_count} pieces of evidence / URLs if possible
-
[4]
Include the publication date of the source if available
For each source, include the URL and a brief note describing how it verifies or challenges the post. Include the publication date of the source if available
-
[5]
Overlapping reasoning is acceptable when it comes from different publishers
Cover outlets across the ideological spectrum (left, center, right). Overlapping reasoning is acceptable when it comes from different publishers
-
[6]
Prioritize evidence that is relevant, solid, and up to date. Target post (ID: {post_id}): {post} Your response should be returned as a JSON object with the following structure: ‘‘‘ {{ "post_context": "one/two-sentence summary of the post context", "research": [ {{"url": "url1", "description": "how the content of the URL fact-checks the post "}}, ... ] }} ...
-
[7]
Focus on primary claim(s) of the post rather than trivial details
The note is written to explain why the post is misleading and add additional context to the post. Focus on primary claim(s) of the post rather than trivial details
-
[8]
At least one URL must be cited
The note must be grounded in the provided evidence and should cite the URL of the evidence it uses. At least one URL must be cited
-
[9]
Keep the note strictly under 280 characters. Stay neutral and clear
-
[10]
No hashtags, emojis, unnecessary words. No markdown, brackets, or parentheses around URLs. Do not mention "this note" or "the prompt." Target post: ‘‘‘ {post} ‘‘‘ Additional context about the post: ‘‘‘ {post_context} ‘‘‘ Allowed evidence sources: ‘‘‘ 23 AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X {evidence} ‘‘‘ Out...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.