pith. sign in

arxiv: 2605.30241 · v1 · pith:J2574KWTnew · submitted 2026-05-28 · 💻 cs.CL · cs.CY· cs.SI

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

Pith reviewed 2026-06-29 07:30 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.SI
keywords misinformation detectionbenchmarklarge language modelsCommunity Notesmultilingual evaluationweb searchsource selection
0
0 comments X

The pith

Web-enabled LLMs select sources for verifying claims that differ systematically from the sources human Community Notes raters converge on.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CommunityFact as a refreshable benchmark containing 15,992 claims in five languages and two domains to test how well models handle real-world misinformation. It runs ten LLMs through closed-input, thinking, and web-search conditions and tracks both accuracy and the specific sources each model retrieves. The main result is that web access produces the largest accuracy gains, yet the sources chosen by these models diverge from the sources that human raters settle on in Community Notes. The divergence shrinks when models apply their own retrieval-expansion or pruning rules. The benchmark is also positioned to supply training data for systems that suggest sources conditioned on a new claim.

Core claim

CommunityFact supplies a dynamic collection of 15,992 standalone claims across five languages and two domains. Evaluation of ten LLMs shows closed-input verification remains hard, web access yields the largest gains, and web-enabled models' source-selection policies are systematically misaligned with the sources that human Community Notes raters converge upon; this gap narrows through model-specific retrieval expansion or pruning. Substantial variation appears across language-domain slices and across the evidence ecosystems used by different web-enabled systems. The resource further frames Community Notes data as a training signal for claim-conditioned source suggesters.

What carries the argument

The CommunityFact benchmark, which pairs each claim with the sources converged upon by human Community Notes raters to quantify misalignment in LLM source-selection policies.

If this is right

  • Closed-input verification stays difficult across the tested models.
  • Web access produces larger accuracy gains than internal thinking alone.
  • Source-selection misalignment varies by language and domain slice.
  • Different web-enabled systems draw from distinct evidence ecosystems.
  • Community Notes data can train claim-conditioned source suggesters for novel claims.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained on these alignments could generalize source suggestions to claims outside the current benchmark.
  • Dynamic updates to the benchmark would let researchers track whether retrieval policies improve or drift over time.
  • The observed language-domain variation suggests targeted retrieval tuning per slice may be needed before broad deployment.

Load-bearing premise

Agreement among human Community Notes raters supplies a reliable external reference standard for judging whether an LLM has selected appropriate sources.

What would settle it

A fresh collection of claims on which web-enabled LLMs that apply retrieval expansion or pruning still select sources that diverge from the sources human raters converge on.

Figures

Figures reproduced from arXiv: 2605.30241 by Insyirah Mujtahid, Kokil Jaidka, Min-Yen Kan, Sahajpreet Singh.

Figure 1
Figure 1. Figure 1: Language distribution of helpful Community [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effects of evidence-guided web verification. Panel [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Observed source-domain ecosystems for crowd-sourced verification (via Community Notes). Here, we plot the most frequently emitted source do￾mains for the English subset of COMMUNITYFACT across domains. The x-axis is logarithmic. the largest gains on Portuguese politics (+5.24) and finance (+4.86) for GPT-5-nano. Slice-level reversals persist — Grok-4.3 declines on Spanish finance (−4.47), and Gemini-2.5-Fl… view at source ↗
Figure 4
Figure 4. Figure 4: Observed source-domain ecosystems for web-search verification. For each web-enabled model, we [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Observed top-10 source-domain ecosystems for crowd-sourced verification ( [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Observed top-10 source-domain ecosystems for web-search verification across domain-language slices. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Observed top-10 source-domain ecosystems for evidence-guided web-search verification across domain [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Domain classification prompt. This classifies each helpful Community Note into [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Claim-label extraction prompt used to construct C [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Self-refinement prompt for claim-label pair auditing. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompting template for zero-shot classification. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompting template for zero-shot classification with access to web-search tool. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompting template for zero-shot classification with evidence-guided web-search. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Human validation guidelines for COMMUNITYFACT. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability. We introduce CommunityFact, a refreshable benchmark for misinformation detection in the wild, with three major goals: coverage, granularity, and redistributability. This release contains 15,992 standalone claims across five languages and two domains. We evaluate ten LLMs under varying inference-time capabilities, including thinking and web-search. Our results show that closed-input verification remains challenging, web access yields the largest gains, and web-enabled LLMs' source-selection policies are systematically misaligned with the sources human Community Notes raters converge on -- a gap that closes through model-specific mechanisms of retrieval expansion or pruning. We further find substantial variation across language-domain slices and across the evidence ecosystems used by web-enabled systems. Beyond evaluation, CommunityFact positions Community Notes as a training signal for claim-conditioned source suggesters that could improve factual verification on novel claims.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents CommunityFact, a refreshable benchmark with 15,992 standalone claims across five languages and two domains, designed to evaluate LLMs on misinformation detection in dynamic, multilingual, real-world settings. It reports evaluations of ten LLMs under closed-input, thinking, and web-search conditions, claiming that closed-input verification is challenging, web access yields the largest performance gains, web-enabled LLMs exhibit systematic misalignment in source-selection policies relative to sources converged upon by human Community Notes raters (with the gap closable via model-specific retrieval expansion or pruning), and that substantial variation exists across language-domain slices and evidence ecosystems. The work further positions the benchmark as a potential training signal for claim-conditioned source suggesters.

Significance. If the core empirical findings on web-access gains and source misalignment hold after addressing validation gaps, the benchmark could meaningfully advance evaluation practices for LLMs in fast-moving, multilingual misinformation contexts and open avenues for using Community Notes data in training. The dynamic and redistributable design addresses limitations of static benchmarks, and the multilingual/multi-domain coverage is a strength. However, the interpretive weight placed on human rater convergence as a reference standard without external accuracy validation limits the immediate significance of the misalignment claims.

major comments (2)
  1. [Abstract] Abstract (and implied results section): the claim that web-enabled LLMs' source-selection policies are 'systematically misaligned' with sources on which human Community Notes raters converge treats rater convergence as a reliable external reference standard for measuring misalignment. No independent validation (e.g., against external fact-checking accuracy, expert adjudication, or ground-truth labels) is described to establish that converged sources are more accurate, complete, or authoritative than LLM-retrieved alternatives; without this, the misalignment metric risks conflating deviation from crowd consensus with deviation from truth and is load-bearing for the performance-gap interpretation.
  2. [Abstract] Abstract: the reported results on closed-input verification difficulty, web-access gains, and cross-slice variation lack any description of the underlying statistical methods, confidence intervals, error analysis, or controls for claim difficulty, making it impossible to assess whether the stated patterns are robust or driven by the specific 15,992-claim sample.
minor comments (2)
  1. The manuscript provides no details on claim sourcing process, annotation guidelines, exclusion criteria, inter-rater agreement for the human Community Notes convergence, or how the benchmark ensures redistributability while remaining dynamic.
  2. No information is given on the ten LLMs evaluated (specific models, sizes, or versions), the exact prompting or retrieval setups for web-enabled conditions, or how 'thinking' and 'web-search' capabilities were operationalized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and implied results section): the claim that web-enabled LLMs' source-selection policies are 'systematically misaligned' with sources on which human Community Notes raters converge treats rater convergence as a reliable external reference standard for measuring misalignment. No independent validation (e.g., against external fact-checking accuracy, expert adjudication, or ground-truth labels) is described to establish that converged sources are more accurate, complete, or authoritative than LLM-retrieved alternatives; without this, the misalignment metric risks conflating deviation from crowd consensus with deviation from truth and is load-bearing for the performance-gap interpretation.

    Authors: The manuscript frames the source-selection analysis as a comparison against the specific sources on which human Community Notes raters converge, consistent with CommunityFact's construction from Community Notes data as a real-world human signal. We do not claim or provide evidence that these converged sources are more accurate, complete, or authoritative than LLM-selected alternatives, nor do we equate human consensus with ground truth. The reported misalignment describes a policy difference, and the performance gap refers to deviation from observed human rater behavior. We will revise the abstract and relevant sections to explicitly qualify the reference standard as human rater convergence without external accuracy validation, thereby avoiding any implication that deviation equals deviation from truth. revision: partial

  2. Referee: [Abstract] Abstract: the reported results on closed-input verification difficulty, web-access gains, and cross-slice variation lack any description of the underlying statistical methods, confidence intervals, error analysis, or controls for claim difficulty, making it impossible to assess whether the stated patterns are robust or driven by the specific 15,992-claim sample.

    Authors: We agree that the abstract does not describe the statistical methods, confidence intervals, error analysis, or controls. The full manuscript reports the core quantitative results, but to improve transparency we will revise the abstract to include a concise description of the statistical approaches (including confidence intervals and any controls for claim difficulty, language, and domain) and will ensure the results section provides explicit error analysis and robustness details. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper introduces CommunityFact as a new dataset and reports empirical results from evaluating LLMs under different inference settings against human Community Notes annotations. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided abstract or described content. The comparison of LLM source selection to human rater convergence uses an external reference standard and does not reduce to a construction from the models' own outputs or prior self-citations. The work is self-contained as a benchmark study with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities are described in the abstract; the contribution is dataset construction and empirical evaluation.

pith-pipeline@v0.9.1-grok · 5718 in / 1101 out tokens · 29526 ms · 2026-06-29T07:30:51.690889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Nature Machine Intelligence, 6(8):852–863

    Factuality challenges in the era of large lan- guage models and opportunities for fact-checking. Nature Machine Intelligence, 6(8):852–863. Alberto Barrón-Cedeño, Firoj Alam, Tommaso Caselli, Giovanni Da San Martino, Tamer Elsayed, An- drea Galassi, Fatima Haouari, Federico Ruggeri, Ju- lia Maria Struß, Rabindra Nath Nandi, Gullal S. Cheema, Dilshod Azizo...

  2. [2]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1610–1630, Vienna, Austria

    Real-time factuality assessment from adver- sarial feedback. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1610–1630, Vienna, Austria. Association for Computational Lin- guistics. Yuwei Chuai, Gabriele Lenzini, and Nicolas Pröllochs

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Consensus stability of community notes on x. InProceedings of the ACM Web Conference 2026, WWW ’26, page 8885–8896, New York, NY , USA. Association for Computing Machinery. Yuwei Chuai, Haoye Tian, Nicolas Pröllochs, and Gabriele Lenzini. 2024. Did the roll-out of commu- nity notes reduce engagement with misinformation on x/twitter?Proc. ACM Hum.-Comput. ...

  4. [4]

    Pedro Henrique Martins, João Alves, Patrick Fernan- des, Nuno M

    Fact-audit: An adaptive multi-agent frame- work for dynamic fact-checking evaluation of large language models.Preprint, arXiv:2502.17924. Pedro Henrique Martins, João Alves, Patrick Fernan- des, Nuno M. Guerreiro, Ricardo Rei, Amin Fara- jian, Mateusz Klimaszewski, Duarte M. Alves, José Pombal, Nicolas Boizard, Manuel Faysse, Pierre Colombo, François Yvon...

  5. [5]

    OpenAI GPT-5 System Card

    NLP evaluation in trouble: On the need to mea- sure LLM data contamination for each benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore. Association for Computational Linguistics. Mohammadamin Shafiei, Hamidreza Saffari, and Nafise Sadat Moosavi. 2025. MultiHoax: A dataset of multi-hop false-p...

  6. [6]

    liar, liar pants on fire

    Motivations, methods and metrics of misin- formation detection: An nlp perspective.Natural Language Processing Research, 1:1–13. Yifan Sun, Han Wang, Dongbai Li, Gang Wang, and Huan Zhang. 2025. The emperor’s new clothes in benchmarking? a rigorous examination of mitiga- tion strategies for llm benchmark data contamination. Preprint, arXiv:2503.16402. Jam...

  7. [7]

    false" when the Note refutes the Tweet claim; label as

    Tweet Claims: Extract Tweet claims only when the Note clearly addresses them. Label as "false" when the Note refutes the Tweet claim; label as "true" only when the Note clearly supports or confirms the Tweet claim. Otherwise drop

  8. [8]

    Note Claims: Label as "true" only when the Note itself makes a factual correction or factual assertion that is directly checkworthy

  9. [9]

    Do not use exact quotes

    Paraphrase: Rewrite all claims in your own words. Do not use exact quotes. Claims must assert verifiable facts about the world — never about the tweet, note, or source itself. Drop any claim whose entire content is about the credibility, accuracy, sourcing, or wording of the tweet/note rather 19 than the underlying facts

  10. [10]

    Checkworthiness Scope: Keep only substantive high quality real-world facts/events. Drop metadata/status claims whose main point is who posted something, whether an account/source is official, verified, affiliated, fake, or impersonating, or whether a link/livestream/post/page is available, ended, deleted, or reposted. Do not extract these claims even when...

  11. [11]

    Text-Only: The downstream dataset will not include attached images, videos, screenshots, audio, or other media. Drop claims that require viewing, identifying, or interpreting media, including what media shows, who appears in it, where/when it was recorded, or whether it is edited, cropped, staged, AI-generated, or mislabeled

  12. [12]

    false" = the Tweet asserted it, and the Note refutes it

    Label Precision: "false" = the Tweet asserted it, and the Note refutes it. "true" = the Note asserts it as fact. For substantive attributions, label the attribution itself: if a tweet says "X claims Y", label it "true" if the Note establishes that X made that substantive claim, even if Y is false. Do not apply this attribution rule to account/source authe...

  13. [13]

    U.S. visas

    Specificity: Every claim must name the specific people, organizations, countries, jurisdictions, dates, places, figures, metrics, and issuers needed to identify the fact. Do not drop key qualifiers; for example, write "U.S. visas" rather than "visas" when the country matters. Do not add extra details unless they are necessary to disambiguate the claim

  14. [14]

    ## Density & Balancing Rules

    Language Match: Write each claim in the language of the input text segment. ## Density & Balancing Rules

  15. [15]

    Merge closely related contextual details such as who, what, when, and where into a single, comprehensive claim

    Information Density: Do NOT hyper-fragment the text. Merge closely related contextual details such as who, what, when, and where into a single, comprehensive claim

  16. [16]

    Do not add background, explanation, evidence, rhetorical framing, or provenance unless it is essential to the fact being checked

    Compactness: Keep each claim short while still being standalone and unambiguous. Do not add background, explanation, evidence, rhetorical framing, or provenance unless it is essential to the fact being checked

  17. [17]

    Do not make true claims longer by including extra evidence, caveats, or Note explanations

    Length Neutrality: Write true and false claims at the same level of detail. Do not make true claims longer by including extra evidence, caveats, or Note explanations

  18. [18]

    false" Tweet claim and one strongest

    Claim Balance: Extract one strongest "false" Tweet claim and one strongest "true" Note claim when both are valid

  19. [19]

    No Forced Claims: If no claim satisfies all rules, return []

  20. [20]

    A true corrective claim may contradict a false Tweet claim when it adds distinct factual content

    No Duplicative Opposites: Do not output duplicate claims that merely restate the same fact with opposite labels. A true corrective claim may contradict a false Tweet claim when it adds distinct factual content

  21. [21]

    Zero-Context Formula

    Output Limit: Strictly limit your output to a few high-value, information-dense claims. ## The "Zero-Context Formula" (CRITICAL) Every claim must be 100% standalone. A reader with zero knowledge of the tweet, note, media, URL, or other claims must fully understand it and fact-check it. - No back-references to other claims: If two claims share a subject, r...