pith. sign in

arxiv: 2603.22453 · v2 · pith:DBLGS64Cnew · submitted 2026-03-23 · 💻 cs.CL · cs.SI

XNote: Benchmarking Automated Community Notes Generation for Image-based Contextual Deception

Pith reviewed 2026-05-21 10:34 UTC · model grok-4.3

classification 💻 cs.CL cs.SI
keywords community notescontextual deceptionimage deceptionautomated note generationbenchmarkinglarge vision language modelssocial media
0
0 comments X

The pith

Researchers create the XNote dataset to benchmark automated generation of Community Notes for posts with authentic images but misleading contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fill the gap in datasets for training and testing systems that automatically write Community Notes to counter image-based contextual deception on social media. Community Notes work by providing missing context to users, but relying on humans limits their speed and scale. By assembling XNote from real X posts that have human notes, adding annotations on topics and deception types, and then testing large vision language models along with other systems on generating similar notes, the authors demonstrate current performance levels and difficulties. This matters to readers because effective automation could allow corrections to reach users much sooner and on a larger number of deceptive posts.

Core claim

By curating the XNote dataset from X posts with associated Community Notes and external contexts along with annotations of topics and deceptive factors, and benchmarking a range of frontier large vision language models on both deception detection and note generation tasks, the work shows the challenges in producing concise and grounded notes that help users recover the missing or corrected context and the need for improved methods and metrics.

What carries the argument

The XNote dataset of real-world X posts paired with human Community Notes and new annotations for topics and deceptive factors, which enables evaluation of automated systems on generating helpful corrective notes rather than binary deception labels.

If this is right

  • Evaluation moves beyond binary true or false detection to assess whether generated notes recover the specific missing context.
  • Frontier models exhibit limitations in creating concise, grounded Community Notes for these cases.
  • Both specialized systems and commercial tools require advancements to handle this task effectively.
  • New metrics and methods tailored to note generation will be necessary to make progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmark proves useful, social platforms could deploy similar AI systems to assist or scale up Community Notes production.
  • Collecting more data in this format could allow training models specifically for context recovery in deceptive posts.
  • Connections to other misinformation correction tasks, such as fact-checking, may benefit from similar grounded generation approaches.

Load-bearing premise

The selected X posts and the added annotations for topics and deceptive factors accurately reflect typical cases of image-based contextual deception and provide dependable ground truth for assessing automated note generation.

What would settle it

If future models achieve high agreement with human Community Notes on a diverse set of new posts, as judged by independent raters on helpfulness and accuracy in correcting the context, this would indicate that the challenges highlighted can be overcome with current or near-term techniques.

Figures

Figures reproduced from arXiv: 2603.22453 by Ethan Anderson, Feng Luo, Jingwen Yan, Jinkyung Katie Park, Jin Ma, Long Cheng, Mohammed Aldeen, Taran Kavuru.

Figure 1
Figure 1. Figure 1: (a) Workflow by which a Community Note becomes [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: XCHECK dataset collection pipeline. (e.g., meme), or a multi-photo collage. Claim can be conveyed, but not limited, in the post text or the text within the image. 3.1 Dataset Collection and Analysis XCHECK dataset was constructed in four stages as shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example data entry in XCHECK, with image, structured post metadata and external context [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Number of posts trend over time for top-5 topics. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Source URLs analysis in XCHECK. social platforms rather than by news outlets. However, cit￾ing social posts carries reliability risks, as these platforms are also major venues for deception. Archival services (e.g., archive.ph) are also valuable resources, since they help pre￾serve volatile content and provide durable citations when orig￾inal pages are altered or removed. Surprisingly, fact-checking websit… view at source ↗
Figure 7
Figure 7. Figure 7: System design of the proposed ACCNOTE framework. provided in XCHECK, which is collected via Google reverse image search over the open Web. 4.1 System Overview As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The original post for Example 2, with post text [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: One example post with three different notes used in user study. Method names are anonymized in the survey. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Statistical results from the user study. Bars indicate [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The original post for Example 3. Example 3: Qualitative Example of Different Notes SNIFFER (web search): The image is wrongly used in a different news context. The given news caption and image are inconsistent in person. The person in caption is Tulsi Gabbard, and the person in image is John Kerry. GPT5-mini (web search): The people pictured are not the indi￾viduals named in the post: the woman shown is o… view at source ↗
read the original abstract

Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation task for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), automated Community Notes generation requires producing concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to the scarcity of datasets that support this task. To address this gap, we curate a real-world dataset, XNote, comprising X posts with associated Community Notes and external contexts, along with annotations of topics and deceptive factors. We further benchmark a range of frontier large vision language models (LVLMs) on XNote, evaluating their performance on both deception detection and note generation tasks. We also compare against an end-to-end approach, SNIFFER, and a commercial tool, GPT-5. Our results highlight the challenges in automated Community Notes generation, underscoring the need for improved methods and metrics tailored for this task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the XNote dataset, curated from real X posts with associated Community Notes and external contexts, augmented with annotations for topics and deceptive factors (time/entity/event mismatches). It benchmarks frontier LVLMs on deception detection and automated generation of concise, grounded Community Notes to help users recover missing context, with comparisons to SNIFFER and GPT-5, highlighting challenges in this underexplored task.

Significance. If the annotations are shown to be reliable, the dataset and benchmark could meaningfully advance research on scalable, automated support for Community Notes by shifting focus from binary deception detection to contextual correction. The real-world sourcing from X posts and inclusion of existing notes provide a practical foundation for evaluating LVLM performance on image-based misinformation.

major comments (2)
  1. [Abstract] Abstract and setup: no details are provided on evaluation metrics for note generation, inter-annotator agreement, data splits, or the protocol used to judge note quality. These omissions are load-bearing for the central claim that XNote enables reliable benchmarking of 'concise and grounded' notes.
  2. [Dataset Curation] Dataset curation: the new annotations of deceptive factors (time/entity/event mismatches) lack any reported validation, consistency checks, or external corroboration. Without this, model performance numbers on note generation risk being driven by annotation noise rather than genuine ability to recover context.
minor comments (1)
  1. [Experiments] Clarify whether note quality evaluation relies on automated metrics, human raters, or both, and report any agreement statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify areas where additional transparency is needed to support the reliability of the XNote benchmark. We address each major comment below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and setup: no details are provided on evaluation metrics for note generation, inter-annotator agreement, data splits, or the protocol used to judge note quality. These omissions are load-bearing for the central claim that XNote enables reliable benchmarking of 'concise and grounded' notes.

    Authors: We agree that these elements were not described in the abstract or setup and that their absence weakens the central benchmarking claim. In the revision we will expand the abstract with a concise statement of the evaluation approach and add an explicit subsection (new Section 3.5) that reports the metrics used for note generation, inter-annotator agreement statistics for the annotations, the train/validation/test splits, and the human evaluation protocol for assessing conciseness and groundedness. revision: yes

  2. Referee: [Dataset Curation] Dataset curation: the new annotations of deceptive factors (time/entity/event mismatches) lack any reported validation, consistency checks, or external corroboration. Without this, model performance numbers on note generation risk being driven by annotation noise rather than genuine ability to recover context.

    Authors: We acknowledge that the current manuscript does not report validation or consistency checks for the deceptive-factor annotations. We will add a dedicated paragraph in Section 3 describing the annotation guidelines, the number of annotators, the procedure for resolving disagreements, and the resulting inter-annotator agreement. We will also include a small set of annotated examples to permit external scrutiny. If time permits we will attempt a limited external corroboration step; otherwise the internal validation details will be provided to reduce the risk of annotation noise affecting the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark on externally sourced dataset

full rationale

The paper curates the XNote dataset from real X posts paired with existing Community Notes and external contexts, then adds topic and deceptive-factor annotations to benchmark LVLM performance on detection and note generation. No equations, fitted parameters, or model predictions appear in the described chain. Evaluation metrics are computed against human-provided annotations and compared to independent baselines (SNIFFER, GPT-5), so results do not reduce to quantities defined by the authors' own prior fits or self-citations. The work is therefore self-contained against external benchmarks and exhibits no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that existing human-written Community Notes and the authors' added annotations of topics and deceptive factors provide a faithful proxy for real-world deception cases; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception
    Opening sentence of the abstract; used to motivate the automation task.
  • domain assumption The curated XNote dataset accurately captures real-world image-based contextual deception
    Implicit in the decision to benchmark on this dataset as ground truth.

pith-pipeline@v0.9.0 · 5776 in / 1426 out tokens · 46175 ms · 2026-05-21T10:34:32.168185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 3 internal anchors

  1. [1]

    Multi-modal misinformation detection: Approaches, challenges and opportunities.ACM Computing Surveys, 57(3):1–29, 2024

    Sara Abdali, Sina Shaham, and Bhaskar Krishnamachari. Multi-modal misinformation detection: Approaches, challenges and opportunities.ACM Computing Surveys, 57(3):1–29, 2024

  2. [2]

    Open- domain, content-based, multi-modal fact-checking of out-of-context images via online resources

    Sahar Abdelnabi, Rakibul Hasan, and Mario Fritz. Open- domain, content-based, multi-modal fact-checking of out-of-context images via online resources. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14940–14949, 2022

  3. [3]

    Photoshop

    Adobe. Photoshop. https://www.adobe.com/ products/photoshop, 2025. Accessed: September, 2025

  4. [4]

    Fake news, disinformation and misinformation in social me- dia: a review.Social Network Analysis and Mining, 13(1):30, 2023

    Esma Aïmeur, Sabrine Amri, and Gilles Brassard. Fake news, disinformation and misinformation in social me- dia: a review.Social Network Analysis and Mining, 13(1):30, 2023

  5. [5]

    Quantifying the impact of misinformation and vaccine-skeptical content on facebook.Science, 384(6699):eadk3451, 2024

    Jennifer Allen, Duncan J Watts, and David G Rand. Quantifying the impact of misinformation and vaccine-skeptical content on facebook.Science, 384(6699):eadk3451, 2024

  6. [6]

    Amazon mechanical turk

    Amazon. Amazon mechanical turk. https://www. mturk.com/, 2025. Accessed: September, 2025

  7. [7]

    Covid-19 vaccine hesitancy—a scoping review of literature in high-income countries

    Junjie Aw, Jun Jie Benjamin Seng, Sharna Si Ying Seah, and Lian Leng Low. Covid-19 vaccine hesitancy—a scoping review of literature in high-income countries. Vaccines, 9(8):900, 2021

  8. [8]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report....

  9. [9]

    Meteor: An auto- matic metric for mt evaluation with improved correla- tion with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An auto- matic metric for mt evaluation with improved correla- tion with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

  10. [10]

    Main-rag: Multi-agent filter- ing retrieval-augmented generation.arXiv preprint arXiv:2501.00332, 2024

    Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Ma- hashweta Das, et al. Main-rag: Multi-agent filter- ing retrieval-augmented generation.arXiv preprint arXiv:2501.00332, 2024

  11. [11]

    Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  12. [12]

    Supernotes: Driving consensus in crowd- sourced fact-checking

    Soham De, Michiel A Bakker, Jay Baxter, and Mar- tin Saveski. Supernotes: Driving consensus in crowd- sourced fact-checking. InProceedings of the ACM on Web Conference 2025, pages 3751–3761, 2025

  13. [13]

    Ammeba: A large-scale survey and dataset of media-based misinformation in-the-wild

    Nicholas Dufour, Arkanath Pathak, Pouya Samangouei, Nikki Hariri, Shashi Deshetti, Andrew Dudfield, Christo- pher Guess, Pablo Hernández Escayola, Bobby Tran, Mevan Babakar, et al. Ammeba: A large-scale survey and dataset of media-based misinformation in-the-wild. arXiv preprint arXiv:2405.11697, 2024

  14. [14]

    Factcheck

    FactCheck.org. Factcheck. https://www.factcheck. org/, 2025. Accessed: January, 2026

  15. [15]

    Detect web entities and pages.https:// cloud.google.com/vision/docs/detecting-web,

    Google Cloud. Detect web entities and pages.https:// cloud.google.com/vision/docs/detecting-web,

  16. [16]

    Accessed: April, 2025

  17. [17]

    Fact check (claimreview) structured data

    Google Search Central. Fact check (claimreview) structured data. https://developers.google. com/search/docs/appearance/structured-data/ factcheck. Accessed: October, 2025

  18. [18]

    An overview of fake news detection: From a new perspec- tive.Fundamental Research, 5(1):332–346, 2025

    Bo Hu, Zhendong Mao, and Yongdong Zhang. An overview of fake news detection: From a new perspec- tive.Fundamental Research, 5(1):332–346, 2025

  19. [19]

    Langchain community

    LangChain. Langchain community. https: //python.langchain.com/api_reference/ community/index.html, 2025. Accessed: June, 2025

  20. [20]

    Misinformation and the epistemic integrity of democracy.Current opinion in psychology, 54:101711, 2023

    Stephan Lewandowsky, Ullrich KH Ecker, John Cook, Sander Van Der Linden, Jon Roozenbeek, and Naomi Oreskes. Misinformation and the epistemic integrity of democracy.Current opinion in psychology, 54:101711, 2023

  21. [21]

    Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  22. [22]

    Is a picture worth a thousand words? an empirical study of image content and so- cial media engagement.Journal of marketing research, 57(1):1–19, 2020

    Yiyi Li and Ying Xie. Is a picture worth a thousand words? an empirical study of image content and so- cial media engagement.Journal of marketing research, 57(1):1–19, 2020. 15

  23. [23]

    Rouge: A package for automatic evalua- tion of summaries

    Chin-Yew Lin. Rouge: A package for automatic evalua- tion of summaries. InText summarization branches out, pages 74–81, 2004

  24. [24]

    Detecting multimedia gen- erated by large ai models: A survey

    Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun- Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, and Shu Hu. Detecting multimedia generated by large ai models: A survey.arXiv preprint arXiv:2402.00045, 2024

  25. [25]

    Visual news: Benchmark and challenges in news image captioning

    Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. Visual news: Benchmark and challenges in news image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6761–6771, 2021

  26. [26]

    Llavanext: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Im- proved reasoning, ocr, and world knowledge, 2024

  27. [27]

    Mmfakebench: A mixed-source mul- timodal misinformation detection benchmark for lvlms,

    Xuannan Liu, Zekun Li, Peipei Li, Huaibo Huang, Shuhan Xia, Xing Cui, Linzhi Huang, Weihong Deng, and Zhaofeng He. Mmfakebench: A mixed-source mul- timodal misinformation detection benchmark for lvlms. arXiv preprint arXiv:2406.08772, 2024

  28. [28]

    NVILA: Efficient Frontier Visual Language Models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, et al. Nvila: Efficient frontier vi- sual language models.arXiv preprint arXiv:2412.04468, 2024

  29. [29]

    Textblob: Simplified text process- ing

    Steven Loria. Textblob: Simplified text process- ing. https://textblob.readthedocs.io/en/dev/ index.html, 2026. Accessed: January, 2026

  30. [30]

    Newsclippings: Automatic generation of out-of-context multimodal media,

    Grace Luo, Trevor Darrell, and Anna Rohrbach. Newsclippings: Automatic generation of out-of-context multimodal media.arXiv preprint arXiv:2104.05893, 2021

  31. [31]

    Local: Logical and causal fact-checking with llm-based multi- agents

    Jiatong Ma, Linmei Hu, Rang Li, and Wenbo Fu. Local: Logical and causal fact-checking with llm-based multi- agents. InProceedings of the ACM on Web Conference 2025, pages 1614–1625, 2025

  32. [32]

    Introducing community notes

    Meta. Introducing community notes. https://www. meta.com/technologies/community-notes/, 2025. Accessed: September, 2025

  33. [33]

    The creation and detec- tion of deepfakes: A survey.ACM computing surveys (CSUR), 54(1):1–41, 2021

    Yisroel Mirsky and Wenke Lee. The creation and detec- tion of deepfakes: A survey.ACM computing surveys (CSUR), 54(1):1–41, 2021

  34. [34]

    r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection.arXiv preprint arXiv:1911.03854, 2019

    Kai Nakamura, Sharon Levy, and William Yang Wang. r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection.arXiv preprint arXiv:1911.03854, 2019

  35. [35]

    Openai models

    OpenAI. Openai models. https://platform.openai. com/docs/models/, 2025. Accessed: September, 2025

  36. [36]

    Verite: a robust benchmark for multimodal misinfor- mation detection accounting for unimodal bias.Inter- national Journal of Multimedia Information Retrieval, 13(1):4, 2024

    Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, and Panagiotis C Petrantonakis. Verite: a robust benchmark for multimodal misinfor- mation detection accounting for unimodal bias.Inter- national Journal of Multimedia Information Retrieval, 13(1):4, 2024

  37. [37]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguis- tics, pages 311–318, 2002

  38. [38]

    The psychology of fake news.Trends in cognitive sciences, 25(5):388– 402, 2021

    Gordon Pennycook and David G Rand. The psychology of fake news.Trends in cognitive sciences, 25(5):388– 402, 2021

  39. [39]

    Sniffer: Multimodal large language model for explain- able out-of-context misinformation detection

    Peng Qi, Zehong Yan, Wynne Hsu, and Mong Li Lee. Sniffer: Multimodal large language model for explain- able out-of-context misinformation detection. InPro- ceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 13052–13062, 2024

  40. [40]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable vi- sual models from natural language supervision.arXiv preprint arXiv:2103.00020, 2021

  41. [41]

    Fin-fact: A benchmark dataset for multimodal financial fact-checking and explanation generation

    Aman Rangapur, Haoran Wang, Ling Jian, and Kai Shu. Fin-fact: A benchmark dataset for multimodal financial fact-checking and explanation generation. InCompan- ion Proceedings of the ACM on Web Conference 2025, pages 785–788, 2025

  42. [42]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing, pages 3982–3992, 2019.doi:10.18653/v1/D19-1410

  43. [43]

    How long do respondents think online surveys should be? new evi- dence from two online panels in germany.International Journal of Market Research, 62(5):538–545, 2020

    Melanie Revilla and Jan Karem Höhne. How long do respondents think online surveys should be? new evi- dence from two online panels in germany.International Journal of Market Research, 62(5):538–545, 2020

  44. [44]

    Evaluating retrieval quality in retrieval-augmented generation

    Alireza Salemi and Hamed Zamani. Evaluating retrieval quality in retrieval-augmented generation. InProceed- ings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2395–2400, 2024. 16

  45. [45]

    Claimreview — schema.org type

    Schema.org. Claimreview — schema.org type. https: //schema.org/ClaimReview. Accessed: October, 2025

  46. [46]

    Snopes, Inc. Snopes. https://www.snopes.com/,

  47. [47]

    Accessed: January, 2026

  48. [48]

    References to unbiased sources increase the helpfulness of community fact-checks.Scientific Reports, 15(1):25749, 2025

    Kirill Solovev and Nicolas Pröllochs. References to unbiased sources increase the helpfulness of community fact-checks.Scientific Reports, 15(1):25749, 2025

  49. [49]

    The proof and measurement of asso- ciation between two things

    Charles Spearman. The proof and measurement of asso- ciation between two things. 1961

  50. [50]

    Politifact

    The Poynter Institute. Politifact. https://www. politifact.com/, 2026. Accessed: January, 2026

  51. [51]

    Online de- ception in social media.Communications of the ACM, 57(9):72–80, 2014

    Michail Tsikerdekis and Sherali Zeadally. Online de- ception in social media.Communications of the ACM, 57(9):72–80, 2014

  52. [52]

    Explainable fake news detection with large language model via de- fense among competing wisdom

    Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang, Yuan Tian, and Yi Chang. Explainable fake news detection with large language model via de- fense among competing wisdom. InProceedings of the ACM Web Conference 2024, pages 2452–2463, 2024

  53. [53]

    Un- derstanding the use of fauxtography on social media

    Yuping Wang, Fatemeh Tahmasbi, Jeremy Blackburn, Barry Bradlyn, Emiliano De Cristofaro, David Mager- man, Savvas Zannettou, and Gianluca Stringhini. Un- derstanding the use of fauxtography on social media. InProceedings of the International AAAI Conference on Web and Social Media, volume 15, pages 776–786, 2021

  54. [54]

    The emergence of deepfake tech- nology: A review.Technology innovation management review, 9(11), 2019

    Mika Westerlund. The emergence of deepfake tech- nology: A review.Technology innovation management review, 9(11), 2019

  55. [55]

    X community notes

    X Corp. X community notes. https: //communitynotes.x.com/guide/en/about/ introduction, 2025. Accessed: September, 2025

  56. [56]

    X developer platform api

    X Corp. X developer platform api. https: //developer.x.com/en/portal/dashboard, 2025. Accessed: April, 2025

  57. [57]

    Mmooc: A multimodal misinformation dataset for out-of-context news analysis

    Qingzheng Xu, Heming Du, Huiqiang Chen, Bo Liu, and Xin Yu. Mmooc: A multimodal misinformation dataset for out-of-context news analysis. InAustralasian Conference on Information Security and Privacy, pages 444–459. Springer, 2024

  58. [58]

    Visual misinformation on facebook.Journal of Commu- nication, 73(4):316–328, 2023

    Yunkang Yang, Trevor Davis, and Matthew Hindman. Visual misinformation on facebook.Journal of Commu- nication, 73(4):316–328, 2023

  59. [59]

    Support or refute: Analyzing the stance of ev- idence to detect out-of-context mis-and disinformation

    Xin Yuan, Jie Guo, Weidong Qiu, Zheng Huang, and Shujun Li. Support or refute: Analyzing the stance of ev- idence to detect out-of-context mis-and disinformation. arXiv preprint arXiv:2311.01766, 2023

  60. [60]

    deceptive

    Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. InInternational confer- ence on machine learning, pages 11328–11339. PMLR, 2020. AXCHECKDataset A.1 Topics and Factors Classification We used OpenAI GPT5 with the Prompt 1 to assign topical categories to each post. For...

  61. [61]

    Identify the post’s main claim from the image, text, and date

  62. [62]

    If the claim is based on the image, check whether the image’s visual details and factual context support or contradict it

  63. [63]

    If the claim does not rely on the image, use knowledge and facts to support or contradict the claim

  64. [64]

    If external context is provided, use the provided context to sup- port or contradict the claim

  65. [65]

    Deceptive

    If any contradiction is found (e.g., claim vs. image, claim vs. knowledge, claim vs. external context), label “Deceptive”; if none, label “Non-deceptive”. OUTPUT FORMAT (clear, unbiased, factual, relevant): - Begin with “Deceptive” or “Non-deceptive”. - Follow with 1-2 sentences citing specific visual details, knowl- edge, or relevant context. EXTERNAL CO...

  66. [66]

    Source Credibility: cites reliable, trustworthy sources

  67. [67]

    Clarity: concise and easy to understand

  68. [68]

    Relevance: directly addresses the post’s image/text and context

  69. [69]

    Veracity: factually correct and evidence-based

  70. [70]

    Option X

    Neutrality: neutral tone, no cultural/personal bias. OUTPUT FORMAT: - Begin with “Option X”, where X is the option number. - Follow with 1-2 sentences explaining why this option is best. POST DETAILS: Image: <image>; Text: <text>; Date: <date> EV ALUATION OPTIONS:[1. {Note 1}, 2. {Note 2}, . . .] •Source Credibility •Clarity •Relevance •Font Size •Veracit...