pith. sign in

arxiv: 2512.22933 · v4 · submitted 2025-12-28 · 💻 cs.AI · cs.CL

RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

Pith reviewed 2026-05-16 19:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords multimodal fact-checkingevidence groundingbenchmark datasetmisinformation detectionvision-language modelsauditable annotationssocial media verification
0
0 comments X

The pith

RW-Post benchmark shows evidence-bounded evaluation improves accuracy and faithfulness in multimodal fact-checking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RW-Post, a dataset that pairs real social-media posts with reasoning traces and explicitly linked evidence extracted from human fact-check articles. An LLM-assisted pipeline creates auditable annotations that support three evaluation regimes: closed-book, evidence-bounded, and open-web. Experiments on this benchmark reveal that current large vision-language models frequently fail to ground their outputs faithfully in the supplied evidence. When evaluation is restricted to the provided evidence, both accuracy and faithfulness rise. The work supplies AgentFact as a baseline and demonstrates measurable headroom for future systems.

Core claim

RW-Post supplies post-aligned instances with auditable evidence links drawn from human fact-check articles; under unified protocols, strong open-source LVLMs exhibit low faithfulness in evidence grounding, yet evidence-bounded evaluation measurably raises both accuracy and adherence to the supplied facts.

What carries the argument

RW-Post benchmark of post-aligned text-image instances whose annotations are produced by an LLM-assisted extraction-and-auditing pipeline that converts human fact-check articles into explicit reasoning traces and evidence items.

If this is right

  • Models can be diagnosed separately for visual grounding failures versus reasoning failures.
  • Evidence-bounded protocols become a practical way to measure and improve faithfulness.
  • AgentFact and similar agent baselines can be compared directly against LVLMs under the same three regimes.
  • Development of new multimodal systems can target the identified gaps in evidence utilization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same extraction pipeline could be reused to create comparable benchmarks for text-only or video-based misinformation.
  • Auditable traces may support downstream training of models that learn to cite evidence explicitly.
  • The observed headroom implies that hybrid systems combining retrieval and generation could close much of the gap.

Load-bearing premise

The LLM pipeline produces annotations that faithfully match the original human fact-check articles without introducing systematic errors or biases.

What would settle it

A manual audit of a random sample of RW-Post instances that finds frequent mismatches between the extracted evidence links and the content of the original fact-check articles would undermine the benchmark.

Figures

Figures reproduced from arXiv: 2512.22933 by Danni Xu, Harry Cheng, Mohan Kankanhalli, Shaojing Fan.

Figure 1
Figure 1. Figure 1: RW-Post Dataset: Use Context (purple highlight) helps LLM determine whether the link (pink highlight) is post or [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of image annotations in the fact-checking [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of RW-Post Dataset IV. METHOD a) Overview: We decompose the fact-checking problem into five independent sub-tasks and design a dedicated agent for each of them. Building on these components, we develop a fact-checking pipeline, termed the AgentFact framework, which integrates the five agents into a multi-round evidence retrieval, filtering and reasoning process to achieve high￾quality fact check… view at source ↗
Figure 4
Figure 4. Figure 4: Proposed Multimodal Fact-checking Agents. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Proposed Multimodal Fact-checking workflow with [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: From these two examples, we observe that AgentFact is capable of producing coherent reasoning and structured key points supported by multimodal evidence from generally reli￾able sources. However, a closer comparison with the ground truth reasoning reveals several notable shortcomings. A common failure pattern highlighted by these cases is the model’s inability to retrieve accurate image contextual evi￾denc… view at source ↗
Figure 7
Figure 7. Figure 7: Case study of a correctly classified claim. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce RW-Post, a post-aligned text--image benchmark for real-world multimodal fact-checking with auditable annotations: each instance links the original social-media post with reasoning traces and explicitly linked evidence items derived from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. RW-Post supports controlled evaluation across closed-book, evidence-bounded, and open-web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide AgentFact as a reference verification baseline and benchmark strong open-source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence-bounded evaluation improves both accuracy and faithfulness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces RW-Post, a post-aligned text-image benchmark for real-world multimodal fact-checking. Each instance links original social-media posts to reasoning traces and explicitly linked evidence items extracted from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. The benchmark supports controlled evaluation in closed-book, evidence-bounded, and open-web regimes. It provides AgentFact as a reference baseline and evaluates strong open-source LVLMs, claiming that current models struggle with faithful evidence grounding while evidence-bounded evaluation improves both accuracy and faithfulness.

Significance. If the pipeline produces faithful annotations, RW-Post would offer a valuable, auditable resource for diagnosing visual grounding failures and evidence utilization in multimodal models. The controlled regimes and post-alignment are strengths that could enable reproducible progress on a timely problem. However, the absence of reported validation for the core annotation process limits the immediate impact of the experimental claims.

major comments (2)
  1. [Benchmark Construction] Benchmark construction section: the LLM-assisted extraction-and-auditing pipeline is presented as producing accurate, auditable annotations, yet no quantitative validation (human agreement rates, error analysis, or bias checks) is reported. This is load-bearing for the central claim that evidence-bounded evaluation improves faithfulness, because any systematic extraction errors would make measured improvements reflect pipeline artifacts rather than model capability.
  2. [Experiments] Experiments section: the abstract and high-level findings assert substantial headroom and regime-specific improvements, but the provided text supplies no quantitative results, dataset statistics, or error breakdowns. Without these, the diagnosis of model struggles with evidence grounding cannot be directly assessed or reproduced.
minor comments (1)
  1. [Abstract] The abstract states high-level experimental findings without any numerical values or dataset sizes; adding a brief summary table of key metrics would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential value of RW-Post for diagnosing visual grounding and evidence utilization issues. We address each major comment below and will incorporate the suggested additions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark construction section: the LLM-assisted extraction-and-auditing pipeline is presented as producing accurate, auditable annotations, yet no quantitative validation (human agreement rates, error analysis, or bias checks) is reported. This is load-bearing for the central claim that evidence-bounded evaluation improves faithfulness, because any systematic extraction errors would make measured improvements reflect pipeline artifacts rather than model capability.

    Authors: We agree that quantitative validation of the annotation pipeline is essential to substantiate the claims. The current manuscript describes the LLM-assisted extraction-and-auditing pipeline and its auditable design but does not report human agreement rates, error analysis, or bias checks. In the revised version we will add a dedicated subsection reporting results from a human audit of a sampled subset of annotations, including inter-annotator agreement rates, categorized error types, and bias checks. This will directly address the concern that measured improvements could reflect pipeline artifacts. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract and high-level findings assert substantial headroom and regime-specific improvements, but the provided text supplies no quantitative results, dataset statistics, or error breakdowns. Without these, the diagnosis of model struggles with evidence grounding cannot be directly assessed or reproduced.

    Authors: We agree that the experiments section must supply explicit quantitative results, dataset statistics, and error breakdowns to support the claims and enable reproduction. The manuscript currently presents high-level findings without sufficient detail in the main text. In the revision we will expand the experiments section to include dataset statistics (instance counts, regime distributions), full quantitative accuracy and faithfulness metrics across models and regimes, and error breakdowns. These additions will make the diagnosis of evidence-grounding struggles directly assessable and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper constructs the RW-Post benchmark directly from external human fact-check articles via an LLM-assisted extraction pipeline, then reports model performance across evaluation regimes. No equations, parameters, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains; the annotations are presented as derived from independent sources, and the experimental diagnosis of headroom follows from direct measurement on this externally sourced dataset rather than any internal renaming or forced prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the extraction pipeline yields faithful annotations; no free parameters or invented entities are described.

axioms (1)
  • domain assumption LLM-assisted extraction from human fact-check articles produces accurate and auditable reasoning traces and evidence links.
    This assumption underpins the validity of the RW-Post annotations and the evaluation regimes.

pith-pipeline@v0.9.0 · 5432 in / 1118 out tokens · 23652 ms · 2026-05-16T19:25:45.050078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    In- fodemics and health misinformation: a systematic review of reviews,

    I. J. Borges do Nascimento, A. B. Pizarro, J. M. Almeida, N. Azzopardi- Muscat, M. A. Gonc ¸alves, M. Bj ¨orklund, and D. Novillo-Ortiz, “In- fodemics and health misinformation: a systematic review of reviews,” Bulletin of the World Health Organization, vol. 100, no. 9, pp. 544–561, Sep. 2022, epub 2022 Jun 30

  2. [2]

    The false tariff headline that sent stocks on a $2 trillion ride,

    “The false tariff headline that sent stocks on a $2 trillion ride,” The Wall Street Journal, Apr. 2025, accessed: 2025-04-11. [Online]. Available: https://www.wsj.com/finance/stocks/the-false-tariff-headlin e-that-sent-stocks-on-a-2-trillion-ride-2224ef75

  3. [3]

    Does fake news impact stock returns? evidence from us and eu stock markets,

    M. C. Arcuri, G. Gandolfi, and I. Russo, “Does fake news impact stock returns? evidence from us and eu stock markets,”Journal of Economics and Business, vol. 125-126, p. 106130, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0148619523000231

  4. [4]

    Tech companies are taking action on ai election misinformation. will it matter?

    W. Henshall, “Tech companies are taking action on ai election misinformation. will it matter?”Time, 2023, accessed: 2025-04-11. [Online]. Available: https://time.com/6333288/tech-companies-ai-misin formation/

  5. [5]

    Deepfake detection: A comprehensive survey from the reliability perspective,

    T. Wang, X. Liao, K. P. Chow, X. Lin, and Y . Wang, “Deepfake detection: A comprehensive survey from the reliability perspective,” ACM Comput. Surv., vol. 57, no. 3, Nov. 2024

  6. [6]

    Fake accounts drove praise of duterte and now target philippine election,

    “Fake accounts drove praise of duterte and now target philippine election,”Reuters, Apr. 2025, accessed: 2025-04-11. [Online]. Available: https://www.reuters.com/world/asia-pacific/fake-accounts-drove-prais e-duterte-now-target-philippine-election-2025-04-11/

  7. [7]

    Semantics-oriented multitask learning for deepfake detection: A joint embedding approach,

    M. Zou, B. Yu, Y . Zhan, S. Lyu, and K. Ma, “Semantics-oriented multitask learning for deepfake detection: A joint embedding approach,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 10, pp. 9950–9963, 2025

  8. [8]

    Trump, twitter, and truth judgments: The effects of “disputed

    J. C. Blanchar and C. J. Norris, “Trump, twitter, and truth judgments: The effects of “disputed” tags and political knowledge on the judged truthfulness of election misinformation,”HKS Misinformation Review, September 2024. [Online]. Available: https://misinforeview.hks.harvard. edu/article/trump-twitter-and-truth-judgments-the-effects-of-disputed-t ags-a...

  9. [9]

    The global effectiveness of fact-checking: Evidence from simultaneous experiments in argentina, nigeria, south africa, and the united kingdom,

    E. Porter and T. J. Wood, “The global effectiveness of fact-checking: Evidence from simultaneous experiments in argentina, nigeria, south africa, and the united kingdom,”Proceedings of the National Academy of Sciences, vol. 118, no. 37, p. e2104235118, 2021. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.2104235118

  10. [10]

    Sniffer: Multimodal large lan- guage model for explainable out-of-context misinformation detection,

    P. Qi, Z. Yan, W. Hsu, and M. L. Lee, “Sniffer: Multimodal large lan- guage model for explainable out-of-context misinformation detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 13 052–13 062

  11. [11]

    Noise based deepfake detection via multi-head relative-interaction,

    T. Wang and K. P. Chow, “Noise based deepfake detection via multi-head relative-interaction,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 12, pp. 14 548–14 556, Jun. 2023

  12. [12]

    Unsupervised generative fake image detector,

    T. Qiao, H. Shao, S. Xie, and R. Shi, “Unsupervised generative fake image detector,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 9, pp. 8442–8455, 2024

  13. [13]

    Audio-visual temporal forgery de- tection using embedding-level fusion and multi-dimensional contrastive loss,

    M. Liu, J. Wang, X. Qian, and H. Li, “Audio-visual temporal forgery de- tection using embedding-level fusion and multi-dimensional contrastive loss,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 8, pp. 6937–6948, 2024

  14. [14]

    Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolu- tional network,

    J. Hu, X. Liao, W. Wang, and Z. Qin, “Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolu- tional network,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1089–1102, 2022

  15. [15]

    Qacheck: A demonstration system for question-guided multi-hop fact-checking,

    L. Pan, X. Lu, M.-Y . Kan, and P. Nakov, “Qacheck: A demonstration system for question-guided multi-hop fact-checking,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing System Demonstrations Track (EMNLP 2023 Demo Track), Singapore, Dec 2023

  16. [16]

    De- tecting misinformation with llm-predicted credibility signals and weak supervision,

    J. A. Leite, O. Razuvayevskaya, K. Bontcheva, and C. Scarton, “De- tecting misinformation with llm-predicted credibility signals and weak supervision,”arXiv preprint arXiv:2309.07601, 2023

  17. [17]

    Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method,

    X. Zhang and W. Gao, “Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method,”AACL, 2023

  18. [18]

    Fighting lies with intelligence: Using large language models and chain of thoughts technique to combat fake news,

    W. Kareem and N. Abbas, “Fighting lies with intelligence: Using large language models and chain of thoughts technique to combat fake news,” inInternational Conference on Innovative Techniques and Applications of Artificial Intelligence. Springer, 2023, pp. 253–258

  19. [19]

    Mmidr: Teaching large language model to interpret multimodal misinformation via knowledge distillation,

    L. Wang, X. Xu, L. Zhang, J. Lu, Y . Xu, H. Xu, and C. Zhang, “Mmidr: Teaching large language model to interpret multimodal misinformation via knowledge distillation,”arXiv preprint arXiv:2403.14171, 2024

  20. [20]

    Lemma: towards lvlm-enhanced multimodal misinformation detection with external knowledge augmentation.arXiv preprint arXiv:2402.11943, 2024

    K. Xuan, L. Yi, F. Yang, R. Wu, Y . R. Fung, and H. Ji, “Lemma: To- wards lvlm-enhanced multimodal misinformation detection with external knowledge augmentation,”arXiv preprint arXiv:2402.11943, 2024

  21. [21]

    Few-shot in- context learning for implicit semantic multimodal content detection and interpretation,

    X. Wang, L. Wang, Y . Su, H. Tian, G. Jin, and A.-A. Liu, “Few-shot in- context learning for implicit semantic multimodal content detection and interpretation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 9, pp. 9545–9558, 2025

  22. [22]

    “image, tell me your story!

    J. Tonglet, M.-F. Moens, and I. Gurevych, ““image, tell me your story!” predicting the original meta-context of visual misinformation,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 784...

  23. [23]

    Improving factuality and reasoning in language models through multiagent debate,

    Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learn- ing, ser. ICML’24. JMLR.org, 2024

  24. [24]

    Defame: Dynamic evidence-based fact-checking with multimodal experts,

    T. Braun, M. Rothermel, M. Rohrbach, and A. Rohrbach, “DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts,” inProceedings of the 42nd International Conference on Machine Learning, 2025. [Online]. Available: https://arxiv.org/abs/2412.10510

  25. [25]

    Mdfend: Multi-domain fake news detection,

    Q. Nan, J. Cao, Y . Zhu, Y . Wang, and J. Li, “Mdfend: Multi-domain fake news detection,” inProceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 3343– 3347

  26. [26]

    Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media,

    K. Shu, D. Mahudeswaran, S. Wang, D. Lee, and H. Liu, “Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media,”Big data, vol. 8, no. 3, pp. 171–188, 2020

  27. [27]

    A coarse-to- fine cascaded evidence-distillation neural network for explainable fake news detection,

    Z. Yang, J. Ma, H. Chen, H. Lin, Z. Luo, and Y . Chang, “A coarse-to- fine cascaded evidence-distillation neural network for explainable fake news detection,” inProceedings of the 29th International Conference on Computational Linguistics, N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi, P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji,...

  28. [28]

    Mumin: A large-scale multilingual multimodal fact-checked misinformation social network dataset,

    D. S. Nielsen and R. McConville, “Mumin: A large-scale multilingual multimodal fact-checked misinformation social network dataset,” inPro- ceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, 2022

  29. [29]

    Mr2: A benchmark for multimodal retrieval-augmented rumor detection in social media,

    X. Hu, Z. Guo, J. Chen, L. Wen, and P. S. Yu, “Mr2: A benchmark for multimodal retrieval-augmented rumor detection in social media,” inProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, 2023, pp. 2901–2912

  30. [30]

    Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection,

    K. Nakamura, S. Levy, and W. Y . Wang, “Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection,”Conference on Language Resources and Evaluation (LREC 2020), pp. 6149–6157, 2020. 12

  31. [31]

    Averitec: A dataset for real-world claim verification with evidence from the web,

    M. Schlichtkrull, Z. Guo, and A. Vlachos, “Averitec: A dataset for real-world claim verification with evidence from the web,”Advances in Neural Information Processing Systems, vol. 36, pp. 65 128–65 167, 2023

  32. [32]

    Metasumperceiver: Multi- modal multi-document evidence summarization for fact-checking,

    T.-C. Chen, C.-W. Tang, and C. Thomas, “Metasumperceiver: Multi- modal multi-document evidence summarization for fact-checking,”ACL, 2024

  33. [33]

    Multimedia semantic integrity assessment using joint embedding of images and text,

    A. Jaiswal, E. Sabir, W. AbdAlmageed, and P. Natarajan, “Multimedia semantic integrity assessment using joint embedding of images and text,” inProceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1465–1471

  34. [34]

    Cosmos: catching out-of-context image misuse using self-supervised learning,

    S. Aneja, C. Bregler, and M. Nießner, “Cosmos: catching out-of-context image misuse using self-supervised learning,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 12, 2023, pp. 14 084–14 092

  35. [35]

    Newsclippings: Automatic generation of out-of-context multimodal media,

    G. Luo, T. Darrell, and A. Rohrbach, “Newsclippings: Auto- matic generation of out-of-context multimodal media,”arXiv preprint arXiv:2104.05893, 2021

  36. [36]

    Synthetic misinformers: Generating and combating multimodal misin- formation,

    S.-I. Papadopoulos, C. Koutlis, S. Papadopoulos, and P. Petrantonakis, “Synthetic misinformers: Generating and combating multimodal misin- formation,” inProceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation, 2023, pp. 36–44

  37. [37]

    Multimodal analytics for real-world news using measures of cross- modal entity consistency,

    E. M ¨uller-Budack, J. Theiner, S. Diering, M. Idahl, and R. Ewerth, “Multimodal analytics for real-world news using measures of cross- modal entity consistency,” inProceedings of the 2020 international conference on multimedia retrieval, 2020, pp. 16–25

  38. [38]

    Capturing the style of fake news,

    P. Przybyla, “Capturing the style of fake news,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 01, 2020, pp. 490–497

  39. [39]

    Hierarchical propa- gation networks for fake news detection: Investigation and exploitation,

    K. Shu, D. Mahudeswaran, S. Wang, and H. Liu, “Hierarchical propa- gation networks for fake news detection: Investigation and exploitation,” inProceedings of the international AAAI conference on web and social media, vol. 14, 2020, pp. 626–637

  40. [40]

    Safe: Similarity-aware multi-modal fake news detection,

    X. Zhou, J. Wu, and R. Zafarani, “Safe: Similarity-aware multi-modal fake news detection,” inPacific-Asia Conference on Knowledge Discov- ery and Data Mining. Springer, 2020, pp. 354–367

  41. [41]

    Causal inference for leveraging image-text matching bias in multi-modal fake news detection,

    L. Hu, Z. Chen, Z. Zhao, J. Yin, and L. Nie, “Causal inference for leveraging image-text matching bias in multi-modal fake news detection,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 11, pp. 11 141–11 152, 2023

  42. [42]

    Similarity over factuality: Are we making progress on multimodal out-of-context misinformation detection?

    S.-I. Papadopoulos, C. Koutlis, S. Papadopoulos, and P. C. Petrantonakis, “Similarity over factuality: Are we making progress on multimodal out-of-context misinformation detection?” inProceedings of the Winter Conference on Applications of Computer Vision (WACV), February 2025, pp. 5570–5579

  43. [43]

    Open-domain, content-based, multi-modal fact-checking of out-of-context images via online re- sources,

    S. Abdelnabi, R. Hasan, and M. Fritz, “Open-domain, content-based, multi-modal fact-checking of out-of-context images via online re- sources,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 14 940–14 949

  44. [44]

    Mm-vet: Evaluating large multimodal models for integrated capabilities,

    W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang, “Mm-vet: Evaluating large multimodal models for integrated capabilities,” inInternational conference on machine learning. PMLR, 2024

  45. [45]

    Measuring massive multitask language understanding,

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” Proceedings of the International Conference on Learning Representa- tions (ICLR), 2021

  46. [46]

    Mmfakebench: A mixed-source mul- timodal misinformation detection benchmark for lvlms,

    X. Liu, Z. Li, P. Li, S. Xia, X. Cui, L. Huang, H. Huang, W. Deng, and Z. He, “Mmfakebench: A mixed-source multimodal misinformation detection benchmark for lvlms,”arXiv preprint arXiv:2406.08772, 2024

  47. [47]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millicah, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language mod...

  48. [48]

    Dire for diffusion-generated image detection,

    Z. Wang, J. Bao, W. Zhou, W. Wang, H. Hu, H. Chen, and H. Li, “Dire for diffusion-generated image detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 445–22 455

  49. [49]

    End-to-end multimodal fact-checking and explanation generation: A challenging dataset and models,

    B. M. Yao, A. Shah, L. Sun, J.-H. Cho, and L. Huang, “End-to-end multimodal fact-checking and explanation generation: A challenging dataset and models,” inProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 2733–27...

  50. [50]

    Can llms produce faithful explanations for fact-checking? towards faithful explainable fact-checking via multi-agent debate,

    K. Kim, S. Lee, K.-H. Huang, H. P. Chan, M. Li, and H. Ji, “Can llms produce faithful explanations for fact-checking? towards faith- ful explainable fact-checking via multi-agent debate,”arXiv preprint arXiv:2402.07401, 2024. DANNI XUis currently a Ph.D. student with the School of Computing, National University of Sin- gapore. She received the B.S. degree...