The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

Arka Ujjal Dey; John Collomosse

arxiv: 2606.24627 · v1 · pith:CF3NAZNMnew · submitted 2026-06-23 · 💻 cs.CL

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

Arka Ujjal Dey , John Collomosse This is my paper

Pith reviewed 2026-06-25 23:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords fact-checkingLLMevidence re-scoringwarrant gapNLIdecompositionSIFTWSP

0 comments

The pith

Claim-conditioned re-scoring of evidence recovers accuracy lost in structured fact-checking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fact-checking systems using large language models frequently assign support labels even when the cited evidence does not fully entail the original claim. The paper introduces SIFT to re-score extracted evidence spans by conditioning directly on the complete claim rather than on isolated facets from decomposition. It also defines WSP as an automatic natural language inference metric that measures the proportion of cited warrants that actually entail the claim. Across evaluations on FEVER, SciFact, 5PILS, and DP with four open-source models, SIFT restores accuracy drops of up to 27.6 points from naive decomposition while improving WSP scores over direct prompting. WSP itself aligns with human gold evidence judgments at AUC 0.92 and precision 0.98.

Core claim

SIFT recovers accuracy on cells where naive decomposition costs up to 27.6 points, while raising WSP above direct prompting; WSP itself calibrates against human gold evidence at AUC 0.92 and precision 0.98.

What carries the argument

SIFT, claim-conditioned re-scoring of extracted evidence spans against the full claim, paired with WSP, an automatic NLI check that the cited warrant entails the claim.

If this is right

Structured decomposition protocols can preserve full-claim context through targeted re-scoring without sacrificing inspection of individual warrants.
WSP offers an automatic, high-precision substitute for human review of whether cited evidence supports a claim.
Performance improvements from SIFT hold across multiple fact-checking benchmarks and different open-source LLM backbones.
Direct prompting yields lower warrant quality than the re-scoring approach on the same models and datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar conditioning on the original claim could reduce grounding failures in other LLM tasks that rely on step-by-step decomposition.
Evidence extraction pipelines may benefit from always retaining the full claim as context rather than discarding it after initial parsing.
The approach suggests a general pattern for closing warrant gaps in any system that breaks complex claims into sub-questions.

Load-bearing premise

The NLI model used to compute WSP provides a faithful proxy for whether extracted evidence actually entails the full claim, and that performance on the chosen benchmarks generalizes to real-world fact-checking distributions.

What would settle it

A new human evaluation set of evidence-claim pairs where WSP scores show low correlation with entailment judgments, or a decomposition-based fact-checking task where SIFT fails to recover the reported accuracy gains.

Figures

Figures reproduced from arXiv: 2606.24627 by Arka Ujjal Dey, John Collomosse.

**Figure 2.** Figure 2: SIFT with claim-conditioned re-scoring. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Residual error shift by model. Each dataset contains four model columns (Q4=Qwen3- 4B, L8=Llama3.1-8B, G9=Gemma2-9B, Q14=Qwen3- 14B). Within each column, S (left) is the SIFT baseline and B (right) is Design B. Segments encode the five residual error types (bottom to top): ■ IDENTITY, ■ PREDICATE, ■ AGGREGATION, ■ UNDERCUTTER, and ■ UNABLE. Aggregation reductions are strongest for the smaller backbones an… view at source ↗

read the original abstract

Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect those warrants, but rigid extraction protocols strip the full-claim context that facets need. We introduce SIFT -- claim-conditioned re-scoring of extracted evidence spans against the full claim -- paired with WSP (Warranted Supports Proportion), an automatic NLI check that the cited warrant entails the claim. We evaluate on FEVER, SciFact, 5PILS, and DP across four open-source backbones. SIFT recovers accuracy on cells where naive decomposition costs up to 27.6 points, while raising WSP above direct prompting; WSP itself calibrates against human gold evidence at AUC 0.92 and precision 0.98.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIFT and WSP target the evidence-claim mismatch in LLM fact-checking with clear reported gains, though the NLI proxy for WSP remains the main untested piece.

read the letter

The core idea here is straightforward: LLMs often output a Supports verdict whose cited evidence does not actually entail the full claim. SIFT re-scores extracted spans against the complete claim instead of using rigid decomposition, and WSP adds an automatic NLI check to measure how much of the evidence is warranted. That pairing is what the paper contributes.

The results look useful on the numbers given. SIFT recovers up to 27.6 accuracy points lost to naive decomposition across the four backbones and four benchmarks, and WSP exceeds direct prompting while hitting AUC 0.92 and precision 0.98 against human gold labels. The evaluation uses external benchmarks and human labels rather than self-referential fitting, so there is no obvious circularity.

The soft spot is the NLI model that powers WSP. The calibration numbers are reported, but the abstract supplies no architecture, training data, or check on whether the NLI was exposed to the same claim distributions as the fact-checkers. If NLI errors line up with the warrant-gap cases the method is meant to fix, both the accuracy recovery and the automatic metric become less reliable. That assumption is load-bearing and not yet stress-tested in the provided details.

This is for NLP researchers who build or audit LLM fact-checkers and want a concrete way to tighten evidence quality. A reader working on misinformation pipelines would get practical value from the re-scoring approach and the benchmark numbers. The central argument holds up on what is shown, even if more ablations on the NLI component would help.

I would send it to peer review. The problem is real and the fix is simple enough to be worth referee scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLM-based fact-checking systems suffer from a 'warrant gap' in which cited evidence spans fail to entail the full claim. It introduces SIFT (claim-conditioned re-scoring of extracted spans) paired with WSP (Warranted Supports Proportion), an automatic NLI-based check that the warrant entails the claim. On FEVER, SciFact, 5PILS and DP with four open-source backbones, SIFT recovers up to 27.6 accuracy points lost under naive decomposition while raising WSP above direct prompting; WSP itself is reported to calibrate against human gold at AUC 0.92 and precision 0.98.

Significance. If the central results hold, the work supplies a practical mechanism for closing the warrant gap and an automatic metric that could reduce reliance on human evidence annotation. The multi-benchmark, multi-backbone evaluation is a positive feature.

major comments (2)

[Abstract and §4] Abstract and evaluation sections: the reported accuracy recovery (up to 27.6 points) and WSP improvements are presented without ablation studies, error analysis, or per-backbone breakdowns. It is therefore impossible to determine whether the gains are attributable to the claim-conditioned re-scoring itself or to other experimental choices.
[WSP definition] WSP definition and calibration paragraph: the claim that WSP provides a faithful proxy for full-claim entailment rests on an NLI model whose architecture, training regime, and exposure to the target claim distributions are not described. Although aggregate calibration (AUC 0.92, precision 0.98) against human gold is stated, no analysis is given of whether NLI errors correlate with the warrant-gap cases the method targets.

minor comments (1)

The four open-source backbones are referenced but not named in the abstract or evaluation summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and evaluation sections: the reported accuracy recovery (up to 27.6 points) and WSP improvements are presented without ablation studies, error analysis, or per-backbone breakdowns. It is therefore impossible to determine whether the gains are attributable to the claim-conditioned re-scoring itself or to other experimental choices.

Authors: We agree that the current presentation reports aggregate results across backbones without isolating the contribution of SIFT via ablations or providing per-backbone breakdowns and error analysis. In the revised manuscript we will expand §4 with (i) an ablation removing the claim-conditioned re-scoring step, (ii) per-backbone accuracy and WSP tables, and (iii) a qualitative error analysis of cases where SIFT recovers or fails to recover accuracy. These additions will make the source of the reported gains transparent. revision: yes
Referee: [WSP definition] WSP definition and calibration paragraph: the claim that WSP provides a faithful proxy for full-claim entailment rests on an NLI model whose architecture, training regime, and exposure to the target claim distributions are not described. Although aggregate calibration (AUC 0.92, precision 0.98) against human gold is stated, no analysis is given of whether NLI errors correlate with the warrant-gap cases the method targets.

Authors: We acknowledge that the NLI model underlying WSP is not fully specified and that no error-correlation analysis is provided. In the revision we will add a dedicated paragraph describing the NLI model (architecture, training data, and any overlap with the evaluation claim distributions) together with a breakdown of NLI errors on the human-gold subset, explicitly checking whether misclassifications align with warrant-gap instances. This will strengthen the justification for using WSP as a proxy. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external benchmarks and human gold labels

full rationale

The provided abstract and context describe SIFT and WSP evaluated on FEVER, SciFact, 5PILS, and DP benchmarks, with WSP calibrated directly against human gold evidence (AUC 0.92, precision 0.98). No equations, self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the text. The central results are presented as empirical outcomes on independent test sets rather than reductions to the method's own inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger; relies on standard NLI assumptions and benchmark validity.

axioms (1)

domain assumption NLI models reliably detect entailment between evidence spans and full claims
WSP metric depends on this for automatic warrant checking.

pith-pipeline@v0.9.1-grok · 5672 in / 993 out tokens · 22770 ms · 2026-06-25T23:40:47.637113+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

110 extracted references · 24 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[4]

Psychometrika , volume=

Note on the sampling error of the difference between correlated proportions or percentages , author=. Psychometrika , volume=. 1947 , publisher=

1947
[5]

Journal of the Royal statistical society: series B (Methodological) , volume=

Controlling the false discovery rate: a practical and powerful approach to multiple testing , author=. Journal of the Royal statistical society: series B (Methodological) , volume=. 1995 , publisher=

1995
[7]

Spearman , journal =

C. Spearman , journal =. The Proof and Measurement of Association between Two Things , urldate =
[8]

Mathematical contributions to the theory of evolution.—VII

I. Mathematical contributions to the theory of evolution.—VII. On the correlation of characters not quantitatively measurable , author=. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character , volume=. 1900 , publisher=

1900
[9]

Towards Debiasing Fact Verification Models

Schuster, Tal and Shah, Darsh and Yeo, Yun Jie Serene and Roberto Filizzola Ortiz, Daniel and Santus, Enrico and Barzilay, Regina. Towards Debiasing Fact Verification Models. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019....

work page doi:10.18653/v1/d19-1341 2019
[16]

arXiv preprint arXiv:2311.01453 , year=

Ppi++: Efficient prediction-powered inference , author=. arXiv preprint arXiv:2311.01453 , year=

Pith/arXiv arXiv
[26]

2026 , eprint=

Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection , author=. 2026 , eprint=

2026
[27]

Advances in Neural Information Processing Systems , volume=

Averitec: A dataset for real-world claim verification with evidence from the web , author=. Advances in Neural Information Processing Systems , volume=
[28]

Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025) , pages=

From RAG to Reality: Coarse-Grained Hallucination Detection via NLI Fine-Tuning , author=. Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025) , pages=

2025
[29]

arXiv preprint arXiv:2506.13342 , year=

Verifying the verifiers: Unveiling pitfalls and potentials in fact verifiers , author=. arXiv preprint arXiv:2506.13342 , year=

arXiv
[30]

Towards Understanding Sycophancy in Language Models , url =

Sharma, Mrinank and Tong, Meg and Korbak, Tomek and Duvenaud, David and Askell, Amanda and Bowman, Sam and DURMUS, Esin and Hatfield-Dodds, Zac and Johnston, Scott and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Ethan , booktitle =. Towards U...
[34]

2003 , publisher=

The uses of argument , author=. 2003 , publisher=

2003
[36]

, title =

Wei, Jerry and Yang, Chengrun and Song, Xinying and Lu, Yifeng and Hu, Nathan and Huang, Jie and Tran, Dustin and Peng, Daiyi and Liu, Ruibo and Huang, Da and Du, Cosmo and Le, Quoc V. , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

2024
[37]

Fact-Checking With Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis , year=

Dey, Arka Ujjal and Awan, Muhammad Junaid and Channing, Georgia and Witt, Christian Schroeder de and Collomosse, John , journal=. Fact-Checking With Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis , year=
[38]

IEEE Transactions on Artificial Intelligence , year=

TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking , author=. IEEE Transactions on Artificial Intelligence , year=
[39]

Findings of the Association for Computational Linguistics: EMNLP 2025 , year=

GraphCheck: Multipath Fact-Checking with Entity-Relationship Graphs , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , year=

2025
[40]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

Fact Verification on Knowledge Graph via Programmatic Graph Reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

2025
[41]

IEEE Transactions on Computational Social Systems , year=

AMIR: An Automated Misinformation Rebuttal System--A COVID-19 Vaccination Datasets-Based Exposition , author=. IEEE Transactions on Computational Social Systems , year=
[42]

IEEE Transactions on Computational Social Systems , year=

Believe in artificial intelligence? A user study on the ChatGPT’s fake information impact , author=. IEEE Transactions on Computational Social Systems , year=
[43]

IEEE Transactions on Computational Social Systems , volume=

Integrating social explanations into explainable artificial intelligence (XAI) for combating misinformation: Vision and challenges , author=. IEEE Transactions on Computational Social Systems , volume=. 2024 , publisher=

2024
[44]

IEEE Transactions on Computational Social Systems , volume=

Detecting and mitigating the dissemination of fake news: Challenges and future research opportunities , author=. IEEE Transactions on Computational Social Systems , volume=. 2022 , publisher=

2022
[45]

science , volume=

The spread of true and false news online , author=. science , volume=. 2018 , publisher=

2018
[46]

fake news

The economics of “fake news” , author=. IT Professional , volume=. 2017 , publisher=

2017
[47]

Journal of economic perspectives , volume=

Social media and fake news in the 2016 election , author=. Journal of economic perspectives , volume=. 2017 , publisher=

2016
[48]

Communications of the ACM , volume=

Wikidata: a free collaborative knowledgebase , author=. Communications of the ACM , volume=. 2014 , publisher=

2014
[49]

arXiv preprint arXiv:2305.14292 , year=

WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia , author=. arXiv preprint arXiv:2305.14292 , year=

arXiv
[50]

arXiv preprint arXiv:2010.03743 , year=

Visual news: Benchmark and challenges in news image captioning , author=. arXiv preprint arXiv:2010.03743 , year=

arXiv 2010
[51]

AIMS public health , volume=

The impact of misinformation on the COVID-19 pandemic , author=. AIMS public health , volume=
[52]

Social Network Analysis and Mining , volume=

Fake news, disinformation and misinformation in social media: a review , author=. Social Network Analysis and Mining , volume=. 2023 , publisher=

2023
[53]

arXiv preprint arXiv:2410.23850 , year=

The automated verification of textual claims (averitec) shared task , author=. arXiv preprint arXiv:2410.23850 , year=

arXiv
[54]

Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER) , pages=

RAG-Fusion Based Information Retrieval for Fact-Checking , author=. Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER) , pages=
[55]

arXiv preprint arXiv:2503.22877 , year=

Understanding Inequality of LLM Fact-Checking over Geographic Regions with Agent and Retrieval models , author=. arXiv preprint arXiv:2503.22877 , year=

arXiv
[56]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
[57]

Transactions of the Association for Computational Linguistics , volume=

Justilm: Few-shot justification generation for explainable fact-checking of real-world claims , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

2024
[58]

arXiv preprint arXiv:2501.11403 , year=

Verifying cross-modal entity consistency in news using vision-language models , author=. arXiv preprint arXiv:2501.11403 , year=

arXiv
[59]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Sniffer: Multimodal large language model for explainable out-of-context misinformation detection , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[60]

The library quarterly , volume=

Posttruth, truthiness, and alternative facts: Information behavior and critical information consumption for a new age , author=. The library quarterly , volume=. 2017 , publisher=

2017
[61]

Political communication , volume=

A picture paints a thousand lies? The effects and mechanisms of multimodal disinformation and rebuttals disseminated via social media , author=. Political communication , volume=. 2020 , publisher=

2020
[62]

arXiv preprint arXiv:2502.00752 , year=

Zero-Shot Warning Generation for Misinformative Multimodal Content , author=. arXiv preprint arXiv:2502.00752 , year=

arXiv
[63]

arXiv preprint arXiv:2104.05893 , year=

Newsclippings: Automatic generation of out-of-context multimodal media , author=. arXiv preprint arXiv:2104.05893 , year=

arXiv
[64]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

2023
[65]

Journal of computational and applied mathematics , volume=

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , author=. Journal of computational and applied mathematics , volume=. 1987 , publisher=

1987
[66]

IEEE transactions on pattern analysis and machine intelligence , number=

A cluster separation measure , author=. IEEE transactions on pattern analysis and machine intelligence , number=. 1979 , publisher=

1979
[67]

uncheckable: How opinion-based claims can impede corrections of misinformation , author=

Unchecked vs. uncheckable: How opinion-based claims can impede corrections of misinformation , author=. Mass communication and society , volume=. 2021 , publisher=

2021
[68]

2024 , eprint=

Improved Baselines with Visual Instruction Tuning , author=. 2024 , eprint=

2024
[69]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023
[70]

arXiv preprint arXiv:2111.09543 , year=

Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing , author=. arXiv preprint arXiv:2111.09543 , year=

Pith/arXiv arXiv
[71]

IEEE signal processing letters , volume=

Joint face detection and alignment using multitask cascaded convolutional networks , author=. IEEE signal processing letters , volume=. 2016 , publisher=

2016
[72]

arXiv preprint arXiv:2010.11929 , year=

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

Pith/arXiv arXiv 2010
[73]

CVPR , year=

Facenet: A unified embedding for face recognition and clustering , author=. CVPR , year=
[74]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Places: A 10 million Image Database for Scene Recognition , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[75]

arXiv preprint arXiv:2402.06782 , year=

Debating with more persuasive llms leads to more truthful answers , author=. arXiv preprint arXiv:2402.06782 , year=

arXiv
[76]

arXiv preprint arXiv:2411.06116 , year=

Supernotes: Driving Consensus in Crowd-Sourced Fact-Checking , author=. arXiv preprint arXiv:2411.06116 , year=

arXiv
[77]

International Journal of Multimedia Information Retrieval , volume=

Verite: a robust benchmark for multimodal misinformation detection accounting for unimodal bias , author=. International Journal of Multimedia Information Retrieval , volume=. 2024 , publisher=

2024
[78]

, title =

Urbani, S. , title =. Essential Guides , year =
[79]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

2019
[80]

arXiv preprint arXiv:2404.10702 , year=

Retrieval Augmented Verification for Zero-Shot Detection of Multimodal Disinformation , author=. arXiv preprint arXiv:2404.10702 , year=

arXiv
[81]

2024 , eprint=

RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection , author=. 2024 , eprint=

2024
[82]

2025 , eprint=

DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts , author=. 2025 , eprint=

2025
[83]

arXiv preprint arXiv:2406.08772 , year=

Mmfakebench: A mixed-source multimodal misinformation detection benchmark for lvlms , author=. arXiv preprint arXiv:2406.08772 , year=

arXiv
[84]

arXiv preprint arXiv:2407.13488 , year=

Similarity over Factuality: Are we making progress on multimodal out-of-context misinformation detection? , author=. arXiv preprint arXiv:2407.13488 , year=

arXiv
[86]

Reuters Institute for the Study of Journalism , year=

Understanding the promise and limits of automated fact-checking , author=. Reuters Institute for the Study of Journalism , year=
[87]

Fact-checking

“Fact-checking” fact checkers: A data-driven approach , author=. Harvard Kennedy School Misinformation Review , year=
[88]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Detecting and grounding multi-modal media manipulation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[89]

2025 , eprint=

LLM-Consensus: Multi-Agent Debate for Visual Misinformation Detection , author=. 2025 , eprint=

2025
[90]

arXiv preprint arXiv:2101.06278 , year=

Cosmos: Catching out-of-context misinformation with self-supervised learning , author=. arXiv preprint arXiv:2101.06278 , year=

arXiv
[91]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[93]

2025 , publisher=

Journalism, media, and technology trends and predictions 2025 , author=. 2025 , publisher=

2025
[94]

PNAS nexus , volume=

Unveiling the hidden agenda: Biases in news reporting and consumption , author=. PNAS nexus , volume=. 2024 , publisher=

2024
[95]

Proceedings of the ACM on Human-Computer Interaction , volume=

Did the Roll-Out of Community Notes Reduce Engagement With Misinformation on X/Twitter? , author=. Proceedings of the ACM on Human-Computer Interaction , volume=. 2024 , publisher=

2024
[96]

arXiv preprint arXiv:2502.14132 , year=

Can Community Notes Replace Professional Fact-Checkers? , author=. arXiv preprint arXiv:2502.14132 , year=

arXiv
[97]

Sahar Abdelnabi, Rakibul Hasan, and Mario Fritz. 2022. Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14940--14949

2022
[98]

Mubashara Akhtar, Michael Schlichtkrull, and Andreas Vlachos. 2026. https://doi.org/10.1162/TACL.a.647 Ev2r: Evaluating evidence retrieval in automated fact-checking . Transactions of the Association for Computational Linguistics, 14:530--561

work page doi:10.1162/tacl.a.647 2026
[99]

Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. 2023. Ppi++: Efficient prediction-powered inference

2023
[100]

Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289--300

1995
[101]

Eunsol Choi, Jennimaria Palomaki, Matthew Lamm, Tom Kwiatkowski, Dipanjan Das, and Michael Collins. 2021. https://doi.org/10.1162/tacl_a_00377 Decontextualization: Making sentences stand-alone . Transactions of the Association for Computational Linguistics, 9:447--461

work page doi:10.1162/tacl_a_00377 2021
[102]

Arka Ujjal Dey, Muhammad Junaid Awan, Georgia Channing, Christian Schroeder de Witt, and John Collomosse. 2026. https://doi.org/10.1109/TCSS.2026.3669799 Fact-checking with contextual narratives: Leveraging retrieval-augmented llms for social media analysis . IEEE Transactions on Computational Social Systems, pages 1--12

work page doi:10.1109/tcss.2026.3669799 2026
[103]

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. 2023. https://doi.org/10.18653/v1/2023.acl-long.910 RARR : Researching and revising what language models say, using language models . In Proceedings of the 61st Annual Meeting of the Association for Co...

work page doi:10.18653/v1/2023.acl-long.910 2023
[104]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[2] [4]

Psychometrika , volume=

Note on the sampling error of the difference between correlated proportions or percentages , author=. Psychometrika , volume=. 1947 , publisher=

1947

[3] [5]

Journal of the Royal statistical society: series B (Methodological) , volume=

Controlling the false discovery rate: a practical and powerful approach to multiple testing , author=. Journal of the Royal statistical society: series B (Methodological) , volume=. 1995 , publisher=

1995

[4] [7]

Spearman , journal =

C. Spearman , journal =. The Proof and Measurement of Association between Two Things , urldate =

[5] [8]

Mathematical contributions to the theory of evolution.—VII

I. Mathematical contributions to the theory of evolution.—VII. On the correlation of characters not quantitatively measurable , author=. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character , volume=. 1900 , publisher=

1900

[6] [9]

Towards Debiasing Fact Verification Models

Schuster, Tal and Shah, Darsh and Yeo, Yun Jie Serene and Roberto Filizzola Ortiz, Daniel and Santus, Enrico and Barzilay, Regina. Towards Debiasing Fact Verification Models. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019....

work page doi:10.18653/v1/d19-1341 2019

[7] [16]

arXiv preprint arXiv:2311.01453 , year=

Ppi++: Efficient prediction-powered inference , author=. arXiv preprint arXiv:2311.01453 , year=

Pith/arXiv arXiv

[8] [26]

2026 , eprint=

Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection , author=. 2026 , eprint=

2026

[9] [27]

Advances in Neural Information Processing Systems , volume=

Averitec: A dataset for real-world claim verification with evidence from the web , author=. Advances in Neural Information Processing Systems , volume=

[10] [28]

Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025) , pages=

From RAG to Reality: Coarse-Grained Hallucination Detection via NLI Fine-Tuning , author=. Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025) , pages=

2025

[11] [29]

arXiv preprint arXiv:2506.13342 , year=

Verifying the verifiers: Unveiling pitfalls and potentials in fact verifiers , author=. arXiv preprint arXiv:2506.13342 , year=

arXiv

[12] [30]

Towards Understanding Sycophancy in Language Models , url =

Sharma, Mrinank and Tong, Meg and Korbak, Tomek and Duvenaud, David and Askell, Amanda and Bowman, Sam and DURMUS, Esin and Hatfield-Dodds, Zac and Johnston, Scott and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Ethan , booktitle =. Towards U...

[13] [34]

2003 , publisher=

The uses of argument , author=. 2003 , publisher=

2003

[14] [36]

, title =

Wei, Jerry and Yang, Chengrun and Song, Xinying and Lu, Yifeng and Hu, Nathan and Huang, Jie and Tran, Dustin and Peng, Daiyi and Liu, Ruibo and Huang, Da and Du, Cosmo and Le, Quoc V. , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

2024

[15] [37]

Fact-Checking With Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis , year=

Dey, Arka Ujjal and Awan, Muhammad Junaid and Channing, Georgia and Witt, Christian Schroeder de and Collomosse, John , journal=. Fact-Checking With Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis , year=

[16] [38]

IEEE Transactions on Artificial Intelligence , year=

TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking , author=. IEEE Transactions on Artificial Intelligence , year=

[17] [39]

Findings of the Association for Computational Linguistics: EMNLP 2025 , year=

GraphCheck: Multipath Fact-Checking with Entity-Relationship Graphs , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , year=

2025

[18] [40]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

Fact Verification on Knowledge Graph via Programmatic Graph Reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

2025

[19] [41]

IEEE Transactions on Computational Social Systems , year=

AMIR: An Automated Misinformation Rebuttal System--A COVID-19 Vaccination Datasets-Based Exposition , author=. IEEE Transactions on Computational Social Systems , year=

[20] [42]

IEEE Transactions on Computational Social Systems , year=

Believe in artificial intelligence? A user study on the ChatGPT’s fake information impact , author=. IEEE Transactions on Computational Social Systems , year=

[21] [43]

IEEE Transactions on Computational Social Systems , volume=

Integrating social explanations into explainable artificial intelligence (XAI) for combating misinformation: Vision and challenges , author=. IEEE Transactions on Computational Social Systems , volume=. 2024 , publisher=

2024

[22] [44]

IEEE Transactions on Computational Social Systems , volume=

Detecting and mitigating the dissemination of fake news: Challenges and future research opportunities , author=. IEEE Transactions on Computational Social Systems , volume=. 2022 , publisher=

2022

[23] [45]

science , volume=

The spread of true and false news online , author=. science , volume=. 2018 , publisher=

2018

[24] [46]

fake news

The economics of “fake news” , author=. IT Professional , volume=. 2017 , publisher=

2017

[25] [47]

Journal of economic perspectives , volume=

Social media and fake news in the 2016 election , author=. Journal of economic perspectives , volume=. 2017 , publisher=

2016

[26] [48]

Communications of the ACM , volume=

Wikidata: a free collaborative knowledgebase , author=. Communications of the ACM , volume=. 2014 , publisher=

2014

[27] [49]

arXiv preprint arXiv:2305.14292 , year=

WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia , author=. arXiv preprint arXiv:2305.14292 , year=

arXiv

[28] [50]

arXiv preprint arXiv:2010.03743 , year=

Visual news: Benchmark and challenges in news image captioning , author=. arXiv preprint arXiv:2010.03743 , year=

arXiv 2010

[29] [51]

AIMS public health , volume=

The impact of misinformation on the COVID-19 pandemic , author=. AIMS public health , volume=

[30] [52]

Social Network Analysis and Mining , volume=

Fake news, disinformation and misinformation in social media: a review , author=. Social Network Analysis and Mining , volume=. 2023 , publisher=

2023

[31] [53]

arXiv preprint arXiv:2410.23850 , year=

The automated verification of textual claims (averitec) shared task , author=. arXiv preprint arXiv:2410.23850 , year=

arXiv

[32] [54]

Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER) , pages=

RAG-Fusion Based Information Retrieval for Fact-Checking , author=. Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER) , pages=

[33] [55]

arXiv preprint arXiv:2503.22877 , year=

Understanding Inequality of LLM Fact-Checking over Geographic Regions with Agent and Retrieval models , author=. arXiv preprint arXiv:2503.22877 , year=

arXiv

[34] [56]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

[35] [57]

Transactions of the Association for Computational Linguistics , volume=

Justilm: Few-shot justification generation for explainable fact-checking of real-world claims , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

2024

[36] [58]

arXiv preprint arXiv:2501.11403 , year=

Verifying cross-modal entity consistency in news using vision-language models , author=. arXiv preprint arXiv:2501.11403 , year=

arXiv

[37] [59]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Sniffer: Multimodal large language model for explainable out-of-context misinformation detection , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[38] [60]

The library quarterly , volume=

Posttruth, truthiness, and alternative facts: Information behavior and critical information consumption for a new age , author=. The library quarterly , volume=. 2017 , publisher=

2017

[39] [61]

Political communication , volume=

A picture paints a thousand lies? The effects and mechanisms of multimodal disinformation and rebuttals disseminated via social media , author=. Political communication , volume=. 2020 , publisher=

2020

[40] [62]

arXiv preprint arXiv:2502.00752 , year=

Zero-Shot Warning Generation for Misinformative Multimodal Content , author=. arXiv preprint arXiv:2502.00752 , year=

arXiv

[41] [63]

arXiv preprint arXiv:2104.05893 , year=

Newsclippings: Automatic generation of out-of-context multimodal media , author=. arXiv preprint arXiv:2104.05893 , year=

arXiv

[42] [64]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

2023

[43] [65]

Journal of computational and applied mathematics , volume=

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , author=. Journal of computational and applied mathematics , volume=. 1987 , publisher=

1987

[44] [66]

IEEE transactions on pattern analysis and machine intelligence , number=

A cluster separation measure , author=. IEEE transactions on pattern analysis and machine intelligence , number=. 1979 , publisher=

1979

[45] [67]

uncheckable: How opinion-based claims can impede corrections of misinformation , author=

Unchecked vs. uncheckable: How opinion-based claims can impede corrections of misinformation , author=. Mass communication and society , volume=. 2021 , publisher=

2021

[46] [68]

2024 , eprint=

Improved Baselines with Visual Instruction Tuning , author=. 2024 , eprint=

2024

[47] [69]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023

[48] [70]

arXiv preprint arXiv:2111.09543 , year=

Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing , author=. arXiv preprint arXiv:2111.09543 , year=

Pith/arXiv arXiv

[49] [71]

IEEE signal processing letters , volume=

Joint face detection and alignment using multitask cascaded convolutional networks , author=. IEEE signal processing letters , volume=. 2016 , publisher=

2016

[50] [72]

arXiv preprint arXiv:2010.11929 , year=

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

Pith/arXiv arXiv 2010

[51] [73]

CVPR , year=

Facenet: A unified embedding for face recognition and clustering , author=. CVPR , year=

[52] [74]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Places: A 10 million Image Database for Scene Recognition , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

[53] [75]

arXiv preprint arXiv:2402.06782 , year=

Debating with more persuasive llms leads to more truthful answers , author=. arXiv preprint arXiv:2402.06782 , year=

arXiv

[54] [76]

arXiv preprint arXiv:2411.06116 , year=

Supernotes: Driving Consensus in Crowd-Sourced Fact-Checking , author=. arXiv preprint arXiv:2411.06116 , year=

arXiv

[55] [77]

International Journal of Multimedia Information Retrieval , volume=

Verite: a robust benchmark for multimodal misinformation detection accounting for unimodal bias , author=. International Journal of Multimedia Information Retrieval , volume=. 2024 , publisher=

2024

[56] [78]

, title =

Urbani, S. , title =. Essential Guides , year =

[57] [79]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

2019

[58] [80]

arXiv preprint arXiv:2404.10702 , year=

Retrieval Augmented Verification for Zero-Shot Detection of Multimodal Disinformation , author=. arXiv preprint arXiv:2404.10702 , year=

arXiv

[59] [81]

2024 , eprint=

RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection , author=. 2024 , eprint=

2024

[60] [82]

2025 , eprint=

DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts , author=. 2025 , eprint=

2025

[61] [83]

arXiv preprint arXiv:2406.08772 , year=

Mmfakebench: A mixed-source multimodal misinformation detection benchmark for lvlms , author=. arXiv preprint arXiv:2406.08772 , year=

arXiv

[62] [84]

arXiv preprint arXiv:2407.13488 , year=

Similarity over Factuality: Are we making progress on multimodal out-of-context misinformation detection? , author=. arXiv preprint arXiv:2407.13488 , year=

arXiv

[63] [86]

Reuters Institute for the Study of Journalism , year=

Understanding the promise and limits of automated fact-checking , author=. Reuters Institute for the Study of Journalism , year=

[64] [87]

Fact-checking

“Fact-checking” fact checkers: A data-driven approach , author=. Harvard Kennedy School Misinformation Review , year=

[65] [88]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Detecting and grounding multi-modal media manipulation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[66] [89]

2025 , eprint=

LLM-Consensus: Multi-Agent Debate for Visual Misinformation Detection , author=. 2025 , eprint=

2025

[67] [90]

arXiv preprint arXiv:2101.06278 , year=

Cosmos: Catching out-of-context misinformation with self-supervised learning , author=. arXiv preprint arXiv:2101.06278 , year=

arXiv

[68] [91]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[69] [93]

2025 , publisher=

Journalism, media, and technology trends and predictions 2025 , author=. 2025 , publisher=

2025

[70] [94]

PNAS nexus , volume=

Unveiling the hidden agenda: Biases in news reporting and consumption , author=. PNAS nexus , volume=. 2024 , publisher=

2024

[71] [95]

Proceedings of the ACM on Human-Computer Interaction , volume=

Did the Roll-Out of Community Notes Reduce Engagement With Misinformation on X/Twitter? , author=. Proceedings of the ACM on Human-Computer Interaction , volume=. 2024 , publisher=

2024

[72] [96]

arXiv preprint arXiv:2502.14132 , year=

Can Community Notes Replace Professional Fact-Checkers? , author=. arXiv preprint arXiv:2502.14132 , year=

arXiv

[73] [97]

Sahar Abdelnabi, Rakibul Hasan, and Mario Fritz. 2022. Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14940--14949

2022

[74] [98]

Mubashara Akhtar, Michael Schlichtkrull, and Andreas Vlachos. 2026. https://doi.org/10.1162/TACL.a.647 Ev2r: Evaluating evidence retrieval in automated fact-checking . Transactions of the Association for Computational Linguistics, 14:530--561

work page doi:10.1162/tacl.a.647 2026

[75] [99]

Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. 2023. Ppi++: Efficient prediction-powered inference

2023

[76] [100]

Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289--300

1995

[77] [101]

Eunsol Choi, Jennimaria Palomaki, Matthew Lamm, Tom Kwiatkowski, Dipanjan Das, and Michael Collins. 2021. https://doi.org/10.1162/tacl_a_00377 Decontextualization: Making sentences stand-alone . Transactions of the Association for Computational Linguistics, 9:447--461

work page doi:10.1162/tacl_a_00377 2021

[78] [102]

Arka Ujjal Dey, Muhammad Junaid Awan, Georgia Channing, Christian Schroeder de Witt, and John Collomosse. 2026. https://doi.org/10.1109/TCSS.2026.3669799 Fact-checking with contextual narratives: Leveraging retrieval-augmented llms for social media analysis . IEEE Transactions on Computational Social Systems, pages 1--12

work page doi:10.1109/tcss.2026.3669799 2026

[79] [103]

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. 2023. https://doi.org/10.18653/v1/2023.acl-long.910 RARR : Researching and revising what language models say, using language models . In Proceedings of the 61st Annual Meeting of the Association for Co...

work page doi:10.18653/v1/2023.acl-long.910 2023

[80] [104]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024