VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking
Pith reviewed 2026-05-16 14:59 UTC · model grok-4.3
The pith
VeriTaS creates the first dynamic multimodal benchmark for automated fact-checking that updates quarterly to resist data leakage into model pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VeriTaS is presented as the first dynamic benchmark for multimodal automated fact-checking, consisting of 25,000 claims sourced from 104 organizations in 54 languages with textual and audiovisual content, maintained through a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel standardized and disentangled scoring scheme accompanied by textual justifications.
What carries the argument
The fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a standardized, disentangled scoring scheme with textual justifications.
If this is right
- AFC system evaluations will measure actual verification ability instead of recall of training data.
- Standardized scoring enables direct comparison of systems across claims from many different fact-checking organizations.
- Quarterly updates allow tracking of model progress on emerging misinformation without repeated use of the same fixed set.
- Multimodal coverage supports testing of systems that process both text and audiovisual evidence together.
Where Pith is reading between the lines
- The dynamic update mechanism could serve as a template for other evaluation suites in areas where rapid model pretraining risks contamination.
- Long-term tracking with VeriTaS may show whether gains in AFC performance come from better reasoning or from broader coverage of the same claim types.
- Public release of the pipeline invites external groups to extend the benchmark to additional languages or new misinformation formats.
Load-bearing premise
The automated pipeline will keep producing annotations that match human judgments on new quarterly claims without accumulating systematic errors.
What would settle it
A future quarterly batch of claims where independent human raters disagree with the pipeline's standardized verdicts on a substantial fraction of cases.
Figures
read the original abstract
The growing scale of online misinformation urgently demands Automated Fact-Checking (AFC). Existing benchmarks for evaluating AFC systems, however, are largely limited in terms of task scope, modalities, domain, language diversity, realism, or coverage of misinformation types. Critically, they are static, thus subject to data leakage as their claims enter the pretraining corpora of LLMs. As a result, benchmark performance no longer reliably reflects the actual ability to verify claims. We introduce Verified Theses and Statements (VeriTaS), the first dynamic benchmark for multimodal AFC, designed to remain robust under ongoing large-scale pretraining of foundation models. VeriTaS currently comprises 25,000 real-world claims from 104 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications. Through human evaluation, we demonstrate that the automated annotations closely match human judgments. We commit to updating VeriTaS in the future, establishing a leakage-resistant benchmark, supporting meaningful AFC evaluation in the era of rapidly evolving foundation models. The code and data are publicly available under https://veritas.mai.informatik.tu-darmstadt.de .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VeriTaS, the first dynamic benchmark for multimodal automated fact-checking (AFC). It comprises 25,000 real-world claims sourced from 104 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are ingested quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel standardized and disentangled scoring scheme with textual justifications. The authors report that automated annotations closely match human judgments in a one-time evaluation and commit to ongoing quarterly updates to create a leakage-resistant benchmark for evaluating AFC systems amid large-scale pretraining of foundation models. Code and data are released publicly.
Significance. If the seven-stage pipeline maintains alignment with human judgments across quarterly updates without systematic drift, VeriTaS would provide a valuable contribution as a multimodal, multilingual, and dynamic benchmark that mitigates data leakage issues plaguing static AFC datasets. The public data release and commitment to updates strengthen its utility for reproducible evaluation of AFC systems in the era of evolving foundation models.
major comments (2)
- [Pipeline and Human Evaluation] The central claim of long-term robustness for the dynamic benchmark rests on the seven-stage pipeline continuing to produce annotations that closely track human judgments indefinitely. However, the manuscript describes only a single human evaluation on current data with no protocol for ongoing sampling, discrepancy auditing, or re-calibration when ingesting new claims from 54 languages and audiovisual sources quarterly (see Pipeline and Human Evaluation sections).
- [Verdict Mapping] The standardized and disentangled scoring scheme is presented as a key innovation for mapping heterogeneous verdicts, but the manuscript provides insufficient detail on its exact definition, how it handles cross-organization and cross-modality variations, or mechanisms to detect accumulating mapping errors over time (see Verdict Mapping stage description).
minor comments (2)
- [Abstract] The abstract states that annotations 'closely match' human judgments but does not report specific quantitative metrics such as agreement rates, Cohen's kappa, or per-modality breakdowns; adding these would improve clarity.
- [Related Work] The claim of being the 'first' dynamic benchmark would benefit from a more explicit comparison table against prior AFC benchmarks to highlight differences in dynamism, multimodality, and language coverage.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive feedback on our manuscript. We appreciate the recognition of VeriTaS as a potentially valuable contribution to the field of automated fact-checking. We address each major comment point by point below, outlining the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Pipeline and Human Evaluation] The central claim of long-term robustness for the dynamic benchmark rests on the seven-stage pipeline continuing to produce annotations that closely track human judgments indefinitely. However, the manuscript describes only a single human evaluation on current data with no protocol for ongoing sampling, discrepancy auditing, or re-calibration when ingesting new claims from 54 languages and audiovisual sources quarterly (see Pipeline and Human Evaluation sections).
Authors: We agree that ensuring sustained alignment with human judgments is critical for the benchmark's credibility over time. The current manuscript reports results from a single human evaluation performed on the initial data release to validate the pipeline. In the revised version, we will add a new subsection to the Human Evaluation section that specifies an ongoing protocol. This will include quarterly sampling of a representative subset of new claims (stratified by language and modality), procedures for discrepancy auditing between automated and human annotations, and explicit re-calibration steps if drift is detected. We will also commit to publicly releasing the outcomes of these checks with each quarterly update. revision: yes
-
Referee: [Verdict Mapping] The standardized and disentangled scoring scheme is presented as a key innovation for mapping heterogeneous verdicts, but the manuscript provides insufficient detail on its exact definition, how it handles cross-organization and cross-modality variations, or mechanisms to detect accumulating mapping errors over time (see Verdict Mapping stage description).
Authors: We acknowledge that additional detail on the scoring scheme would improve clarity and reproducibility. In the revised manuscript, we will expand the Verdict Mapping stage description with a formal definition of the standardized, disentangled scoring scheme, including its core components (veracity category, justification text, and confidence indicators). We will include concrete mapping examples from multiple organizations and both textual and audiovisual claims. We will also describe mechanisms for ongoing error detection, such as automated consistency checks across updates and periodic human spot-checks, to identify and correct any accumulating mapping errors. revision: yes
Circularity Check
No circularity: benchmark built from external claims with independent validation
full rationale
The paper constructs VeriTaS from real-world claims sourced directly from 104 independent professional fact-checking organizations across 54 languages. The seven-stage automated pipeline normalizes, retrieves media, and maps verdicts to a standardized scheme, but these steps operate on external inputs rather than fitted parameters or self-referential definitions. Human evaluation is reported as an independent check that annotations match judgments, with no equations, self-citations, or uniqueness theorems invoked to force the result. The derivation chain remains self-contained against external benchmarks and does not reduce by construction to its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The automated seven-stage pipeline accurately normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a standardized scoring scheme.
invented entities (1)
-
Standardized and disentangled scoring scheme
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we model each property on a scale from −1 to +1, where 0 denotes full uncertainty (NEI)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
COSMOS: Catching Out-of-Context Misin- formation with Self-Supervised Learning.Preprint, arXiv:2101.06278. Anthropic. 2025. Claude Sonnet 4.5 System Card. Sys- tem Card. Accessed on Jan 5, 2026. Alessandro Bondielli, Pietro Dell’Oglio, Alessandro Lenci, Francesco Marcelloni, and Lucia Passaro
-
[2]
Tobias Braun, Mark Rothermel, Marcus Rohrbach, and Anna Rohrbach
Dataset for multimodal fake news detection and verification tasks.Data in Brief, 54:110440. Tobias Braun, Mark Rothermel, Marcus Rohrbach, and Anna Rohrbach. 2025. DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Ex- perts. InProceedings of the 42nd International Conference on Machine Learning, pages 5383–5417. PMLR. Grégoire Burel, Martino Me...
work page 2025
-
[3]
InThe Semantic Web – ISWC 2024, pages 97–114, Cham
CimpleKG: A Continuously Updated Knowl- edge Graph on Misinformation, Factors and Fact- Checks. InThe Semantic Web – ISWC 2024, pages 97–114, Cham. Springer Nature Switzerland. Rui Cao, Zifeng Ding, Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos. 2025. A Ver- ImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from th...
-
[4]
M4FC: A Multimodal, Multilingual, Multicul- tural, Multitask Real-World Fact-Checking Dataset. Preprint, arXiv:2510.23508. Max Glockner, Yufang Hou, and Iryna Gurevych
-
[5]
Missing Counter-Evidence Renders NLP Fact- Checking Unrealistic for Misinformation. InPro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 5916– 5936, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Max Glockner, Ieva Stali¯unait˙e, James Thorne, Gisela Vallejo, Andreas Vlachos, and ...
-
[6]
Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6149–6157, Marseille, France. European Language Resources Association. Eryn J. Newman, Maryanne Garry, Daniel M. Bernstein, Justin Kantner, and D. Stephen Lindsay. 2012. Non- probative photogr...
work page 2012
-
[7]
Multimodal and Multilingual Fact-Checked Article Retrieval. InProceedings of the 2025 Inter- national Conference on Multimedia Retrieval, ICMR ’25, pages 1063–1071, New York, NY , USA. Associ- ation for Computing Machinery. Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, and Panagiotis C. Petran- tonakis. 2024. VERITE: A Robust benc...
work page 2025
-
[8]
Fin-Fact: A Benchmark Dataset for Multi- modal Financial Fact-Checking and Explanation Gen- eration. InCompanion Proceedings of the ACM on Web Conference 2025, WWW ’25, pages 785–788, New York, NY , USA. Association for Computing Machinery. Shaina Raza, Ashmal Vayani, Aditya Jain, Aravind Narayanan, Vahid Reza Khazaie, S. Bashir, Elham Dolatabadi, Gias Ud...
-
[9]
ClaimsKG: A Knowledge Graph of Fact- Checked Claims. InThe Semantic Web – ISWC 2019: 18th International Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Pro- ceedings, Part II, pages 309–324, Berlin, Heidelberg. Springer-Verlag. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: A Large-scale Dat...
work page 2019
-
[10]
Michiel van der Meer, Pavel Korshunov, Sébastien Mar- cel, and Lonneke van der Plas
COVE: COntext and VEracity prediction for out-of-context images.Preprint, arXiv:2502.01194. Michiel van der Meer, Pavel Korshunov, Sébastien Mar- cel, and Lonneke van der Plas. 2025. HintsOfT- ruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims.Preprint, arXiv:2502.11753. Haoran Wang, Aman Rangapur, Xiongxiao Xu, Yue- qing ...
-
[11]
Piecing It All Together: Verifying Multi-Hop Multimodal Claims. InProceedings of the 31st Inter- national Conference on Computational Linguistics, pages 7453–7469, Abu Dhabi, UAE. Association for Computational Linguistics. William Yang Wang. 2017. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. InProceedings of the 55th Annual...
-
[12]
Fact-Checking Meets Fauxtography: Verify- ing Claims About Images. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2099–2108, Hong Kong, China. Association for Computational Linguistics. 13 A LLM Glossary Table 5 summa...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.