arxiv: 2601.08611 · v2 · submitted 2026-01-13 · 💻 cs.IR · cs.AI· cs.CV· cs.MM

VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking

Mark Rothermel , Marcus Kornmann , Marcus Rohrbach , Anna Rohrbach This is my paper

Pith reviewed 2026-05-16 14:59 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CVcs.MM

keywords automated fact-checkingdynamic benchmarkmultimodaldata leakagemisinformationfact-checking organizationsstandardized scoringquarterly updates

0 comments p. Extension

The pith

VeriTaS creates the first dynamic multimodal benchmark for automated fact-checking that updates quarterly to resist data leakage into model pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Static benchmarks for automated fact-checking lose reliability once their claims appear in the training data of large models, so performance no longer measures genuine verification skill. The paper introduces VeriTaS as a dynamic alternative built from 25,000 real-world claims drawn from 104 professional organizations across 54 languages and both textual and audiovisual formats. A fully automated seven-stage pipeline normalizes each claim, retrieves its original media, and converts diverse expert verdicts into a standardized, disentangled scoring scheme with textual justifications. Human evaluation confirms the pipeline's outputs align closely with manual judgments. The benchmark commits to quarterly additions, creating a leakage-resistant resource for ongoing evaluation of fact-checking systems.

Core claim

VeriTaS is presented as the first dynamic benchmark for multimodal automated fact-checking, consisting of 25,000 claims sourced from 104 organizations in 54 languages with textual and audiovisual content, maintained through a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel standardized and disentangled scoring scheme accompanied by textual justifications.

What carries the argument

The fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a standardized, disentangled scoring scheme with textual justifications.

If this is right

AFC system evaluations will measure actual verification ability instead of recall of training data.
Standardized scoring enables direct comparison of systems across claims from many different fact-checking organizations.
Quarterly updates allow tracking of model progress on emerging misinformation without repeated use of the same fixed set.
Multimodal coverage supports testing of systems that process both text and audiovisual evidence together.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dynamic update mechanism could serve as a template for other evaluation suites in areas where rapid model pretraining risks contamination.
Long-term tracking with VeriTaS may show whether gains in AFC performance come from better reasoning or from broader coverage of the same claim types.
Public release of the pipeline invites external groups to extend the benchmark to additional languages or new misinformation formats.

Load-bearing premise

The automated pipeline will keep producing annotations that match human judgments on new quarterly claims without accumulating systematic errors.

What would settle it

A future quarterly batch of claims where independent human raters disagree with the pipeline's standardized verdicts on a substantial fraction of cases.

Figures

Figures reproduced from arXiv: 2601.08611 by Anna Rohrbach, Marcus Kornmann, Marcus Rohrbach, Mark Rothermel.

**Figure 1.** Figure 1: Example claims from the VERITAS benchmark, including media, claim date, and claim integrity score. The two claims on the right showcase lower-level annotations of media/claim properties used to infer the overall integrity. Each annotation contains a justification. Claims are added quarterly via an automated pipeline. Abstract The growing scale of online misinformation urgently demands Automated Fact-Checki… view at source ↗

**Figure 2.** Figure 2: The seven stages of VERITAS repeated on a quarterly basis. 3.2 Stage 2: Publisher Identification Goal. Ensure review credibility. We identified 832 distinct publishers via review URLs, and retain reviews only if they originate from credible fact-checkers (cf. App. B), dismissing about 55 K (15%) of the reviews not meeting this criterion. Output. Reviews from professional fact-checking organizations meetin… view at source ↗

**Figure 3.** Figure 3: Verdict derivation (Stage 6), assessing proper [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Baseline performance without web search on the longitudinal split using a 200-claim moving average window, single runs. Lower is better. Vertical lines indicate knowledge cutoff dates. over time, exceeding 25% of claims by Q4-2025, while the proportion of pristine media correspondingly declines ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Origins of ClaimReviews D VERITAS Statistics D.1 ClaimReview We obtained a total of 371, 071 K ClaimReviews starting from January 1, 2016. Three sources yielded the data: Google Fact-Check Explorer, Data Commons, and the fact-checking organizations themselves who added ClaimReview data to their published article webpages. The same ClaimReview can be obtained from multiple sources, where the ClaimReview d… view at source ↗

**Figure 6.** Figure 6: ClaimReview statistics, as obtained by stages 1 and 2, showing [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Raw ratings as provided by ClaimReview for [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 10.** Figure 10: Appearances per archiving service. Integrity Search MSE (↓) MAE (↓) Acc. (↑) Gemini 2.0 Flash - 0.63 0.59 46.72 Gemini 2.5 Flash - 0.56 0.39 79.65 Gemini 3 Pro - 0.36 0.26 89.52 GPT-4o - 0.53 0.51 54.25 GPT-5.2 - 0.59 0.57 47.75 LLama 4 Maverick - 0.81 0.61 55.63 Gemini 2.0 Flash ✓ 0.60 0.51 59.13 Gemini 2.5 Flash ✓ 0.50 0.41 71.73 Gemini 3 Pro ✓ 0.34 0.33 75.13 GPT-4o ✓ 0.54 0.47 61.83 GPT-5.2 ✓ 0.47 0.4… view at source ↗

**Figure 8.** Figure 8: Appearances that have the original source [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Appearances per platform, showing the top [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 11.** Figure 11: Statistics in the final VERITAS benchmark for all quarter splits. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Statistics in the final VERITAS benchmark for all quarter splits. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Common words occurring in VERITAS’ claims for the ten most frequent languages. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Natural data statistics as obtained before stage 7, i.e., before balancing and sampling. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Natural data statistics as obtained before stage 7, i.e., before balancing and sampling. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Baseline results on the longitudinal split for all three metrics Mean Squared Error (MSE), Mean Absolute Error (MAE), and Accuracy (by 3-bin discretization). All plots use a 200-claim moving average window. Vertical lines indicate knowledge cutoff dates. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Baseline results on the longitudinal split for modality specific Mean Absolute Error (MAE). All plots use a 200-claim moving average window. Vertical lines indicate knowledge cutoff dates. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Overview of the human annotation process: After validating quality requirements, annotators assess all [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Confusion matrices comparing human annotations with V [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

read the original abstract

The growing scale of online misinformation urgently demands Automated Fact-Checking (AFC). Existing benchmarks for evaluating AFC systems, however, are largely limited in terms of task scope, modalities, domain, language diversity, realism, or coverage of misinformation types. Critically, they are static, thus subject to data leakage as their claims enter the pretraining corpora of LLMs. As a result, benchmark performance no longer reliably reflects the actual ability to verify claims. We introduce Verified Theses and Statements (VeriTaS), the first dynamic benchmark for multimodal AFC, designed to remain robust under ongoing large-scale pretraining of foundation models. VeriTaS currently comprises 25,000 real-world claims from 104 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications. Through human evaluation, we demonstrate that the automated annotations closely match human judgments. We commit to updating VeriTaS in the future, establishing a leakage-resistant benchmark, supporting meaningful AFC evaluation in the era of rapidly evolving foundation models. The code and data are publicly available under https://veritas.mai.informatik.tu-darmstadt.de .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VeriTaS brings a dynamic, multilingual multimodal benchmark with quarterly updates and public data, but the automated pipeline's ongoing alignment with humans is only shown once and lacks described checks for drift.

read the letter

VeriTaS is a new dynamic benchmark for multimodal automated fact-checking that pulls real claims from professional organizations and updates quarterly, but its automated annotation pipeline lacks described safeguards against gradual divergence from human judgments. The paper's main contribution is creating a benchmark that avoids the data leakage issue plaguing static ones, since claims are added over time. It covers text and audiovisual content in 54 languages from 104 fact-checkers, which is a broad scope. They do a good job releasing the data publicly and showing that their seven-stage pipeline produces annotations that align with human judgments in an initial evaluation. The standardized disentangled scoring scheme is a nice touch for making verdicts comparable across sources. Where it falls short is in the details and the long-term plan. The abstract mentions the pipeline normalizes claims, retrieves media, and maps verdicts, but without specifics on how it handles edge cases or maintains consistency across updates. The human evaluation is a one-off, so there's no mechanism outlined for detecting if quality slips as new claims come in quarterly. If the automation introduces systematic biases in normalization or mapping, that could go unnoticed. This work is aimed at researchers evaluating AFC systems, especially those concerned with real-world applicability and model contamination. Anyone building or testing fact-checkers would find the dataset useful, even if the benchmark's robustness needs more proof. I would recommend sending it for peer review. The core idea is solid and the public artifacts are valuable, but referees can push for more on the pipeline mechanics and ongoing validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces VeriTaS, the first dynamic benchmark for multimodal automated fact-checking (AFC). It comprises 25,000 real-world claims sourced from 104 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are ingested quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel standardized and disentangled scoring scheme with textual justifications. The authors report that automated annotations closely match human judgments in a one-time evaluation and commit to ongoing quarterly updates to create a leakage-resistant benchmark for evaluating AFC systems amid large-scale pretraining of foundation models. Code and data are released publicly.

Significance. If the seven-stage pipeline maintains alignment with human judgments across quarterly updates without systematic drift, VeriTaS would provide a valuable contribution as a multimodal, multilingual, and dynamic benchmark that mitigates data leakage issues plaguing static AFC datasets. The public data release and commitment to updates strengthen its utility for reproducible evaluation of AFC systems in the era of evolving foundation models.

major comments (2)

[Pipeline and Human Evaluation] The central claim of long-term robustness for the dynamic benchmark rests on the seven-stage pipeline continuing to produce annotations that closely track human judgments indefinitely. However, the manuscript describes only a single human evaluation on current data with no protocol for ongoing sampling, discrepancy auditing, or re-calibration when ingesting new claims from 54 languages and audiovisual sources quarterly (see Pipeline and Human Evaluation sections).
[Verdict Mapping] The standardized and disentangled scoring scheme is presented as a key innovation for mapping heterogeneous verdicts, but the manuscript provides insufficient detail on its exact definition, how it handles cross-organization and cross-modality variations, or mechanisms to detect accumulating mapping errors over time (see Verdict Mapping stage description).

minor comments (2)

[Abstract] The abstract states that annotations 'closely match' human judgments but does not report specific quantitative metrics such as agreement rates, Cohen's kappa, or per-modality breakdowns; adding these would improve clarity.
[Related Work] The claim of being the 'first' dynamic benchmark would benefit from a more explicit comparison table against prior AFC benchmarks to highlight differences in dynamism, multimodality, and language coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We appreciate the recognition of VeriTaS as a potentially valuable contribution to the field of automated fact-checking. We address each major comment point by point below, outlining the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Pipeline and Human Evaluation] The central claim of long-term robustness for the dynamic benchmark rests on the seven-stage pipeline continuing to produce annotations that closely track human judgments indefinitely. However, the manuscript describes only a single human evaluation on current data with no protocol for ongoing sampling, discrepancy auditing, or re-calibration when ingesting new claims from 54 languages and audiovisual sources quarterly (see Pipeline and Human Evaluation sections).

Authors: We agree that ensuring sustained alignment with human judgments is critical for the benchmark's credibility over time. The current manuscript reports results from a single human evaluation performed on the initial data release to validate the pipeline. In the revised version, we will add a new subsection to the Human Evaluation section that specifies an ongoing protocol. This will include quarterly sampling of a representative subset of new claims (stratified by language and modality), procedures for discrepancy auditing between automated and human annotations, and explicit re-calibration steps if drift is detected. We will also commit to publicly releasing the outcomes of these checks with each quarterly update. revision: yes
Referee: [Verdict Mapping] The standardized and disentangled scoring scheme is presented as a key innovation for mapping heterogeneous verdicts, but the manuscript provides insufficient detail on its exact definition, how it handles cross-organization and cross-modality variations, or mechanisms to detect accumulating mapping errors over time (see Verdict Mapping stage description).

Authors: We acknowledge that additional detail on the scoring scheme would improve clarity and reproducibility. In the revised manuscript, we will expand the Verdict Mapping stage description with a formal definition of the standardized, disentangled scoring scheme, including its core components (veracity category, justification text, and confidence indicators). We will include concrete mapping examples from multiple organizations and both textual and audiovisual claims. We will also describe mechanisms for ongoing error detection, such as automated consistency checks across updates and periodic human spot-checks, to identify and correct any accumulating mapping errors. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark built from external claims with independent validation

full rationale

The paper constructs VeriTaS from real-world claims sourced directly from 104 independent professional fact-checking organizations across 54 languages. The seven-stage automated pipeline normalizes, retrieves media, and maps verdicts to a standardized scheme, but these steps operate on external inputs rather than fitted parameters or self-referential definitions. Human evaluation is reported as an independent check that annotations match judgments, with no equations, self-citations, or uniqueness theorems invoked to force the result. The derivation chain remains self-contained against external benchmarks and does not reduce by construction to its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the automated pipeline produces reliable, human-aligned annotations and that quarterly updates will maintain leakage resistance without introducing new biases.

axioms (1)

domain assumption The automated seven-stage pipeline accurately normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a standardized scoring scheme.
Invoked in the description of how claims are processed and added quarterly.

invented entities (1)

Standardized and disentangled scoring scheme no independent evidence
purpose: To convert heterogeneous expert verdicts into a consistent format with textual justifications.
New scheme introduced to handle diverse fact-checking organization outputs.

pith-pipeline@v0.9.0 · 5552 in / 1259 out tokens · 67793 ms · 2026-05-16T14:59:51.548099+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we model each property on a scale from −1 to +1, where 0 denotes full uncertainty (NEI)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Anthropic

COSMOS: Catching Out-of-Context Misin- formation with Self-Supervised Learning.Preprint, arXiv:2101.06278. Anthropic. 2025. Claude Sonnet 4.5 System Card. Sys- tem Card. Accessed on Jan 5, 2026. Alessandro Bondielli, Pietro Dell’Oglio, Alessandro Lenci, Francesco Marcelloni, and Lucia Passaro

work page arXiv 2025
[2]

Tobias Braun, Mark Rothermel, Marcus Rohrbach, and Anna Rohrbach

Dataset for multimodal fake news detection and verification tasks.Data in Brief, 54:110440. Tobias Braun, Mark Rothermel, Marcus Rohrbach, and Anna Rohrbach. 2025. DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Ex- perts. InProceedings of the 42nd International Conference on Machine Learning, pages 5383–5417. PMLR. Grégoire Burel, Martino Me...

work page 2025
[3]

InThe Semantic Web – ISWC 2024, pages 97–114, Cham

CimpleKG: A Continuously Updated Knowl- edge Graph on Misinformation, Factors and Fact- Checks. InThe Semantic Web – ISWC 2024, pages 97–114, Cham. Springer Nature Switzerland. Rui Cao, Zifeng Ding, Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos. 2025. A Ver- ImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from th...

work page arXiv 2024
[4]

Preprint, arXiv:2510.23508

M4FC: A Multimodal, Multilingual, Multicul- tural, Multitask Real-World Fact-Checking Dataset. Preprint, arXiv:2510.23508. Max Glockner, Yufang Hou, and Iryna Gurevych

work page arXiv
[5]

InPro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 5916– 5936, Abu Dhabi, United Arab Emirates

Missing Counter-Evidence Renders NLP Fact- Checking Unrealistic for Misinformation. InPro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 5916– 5936, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Max Glockner, Ieva Stali¯unait˙e, James Thorne, Gisela Vallejo, Andreas Vlachos, and ...

work page arXiv 2022
[6]

In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6149–6157, Marseille, France

Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6149–6157, Marseille, France. European Language Resources Association. Eryn J. Newman, Maryanne Garry, Daniel M. Bernstein, Justin Kantner, and D. Stephen Lindsay. 2012. Non- probative photogr...

work page 2012
[7]

InProceedings of the 2025 Inter- national Conference on Multimedia Retrieval, ICMR ’25, pages 1063–1071, New York, NY , USA

Multimodal and Multilingual Fact-Checked Article Retrieval. InProceedings of the 2025 Inter- national Conference on Multimedia Retrieval, ICMR ’25, pages 1063–1071, New York, NY , USA. Associ- ation for Computing Machinery. Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, and Panagiotis C. Petran- tonakis. 2024. VERITE: A Robust benc...

work page 2025
[8]

InCompanion Proceedings of the ACM on Web Conference 2025, WWW ’25, pages 785–788, New York, NY , USA

Fin-Fact: A Benchmark Dataset for Multi- modal Financial Fact-Checking and Explanation Gen- eration. InCompanion Proceedings of the ACM on Web Conference 2025, WWW ’25, pages 785–788, New York, NY , USA. Association for Computing Machinery. Shaina Raza, Ashmal Vayani, Aditya Jain, Aravind Narayanan, Vahid Reza Khazaie, S. Bashir, Elham Dolatabadi, Gias Ud...

work page arXiv 2025
[9]

Image, Tell me your story!

ClaimsKG: A Knowledge Graph of Fact- Checked Claims. InThe Semantic Web – ISWC 2019: 18th International Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Pro- ceedings, Part II, pages 309–324, Berlin, Heidelberg. Springer-Verlag. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: A Large-scale Dat...

work page 2019
[10]

Michiel van der Meer, Pavel Korshunov, Sébastien Mar- cel, and Lonneke van der Plas

COVE: COntext and VEracity prediction for out-of-context images.Preprint, arXiv:2502.01194. Michiel van der Meer, Pavel Korshunov, Sébastien Mar- cel, and Lonneke van der Plas. 2025. HintsOfT- ruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims.Preprint, arXiv:2502.11753. Haoran Wang, Aman Rangapur, Xiongxiao Xu, Yue- qing ...

work page arXiv 2025
[11]

Liar, Liar Pants on Fire

Piecing It All Together: Verifying Multi-Hop Multimodal Claims. InProceedings of the 31st Inter- national Conference on Computational Linguistics, pages 7453–7469, Abu Dhabi, UAE. Association for Computational Linguistics. William Yang Wang. 2017. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. InProceedings of the 55th Annual...

work page arXiv 2017
[12]

rather certain,

Fact-Checking Meets Fauxtography: Verify- ing Claims About Images. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2099–2108, Hong Kong, China. Association for Computational Linguistics. 13 A LLM Glossary Table 5 summa...

work page 2019