DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding

Cennet Oguz; Josef van Genabith; Simon Ostermann; Yasser Hamidullah

arxiv: 2604.25584 · v1 · submitted 2026-04-28 · 💻 cs.AI

DualFact+: A Multimodal Fact Verification Framework for Procedural Video Understanding

Cennet Oguz , Yasser Hamidullah , Josef van Genabith , Simon Ostermann This is my paper

Pith reviewed 2026-05-07 16:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords fact verificationprocedural video understandingvideo captioningmultimodal evaluationsemantic rolesfactual groundinghallucination detection

0 comments

The pith

DualFact evaluates factual correctness in procedural video captions by separating abstract semantic roles from their video-grounded realizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DualFact as a dual-layer framework to check factual accuracy in captions of procedural videos. It divides facts into conceptual ones that name abstract roles such as actions, ingredients, tools and locations, and contextual ones that describe how those roles actually occur in the video. This approach reveals that current multimodal models often generate fluent captions that still omit key details or mix up roles. DualFact aligns better with human judgments of factuality than existing metrics, especially when verification uses video evidence instead of text alone. The work shows that caption-only checks tend to count more hallucinations than video-grounded ones.

Core claim

DualFact is a dual-layer, multimodal factuality evaluation framework for procedural video captioning that separates factual correctness into conceptual facts capturing abstract semantic roles such as Action, Ingredient, Tool and Location, and contextual facts capturing their grounded predicate-argument realizations in video. The framework supports complete and role-consistent evaluation through implicit argument augmentation and contrastive fact sets. It runs in two modes: DualFact-T against textual evidence and DualFact-V against video-grounded visual evidence. Experiments on YouCook3-Fact and CraftBench-Fact show that state-of-the-art multimodal language models produce fluent but oftenfact

What carries the argument

DualFact, the dual-layer fact verification framework that decomposes factual correctness into conceptual semantic roles and contextual predicate-argument realizations, supported by implicit argument augmentation and contrastive fact sets, with separate text-based and video-based verification modes.

Load-bearing premise

Factual correctness can be reliably decomposed into conceptual semantic roles and contextual predicate-argument realizations such that implicit argument augmentation and contrastive fact sets deliver complete, unbiased coverage.

What would settle it

Human factuality ratings collected on a fresh set of procedural video captions; if DualFact scores do not correlate more strongly with those ratings than standard metrics or text-only checks, the framework's claimed advantage would not hold.

Figures

Figures reproduced from arXiv: 2604.25584 by Cennet Oguz, Josef van Genabith, Simon Ostermann, Yasser Hamidullah.

**Figure 1.** Figure 1: Overview of the MULTIFACTSCORE pipeline. The captioning model generates a description, from which positive facts are extracted and verified using multimodal and textual NLI models. Error decomposition identifies hallucination, salience, and omission errors for fine-grained factuality analysis. Evaluation. At inference time, verification depends on the available evidence: yˆi =    Mnli(V, fi), visual ve… view at source ↗

read the original abstract

We introduce DualFact, a dual-layer, multimodal factuality evaluation framework for procedural video captioning. DualFact separates factual correctness into conceptual facts, capturing abstract semantic roles (e.g., Action, Ingredient, Tool, Location), and contextual facts, capturing their grounded predicate-argument realizations in video. To support complete and role-consistent evaluation, DualFact incorporates implicit argument augmentation (VIA) and contrastive fact sets. We instantiate DualFact in two modes: DualFact-T, which verifies facts against textual evidence, and DualFact-V, which verifies facts against video-grounded visual evidence. Experiments on YouCook3-Fact and CraftBench-Fact show that state-of-the-art multimodal language models produce fluent but often factually incomplete captions, with systematic omissions and role-level inconsistencies. DualFact correlates more strongly with human factuality judgments than standard metrics, particularly for contextual facts, and reveals that caption-only evaluation overestimates hallucinations compared to video-grounded verification. Overall, DualFact offers an interpretable and human-aligned evaluation protocol that highlights persistent challenges in multimodal factual grounding, extending beyond surface-level fluency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DualFact splits factuality into conceptual roles and contextual realizations for procedural video captions, with video verification exposing more omissions than text-only checks.

read the letter

The main point is that DualFact decomposes factual correctness in procedural video captions into abstract semantic roles like actions and ingredients, then checks their specific grounded realizations against either text or video evidence. This dual-layer setup plus implicit argument augmentation and contrastive sets is the core new piece, and the experiments on YouCook3-Fact and CraftBench-Fact show models produce fluent output but systematically drop details and role consistency. The video-grounded mode correlates better with human judgments than standard metrics and suggests caption-only evaluation overestimates hallucinations. That separation and the two verification modes are a clear step past existing factuality scores in video captioning work, and the new datasets give others something concrete to build on. The framework is interpretable by design, which helps when diagnosing where models fail to ground facts. The soft spot is whether the role decomposition and augmentation actually deliver complete, unbiased coverage. If implicit arguments or contrastive negatives systematically under-represent certain contextual elements like locations or tools, the reported correlation gains and the hallucination-overestimation finding could partly reflect annotation choices rather than a true improvement in measurement. The abstract gives the high-level results but leaves the exact construction details and statistical breakdowns thin, so the strength of the human alignment claim is still hard to gauge fully. This is aimed at people working on multimodal evaluation, instructional video understanding, or captioning systems. A reader testing factuality protocols or building procedural AI would find usable ideas and datasets here. It deserves peer review because the evaluation gap it targets is real and the proposed structure is thoughtful, even if the current evidence needs more detail to confirm it holds up.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces DualFact, a dual-layer multimodal fact verification framework for procedural video captioning. It decomposes factual correctness into conceptual facts (abstract semantic roles such as Action, Ingredient, Tool, Location) and contextual facts (grounded predicate-argument realizations). The approach incorporates implicit argument augmentation (VIA) and contrastive fact sets to support complete, role-consistent evaluation, with two verification modes: DualFact-T (textual evidence) and DualFact-V (video-grounded visual evidence). Experiments on the introduced YouCook3-Fact and CraftBench-Fact datasets indicate that state-of-the-art multimodal language models produce fluent but factually incomplete captions exhibiting systematic omissions and role-level inconsistencies. DualFact shows stronger correlation with human factuality judgments than standard metrics (particularly for contextual facts) and suggests that caption-only evaluation overestimates hallucinations relative to video-grounded verification.

Significance. If the results hold, DualFact provides a structured, interpretable protocol for diagnosing factual grounding failures in multimodal models on procedural tasks, moving beyond surface fluency metrics. The introduction of two new fact-annotated datasets, the explicit separation of conceptual and contextual layers, and the dual verification modes are concrete strengths that could support more targeted model improvements. The reported finding that video-grounded verification yields different hallucination estimates than caption-only approaches is a useful empirical observation for the video understanding community.

major comments (2)

[§3.2] §3.2 (Implicit Argument Augmentation and Contrastive Fact Sets): The central premise that VIA augmentation plus contrastive sets guarantee complete and role-consistent coverage is load-bearing for all downstream claims about correlation advantages and systematic omissions, yet the manuscript provides no quantitative coverage analysis (e.g., percentage of implicit arguments successfully augmented or inter-role consistency rates across the datasets). Without such validation, it remains possible that under-representation of certain contextual facts (locations, tools) biases the human correlation results.
[§4.3] §4.3 (Human Correlation Experiments): The claim that DualFact correlates more strongly with human judgments than baselines is presented without reported p-values, confidence intervals on the correlation differences, or inter-annotator agreement statistics for the human factuality labels. These details are necessary to establish that the observed advantage is robust rather than an artifact of annotation variance or small sample effects.

minor comments (3)

[Title/Abstract] Title vs. Abstract: The title refers to DualFact+ while the abstract and body consistently use DualFact; a brief clarification of the naming convention would avoid reader confusion.
[Figure 1] Figure 1 (Framework Overview): The diagram would be clearer with explicit callouts distinguishing the VIA augmentation step from the contrastive fact set construction and the two verification paths (T vs. V).
[§2] Related Work: The section would benefit from citing at least one or two 2024 works on multimodal fact verification to better situate the contribution relative to concurrent efforts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript introducing DualFact. The comments help clarify areas where additional evidence and statistical details can strengthen the presentation of our dual-layer fact verification framework. We address each major comment below and will incorporate revisions to improve the robustness of the claims.

read point-by-point responses

Referee: [§3.2] §3.2 (Implicit Argument Augmentation and Contrastive Fact Sets): The central premise that VIA augmentation plus contrastive sets guarantee complete and role-consistent coverage is load-bearing for all downstream claims about correlation advantages and systematic omissions, yet the manuscript provides no quantitative coverage analysis (e.g., percentage of implicit arguments successfully augmented or inter-role consistency rates across the datasets). Without such validation, it remains possible that under-representation of certain contextual facts (locations, tools) biases the human correlation results.

Authors: We agree that explicit quantitative validation of coverage would strengthen the load-bearing claims in §3.2. The manuscript describes the VIA augmentation procedure and contrastive set construction in detail but does not report aggregate coverage metrics. In the revised version, we will add a quantitative analysis subsection to §3.2 that reports (1) the percentage of implicit arguments successfully augmented per semantic role (Action, Ingredient, Tool, Location) on both YouCook3-Fact and CraftBench-Fact, and (2) inter-role consistency rates measured via automated checks and sampled manual verification. These additions will directly address the possibility of under-representation bias and support the downstream correlation results. revision: yes
Referee: [§4.3] §4.3 (Human Correlation Experiments): The claim that DualFact correlates more strongly with human judgments than baselines is presented without reported p-values, confidence intervals on the correlation differences, or inter-annotator agreement statistics for the human factuality labels. These details are necessary to establish that the observed advantage is robust rather than an artifact of annotation variance or small sample effects.

Authors: We acknowledge that the human correlation experiments in §4.3 would benefit from additional statistical reporting. The manuscript presents Pearson and Spearman correlations but omits p-values for the differences, confidence intervals, and inter-annotator agreement. In the revised manuscript, we will augment §4.3 with (1) p-values testing the significance of DualFact's correlation advantage over baselines, (2) 95% bootstrap confidence intervals on the correlation coefficients and their differences, and (3) inter-annotator agreement statistics (e.g., Fleiss' kappa) for the factuality labels collected on the two datasets. These additions will provide clearer evidence that the observed advantages are robust. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation framework with no self-referential derivation chain

full rationale

The paper presents DualFact as an introduced framework that decomposes facts into conceptual and contextual layers, augmented by VIA and contrastive sets, then validates it empirically on new datasets (YouCook3-Fact, CraftBench-Fact) against human judgments and model outputs. No equations, fitted parameters, or predictions are shown that reduce by construction to the inputs; the central claims rest on external human correlations and dataset construction rather than self-definition or self-citation load-bearing. The work is self-contained as an evaluation protocol without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces a new evaluation framework relying primarily on domain assumptions about semantic role decomposition in procedural videos rather than fitted parameters or new physical entities.

axioms (2)

domain assumption Factual correctness in procedural video captions can be meaningfully separated into abstract conceptual semantic roles and their grounded contextual realizations.
This decomposition is the foundational structure of the DualFact framework.
ad hoc to paper Implicit argument augmentation and contrastive fact sets provide complete and role-consistent coverage for evaluation.
These components are introduced to support complete evaluation as stated in the abstract.

pith-pipeline@v0.9.0 · 5496 in / 1483 out tokens · 111520 ms · 2026-05-07T16:25:31.035766+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Faith- score: Fine-grained evaluations of hallucinations in large vision-language models

Faithscore: Fine-grained evaluations of hal- lucinations in large vision-language models.arXiv preprint arXiv:2311.01477. Chloé Kiddon, Ganesa Thandavam Ponnuraj, Luke Zettlemoyer, and Yejin Choi. 2015. Mise en place: Unsupervised interpretation of instructional recipes. InProceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Proce...

work page arXiv 2015
[2]

InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640

Howto100m: Learning a text-video embed- ding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640. Cennet Oguz, Pascal Denis, Emmanuel Vincent, Simon Ostermann, and Josef van Genabith. 2023. Find-2- find: Multitask learning for anaphora resolution and object localizati...

work page 2023
[3]

Stir it

Coin: A large-scale dataset for comprehen- sive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr- vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE Con- ference on Computer Vi...

work page 2016
[4]

<Category> is <Value>

Preserve structure - Keep the exact format: "<Category> is <Value>." - Use the same category as the corresponding positive fact

work page
[5]

- Changing only surface form, plurality, or numeric values is NOT sufficient

Ensure falsity - Each generated fact must contradict the positive facts. - Changing only surface form, plurality, or numeric values is NOT sufficient

work page
[6]

Maintain plausibility - The value must be realistic within the task domain (e.g., cooking, crafting, assembly, medical procedures), but incorrect in the given context

work page
[7]

- Do NOT use synonyms, hypernyms, or morphological variants

Avoid overlap - Do NOT reuse or partially reuse any word, stem, or substring from the positive fact values. - Do NOT use synonyms, hypernyms, or morphological variants. ### Error Types Action - Replace the action with a verb from a different functional category. - The new action must not naturally co-occur with the original one. Object / Ingredient / Mate...

work page
[8]

A list of true positive contextual facts (short predicate–argument statements)

work page
[9]

### Task Generate a list of false but plausible contextual facts that contradict the positive facts while remaining linguistically natural

A target negative action verb. ### Task Generate a list of false but plausible contextual facts that contradict the positive facts while remaining linguistically natural. Each fact should follow one of these patterns: - verb + object - verb + with TOOL - verb + in/to LOCATION ### Error Types A. Negative action verb + original object B. Positive action ver...

work page
[10]

Preserve structure - Keep the same syntactic pattern as the positive facts

work page
[11]

Ensure falsity - Each generated fact must be false relative to the positives

work page
[12]

Maintain plausibility - The verb–argument combination must be physically and logically possible within the task domain, even though incorrect in context

work page
[13]

Avoid overlap - Do NOT reuse any word, stem, or substring from the positive facts

work page
[14]

- Do NOT change only quantities, attributes, or minor properties

Avoid trivial negatives - Do NOT generate nonsensical or impossible actions. - Do NOT change only quantities, attributes, or minor properties. ### Output Format Return a comma-separated list only. Example: cut metal, assemble with brush, place on floor, measure wood add tomato, add with spoon, add on tray, peel onion

work page

[1] [1]

Faith- score: Fine-grained evaluations of hallucinations in large vision-language models

Faithscore: Fine-grained evaluations of hal- lucinations in large vision-language models.arXiv preprint arXiv:2311.01477. Chloé Kiddon, Ganesa Thandavam Ponnuraj, Luke Zettlemoyer, and Yejin Choi. 2015. Mise en place: Unsupervised interpretation of instructional recipes. InProceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Proce...

work page arXiv 2015

[2] [2]

InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640

Howto100m: Learning a text-video embed- ding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640. Cennet Oguz, Pascal Denis, Emmanuel Vincent, Simon Ostermann, and Josef van Genabith. 2023. Find-2- find: Multitask learning for anaphora resolution and object localizati...

work page 2023

[3] [3]

Stir it

Coin: A large-scale dataset for comprehen- sive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr- vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE Con- ference on Computer Vi...

work page 2016

[4] [4]

<Category> is <Value>

Preserve structure - Keep the exact format: "<Category> is <Value>." - Use the same category as the corresponding positive fact

work page

[5] [5]

- Changing only surface form, plurality, or numeric values is NOT sufficient

Ensure falsity - Each generated fact must contradict the positive facts. - Changing only surface form, plurality, or numeric values is NOT sufficient

work page

[6] [6]

Maintain plausibility - The value must be realistic within the task domain (e.g., cooking, crafting, assembly, medical procedures), but incorrect in the given context

work page

[7] [7]

- Do NOT use synonyms, hypernyms, or morphological variants

Avoid overlap - Do NOT reuse or partially reuse any word, stem, or substring from the positive fact values. - Do NOT use synonyms, hypernyms, or morphological variants. ### Error Types Action - Replace the action with a verb from a different functional category. - The new action must not naturally co-occur with the original one. Object / Ingredient / Mate...

work page

[8] [8]

A list of true positive contextual facts (short predicate–argument statements)

work page

[9] [9]

### Task Generate a list of false but plausible contextual facts that contradict the positive facts while remaining linguistically natural

A target negative action verb. ### Task Generate a list of false but plausible contextual facts that contradict the positive facts while remaining linguistically natural. Each fact should follow one of these patterns: - verb + object - verb + with TOOL - verb + in/to LOCATION ### Error Types A. Negative action verb + original object B. Positive action ver...

work page

[10] [10]

Preserve structure - Keep the same syntactic pattern as the positive facts

work page

[11] [11]

Ensure falsity - Each generated fact must be false relative to the positives

work page

[12] [12]

Maintain plausibility - The verb–argument combination must be physically and logically possible within the task domain, even though incorrect in context

work page

[13] [13]

Avoid overlap - Do NOT reuse any word, stem, or substring from the positive facts

work page

[14] [14]

- Do NOT change only quantities, attributes, or minor properties

Avoid trivial negatives - Do NOT generate nonsensical or impossible actions. - Do NOT change only quantities, attributes, or minor properties. ### Output Format Return a comma-separated list only. Example: cut metal, assemble with brush, place on floor, measure wood add tomato, add with spoon, add on tray, peel onion

work page