pith. sign in

arxiv: 2606.03614 · v1 · pith:GLWCHMQYnew · submitted 2026-06-02 · 💻 cs.MM

OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination

Pith reviewed 2026-06-28 07:29 UTC · model grok-4.3

classification 💻 cs.MM
keywords long-form omni hallucinationcounterfactual benchmarkingmodality perturbationevent bindingreliability calibrationaudio-visual misbindinglong-video QA
0
0 comments X

The pith

Long-video omni models misbind correct audio-visual evidence to wrong speakers or moments, as shown by strict-pair accuracy on paired counterfactual claims, and a frozen calibration using modality perturbations raises their scores without r

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that omni assistants often fail long videos not by inventing facts but by attaching valid evidence to the wrong speaker, moment, or modality. Standard item-level QA can miss these almost-true errors because local facts remain valid. The authors therefore build paired supported and counterfactual claims from the same audio-visual events and score models on whether both members of each pair are answered correctly. They release a 3600-item benchmark drawn from 638 videos averaging 24 minutes each and demonstrate that open-weight models score 32 to 41 percent under this measure while a closed-source reference reaches 76 percent. They also present a calibration procedure that selects audio-negative probes and fuses their response shifts with native confidence to produce improved per-claim support estimates.

Core claim

Under the counterfactual event-binding protocol, open-weight Omni models remain weak at pair-level binding: Qwen2.5-Omni-7B reaches 32.06% and Qwen3-Omni-Instruct reaches 41.55%, versus 76.54% for a closed-source reference. Modality-Perturbation Reliability Calibration lifts Qwen2.5-Omni-7B to 36.22% and Qwen3 to 51.09% on the benchmark, and improves target-adapted MCQ accuracy on OmniVideoBench (+2.20) and WorldSense (+1.51) with Qwen3.

What carries the argument

Counterfactual event-binding protocol that constructs matched supported and counterfactual claims from identical audio-visual event evidence and evaluates models by strict-pair accuracy to isolate misbinding.

If this is right

  • Strict-pair accuracy penalizes acceptance of both a claim and its near-counterfactual even when local evidence is correct.
  • The calibration operates on a frozen backbone by selecting audio-negative probes inside video-level folds and combining response shifts with native confidence.
  • Gains from the calibration appear on the main benchmark and transfer to adapted MCQ versions of OmniVideoBench and WorldSense.
  • The method narrows but does not close the gap between open-weight and closed-source systems on binding tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same perturbation-based reliability estimation could be tested on binding problems in other multimodal settings such as audio-only or image-text tasks.
  • Extending the protocol to videos longer than the current average of 24 minutes might reveal additional binding failures not captured here.
  • The persistent gap to closed-source performance suggests that explicit binding objectives during pretraining or fine-tuning may be needed beyond post-hoc calibration.

Load-bearing premise

Paired supported and counterfactual claims can be constructed from the same evidence so that strict-pair accuracy measures only binding mistakes rather than artifacts introduced during claim creation.

What would settle it

A new model that scores above 70 percent strict-pair accuracy on the 3600-item benchmark while showing comparable item-level performance, or the calibration producing no measurable lift on the benchmark or the two transfer tasks, would falsify the reported weakness and improvement.

Figures

Figures reproduced from arXiv: 2606.03614 by Baoyun Peng, Jiafu Tang, Jiaheng Liu, Shihao Li, Xiaodong Wang, Yanghai Wang, Zhe Cao, Zhide Lei, Zijie Zhang, Zixuan Dong.

Figure 1
Figure 1. Figure 1: Strict-pair scoring: paired claims share real evidence but differ in binding. We address this gap with a counterfactual event-binding protocol. Starting from real audio-visual event evidence, we author a supported claim and a minimally edited counterfactual that preserves the evidence * Equal Contribution. † Corresponding Author. 1 arXiv:2606.03614v1 [cs.MM] 2 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 2
Figure 2. Figure 2: OmniHalluc-L construction pipeline. Long-form videos are temporally segmented, annotated through isolated audio and visual views, authored into supported/counterfactual claim pairs, and validated by human verification so that each pair differs only in the evidence-binding relation. A pair is credited only when both sides are answered correctly. The primary metric is strict-pair accuracy: SPA = 1 |P| ∑ p∈P … view at source ↗
Figure 3
Figure 3. Figure 3: OmniHalluc-L dataset statistics. Notes. For [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MPRC overview. A frozen Omni backbone is queried under native audio-visual input and structured audio-negative probes. Branch subsets and a lightweight calibrator are selected within video-level folds; held-out claims are predicted independently from native confidence and branch-induced response shifts. MPRC is a frozen-backbone calibration framework that keeps the visual stream fixed, queries native audio… view at source ↗
Figure 5
Figure 5. Figure 5: Frame-count and efficiency diagnostics. Pan￾els compare Raw AV, MAD, and MPRC under 64, 128, and 256 input frames and summarize latency versus SPA. 5.4 Frame Count Analysis [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level scoring can reward both a supported claim and its near-counterfactual. We introduce a counterfactual event-binding protocol that constructs paired supported/counterfactual claims from the same audio-visual event evidence and evaluates them by strict-pair accuracy. We instantiate it as \bench, a benchmark for long-video Omni hallucination, with 3{,}600 single-claim QA items from 638 long-form videos averaging 24.16 minutes and covering 256.87 hours. Under this protocol, open-weight Omni models remain weak at pair-level binding: Qwen2.5-Omni-7B reaches 32.06\% and Qwen3-Omni-Instruct reaches 41.55\%, versus 76.54\% for a closed-source reference. To narrow this gap without updating the backbone, we propose \method, Modality-Perturbation Reliability Calibration, a frozen-backbone framework that selects audio-negative probes within video-level folds and fuses their response shifts with native audio-visual confidence into per-claim support estimates. \method lifts Qwen2.5-Omni-7B to 36.22\% and Qwen3 to 51.09\% on \bench, and improves target-adapted MCQ accuracy on OmniVideoBench ($+$2.20) and WorldSense ($+$1.51) with Qwen3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces OmniHalluc-L (ench), a benchmark for long-form omni hallucination using a counterfactual event-binding protocol that constructs paired supported/counterfactual claims from identical audio-visual evidence and scores via strict-pair accuracy to isolate misbinding errors. It reports low pair-level performance for open-weight models (Qwen2.5-Omni-7B at 32.06%, Qwen3-Omni-Instruct at 41.55%) versus 76.54% for a closed-source reference on 3,600 items from 638 videos (avg. 24.16 min), and proposes Modality-Perturbation Reliability Calibration (ench) that fuses audio-negative probe shifts with native confidence to lift scores to 36.22% and 51.09% without backbone updates, plus gains on OmniVideoBench (+2.20) and WorldSense (+1.51).

Significance. If the protocol successfully isolates binding errors, the benchmark fills a gap in multimodal hallucination evaluation by targeting almost-true misbindings that standard item-level QA misses; the calibration method offers a practical, frozen-backbone improvement with cross-benchmark transfer.

major comments (2)
  1. [Abstract] Abstract: the claim that strict-pair accuracy isolates misbinding without confounding factors from the claim construction process (reader's weakest assumption) lacks any description of validation steps, inter-annotator agreement, or controls for pair construction; this is load-bearing for interpreting the reported accuracy gaps and lifts.
  2. [Abstract] Abstract: no details are supplied on selection criteria for audio-negative probes within video-level folds or error analysis of the 3,600 items, preventing assessment of whether the +4.16% and +9.54% gains under ench are supported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our benchmark construction and calibration details. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that strict-pair accuracy isolates misbinding without confounding factors from the claim construction process (reader's weakest assumption) lacks any description of validation steps, inter-annotator agreement, or controls for pair construction; this is load-bearing for interpreting the reported accuracy gaps and lifts.

    Authors: We agree that the abstract provides insufficient detail on validation steps, inter-annotator agreement, and controls for pair construction, which are necessary to substantiate that strict-pair accuracy isolates misbinding errors. The full manuscript (Section 3) outlines the counterfactual event-binding protocol but does not explicitly report these elements. In revision we will add a dedicated validation subsection reporting inter-annotator agreement statistics, controls for pair construction (e.g., evidence-matching checks and counterfactual plausibility filters), and any sensitivity analyses performed. revision: yes

  2. Referee: [Abstract] Abstract: no details are supplied on selection criteria for audio-negative probes within video-level folds or error analysis of the 3,600 items, preventing assessment of whether the +4.16% and +9.54% gains under ench are supported.

    Authors: We concur that the abstract omits selection criteria for audio-negative probes and error analysis of the 3,600 items, limiting evaluation of the reported gains under Modality-Perturbation Reliability Calibration. Section 4 describes probe selection within video-level folds and the fusion procedure, yet lacks explicit criteria and item-level error breakdowns. We will revise to include the precise selection criteria (e.g., temporal and semantic mismatch thresholds) and a summary error analysis of the dataset to better justify the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core contributions are an empirical benchmark protocol (counterfactual event-binding with strict-pair accuracy on paired supported/counterfactual claims) and a calibration method (Modality-Perturbation Reliability Calibration that fuses response shifts from audio-negative probes with native confidence). Both are presented as constructed and evaluated on external video data, with performance gains reported as measured outcomes on ench and transfer benchmarks (OmniVideoBench, WorldSense). No equations, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or high-level description that would reduce any claimed result to its inputs by construction. The protocol's isolation of binding errors is an assumption about data construction, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the central claim rests on the unverified assumption that the constructed counterfactual pairs isolate binding errors and that audio-negative probes provide unbiased calibration signals.

axioms (1)
  • domain assumption Counterfactual claims constructed from the same audio-visual evidence correctly test misbinding without introducing extraneous errors.
    Invoked in the description of the counterfactual event-binding protocol.

pith-pipeline@v0.9.1-grok · 5876 in / 1342 out tokens · 32733 ms · 2026-06-28T07:29:35.160942+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 2 canonical work pages

  1. [1]

    Evaluating object hallucination in large vision-language models

    8 Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,

  2. [2]

    Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models.arXiv preprint arXiv:2406.16338,

    Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models.arXiv preprint arXiv:2406.16338,

  3. [3]

    Elv-halluc: Benchmarking semantic aggregation hallucinations in long video understanding.arXiv preprint arXiv:2508.21496,

    Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, and Lewei Lu. Elv-halluc: Benchmarking semantic aggregation hallucinations in long video understanding.arXiv preprint arXiv:2508.21496,

  4. [4]

    Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms.arXiv preprint arXiv:2603.19217,

    Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, and Huan Wang. Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms.arXiv preprint arXiv:2603.19217,

  5. [5]

    Mmou: A massive multi-task omni understanding and reasoning benchmark for long and complex real-world videos.arXiv preprint arXiv:2603.14145,

    Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar, Lasha Koroshinadze, Yao Xu, et al. Mmou: A massive multi-task omni understanding and reasoning benchmark for long and complex real-world videos.arXiv preprint arXiv:2603.14145,

  6. [6]

    Mad: Modality-adaptive decod- ing for mitigating cross-modal hallucinations in multimodal large language models.arXiv preprint arXiv:2601.21181,

    Sangyun Chung, Se Yeon Kim, Youngchae Chee, and Yong Man Ro. Mad: Modality-adaptive decod- ing for mitigating cross-modal hallucinations in multimodal large language models.arXiv preprint arXiv:2601.21181,

  7. [7]

    See, hear, and understand: Benchmarking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231,

    Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, and Yong Jae Lee. See, hear, and understand: Benchmarking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231,

  8. [8]

    Avid: A benchmark for omni-modal audio-visual inconsistency understanding via agent-driven construction.arXiv preprint arXiv:2604.13593,

    Zixuan Chen, Depeng Wang, Hao Lin, Li Luo, Ke Xu, Ya Guo, Huijia Zhu, Tanfeng Sun, and Xinghao Jiang. Avid: A benchmark for omni-modal audio-visual inconsistency understanding via agent-driven construction.arXiv preprint arXiv:2604.13593,

  9. [9]

    Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397,

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397,

  10. [10]

    Eventhallusion: Diagnosing event hallucinations in video llms.arXiv preprint arXiv:2409.16597,

    Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Na Zhao, and Jingjing Chen. Eventhallusion: Diagnosing event hallucinations in video llms.arXiv preprint arXiv:2409.16597,

  11. [11]

    Learning to de- code against compositional hallucination in video multimodal large language models.arXiv preprint arXiv:2602.00559,

    Wenbin Xing, Quanxing Zha, Lizheng Zu, Mengran Li, Ming Li, and Junchi Yan. Learning to de- code against compositional hallucination in video multimodal large language models.arXiv preprint arXiv:2602.00559,

  12. [12]

    Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau Yih

    doi: 10.1609/aaai.v40i14.38183. Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau Yih. Trusting your evidence: Hallucinate less with context-aware decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 783–791,

  13. [13]

    Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

    doi: 10.18653/v1/2024.naacl-short.69. Avshalom Manevich and Reut Tsarfaty. Mitigating hallucinations in large vision-language models via language-contrastive decoding.arXiv preprint arXiv:2408.04664,

  14. [14]

    Qwen2.5-omni technical report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215, 2025a. Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv...

  15. [15]

    Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction

    Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, et al. Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction. arXiv preprint arXiv:2604.27393,

  16. [16]

    Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...