pith. sign in

arxiv: 2606.30220 · v1 · pith:7ITWCBY2new · submitted 2026-06-29 · 💻 cs.CV

From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA

Pith reviewed 2026-06-30 06:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords VideoQAVision-Language ModelsModality CollapseShortcut LearningVisual GroundingTraffic AccidentBenchmark Audit
0
0 comments X

The pith

Vision-language models often solve traffic video questions without needing the video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits four public benchmarks for traffic accident VideoQA and shows that recent open-weight VLMs achieve competitive or superior accuracy when video input is removed. On the MM-AU benchmark, accuracy rises without video and falls when more frames are added. The authors define Blind Gap to capture above-chance text-only performance and Visual Gain to measure the added value of video, then introduce an instance-level Shortcut Score that combines text-only confidence with visual necessity signals. This score enables training-free filtering of shortcut questions, producing subsets with lower bias and stronger visual grounding. The results indicate large differences in grounding quality across benchmarks and that accuracy alone is not sufficient for safety-critical VideoQA.

Core claim

Several recent open-weight Vision-Language Models perform competitively, and sometimes better, without video input on traffic VideoQA benchmarks. On MM-AU, removing video consistently improves accuracy while adding more frames degrades performance. The work introduces Blind Gap and Visual Gain as dataset-level diagnostics of visual dependence and a Shortcut Score for instance-level filtering of shortcut-prone questions, yielding filtered subsets that reduce textual bias and improve visual grounding.

What carries the argument

The Shortcut Score, which combines text-only model confidence with signals of visual necessity to enable continuous, training-free filtering of shortcut-prone questions.

If this is right

  • Benchmarks exhibit large differences in the degree of visual grounding required.
  • Applying the Shortcut Score produces subsets with reduced shortcut bias and improved visual grounding.
  • High accuracy on existing benchmarks does not ensure that models use visual evidence.
  • Safety-critical VideoQA evaluation must prioritize measures of visual dependence beyond accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar modality collapse may exist in other VideoQA domains where text shortcuts are possible.
  • Benchmark design should explicitly test whether questions remain solvable after removing visual content.
  • The filtering method could be applied at training time to encourage genuine multimodal learning.

Load-bearing premise

That text-only performance above chance on these benchmarks means the questions can be solved without scene-specific visual evidence.

What would settle it

Rewrite the questions to remove all textual patterns that allow correct answers from text alone, then measure whether text-only accuracy falls to chance while video input becomes necessary for high accuracy.

Figures

Figures reproduced from arXiv: 2606.30220 by Mar\'ia Alejandra Bravo Sarmiento, Sanghwan Kim, Sena Korkut, Zeynep Akata.

Figure 1
Figure 1. Figure 1: Diagnostic space of benchmark-level modality collapse. Each benchmark is positioned by its Blind Gap and Visual Gain, with circle size proportional to multimodal accuracy (AccV ). Our shortcut-aware filtering shifts MM-AU and VRU-Accident toward lower Blind Gap and higher Visual Gain, producing more visually grounded evaluation subsets (G-MM-AU and G-VRU-Accident). language priors or dataset-specific short… view at source ↗
Figure 2
Figure 2. Figure 2: Representative questions from MM-AU and VRU-Accident. Masked-eye and open-eye icons show blind and video ensemble predictions (✓ all-correct, ✗ all-incorrect, ? mixed). For filtering methods, ✓/✗ indicate retained/removed. The Shortcut Score bar reflects shortcut severity [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sweeping the Shortcut Score threshold τ reveals the tradeoff between BG, VG, and the number of retained samples. Choosing τ = 0.1 reduces text-only solvability while improving visual grounding. positive VG on MM-AU; filtering on −V improves VG but raises BG and collapses sample size. The S sweep provides the most balanced outcome on both datasets, and we apply τ = 0.1 to both MM-AU and VRU-Accident. Full s… view at source ↗
Figure 4
Figure 4. Figure 4: Threshold sweep on T, −V , and S for MM-AU and VRU-Accident. Blue and orange curves show Blind Gap and Visual Gain on the filtered subset; dashed lines mark the full-dataset values. The gray dotted curve shows the number of remaining samples on the right axis. SUTD-TrafficQA (Xu et al., 2021) provides human-annotated QA pairs across six reasoning types. Answer candidates are sampled to reduce repetition, a… view at source ↗
read the original abstract

High benchmark accuracy does not guarantee genuine use of visual evidence. We study this problem in traffic accident Video Question Answering (VideoQA), where correct answers should depend on scene-specific visual evidence but may instead be inferred from textual shortcuts. Through an audit of four public benchmarks, we find that several recent open-weight Vision-Language Models (VLMs) perform competitively, and sometimes better, without video input. On the MM-AU benchmark, removing video consistently improves accuracy, and adding more frames further degrades performance. To quantify visual dependence, we introduce two dataset-level diagnostics: Blind Gap, measuring above-chance text-only performance, and Visual Gain, measuring the marginal benefit of adding video. We further propose an instance-level Shortcut Score that combines text-only confidence with visual necessity signals, enabling continuous, training-free filtering of shortcut-prone questions. The resulting subsets reduce shortcut bias and improve visual grounding. Our findings reveal large differences in grounding quality across benchmarks and show that visually grounded evaluation, not just high accuracy, is essential in safety-critical VideoQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper audits four public traffic accident VideoQA benchmarks and reports that recent open-weight VLMs often achieve competitive or superior accuracy without video input, with removing video improving performance and adding frames degrading it on MM-AU. It introduces dataset-level metrics Blind Gap (above-chance text-only performance) and Visual Gain (marginal benefit of video), plus an instance-level Shortcut Score combining text-only confidence and visual necessity signals, to filter shortcut-prone questions and produce more visually grounded subsets. The central claim is that high accuracy does not ensure visual dependence and that these diagnostics are needed for safety-critical evaluation.

Significance. If the audit findings and metrics hold after addressing verification gaps, the work would be significant for exposing limitations in current VideoQA benchmarks and providing training-free tools to improve visual grounding. It directly addresses a practical issue in deploying VLMs for traffic safety applications, where reliance on textual shortcuts could lead to unreliable systems, and highlights benchmark-specific differences in grounding quality.

major comments (2)
  1. [Metrics definitions and audit interpretation] The definitions of Blind Gap and Visual Gain (introduced after the audit description) rest on the assumption that benchmark questions require scene-specific visual evidence rather than being solvable via textual patterns or common sense. Without independent verification—such as question-type analysis, human studies on visual necessity, or dataset construction details—this risks interpreting dataset artifacts as modality collapse. This assumption is load-bearing for all claims about auditing and filtering visual dependence.
  2. [Audit results section] The audit results, including the MM-AU finding that removing video improves accuracy while adding frames degrades performance, are stated without quantitative values, error bars, dataset statistics, per-model breakdowns, or verification steps. This absence undermines assessment of effect sizes and reliability, as noted in the abstract's reporting of results.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta on MM-AU) to convey the magnitude of the findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to strengthen the foundational assumptions of our metrics and to improve the quantitative reporting of our audit results. We address each major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: The definitions of Blind Gap and Visual Gain (introduced after the audit description) rest on the assumption that benchmark questions require scene-specific visual evidence rather than being solvable via textual patterns or common sense. Without independent verification—such as question-type analysis, human studies on visual necessity, or dataset construction details—this risks interpreting dataset artifacts as modality collapse. This assumption is load-bearing for all claims about auditing and filtering visual dependence.

    Authors: We agree that the metrics are grounded in the premise that traffic accident VideoQA questions are intended to require scene-specific visual evidence, as stated in the original benchmark papers and motivated by safety-critical applications. While we did not perform new human annotation studies, the empirical results (text-only accuracy matching or exceeding video-input accuracy on multiple benchmarks) provide direct evidence of shortcut reliance independent of that assumption. In revision we will add an expanded discussion of benchmark construction details from the source papers, include a question-type breakdown where available, and explicitly note the assumption as a scope limitation rather than claiming universal visual necessity. revision: partial

  2. Referee: The audit results, including the MM-AU finding that removing video improves accuracy while adding frames degrades performance, are stated without quantitative values, error bars, dataset statistics, per-model breakdowns, or verification steps. This absence undermines assessment of effect sizes and reliability, as noted in the abstract's reporting of results.

    Authors: We acknowledge that the current manuscript version under-reports the numerical details of the audit. The full paper contains per-model tables, but these lack error bars, explicit dataset sizes, and step-by-step verification of the text-only and multi-frame protocols. In the revised manuscript we will insert the missing quantitative values, standard deviations across runs where applicable, benchmark statistics (question counts, video lengths), per-model breakdowns, and a verification subsection describing the exact zero-shot prompting and frame-sampling procedures used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements are self-contained

full rationale

The paper performs an empirical audit by running VLMs on public benchmarks under text-only, single-frame, and multi-frame conditions and directly reports the resulting accuracies. Blind Gap is defined as above-chance text-only accuracy and Visual Gain as the accuracy delta when video is added; these are transparent arithmetic differences with no fitted parameters, no equations that equate a claimed output to its own input by construction, and no load-bearing self-citations or uniqueness theorems. The Shortcut Score is likewise a composite of the same measured signals. All central claims are therefore observations from external benchmark runs rather than derivations that collapse to their own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The audit and new metrics rest on the domain assumption that visual input is required for the tasks and on the postulation of three new measurement constructs without external validation.

axioms (1)
  • domain assumption Correct answers in the studied traffic VideoQA tasks require scene-specific visual evidence.
    Invoked to interpret text-only performance as evidence of shortcut use.
invented entities (3)
  • Blind Gap no independent evidence
    purpose: Quantify above-chance text-only performance as a measure of shortcut reliance.
    Newly defined metric introduced in the paper.
  • Visual Gain no independent evidence
    purpose: Quantify marginal benefit of video input over text-only baseline.
    Newly defined metric introduced in the paper.
  • Shortcut Score no independent evidence
    purpose: Combine text-only confidence and visual necessity signals to filter questions at instance level.
    Newly defined score introduced in the paper.

pith-pipeline@v0.9.1-grok · 5724 in / 1339 out tokens · 43759 ms · 2026-06-30T06:11:49.189386+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 20 canonical work pages · 10 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Has- son, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., and Si- monyan, K. Fl...

  2. [2]

    W., Cao, F., Nedaee, T., Raja- balifardi, K., Li, F.-F., Adeli, E., and Ashley, E

    Asadi, M., O’Sullivan, J. W., Cao, F., Nedaee, T., Raja- balifardi, K., Li, F.-F., Adeli, E., and Ashley, E. Mi- rage: The illusion of visual understanding.arXiv preprint arXiv:2603.21687,

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y ., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y ., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

  4. [4]

    train on the test set

    Brown, E., Yang, J., Yang, S., Fergus, R., and Xie, S. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655,

  5. [5]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., and Zhao, F. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024a. Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y ., and D...

  6. [6]

    Words or vision: Do vision-language models have blind faith in text? In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Deng, A., Cao, T., Chen, Z., and Hooi, B. Words or vision: Do vision-language models have blind faith in text? In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3867–3876,

  7. [7]

    In: CVPR

    doi: 10.1109/CVPR52734.2025.00366. Fang, J., Li, L.-l., Zhou, J., Xiao, J., Yu, H., Lv, C., Xue, J., and Chua, T.-S. Abductive ego-view accident video un- derstanding for safe driving perception. InCVPR,

  8. [8]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Fu, C., Dai, Y ., Luo, Y ., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y ., Zhang, M., Chen, P., Li, Y ., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., and Sun, X. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075,

  9. [9]

    Do Vision Language Models Need to Process Image Tokens?

    Ghosh, S., Babu, R. V ., and Agarwal, C. Do vision language models need to process image tokens?arXiv preprint arXiv:2604.09425,

  10. [10]

    Accidentbench: Benchmarking multimodal understanding and reasoning in vehicle accidents and beyond.arXiv preprint arXiv:2509.26636,

    Gu, S., Wang, X., Ying, D., Zhao, H., Yang, R., Jin, M., Li, B., Pavone, M., Yeung-Levy, S., Wang, J., Song, D., and Spanos, C. Accidentbench: Benchmarking multimodal understanding and reasoning in vehicle accidents and beyond.arXiv preprint arXiv:2509.26636,

  11. [11]

    S., and Abdel-Aty, M

    Kim, Y ., Abdelrahman, A. S., and Abdel-Aty, M. Vru- accident: A vision-language benchmark for video ques- tion answering and dense captioning for accident scene understanding.arXiv preprint arXiv:2507.09815,

  12. [12]

    Lee, D., Jung, S., Kim, B., Kim, M., Kim, S., Sung, J., and Seo, P. H. Breaking the visual shortcuts in multi- modal knowledge-based visual question answering.arXiv preprint arXiv:2511.22843,

  13. [13]

    LLaVA-OneVision: Easy Visual Task Transfer

    5 Auditing and Filtering Modality Collapse in Traffic VideoQA Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., and Li, C. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

  14. [14]

    Understanding lan- guage prior of lvlms by contrasting chain-of-embedding

    Long, L., Oh, C., Park, S., and Li, S. Understanding lan- guage prior of lvlms by contrasting chain-of-embedding. arXiv preprint arXiv:2509.23050,

  15. [15]

    Egoschema: A diagnostic benchmark for very long- form video language understanding.ArXiv, abs/2308.09126,

    Mangalam, K., Akshulakov, R., and Malik, J. Egoschema: A diagnostic benchmark for very long-form video language understanding.arXiv preprint arXiv:2308.09126,

  16. [16]

    Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts

    Nooralahzadeh, F., Rohanian, O., Zhang, Y ., F ¨urst, J., and Stockinger, K. Arbitration failure, not perceptual blindness: How vision-language models resolve visual- linguistic conflicts.arXiv preprint arXiv:2604.09364,

  17. [17]

    LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

    Wu, H., Li, D., Chen, B., and Li, J. Longvideobench: A benchmark for long-context interleaved video-language understanding.arXiv preprint arXiv:2407.15754,

  18. [18]

    A., Liu, Z., and Pan, L

    Xie, S., Kong, L., Dong, Y ., Sima, C., Zhang, W., Chen, Q. A., Liu, Z., and Pan, L. Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives.arXiv preprint arXiv:2501.04003,

  19. [19]

    K., and Vashist, A

    Zafar, A., Murali, L. K., and Vashist, A. Beyond accu- racy: Evaluating visual grounding in multimodal medical reasoning.arXiv preprint arXiv:2603.03437,

  20. [20]

    Zhang, Y ., Hwang, E., Zhang, H., Du, P., Jia, Y ., Jiang, D., He, X., Zhang, S., Nie, P., West, P., and Allen, K. R. Watch before you answer: Learning from visually grounded post-training.arXiv preprint arXiv:2604.05117,

  21. [21]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Zhou, J., Shu, Y ., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y ., Zhang, B., Huang, T., and Liu, Z. Mlvu: Benchmarking multi-task long video understanding.arXiv preprint arXiv:2406.04264,

  22. [22]

    Dataset Videos Avg

    6 Auditing and Filtering Modality Collapse in Traffic VideoQA Table 3.Comparison of the traffic accident VideoQA benchmarks analyzed in this work. Dataset Videos Avg. # Frames QA Pairs Choices Format Reasoning Types MM-AU 11,727 187 11,727 5 MCQ Causal VRU-Accident 1,000 189 6,000 4 MCQ Causal, counterfactual, scene attr. TrafficQA 10,080 189 62,535 4 MCQ...

  23. [23]

    have expanded multimodal evaluation to richer temporal and causal reasoning over video, with the implicit assumption that such tasks require genuine visual grounding, which accuracy alone cannot verify (Chen et al., 2024a; Zhang et al., 2026). Traffic accident VideoQA is a critical test case: accident causality, agent behavior, and collision dynamics are ...

  24. [24]

    target precisely this reasoning but report only answer accuracy and no blind baselines, which is the gap this work addresses. Vision-Language Models and the Roots of Language Dominance.Modern VLMs combine a visual encoder, projection module, and autoregressive language backbone, extended to video via frame sampling and spatiotemporal encodings (Bai et al....

  25. [25]

    in VQA motivated bias-controlled datasets (Goyal et al., 2017; Agrawal et al.,

  26. [26]

    In multimodal systems this manifests asmodality collapse, where language dominates prediction regardless of visual input (Sim et al., 2025; Deng et al., 2025)

    and was later extended to joint image-question correlations (Dancette et al., 2021). In multimodal systems this manifests asmodality collapse, where language dominates prediction regardless of visual input (Sim et al., 2025; Deng et al., 2025). In traffic accident VideoQA, DriveBench (Xie et al.,

  27. [27]

    Existing auditing approaches either apply binary filtering by removing questions a blind model answers correctly (Asadi et al., 2026; Zhang et al.,

    confirms this by showing comparable performance under text-only and full-input conditions. Existing auditing approaches either apply binary filtering by removing questions a blind model answers correctly (Asadi et al., 2026; Zhang et al.,

  28. [28]

    Binary filtering discards visually ambiguous but grounded questions; dataset-level metrics cannot identify which individual questions are shortcut-prone

    , or introduce dataset-level metrics to quantify visual contribution (Chen et al., 2024a; Zafar et al., 2026; Brown et al., 2025; Lee et al., 2025). Binary filtering discards visually ambiguous but grounded questions; dataset-level metrics cannot identify which individual questions are shortcut-prone. OurShortcut Scoreaddresses both limitations by assigni...

  29. [29]

    Videos can reach approximately 16 minutes, posing a practical challenge for models under a fixed frame budget

    targets temporal, spatial, and intent reasoning with a tiered answer set that modulates difficulty. Videos can reach approximately 16 minutes, posing a practical challenge for models under a fixed frame budget. We use only the Land domain subset. B.2. Evaluation Protocol We evaluate four vision-language models at the 7–8B parameter scale: Qwen2.5-VL-7B, I...

  30. [30]

    A substantial share of what these benchmarks measure is answerable without any video

    and MLVU (Zhou et al., 2025), which are widely used benchmarks for video understanding, Blind Gap ranges from roughly 14 to 22 points, comparable to VRU-Accident. A substantial share of what these benchmarks measure is answerable without any video. Visual Gain is positive across all models, which confirms that the visual signal contributes, but that contr...