pith. sign in

arxiv: 2606.08894 · v1 · pith:NKL37K2Rnew · submitted 2026-06-08 · 💻 cs.CV · cs.CL

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

Pith reviewed 2026-06-27 17:35 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelsrobustness evaluationsemantic distractionsreasoning modelsbenchmarkmultimodal reasoning
0
0 comments X

The pith

Reasoning vision-language models integrate task-irrelevant visual cues into their reasoning chains more readily than base models do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Distract-Bench to isolate a failure mode where models receive clear visual evidence but are shown additional meaningful objects or details that do not alter the correct answer. It evaluates multiple open- and closed-source VLMs on both standard image corruptions and this new benchmark. Results indicate that reasoning models largely match their non-reasoning bases under blur, noise, or weather effects, yet drop more sharply when semantic distractions are present. Analysis of model outputs shows the distractions frequently appear inside the generated reasoning steps and are treated as supporting evidence for the wrong final answer.

Core claim

Reasoning VLMs largely track their non-reasoning base models under perceptual degradation, but show consistently lower robustness to semantic distractions; these distractions often enter the reasoning process, are treated as evidence, and lead to incorrect answers.

What carries the argument

Distract-Bench, a benchmark that inserts meaningful but task-irrelevant visual cues into inputs while keeping the ground-truth answer unchanged.

If this is right

  • Robustness testing for reasoning VLMs must separately measure resistance to perceptual noise and resistance to semantic irrelevancies.
  • Chain-of-thought outputs in VLMs can incorporate and propagate distracting visual evidence even when the core task content remains visible.
  • Real-world deployment of reasoning VLMs requires safeguards against plausible but irrelevant scene elements rather than only image-quality degradation.
  • The performance gap between reasoning and non-reasoning models widens specifically under semantic distraction rather than under vision corruptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives that explicitly penalize the use of extraneous visual details in reasoning traces could reduce the observed failure mode.
  • Benchmarks that combine multiple simultaneous distractions might reveal whether models can selectively attend to task-relevant evidence.
  • The same distraction mechanism may affect other multimodal reasoning tasks such as visual question answering on cluttered scenes or document understanding with marginal annotations.

Load-bearing premise

The added visual cues are meaningful yet task-irrelevant and leave the ground-truth answer unchanged.

What would settle it

An experiment in which reasoning VLMs maintain accuracy on Distract-Bench at the same rate as on clean images while dropping on standard corruptions would falsify the claimed distinction between the two robustness regimes.

Figures

Figures reproduced from arXiv: 2606.08894 by Chenghua Lin, Hao Li, Jia Tong See, Jingyuan Sun, Jingyu Sun, Mochuan Zhan, Riza Batista-Navarro, Wenhao Cai, Yanan Ma, Yang Cui, Yifan Wang, Yizheng Sun, Ziyi Wang.

Figure 1
Figure 1. Figure 1: From perceptual robustness to semantic-distraction robustness. Traditional robustness evaluation tests whether VLMs can still perceive evidence from visually corrupted inputs, where failures mainly reflect degraded perception. Distract-Bench targets a different failure mode: the image remains clear and the ground-truth answer is unchanged, but a factually correct, query-irrelevant semantic distractor is in… view at source ↗
Figure 2
Figure 2. Figure 2: Distract-Bench dataset curation pipeline. We collect raw image-question samples from widely used VLM benchmarks (Yue et al., 2024; Lu et al., 2024; Wang et al., 2024), construct expert-designed seed distractors, scale the process with retrieval-guided synthetic editing, and finally conduct human expert verification to retain high-quality samples whose distractors are factually correct, query-irrelevant, an… view at source ↗
Figure 3
Figure 3. Figure 3: Category distribution of Distract-Bench sam [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-benchmark relative robustness (RR) un [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distract-Bench relative robustness for base [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: VLMs Distraction-reference behavior on Distract-Bench. (A) Distractor-reference ratios among correctly and incorrectly answered samples, grouped by model family. (B) Harmful reference ratio (HFR), measuring distractor mentions in incorrect answers, for reasoning-tuned models relative to the corresponding base model within each family. 30° Original Image Edited Image Okay, let‘s see. The problem is about a … view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative example of distraction-induced reasoning failure. The injected irrelevant cue may be ignored or mentioned harmlessly, but in failure cases it is treated as evidence and propagated to an incorrect answer. ants are higher. More importantly, as shown in Fig￾ure 6B, harmful reference ratios are consistently higher for reasoning models than for their base models. This suggests that failures under se… view at source ↗
Figure 8
Figure 8. Figure 8: Full prompt to GPT-5.5 for generating edit instructions. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of GPT-Image2 image edits gener [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative Example. Original Image Edited Image Okay, let's see. The question is asking if the function displayed is an even function. ... But in this case, the function is only defined for x ≥ 0. So if x is positive, -x would be -2. ... So the answer should be B (No). Vision R1 Okay, so I need to determine whether the function is an even function. ... Therefore, the function does not satisfy the conditi… view at source ↗
Figure 1
Figure 1. Figure 1: , we need to identify the pattern of sleep stages over the course of the night. By counting the distinct cycles from the awake period to the next awake period, we can see that there are 4 complete cycles within the 7 hours shown in the figure. Therefore, the correct answer is: 4 Qwen2.5-VL MM-Eureka-Qwen To determine the number of complete sleep cycles in [PITH_FULL_IMAGE:figures/full_fig_p018_1.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative Example. Original Image Edited Image Vision R1 R1-OneVision Qwen2.5-VL MM-Eureka-Qwen MMR1 Qwen3-VL Qwen3-VL-Thinking OpenVLThinker The question is asking which wave has the highest amplitude. The image shows three waves labeled A, B, and C…In the image, wave (A) has the largest peaks and troughs compared to waves (B) and (C). Therefore, wave (A) has the highest amplitude. Which wave has the h… view at source ↗
read the original abstract

Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable real-world application requires handling visual inputs that are messier than clean, curated benchmarks. Existing works mainly evaluate such reliability of VLMs through input corruptions, such as noise, blur and weather effects, which make visual evidence harder to perceive. This leaves a critical reliability failure mode underexplored: a model may perceive the evidence correctly, yet reason from plausible but irrelevant and distracting evidence and propagate this mistake to its final answer. To address this gap, we introduce \textbf{Distract-Bench}, a benchmark for evaluating VLM robustness to \textbf{semantic visual distractions}, defined as meaningful but task-irrelevant visual cues added to inputs while preserving the ground-truth answer. We comprehensively evaluate eight leading open-source and two closed-source VLMs across conventional vision corruptions and Distract-Bench. Our results show that Distract-Bench exposes a robustness failure distinct from vision corruptions: reasoning VLMs largely track their non-reasoning base models under perceptual degradation, but show consistently lower robustness to semantic distractions. Further analysis shows that these distractions often enter the reasoning process of VLMs, are treated as evidence, and lead to incorrect answers. Together, these findings reframe robustness evaluation for reasoning VLMs, shifting the focus from degraded perception to distractions for reliable real-world visual reasoning. Our data and code are available at https://github.com/Yizheng-Sun/Distract-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Distract-Bench, a benchmark of semantic visual distractions (meaningful but task-irrelevant cues added while preserving ground-truth), and evaluates eight open-source plus two closed-source VLMs. It claims reasoning VLMs largely match their non-reasoning base models under conventional perceptual corruptions (noise, blur, weather) but show consistently lower robustness to these semantic distractions; further analysis indicates the distractions enter the reasoning trace, are treated as evidence, and produce incorrect answers. The work concludes that robustness evaluation for reasoning VLMs should shift focus from degraded perception to distraction handling.

Significance. If the benchmark isolates semantic distraction from perception failure, the result identifies a distinct robustness gap for chain-of-thought VLMs that is relevant to real-world deployment. The public release of data and code at the cited GitHub repository is a clear strength that supports reproducibility and follow-up work.

major comments (1)
  1. [Distract-Bench construction] Distract-Bench definition and construction (abstract and § on benchmark): the central claim that reasoning VLMs are less robust specifically to semantic distractions (rather than to any change in visual evidence) rests on the guarantee that each added cue is task-irrelevant and leaves the ground-truth answer unchanged for the evaluated models. No explicit verification procedure (human annotation protocol, model-agnostic checks, or per-example evidence that the cue cannot be re-interpreted as supporting evidence by a reasoning chain) is described; without it the reported gap versus vision corruptions is potentially confounded.
minor comments (1)
  1. [Abstract] Abstract states results across “eight leading open-source and two closed-source VLMs” but supplies no model names or sizes; these should appear in a table or §4 for immediate reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment on Distract-Bench construction below and will incorporate the requested clarifications in the revised version.

read point-by-point responses
  1. Referee: [Distract-Bench construction] Distract-Bench definition and construction (abstract and § on benchmark): the central claim that reasoning VLMs are less robust specifically to semantic distractions (rather than to any change in visual evidence) rests on the guarantee that each added cue is task-irrelevant and leaves the ground-truth answer unchanged for the evaluated models. No explicit verification procedure (human annotation protocol, model-agnostic checks, or per-example evidence that the cue cannot be re-interpreted as supporting evidence by a reasoning chain) is described; without it the reported gap versus vision corruptions is potentially confounded.

    Authors: We agree that an explicit verification procedure is necessary to rigorously support the claim that added cues are task-irrelevant and preserve ground-truth. The current manuscript defines semantic visual distractions as meaningful but task-irrelevant cues that leave the ground-truth unchanged, but does not detail the verification steps. In the revised manuscript, we will add a dedicated subsection under benchmark construction describing the human annotation protocol: multiple independent annotators review each example to confirm (1) the cue cannot be reasonably interpreted as supporting evidence for a different answer and (2) the ground-truth label remains identical before and after cue insertion. We will also report inter-annotator agreement and provide per-example verification summaries in the supplementary material. This addition directly addresses the potential confounding concern and strengthens the distinction from perceptual corruptions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark and model evaluation

full rationale

The paper introduces Distract-Bench, defines semantic distractions as meaningful but task-irrelevant cues that preserve ground-truth, and reports direct empirical results comparing reasoning VLMs to base models on this benchmark versus standard corruptions. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All claims rest on observable model outputs on the new dataset, with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the benchmark's added cues qualify as semantic distractions and that observed model behavior reflects genuine robustness differences rather than benchmark artifacts.

axioms (1)
  • domain assumption Semantic visual distractions are meaningful but task-irrelevant visual cues added while preserving the ground-truth answer.
    This definition is used to construct Distract-Bench and separate it from perceptual corruptions.
invented entities (1)
  • Distract-Bench no independent evidence
    purpose: Benchmark dataset for measuring VLM robustness to semantic visual distractions
    Newly introduced evaluation resource without external validation or prior independent use mentioned.

pith-pipeline@v0.9.1-grok · 5844 in / 1243 out tokens · 26839 ms · 2026-06-27T17:35:04.062914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 15 canonical work pages · 11 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  2. [2]

    Qwen3-VL Technical Report

    Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

  3. [3]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  5. [5]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  6. [6]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

  7. [7]

    arXiv preprint arXiv:2509.21268 , year=

    Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources , author=. arXiv preprint arXiv:2509.21268 , year=

  8. [8]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning , author=. arXiv preprint arXiv:2503.07365 , year=

  9. [9]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Benchmarking neural network robustness to common corruptions and perturbations , author=. arXiv preprint arXiv:1903.12261 , year=

  10. [10]

    arXiv preprint arXiv:2603.06148 , year=

    VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models , author=. arXiv preprint arXiv:2603.06148 , year=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Benchmarking robustness of adaptation methods on pre-trained vision-language models , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    International Conference on Learning Representations , volume=

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. International Conference on Learning Representations , volume=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    Measuring multimodal mathematical reasoning with math-vision dataset , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Openvlthinker: Complex vision-language reasoning via iterative sft-rl cycles , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Adversarial vqa: A new benchmark for evaluating the robustness of vqa models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  16. [16]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  17. [17]

    Synthetic and Natural Noise Both Break Neural Machine Translation

    Synthetic and natural noise both break neural machine translation , author=. arXiv preprint arXiv:1711.02173 , year=

  18. [18]

    Findings of the association for computational linguistics: ACL 2022 , pages=

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. Findings of the association for computational linguistics: ACL 2022 , pages=

  19. [19]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  20. [20]

    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

    Infographicvqa , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

  21. [21]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Are we on the right way for evaluating large vision-language models? , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  24. [24]

    Qwen2.5-VL Technical Report , journal =

    Shuai Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Sibo Song and Kai Dang and Peng Wang and Shijie Wang and Jun Tang and Humen Zhong and Yuanzhi Zhu and Ming. Qwen2.5-VL Technical Report , journal =

  25. [25]

    Nature Machine Intelligence , volume=

    Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

  26. [26]

    International conference on machine learning , pages=

    Wilds: A benchmark of in-the-wild distribution shifts , author=. International conference on machine learning , pages=. 2021 , organization=

  27. [27]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Multimodal chain-of-thought reasoning in language models , author=. arXiv preprint arXiv:2302.00923 , year=

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    OpenAI GPT-5 System Card

    Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

  30. [30]

    2026 , howpublished =

    Image Generation , author =. 2026 , howpublished =

  31. [31]

    Advances in Neural Information Processing Systems , volume=

    Robustness analysis of video-language models against visual and language perturbations , author=. Advances in Neural Information Processing Systems , volume=

  32. [32]

    arXiv preprint arXiv:2504.13690 , year=

    Analysing the robustness of vision-language-models to common corruptions , author=. arXiv preprint arXiv:2504.13690 , year=

  33. [33]

    International conference on machine learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  34. [34]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

  35. [35]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  36. [36]

    Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

    Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

  37. [37]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    On LLMs-driven synthetic data generation, curation, and evaluation: A survey , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  38. [38]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  39. [39]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Taking a hint: Leveraging explanations to make vision and language models more grounded , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  40. [40]

    Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

    Evaluating object hallucination in large vision-language models , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

  41. [41]

    International Conference on Machine Learning , pages=

    Large language models can be easily distracted by irrelevant context , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  42. [42]

    Advances in Neural Information Processing Systems , volume=

    Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=

  43. [43]

    arXiv preprint arXiv:2212.08044 , year=

    Benchmarking robustness of multimodal image-text models under distribution shift , author=. arXiv preprint arXiv:2212.08044 , year=

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Measuring robustness to natural distribution shifts in image classification , author=. Advances in Neural Information Processing Systems , volume=