Reasoning VLMs show lower robustness to semantic visual distractions than to perceptual corruptions, with distractions entering their reasoning chains and causing errors.
Analysing the Robustness of Vision-Language-Models to Common Cor- ruptions
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
DUALVISION is a new lightweight fusion module using localized cross-attention to integrate infrared with RGB data in MLLMs, improving robustness to degradations and supported by the new DV-204K training dataset and DV-500 benchmark.
RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.
citing papers explorer
-
Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?
Reasoning VLMs show lower robustness to semantic visual distractions than to perceptual corruptions, with distractions entering their reasoning chains and causing errors.
-
DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning
DUALVISION is a new lightweight fusion module using localized cross-attention to integrate infrared with RGB data in MLLMs, improving robustness to degradations and supported by the new DV-204K training dataset and DV-500 benchmark.
-
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation
RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.