VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.
Illusionbench: A large-scale and comprehensive benchmark for visual illusion understanding in vision-language models
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CV 3roles
background 1polarities
background 1representative citing papers
A combination of illusion-specific image transformations, anti-illusion prompts, and majority voting lets VLMs reach 90.48% accuracy on a 630-image illusion benchmark without any model training.
citing papers explorer
-
Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.
-
Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models
A combination of illusion-specific image transformations, anti-illusion prompts, and majority voting lets VLMs reach 90.48% accuracy on a 630-image illusion benchmark without any model training.
- MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs