VIBE benchmark evaluates visual instruction following in image editing models across deictic, morphological, and causal levels, finding proprietary models lead but all degrade on harder tasks.
Edit this image following the instructions annotated on this picture
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
method 1
citation-polarity summary
fields
cs.CV 3years
2026 3roles
method 1polarities
use method 1representative citing papers
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
citing papers explorer
-
How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing
VIBE benchmark evaluates visual instruction following in image editing models across deictic, morphological, and causal levels, finding proprietary models lead but all degrade on harder tasks.
-
Multimodal Language Models Cannot Spot Spatial Inconsistencies
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.