VIBE benchmark evaluates visual instruction following in image editing models across deictic, morphological, and causal levels, finding proprietary models lead but all degrade on harder tasks.
11plus-bench: Demystifying multimodal llm spatial reasoning with cognitive-inspired analysis
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3years
2026 3roles
method 1polarities
use method 1representative citing papers
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
citing papers explorer
-
How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing
VIBE benchmark evaluates visual instruction following in image editing models across deictic, morphological, and causal levels, finding proprietary models lead but all degrade on harder tasks.
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
Multimodal Language Models Cannot Spot Spatial Inconsistencies
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.