Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10910--10921

Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.

Showing 1 of 1 citing paper.

Multimodal Language Models Cannot Spot Spatial Inconsistencies cs.CV · 2026-04-01 · unverdicted · none · ref 28
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.