MLLMs achieve competitive but subhuman performance on the new VSI-Bench for visual-spatial intelligence from videos, with spatial reasoning as the main bottleneck and explicit cognitive map generation improving distance estimation.
why should i trust you?
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2024 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
MLLMs achieve competitive but subhuman performance on the new VSI-Bench for visual-spatial intelligence from videos, with spatial reasoning as the main bottleneck and explicit cognitive map generation improving distance estimation.