TRACE prompting induces MLLMs to produce textual allocentric 3D representations from video, yielding consistent gains on spatial QA benchmarks across multiple model backbones.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
MLLMs achieve competitive but subhuman performance on the new VSI-Bench for visual-spatial intelligence from videos, with spatial reasoning as the main bottleneck and explicit cognitive map generation improving distance estimation.
citing papers explorer
-
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
TRACE prompting induces MLLMs to produce textual allocentric 3D representations from video, yielding consistent gains on spatial QA benchmarks across multiple model backbones.
-
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
MLLMs achieve competitive but subhuman performance on the new VSI-Bench for visual-spatial intelligence from videos, with spatial reasoning as the main bottleneck and explicit cognitive map generation improving distance estimation.