Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, Saining Xie · 2025

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

cs.CV · 2025-11-14 · unverdicted · novelty 7.0

SandboxVLM enhances VLMs' spatial intelligence by encoding 3D geometry with abstract bounding boxes in a four-stage zero-shot pipeline, yielding an 8.3% improvement on SAT Real benchmark.

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

cs.CV · 2026-02-19 · unverdicted · novelty 6.0

GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.

LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

cs.CV · 2026-05-16 · unverdicted · novelty 4.0

LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.

citing papers explorer

Showing 3 of 3 citing papers.

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models cs.CV · 2025-11-14 · unverdicted · none · ref 38
SandboxVLM enhances VLMs' spatial intelligence by encoding 3D geometry with abstract bounding boxes in a four-stage zero-shot pipeline, yielding an 8.3% improvement on SAT Real benchmark.
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking cs.CV · 2026-02-19 · unverdicted · none · ref 68
GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map cs.CV · 2026-05-16 · unverdicted · none · ref 50
LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

fields

years

verdicts

representative citing papers

citing papers explorer