Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, Fei Xia · 2024

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.

Unlocking Dense Metric Depth Estimation in VLMs

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.

AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly

cs.RO · 2026-04-10 · unverdicted · novelty 6.0

AssemLM uses a specialized point cloud encoder inside a multimodal LLM to reach state-of-the-art 6D pose prediction for assembly tasks, backed by a new 900K-sample benchmark called AssemBench.

citing papers explorer

Showing 3 of 3 citing papers.

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving cs.CV · 2026-05-22 · unverdicted · none · ref 53
DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
Unlocking Dense Metric Depth Estimation in VLMs cs.CV · 2026-05-15 · unverdicted · none · ref 11 · 2 links
DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.
AssemLM: Spatial Reasoning Multimodal Large Language Models for Robotic Assembly cs.RO · 2026-04-10 · unverdicted · none · ref 6
AssemLM uses a specialized point cloud encoder inside a multimodal LLM to reach state-of-the-art 6D pose prediction for assembly tasks, backed by a new 900K-sample benchmark called AssemBench.

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer