SandboxVLM enhances VLMs' spatial intelligence by encoding 3D geometry with abstract bounding boxes in a four-stage zero-shot pipeline, yielding an 8.3% improvement on SAT Real benchmark.
Srinivasan, Matthew Tancik, Jonathan T
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
method 1
citation-polarity summary
fields
cs.CV 2verdicts
UNVERDICTED 2roles
method 1polarities
use method 1representative citing papers
Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.
citing papers explorer
-
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
SandboxVLM enhances VLMs' spatial intelligence by encoding 3D geometry with abstract bounding boxes in a four-stage zero-shot pipeline, yielding an 8.3% improvement on SAT Real benchmark.
-
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.