Uses VLMs to detect instance concepts and LLMs to infer abstract relationships, assembling them into 3D scene graph forests that are evaluated on uHumans2 and ScanNet and tested in open-vocabulary retrieval on a Spot robot.
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.RO 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Introduces LSM that outputs calibrated multimodal spatial distributions from language plus scene graph, fused via VL-Map to improve 3D target localization on VLA-3D benchmark and real robot.
Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
citing papers explorer
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.