Uses VLMs to detect instance concepts and LLMs to infer abstract relationships, assembling them into 3D scene graph forests that are evaluated on uHumans2 and ScanNet and tested in open-vocabulary retrieval on a Spot robot.
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.RO 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Introduces LSM that outputs calibrated multimodal spatial distributions from language plus scene graph, fused via VL-Map to improve 3D target localization on VLA-3D benchmark and real robot.
Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
citing papers explorer
-
From Pixels to Concepts: Growing Rich 3D Semantic Scene Graph Forests utilizing Foundation Models
Uses VLMs to detect instance concepts and LLMs to infer abstract relationships, assembling them into 3D scene graph forests that are evaluated on uHumans2 and ScanNet and tested in open-vocabulary retrieval on a Spot robot.
-
Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language
Introduces LSM that outputs calibrated multimodal spatial distributions from language plus scene graph, fused via VL-Map to improve 3D target localization on VLA-3D benchmark and real robot.
-
Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.