This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
SceneGraphGrounder builds a persistent 3D scene graph from VLM-inferred relations in 2D views and solves grounding via constrained graph alignment, achieving competitive zero-shot results on ScanRefer with only RGB-D input.
FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.
citing papers explorer
-
A Survey on Vision-Language-Action Models for Embodied AI
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
-
SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching
SceneGraphGrounder builds a persistent 3D scene graph from VLM-inferred relations in 2D views and solves grounding via constrained graph alignment, achieving competitive zero-shot results on ScanRefer with only RGB-D input.
-
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers
FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.