Defines 3D Instruction Ambiguity Detection as a new task, releases the Ambi3D benchmark, shows state-of-the-art 3D LLMs struggle with it, and proposes the AmbiVer framework that gathers multi-view visual evidence to guide VLMs in judging ambiguity.
3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
GLMap combines explicit 3D Gaussians with multi-scale language semantics in a dual-modality structure and uses an analytical Gaussian Estimator for incremental map building, improving zero-shot performance on navigation and reasoning tasks.
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.
citing papers explorer
-
3D Instruction Ambiguity Detection
Defines 3D Instruction Ambiguity Detection as a new task, releases the Ambi3D benchmark, shows state-of-the-art 3D LLMs struggle with it, and proposes the AmbiVer framework that gathers multi-view visual evidence to guide VLMs in judging ambiguity.
-
POMA-3D: The Point Map Way to 3D Scene Understanding
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
-
Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning
GLMap combines explicit 3D Gaussians with multi-scale language semantics in a dual-modality structure and uses an analytical Gaussian Estimator for incremental map building, improving zero-shot performance on navigation and reasoning tasks.
-
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
-
3D-IDE: 3D Implicit Depth Emergent
3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.