3d-llm: Injecting the 3d world into large language models

Yining Hong et al · 2023 · arXiv 2307.12981

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

cs.CV · 2025-11-14 · unverdicted · novelty 7.0

SandboxVLM enhances VLMs' spatial intelligence by encoding 3D geometry with abstract bounding boxes in a four-stage zero-shot pipeline, yielding an 8.3% improvement on SAT Real benchmark.

3D-VLA: A 3D Vision-Language-Action Generative World Model

cs.CV · 2024-03-14 · unverdicted · novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

cs.CV · 2026-03-29 · unverdicted · novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

cs.CV · 2026-03-09 · unverdicted · novelty 6.0

GR3D turns 3D scene geometry into ID-indexed text references, enabling zero-shot MLLM spatial reasoning gains of 9% on VSI-Bench and 12% on MindCube.

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

cs.CV · 2026-01-29 · unverdicted · novelty 5.0

CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

cs.CV · 2023-11-13 · unverdicted · novelty 5.0

SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.

citing papers explorer

Showing 6 of 6 citing papers.

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models cs.CV · 2025-11-14 · unverdicted · none · ref 32
SandboxVLM enhances VLMs' spatial intelligence by encoding 3D geometry with abstract bounding boxes in a four-stage zero-shot pipeline, yielding an 8.3% improvement on SAT Real benchmark.
3D-VLA: A 3D Vision-Language-Action Generative World Model cs.CV · 2024-03-14 · unverdicted · none · ref 20
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM cs.CV · 2026-03-29 · unverdicted · none · ref 23
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations cs.CV · 2026-03-09 · unverdicted · none · ref 11
GR3D turns 3D scene geometry into ID-indexed text references, enabling zero-shot MLLM spatial reasoning gains of 9% on VSI-Bench and 12% on MindCube.
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models cs.CV · 2026-01-29 · unverdicted · none · ref 74
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models cs.CV · 2023-11-13 · unverdicted · none · ref 12
SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.

3d-llm: Injecting the 3d world into large language models

fields

years

verdicts

representative citing papers

citing papers explorer