Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Dingkang Liang; Kui Xia; Tianrui Feng; Xiang Bai; Xianjin Wu; Xiaofan Li; Xiao Tan; Yumeng Zhang

arxiv: 2603.19235 · v2 · pith:TV7NHZHLnew · submitted 2026-03-19 · 💻 cs.CV · cs.RO

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu , Dingkang Liang , Tianrui Feng , Kui Xia , Yumeng Zhang , Xiaofan Li , Xiao Tan , Xiang Bai This is my paper

classification 💻 cs.CV cs.RO

keywords modelsgeometricpriorsspatialunderstandingvideodemonstrateexplicit

0 comments

read the original abstract

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unlocking Dense Metric Depth Estimation in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

DepthVLM attaches a depth head to VLMs for native dense metric depth prediction alongside language outputs using a two-stage unified training schedule and a new indoor-outdoor benchmark.
Unlocking Dense Metric Depth Estimation in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new in...
Do multimodal models imagine electric sheep?
cs.CV 2026-05 conditional novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
WALL-WM: Carving World Action Modeling at the Event Joints
cs.RO 2026-06 unverdicted novelty 4.0

WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.