DVGT: Driving Visual Geometry Transformer

Fang Li; Jiwen Lu; Long Chen; Shaoqing Xu; Shengyin Jiang; Sicheng Zuo; Wenzhao Zheng; Zhi-Xin Yang; Zixun Xie

arxiv: 2512.16919 · v2 · pith:BLWFHEJLnew · submitted 2025-12-18 · 💻 cs.CV · cs.AI· cs.RO

DVGT: Driving Visual Geometry Transformer

Sicheng Zuo , Zixun Xie , Wenzhao Zheng , Shaoqing Xu , Fang Li , Shengyin Jiang , Long Chen , Zhi-Xin Yang

show 1 more author

Jiwen Lu

This is my paper

classification 💻 cs.CV cs.AIcs.RO

keywords dvgtgeometryvisualdrivingattentioncameraconfigurationsdense

0 comments

read the original abstract

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...