PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

Andreas Geiger; Fabian Manhardt; Federico Tombari; Haofei Xu; Marc Pollefeys; Michael Niemeyer; Michael Oechsle; Nikolai Kalischek; Philipp Henzler; Rundi Wu

arxiv: 2607.02515 · v1 · pith:AZQN6YDDnew · submitted 2026-07-02 · 💻 cs.CV

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

Haofei Xu , Rundi Wu , Philipp Henzler , Nikolai Kalischek , Michael Oechsle , Fabian Manhardt , Marc Pollefeys , Andreas Geiger

show 2 more authors

Federico Tombari Michael Niemeyer

This is my paper

classification 💻 cs.CV

keywords diffusionlatentcomplexgeometryhybridlossmodelspixel-space

0 comments

read the original abstract

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.

This paper has not been read by Pith yet.

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

discussion (0)