PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

Andreas Geiger; Fabian Manhardt; Federico Tombari; Haofei Xu; Marc Pollefeys; Michael Niemeyer; Michael Oechsle; Nikolai Kalischek; Philipp Henzler; Rundi Wu

arxiv: 2607.02515 · v1 · pith:AZQN6YDDnew · submitted 2026-07-02 · 💻 cs.CV

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

Haofei Xu , Rundi Wu , Philipp Henzler , Nikolai Kalischek , Michael Oechsle , Fabian Manhardt , Marc Pollefeys , Andreas Geiger

show 2 more authors

Federico Tombari Michael Niemeyer

This is my paper

Pith reviewed 2026-07-03 14:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular geometry estimationpixel-space diffusiondiffusion transformer3D point mapssingle-image 3D reconstructionViT backbone

0 comments

The pith

A plain ViT diffusion model operating directly on raw point map patches outperforms latent diffusion models for single-image 3D geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that complex hybrid architectures, latent compression, and intricate losses are unnecessary for state-of-the-art monocular 3D reconstruction. A minimalist Diffusion Transformer built on a plain ViT works directly on raw 3D point map patches and conditions only on pre-trained DINOv3 image tokens. The model trains entirely from scratch with no point-map tokenizer. This simpler setup produces sharper geometry and handles ambiguous regions such as transparent objects better than existing latent-based or hybrid methods.

Core claim

A pixel-space Diffusion Transformer using a plain ViT backbone, trained from scratch on raw 3D point map patches and conditioned solely on DINOv3 image tokens, surpasses complex latent-based diffusion models while remaining simpler than hybrid alternatives and yields sharper geometric structure with greater robustness in ambiguous regions.

What carries the argument

PointDiT: a pixel-space Diffusion Transformer that operates directly on raw 3D point map patches without any tokenizer, conditioned on DINOv3 image tokens.

If this is right

Sharper geometric structure emerges without specialized loss formulations.
Robustness improves in regions with transparency or depth ambiguity.
Deployment simplifies by removing the need for point-map tokenizers and latent encoders.
Training remains feasible entirely from scratch on the target geometry representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may extend naturally to other dense prediction tasks where direct pixel-space modeling avoids quantization artifacts.
Removing reliance on pre-trained latent spaces could allow faster adaptation when new sensor modalities appear.
The approach suggests that diffusion on explicit geometric outputs can serve as a baseline for comparing future architectural additions.

Load-bearing premise

That training a standard ViT diffusion model from scratch directly on point-map patches without tokenizers or pre-trained latents suffices to exceed the performance of more elaborate latent and hybrid systems.

What would settle it

A controlled experiment in which a latent diffusion model with equivalent training compute and data produces equal or superior results on standard monocular geometry benchmarks.

Figures

Figures reproduced from arXiv: 2607.02515 by Andreas Geiger, Fabian Manhardt, Federico Tombari, Haofei Xu, Marc Pollefeys, Michael Niemeyer, Michael Oechsle, Nikolai Kalischek, Philipp Henzler, Rundi Wu.

**Figure 1.** Figure 1: PointDiT. A minimalist pixel-space Diffusion Transformer operating directly on raw point map patches, conditioned on image tokens from a pre-trained DINOv3. The 3D point map (H × W × 3) is visualized as an RGB image, with color encoding the spatial (X, Y, Z) coordinates. 3D representation remains ill-posed, owing to the inherent scale and depth ambiguities of perspective projection. Existing approaches to… view at source ↗

**Figure 2.** Figure 2: Comparison with latent diffusion and regression. The two dominant paradigms each have an inherent limitation: (a) the VAE in latent diffusion models introduces reconstruction noise that caps the attainable quality, while (b) deterministic regression over-smooths fine geometric structures. PointDiT avoids both. resolving depth in highly ambiguous scenarios, such as transparent objects. PointDiT achieves hig… view at source ↗

**Figure 3.** Figure 3: Different diffusion sampling steps. Our single-step diffusion already significantly outperforms prior works, and increasing the sampling steps further enhances reconstruction details (see the zoomed-in region). The improvement is most pronounced on BF1: PointDiT raises boundary sharpness from 9.41 (the best baseline) to 10.50, reflecting markedly sharper geometry ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Point map comparisons. Our PointDiT is significantly better in terms of reconstructing thin structures (1st row), transparent objects (2nd rows), and maintaining a more accurate relative scale across the global scene (3rd and 4th rows). We show additional depth comparisons in the appendix ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Generative flow matching vs. deterministic regression. (a) The deterministic regressor converges faster at first but soon overfits, while the generative model trains stably and reaches lower error. (b) The generative model recovers sharper boundaries, thin structures, and transparent objects than the deterministic regressor. Overall, the generative formulation improves the boundary metric BF1 from 10.90 to… view at source ↗

**Figure 6.** Figure 6: Effect of patch size. At 512 × 512 resolution, a patch size of 16 recovers sharper boundaries and finer local structures than a patch size of 32. with notably sharper boundaries and supports both singlestep and multi-step inference. By showing that dense geometry can be modeled effectively in pixel space, we bridge the gap between standard image generation and 3D reconstruction, paving the way for VAE-f… view at source ↗

**Figure 7.** Figure 7: Depth comparisons. Our PointDiT is significantly better in terms of reconstructing thin structures (1st row), transparent objects (2nd rows), and maintaining a more accurate relative scale across the global scene (3rd and 4th rows). The corresponding point map comparisons are provided in the main paper ( [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PointDiT claims a plain ViT diffusion model on raw point-map patches beats latent and hybrid methods for monocular geometry, but the abstract supplies no numbers or controls to check if that holds.

read the letter

The new element is the stripped-down setup: a from-scratch Diffusion Transformer on point-map patches in pixel space, conditioned only on frozen DINOv3 tokens, with no tokenizer or hybrid losses. That design choice is distinct from the latent-diffusion and hybrid pipelines the abstract cites.

If the experiments back the claim, the simplicity itself is useful. Avoiding latent compression and extra scaffolding could matter for systems that need direct, accurate depth and surface output without extra stages.

The soft spot is obvious and central: the abstract asserts superiority and better robustness on transparent objects but shows zero quantitative results, baselines, or ablations. Without those, the main argument stays untested. The stress-test concern lands here—the premise that direct patch diffusion needs no auxiliary representation learning only stands if the full runs control for training compute, data scale, and evaluation protocol. If those controls are missing or unequal, the gains could disappear.

The paper is aimed at people working on single-image 3D reconstruction who are looking for simpler diffusion alternatives. A reader who wants to test whether minimal pixel-space models can close the gap would find it worth reading once the numbers are in.

It deserves peer review. The question is practical and the architecture is clean enough to evaluate properly; the current presentation is too thin to judge on its own.

Referee Report

2 major / 0 minor

Summary. The paper introduces PointDiT, a minimalist pixel-space Diffusion Transformer built on a plain ViT that operates directly on raw 3D point map patches conditioned solely on frozen DINOv3 image tokens. It is trained entirely from scratch without any point-map tokenizer or latent compression, and claims to surpass complex latent-based diffusion models and hybrid alternatives in accuracy while producing sharper geometric structure and greater robustness in ambiguous regions such as transparent objects for monocular geometry estimation.

Significance. If the performance claims are substantiated by rigorous experiments, the result would indicate that direct pixel-space diffusion on raw point maps can eliminate the need for latent spaces, tokenizers, or hybrid architectural scaffolding in single-image 3D reconstruction, offering a simpler and potentially more scalable alternative to current state-of-the-art methods.

major comments (2)

[Abstract] Abstract: The central claim that the approach 'surpasses complex latent-based diffusion models' and is 'more robust in highly ambiguous regions' is asserted without any quantitative results, baseline comparisons, ablation studies, or error metrics. This absence makes it impossible to evaluate whether the minimalist ViT diffusion on raw patches actually delivers the stated superiority.
[Abstract] Method description (inferred from abstract): The premise that a plain ViT diffusion process on raw 3D point-map patches requires 'no auxiliary representation learning' is load-bearing for the simplicity argument, yet the text provides no details on point-map normalization, patch embedding, or handling of scale/ambiguity that might implicitly reintroduce complexity equivalent to a tokenizer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify points raised about the abstract. The manuscript provides supporting experiments and method details in later sections; we address the specific concerns below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the approach 'surpasses complex latent-based diffusion models' and is 'more robust in highly ambiguous regions' is asserted without any quantitative results, baseline comparisons, ablation studies, or error metrics. This absence makes it impossible to evaluate whether the minimalist ViT diffusion on raw patches actually delivers the stated superiority.

Authors: We agree the abstract presents claims at a high level without numbers. The full manuscript contains quantitative results in the Experiments section, including tables with RMSE, accuracy@threshold metrics, and direct comparisons against latent diffusion baselines, plus ablations on ambiguous regions. To make the abstract self-contained, we will revise it to include one or two key quantitative highlights from the main results. revision: yes
Referee: [Abstract] Method description (inferred from abstract): The premise that a plain ViT diffusion process on raw 3D point-map patches requires 'no auxiliary representation learning' is load-bearing for the simplicity argument, yet the text provides no details on point-map normalization, patch embedding, or handling of scale/ambiguity that might implicitly reintroduce complexity equivalent to a tokenizer.

Authors: The abstract summarizes the approach; the method section details the processing: point maps are normalized to a fixed range per scene, embedded via a simple linear projection on the raw (x,y,z) patches with no learned tokenizer or VQ, and scale/ambiguity is handled through the diffusion objective and DINO conditioning alone. We will add a short clarifying sentence in the abstract to explicitly note the direct raw-patch embedding and absence of auxiliary representation learning. revision: yes

Circularity Check

0 steps flagged

No circularity: method trained from scratch on raw patches with external conditioning

full rationale

The paper presents an empirical method: a plain ViT diffusion model trained entirely from scratch directly on raw 3D point-map patches, conditioned only on frozen pre-trained DINOv3 tokens, with no point-map tokenizer. No equations, predictions, or first-principles derivations are shown that reduce the claimed superiority to a fitted quantity or self-citation chain by construction. The abstract explicitly contrasts the approach against latent/hybrid methods without invoking authors' prior uniqueness theorems or ansatzes. The central claim rests on experimental comparison rather than definitional equivalence, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5697 in / 1142 out tokens · 26762 ms · 2026-07-03T14:32:07.443631+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages · 6 internal anchors

[1]

arXiv preprint arXiv:2001.10773,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[3]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction, 2025

He, J., Li, H., Yin, W., Liang, Y ., Li, L., Zhou, K., Zhang, H., Liu, B., and Chen, Y .-C. Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124,

work page arXiv
[4]

Is my depth ground-truth good enough? hammer – highly ac- curate multi-modal dataset for dense 3d scene regression

Jung, H., Ruhkamp, P., Zhai, G., Brasch, N., Li, Y ., Verdie, Y ., Song, J., Zhou, Y ., Armagan, A., Ilic, S., et al. Is my depth ground-truth good enough? hammer – highly ac- curate multi-modal dataset for dense 3d scene regression. arXiv preprint arXiv:2205.04565,

work page arXiv
[5]

Progressive Distillation for Fast Sampling of Diffusion Models

Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DINOv3

Sim´eoni, O., V o, H. V ., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V ., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Z., Daniele, A

Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., Daniele, A. F., Mostajabi, M., Basart, S., Walter, M. R., and Shakhnarovich, G. Diode: A dense indoor and outdoor depth dataset.arXiv preprint arXiv:1908.00463,

work page arXiv 1908
[8]

Vggt: Visual geometry grounded transformer

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InCVPR, 2025a. Wang, Q., Zheng, S., Yan, Q., Deng, F., Zhao, K., and Chu, X. Irs: A large synthetic indoor robotics stereo dataset for disparity and surface normal estimation.arXiv preprint arXiv:1912.09632,

work page arXiv 1912
[9]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y ., Tong, X., and Yang, J. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InCVPR, 2025b. Wang, R., Xu, S., Dong, Y ., Deng, Y ., Xiang, J., Lv, Z., Sun, G., Tong, X., and Yang, J. Moge-2: Accurate monocu- lar geometry with metric scale and sharp ...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Wang, W. et al. Tartanground: A large-scale dataset for ground robot perception and navigation.arXiv preprint arXiv:2505.10696, 2025d. Wrenninge, M. and Unger, J. Synscapes: A photorealistic synthetic dataset for street scene parsing.arXiv preprint arXiv:1810.08705,

work page arXiv
[11]

Diffusion Transformers with Representation Autoencoders

Zheng, B., Ma, N., Tong, S., and Xie, S. Diffusion trans- formers with representation autoencoders.arXiv preprint arXiv:2510.11690,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Zhou, Y . et al. Omniworld: A multi-domain and multi- modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201,

work page arXiv
[13]

12 PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation Appendix A. Experimental Details We train our PointDiT on synthetic datasets that provide dense, accurate ground-truth depth together with known camera intrinsics, and we evaluate zero-shot on unseen real-world benchmarks. From each image we back-project the depth map through the intrins...

2017
[14]

PointDiT variants use 4 sampling steps, and the Avg column is the sample-weighted mean over all evaluation samples. Overall, PointDiT-H attains the best average depth accuracy, PointDiT achieves the sharpest 14 PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation Table 7.Per-dataset point map results.Rel p ↓ and δp 1 ↑ for each of the seven e...

work page arXiv

[1] [1]

arXiv preprint arXiv:2001.10773,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[2] [2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[3] [3]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction, 2025

He, J., Li, H., Yin, W., Liang, Y ., Li, L., Zhou, K., Zhang, H., Liu, B., and Chen, Y .-C. Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124,

work page arXiv

[4] [4]

Is my depth ground-truth good enough? hammer – highly ac- curate multi-modal dataset for dense 3d scene regression

Jung, H., Ruhkamp, P., Zhai, G., Brasch, N., Li, Y ., Verdie, Y ., Song, J., Zhou, Y ., Armagan, A., Ilic, S., et al. Is my depth ground-truth good enough? hammer – highly ac- curate multi-modal dataset for dense 3d scene regression. arXiv preprint arXiv:2205.04565,

work page arXiv

[5] [5]

Progressive Distillation for Fast Sampling of Diffusion Models

Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

DINOv3

Sim´eoni, O., V o, H. V ., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V ., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Z., Daniele, A

Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., Daniele, A. F., Mostajabi, M., Basart, S., Walter, M. R., and Shakhnarovich, G. Diode: A dense indoor and outdoor depth dataset.arXiv preprint arXiv:1908.00463,

work page arXiv 1908

[8] [8]

Vggt: Visual geometry grounded transformer

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InCVPR, 2025a. Wang, Q., Zheng, S., Yan, Q., Deng, F., Zhao, K., and Chu, X. Irs: A large synthetic indoor robotics stereo dataset for disparity and surface normal estimation.arXiv preprint arXiv:1912.09632,

work page arXiv 1912

[9] [9]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y ., Tong, X., and Yang, J. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InCVPR, 2025b. Wang, R., Xu, S., Dong, Y ., Deng, Y ., Xiang, J., Lv, Z., Sun, G., Tong, X., and Yang, J. Moge-2: Accurate monocu- lar geometry with metric scale and sharp ...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Wang, W. et al. Tartanground: A large-scale dataset for ground robot perception and navigation.arXiv preprint arXiv:2505.10696, 2025d. Wrenninge, M. and Unger, J. Synscapes: A photorealistic synthetic dataset for street scene parsing.arXiv preprint arXiv:1810.08705,

work page arXiv

[11] [11]

Diffusion Transformers with Representation Autoencoders

Zheng, B., Ma, N., Tong, S., and Xie, S. Diffusion trans- formers with representation autoencoders.arXiv preprint arXiv:2510.11690,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Zhou, Y . et al. Omniworld: A multi-domain and multi- modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201,

work page arXiv

[13] [13]

12 PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation Appendix A. Experimental Details We train our PointDiT on synthetic datasets that provide dense, accurate ground-truth depth together with known camera intrinsics, and we evaluate zero-shot on unseen real-world benchmarks. From each image we back-project the depth map through the intrins...

2017

[14] [14]

PointDiT variants use 4 sampling steps, and the Avg column is the sample-weighted mean over all evaluation samples. Overall, PointDiT-H attains the best average depth accuracy, PointDiT achieves the sharpest 14 PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation Table 7.Per-dataset point map results.Rel p ↓ and δp 1 ↑ for each of the seven e...

work page arXiv