pith. sign in

arxiv: 2607.02515 · v1 · pith:AZQN6YDDnew · submitted 2026-07-02 · 💻 cs.CV

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

Pith reviewed 2026-07-03 14:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular geometry estimationpixel-space diffusiondiffusion transformer3D point mapssingle-image 3D reconstructionViT backbone
0
0 comments X

The pith

A plain ViT diffusion model operating directly on raw point map patches outperforms latent diffusion models for single-image 3D geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that complex hybrid architectures, latent compression, and intricate losses are unnecessary for state-of-the-art monocular 3D reconstruction. A minimalist Diffusion Transformer built on a plain ViT works directly on raw 3D point map patches and conditions only on pre-trained DINOv3 image tokens. The model trains entirely from scratch with no point-map tokenizer. This simpler setup produces sharper geometry and handles ambiguous regions such as transparent objects better than existing latent-based or hybrid methods.

Core claim

A pixel-space Diffusion Transformer using a plain ViT backbone, trained from scratch on raw 3D point map patches and conditioned solely on DINOv3 image tokens, surpasses complex latent-based diffusion models while remaining simpler than hybrid alternatives and yields sharper geometric structure with greater robustness in ambiguous regions.

What carries the argument

PointDiT: a pixel-space Diffusion Transformer that operates directly on raw 3D point map patches without any tokenizer, conditioned on DINOv3 image tokens.

If this is right

  • Sharper geometric structure emerges without specialized loss formulations.
  • Robustness improves in regions with transparency or depth ambiguity.
  • Deployment simplifies by removing the need for point-map tokenizers and latent encoders.
  • Training remains feasible entirely from scratch on the target geometry representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may extend naturally to other dense prediction tasks where direct pixel-space modeling avoids quantization artifacts.
  • Removing reliance on pre-trained latent spaces could allow faster adaptation when new sensor modalities appear.
  • The approach suggests that diffusion on explicit geometric outputs can serve as a baseline for comparing future architectural additions.

Load-bearing premise

That training a standard ViT diffusion model from scratch directly on point-map patches without tokenizers or pre-trained latents suffices to exceed the performance of more elaborate latent and hybrid systems.

What would settle it

A controlled experiment in which a latent diffusion model with equivalent training compute and data produces equal or superior results on standard monocular geometry benchmarks.

Figures

Figures reproduced from arXiv: 2607.02515 by Andreas Geiger, Fabian Manhardt, Federico Tombari, Haofei Xu, Marc Pollefeys, Michael Niemeyer, Michael Oechsle, Nikolai Kalischek, Philipp Henzler, Rundi Wu.

Figure 1
Figure 1. Figure 1: PointDiT. A minimalist pixel-space Diffusion Trans￾former operating directly on raw point map patches, conditioned on image tokens from a pre-trained DINOv3. The 3D point map (H × W × 3) is visualized as an RGB image, with color encoding the spatial (X, Y, Z) coordinates. 3D representation remains ill-posed, owing to the inherent scale and depth ambiguities of perspective projection. Existing approaches to… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison with latent diffusion and regression. The two dominant paradigms each have an inherent limitation: (a) the VAE in latent diffusion models introduces reconstruction noise that caps the attainable quality, while (b) deterministic regression over-smooths fine geometric structures. PointDiT avoids both. resolving depth in highly ambiguous scenarios, such as transparent objects. PointDiT achieves hig… view at source ↗
Figure 3
Figure 3. Figure 3: Different diffusion sampling steps. Our single-step diffusion already significantly outperforms prior works, and in￾creasing the sampling steps further enhances reconstruction details (see the zoomed-in region). The improvement is most pronounced on BF1: PointDiT raises boundary sharpness from 9.41 (the best baseline) to 10.50, reflecting markedly sharper geometry ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Point map comparisons. Our PointDiT is significantly better in terms of reconstructing thin structures (1st row), transparent objects (2nd rows), and maintaining a more accurate relative scale across the global scene (3rd and 4th rows). We show additional depth comparisons in the appendix ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Generative flow matching vs. deterministic regression. (a) The deterministic regressor converges faster at first but soon overfits, while the generative model trains stably and reaches lower error. (b) The generative model recovers sharper boundaries, thin structures, and transparent objects than the deterministic regressor. Overall, the generative formulation improves the boundary metric BF1 from 10.90 to… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of patch size. At 512 × 512 resolution, a patch size of 16 recovers sharper boundaries and finer local structures than a patch size of 32. with notably sharper boundaries and supports both single￾step and multi-step inference. By showing that dense geom￾etry can be modeled effectively in pixel space, we bridge the gap between standard image generation and 3D recon￾struction, paving the way for VAE-f… view at source ↗
Figure 7
Figure 7. Figure 7: Depth comparisons. Our PointDiT is significantly better in terms of reconstructing thin structures (1st row), transparent objects (2nd rows), and maintaining a more accurate relative scale across the global scene (3rd and 4th rows). The corresponding point map comparisons are provided in the main paper ( [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces PointDiT, a minimalist pixel-space Diffusion Transformer built on a plain ViT that operates directly on raw 3D point map patches conditioned solely on frozen DINOv3 image tokens. It is trained entirely from scratch without any point-map tokenizer or latent compression, and claims to surpass complex latent-based diffusion models and hybrid alternatives in accuracy while producing sharper geometric structure and greater robustness in ambiguous regions such as transparent objects for monocular geometry estimation.

Significance. If the performance claims are substantiated by rigorous experiments, the result would indicate that direct pixel-space diffusion on raw point maps can eliminate the need for latent spaces, tokenizers, or hybrid architectural scaffolding in single-image 3D reconstruction, offering a simpler and potentially more scalable alternative to current state-of-the-art methods.

major comments (2)
  1. [Abstract] Abstract: The central claim that the approach 'surpasses complex latent-based diffusion models' and is 'more robust in highly ambiguous regions' is asserted without any quantitative results, baseline comparisons, ablation studies, or error metrics. This absence makes it impossible to evaluate whether the minimalist ViT diffusion on raw patches actually delivers the stated superiority.
  2. [Abstract] Method description (inferred from abstract): The premise that a plain ViT diffusion process on raw 3D point-map patches requires 'no auxiliary representation learning' is load-bearing for the simplicity argument, yet the text provides no details on point-map normalization, patch embedding, or handling of scale/ambiguity that might implicitly reintroduce complexity equivalent to a tokenizer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify points raised about the abstract. The manuscript provides supporting experiments and method details in later sections; we address the specific concerns below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the approach 'surpasses complex latent-based diffusion models' and is 'more robust in highly ambiguous regions' is asserted without any quantitative results, baseline comparisons, ablation studies, or error metrics. This absence makes it impossible to evaluate whether the minimalist ViT diffusion on raw patches actually delivers the stated superiority.

    Authors: We agree the abstract presents claims at a high level without numbers. The full manuscript contains quantitative results in the Experiments section, including tables with RMSE, accuracy@threshold metrics, and direct comparisons against latent diffusion baselines, plus ablations on ambiguous regions. To make the abstract self-contained, we will revise it to include one or two key quantitative highlights from the main results. revision: yes

  2. Referee: [Abstract] Method description (inferred from abstract): The premise that a plain ViT diffusion process on raw 3D point-map patches requires 'no auxiliary representation learning' is load-bearing for the simplicity argument, yet the text provides no details on point-map normalization, patch embedding, or handling of scale/ambiguity that might implicitly reintroduce complexity equivalent to a tokenizer.

    Authors: The abstract summarizes the approach; the method section details the processing: point maps are normalized to a fixed range per scene, embedded via a simple linear projection on the raw (x,y,z) patches with no learned tokenizer or VQ, and scale/ambiguity is handled through the diffusion objective and DINO conditioning alone. We will add a short clarifying sentence in the abstract to explicitly note the direct raw-patch embedding and absence of auxiliary representation learning. revision: yes

Circularity Check

0 steps flagged

No circularity: method trained from scratch on raw patches with external conditioning

full rationale

The paper presents an empirical method: a plain ViT diffusion model trained entirely from scratch directly on raw 3D point-map patches, conditioned only on frozen pre-trained DINOv3 tokens, with no point-map tokenizer. No equations, predictions, or first-principles derivations are shown that reduce the claimed superiority to a fitted quantity or self-citation chain by construction. The abstract explicitly contrasts the approach against latent/hybrid methods without invoking authors' prior uniqueness theorems or ansatzes. The central claim rests on experimental comparison rather than definitional equivalence, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5697 in / 1142 out tokens · 26762 ms · 2026-07-03T14:32:07.443631+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    arXiv preprint arXiv:2001.10773,

  2. [2]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  3. [3]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction, 2025

    He, J., Li, H., Yin, W., Liang, Y ., Li, L., Zhou, K., Zhang, H., Liu, B., and Chen, Y .-C. Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124,

  4. [4]

    Is my depth ground-truth good enough? hammer – highly ac- curate multi-modal dataset for dense 3d scene regression

    Jung, H., Ruhkamp, P., Zhai, G., Brasch, N., Li, Y ., Verdie, Y ., Song, J., Zhou, Y ., Armagan, A., Ilic, S., et al. Is my depth ground-truth good enough? hammer – highly ac- curate multi-modal dataset for dense 3d scene regression. arXiv preprint arXiv:2205.04565,

  5. [5]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

  6. [6]

    DINOv3

    Sim´eoni, O., V o, H. V ., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V ., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al. Dinov3.arXiv preprint arXiv:2508.10104,

  7. [7]

    Z., Daniele, A

    Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., Daniele, A. F., Mostajabi, M., Basart, S., Walter, M. R., and Shakhnarovich, G. Diode: A dense indoor and outdoor depth dataset.arXiv preprint arXiv:1908.00463,

  8. [8]

    Vggt: Visual geometry grounded transformer

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InCVPR, 2025a. Wang, Q., Zheng, S., Yan, Q., Deng, F., Zhao, K., and Chu, X. Irs: A large synthetic indoor robotics stereo dataset for disparity and surface normal estimation.arXiv preprint arXiv:1912.09632,

  9. [9]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y ., Tong, X., and Yang, J. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InCVPR, 2025b. Wang, R., Xu, S., Dong, Y ., Deng, Y ., Xiang, J., Lv, Z., Sun, G., Tong, X., and Yang, J. Moge-2: Accurate monocu- lar geometry with metric scale and sharp ...

  10. [10]

    Wang, W. et al. Tartanground: A large-scale dataset for ground robot perception and navigation.arXiv preprint arXiv:2505.10696, 2025d. Wrenninge, M. and Unger, J. Synscapes: A photorealistic synthetic dataset for street scene parsing.arXiv preprint arXiv:1810.08705,

  11. [11]

    Diffusion Transformers with Representation Autoencoders

    Zheng, B., Ma, N., Tong, S., and Xie, S. Diffusion trans- formers with representation autoencoders.arXiv preprint arXiv:2510.11690,

  12. [12]

    Zhou, Y . et al. Omniworld: A multi-domain and multi- modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201,

  13. [13]

    12 PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation Appendix A. Experimental Details We train our PointDiT on synthetic datasets that provide dense, accurate ground-truth depth together with known camera intrinsics, and we evaluate zero-shot on unseen real-world benchmarks. From each image we back-project the depth map through the intrins...

  14. [14]

    PointDiT variants use 4 sampling steps, and the Avg column is the sample-weighted mean over all evaluation samples. Overall, PointDiT-H attains the best average depth accuracy, PointDiT achieves the sharpest 14 PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation Table 7.Per-dataset point map results.Rel p ↓ and δp 1 ↑ for each of the seven e...