pith. machine review for the scientific record. sign in

arxiv: 2605.10922 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

Pixal3D: Pixel-Aligned 3D Generation from Images

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D generationpixel alignmentimage to 3Dfeature projection3D reconstructionmulti-view 3Dscene synthesis
0
0 comments X

The pith

Pixal3D generates 3D assets from images by aligning each generated point directly to input pixels via feature back-projection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Pixal3D to address low pixel-level fidelity in current image-to-3D models. It identifies that generating shapes in canonical space creates ambiguous pixel-to-3D links. Instead, Pixal3D uses a back-projection method to lift image features straight into 3D space, matching the input view. This leads to higher fidelity outputs that approach reconstruction quality and works for both objects and scenes from single or multiple views. Readers would care if it makes turning photos into accurate 3D models more reliable without extra steps.

Core claim

Pixal3D demonstrates that generating 3D in a pixel-aligned manner, rather than in canonical space, resolves the correspondence issue by explicitly lifting multi-scale image features into a 3D feature volume through pixel back-projection. This establishes unambiguous direct pixel-to-3D associations, enabling scalable high-fidelity 3D generation from images that approaches the fidelity of traditional reconstruction techniques.

What carries the argument

The pixel back-projection conditioning scheme that lifts multi-scale 2D image features into a 3D feature volume to create explicit pixel-to-3D mappings.

If this is right

  • High-quality 3D assets can be generated with substantially improved fidelity.
  • The method scales to produce 3D models at higher resolutions with realistic appearance.
  • Multi-view inputs can be handled by aggregating back-projected feature volumes from different views.
  • A modular pipeline extends the approach to synthesize high-fidelity 3D scenes with separated objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This pixel-aligned approach may generalize to other 3D tasks like novel view synthesis without additional training.
  • It could simplify pipelines by reducing reliance on attention mechanisms for conditioning.
  • Applications in AR or robotics might benefit from more accurate 3D models derived from casual photos.

Load-bearing premise

That mapping image features back to 3D coordinates creates a direct and unambiguous link between each pixel and its corresponding 3D location without creating alignment errors or requiring further adjustments.

What would settle it

A test set of generated 3D models where rendered views from the input angle show pixel-level differences from the original image exceeding those of standard reconstruction methods, or where the model fails to maintain consistency at higher resolutions.

Figures

Figures reproduced from arXiv: 2605.10922 by Dong-Yang Li, Fang-Lue Zhang, Meng-Hao Guo, Shi-Min Hu, Wang Zhao, Wenbo Hu, Ying Shan, Yuxin Chen.

Figure 1
Figure 1. Figure 1: Pixel-aligned meshes generated by Pixal3D. The foreground displays our results with their corresponding input images in the background. Our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Pixal3D framework. The framework consists of three key components: (1) Pixel-Aligned Structured Latent Representation Learning [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the Back-projection Conditioning Scheme. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons of single view 3D generation. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of single-view 3D generation on in-the-wild images. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of multi-view 3D generation on Toys4K. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on 3D scene generation. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study on key components. decreases while reconstruction cues become stronger, a trend con￾sistently observed in our results. This behavior is also a fundamental principle and objective of 3D generative reconstruction. 4.3 3D Scene Generation We extend Pixal3D to scene generation, as discussed in Sec. 3.3 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: https://ldyang694.github.io/projects/pixal3d/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Pixal3D, a 3D-native generative model for high-fidelity image-to-3D synthesis. It identifies an implicit 2D-3D correspondence problem in prior canonical-space generators and introduces a pixel-aligned generation paradigm that operates directly in the input view. The core technical contribution is a pixel back-projection conditioning scheme that lifts multi-scale 2D image features into a 3D feature volume to create explicit pixel-to-3D associations. The authors claim this yields substantially higher fidelity (approaching reconstruction quality), scales to high-resolution assets, extends naturally to multi-view inputs via volume aggregation, and supports modular scene-level synthesis with object separation.

Significance. If the back-projection mechanism demonstrably resolves correspondence without introducing ray-smearing artifacts or requiring compensatory regularization that limits generalization, the work would represent a meaningful advance in scalable, high-fidelity 3D generation. The explicit pixel-alignment approach and its extension to scenes are conceptually attractive and could influence future architectures that prioritize faithfulness over canonical-pose generation. The paper's emphasis on being 3D-native rather than 2D-lifted is a strength worth highlighting if supported by rigorous experiments.

major comments (2)
  1. [Abstract / Method] Abstract and method description: The claim that pixel back-projection 'establishes direct pixel-to-3D correspondence without ambiguity' is load-bearing for all fidelity and 'approaching reconstruction' assertions. For monocular inputs, back-projecting each pixel feature along its camera ray (or discrete depth samples) into a 3D volume inherently produces a one-to-many mapping; the resulting feature volume encodes smeared information along rays. The manuscript must explicitly show how depth ambiguity is resolved (e.g., via learned depth prediction, multi-view constraints, or architectural inductive biases) rather than asserted to be removed by construction. Without this, the central correspondence advantage reduces to a reparameterization that still requires the network to disambiguate.
  2. [Experiments] Experiments section: The abstract asserts 'qualitative and comparative improvements' and 'substantially improves fidelity' yet supplies no quantitative metrics (PSNR, LPIPS, Chamfer distance, IoU, etc.), ablation tables, or error analysis. This absence prevents verification of whether the pixel-aligned scheme actually outperforms strong baselines on faithfulness while maintaining 3D consistency. Load-bearing claims require at least one table reporting these metrics across single-view, multi-view, and scene settings.
minor comments (2)
  1. [Abstract] Abstract: 'Pixal3D' spelling should be confirmed for consistency; the project page link is useful but the manuscript should include a brief statement on reproducibility (code release, model weights).
  2. [Method] The multi-view aggregation and scene pipeline are described at high level; a diagram or pseudocode in the method section would clarify how back-projected volumes are combined without introducing view-inconsistency artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: The claim that pixel back-projection 'establishes direct pixel-to-3D correspondence without ambiguity' is load-bearing for all fidelity and 'approaching reconstruction' assertions. For monocular inputs, back-projecting each pixel feature along its camera ray (or discrete depth samples) into a 3D volume inherently produces a one-to-many mapping; the resulting feature volume encodes smeared information along rays. The manuscript must explicitly show how depth ambiguity is resolved (e.g., via learned depth prediction, multi-view constraints, or architectural inductive biases) rather than asserted to be removed by construction. Without this, the central correspondence advantage reduces to a reparameterization that still requires the network to disambiguate.

    Authors: We agree that the original wording overstated the automatic removal of ambiguity. Back-projection creates explicit pixel-to-ray associations in the 3D volume, but monocular depth disambiguation is performed by the generative network through learned 3D priors, multi-scale feature aggregation, and training objectives that encourage geometric consistency. We have revised the abstract and method sections to explicitly describe this mechanism, including the role of the 3D U-Net architecture and any inductive biases that help resolve ray-smearing. We have also added a short discussion of limitations for highly ambiguous cases. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts 'qualitative and comparative improvements' and 'substantially improves fidelity' yet supplies no quantitative metrics (PSNR, LPIPS, Chamfer distance, IoU, etc.), ablation tables, or error analysis. This absence prevents verification of whether the pixel-aligned scheme actually outperforms strong baselines on faithfulness while maintaining 3D consistency. Load-bearing claims require at least one table reporting these metrics across single-view, multi-view, and scene settings.

    Authors: We acknowledge that quantitative metrics are necessary to substantiate the fidelity claims. The original submission emphasized qualitative results and visual comparisons because defining precise 3D ground truth for generative tasks is non-trivial; however, this is insufficient for the load-bearing assertions. In the revised manuscript we have added a new quantitative evaluation section and table reporting PSNR, LPIPS, Chamfer distance, and IoU (where applicable) for single-view, multi-view, and scene-level synthesis, together with ablations isolating the back-projection component. revision: yes

Circularity Check

0 steps flagged

No significant circularity: method is a novel architectural proposal without reduction to inputs or self-citations.

full rationale

The paper introduces Pixal3D as a new pixel-aligned 3D generation paradigm that uses a pixel back-projection conditioning scheme to lift multi-scale image features into a 3D feature volume. This is framed as an explicit design choice inspired by 3D reconstruction to address correspondence ambiguity, rather than a mathematical derivation or prediction that reduces by construction to fitted parameters or prior self-cited results. No equations, uniqueness theorems, or ansatzes are quoted that would indicate self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The central claims about scalability, fidelity gains, and extension to multi-view/scene synthesis rest on the proposed architecture's empirical performance, which remains independently verifiable and does not tautologically restate its inputs. This is a standard case of a self-contained method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that back-projection creates unambiguous correspondence and that the resulting 3D feature volume can be used directly by a generative model without further unspecified constraints. No explicit free parameters or invented physical entities are named in the abstract.

axioms (1)
  • domain assumption Neural networks can learn to generate consistent 3D structure from lifted 2D features when trained on appropriate data.
    Implicit in any learned 3D generative model; invoked when claiming scalability and high-quality output.
invented entities (1)
  • Pixel back-projection conditioning scheme no independent evidence
    purpose: To lift multi-scale image features into a 3D feature volume establishing direct pixel-to-3D correspondence.
    New mechanism introduced to solve the implicit 2D-3D correspondence issue; no independent evidence outside the method itself is provided in the abstract.

pith-pipeline@v0.9.0 · 5650 in / 1436 out tokens · 31671 ms · 2026-05-12T03:46:53.484266+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 7 internal anchors

  1. [1]

    Objaverse-XL:

    Matt Deitke and Ruoshi Liu and Matthew Wallingford and Huong Ngo and Oscar Michel and Aditya Kusupati and Alan Fan and Christian Laforte and Vikram Voleti and Samir Yitzhak Gadre and Eli VanderBilt and Aniruddha Kembhavi and Carl Vondrick and Georgia Gkioxari and Kiana Ehsani and Ludwig Schmidt and Ali Farhadi , editor =. Objaverse-XL:. Advances in Neural...

  2. [2]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol and Heewoo Jun and Prafulla Dhariwal and Pamela Mishkin and Mark Chen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2212.08751 , eprinttype =. 2212.08751 , timestamp =

  3. [3]

    3DShape2VecSet:

    Biao Zhang and Jiapeng Tang and Matthias Nie. 3DShape2VecSet:. ACM Transactions on Graphics (TOG) , volume =. 2023 , url =. doi:10.1145/3592442 , timestamp =

  4. [4]

    ACM Transactions on Graphics (TOG) , volume =

    Longwen Zhang and Ziyu Wang and Qixuan Zhang and Qiwei Qiu and Anqi Pang and Haoran Jiang and Wei Yang and Lan Xu and Jingyi Yu , title =. ACM Transactions on Graphics (TOG) , volume =. 2024 , url =. doi:10.1145/3658146 , timestamp =

  5. [5]

    What’s in the image? a deep-dive into the vision of vision language models

    Weiyu Li and Jiarui Liu and Hongyu Yan and Rui Chen and Yixun Liang and Xuelin Chen and Ping Tan and Xiaoxiao Long , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.00500 , timestamp =

  6. [6]

    Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

    Zibo Zhao and Zeqiang Lai and Qingxiang Lin and Yunfei Zhao and Haolin Liu and Shuhui Yang and Yifei Feng and Mingxin Yang and Sheng Zhang and Xianghui Yang and Huiwen Shi and Sicong Liu and Junta Wu and Yihang Lian and Fan Yang and Ruining Tang and Zebin He and Xinzhou Wang and Jian Liu and Xuhui Zuo and Zhuo Chen and Biwen Lei and Haohan Weng and Jing X...

  7. [7]

    Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

    Yangguang Li and Zi. TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2502.06608 , eprinttype =. 2502.06608 , timestamp =

  8. [8]

    What’s in the image? a deep-dive into the vision of vision language models

    Jianfeng Xiang and Zelong Lv and Sicheng Xu and Yu Deng and Ruicheng Wang and Bowen Zhang and Dong Chen and Xin Tong and Jiaolong Yang , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.02000 , timestamp =

  9. [9]

    Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

    Shuang Wu and Youtian Lin and Feihu Zhang and Yifei Zeng and Yikang Yang and Yajie Bao and Jiachen Qian and Siyu Zhu and Xun Cao and Philip Torr and Yao Yao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.17412 , eprinttype =. 2505.17412 , timestamp =

  10. [10]

    arXiv preprint arXiv:2505.14521 , year=

    Zhihao Li and Yufei Wang and Heliang Zheng and Yihao Luo and Bihan Wen , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.14521 , eprinttype =. 2505.14521 , timestamp =

  11. [11]

    Sparseflex: High-resolution and arbitrary-topology 3d shape modeling,

    Xianglong He and Zi. SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.21732 , eprinttype =. 2503.21732 , timestamp =

  12. [12]

    Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

    Jianfeng Xiang and Xiaoxue Chen and Sicheng Xu and Ruicheng Wang and Zelong Lv and Yu Deng and Hongyuan Zhu and Yue Dong and Hao Zhao and Nicholas Jing Yuan and Jiaolong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.14692 , eprinttype =. 2512.14692 , bibsource =

  13. [13]

    CoRR , volume =

    Yihao Luo and Xianglong He and Chuanyu Pan and Yiwen Chen and Jiaqi Wu and Yangguang Li and Wanli Ouyang and Yuanming Hu and Guang Yang and Choon Hwai Yap , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.04029 , eprinttype =. 2511.04029 , timestamp =

  14. [14]

    Lattice: Democratize high-fidelity 3d generation at scale.arXiv preprint arXiv:2512.03052, 2025

    Zeqiang Lai and Yunfei Zhao and Zibo Zhao and Haolin Liu and Qingxiang Lin and Jingwei Huang and Chunchao Guo and Xiangyu Yue , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.03052 , eprinttype =. 2512.03052 , bibsource =

  15. [15]

    Warren , title =

    Tao Ju and Frank Losasso and Scott Schaefer and Joe D. Warren , title =. ACM Transactions on Graphics (TOG) , volume =. 2002 , url =. doi:10.1145/566654.566586 , timestamp =

  16. [16]

    The Thirteenth International Conference on Learning Representations,

    Yushi Lan and Shangchen Zhou and Zhaoyang Lyu and Fangzhou Hong and Shuai Yang and Bo Dai and Xingang Pan and Chen Change Loy , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  17. [17]

    Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer , booktitle =

    Shuang Wu and Youtian Lin and Yifei Zeng and Feihu Zhang and Jingxi Xu and Philip Torr and Xun Cao and Yao Yao , editor =. Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer , booktitle =. 2024 , url =

  18. [18]

    Black and Derek Nowrouzezahrai and Liam Paull and Weiyang Liu , title =

    Zhen Liu and Yao Feng and Michael J. Black and Derek Nowrouzezahrai and Liam Paull and Weiyang Liu , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  19. [19]

    OctFusion: Octree-based Diffusion Models for 3D Shape Generation , journal =

    Bojun Xiong and Si. OctFusion: Octree-based Diffusion Models for 3D Shape Generation , journal =. 2025 , url =. doi:10.1111/CGF.70198 , timestamp =

  20. [20]

    Barron and Ben Mildenhall , title =

    Ben Poole and Ajay Jain and Jonathan T. Barron and Ben Mildenhall , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  21. [21]

    ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation , booktitle =

    Zhengyi Wang and Cheng Lu and Yikai Wang and Fan Bao and Chongxuan Li and Hang Su and Jun Zhu , editor =. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation , booktitle =. 2023 , url =

  22. [22]

    Johannes L. Sch. Structure-from-Motion Revisited , booktitle =. 2016 , url =. doi:10.1109/CVPR.2016.445 , timestamp =

  23. [23]

    Johannes L. Sch. Pixelwise View Selection for Unstructured Multi-View Stereo , booktitle =. 2016 , url =. doi:10.1007/978-3-319-46487-9\_31 , timestamp =

  24. [24]

    DeepMVS: Learning Multi-View Stereopsis , booktitle =

    Po. DeepMVS: Learning Multi-View Stereopsis , booktitle =. 2018 , url =. doi:10.1109/CVPR.2018.00298 , timestamp =

  25. [25]

    MVSNet: Depth Inference for Unstructured Multi-view Stereo , booktitle =

    Yao Yao and Zixin Luo and Shiwei Li and Tian Fang and Long Quan , editor =. MVSNet: Depth Inference for Unstructured Multi-view Stereo , booktitle =. 2018 , url =. doi:10.1007/978-3-030-01237-3\_47 , timestamp =

  26. [26]

    DPSNet: End-to-end Deep Plane Sweep Stereo , booktitle =

    Sunghoon Im and Hae. DPSNet: End-to-end Deep Plane Sweep Stereo , booktitle =. 2019 , url =

  27. [27]

    Atlas: End-to-End 3D Scene Reconstruction from Posed Images , booktitle =

    Zak Murez and Tarrence van As and James Bartolozzi and Ayan Sinha and Vijay Badrinarayanan and Andrew Rabinovich , editor =. Atlas: End-to-End 3D Scene Reconstruction from Posed Images , booktitle =. 2020 , url =. doi:10.1007/978-3-030-58571-6\_25 , timestamp =

  28. [28]

    2021 , url =

    Jiaming Sun and Yiming Xie and Linghao Chen and Xiaowei Zhou and Hujun Bao , title =. 2021 , url =. doi:10.1109/CVPR46437.2021.01534 , timestamp =

  29. [29]

    In: CVPR

    Shuzhe Wang and Vincent Leroy and Yohann Cabon and Boris Chidlovskii and J. DUSt3R: Geometric 3D Vision Made Easy , booktitle =. 2024 , url =. doi:10.1109/CVPR52733.2024.01956 , timestamp =

  30. [30]

    What’s in the image? a deep-dive into the vision of vision language models

    Jianyuan Wang and Minghao Chen and Nikita Karaev and Andrea Vedaldi and Christian Rupprecht and David Novotn. 2025 , url =. doi:10.1109/CVPR52734.2025.00499 , timestamp =

  31. [31]

    What’s in the image? a deep-dive into the vision of vision language models

    Zhenggang Tang and Yuchen Fan and Dilin Wang and Hongyu Xu and Rakesh Ranjan and Alexander G. Schwing and Zhicheng Yan , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.00498 , timestamp =

  32. [32]

    What’s in the image? a deep-dive into the vision of vision language models

    Ruicheng Wang and Sicheng Xu and Cassie Dai and Jianfeng Xiang and Yu Deng and Xin Tong and Jiaolong Yang , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.00496 , timestamp =

  33. [33]

    What’s in the image? a deep-dive into the vision of vision language models

    Jianing Yang and Alexander Sax and Kevin J. Liang and Mikael Henaff and Hao Tang and Ang Cao and Joyce Chai and Franziska Meier and Matt Feiszli , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.02042 , timestamp =

  34. [34]

    In: CVPR

    Lihe Yang and Bingyi Kang and Zilong Huang and Xiaogang Xu and Jiashi Feng and Hengshuang Zhao , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00987 , timestamp =

  35. [35]

    2023 , url =

    Wei Yin and Chi Zhang and Hao Chen and Zhipeng Cai and Gang Yu and Kaixuan Wang and Xiaozhi Chen and Chunhua Shen , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00830 , timestamp =

  36. [36]

    Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , journal =

    Ren. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , journal =. 2022 , url =. doi:10.1109/TPAMI.2020.3019967 , timestamp =

  37. [37]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin and Sili Chen and Junhao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.10647 , eprinttype =. 2511.10647 , timestamp =

  38. [38]

    IEEE Transactions on PatternAnalysisandMachineIntelligence46(12),10579–10596(Dec2024).https: //doi.org/10.1109/tpami.2024.3444912,http://dx.doi.org/10.1109/TPAMI

    Mu Hu and Wei Yin and Chi Zhang and Zhipeng Cai and Xiaoxiao Long and Hao Chen and Kaixuan Wang and Gang Yu and Chunhua Shen and Shaojie Shen , title =. 2024 , url =. doi:10.1109/TPAMI.2024.3444912 , timestamp =

  39. [39]

    In: CVPR

    Bingxin Ke and Anton Obukhov and Shengyu Huang and Nando Metzger and Rodrigo Caye Daudt and Konrad Schindler , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00907 , timestamp =

  40. [40]

    Stablenormal: Reducing diffusion variance for stable and sharp normal.ACM Transactions on Graphics (TOG), 43(6):1–18, 2024

    Chongjie Ye and Lingteng Qiu and Xiaodong Gu and Qi Zuo and Yushuang Wu and Zilong Dong and Liefeng Bo and Yuliang Xiu and Xiaoguang Han , title =. ACM Transactions on Graphics (TOG) , volume =. 2024 , url =. doi:10.1145/3687971 , timestamp =

  41. [41]

    GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image , booktitle =

    Xiao Fu and Wei Yin and Mu Hu and Kaixuan Wang and Yuexin Ma and Ping Tan and Shaojie Shen and Dahua Lin and Xiaoxiao Long , editor =. GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image , booktitle =. 2024 , url =. doi:10.1007/978-3-031-72670-5\_14 , timestamp =

  42. [42]

    In: 2025 International Conference on 3D Vision (3DV)

    Stanislaw Szymanowicz and Eldar Insafutdinov and Chuanxia Zheng and Dylan Campbell and Jo. Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image , booktitle =. 2025 , url =. doi:10.1109/3DV66043.2025.00067 , timestamp =

  43. [43]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Bolt3d: Generating 3d scenes in seconds , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  44. [44]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Ruicheng Wang and Sicheng Xu and Yue Dong and Yu Deng and Jianfeng Xiang and Zelong Lv and Guangzhong Sun and Xin Tong and Jiaolong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.02546 , eprinttype =. 2507.02546 , timestamp =

  45. [45]

    The Twelfth International Conference on Learning Representations,

    Yicong Hong and Kai Zhang and Jiuxiang Gu and Sai Bi and Yang Zhou and Difan Liu and Feng Liu and Kalyan Sunkavalli and Trung Bui and Hao Tan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  46. [46]

    The Twelfth International Conference on Learning Representations,

    Jiahao Li and Hao Tan and Kai Zhang and Zexiang Xu and Fujun Luan and Yinghao Xu and Yicong Hong and Kalyan Sunkavalli and Greg Shakhnarovich and Sai Bi , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  47. [47]

    2023 , url =

    Ruoshi Liu and Rundi Wu and Basile Van Hoorick and Pavel Tokmakov and Sergey Zakharov and Carl Vondrick , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00853 , timestamp =

  48. [48]

    The Twelfth International Conference on Learning Representations,

    Yichun Shi and Peng Wang and Jianglong Ye and Long Mai and Kejie Li and Xiao Yang , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  49. [49]

    CoRR , volume =

    Bardienus Pieter Duisterhof and Jan Oberst and Bowen Wen and Stan Birchfield and Deva Ramanan and Jeffrey Ichnowski , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.05285 , eprinttype =. 2506.05285 , timestamp =

  50. [50]

    Gen3r: 3d scene generation meets feed-forward reconstruction,

    Jiaxin Huang and Yuanbo Yang and Bangbang Yang and Lin Ma and Yuewen Ma and Yiyi Liao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2601.04090 , eprinttype =. 2601.04090 , bibsource =

  51. [51]

    CoRR , volume =

    Rui Li and Biao Zhang and Zhenyu Li and Federico Tombari and Peter Wonka , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.18424 , eprinttype =. 2504.18424 , timestamp =

  52. [52]

    Reconviagen: Towards accurate multi- view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

    Jiahao Chang and Chongjie Ye and Yushuang Wu and Yuantao Chen and Yidan Zhang and Zhongjin Luo and Chenghong Li and Yihao Zhi and Xiaoguang Han , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.23306 , eprinttype =. 2510.23306 , timestamp =

  53. [53]

    Cupid: Pose-grounded genera- tive 3d reconstruction from a single image.arXiv preprint arXiv:2510.20776, 2025

    Binbin Huang and Haobin Duan and Yiqun Zhao and Zibo Zhao and Yi Ma and Shenghua Gao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.20776 , eprinttype =. 2510.20776 , timestamp =

  54. [54]

    arXiv2506.15442(2025) 10

    Team Hunyuan3D and Shuhui Yang and Mingxin Yang and Yifei Feng and Xin Huang and Sheng Zhang and Zebin He and Di Luo and Haolin Liu and Yunfei Zhao and Qingxiang Lin and Zeqiang Lai and Xianghui Yang and Huiwen Shi and Zibo Zhao and Bowen Zhang and Hongyu Yan and Lifu Wang and Sicong Liu and Jihong Zhang and Meng Chen and Liang Dong and Yiwen Jia and Yuli...

  55. [55]

    2021 , url =

    Stefan Stojanov and Anh Thai and James M. Rehg , title =. 2021 , url =. doi:10.1109/CVPR46437.2021.00184 , timestamp =

  56. [56]

    In: CVPR

    Le Xue and Ning Yu and Shu Zhang and Artemis Panagopoulou and Junnan Li and Roberto Mart. 2024 , url =. doi:10.1109/CVPR52733.2024.02558 , timestamp =

  57. [57]

    Uni3D: Exploring Unified 3D Representation at Scale , booktitle =

    Junsheng Zhou and Jinsheng Wang and Baorui Ma and Yu. Uni3D: Exploring Unified 3D Representation at Scale , booktitle =. 2024 , url =

  58. [58]

    SAM 3D: 3Dfy Anything in Images

    SAM and Xingyu Chen and Fu. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.16624 , eprinttype =. 2511.16624 , timestamp =

  59. [59]

    DINOv2: Learning Robust Visual Features without Supervision , journal =

    Maxime Oquab and Timoth. DINOv2: Learning Robust Visual Features without Supervision , journal =. 2024 , url =

  60. [60]

    CoRR , volume =

    Lo. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.18452 , eprinttype =. 2511.18452 , timestamp =

  61. [61]

    CoRR , volume =

    Zeqiang Lai and Yunfei Zhao and Zibo Zhao and Xin Yang and Xin Huang and Jingwei Huang and Xiangyu Yue and Chunchao Guo , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.16317 , eprinttype =. 2511.16317 , timestamp =

  62. [62]

    Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.ArXiv, abs/2506.05573, 2025

    Yuchen Lin and Chenguo Lin and Panwang Pan and Honglei Yan and Yiqiang Feng and Yadong Mu and Katerina Fragkiadaki , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.05573 , eprinttype =. 2506.05573 , timestamp =

  63. [63]

    OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion , journal =

    Yunhan Yang and Yufan Zhou and Yuan. OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion , journal =. 2025 , url =. doi:10.48550/ARXIV.2507.06165 , eprinttype =. 2507.06165 , timestamp =

  64. [64]

    TEXGen: a Generative Diffusion Model for Mesh Textures , journal =

    Xin Yu and Ze Yuan and Yuan. TEXGen: a Generative Diffusion Model for Mesh Textures , journal =. 2024 , url =. doi:10.1145/3687909 , timestamp =

  65. [65]

    Hartley, A

    Richard Hartley and Andrew Zisserman , title =. 2004 , url =. doi:10.1017/CBO9780511811685 , isbn =

  66. [66]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion and Laura Gustafson and Yuan. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.16719 , eprinttype =. 2511.16719 , timestamp =

  67. [67]

    Hi3dgen: High-fidelity 3d geometry generation from im- ages via normal bridging.arXiv preprint arXiv:2503.22236,

    Chongjie Ye and Yushuang Wu and Ziteng Lu and Jiahao Chang and Xiaoyang Guo and Jiaqing Zhou and Hao Zhao and Xiaoguang Han , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.22236 , eprinttype =. 2503.22236 , timestamp =

  68. [68]

    Qwen-Image Technical Report

    Chenfei Wu and Jiahao Li and Jingren Zhou and Junyang Lin and Kaiyuan Gao and Kun Yan and Shengming Yin and Shuai Bai and Xiao Xu and Yilei Chen and Yuxiang Chen and Zecheng Tang and Zekai Zhang and Zhengyi Wang and An Yang and Bowen Yu and Chen Cheng and Dayiheng Liu and Deqing Li and Hang Zhang and Hao Meng and Hu Wei and Jingyuan Ni and Kai Chen and Ku...

  69. [69]

    Diffusion Models for 3D Generation:

    Chen Wang and Hao. Diffusion Models for 3D Generation:. Computational Visual Media , volume =. 2025 , url =. doi:10.26599/CVM.2025.9450452 , timestamp =

  70. [70]

    In: CVPR

    Shunyuan Zheng and Boyao Zhou and Ruizhi Shao and Boning Liu and Shengping Zhang and Liqiang Nie and Yebin Liu , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.01861 , timestamp =

  71. [71]

    Computational Visual Media , volume =

    Ming Meng and Yonggui Zhu and Yufei Zhao and Zhaoxin Li and Zhe Zhu , title =. Computational Visual Media , volume =. 2025 , url =. doi:10.26599/CVM.2025.9450438 , timestamp =

  72. [72]

    Computational Visual Media , volume =

    Zhexi Peng and Kun Zhou and Tianjia Shao , title =. Computational Visual Media , volume =. 2025 , url =. doi:10.26599/CVM.2025.9450513 , timestamp =

  73. [73]

    Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference,

    Xiao. Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference,. 2025 , url =. doi:10.1145/3721238.3730648 , timestamp =