arxiv: 2605.10922 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

Pixal3D: Pixel-Aligned 3D Generation from Images

Dong-Yang Li , Wang Zhao , Yuxin Chen , Wenbo Hu , Meng-Hao Guo , Fang-Lue Zhang , Ying Shan , Shi-Min Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D generationpixel alignmentimage to 3Dfeature projection3D reconstructionmulti-view 3Dscene synthesis

0 comments

The pith

Pixal3D generates 3D assets from images by aligning each generated point directly to input pixels via feature back-projection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Pixal3D to address low pixel-level fidelity in current image-to-3D models. It identifies that generating shapes in canonical space creates ambiguous pixel-to-3D links. Instead, Pixal3D uses a back-projection method to lift image features straight into 3D space, matching the input view. This leads to higher fidelity outputs that approach reconstruction quality and works for both objects and scenes from single or multiple views. Readers would care if it makes turning photos into accurate 3D models more reliable without extra steps.

Core claim

Pixal3D demonstrates that generating 3D in a pixel-aligned manner, rather than in canonical space, resolves the correspondence issue by explicitly lifting multi-scale image features into a 3D feature volume through pixel back-projection. This establishes unambiguous direct pixel-to-3D associations, enabling scalable high-fidelity 3D generation from images that approaches the fidelity of traditional reconstruction techniques.

What carries the argument

The pixel back-projection conditioning scheme that lifts multi-scale 2D image features into a 3D feature volume to create explicit pixel-to-3D mappings.

If this is right

High-quality 3D assets can be generated with substantially improved fidelity.
The method scales to produce 3D models at higher resolutions with realistic appearance.
Multi-view inputs can be handled by aggregating back-projected feature volumes from different views.
A modular pipeline extends the approach to synthesize high-fidelity 3D scenes with separated objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This pixel-aligned approach may generalize to other 3D tasks like novel view synthesis without additional training.
It could simplify pipelines by reducing reliance on attention mechanisms for conditioning.
Applications in AR or robotics might benefit from more accurate 3D models derived from casual photos.

Load-bearing premise

That mapping image features back to 3D coordinates creates a direct and unambiguous link between each pixel and its corresponding 3D location without creating alignment errors or requiring further adjustments.

What would settle it

A test set of generated 3D models where rendered views from the input angle show pixel-level differences from the original image exceeding those of standard reconstruction methods, or where the model fails to maintain consistency at higher resolutions.

Figures

Figures reproduced from arXiv: 2605.10922 by Dong-Yang Li, Fang-Lue Zhang, Meng-Hao Guo, Shi-Min Hu, Wang Zhao, Wenbo Hu, Ying Shan, Yuxin Chen.

**Figure 1.** Figure 1: Pixel-aligned meshes generated by Pixal3D. The foreground displays our results with their corresponding input images in the background. Our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the Pixal3D framework. The framework consists of three key components: (1) Pixel-Aligned Structured Latent Representation Learning [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the Back-projection Conditioning Scheme. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons of single view 3D generation. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of single-view 3D generation on in-the-wild images. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of multi-view 3D generation on Toys4K. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on 3D scene generation. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study on key components. decreases while reconstruction cues become stronger, a trend consistently observed in our results. This behavior is also a fundamental principle and objective of 3D generative reconstruction. 4.3 3D Scene Generation We extend Pixal3D to scene generation, as discussed in Sec. 3.3 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: https://ldyang694.github.io/projects/pixal3d/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pixal3D's back-projection of image features into a 3D volume is a straightforward attempt to tighten pixel-to-3D correspondence in generative models, but the single-view depth ambiguity still looks unresolved by construction.

read the letter

Pixal3D shifts 3D generation out of canonical space and into a view-aligned setup by lifting multi-scale image features via pixel back-projection into a 3D feature volume. The abstract frames this as removing the implicit correspondence problem that attention-based methods leave behind, and it claims the result approaches reconstruction-level fidelity while scaling to multi-view inputs and full scenes with object separation.

Referee Report

2 major / 2 minor

Summary. The paper proposes Pixal3D, a 3D-native generative model for high-fidelity image-to-3D synthesis. It identifies an implicit 2D-3D correspondence problem in prior canonical-space generators and introduces a pixel-aligned generation paradigm that operates directly in the input view. The core technical contribution is a pixel back-projection conditioning scheme that lifts multi-scale 2D image features into a 3D feature volume to create explicit pixel-to-3D associations. The authors claim this yields substantially higher fidelity (approaching reconstruction quality), scales to high-resolution assets, extends naturally to multi-view inputs via volume aggregation, and supports modular scene-level synthesis with object separation.

Significance. If the back-projection mechanism demonstrably resolves correspondence without introducing ray-smearing artifacts or requiring compensatory regularization that limits generalization, the work would represent a meaningful advance in scalable, high-fidelity 3D generation. The explicit pixel-alignment approach and its extension to scenes are conceptually attractive and could influence future architectures that prioritize faithfulness over canonical-pose generation. The paper's emphasis on being 3D-native rather than 2D-lifted is a strength worth highlighting if supported by rigorous experiments.

major comments (2)

[Abstract / Method] Abstract and method description: The claim that pixel back-projection 'establishes direct pixel-to-3D correspondence without ambiguity' is load-bearing for all fidelity and 'approaching reconstruction' assertions. For monocular inputs, back-projecting each pixel feature along its camera ray (or discrete depth samples) into a 3D volume inherently produces a one-to-many mapping; the resulting feature volume encodes smeared information along rays. The manuscript must explicitly show how depth ambiguity is resolved (e.g., via learned depth prediction, multi-view constraints, or architectural inductive biases) rather than asserted to be removed by construction. Without this, the central correspondence advantage reduces to a reparameterization that still requires the network to disambiguate.
[Experiments] Experiments section: The abstract asserts 'qualitative and comparative improvements' and 'substantially improves fidelity' yet supplies no quantitative metrics (PSNR, LPIPS, Chamfer distance, IoU, etc.), ablation tables, or error analysis. This absence prevents verification of whether the pixel-aligned scheme actually outperforms strong baselines on faithfulness while maintaining 3D consistency. Load-bearing claims require at least one table reporting these metrics across single-view, multi-view, and scene settings.

minor comments (2)

[Abstract] Abstract: 'Pixal3D' spelling should be confirmed for consistency; the project page link is useful but the manuscript should include a brief statement on reproducibility (code release, model weights).
[Method] The multi-view aggregation and scene pipeline are described at high level; a diagram or pseudocode in the method section would clarify how back-projected volumes are combined without introducing view-inconsistency artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our contributions.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: The claim that pixel back-projection 'establishes direct pixel-to-3D correspondence without ambiguity' is load-bearing for all fidelity and 'approaching reconstruction' assertions. For monocular inputs, back-projecting each pixel feature along its camera ray (or discrete depth samples) into a 3D volume inherently produces a one-to-many mapping; the resulting feature volume encodes smeared information along rays. The manuscript must explicitly show how depth ambiguity is resolved (e.g., via learned depth prediction, multi-view constraints, or architectural inductive biases) rather than asserted to be removed by construction. Without this, the central correspondence advantage reduces to a reparameterization that still requires the network to disambiguate.

Authors: We agree that the original wording overstated the automatic removal of ambiguity. Back-projection creates explicit pixel-to-ray associations in the 3D volume, but monocular depth disambiguation is performed by the generative network through learned 3D priors, multi-scale feature aggregation, and training objectives that encourage geometric consistency. We have revised the abstract and method sections to explicitly describe this mechanism, including the role of the 3D U-Net architecture and any inductive biases that help resolve ray-smearing. We have also added a short discussion of limitations for highly ambiguous cases. revision: yes
Referee: [Experiments] Experiments section: The abstract asserts 'qualitative and comparative improvements' and 'substantially improves fidelity' yet supplies no quantitative metrics (PSNR, LPIPS, Chamfer distance, IoU, etc.), ablation tables, or error analysis. This absence prevents verification of whether the pixel-aligned scheme actually outperforms strong baselines on faithfulness while maintaining 3D consistency. Load-bearing claims require at least one table reporting these metrics across single-view, multi-view, and scene settings.

Authors: We acknowledge that quantitative metrics are necessary to substantiate the fidelity claims. The original submission emphasized qualitative results and visual comparisons because defining precise 3D ground truth for generative tasks is non-trivial; however, this is insufficient for the load-bearing assertions. In the revised manuscript we have added a new quantitative evaluation section and table reporting PSNR, LPIPS, Chamfer distance, and IoU (where applicable) for single-view, multi-view, and scene-level synthesis, together with ablations isolating the back-projection component. revision: yes

Circularity Check

0 steps flagged

No significant circularity: method is a novel architectural proposal without reduction to inputs or self-citations.

full rationale

The paper introduces Pixal3D as a new pixel-aligned 3D generation paradigm that uses a pixel back-projection conditioning scheme to lift multi-scale image features into a 3D feature volume. This is framed as an explicit design choice inspired by 3D reconstruction to address correspondence ambiguity, rather than a mathematical derivation or prediction that reduces by construction to fitted parameters or prior self-cited results. No equations, uniqueness theorems, or ansatzes are quoted that would indicate self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The central claims about scalability, fidelity gains, and extension to multi-view/scene synthesis rest on the proposed architecture's empirical performance, which remains independently verifiable and does not tautologically restate its inputs. This is a standard case of a self-contained method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that back-projection creates unambiguous correspondence and that the resulting 3D feature volume can be used directly by a generative model without further unspecified constraints. No explicit free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption Neural networks can learn to generate consistent 3D structure from lifted 2D features when trained on appropriate data.
Implicit in any learned 3D generative model; invoked when claiming scalability and high-quality output.

invented entities (1)

Pixel back-projection conditioning scheme no independent evidence
purpose: To lift multi-scale image features into a 3D feature volume establishing direct pixel-to-3D correspondence.
New mechanism introduced to solve the implicit 2D-3D correspondence issue; no independent evidence outside the method itself is provided in the abstract.

pith-pipeline@v0.9.0 · 5650 in / 1436 out tokens · 31671 ms · 2026-05-12T03:46:53.484266+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 7 internal anchors

[1]

Objaverse-XL:

Matt Deitke and Ruoshi Liu and Matthew Wallingford and Huong Ngo and Oscar Michel and Aditya Kusupati and Alan Fan and Christian Laforte and Vikram Voleti and Samir Yitzhak Gadre and Eli VanderBilt and Aniruddha Kembhavi and Carl Vondrick and Georgia Gkioxari and Kiana Ehsani and Ludwig Schmidt and Ali Farhadi , editor =. Objaverse-XL:. Advances in Neural...

work page 2023
[2]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol and Heewoo Jun and Prafulla Dhariwal and Pamela Mishkin and Mark Chen , title =. CoRR , volume =. 2022 , url =. doi:10.48550/ARXIV.2212.08751 , eprinttype =. 2212.08751 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2212.08751 2022
[3]

3DShape2VecSet:

Biao Zhang and Jiapeng Tang and Matthias Nie. 3DShape2VecSet:. ACM Transactions on Graphics (TOG) , volume =. 2023 , url =. doi:10.1145/3592442 , timestamp =

work page doi:10.1145/3592442 2023
[4]

ACM Transactions on Graphics (TOG) , volume =

Longwen Zhang and Ziyu Wang and Qixuan Zhang and Qiwei Qiu and Anqi Pang and Haoran Jiang and Wei Yang and Lan Xu and Jingyi Yu , title =. ACM Transactions on Graphics (TOG) , volume =. 2024 , url =. doi:10.1145/3658146 , timestamp =

work page doi:10.1145/3658146 2024
[5]

What’s in the image? a deep-dive into the vision of vision language models

Weiyu Li and Jiarui Liu and Hongyu Yan and Rui Chen and Yixun Liang and Xuelin Chen and Ping Tan and Xiaoxiao Long , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.00500 , timestamp =

work page doi:10.1109/cvpr52734.2025.00500 2025
[6]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao and Zeqiang Lai and Qingxiang Lin and Yunfei Zhao and Haolin Liu and Shuhui Yang and Yifei Feng and Mingxin Yang and Sheng Zhang and Xianghui Yang and Huiwen Shi and Sicong Liu and Junta Wu and Yihang Lian and Fan Yang and Ruining Tang and Zebin He and Xinzhou Wang and Jian Liu and Xuhui Zuo and Zhuo Chen and Biwen Lei and Haohan Weng and Jing X...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12202 2025
[7]

Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

Yangguang Li and Zi. TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2502.06608 , eprinttype =. 2502.06608 , timestamp =

work page doi:10.48550/arxiv.2502.06608 2025
[8]

What’s in the image? a deep-dive into the vision of vision language models

Jianfeng Xiang and Zelong Lv and Sicheng Xu and Yu Deng and Ruicheng Wang and Bowen Zhang and Dong Chen and Xin Tong and Jiaolong Yang , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.02000 , timestamp =

work page doi:10.1109/cvpr52734.2025.02000 2025
[9]

Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

Shuang Wu and Youtian Lin and Feihu Zhang and Yifei Zeng and Yikang Yang and Yajie Bao and Jiachen Qian and Siyu Zhu and Xun Cao and Philip Torr and Yao Yao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.17412 , eprinttype =. 2505.17412 , timestamp =

work page doi:10.48550/arxiv.2505.17412 2025
[10]

arXiv preprint arXiv:2505.14521 , year=

Zhihao Li and Yufei Wang and Heliang Zheng and Yihao Luo and Bihan Wen , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.14521 , eprinttype =. 2505.14521 , timestamp =

work page doi:10.48550/arxiv.2505.14521 2025
[11]

Sparseflex: High-resolution and arbitrary-topology 3d shape modeling,

Xianglong He and Zi. SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling , journal =. 2025 , url =. doi:10.48550/ARXIV.2503.21732 , eprinttype =. 2503.21732 , timestamp =

work page doi:10.48550/arxiv.2503.21732 2025
[12]

Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

Jianfeng Xiang and Xiaoxue Chen and Sicheng Xu and Ruicheng Wang and Zelong Lv and Yu Deng and Hongyuan Zhu and Yue Dong and Hao Zhao and Nicholas Jing Yuan and Jiaolong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.14692 , eprinttype =. 2512.14692 , bibsource =

work page doi:10.48550/arxiv.2512.14692 2025
[13]

CoRR , volume =

Yihao Luo and Xianglong He and Chuanyu Pan and Yiwen Chen and Jiaqi Wu and Yangguang Li and Wanli Ouyang and Yuanming Hu and Guang Yang and Choon Hwai Yap , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.04029 , eprinttype =. 2511.04029 , timestamp =

work page doi:10.48550/arxiv.2511.04029 2025
[14]

Lattice: Democratize high-fidelity 3d generation at scale.arXiv preprint arXiv:2512.03052, 2025

Zeqiang Lai and Yunfei Zhao and Zibo Zhao and Haolin Liu and Qingxiang Lin and Jingwei Huang and Chunchao Guo and Xiangyu Yue , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.03052 , eprinttype =. 2512.03052 , bibsource =

work page doi:10.48550/arxiv.2512.03052 2025
[15]

Warren , title =

Tao Ju and Frank Losasso and Scott Schaefer and Joe D. Warren , title =. ACM Transactions on Graphics (TOG) , volume =. 2002 , url =. doi:10.1145/566654.566586 , timestamp =

work page doi:10.1145/566654.566586 2002
[16]

The Thirteenth International Conference on Learning Representations,

Yushi Lan and Shangchen Zhou and Zhaoyang Lyu and Fangzhou Hong and Shuai Yang and Bo Dai and Xingang Pan and Chen Change Loy , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025
[17]

Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer , booktitle =

Shuang Wu and Youtian Lin and Yifei Zeng and Feihu Zhang and Jingxi Xu and Philip Torr and Xun Cao and Yao Yao , editor =. Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer , booktitle =. 2024 , url =

work page 2024
[18]

Black and Derek Nowrouzezahrai and Liam Paull and Weiyang Liu , title =

Zhen Liu and Yao Feng and Michael J. Black and Derek Nowrouzezahrai and Liam Paull and Weiyang Liu , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

work page 2023
[19]

OctFusion: Octree-based Diffusion Models for 3D Shape Generation , journal =

Bojun Xiong and Si. OctFusion: Octree-based Diffusion Models for 3D Shape Generation , journal =. 2025 , url =. doi:10.1111/CGF.70198 , timestamp =

work page doi:10.1111/cgf.70198 2025
[20]

Barron and Ben Mildenhall , title =

Ben Poole and Ajay Jain and Jonathan T. Barron and Ben Mildenhall , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

work page 2023
[21]

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation , booktitle =

Zhengyi Wang and Cheng Lu and Yikai Wang and Fan Bao and Chongxuan Li and Hang Su and Jun Zhu , editor =. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation , booktitle =. 2023 , url =

work page 2023
[22]

Johannes L. Sch. Structure-from-Motion Revisited , booktitle =. 2016 , url =. doi:10.1109/CVPR.2016.445 , timestamp =

work page doi:10.1109/cvpr.2016.445 2016
[23]

Johannes L. Sch. Pixelwise View Selection for Unstructured Multi-View Stereo , booktitle =. 2016 , url =. doi:10.1007/978-3-319-46487-9\_31 , timestamp =

work page doi:10.1007/978-3-319-46487-9 2016
[24]

DeepMVS: Learning Multi-View Stereopsis , booktitle =

Po. DeepMVS: Learning Multi-View Stereopsis , booktitle =. 2018 , url =. doi:10.1109/CVPR.2018.00298 , timestamp =

work page doi:10.1109/cvpr.2018.00298 2018
[25]

MVSNet: Depth Inference for Unstructured Multi-view Stereo , booktitle =

Yao Yao and Zixin Luo and Shiwei Li and Tian Fang and Long Quan , editor =. MVSNet: Depth Inference for Unstructured Multi-view Stereo , booktitle =. 2018 , url =. doi:10.1007/978-3-030-01237-3\_47 , timestamp =

work page doi:10.1007/978-3-030-01237-3 2018
[26]

DPSNet: End-to-end Deep Plane Sweep Stereo , booktitle =

Sunghoon Im and Hae. DPSNet: End-to-end Deep Plane Sweep Stereo , booktitle =. 2019 , url =

work page 2019
[27]

Atlas: End-to-End 3D Scene Reconstruction from Posed Images , booktitle =

Zak Murez and Tarrence van As and James Bartolozzi and Ayan Sinha and Vijay Badrinarayanan and Andrew Rabinovich , editor =. Atlas: End-to-End 3D Scene Reconstruction from Posed Images , booktitle =. 2020 , url =. doi:10.1007/978-3-030-58571-6\_25 , timestamp =

work page doi:10.1007/978-3-030-58571-6 2020
[28]

2021 , url =

Jiaming Sun and Yiming Xie and Linghao Chen and Xiaowei Zhou and Hujun Bao , title =. 2021 , url =. doi:10.1109/CVPR46437.2021.01534 , timestamp =

work page doi:10.1109/cvpr46437.2021.01534 2021
[29]

In: CVPR

Shuzhe Wang and Vincent Leroy and Yohann Cabon and Boris Chidlovskii and J. DUSt3R: Geometric 3D Vision Made Easy , booktitle =. 2024 , url =. doi:10.1109/CVPR52733.2024.01956 , timestamp =

work page doi:10.1109/cvpr52733.2024.01956 2024
[30]

What’s in the image? a deep-dive into the vision of vision language models

Jianyuan Wang and Minghao Chen and Nikita Karaev and Andrea Vedaldi and Christian Rupprecht and David Novotn. 2025 , url =. doi:10.1109/CVPR52734.2025.00499 , timestamp =

work page doi:10.1109/cvpr52734.2025.00499 2025
[31]

What’s in the image? a deep-dive into the vision of vision language models

Zhenggang Tang and Yuchen Fan and Dilin Wang and Hongyu Xu and Rakesh Ranjan and Alexander G. Schwing and Zhicheng Yan , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.00498 , timestamp =

work page doi:10.1109/cvpr52734.2025.00498 2025
[32]

What’s in the image? a deep-dive into the vision of vision language models

Ruicheng Wang and Sicheng Xu and Cassie Dai and Jianfeng Xiang and Yu Deng and Xin Tong and Jiaolong Yang , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.00496 , timestamp =

work page doi:10.1109/cvpr52734.2025.00496 2025
[33]

What’s in the image? a deep-dive into the vision of vision language models

Jianing Yang and Alexander Sax and Kevin J. Liang and Mikael Henaff and Hao Tang and Ang Cao and Joyce Chai and Franziska Meier and Matt Feiszli , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.02042 , timestamp =

work page doi:10.1109/cvpr52734.2025.02042 2025
[34]

In: CVPR

Lihe Yang and Bingyi Kang and Zilong Huang and Xiaogang Xu and Jiashi Feng and Hengshuang Zhao , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00987 , timestamp =

work page doi:10.1109/cvpr52733.2024.00987 2024
[35]

2023 , url =

Wei Yin and Chi Zhang and Hao Chen and Zhipeng Cai and Gang Yu and Kaixuan Wang and Xiaozhi Chen and Chunhua Shen , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00830 , timestamp =

work page doi:10.1109/iccv51070.2023.00830 2023
[36]

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , journal =

Ren. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , journal =. 2022 , url =. doi:10.1109/TPAMI.2020.3019967 , timestamp =

work page doi:10.1109/tpami.2020.3019967 2022
[37]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin and Sili Chen and Junhao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.10647 , eprinttype =. 2511.10647 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.10647 2025
[38]

IEEE Transactions on PatternAnalysisandMachineIntelligence46(12),10579–10596(Dec2024).https: //doi.org/10.1109/tpami.2024.3444912,http://dx.doi.org/10.1109/TPAMI

Mu Hu and Wei Yin and Chi Zhang and Zhipeng Cai and Xiaoxiao Long and Hao Chen and Kaixuan Wang and Gang Yu and Chunhua Shen and Shaojie Shen , title =. 2024 , url =. doi:10.1109/TPAMI.2024.3444912 , timestamp =

work page doi:10.1109/tpami.2024.3444912 2024
[39]

In: CVPR

Bingxin Ke and Anton Obukhov and Shengyu Huang and Nando Metzger and Rodrigo Caye Daudt and Konrad Schindler , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.00907 , timestamp =

work page doi:10.1109/cvpr52733.2024.00907 2024
[40]

Stablenormal: Reducing diffusion variance for stable and sharp normal.ACM Transactions on Graphics (TOG), 43(6):1–18, 2024

Chongjie Ye and Lingteng Qiu and Xiaodong Gu and Qi Zuo and Yushuang Wu and Zilong Dong and Liefeng Bo and Yuliang Xiu and Xiaoguang Han , title =. ACM Transactions on Graphics (TOG) , volume =. 2024 , url =. doi:10.1145/3687971 , timestamp =

work page doi:10.1145/3687971 2024
[41]

GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image , booktitle =

Xiao Fu and Wei Yin and Mu Hu and Kaixuan Wang and Yuexin Ma and Ping Tan and Shaojie Shen and Dahua Lin and Xiaoxiao Long , editor =. GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image , booktitle =. 2024 , url =. doi:10.1007/978-3-031-72670-5\_14 , timestamp =

work page doi:10.1007/978-3-031-72670-5 2024
[42]

In: 2025 International Conference on 3D Vision (3DV)

Stanislaw Szymanowicz and Eldar Insafutdinov and Chuanxia Zheng and Dylan Campbell and Jo. Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image , booktitle =. 2025 , url =. doi:10.1109/3DV66043.2025.00067 , timestamp =

work page doi:10.1109/3dv66043.2025.00067 2025
[43]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Bolt3d: Generating 3d scenes in seconds , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[44]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang and Sicheng Xu and Yue Dong and Yu Deng and Jianfeng Xiang and Zelong Lv and Guangzhong Sun and Xin Tong and Jiaolong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.02546 , eprinttype =. 2507.02546 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2507.02546 2025
[45]

The Twelfth International Conference on Learning Representations,

Yicong Hong and Kai Zhang and Jiuxiang Gu and Sai Bi and Yang Zhou and Difan Liu and Feng Liu and Kalyan Sunkavalli and Trung Bui and Hao Tan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[46]

The Twelfth International Conference on Learning Representations,

Jiahao Li and Hao Tan and Kai Zhang and Zexiang Xu and Fujun Luan and Yinghao Xu and Yicong Hong and Kalyan Sunkavalli and Greg Shakhnarovich and Sai Bi , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[47]

2023 , url =

Ruoshi Liu and Rundi Wu and Basile Van Hoorick and Pavel Tokmakov and Sergey Zakharov and Carl Vondrick , title =. 2023 , url =. doi:10.1109/ICCV51070.2023.00853 , timestamp =

work page doi:10.1109/iccv51070.2023.00853 2023
[48]

The Twelfth International Conference on Learning Representations,

Yichun Shi and Peng Wang and Jianglong Ye and Long Mai and Kejie Li and Xiao Yang , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[49]

CoRR , volume =

Bardienus Pieter Duisterhof and Jan Oberst and Bowen Wen and Stan Birchfield and Deva Ramanan and Jeffrey Ichnowski , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.05285 , eprinttype =. 2506.05285 , timestamp =

work page doi:10.48550/arxiv.2506.05285 2025
[50]

Gen3r: 3d scene generation meets feed-forward reconstruction,

Jiaxin Huang and Yuanbo Yang and Bangbang Yang and Lin Ma and Yuewen Ma and Yiyi Liao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2601.04090 , eprinttype =. 2601.04090 , bibsource =

work page doi:10.48550/arxiv.2601.04090 2025
[51]

CoRR , volume =

Rui Li and Biao Zhang and Zhenyu Li and Federico Tombari and Peter Wonka , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.18424 , eprinttype =. 2504.18424 , timestamp =

work page doi:10.48550/arxiv.2504.18424 2025
[52]

Reconviagen: Towards accurate multi- view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

Jiahao Chang and Chongjie Ye and Yushuang Wu and Yuantao Chen and Yidan Zhang and Zhongjin Luo and Chenghong Li and Yihao Zhi and Xiaoguang Han , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.23306 , eprinttype =. 2510.23306 , timestamp =

work page doi:10.48550/arxiv.2510.23306 2025
[53]

Cupid: Pose-grounded genera- tive 3d reconstruction from a single image.arXiv preprint arXiv:2510.20776, 2025

Binbin Huang and Haobin Duan and Yiqun Zhao and Zibo Zhao and Yi Ma and Shenghua Gao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.20776 , eprinttype =. 2510.20776 , timestamp =

work page doi:10.48550/arxiv.2510.20776 2025
[54]

arXiv2506.15442(2025) 10

Team Hunyuan3D and Shuhui Yang and Mingxin Yang and Yifei Feng and Xin Huang and Sheng Zhang and Zebin He and Di Luo and Haolin Liu and Yunfei Zhao and Qingxiang Lin and Zeqiang Lai and Xianghui Yang and Huiwen Shi and Zibo Zhao and Bowen Zhang and Hongyu Yan and Lifu Wang and Sicong Liu and Jihong Zhang and Meng Chen and Liang Dong and Yiwen Jia and Yuli...

work page doi:10.48550/arxiv.2506.15442 2025
[55]

2021 , url =

Stefan Stojanov and Anh Thai and James M. Rehg , title =. 2021 , url =. doi:10.1109/CVPR46437.2021.00184 , timestamp =

work page doi:10.1109/cvpr46437.2021.00184 2021
[56]

In: CVPR

Le Xue and Ning Yu and Shu Zhang and Artemis Panagopoulou and Junnan Li and Roberto Mart. 2024 , url =. doi:10.1109/CVPR52733.2024.02558 , timestamp =

work page doi:10.1109/cvpr52733.2024.02558 2024
[57]

Uni3D: Exploring Unified 3D Representation at Scale , booktitle =

Junsheng Zhou and Jinsheng Wang and Baorui Ma and Yu. Uni3D: Exploring Unified 3D Representation at Scale , booktitle =. 2024 , url =

work page 2024
[58]

SAM 3D: 3Dfy Anything in Images

SAM and Xingyu Chen and Fu. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.16624 , eprinttype =. 2511.16624 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.16624 2025
[59]

DINOv2: Learning Robust Visual Features without Supervision , journal =

Maxime Oquab and Timoth. DINOv2: Learning Robust Visual Features without Supervision , journal =. 2024 , url =

work page 2024
[60]

CoRR , volume =

Lo. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.18452 , eprinttype =. 2511.18452 , timestamp =

work page doi:10.48550/arxiv.2511.18452 2025
[61]

CoRR , volume =

Zeqiang Lai and Yunfei Zhao and Zibo Zhao and Xin Yang and Xin Huang and Jingwei Huang and Xiangyu Yue and Chunchao Guo , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.16317 , eprinttype =. 2511.16317 , timestamp =

work page doi:10.48550/arxiv.2511.16317 2025
[62]

Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers.ArXiv, abs/2506.05573, 2025

Yuchen Lin and Chenguo Lin and Panwang Pan and Honglei Yan and Yiqiang Feng and Yadong Mu and Katerina Fragkiadaki , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.05573 , eprinttype =. 2506.05573 , timestamp =

work page doi:10.48550/arxiv.2506.05573 2025
[63]

OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion , journal =

Yunhan Yang and Yufan Zhou and Yuan. OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion , journal =. 2025 , url =. doi:10.48550/ARXIV.2507.06165 , eprinttype =. 2507.06165 , timestamp =

work page doi:10.48550/arxiv.2507.06165 2025
[64]

TEXGen: a Generative Diffusion Model for Mesh Textures , journal =

Xin Yu and Ze Yuan and Yuan. TEXGen: a Generative Diffusion Model for Mesh Textures , journal =. 2024 , url =. doi:10.1145/3687909 , timestamp =

work page doi:10.1145/3687909 2024
[65]

Hartley, A

Richard Hartley and Andrew Zisserman , title =. 2004 , url =. doi:10.1017/CBO9780511811685 , isbn =

work page doi:10.1017/cbo9780511811685 2004
[66]

SAM 3: Segment Anything with Concepts

Nicolas Carion and Laura Gustafson and Yuan. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.16719 , eprinttype =. 2511.16719 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.16719 2025
[67]

Hi3dgen: High-fidelity 3d geometry generation from im- ages via normal bridging.arXiv preprint arXiv:2503.22236,

Chongjie Ye and Yushuang Wu and Ziteng Lu and Jiahao Chang and Xiaoyang Guo and Jiaqing Zhou and Hao Zhao and Xiaoguang Han , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.22236 , eprinttype =. 2503.22236 , timestamp =

work page doi:10.48550/arxiv.2503.22236 2025
[68]

Qwen-Image Technical Report

Chenfei Wu and Jiahao Li and Jingren Zhou and Junyang Lin and Kaiyuan Gao and Kun Yan and Shengming Yin and Shuai Bai and Xiao Xu and Yilei Chen and Yuxiang Chen and Zecheng Tang and Zekai Zhang and Zhengyi Wang and An Yang and Bowen Yu and Chen Cheng and Dayiheng Liu and Deqing Li and Hang Zhang and Hao Meng and Hu Wei and Jingyuan Ni and Kai Chen and Ku...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.02324 2025
[69]

Diffusion Models for 3D Generation:

Chen Wang and Hao. Diffusion Models for 3D Generation:. Computational Visual Media , volume =. 2025 , url =. doi:10.26599/CVM.2025.9450452 , timestamp =

work page doi:10.26599/cvm.2025.9450452 2025
[70]

In: CVPR

Shunyuan Zheng and Boyao Zhou and Ruizhi Shao and Boning Liu and Shengping Zhang and Liqiang Nie and Yebin Liu , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.01861 , timestamp =

work page doi:10.1109/cvpr52733.2024.01861 2024
[71]

Computational Visual Media , volume =

Ming Meng and Yonggui Zhu and Yufei Zhao and Zhaoxin Li and Zhe Zhu , title =. Computational Visual Media , volume =. 2025 , url =. doi:10.26599/CVM.2025.9450438 , timestamp =

work page doi:10.26599/cvm.2025.9450438 2025
[72]

Computational Visual Media , volume =

Zhexi Peng and Kun Zhou and Tianjia Shao , title =. Computational Visual Media , volume =. 2025 , url =. doi:10.26599/CVM.2025.9450513 , timestamp =

work page doi:10.26599/cvm.2025.9450513 2025
[73]

Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference,

Xiao. Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference,. 2025 , url =. doi:10.1145/3721238.3730648 , timestamp =

work page doi:10.1145/3721238.3730648 2025