pith. machine review for the scientific record. sign in

arxiv: 2605.03359 · v1 · submitted 2026-05-05 · 💻 cs.CV

Recognition: unknown

Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructionsparse viewpose estimationgenerative modelsfeed-forward networksmixture of transformersmulti-view alignmenttexture generation
0
0 comments X

The pith

Mix3R mixes feed-forward reconstruction with generative 3D priors to produce better-aligned shapes and more accurate pose estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a hybrid approach to sparse-view 3D reconstruction that unites the pixel-level alignment strengths of feed-forward methods with the complete geometry produced by generative models. It achieves this by running a sparse voxel stage that outputs aligned 3D structure, point maps, and camera parameters, followed by a texture stage that transfers input appearance without additional training. The design uses a Mixture-of-Transformers to let pretrained models exchange information while keeping their original strengths. A sympathetic reader would care because the result addresses the common gap between geometrically faithful but incomplete reconstructions and complete but misaligned generated shapes, offering a practical path to usable 3D outputs from limited input images.

Core claim

Mix3R generates a 3D shape in two stages: a sparse voxel generation stage that jointly produces a coarse 3D structure, per-view point maps, and camera parameters aligned to that structure, and a textured geometry generation stage that correctly places input textures onto the generated shape. This is enabled by a Mixture-of-Transformers architecture that inserts global self-attentions into pretrained feed-forward and generative models, plus an overlap-based attention bias added directly to a pretrained textured model. The mutual conditioning lets the feed-forward branch ground predictions in a generative prior and lets the generative branch receive geometrically informative features, yielding

What carries the argument

Mixture-of-Transformers architecture that inserts global self-attentions between a pretrained feed-forward reconstruction model and a pretrained 3D generative model, together with an overlap-based attention bias for training-free texture placement.

If this is right

  • The feed-forward branch learns to ground its predictions to a generative 3D prior.
  • The 3D generation branch receives geometrically informative features from the feed-forward branch.
  • Resulting 3D shapes show better input alignment than those from pure generative methods.
  • Camera pose estimates are more accurate than those produced by previous feed-forward reconstruction methods.
  • Textures are placed correctly onto generated shapes in a training-free manner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar cross-attention insertions between pretrained models could be tested on related tasks such as novel-view synthesis or 3D-aware image editing.
  • The training-free texture transfer step suggests the method could adapt quickly to new object categories or lighting conditions without full retraining.
  • If the architecture scales to higher resolutions, it may support applications requiring both geometric fidelity and visual detail from very few input views.

Load-bearing premise

Inserting global self-attentions into the pretrained models preserves their individual priors while creating the 2D-3D alignment needed for joint generation and correct texture transfer.

What would settle it

Quantitative evaluation on standard sparse-view benchmarks showing no statistically significant improvement in alignment metrics such as point-map consistency or pose accuracy measured by rotation and translation error compared with the separate feed-forward and generative baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.03359 by Dongping Li, Hongwen Zhang, Liang An, Shaohui Jiao, Siyou Lin, Yebin Liu, Zhou Xue.

Figure 1
Figure 1. Figure 1: The overall architecture of our two-stage framework. Given multi-view unposed input images, we first employ a mixture view at source ↗
Figure 2
Figure 2. Figure 2: The block matching configuration of our MoT architecture. According to different matching types, our network has three different view at source ↗
Figure 3
Figure 3. Figure 3: Illustrations of different block mixture architectures. view at source ↗
Figure 4
Figure 4. Figure 4: We exhibit the reprojection alignment. Each rendering result is obtained using the decoded 3D Gaussians and the predicted view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of novel-view rendering evaluation. We show input images and novel-view GT images. Our method more view at source ↗
Figure 6
Figure 6. Figure 6: More qualitative results of novel-view rendering evaluation. view at source ↗
Figure 7
Figure 7. Figure 7: More qualitative results of novel-view rendering evaluation. view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results for real-world cellphone captures. view at source ↗
read the original abstract

Recent trends in sparse-view 3D reconstruction have taken two different paths: feed-forward reconstruction that predicts pixel-aligned point maps without a complete geometry, and generative 3D reconstruction that generates complete geometry but often with poor input-alignment. We present Mix3R, a novel generative 3D reconstruction method which mixes feed-forward reconstruction and 3D generation into a single framework in an aligned manner. Mix3R generates a 3D shape in two stages: a sparse voxel generation stage and a textured geometry generation stage. Unlike pure generative methods, our first-stage generation jointly produces a coarse 3D structure (sparse voxels), per-view point maps and camera parameters aligned to that 3D structure. This is made possible by introducing a Mixture-of-Transformers architecture that inserts global self-attentions to a feed-forward reconstruction model and a 3D generative model, both pretrained on large-scale data. This design effectively retains the pretrained priors but enables better 2D-3D alignment. Based on the initial aligned generations of sparse 3D voxels and point maps, we compute an overlap-based attention bias that is directly added to another pretrained textured geometry generation model, enabling it to correctly place input textures onto generated shapes in a training-free manner. Our design brings mutual benefits to both feed-forward reconstruction and 3D generation: The feed-forward branch learns to ground its predictions to a generative 3D prior, and conversely, the 3D generation branch is conditioned on geometrically informative features from the feed-forward branch. As a result, our method produces 3D shapes with better input alignment compared with pure 3D generative methods, together with camera pose estimations more accurate than previous feed-forward reconstruction methods. Our project page is at https://jsnln.github.io/mix3r/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces Mix3R, a method for joint multi-view aligned 3D reconstruction and pose estimation that mixes feed-forward reconstruction and generative 3D priors. It operates in two stages: a sparse voxel generation stage that jointly produces coarse 3D structure (sparse voxels), per-view point maps, and aligned camera parameters via a Mixture-of-Transformers architecture inserting global self-attentions into pretrained feed-forward and generative models; and a textured geometry generation stage that adds an overlap-based attention bias to another pretrained model for training-free texture placement. The design claims mutual benefits, yielding 3D shapes with better input alignment than pure generative methods and more accurate camera poses than prior feed-forward methods.

Significance. If the reported comparisons and ablations hold, the work is significant for bridging alignment limitations in feed-forward methods with completeness in generative 3D reconstruction from sparse views. The retention of large-scale pretrained priors through architectural mixing, combined with the training-free overlap bias, offers an efficient path to aligned outputs without full retraining. This could impact applications in robotics and AR/VR by providing geometrically grounded and textured models.

minor comments (3)
  1. The description of the Mixture-of-Transformers would benefit from a diagram or explicit pseudocode showing how global self-attentions are inserted and fused across the feed-forward and generative branches to retain priors while enabling 2D-3D alignment.
  2. The overlap-based attention bias in the second stage is described conceptually; including the precise formula or algorithm for its computation from the initial sparse voxels and point maps would enhance reproducibility.
  3. The abstract states improvements in alignment and pose accuracy but does not reference specific metrics or datasets; adding a brief quantitative summary would better support the central claims for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of Mix3R, recognition of its significance in bridging feed-forward and generative approaches, and recommendation for minor revision. The description accurately captures the two-stage architecture, Mixture-of-Transformers design, and overlap-based attention bias. As the report lists no specific major comments, we have no points requiring rebuttal or clarification at this stage. We remain ready to incorporate any minor revisions requested by the editor.

Circularity Check

0 steps flagged

No significant circularity; architecture combines external pretrained models with novel insertions validated empirically

full rationale

The paper's core contribution is a Mixture-of-Transformers design that inserts global self-attention between a pretrained feed-forward reconstruction model and a pretrained 3D generative model, plus an overlap-based attention bias added to another pretrained model for training-free texture placement. These are presented as new architectural choices whose benefits are shown via direct comparisons (alignment metrics vs. pure generative baselines; pose accuracy vs. feed-forward baselines). No equations, parameters, or central claims reduce by construction to inputs defined within the paper itself; no self-citation chains, uniqueness theorems, or fitted quantities renamed as predictions appear in the derivation. The method remains self-contained against external benchmarks and pretrained priors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method builds on unspecified pretrained models and assumes the new attention mechanisms function as described.

pith-pipeline@v0.9.0 · 5664 in / 1192 out tokens · 48216 ms · 2026-05-08T01:24:21.290757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    Multidiffusion: Fusing diffusion paths for controlled image generation,

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation.arXiv preprint arXiv:2302.08113, 2023. 2

  2. [2]

    Chan, Connor Z

    Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3D generative adversarial networks. InCVPR, 2022. 2

  3. [3]

    Reconviagen: Towards accurate multi- view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025

    Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han. Reconviagen: Towards accurate multi- view 3d object reconstruction via generation.arXiv preprint arXiv:2510.23306, 2025. 2, 3, 7, 14, 15

  4. [4]

    Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

    Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023. 2

  5. [5]

    TTT3r: 3d reconstruction as test-time train- ing

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3r: 3d reconstruction as test-time train- ing. InThe Fourteenth International Conference on Learn- ing Representations, 2026. 3

  6. [6]

    Ultra3d: Efficient and high- fidelity 3d generation with part attention, 2025

    Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high- fidelity 3d generation with part attention, 2025. 2, 3

  7. [7]

    Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik

    Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik. Abo: Dataset and benchmarks for real-world 3d object understanding. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR...

  8. [8]

    Vision transformers need registers, 2023

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers, 2023. 4, 6

  9. [9]

    CoRR , volume =

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023. 15

  10. [10]

    Gram: Generative radiance manifolds for 3d-aware image generation

    Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware image generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 2

  11. [11]

    Google scanned objects: A high-quality dataset of 3d scanned house- hold items.2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560, 2022

    Laura Downs, Anthony Francis, Nate Koenig, Bran- don Kinman, Ryan Michael Hickman, Krista Reymann, Thomas Barlow McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned house- hold items.2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560, 2022. 7, 16

  12. [12]

    Get3d: A generative model of high quality 3d tex- tured shapes learned from images

    Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images. InAdvances In Neural Information Processing Systems, 2022. 2

  13. [13]

    ReconViaGen.https:// github.com/estheryang11/ReconViaGen, 2025

    Github user estheryang11. ReconViaGen.https:// github.com/estheryang11/ReconViaGen, 2025. GitHub repository, accessed 2026-01-22. 15

  14. [14]

    Gvgen: Text-to-3d generation with volumetric rep- resentation

    Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yang- guang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric rep- resentation. InComputer Vision – ECCV 2024: 18th Eu- ropean Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VIII, page 463–479, Berlin, Heidel- berg, 2024. Springer...

  15. [15]

    LRM: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations, 2024. 3

  16. [16]

    Neural wavelet-domain diffusion for 3d shape gen- eration, inversion, and manipulation.ACM Trans

    Jingyu Hu, Ka-Hei Hui, Zhengzhe Liu, Ruihui Li, and Chi- Wing Fu. Neural wavelet-domain diffusion for 3d shape gen- eration, inversion, and manipulation.ACM Trans. Graph., 43 (2), 2024. 2

  17. [17]

    Cupid: Pose-grounded genera- tive 3d reconstruction from a single image.arXiv preprint arXiv:2510.20776, 2025

    Binbin Huang, Haobin Duan, Yiqun Zhao, Zibo Zhao, Yi Ma, and Shenghua Gao. Cupid: Pose-grounded genera- tive 3d reconstruction from a single image.arXiv preprint arXiv:2510.20776, 2025. 2, 3

  18. [18]

    Gen3r: 3d scene generation meets feed-forward reconstruction,

    Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, and Yiyi Liao. Gen3r: 3d scene generation meets feed-forward reconstruction.ArXiv, abs/2601.04090,

  19. [19]

    Few-view object reconstruction with unknown cate- gories and camera poses.International Conference on 3D Vision (3DV), 2024

    Hanwen Jiang, Zhenyu Jiang, Kristen Grauman, and Yuke Zhu. Few-view object reconstruction with unknown cate- gories and camera poses.International Conference on 3D Vision (3DV), 2024. 3

  20. [20]

    LEAP: Liberate sparse-view 3d modeling from camera poses

    Hanwen Jiang, Zhenyu Jiang, Yue Zhao, and Qixing Huang. LEAP: Liberate sparse-view 3d modeling from camera poses. InThe Twelfth International Conference on Learn- ing Representations, 2024. 3

  21. [21]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

  22. [22]

    V o, Patrick Labatut, and Piotr Bojanowski

    Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha¨el Ramamonjisoa, Maxime Oquab, Oriane Sim´eoni, Huy V . V o, Patrick Labatut, and Piotr Bojanowski. Dinov2 meets text: A unified framework for image- and pixel-level vision-language alignment, 2024. 4, 6

  23. [23]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno 11 Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed- forward metric 3D reconstruc...

  24. [24]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 3

  25. [25]

    Chang, and Manolis Savva

    Mukul Khanna*, Yongsen Mao*, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat Synthetic Scenes Dataset (HSSD-200): An Analy- sis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation.arXiv preprint, 2023. 15

  26. [26]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014. 15

  27. [27]

    Epnp: An accurate o(n) solution to the pnp problem.Int

    Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o(n) solution to the pnp problem.Int. J. Comput. Vision, 81(2):155–166, 2009. 2

  28. [28]

    MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

    Baicheng Li, Dong Wu, Jun Li, Shunkai Zhou, Zecui Zeng, Lusong Li, and Hongbin Zha. Mv-sam3d: Adaptive multi- view fusion for layout-aware 3d generation.arXiv preprint arXiv:2603.11633, 2026. 7, 15

  29. [29]

    Crafts- man3d: High-fidelity mesh generation with 3d native dif- fusion and interactive geometry refiner

    Weiyu Li, Jiarui Liu, Hongyu Yan, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Crafts- man3d: High-fidelity mesh generation with 3d native dif- fusion and interactive geometry refiner. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5307–5317, 2025. 2, 3

  30. [30]

    arXiv preprint arXiv:2505.14521 , year=

    Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, and Bihan Wen. Sparc3d: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025. 2, 3

  31. [31]

    Mixture-of- transformers: A sparse and scalable architecture for multi- modal foundation models.Transactions on Machine Learn- ing Research, 2025

    Weixin Liang, Lili Yu, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of- transformers: A sparse and scalable architecture for multi- modal foundation models.Transactions on Machine Learn- ing Research, 2025. 2, 4, 5, 8

  32. [32]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025. 1

  33. [33]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 3, 4

  34. [34]

    Diffusion probabilistic models for 3d point cloud generation

    Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2836–2844, 2021. 2

  35. [35]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 3

  36. [36]

    Diffrf: Rendering-guided 3d radiance field diffusion

    Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023. 2

  37. [37]

    Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

  38. [38]

    Rasterized edge gradients: Handling discontinuities differentiably.arXiv preprint arXiv:2405.02508, 2024

    Stanislav Pidhorskyi, Tomas Simon, Gabriel Schwartz, He Wen, Yaser Sheikh, and Jason Saragih. Rasterized edge gradients: Handling discontinuities differentiably.arXiv preprint arXiv:2405.02508, 2024. 6, 14

  39. [39]

    PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

    Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space.arXiv preprint arXiv:1706.02413, 2017. 15

  40. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2, 3

  41. [41]

    Pifu: Pixel-aligned implicit function for high-resolution clothed human digitiza- tion

    Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitiza- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), 2019. 3

  42. [42]

    Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization

    Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. InCVPR, 2020. 3

  43. [43]

    Sam 3d: 3dfy anything in images

    SAM3DTeam, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Doll´ar, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images. 2025. 3, 15

  44. [44]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 1

  45. [45]

    Pixelwise view selection for un- structured multi-view stereo

    Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 1

  46. [46]

    Neural point cloud diffusion for disentangled 3d shape and appearance generation

    Philipp Schr ¨oppel, Christopher Wewer, Jan Eric Lenssen, Eddy Ilg, and Thomas Brox. Neural point cloud diffusion for disentangled 3d shape and appearance generation. InCVPR,

  47. [47]

    Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein

    J. Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20875–20886, 2023. 2 12

  48. [48]

    3d generation on imagenet

    Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov. 3d generation on imagenet. InInternational Conference on Learning Representations, 2023. 2

  49. [49]

    Stefan Stojanov, Anh Thai, and James M. Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias

  50. [50]

    V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder, 2023

    Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder, 2023. 2

  51. [51]

    Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024

    Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024. 2, 3

  52. [52]

    Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,

    Tencent Hunyuan3D Team. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,

  53. [53]

    Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details, 2025

    Tencent Hunyuan3D Team. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details, 2025. 2, 3, 18

  54. [54]

    PoseDiffusion: Solving pose estimation via diffusion-aided bundle adjustment

    Jianyuan Wang, Christian Rupprecht, and David Novotny. PoseDiffusion: Solving pose estimation via diffusion-aided bundle adjustment. 2023. 8

  55. [55]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 3, 4, 5, 14

  56. [56]

    PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction

    Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction. InThe Twelfth International Conference on Learning Representations, 2024. 3

  57. [57]

    Efros, and Angjoo Kanazawa

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10510–10522, 2025. 3

  58. [58]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. 3

  59. [59]

    Rodin: A generative model for sculpting 3d digital avatars using diffusion

    Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, and Baining Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4563–4573, 2023. 2

  60. [60]

    Chen, and Bohan Zhuang

    Weijie Wang, Yeqing Chen, Zeyu Zhang, Hengyu Liu, Haoxiao Wang, Zhiyuan Feng, Wenkang Qin, Zheng Zhu, Donny Y . Chen, and Bohan Zhuang. V olsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned pre- diction.arXiv preprint arXiv:2509.19297, 2025. 3

  61. [61]

    1, 2, 3, 5, 14

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Scalable permutation-equivariant visual geometry learning, 2025. 1, 2, 3, 5, 14

  62. [62]

    Unilat3d: Geometry-appearance unified latents for single-stage 3d generation.arXiv preprint arXiv:2509.25079, 2025

    Guanjun Wu, Jiemin Fang, Chen Yang, Sikuang Li, Tao- ran Yi, Jia Lu, Zanwei Zhou, Jiazhong Cen, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Xinggang Wang, and Qi Tian. Unilat3d: Geometry-appearance unified latents for single-stage 3d generation.arXiv preprint arXiv:2509.25079, 2025. 2, 3, 8, 15

  63. [63]

    Free- man, and Joshua B

    Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Free- man, and Joshua B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. InProceedings of the 30th International Con- ference on Neural Information Processing Systems, page 82–90, Red Hook, NY , USA, 2016. Curran Associates Inc. 2

  64. [64]

    Direct3d: scalable image-to-3d generation via 3d latent diffusion transformer

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: scalable image-to-3d generation via 3d latent diffusion transformer. InProceedings of the 38th International Conference on Neu- ral Information Processing Systems, Red Hook, NY , USA,

  65. [65]

    Curran Associates Inc. 2, 3

  66. [66]

    Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Philip Torr, Xun Cao, and Yao Yao. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025. 2, 3

  67. [67]

    Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation

    Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Liang Pan Jiawei Ren, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, and Ziwei Liu. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 17

  68. [68]

    Native and compact structured latents for 3d generation.Tech report,

    Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, and Jiaolong Yang. Native and compact structured latents for 3d generation.Tech report,

  69. [69]

    Structured 3d latents for scalable and versatile 3d generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21469–21480, 2025. 2, 3, 5, 6, 8, 15

  70. [70]

    Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and Michael J. Black. ICON: Implicit Clothed humans Obtained from Normals. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13296–13306, 2022. 3

  71. [71]

    Yuliang Xiu, Jinlong Yang, Xu Cao, Dimitrios Tzionas, and Michael J. Black. ECON: Explicit Clothed humans Optimized via Normal integration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

  72. [72]

    Sparp: Fast 3d object reconstruction and pose estimation from sparse views

    Chao Xu, Ang Li, Linghao Chen, Yulin Liu, Ruoxi Shi, Hao Su, and Minghua Liu. Sparp: Fast 3d object reconstruction and pose estimation from sparse views. InComputer Vision – ECCV 2024, pages 143–163, Cham, 2024. Springer Nature Switzerland. 3

  73. [73]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. InstantMesh: Efficient 3d 13 mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191,

  74. [74]

    Freesplatter: Pose- free gaussian splatting for sparse-view 3d reconstruction

    Jiale Xu, Shenghua Gao, and Ying Shan. Freesplatter: Pose- free gaussian splatting for sparse-view 3d reconstruction. arXiv preprint arXiv:2412.09573, 2024. 3

  75. [75]

    Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Trans

    Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qix- uan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruction from an rgb image.ACM Trans. Graph., 44(4), 2025. 3

  76. [76]

    Hi3dgen: High-fidelity 3d geometry generation from im- ages via normal bridging.arXiv preprint arXiv:2503.22236,

    Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from im- ages via normal bridging.arXiv preprint arXiv:2503.22236,

  77. [77]

    3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Trans

    Biao Zhang, Jiapeng Tang, Matthias Nießner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Trans. Graph., 42(4), 2023. 3

  78. [78]

    Rodinhd: High-fidelity 3d avatar generation with diffusion models

    Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiao- long Yang, Yansong Tang, Feng Zhao, Dong Chen, and Bain- ing Guo. Rodinhd: High-fidelity 3d avatar generation with diffusion models. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XIV, page 465–483, Berlin, Hei- delberg, 2024. Sp...

  79. [79]

    Gaussiancube: Structuring gaussian splatting using optimal transport for 3d generative modeling

    Bowen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, and Baining Guo. Gaussiancube: Structuring gaussian splatting using optimal transport for 3d generative modeling. 2024. 2

  80. [80]

    MonST3r: A simple approach for estimating geometry in the presence of motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. MonST3r: A simple approach for estimating geometry in the presence of motion. InThe Thirteenth In- ternational Conference on Learning Representations, 2025. 3

Showing first 80 references.