arxiv: 2604.12309 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

Rong Wang , Ruyi Zha , Ziang Cheng , Jiayu Yang , Pulak Purkait , Hongdong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords orbital video generation3D priorssingle-image videomulti-view consistencyshape realismdiffusion modelsvideo generationfoundation models

0 comments

The pith

3D foundation model latents injected via adapter enforce shape consistency in single-image orbital video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the failure of pixel-attention video models to keep object structure coherent when synthesizing distant views such as rear angles in an orbital sequence. It adds two scales of latent features taken from a 3D foundation model: one global vector for overall shape and a set of projected volumetric images for local geometry. These features are routed into the base video model through a lightweight multi-scale adapter that uses cross-attention, leaving the original video priors intact. The resulting videos show better geometry and fewer structural breaks across long camera paths.

Core claim

Conditioning a video diffusion model on a denoised global latent vector and on view-dependent latent images decoded from 3D volumetric features supplies complete object shape information that pixel-wise attention cannot provide. The features are delivered by a multi-scale 3D adapter that inserts them as tokens through cross-attention layers, allowing simple fine-tuning while preserving general video generation behavior. Experiments confirm that the added guidance improves visual quality, shape realism, and multi-view consistency on standard benchmarks and on complex trajectories with in-the-wild inputs.

What carries the argument

Multi-scale 3D adapter that injects a global latent vector and projected volumetric latent images from the 3D foundation model into the video model's cross-attention layers as conditioning tokens.

If this is right

Long-range extrapolation such as full rear-view synthesis becomes reliable because shape information is supplied independently of pixel overlap.
No explicit mesh extraction is required, keeping inference faster than methods that reconstruct and render 3D geometry.
The same adapter fine-tuning works across different base video models without retraining them from scratch.
Performance holds for arbitrary camera paths and real photographs, not only simple orbits on clean data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent conditioning could be tested on non-video tasks such as single-image novel-view synthesis to see if consistency gains transfer.
Replacing pixel attention with 3D latent guidance might allow shorter fine-tuning or smaller training sets for other motion types.
If the 3D priors prove robust, they could reduce the need for multi-view capture in applications like product visualization or AR content.

Load-bearing premise

Latent codes extracted from the 3D foundation model will match the true geometry of any real-world object without creating new shape errors or domain mismatches.

What would settle it

Run the method on a collection of objects whose shapes lie far outside the 3D training distribution, such as highly concave or thin-part objects, and measure whether generated rear views retain correct 3D structure or exhibit collapses and distortions.

Figures

Figures reproduced from arXiv: 2604.12309 by Hongdong Li, Jiayu Yang, Pulak Purkait, Rong Wang, Ruyi Zha, Ziang Cheng.

**Figure 1.** Figure 1: In this work, we generate realistic and consistent orbital videos from a single object image. By leverage shape priors from 3D [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the method. Given an input image I, we aim to generate a realistic and consistent orbital video V in a base video diffusion model V(·), conditioned on shape priors from a pretrained 3D foundation model F(·). Specifically, we forward the input image to the foundation model to generate (i) a denoised global latent vector pˆ as overall structural guidance, and (ii) a set of latent images Lˆ for vi… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with baselines. Compared to NVS [21, 25] and video [42, 47] generation works, our method produces more realistic and consistent results with less artifacts, e.g. distorted shapes and unnatural structures. Moreover, we generate more accurate object colors and lighting effects compared to rendered textured meshes from 3D generation methods [46, 55]. features such as camera poses as con… view at source ↗

**Figure 4.** Figure 4: Results on in-the-wild images. Our method robustly generalizes to in-the-wild examples and outperforms baseline video generation methods [42, 47] in both visual fidelity and shape realism. Input Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Extension to dynamic orbits. Our method can follow complex camera trajectories with non-zero elevations, i.e. dynamic orbits defined by [42] to achieve accurate camera controls. Input w/o Priors w/ Priors Decoded Priors [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Effects of the 3D foundation priors. The shape priors from 3D foundation models (visualized after decoding and mesh extraction at the last column) help to provide complementary guidance for unobserved and occluded object parts. find this design also leads to a simpler training process and thus helps to improve overall model performance. 5. Discussion Limitation. Although our method achieves results with s… view at source ↗

**Figure 1.** Figure 1: Effects of random seeds for shape priors. Our method can generate different object appearances following the same prior, preserving the stochastic nature of generative model. In [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗

**Figure 2.** Figure 2: Effects of random seeds for the 3D foundation model. By incorporating a large CFG scale, the 3D foundation model demonstrate a consistent behavior on resulting priors D. More Results We show in [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative Comparisons with NVS methods [5, 13, 37] for completeness. Our method produces more realistic results with improved fidelity in shape and appearances details. Moreover, we directly output smooth video results leveraging temporal priors from general video model [3], as shown in supplementary videos. diverse types of objects and produces consistent frames with realistic object shapes. E. More Com… view at source ↗

**Figure 4.** Figure 4: Reconstructed textured meshes from generated video results. Similar to [25, 32], our method can be extended to reconstruct textured 3D meshes, which also demonstrates the multi-view consistency of generated videos. sistency. Specifically, we use a simple method that adopts 2DGS [12] to directly reconstruct textured meshes using the predicted RGB images, and observe that the obtained 3D models faithfully re… view at source ↗

**Figure 5.** Figure 5: Examples of failure cases. Due to the limited resolution, both base video model and 3D foundation model fail to recover fine-grained details on complex scenes, leading to blurry results and ineffective shape control particular when the shape priors disagree with ground truth videos. I. Supplementary Video We refer readers with more results in the supplementary videos, tested on a wide range of objects an… view at source ↗

**Figure 6.** Figure 6: We show more results of generated frames on diverse input views and unseen objects. References [1] Gwangbin Bae and Andrew J Davison. Rethinking inductive biases for surface normal estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9535–9545, 2024. 1 [2] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Muller. Zoedepth: Zero-shot … view at source ↗

**Figure 7.** Figure 7: Qualitative Comparisons with 3D generation methods [23, 28, 31]. Our method produces results with higher fidelity compared to 3DGS based methods [28, 31] and ensures better aligned base colors over foundation models that explicitly estimate mesh texture maps [23, 36]. This further verifies the efficacy of native video generation methods in producing visually improved results. 22563–22575, 2023. 2 [4] Jen-H… view at source ↗

read the original abstract

We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a multi-scale adapter to inject global and projected 3D foundation latents into a video diffusion model for better long-range orbital consistency, but the abstract's superiority claims rest on unshown experiments.

read the letter

The new piece here is the adapter that pulls two scales of features from a pre-trained 3D generative model—a single denoised global vector for overall structure and projected volumetric latents for view-specific detail—then feeds them into the video model through cross-attention. This avoids explicit meshes or depth maps while trying to give the generator complete shape information that pixel attention alone cannot supply for rear views or complex orbits. The design keeps the base video model frozen during fine-tuning, which is a practical choice if the adapter works as described. It correctly identifies that standard attention mechanisms lose traction once pixel overlap drops, and the 3D priors are a reasonable way to add geometric constraint without retraining everything from scratch. The abstract reports gains in visual quality, shape realism, and multi-view consistency on benchmarks plus in-the-wild images, and the method appears to generalize to varied camera paths. That said, the abstract gives no numbers, no baseline names, no ablation results, and no error analysis, so the central claim cannot be checked yet. The stress-test concern about domain gaps is worth watching: the 3D model was trained on synthetic assets, so its latents could carry biases in geometry or appearance when applied to real photos, which would directly weaken the shape guidance for unseen views. If the full paper shows quantitative tables, failure cases, and direct comparisons that survive this gap, the adapter could be a useful plug-in for other video pipelines. This is incremental work inside generative modeling, aimed at people already running video diffusion or 3D-aware synthesis experiments. It is coherent enough on its own terms to go to referees so they can examine the actual metrics and implementation details.

Referee Report

2 major / 1 minor

Summary. The paper proposes a method for generating geometrically realistic and consistent orbital videos from a single input image by conditioning a video diffusion model with multi-scale latent features extracted from a 3D foundation model. These features consist of a denoised global latent vector for overall structure and projected latent images from volumetric representations for view-dependent geometry details. A multi-scale 3D adapter injects these tokens into the base model via cross-attention, enabling fine-tuning while preserving general video capabilities. The approach is motivated by limitations of pixel-wise attention for long-range view extrapolation (e.g., rear views) and claims to outperform state-of-the-art methods in visual quality, shape realism, and multi-view consistency while generalizing to complex camera trajectories and in-the-wild images.

Significance. If the central claims hold, the work could meaningfully advance 3D-aware video generation by demonstrating that compact latent priors from 3D foundation models can enforce plausible object structure more effectively than 2.5D cues or pure attention mechanisms, with efficiency gains from avoiding mesh extraction. This would be particularly relevant for applications requiring consistent novel-view video synthesis.

major comments (2)

[Abstract] Abstract: The central claim of experimental superiority in visual quality, shape realism, and multi-view consistency is stated without any quantitative metrics, specific baseline names, ablation results, or error analysis. This directly affects the ability to assess whether the 3D priors deliver the asserted gains over pixel-attention baselines, especially for long-range consistency.
[Method] Method (description of 3D feature encoding and adapter): The load-bearing assumption that latent features from the 3D foundation model (global vector plus projected volumetric latents) supply accurate, unbiased shape guidance for arbitrary in-the-wild images is not accompanied by targeted validation against domain shift. Since 3D foundation models are typically trained on synthetic asset corpora, any mismatch in geometry or appearance distributions could produce incomplete or biased conditioning signals that undermine the reported improvements in rear-view consistency and generalization; concrete tests (e.g., synthetic-to-real transfer ablations or failure-case analysis on real images) are required.

minor comments (1)

[Abstract] Abstract: 'model-agonistic' appears to be a typographical error for 'model-agnostic'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the presentation of our experimental claims and the validation of our 3D priors. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of experimental superiority in visual quality, shape realism, and multi-view consistency is stated without any quantitative metrics, specific baseline names, ablation results, or error analysis. This directly affects the ability to assess whether the 3D priors deliver the asserted gains over pixel-attention baselines, especially for long-range consistency.

Authors: We agree that the abstract would be more informative if it referenced key quantitative results supporting the superiority claims. The full manuscript presents these details, including metric comparisons and baseline evaluations, in the Experiments section. We will revise the abstract to concisely include references to the primary quantitative findings and baseline names, enabling readers to better evaluate the contributions of the 3D priors upfront while retaining the detailed tables and ablations in the main body. revision: yes
Referee: [Method] Method (description of 3D feature encoding and adapter): The load-bearing assumption that latent features from the 3D foundation model (global vector plus projected volumetric latents) supply accurate, unbiased shape guidance for arbitrary in-the-wild images is not accompanied by targeted validation against domain shift. Since 3D foundation models are typically trained on synthetic asset corpora, any mismatch in geometry or appearance distributions could produce incomplete or biased conditioning signals that undermine the reported improvements in rear-view consistency and generalization; concrete tests (e.g., synthetic-to-real transfer ablations or failure-case analysis on real images) are required.

Authors: We recognize the value of explicit validation for domain shift, given that 3D foundation models are pretrained on synthetic data. Our experiments already include qualitative results demonstrating robust generalization to in-the-wild images and complex trajectories, where the multi-scale latents help maintain structural consistency. To directly address the concern, we will add a dedicated discussion and failure-case analysis subsection in the revised Experiments section, examining performance on challenging real images and noting limitations arising from distribution mismatches. This will provide the targeted evidence requested without requiring new model training. revision: yes

Circularity Check

0 steps flagged

No circularity: external 3D priors and independent adapter yield self-contained claims

full rationale

The paper's core derivation extracts latents from a pre-trained 3D foundation model (global denoised vector plus projected volumetric features) and injects them via a newly proposed multi-scale 3D adapter using cross-attention into a base video diffusion backbone. This conditioning scheme and the reported gains in shape realism and multi-view consistency are supported by comparative experiments on benchmarks rather than by any self-referential definition, fitted input renamed as prediction, or load-bearing self-citation chain. The 3D priors are treated as external input; no equations reduce the output to the input by construction, and the adapter is presented as an independent architectural contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on the assumption that 3D foundation models encode complete, realistic object shapes usable as conditioning signals; the adapter is a new component whose effectiveness is not independently demonstrated in the abstract.

axioms (1)

domain assumption 3D foundation models trained on large asset corpora capture realistic object shape distributions that transfer to real images
Invoked to justify using the model's latent features as structural guidance.

invented entities (1)

multi-scale 3D adapter no independent evidence
purpose: Injects global and view-dependent 3D latent features into the video model via cross-attention
New architectural component introduced to enable effective conditioning while preserving base model capabilities.

pith-pipeline@v0.9.0 · 5612 in / 1259 out tokens · 38974 ms · 2026-05-10T14:46:31.734610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Rethinking induc- tive biases for surface normal estimation

Gwangbin Bae and Andrew J Davison. Rethinking induc- tive biases for surface normal estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9535–9545, 2024. 1

2024
[2]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 1

work page internal anchor Pith review arXiv 2023
[3]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages InputObourous3DGEN3DHunyuan2.0Hunyuan2.1TrellisOursGT Figure 7.Qualita...

2023
[4]

arXiv preprint arXiv:2412.15618 , year=

Jen-Hao Rick Chang, Yuyang Wang, Miguel Angel Bautista Martin, Jiatao Gu, Josh Susskind, and Oncel Tuzel. 3d shape tokenization.arXiv preprint arXiv:2412.15618, 2024. 1

work page arXiv 2024
[5]

Lin, Jiayuan Gu, Hao Su, Gordon Wet- zstein, and Leonidas Guibas

Hansheng Chen, Bokui Shen, Yulin Liu, Ruoxi Shi, Linqi Zhou, Connor Z. Lin, Jiayuan Gu, Hao Su, Gordon Wet- zstein, and Leonidas Guibas. 3d-adapter: Geometry- consistent multi-view diffusion for high-quality 3d genera- tion, 2024. 2

2024
[6]

arXiv preprint arXiv:2403.12032 (2024) 13, 11

Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Ji- ayuan Gu, Gordon Wetzstein, Hao Su, and Leonidas Guibas. Generic 3d diffusion adapter using controlled multi-view editing.arXiv preprint arXiv:2403.12032, 2024. 2

work page arXiv 2024
[7]

CoRR , volume =

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023. 1

work page arXiv 2023
[8]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
[9]

Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image. InEuropean Conference on Computer Vision, pages 241–258. Springer, 2024. 1

2024
[10]

Cat3d: Create any- thing in 3d with multi-view diffusion models,

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024. 1

work page arXiv 2024
[11]

Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 1

work page arXiv 2023
[12]

2d gaussian splatting for geometrically accu- rate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. 3

2024
[13]

Mv-adapter: Multi-view consistent image generation made easy.arXiv preprint arXiv:2412.03632, 2024

Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy.arXiv preprint arXiv:2412.03632, 2024. 2

work page arXiv 2024
[14]

3d gaussian splatting for real-time radiance field rendering, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023. 1

2023
[15]

Sapiens: Foundation for human vision mod- els

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision mod- els. InEuropean Conference on Computer Vision, pages 206–228. Springer, 2024. 1

2024
[16]

Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024

Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024. 1

work page arXiv 2024
[17]

Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025. 1

work page arXiv 2025
[18]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022. 1

2022
[20]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4209–4219, 2024. 1

2024
[21]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024. 1

2024
[22]

Cy- cle3d: High-quality and consistent image-to-3d generation via generation-reconstruction cycle, 2024

Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Wangbo Yu, Chaoran Feng, Yatian Pang, Bin Lin, and Li Yuan. Cy- cle3d: High-quality and consistent image-to-3d generation via generation-reconstruction cycle, 2024. 1

2024
[23]

Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material,

Tencent Hunyuan3D Team. Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material,
[24]

arXiv preprint arXiv:2403.02151 , year=

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024. 1

work page arXiv 2024
[25]

Sv3d: Novel multi-view syn- thesis and 3d generation from a single image using latent video diffusion

Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view syn- thesis and 3d generation from a single image using latent video diffusion. InEuropean Conference on Computer Vi- sion, pages 439–457. Springer, 2024. 1, 2, 3

2024
[26]

Vggt: Visual geometry grounded transformer, 2025

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer.arXiv preprint arXiv:2503.11651, 2025. 1

work page arXiv 2025
[27]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 1

2024
[28]

Ouroboros3d: Image-to-3d gen- eration via 3d-aware recursive diffusion.arXiv preprint arXiv:2406.03184, 2024

Hao Wen, Zehuan Huang, Yaohui Wang, Xinyuan Chen, Yu Qiao, and Lu Sheng. Ouroboros3d: Image-to-3d gen- eration via 3d-aware recursive diffusion.arXiv preprint arXiv:2406.03184, 2024. 1, 2, 5

work page arXiv 2024
[29]

Unique3d: High-quality and efficient 3d mesh generation from a single image

Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. Unique3d: High-quality and efficient 3d mesh generation from a single image. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1

2024
[30]

arXiv preprint arXiv:2412.01506 (2024) 4

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration.arXiv preprint arXiv:2412.01506, 2024. 1

work page arXiv 2024
[31]

Pons-Moll

Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard. Pons-Moll. Gen-3Diffusion: Realistic Image-to-3D Genera- tion via 2D 3D Diffusion Synergy . 2024. 1, 2, 5

2024
[32]

Hi3d: Pursuing high- resolution image-to-3d generation with video diffusion mod- els

Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, and Tao Mei. Hi3d: Pursuing high- resolution image-to-3d generation with video diffusion mod- els. InProceedings of the 32nd ACM International Confer- ence on Multimedia, pages 6870–6879, 2024. 2, 3

2024
[33]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review arXiv
[34]

3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023. 1

2023
[35]

Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 1

2024
[36]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202, 2025. 1, 2, 5

work page Pith review arXiv 2025
[37]

arXiv preprint arXiv:2312.04551 , year=

Chuanxia Zheng and Andrea Vedaldi. Free3d: Consis- tent novel view synthesis without 3d representation.arXiv preprint arXiv:2312.04551, 2023. 2

work page arXiv 2023