pith. sign in

arxiv: 2504.02316 · v4 · submitted 2025-04-03 · 💻 cs.CV · cs.AI

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Pith reviewed 2026-05-22 21:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords zero-shot text-to-3D generationmulti-view consistencyJanus problemView Disentanglement Modulescore distillationpartial order loss3D Gaussian Splatting
0
0 comments X

The pith

ConsDreamer removes viewpoint biases from text-to-image models to produce consistent 3D objects from text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the multi-face Janus problem in zero-shot text-to-3D generation, where objects display conflicting features from different viewpoints. This stems from biases in pre-trained text-to-image models during score distillation. The method refines both conditional and unconditional terms using a View Disentanglement Module to separate view components and add precise control, plus a partial order loss that aligns similarities with geometric angles. If the approach works, generated 3D models would render coherently across views without extra post-processing. Readers would value this because it makes direct text-to-3D synthesis more reliable for practical use.

Core claim

ConsDreamer mitigates view bias by refining the score distillation process: a View Disentanglement Module eliminates irrelevant viewpoint elements in conditional prompts and injects precise view control, while a similarity-based partial order loss enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships, allowing seamless integration into various 3D representations and score distillation paradigms.

What carries the argument

The View Disentanglement Module (VDM) paired with a similarity-based partial order loss, which decouples view biases from conditional prompts and enforces geometric alignment in the unconditional score.

If this is right

  • Seamless integration into 3D Gaussian Splatting and other representations for multi-view rendering.
  • Mitigation of the multi-face Janus problem across multiple score distillation paradigms.
  • Enforced alignment of generated views with actual azimuth angle relationships.
  • Improved geometric consistency in zero-shot text-to-3D outputs without changing the base 3D pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same disentanglement idea could apply to reducing biases in text-to-video or multi-modal generation tasks.
  • Testing on newer text-to-image models might reveal whether bias removal scales as those models improve.
  • The partial order loss could extend to other consistency metrics beyond azimuth angles, such as elevation or distance.

Load-bearing premise

The main source of multi-view inconsistency is viewpoint bias in the pre-trained text-to-image model, and the proposed modules remove this bias without creating new inconsistencies or lowering image quality.

What would settle it

A set of generated 3D models that still exhibit the multi-face Janus problem or show clear drops in image quality after applying ConsDreamer would disprove the central claim.

Figures

Figures reproduced from arXiv: 2504.02316 by Haoran Duan, Jungong Han, Litao Hua, Shilong Jin, Wanjun Lv, Yuan Zhou.

Figure 1
Figure 1. Figure 1: Examples of the Multi-Face Janus Problem: (a) generated by [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of text-to-3D content creation using our method. We introduce [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of PerpNeg and VDM in handling prior view biases. (a) [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An overview of ConsDreamer. Our method is built upon the main flow of 3D content distilled from the T2I model. ConsDreamer introduces two [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Application of the VDM in 2D Generation. (a) Without the VDM, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of LP on 3D content over iterations. Using “A squirrel eating a burger” as an example, LP demonstrates a significant suppression effect on the multi-face Janus problem encountered during early training stages. from rare views like the “back view,” prior view preferences must be eliminated and target view information injected, as described in Eq.(16): V prompt key ←V prompt key − XΩ P rojδ view key … view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative ablation study results (prompt: “A photo of a Pikachu, yellow, electric”). Incrementally integrating our proposed method’s components [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation on weight coefficient κ of LP. κ = 0.6 yields the highest visual quality: excessively large values lead to over-smoothing and misalignment with textual prompts (e.g., “white hair”), while overly small values cause view inconsistency in the generated results, resulting in poor fitting of the target object. Multi-Face Janus Problem Additional Examples of Inconsistency Component Redundancy Component … view at source ↗
Figure 10
Figure 10. Figure 10: Examples of inconsistencies in generated 3D content. The top [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative text-to-3D generation results of our method. We integrate ConsDreamer into six SOTA baselines spanning NeRF/3DGS representations [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Application of the VDM to T2I Tasks. The results demonstrate the outputs of Stable Diffusion with explicit view descriptions across six prompts. [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Generation failure case examples. For complex or multi-object [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
read the original abstract

Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that ConsDreamer mitigates the multi-face Janus problem in zero-shot text-to-3D generation by refining score distillation: a View Disentanglement Module (VDM) decouples view-specific biases from conditional prompts while injecting precise view control, and a cosine-similarity partial-order loss enforces geometric consistency on the unconditional term by aligning similarities with azimuth relationships. The method is presented as integrable into various 3D representations and distillation paradigms without altering image quality or convergence.

Significance. If the VDM and partial-order loss demonstrably remove T2I viewpoint bias without introducing new inconsistencies or quality degradation, the approach would supply a lightweight, additive correction to existing score-distillation pipelines, potentially reducing reliance on post-hoc consistency fixes or view-specific fine-tuning in text-to-3D pipelines.

major comments (2)
  1. [Abstract] Abstract: the central claim that ConsDreamer 'effectively mitigating the multi-face Janus problem' and 'can be seamlessly integrated' is asserted without any quantitative multi-view consistency metrics, ablation tables, or experimental details; the causal chain that VDM removes only view-specific components while the partial-order loss produces geometrically consistent 3D without gradient conflict therefore remains unverified.
  2. [Abstract] The modeling assumption that viewpoint bias in the pre-trained T2I model is the dominant source of the Janus problem, and that the two proposed modules correct it without new inconsistencies, is load-bearing for the entire contribution yet receives no supporting evidence in the form of consistency scores or controlled ablations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. The abstract is a concise summary, while the full manuscript provides the requested quantitative evidence, ablations, and analysis. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that ConsDreamer 'effectively mitigating the multi-face Janus problem' and 'can be seamlessly integrated' is asserted without any quantitative multi-view consistency metrics, ablation tables, or experimental details; the causal chain that VDM removes only view-specific components while the partial-order loss produces geometrically consistent 3D without gradient conflict therefore remains unverified.

    Authors: The abstract provides a high-level overview of the method and its outcomes. The manuscript contains quantitative multi-view consistency metrics (e.g., CLIP similarity across views, user studies), ablation tables isolating VDM and the partial-order loss, and experimental details in the Experiments section. These results verify that VDM decouples view biases and the partial-order loss enforces geometric consistency without introducing gradient conflicts or quality degradation. revision: no

  2. Referee: [Abstract] The modeling assumption that viewpoint bias in the pre-trained T2I model is the dominant source of the Janus problem, and that the two proposed modules correct it without new inconsistencies, is load-bearing for the entire contribution yet receives no supporting evidence in the form of consistency scores or controlled ablations.

    Authors: The manuscript supports this assumption with controlled ablations and consistency scores in the Experiments and Ablation Study sections. These show that removing view-specific biases via VDM and enforcing azimuth-aligned similarities in the unconditional term reduces Janus artifacts relative to baselines, with no measurable increase in other inconsistencies or degradation in convergence/image quality. revision: no

Circularity Check

0 steps flagged

No circularity: additive modules on existing distillation without self-referential derivation

full rationale

The paper proposes ConsDreamer as two new modules (VDM for conditional prompts and a cosine-similarity partial-order loss for the unconditional term) to mitigate viewpoint bias in pre-trained T2I models during score distillation. No equations, uniqueness theorems, or fitted parameters are presented that reduce by construction to the inputs or to prior self-citations. The central claims rest on the modeling assumption that these modules correct bias without new inconsistencies, supported by integration experiments rather than any closed derivation loop. This is a standard additive refinement paper with no load-bearing self-citation chains or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.0 · 5745 in / 952 out tokens · 64205 ms · 2026-05-22T21:31:29.102162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,

    R. Chen, Y . Chen, N. Jiao, and K. Jia, “Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,” inProc. Int. Conf. Comput. Vis., 2023, pp. 22 246–22 256

  2. [2]

    Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,

    F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,” 2022, arXiv:2205.08535

  3. [3]

    Magic3d: High-resolution text-to-3d content creation,

    C.-H. Linet al., “Magic3d: High-resolution text-to-3d content creation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 300–309. IEEE TRANSACTIONS ON IMAGE PROCESSING 13

  4. [4]

    Latent-nerf for shape-guided generation of 3d shapes and textures,

    G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 12 663–12 673

  5. [5]

    Text2mesh: Text-driven neural stylization for meshes,

    O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2mesh: Text-driven neural stylization for meshes,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 13 492–13 502

  6. [6]

    Text2shape: Generating shapes from natural language by learning joint embeddings,

    K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese, “Text2shape: Generating shapes from natural language by learning joint embeddings,” inProc. Asian Conf. Comput. Vis.Springer, 2019, pp. 100–116

  7. [7]

    Dynamic unary convolution in transformers,

    H. Duan, Y . Long, S. Wang, H. Zhang, C. G. Willcocks, and L. Shao, “Dynamic unary convolution in transformers,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11, pp. 12 747–12 759, 2023

  8. [8]

    Sofa-net: Second-order and first-order attention network for crowd counting,

    H. Duan, S. Wang, and Y . Guan, “Sofa-net: Second-order and first-order attention network for crowd counting,” 2020,arXiv:2008.03723

  9. [9]

    Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,

    J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,”Adv. Neural Inform. Process. Syst., vol. 29, 2016

  10. [10]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 22 500–22 510

  11. [11]

    Photorealistic text-to-image diffusion models with deep language understanding,

    C. Sahariaet al., “Photorealistic text-to-image diffusion models with deep language understanding,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 36 479–36 494, 2022

  12. [12]

    DreamFusion: Text-to-3D using 2D Diffusion

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text- to-3d using 2d diffusion,” 2022,arXiv:2209.14988

  13. [13]

    Score distillation via reparametrized ddim,

    A. Lukoianovet al., “Score distillation via reparametrized ddim,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 26 011– 26 044, 2024

  14. [14]

    Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,

    Y . Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y . Chen, “Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 6517–6526

  15. [15]

    Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,

    H. Liet al., “Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,” inProc. Eur . Conf. Comput. Vis. Springer, 2024, pp. 214–230

  16. [16]

    Text-to-3d generation by 2d editing,

    Li, Haoran and Tian, Yuli and Wang, Yonghui and Liao, Yong and Wang, Lin and Wang, Yuyang and Zhou, Peng Yuan, “Text-to-3d generation by 2d editing,”arXiv preprint arXiv:2412.05929, 2024

  17. [17]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Commun. ACM, vol. 65, no. 1, pp. 99–106, 2021

  18. [18]

    MVDream: Multi-view Diffusion for 3D Generation

    Y . Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi- view diffusion for 3d generation,” 2023,arXiv:2308.16512

  19. [19]

    Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation,

    S. Hong, D. Ahn, and S. Kim, “Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 11 970–11 987, 2023

  20. [20]

    Laser: Efficient language-guided segmentation in neural radiance fields,

    X. Miaoet al., “Laser: Efficient language-guided segmentation in neural radiance fields,” 2025,arXiv:2501.19084

  21. [21]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

  22. [22]

    Gseditpro: 3d gaussian splatting editing with attention-based progressive localization,

    Y . Sun, R. Tian, X. Han, X. Liu, Y . Zhang, and K. Xu, “Gseditpro: 3d gaussian splatting editing with attention-based progressive localization,” inComput. Graph. F orum, vol. 43, no. 7. Wiley Online Library, 2024, p. e15215

  23. [23]

    Hyper-3dg: Text-to-3d gaussian generation via hyper- graph,

    D. Diet al., “Hyper-3dg: Text-to-3d gaussian generation via hyper- graph,”Int. J. Comput. Vis., pp. 1–24, 2024

  24. [24]

    Connecting consistency distillation to score distillation for text-to-3d generation,

    Z. Li, M. Hu, Q. Zheng, and X. Jiang, “Connecting consistency distillation to score distillation for text-to-3d generation,” inProc. Eur . Conf. Comput. Vis.Springer, 2024, pp. 274–291

  25. [25]

    Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond,

    M. Armandpour, A. Sadeghian, H. Zheng, A. Sadeghian, and M. Zhou, “Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond,” 2023,arXiv:2304.04968

  26. [26]

    Mip-nerf: A multiscale representation for anti- aliasing neural radiance fields,

    J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti- aliasing neural radiance fields,” inProc. Int. Conf. Comput. Vis., 2021, pp. 5855–5864

  27. [27]

    Ref-neus: Ambiguity- reduced neural implicit surface learning for multi-view reconstruction with reflection,

    W. Ge, T. Hu, H. Zhao, S. Liu, and Y .-C. Chen, “Ref-neus: Ambiguity- reduced neural implicit surface learning for multi-view reconstruction with reflection,” inProc. Int. Conf. Comput. Vis., 2023, pp. 4251–4260

  28. [28]

    Deep marching tetra- hedra: a hybrid representation for high-resolution 3d shape synthesis,

    T. Shen, J. Gao, K. Yin, M.-Y . Liu, and S. Fidler, “Deep marching tetra- hedra: a hybrid representation for high-resolution 3d shape synthesis,” Adv. Neural Inform. Process. Syst., vol. 34, pp. 6087–6101, 2021

  29. [29]

    3d neural field generation using triplane diffusion,

    J. R. Shue, E. R. Chan, R. Po, Z. Ankner, J. Wu, and G. Wetzstein, “3d neural field generation using triplane diffusion,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 20 875–20 886

  30. [30]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,” 2023, arXiv:2309.16653

  31. [31]

    Sherpa3d: Boosting high- fidelity text-to-3d generation via coarse 3d prior,

    F. Liu, D. Wu, Y . Wei, Y . Rao, and Y . Duan, “Sherpa3d: Boosting high- fidelity text-to-3d generation via coarse 3d prior,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 20 763–20 774

  32. [32]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen, “Point-e: A system for generating 3d point clouds from complex prompts,” 2022, arXiv:2212.08751

  33. [33]

    Shap-E: Generating Conditional 3D Implicit Functions

    H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,” 2023,arXiv:2305.02463

  34. [34]

    Zero-shot text-guided object generation with dream fields,

    A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero-shot text-guided object generation with dream fields,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 867–876

  35. [35]

    Zero-shot text-to-image generation,

    A. Rameshet al., “Zero-shot text-to-image generation,” inProc. Int. Conf. Mach. Learn.Pmlr, 2021, pp. 8821–8831

  36. [36]

    Noise-free score distillation,

    O. Katzir, O. Patashnik, D. Cohen-Or, and D. Lischinski, “Noise-free score distillation,” 2023,arXiv:2310.17590

  37. [37]

    Prolificdreamer: High-fidelity and diverse text-to- 3d generation with variational score distillation,

    Z. Wanget al., “Prolificdreamer: High-fidelity and diverse text-to- 3d generation with variational score distillation,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 8406–8441, 2023

  38. [38]

    Text-to-3d with classifier score distillation,

    X. Yu, Y .-C. Guo, Y . Li, D. Liang, S.-H. Zhang, and X. Qi, “Text-to-3d with classifier score distillation,” 2023,arXiv:2310.19415

  39. [39]

    Hifa: High-fidelity text-to-3d generation with advanced diffusion guidance,

    J. Zhu, P. Zhuang, and S. Koyejo, “Hifa: High-fidelity text-to-3d generation with advanced diffusion guidance,” 2023,arXiv:2305.18766

  40. [40]

    Scaledreamer: Scalable text-to-3d synthesis with asynchronous score distillation,

    Z. Ma, Y . Wei, Y . Zhang, X. Zhu, Z. Lei, and L. Zhang, “Scaledreamer: Scalable text-to-3d synthesis with asynchronous score distillation,” in Proc. Eur . Conf. Comput. Vis.Springer, 2024, pp. 1–19

  41. [41]

    Dreamer xl: Towards high-resolution text-to-3d gener- ation via trajectory score matching,

    X. Miaoet al., “Dreamer xl: Towards high-resolution text-to-3d gener- ation via trajectory score matching,” 2024,arXiv:2405.11252

  42. [42]

    Exactdreamer: High-fidelity text-to-3d content creation via exact score matching,

    Y . Zhanget al., “Exactdreamer: High-fidelity text-to-3d content creation via exact score matching,” 2024,arXiv:2405.15914

  43. [43]

    Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model,

    Y . Xuet al., “Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model,”arXiv preprint arXiv:2311.09217, 2023

  44. [44]

    Pi3d: Efficient text-to-3d generation with pseudo-image diffusion,

    Y .-T. Liu, Y .-C. Guo, G. Luo, H. Sun, W. Yin, and S.-H. Zhang, “Pi3d: Efficient text-to-3d generation with pseudo-image diffusion,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 19 915–19 924

  45. [45]

    Large-vocabulary 3d diffusion model with transformer,

    Z. Cao, F. Hong, T. Wu, L. Pan, and Z. Liu, “Large-vocabulary 3d diffusion model with transformer,”arXiv preprint arXiv:2309.07920, 2023

  46. [46]

    Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation,

    Z. Cao, F. Hong, T. Wu, L. Pan, and Z. Liu, “Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation,”IEEE Trans. Pattern Anal. Mach. Intell., 2025

  47. [47]

    T2td: Text-3d generation model based on prior knowledge guidance,

    W. Nie, R. Chen, W. Wang, B. Lepri, and N. Sebe, “T2td: Text-3d generation model based on prior knowledge guidance,”IEEE Trans. Pattern Anal. Mach. Intell., 2024

  48. [48]

    Zero-1-to-3: Zero-shot one image to 3d object,

    R. Liu, R. Wu, B. V . Hoorick, P. Tokmakov, S. Zakharov, and C. V on- drick, “Zero-1-to-3: Zero-shot one image to 3d object,” 2023

  49. [49]

    SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion,

    V . V oletiet al., “SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion,” inEuropean Confer- ence on Computer Vision (ECCV), 2024

  50. [50]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Adv. Neural Inform. Process. Syst., vol. 33, pp. 6840–6851, 2020

  51. [51]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” 2020,arXiv:2011.13456

  52. [52]

    Representation learning: A review and new perspectives,

    Y . Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” 2014

  53. [53]

    Imagereward: Learning and evaluating human preferences for text-to-image generation,

    J. Xuet al., “Imagereward: Learning and evaluating human preferences for text-to-image generation,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 15 903–15 935, 2023

  54. [54]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” vol. 1, no. 2, p. 3, 2022,arXiv:2204.06125

  55. [55]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 586–595