ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Haoran Duan; Jungong Han; Litao Hua; Shilong Jin; Wanjun Lv; Yuan Zhou

arxiv: 2504.02316 · v4 · submitted 2025-04-03 · 💻 cs.CV · cs.AI

ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Yuan Zhou , Shilong Jin , Litao Hua , Wanjun Lv , Haoran Duan , Jungong Han This is my paper

Pith reviewed 2026-05-22 21:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords zero-shot text-to-3D generationmulti-view consistencyJanus problemView Disentanglement Modulescore distillationpartial order loss3D Gaussian Splatting

0 comments

The pith

ConsDreamer removes viewpoint biases from text-to-image models to produce consistent 3D objects from text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the multi-face Janus problem in zero-shot text-to-3D generation, where objects display conflicting features from different viewpoints. This stems from biases in pre-trained text-to-image models during score distillation. The method refines both conditional and unconditional terms using a View Disentanglement Module to separate view components and add precise control, plus a partial order loss that aligns similarities with geometric angles. If the approach works, generated 3D models would render coherently across views without extra post-processing. Readers would value this because it makes direct text-to-3D synthesis more reliable for practical use.

Core claim

ConsDreamer mitigates view bias by refining the score distillation process: a View Disentanglement Module eliminates irrelevant viewpoint elements in conditional prompts and injects precise view control, while a similarity-based partial order loss enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships, allowing seamless integration into various 3D representations and score distillation paradigms.

What carries the argument

The View Disentanglement Module (VDM) paired with a similarity-based partial order loss, which decouples view biases from conditional prompts and enforces geometric alignment in the unconditional score.

If this is right

Seamless integration into 3D Gaussian Splatting and other representations for multi-view rendering.
Mitigation of the multi-face Janus problem across multiple score distillation paradigms.
Enforced alignment of generated views with actual azimuth angle relationships.
Improved geometric consistency in zero-shot text-to-3D outputs without changing the base 3D pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disentanglement idea could apply to reducing biases in text-to-video or multi-modal generation tasks.
Testing on newer text-to-image models might reveal whether bias removal scales as those models improve.
The partial order loss could extend to other consistency metrics beyond azimuth angles, such as elevation or distance.

Load-bearing premise

The main source of multi-view inconsistency is viewpoint bias in the pre-trained text-to-image model, and the proposed modules remove this bias without creating new inconsistencies or lowering image quality.

What would settle it

A set of generated 3D models that still exhibit the multi-face Janus problem or show clear drops in image quality after applying ConsDreamer would disprove the central claim.

Figures

Figures reproduced from arXiv: 2504.02316 by Haoran Duan, Jungong Han, Litao Hua, Shilong Jin, Wanjun Lv, Yuan Zhou.

**Figure 2.** Figure 2: Examples of text-to-3D content creation using our method. We introduce [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of PerpNeg and VDM in handling prior view biases. (a) [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: An overview of ConsDreamer. Our method is built upon the main flow of 3D content distilled from the T2I model. ConsDreamer introduces two [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Application of the VDM in 2D Generation. (a) Without the VDM, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Impact of LP on 3D content over iterations. Using “A squirrel eating a burger” as an example, LP demonstrates a significant suppression effect on the multi-face Janus problem encountered during early training stages. from rare views like the “back view,” prior view preferences must be eliminated and target view information injected, as described in Eq.(16): V prompt key ←V prompt key − XΩ P rojδ view key … view at source ↗

**Figure 8.** Figure 8: Qualitative ablation study results (prompt: “A photo of a Pikachu, yellow, electric”). Incrementally integrating our proposed method’s components [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation on weight coefficient κ of LP. κ = 0.6 yields the highest visual quality: excessively large values lead to over-smoothing and misalignment with textual prompts (e.g., “white hair”), while overly small values cause view inconsistency in the generated results, resulting in poor fitting of the target object. Multi-Face Janus Problem Additional Examples of Inconsistency Component Redundancy Component … view at source ↗

**Figure 10.** Figure 10: Examples of inconsistencies in generated 3D content. The top [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 12.** Figure 12: Qualitative text-to-3D generation results of our method. We integrate ConsDreamer into six SOTA baselines spanning NeRF/3DGS representations [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Application of the VDM to T2I Tasks. The results demonstrate the outputs of Stable Diffusion with explicit view descriptions across six prompts. [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Generation failure case examples. For complex or multi-object [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

read the original abstract

Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConsDreamer adds a view disentanglement module and partial-order loss to score distillation to target the Janus problem, but the abstract supplies no metrics or ablations to show the fixes actually work.

read the letter

The punchline is that ConsDreamer tries to fix the multi-face Janus problem in text-to-3D generation by adding a View Disentanglement Module to clean up conditional prompts and a similarity-based partial order loss to make the unconditional term more geometrically consistent. This is an incremental step on top of score distillation methods. What the paper does is combine these two modules to refine the score distillation process. The VDM is meant to eliminate viewpoint biases by decoupling irrelevant view components and adding precise view control to the prompts. The loss aligns cosine similarities with azimuth relationships to enforce consistency. The authors claim this can be plugged into different 3D representations like Gaussian Splatting and various distillation paradigms without much trouble. It does well at focusing on a known issue that affects usability of these generative models. Many people have seen the Janus problem where objects have multiple faces or inconsistent features from different views, and addressing the bias in the pre-trained T2I model is a logical place to start. Presenting it as an additive change rather than a complete redesign is practical. Where the soft spots are is the complete lack of supporting data in the abstract. It says extensive experiments demonstrate the effectiveness, but there are no numbers for consistency metrics, no ablation tables showing what each module contributes, and no comparisons to baselines on how much the Janus problem is reduced. This means the claim that it mitigates the problem without introducing new inconsistencies or hurting image quality is not backed up here. The assumption that these modules will remove the bias cleanly is still just an assumption at this point. This kind of paper is useful for people actively working in zero-shot text-to-3D generation, especially those using score distillation. They might get value from the specific design of the modules and could test them in their own setups. Someone outside the subfield probably won't get much from it. It deserves a serious referee because the problem it targets is important for the field and the method is described in enough detail to be reviewed and potentially reproduced if the full paper has the experiments. I would recommend sending this to peer review rather than desk rejecting it, so the experimental claims can be properly assessed.

Referee Report

2 major / 0 minor

Summary. The paper claims that ConsDreamer mitigates the multi-face Janus problem in zero-shot text-to-3D generation by refining score distillation: a View Disentanglement Module (VDM) decouples view-specific biases from conditional prompts while injecting precise view control, and a cosine-similarity partial-order loss enforces geometric consistency on the unconditional term by aligning similarities with azimuth relationships. The method is presented as integrable into various 3D representations and distillation paradigms without altering image quality or convergence.

Significance. If the VDM and partial-order loss demonstrably remove T2I viewpoint bias without introducing new inconsistencies or quality degradation, the approach would supply a lightweight, additive correction to existing score-distillation pipelines, potentially reducing reliance on post-hoc consistency fixes or view-specific fine-tuning in text-to-3D pipelines.

major comments (2)

[Abstract] Abstract: the central claim that ConsDreamer 'effectively mitigating the multi-face Janus problem' and 'can be seamlessly integrated' is asserted without any quantitative multi-view consistency metrics, ablation tables, or experimental details; the causal chain that VDM removes only view-specific components while the partial-order loss produces geometrically consistent 3D without gradient conflict therefore remains unverified.
[Abstract] The modeling assumption that viewpoint bias in the pre-trained T2I model is the dominant source of the Janus problem, and that the two proposed modules correct it without new inconsistencies, is load-bearing for the entire contribution yet receives no supporting evidence in the form of consistency scores or controlled ablations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. The abstract is a concise summary, while the full manuscript provides the requested quantitative evidence, ablations, and analysis. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that ConsDreamer 'effectively mitigating the multi-face Janus problem' and 'can be seamlessly integrated' is asserted without any quantitative multi-view consistency metrics, ablation tables, or experimental details; the causal chain that VDM removes only view-specific components while the partial-order loss produces geometrically consistent 3D without gradient conflict therefore remains unverified.

Authors: The abstract provides a high-level overview of the method and its outcomes. The manuscript contains quantitative multi-view consistency metrics (e.g., CLIP similarity across views, user studies), ablation tables isolating VDM and the partial-order loss, and experimental details in the Experiments section. These results verify that VDM decouples view biases and the partial-order loss enforces geometric consistency without introducing gradient conflicts or quality degradation. revision: no
Referee: [Abstract] The modeling assumption that viewpoint bias in the pre-trained T2I model is the dominant source of the Janus problem, and that the two proposed modules correct it without new inconsistencies, is load-bearing for the entire contribution yet receives no supporting evidence in the form of consistency scores or controlled ablations.

Authors: The manuscript supports this assumption with controlled ablations and consistency scores in the Experiments and Ablation Study sections. These show that removing view-specific biases via VDM and enforcing azimuth-aligned similarities in the unconditional term reduces Janus artifacts relative to baselines, with no measurable increase in other inconsistencies or degradation in convergence/image quality. revision: no

Circularity Check

0 steps flagged

No circularity: additive modules on existing distillation without self-referential derivation

full rationale

The paper proposes ConsDreamer as two new modules (VDM for conditional prompts and a cosine-similarity partial-order loss for the unconditional term) to mitigate viewpoint bias in pre-trained T2I models during score distillation. No equations, uniqueness theorems, or fitted parameters are presented that reduce by construction to the inputs or to prior self-citations. The central claims rest on the modeling assumption that these modules correct bias without new inconsistencies, supported by integration experiments rather than any closed derivation loop. This is a standard additive refinement paper with no load-bearing self-citation chains or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.0 · 5745 in / 952 out tokens · 64205 ms · 2026-05-22T21:31:29.102162+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation
cs.LG 2026-05 unverdicted novelty 7.0

CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,

R. Chen, Y . Chen, N. Jiao, and K. Jia, “Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,” inProc. Int. Conf. Comput. Vis., 2023, pp. 22 246–22 256

work page 2023
[2]

Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,

F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,” 2022, arXiv:2205.08535

work page arXiv 2022
[3]

Magic3d: High-resolution text-to-3d content creation,

C.-H. Linet al., “Magic3d: High-resolution text-to-3d content creation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 300–309. IEEE TRANSACTIONS ON IMAGE PROCESSING 13

work page 2023
[4]

Latent-nerf for shape-guided generation of 3d shapes and textures,

G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 12 663–12 673

work page 2023
[5]

Text2mesh: Text-driven neural stylization for meshes,

O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2mesh: Text-driven neural stylization for meshes,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 13 492–13 502

work page 2022
[6]

Text2shape: Generating shapes from natural language by learning joint embeddings,

K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese, “Text2shape: Generating shapes from natural language by learning joint embeddings,” inProc. Asian Conf. Comput. Vis.Springer, 2019, pp. 100–116

work page 2019
[7]

Dynamic unary convolution in transformers,

H. Duan, Y . Long, S. Wang, H. Zhang, C. G. Willcocks, and L. Shao, “Dynamic unary convolution in transformers,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11, pp. 12 747–12 759, 2023

work page 2023
[8]

Sofa-net: Second-order and first-order attention network for crowd counting,

H. Duan, S. Wang, and Y . Guan, “Sofa-net: Second-order and first-order attention network for crowd counting,” 2020,arXiv:2008.03723

work page arXiv 2020
[9]

Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,

J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,”Adv. Neural Inform. Process. Syst., vol. 29, 2016

work page 2016
[10]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 22 500–22 510

work page 2023
[11]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Sahariaet al., “Photorealistic text-to-image diffusion models with deep language understanding,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 36 479–36 494, 2022

work page 2022
[12]

DreamFusion: Text-to-3D using 2D Diffusion

B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text- to-3d using 2d diffusion,” 2022,arXiv:2209.14988

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Score distillation via reparametrized ddim,

A. Lukoianovet al., “Score distillation via reparametrized ddim,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 26 011– 26 044, 2024

work page 2024
[14]

Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,

Y . Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y . Chen, “Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 6517–6526

work page 2024
[15]

Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,

H. Liet al., “Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,” inProc. Eur . Conf. Comput. Vis. Springer, 2024, pp. 214–230

work page 2024
[16]

Text-to-3d generation by 2d editing,

Li, Haoran and Tian, Yuli and Wang, Yonghui and Liao, Yong and Wang, Lin and Wang, Yuyang and Zhou, Peng Yuan, “Text-to-3d generation by 2d editing,”arXiv preprint arXiv:2412.05929, 2024

work page arXiv 2024
[17]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Commun. ACM, vol. 65, no. 1, pp. 99–106, 2021

work page 2021
[18]

MVDream: Multi-view Diffusion for 3D Generation

Y . Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi- view diffusion for 3d generation,” 2023,arXiv:2308.16512

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation,

S. Hong, D. Ahn, and S. Kim, “Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 11 970–11 987, 2023

work page 2023
[20]

Laser: Efficient language-guided segmentation in neural radiance fields,

X. Miaoet al., “Laser: Efficient language-guided segmentation in neural radiance fields,” 2025,arXiv:2501.19084

work page arXiv 2025
[21]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

work page 2023
[22]

Gseditpro: 3d gaussian splatting editing with attention-based progressive localization,

Y . Sun, R. Tian, X. Han, X. Liu, Y . Zhang, and K. Xu, “Gseditpro: 3d gaussian splatting editing with attention-based progressive localization,” inComput. Graph. F orum, vol. 43, no. 7. Wiley Online Library, 2024, p. e15215

work page 2024
[23]

Hyper-3dg: Text-to-3d gaussian generation via hyper- graph,

D. Diet al., “Hyper-3dg: Text-to-3d gaussian generation via hyper- graph,”Int. J. Comput. Vis., pp. 1–24, 2024

work page 2024
[24]

Connecting consistency distillation to score distillation for text-to-3d generation,

Z. Li, M. Hu, Q. Zheng, and X. Jiang, “Connecting consistency distillation to score distillation for text-to-3d generation,” inProc. Eur . Conf. Comput. Vis.Springer, 2024, pp. 274–291

work page 2024
[25]

Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond,

M. Armandpour, A. Sadeghian, H. Zheng, A. Sadeghian, and M. Zhou, “Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond,” 2023,arXiv:2304.04968

work page arXiv 2023
[26]

Mip-nerf: A multiscale representation for anti- aliasing neural radiance fields,

J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti- aliasing neural radiance fields,” inProc. Int. Conf. Comput. Vis., 2021, pp. 5855–5864

work page 2021
[27]

Ref-neus: Ambiguity- reduced neural implicit surface learning for multi-view reconstruction with reflection,

W. Ge, T. Hu, H. Zhao, S. Liu, and Y .-C. Chen, “Ref-neus: Ambiguity- reduced neural implicit surface learning for multi-view reconstruction with reflection,” inProc. Int. Conf. Comput. Vis., 2023, pp. 4251–4260

work page 2023
[28]

Deep marching tetra- hedra: a hybrid representation for high-resolution 3d shape synthesis,

T. Shen, J. Gao, K. Yin, M.-Y . Liu, and S. Fidler, “Deep marching tetra- hedra: a hybrid representation for high-resolution 3d shape synthesis,” Adv. Neural Inform. Process. Syst., vol. 34, pp. 6087–6101, 2021

work page 2021
[29]

3d neural field generation using triplane diffusion,

J. R. Shue, E. R. Chan, R. Po, Z. Ankner, J. Wu, and G. Wetzstein, “3d neural field generation using triplane diffusion,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 20 875–20 886

work page 2023
[30]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,” 2023, arXiv:2309.16653

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Sherpa3d: Boosting high- fidelity text-to-3d generation via coarse 3d prior,

F. Liu, D. Wu, Y . Wei, Y . Rao, and Y . Duan, “Sherpa3d: Boosting high- fidelity text-to-3d generation via coarse 3d prior,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 20 763–20 774

work page 2024
[32]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen, “Point-e: A system for generating 3d point clouds from complex prompts,” 2022, arXiv:2212.08751

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Shap-E: Generating Conditional 3D Implicit Functions

H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,” 2023,arXiv:2305.02463

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Zero-shot text-guided object generation with dream fields,

A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero-shot text-guided object generation with dream fields,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 867–876

work page 2022
[35]

Zero-shot text-to-image generation,

A. Rameshet al., “Zero-shot text-to-image generation,” inProc. Int. Conf. Mach. Learn.Pmlr, 2021, pp. 8821–8831

work page 2021
[36]

Noise-free score distillation,

O. Katzir, O. Patashnik, D. Cohen-Or, and D. Lischinski, “Noise-free score distillation,” 2023,arXiv:2310.17590

work page arXiv 2023
[37]

Prolificdreamer: High-fidelity and diverse text-to- 3d generation with variational score distillation,

Z. Wanget al., “Prolificdreamer: High-fidelity and diverse text-to- 3d generation with variational score distillation,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 8406–8441, 2023

work page 2023
[38]

Text-to-3d with classifier score distillation,

X. Yu, Y .-C. Guo, Y . Li, D. Liang, S.-H. Zhang, and X. Qi, “Text-to-3d with classifier score distillation,” 2023,arXiv:2310.19415

work page arXiv 2023
[39]

Hifa: High-fidelity text-to-3d generation with advanced diffusion guidance,

J. Zhu, P. Zhuang, and S. Koyejo, “Hifa: High-fidelity text-to-3d generation with advanced diffusion guidance,” 2023,arXiv:2305.18766

work page arXiv 2023
[40]

Scaledreamer: Scalable text-to-3d synthesis with asynchronous score distillation,

Z. Ma, Y . Wei, Y . Zhang, X. Zhu, Z. Lei, and L. Zhang, “Scaledreamer: Scalable text-to-3d synthesis with asynchronous score distillation,” in Proc. Eur . Conf. Comput. Vis.Springer, 2024, pp. 1–19

work page 2024
[41]

Dreamer xl: Towards high-resolution text-to-3d gener- ation via trajectory score matching,

X. Miaoet al., “Dreamer xl: Towards high-resolution text-to-3d gener- ation via trajectory score matching,” 2024,arXiv:2405.11252

work page arXiv 2024
[42]

Exactdreamer: High-fidelity text-to-3d content creation via exact score matching,

Y . Zhanget al., “Exactdreamer: High-fidelity text-to-3d content creation via exact score matching,” 2024,arXiv:2405.15914

work page arXiv 2024
[43]

Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model,

Y . Xuet al., “Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model,”arXiv preprint arXiv:2311.09217, 2023

work page arXiv 2023
[44]

Pi3d: Efficient text-to-3d generation with pseudo-image diffusion,

Y .-T. Liu, Y .-C. Guo, G. Luo, H. Sun, W. Yin, and S.-H. Zhang, “Pi3d: Efficient text-to-3d generation with pseudo-image diffusion,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 19 915–19 924

work page 2024
[45]

Large-vocabulary 3d diffusion model with transformer,

Z. Cao, F. Hong, T. Wu, L. Pan, and Z. Liu, “Large-vocabulary 3d diffusion model with transformer,”arXiv preprint arXiv:2309.07920, 2023

work page arXiv 2023
[46]

Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation,

Z. Cao, F. Hong, T. Wu, L. Pan, and Z. Liu, “Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation,”IEEE Trans. Pattern Anal. Mach. Intell., 2025

work page 2025
[47]

T2td: Text-3d generation model based on prior knowledge guidance,

W. Nie, R. Chen, W. Wang, B. Lepri, and N. Sebe, “T2td: Text-3d generation model based on prior knowledge guidance,”IEEE Trans. Pattern Anal. Mach. Intell., 2024

work page 2024
[48]

Zero-1-to-3: Zero-shot one image to 3d object,

R. Liu, R. Wu, B. V . Hoorick, P. Tokmakov, S. Zakharov, and C. V on- drick, “Zero-1-to-3: Zero-shot one image to 3d object,” 2023

work page 2023
[49]

SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion,

V . V oletiet al., “SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion,” inEuropean Confer- ence on Computer Vision (ECCV), 2024

work page 2024
[50]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Adv. Neural Inform. Process. Syst., vol. 33, pp. 6840–6851, 2020

work page 2020
[51]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” 2020,arXiv:2011.13456

work page internal anchor Pith review Pith/arXiv arXiv 2020
[52]

Representation learning: A review and new perspectives,

Y . Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” 2014

work page 2014
[53]

Imagereward: Learning and evaluating human preferences for text-to-image generation,

J. Xuet al., “Imagereward: Learning and evaluating human preferences for text-to-image generation,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 15 903–15 935, 2023

work page 2023
[54]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” vol. 1, no. 2, p. 3, 2022,arXiv:2204.06125

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 586–595

work page 2018

[1] [1]

Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,

R. Chen, Y . Chen, N. Jiao, and K. Jia, “Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,” inProc. Int. Conf. Comput. Vis., 2023, pp. 22 246–22 256

work page 2023

[2] [2]

Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,

F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,” 2022, arXiv:2205.08535

work page arXiv 2022

[3] [3]

Magic3d: High-resolution text-to-3d content creation,

C.-H. Linet al., “Magic3d: High-resolution text-to-3d content creation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 300–309. IEEE TRANSACTIONS ON IMAGE PROCESSING 13

work page 2023

[4] [4]

Latent-nerf for shape-guided generation of 3d shapes and textures,

G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 12 663–12 673

work page 2023

[5] [5]

Text2mesh: Text-driven neural stylization for meshes,

O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2mesh: Text-driven neural stylization for meshes,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 13 492–13 502

work page 2022

[6] [6]

Text2shape: Generating shapes from natural language by learning joint embeddings,

K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese, “Text2shape: Generating shapes from natural language by learning joint embeddings,” inProc. Asian Conf. Comput. Vis.Springer, 2019, pp. 100–116

work page 2019

[7] [7]

Dynamic unary convolution in transformers,

H. Duan, Y . Long, S. Wang, H. Zhang, C. G. Willcocks, and L. Shao, “Dynamic unary convolution in transformers,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11, pp. 12 747–12 759, 2023

work page 2023

[8] [8]

Sofa-net: Second-order and first-order attention network for crowd counting,

H. Duan, S. Wang, and Y . Guan, “Sofa-net: Second-order and first-order attention network for crowd counting,” 2020,arXiv:2008.03723

work page arXiv 2020

[9] [9]

Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,

J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,”Adv. Neural Inform. Process. Syst., vol. 29, 2016

work page 2016

[10] [10]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 22 500–22 510

work page 2023

[11] [11]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Sahariaet al., “Photorealistic text-to-image diffusion models with deep language understanding,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 36 479–36 494, 2022

work page 2022

[12] [12]

DreamFusion: Text-to-3D using 2D Diffusion

B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text- to-3d using 2d diffusion,” 2022,arXiv:2209.14988

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Score distillation via reparametrized ddim,

A. Lukoianovet al., “Score distillation via reparametrized ddim,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 26 011– 26 044, 2024

work page 2024

[14] [14]

Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,

Y . Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y . Chen, “Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 6517–6526

work page 2024

[15] [15]

Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,

H. Liet al., “Dreamscene: 3d gaussian-based text-to-3d scene generation via formation pattern sampling,” inProc. Eur . Conf. Comput. Vis. Springer, 2024, pp. 214–230

work page 2024

[16] [16]

Text-to-3d generation by 2d editing,

Li, Haoran and Tian, Yuli and Wang, Yonghui and Liao, Yong and Wang, Lin and Wang, Yuyang and Zhou, Peng Yuan, “Text-to-3d generation by 2d editing,”arXiv preprint arXiv:2412.05929, 2024

work page arXiv 2024

[17] [17]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Commun. ACM, vol. 65, no. 1, pp. 99–106, 2021

work page 2021

[18] [18]

MVDream: Multi-view Diffusion for 3D Generation

Y . Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi- view diffusion for 3d generation,” 2023,arXiv:2308.16512

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation,

S. Hong, D. Ahn, and S. Kim, “Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 11 970–11 987, 2023

work page 2023

[20] [20]

Laser: Efficient language-guided segmentation in neural radiance fields,

X. Miaoet al., “Laser: Efficient language-guided segmentation in neural radiance fields,” 2025,arXiv:2501.19084

work page arXiv 2025

[21] [21]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

work page 2023

[22] [22]

Gseditpro: 3d gaussian splatting editing with attention-based progressive localization,

Y . Sun, R. Tian, X. Han, X. Liu, Y . Zhang, and K. Xu, “Gseditpro: 3d gaussian splatting editing with attention-based progressive localization,” inComput. Graph. F orum, vol. 43, no. 7. Wiley Online Library, 2024, p. e15215

work page 2024

[23] [23]

Hyper-3dg: Text-to-3d gaussian generation via hyper- graph,

D. Diet al., “Hyper-3dg: Text-to-3d gaussian generation via hyper- graph,”Int. J. Comput. Vis., pp. 1–24, 2024

work page 2024

[24] [24]

Connecting consistency distillation to score distillation for text-to-3d generation,

Z. Li, M. Hu, Q. Zheng, and X. Jiang, “Connecting consistency distillation to score distillation for text-to-3d generation,” inProc. Eur . Conf. Comput. Vis.Springer, 2024, pp. 274–291

work page 2024

[25] [25]

Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond,

M. Armandpour, A. Sadeghian, H. Zheng, A. Sadeghian, and M. Zhou, “Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond,” 2023,arXiv:2304.04968

work page arXiv 2023

[26] [26]

Mip-nerf: A multiscale representation for anti- aliasing neural radiance fields,

J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti- aliasing neural radiance fields,” inProc. Int. Conf. Comput. Vis., 2021, pp. 5855–5864

work page 2021

[27] [27]

Ref-neus: Ambiguity- reduced neural implicit surface learning for multi-view reconstruction with reflection,

W. Ge, T. Hu, H. Zhao, S. Liu, and Y .-C. Chen, “Ref-neus: Ambiguity- reduced neural implicit surface learning for multi-view reconstruction with reflection,” inProc. Int. Conf. Comput. Vis., 2023, pp. 4251–4260

work page 2023

[28] [28]

Deep marching tetra- hedra: a hybrid representation for high-resolution 3d shape synthesis,

T. Shen, J. Gao, K. Yin, M.-Y . Liu, and S. Fidler, “Deep marching tetra- hedra: a hybrid representation for high-resolution 3d shape synthesis,” Adv. Neural Inform. Process. Syst., vol. 34, pp. 6087–6101, 2021

work page 2021

[29] [29]

3d neural field generation using triplane diffusion,

J. R. Shue, E. R. Chan, R. Po, Z. Ankner, J. Wu, and G. Wetzstein, “3d neural field generation using triplane diffusion,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 20 875–20 886

work page 2023

[30] [30]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,” 2023, arXiv:2309.16653

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Sherpa3d: Boosting high- fidelity text-to-3d generation via coarse 3d prior,

F. Liu, D. Wu, Y . Wei, Y . Rao, and Y . Duan, “Sherpa3d: Boosting high- fidelity text-to-3d generation via coarse 3d prior,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 20 763–20 774

work page 2024

[32] [32]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen, “Point-e: A system for generating 3d point clouds from complex prompts,” 2022, arXiv:2212.08751

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Shap-E: Generating Conditional 3D Implicit Functions

H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,” 2023,arXiv:2305.02463

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Zero-shot text-guided object generation with dream fields,

A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero-shot text-guided object generation with dream fields,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 867–876

work page 2022

[35] [35]

Zero-shot text-to-image generation,

A. Rameshet al., “Zero-shot text-to-image generation,” inProc. Int. Conf. Mach. Learn.Pmlr, 2021, pp. 8821–8831

work page 2021

[36] [36]

Noise-free score distillation,

O. Katzir, O. Patashnik, D. Cohen-Or, and D. Lischinski, “Noise-free score distillation,” 2023,arXiv:2310.17590

work page arXiv 2023

[37] [37]

Prolificdreamer: High-fidelity and diverse text-to- 3d generation with variational score distillation,

Z. Wanget al., “Prolificdreamer: High-fidelity and diverse text-to- 3d generation with variational score distillation,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 8406–8441, 2023

work page 2023

[38] [38]

Text-to-3d with classifier score distillation,

X. Yu, Y .-C. Guo, Y . Li, D. Liang, S.-H. Zhang, and X. Qi, “Text-to-3d with classifier score distillation,” 2023,arXiv:2310.19415

work page arXiv 2023

[39] [39]

Hifa: High-fidelity text-to-3d generation with advanced diffusion guidance,

J. Zhu, P. Zhuang, and S. Koyejo, “Hifa: High-fidelity text-to-3d generation with advanced diffusion guidance,” 2023,arXiv:2305.18766

work page arXiv 2023

[40] [40]

Scaledreamer: Scalable text-to-3d synthesis with asynchronous score distillation,

Z. Ma, Y . Wei, Y . Zhang, X. Zhu, Z. Lei, and L. Zhang, “Scaledreamer: Scalable text-to-3d synthesis with asynchronous score distillation,” in Proc. Eur . Conf. Comput. Vis.Springer, 2024, pp. 1–19

work page 2024

[41] [41]

Dreamer xl: Towards high-resolution text-to-3d gener- ation via trajectory score matching,

X. Miaoet al., “Dreamer xl: Towards high-resolution text-to-3d gener- ation via trajectory score matching,” 2024,arXiv:2405.11252

work page arXiv 2024

[42] [42]

Exactdreamer: High-fidelity text-to-3d content creation via exact score matching,

Y . Zhanget al., “Exactdreamer: High-fidelity text-to-3d content creation via exact score matching,” 2024,arXiv:2405.15914

work page arXiv 2024

[43] [43]

Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model,

Y . Xuet al., “Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model,”arXiv preprint arXiv:2311.09217, 2023

work page arXiv 2023

[44] [44]

Pi3d: Efficient text-to-3d generation with pseudo-image diffusion,

Y .-T. Liu, Y .-C. Guo, G. Luo, H. Sun, W. Yin, and S.-H. Zhang, “Pi3d: Efficient text-to-3d generation with pseudo-image diffusion,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 19 915–19 924

work page 2024

[45] [45]

Large-vocabulary 3d diffusion model with transformer,

Z. Cao, F. Hong, T. Wu, L. Pan, and Z. Liu, “Large-vocabulary 3d diffusion model with transformer,”arXiv preprint arXiv:2309.07920, 2023

work page arXiv 2023

[46] [46]

Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation,

Z. Cao, F. Hong, T. Wu, L. Pan, and Z. Liu, “Difftf++: 3d-aware diffusion transformer for large-vocabulary 3d generation,”IEEE Trans. Pattern Anal. Mach. Intell., 2025

work page 2025

[47] [47]

T2td: Text-3d generation model based on prior knowledge guidance,

W. Nie, R. Chen, W. Wang, B. Lepri, and N. Sebe, “T2td: Text-3d generation model based on prior knowledge guidance,”IEEE Trans. Pattern Anal. Mach. Intell., 2024

work page 2024

[48] [48]

Zero-1-to-3: Zero-shot one image to 3d object,

R. Liu, R. Wu, B. V . Hoorick, P. Tokmakov, S. Zakharov, and C. V on- drick, “Zero-1-to-3: Zero-shot one image to 3d object,” 2023

work page 2023

[49] [49]

SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion,

V . V oletiet al., “SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion,” inEuropean Confer- ence on Computer Vision (ECCV), 2024

work page 2024

[50] [50]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Adv. Neural Inform. Process. Syst., vol. 33, pp. 6840–6851, 2020

work page 2020

[51] [51]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” 2020,arXiv:2011.13456

work page internal anchor Pith review Pith/arXiv arXiv 2020

[52] [52]

Representation learning: A review and new perspectives,

Y . Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” 2014

work page 2014

[53] [53]

Imagereward: Learning and evaluating human preferences for text-to-image generation,

J. Xuet al., “Imagereward: Learning and evaluating human preferences for text-to-image generation,”Adv. Neural Inform. Process. Syst., vol. 36, pp. 15 903–15 935, 2023

work page 2023

[54] [54]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” vol. 1, no. 2, p. 3, 2022,arXiv:2204.06125

work page internal anchor Pith review Pith/arXiv arXiv 2022

[55] [55]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 586–595

work page 2018