DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing

arxiv: 2605.16990 · v1 · submitted 2026-05-16 · 💻 cs.CV

DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing

Jinxin Ai , Matthias Nie{\ss}ner , Ziya Erko\c{c} This is my paper

Pith reviewed 2026-05-19 20:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D editingdiffusion modelspersonalizationmulti-view consistencytext-guided editingtoken embeddings3D mesh generationidentity preservation

0 comments p. Extension

The pith

Personalizing multi-view diffusion models enables text-guided 3D editing with object-level control and preserved consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to extend 2D identity-preserving personalization techniques to 3D assets. It renders orthogonal views of a 3D input, extracts segmentation masks for semantic components, and learns distinct token embeddings through a two-phase optimization. This allows composing tokens with editing prompts to generate consistent multi-view images that lift to edited 3D meshes. If successful, this would make 3D editing as flexible as 2D diffusion personalization, improving applications in content creation and design.

Core claim

The central claim is that by learning disentangled token embeddings for isolated semantic components in orthogonal views using multi-view textual inversion with attention alignment followed by full fine-tuning, the method achieves compositional text-guided 3D editing while maintaining multi-view consistency and identity preservation, outperforming baselines in faithfulness and preservation.

What carries the argument

Disentangled token embeddings for each object component, learned via two-phase optimization of multi-view diffusion models.

If this is right

Edited 3D models maintain consistency across multiple views when generated from composed prompts.
The approach supports object-level control through natural language without manual 3D manipulation.
It achieves state-of-the-art performance in edit faithfulness and identity preservation.
High-fidelity textured meshes can be produced from the generated consistent images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could test this on more complex scenes with interacting objects.
Integration with real-time 3D rendering pipelines might enable interactive editing applications.
Similar token-based approaches could apply to video or 4D content for temporal consistency.

Load-bearing premise

Rendering orthogonal views and extracting object-level segmentation masks will allow learning of distinct, composable token embeddings that preserve multi-view consistency.

What would settle it

If the generated multi-view images for edited prompts show visible inconsistencies or artifacts when viewed from angles not used in training, or if the lifted 3D meshes fail to preserve the original object's identity under new edits, the claim would be challenged.

Figures

Figures reproduced from arXiv: 2605.16990 by Jinxin Ai, Matthias Nie{\ss}ner, Ziya Erko\c{c}.

**Figure 1.** Figure 1: DreamEdit3D produces multi-view consistent edits guided by natural language, given a source 3D object. We apply personalization to multi-view diffusion models to preserve the identity of the input shapes. We show that multiple diverse edits can be generated from one source by preserving the input. Abstract. While 2D diffusion models have achieved remarkable success in identity-preserving personalization, … view at source ↗

**Figure 2.** Figure 2: Method overview. Top: Given a 3D mesh, we render four orthogonal views and obtain object masks via SAM. Middle: In Phase 1 (TI), a token embedding s ∗ is learned for the object through textual inversion on 4 views with a frozen UNet and attention alignment loss. In Phase 2 (DB), the full UNet is fine-tuned jointly across all 4 views with prior preservation. Bottom: At inference, tokens are composed with ed… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative ablation study on the Robot Sitting case (“a photo of robot” → “a photo of robot sitting”). (a) Training with only a single view (front, back, or side) reduces 3D consistency. (b) Removing mask-based losses degrades edit localization. (c) Without TI, identity is partially lost; TI-only (no DreamBooth) fails to preserve the object. Ablation (c): Two-phase optimization. As shown in [PITH_FULL_IM… view at source ↗

**Figure 5.** Figure 5: illustrates the trade-off between editing quality (measured by CLIPdir-cos) and computational cost across all compared methods. Our method achieves the highest editing fidelity while requiring only ∼5 minutes per edit comparable to MVEdit [6] and over an order of magnitude faster than Vox-E [36], which demands ∼70 minutes due to its iterative SDS-based voxel optimization. PrEditor3D [8] is the fastest at … view at source ↗

read the original abstract

While 2D diffusion models have achieved remarkable success in identity-preserving personalization, extending this capability to 3D assets remains a significant challenge due to the complexities of multi-view consistency and spatial control. Inspired by these 2D advancements, we present a novel personalization method for text-guided 3D editing that enables compositional, object-level control through natural language. Given a 3D input, we render orthogonal views and extract object-level segmentation masks to isolate semantic components. We then learn distinct token embeddings for each component through a tailored two-phase optimization strategy: multi-view textual inversion with attention alignment, followed by full fine-tuning of multi-view diffusion model. During inference, these disentangled tokens seamlessly compose with editing prompts to generate multi-view consistent images, which are subsequently lifted into high-fidelity textured 3D meshes. Extensive evaluations across diverse editing scenarios demonstrate that our method successfully transfers the flexibility of 2D personalization to 3D, achieving state-of-the-art edit faithfulness and identity preservation compared to existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DreamEdit3D adapts 2D textual inversion to 3D component editing via two-phase optimization, but the disentanglement claim rests on an assumption without reported checks.

read the letter

This paper's main contribution is a two-phase optimization that learns separate token embeddings for 3D object components from orthogonal views, allowing text prompts to edit specific parts while maintaining multi-view consistency for later 3D reconstruction. The approach builds on 2D textual inversion but adapts it with attention alignment for better control in 3D. It does well in describing a pipeline that isolates components via segmentation masks and then composes the learned tokens during inference. This could help with practical editing tasks where you want to change one part of a model without affecting others. The soft spot is the lack of concrete evidence for the disentanglement. The method assumes that the multi-view inversion plus attention alignment produces tokens that stay independent when recombined with editing prompts. If there's leakage between components, the generated views could become inconsistent, and the lift to meshes would suffer. The abstract claims state-of-the-art results but does not include the quantitative comparisons or ablation studies that would confirm this holds up. A reader working on diffusion models for 3D graphics or design tools would get the most from the specific optimization strategy and how it handles the consistency problem. It is an incremental step rather than a complete shift in the field. I would send this to peer review. The core idea is clear enough that referees could help strengthen the evaluation section and check the assumptions with additional metrics.

Referee Report

2 major / 1 minor

Summary. The manuscript presents DreamEdit3D, a method for text-guided 3D editing by personalizing multi-view diffusion models. Given a 3D input, orthogonal views are rendered and object-level segmentation masks are extracted to isolate semantic components. Distinct token embeddings are learned for each component using a two-phase optimization: multi-view textual inversion with attention alignment, followed by full fine-tuning. These tokens are composed with editing prompts to generate multi-view consistent images, which are then lifted to textured 3D meshes. The authors claim this achieves state-of-the-art edit faithfulness and identity preservation compared to baselines across diverse scenarios.

Significance. If the disentanglement of component tokens and multi-view consistency hold under quantitative scrutiny, the work would meaningfully advance 3D editing by transferring compositional 2D personalization techniques to 3D assets with natural-language control.

major comments (2)

[§3.2] §3.2 (two-phase optimization): The claim that multi-view textual inversion with attention alignment followed by fine-tuning yields distinct, composable token embeddings from segmented orthogonal renders is load-bearing for the composability assertion, yet no quantitative verification (e.g., token-swap consistency scores or attention-map isolation metrics) is supplied to rule out cross-component leakage or view-dependent entanglement.
[§4] §4 (Experiments): The manuscript asserts state-of-the-art edit faithfulness and identity preservation, but the evaluations lack reported quantitative tables, specific baseline comparisons, error analysis, or metrics that directly test the separability of the learned tokens under editing prompts.

minor comments (1)

[Abstract] Abstract: The phrase 'object-level segmentation masks' would benefit from a short clarification on how masks are obtained and aligned across orthogonal views to ensure consistent component isolation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the quantitative support for our claims.

read point-by-point responses

Referee: [§3.2] §3.2 (two-phase optimization): The claim that multi-view textual inversion with attention alignment followed by fine-tuning yields distinct, composable token embeddings from segmented orthogonal renders is load-bearing for the composability assertion, yet no quantitative verification (e.g., token-swap consistency scores or attention-map isolation metrics) is supplied to rule out cross-component leakage or view-dependent entanglement.

Authors: We agree that explicit quantitative verification of token disentanglement would better support the composability claims. In the revised manuscript we will add token-swap consistency scores computed by exchanging learned embeddings across components and measuring multi-view image consistency, together with attention-map isolation metrics that quantify the fraction of attention mass remaining within the intended segmentation mask. These will be reported on a held-out test set of 20 objects to demonstrate reduced cross-component leakage relative to single-phase baselines. revision: yes
Referee: [§4] §4 (Experiments): The manuscript asserts state-of-the-art edit faithfulness and identity preservation, but the evaluations lack reported quantitative tables, specific baseline comparisons, error analysis, or metrics that directly test the separability of the learned tokens under editing prompts.

Authors: We acknowledge the need for more comprehensive quantitative reporting. The revised experimental section will include tables with numerical results for edit faithfulness (CLIP similarity to target prompt) and identity preservation (DINO feature distance to input) against the cited baselines, accompanied by per-scenario error analysis. We will also add a separability test that measures editing success when tokens are deliberately swapped or omitted, directly quantifying the benefit of the two-phase optimization. revision: yes

Circularity Check

0 steps flagged

No equations or self-referential reductions; method builds on external 2D techniques

full rationale

The provided abstract and description contain no equations, derivations, or fitted-parameter predictions that reduce claims to inputs by construction. The two-phase optimization (multi-view textual inversion with attention alignment followed by fine-tuning) is presented as a procedural strategy for learning composable tokens, with success asserted via evaluations rather than tautological definitions. No load-bearing self-citations or uniqueness theorems from the same authors are invoked in the given text to force the central claims. This qualifies as a normal non-finding of significant circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the effectiveness of learned token embeddings and the assumption that segmentation plus multi-view fine-tuning produces disentangled, consistent edits, but the abstract gives no explicit free parameters, axioms, or new entities beyond standard diffusion model components.

free parameters (1)

component token embeddings
Learned during the two-phase optimization for each segmented object part; values are fitted to the input 3D asset.

pith-pipeline@v0.9.0 · 5717 in / 1275 out tokens · 57381 ms · 2026-05-19T20:24:04.921343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we render four orthogonal views... extract object-level segmentation masks... two-phase optimization strategy: multi-view textual inversion with attention alignment, followed by full fine-tuning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 14 internal anchors

[1]

In: SIGGRAPH Asia 2023 Conference Papers

Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a- scene: Extracting multiple concepts from a single image. In: SIGGRAPH Asia 2023 Conference Papers. SA ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3610548.3618154,https://doi.org/ 10.1145/3610548.3618154

work page doi:10.1145/3610548.3618154 2023
[2]

arXiv preprint arXiv:2408.07009 (2024)

Baldridge, J., Bauer, J., Bhutani, M., Brichtova, N., Bunner, A., Castrejon, L., Chan, K., Chen, Y., Dieleman, S., Du, Y., et al.: Imagen 3. arXiv preprint arXiv:2408.07009 (2024)

work page arXiv 2024
[3]

Barda, A., Gadelha, M., Kim, V.G., Aigerman, N., Bermano, A.H., Groueix, T.: Instant3dit: Multiview inpainting for fast editing of 3d objects (2024),https: //arxiv.org/abs/2412.00518

work page arXiv 2024
[4]

Betker, J., Goh, G., Jing, L., TimBrooks, Wang, J., Li, L., LongOuyang, Jun- tangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, Ramesh, A.: Improving image generation with better captions.https: //api.semanticscholar.org/CorpusID:264403242

work page
[5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16123–16133 (2022)

work page 2022
[6]

Chen, H., Shi, R., Liu, Y., Shen, B., Gu, J., Wetzstein, G., Su, H., Guibas, L.: Generic 3d diffusion adapter using controlled multi-view editing (2024)

work page 2024
[7]

Chen, M., Shapovalov, R., Laina, I., Monnier, T., Wang, J., Novotny, D., Vedaldi, A.: Partgen: Part-level 3d generation and reconstruction with multi-view diffusion models (2024),https://arxiv.org/abs/2412.18608

work page arXiv 2024
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Erkoç, Z., Gümeli, C., Wang, C., Nießner, M., Dai, A., Wonka, P., Lee, H.Y., Zhuang, P.: Preditor3d: Fast and precise 3d shape editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 640–649 (2025)

work page 2025
[9]

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis (2024),https://arxiv.org/abs/2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

In: ACM Transactions on Graphics (TOG)

Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: CLIP-guided domain adaptation of image generators. In: ACM Transactions on Graphics (TOG). vol. 41, pp. 1–13 (2022)

work page 2022
[12]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models (2024),https://arxiv.org/abs/2405.10314 16 J. Ai et al

work page internal anchor Pith review arXiv 2024
[13]

Communications of the ACM63(11), 139–144 (2020)

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

work page 2020
[14]

Guo, Z., Wu, Y., Chen, Z., Chen, L., Zhang, P., He, Q.: Pulid: Pure and lightning id customization via contrastive alignment (2024),https://arxiv.org/abs/2404. 16022

work page 2024
[15]

Haque,A.,Tancik,M.,Efros,A.A.,Holynski,A.,Kanazawa,A.:Instruct-nerf2nerf: Editing 3d scenes with instructions (2023),https://arxiv.org/abs/2303.12789

work page arXiv 2023
[16]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

LRM: Large Reconstruction Model for Single Image to 3D

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)

work page 2019
[20]

Segment Anything

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept cus- tomization of text-to-image diffusion (2023)

work page 2023
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 300–309 (2023)

work page 2023
[23]

In: Proceedings of the IEEE/CVF inter- national conference on computer vision

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 9298–9309 (2023)

work page 2023
[24]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., Wang, W.: Wonder3d: Single image to 3d using cross-domain diffusion (2023),https://arxiv.org/abs/2310.15008

work page arXiv 2023
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Re- Paint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11461–11471 (2022)

work page 2022
[27]

Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: Text-driven neural stylization for meshes (2021),https://arxiv.org/abs/2112.03221

work page arXiv 2021
[28]

Mikaeili, A., Perel, O., Safaee, M., Cohen-Or, D., Mahdavi-Amiri, A.: Sked: Sketch- guided text-based 3d editing (2023),https://arxiv.org/abs/2303.10735

work page arXiv 2023
[29]

Ng, K.W., Zhu, X., Song, Y.Z., Xiang, T.: Partcraft: Crafting creative objects by parts (2024),https://arxiv.org/abs/2407.04604

work page arXiv 2024
[30]

DreamFusion: Text-to-3D using 2D Diffusion

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022) DreamEdit3D 17

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

In: Proceedings of the IEEE/CVF international conference on computer vision

Raj, A., Kaza, S., Poole, B., Niemeyer, M., Ruiz, N., Mildenhall, B., Zada, S., Aberman, K., Rubinstein, M., Barron, J., et al.: Dreambooth3d: Subject-driven text-to-3d generation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2349–2359 (2023)

work page 2023
[32]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. In: arXiv preprint arXiv:2204.06125 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022)

work page 2022
[34]

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth:Finetuningtext-to-imagediffusionmodelsforsubject-drivengeneration.In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 22500–22510 (2023)

work page 2023
[35]

Ruiz, N., Li, Y., Jampani, V., Wei, W., Hou, T., Pritch, Y., Wadhwa, N., Rubin- stein, M., Aberman, K.: Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models (2024),https://arxiv.org/abs/2307.06949

work page arXiv 2024
[36]

In: Proceedings of the IEEE/CVF international conference on computer vision

Sella, E., Fiebelman, G., Hedman, P., Averbuch-Elor, H.: Vox-e: Text-guided voxel editing of 3d objects. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 430–440 (2023)

work page 2023
[37]

Shah, V., Ruiz, N., Cole, F., Lu, E., Lazebnik, S., Li, Y., Jampani, V.: Ziplora: Any subject in any style by effectively merging loras (2026),https://arxiv.org/ abs/2311.13600

work page arXiv 2026
[38]

Advances in Neural Information Processing Systems34, 6087–6101 (2021)

Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems34, 6087–6101 (2021)

work page 2021
[39]

MVDream: Multi-view Diffusion for 3D Generation

Shi,Y.,Wang,P.,Ye,J.,Long,M.,Li,K.,Yang,X.:Mvdream:Multi-viewdiffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[41]

arXiv preprint arXiv:2303.09522 (2023)

Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: p+: Extended textual condi- tioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)

work page arXiv 2023
[42]

Wang, J., Fang, J., Zhang, X., Xie, L., Tian, Q.: Gaussianeditor: Editing 3d gaus- sians delicately with text instructions (2024),https://arxiv.org/abs/2311. 16037

work page 2024
[43]

Wang, P., Shi, Y.: Imagedream: Image-prompt multi-view diffusion for 3d genera- tion (2023),https://arxiv.org/abs/2312.02201

work page arXiv 2023
[44]

Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A., Li, H., Tang, X., Hu, Y.: Instantid: Zero-shot identity-preserving generation in seconds (2024),https://arxiv.org/ abs/2401.07519

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Advances in neural information processing systems36, 8406–8441 (2023)

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems36, 8406–8441 (2023)

work page 2023
[46]

In: European conference on computer vision

Wu, J., Bian, J.W., Li, X., Wang, G., Reid, I., Torr, P., Prisacariu, V.A.: Gauss- ctrl: Multi-view consistent text-driven 3d gaussian splatting editing. In: European conference on computer vision. pp. 55–71. Springer (2024)

work page 2024
[47]

Yang, Y., Long, X.X., Dou, Z., Lin, C., Liu, Y., Yan, Q., Ma, Y., Wang, H., Wu, Z., Yin, W.: Wonder3d++: Cross-domain diffusion for high-fidelity 3d generation from a single image. IEEE Transactions on Pattern Analysis and Machine Intelligence 48(2), 1674–1688 (Feb 2026).https://doi.org/10.1109/tpami.2025.3618675, http://dx.doi.org/10.1109/TPAMI.2025.3618...

work page doi:10.1109/tpami.2025.3618675 2026
[48]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

arXiv preprint arXiv:2510.15019 (2025)

Ye, J., Xie, S., Zhao, R., Wang, Z., Yan, H., Zu, W., Ma, L., Zhu, J.: Nano3d: A training-free approach for efficient 3d editing without masks. arXiv preprint arXiv:2510.15019 (2025)

work page arXiv 2025
[50]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

work page 2023
[51]

Zheng, Y., Huang, M., Chen, N., Mao, Z.: Pro3d-editor : A progressive-views per- spective for consistent and precise 3d editing (2025),https://arxiv.org/abs/ 2506.00512

work page arXiv 2025
[52]

robot” for the Robot Sitting case, “dog

Zhuang, P., Han, S., Wang, C., Siarohin, A., Zou, J., Vasilkovsky, M., Shakhrai, V., Korolev, S., Tulyakov, S., Lee, H.Y.: Gtr: Improving large 3d reconstruction models through geometry and texture refinement. arXiv preprint arXiv:2406.05649 (2024) DreamEdit3D 19 6 Appendix We organize the supplementary material as follows: Sec. 6.1 analyzes the editing q...

work page arXiv 2024

[1] [1]

In: SIGGRAPH Asia 2023 Conference Papers

Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a- scene: Extracting multiple concepts from a single image. In: SIGGRAPH Asia 2023 Conference Papers. SA ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3610548.3618154,https://doi.org/ 10.1145/3610548.3618154

work page doi:10.1145/3610548.3618154 2023

[2] [2]

arXiv preprint arXiv:2408.07009 (2024)

Baldridge, J., Bauer, J., Bhutani, M., Brichtova, N., Bunner, A., Castrejon, L., Chan, K., Chen, Y., Dieleman, S., Du, Y., et al.: Imagen 3. arXiv preprint arXiv:2408.07009 (2024)

work page arXiv 2024

[3] [3]

Barda, A., Gadelha, M., Kim, V.G., Aigerman, N., Bermano, A.H., Groueix, T.: Instant3dit: Multiview inpainting for fast editing of 3d objects (2024),https: //arxiv.org/abs/2412.00518

work page arXiv 2024

[4] [4]

Betker, J., Goh, G., Jing, L., TimBrooks, Wang, J., Li, L., LongOuyang, Jun- tangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, Ramesh, A.: Improving image generation with better captions.https: //api.semanticscholar.org/CorpusID:264403242

work page

[5] [5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16123–16133 (2022)

work page 2022

[6] [6]

Chen, H., Shi, R., Liu, Y., Shen, B., Gu, J., Wetzstein, G., Su, H., Guibas, L.: Generic 3d diffusion adapter using controlled multi-view editing (2024)

work page 2024

[7] [7]

Chen, M., Shapovalov, R., Laina, I., Monnier, T., Wang, J., Novotny, D., Vedaldi, A.: Partgen: Part-level 3d generation and reconstruction with multi-view diffusion models (2024),https://arxiv.org/abs/2412.18608

work page arXiv 2024

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Erkoç, Z., Gümeli, C., Wang, C., Nießner, M., Dai, A., Wonka, P., Lee, H.Y., Zhuang, P.: Preditor3d: Fast and precise 3d shape editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 640–649 (2025)

work page 2025

[9] [9]

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis (2024),https://arxiv.org/abs/2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

In: ACM Transactions on Graphics (TOG)

Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: CLIP-guided domain adaptation of image generators. In: ACM Transactions on Graphics (TOG). vol. 41, pp. 1–13 (2022)

work page 2022

[12] [12]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models (2024),https://arxiv.org/abs/2405.10314 16 J. Ai et al

work page internal anchor Pith review arXiv 2024

[13] [13]

Communications of the ACM63(11), 139–144 (2020)

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

work page 2020

[14] [14]

Guo, Z., Wu, Y., Chen, Z., Chen, L., Zhang, P., He, Q.: Pulid: Pure and lightning id customization via contrastive alignment (2024),https://arxiv.org/abs/2404. 16022

work page 2024

[15] [15]

Haque,A.,Tancik,M.,Efros,A.A.,Holynski,A.,Kanazawa,A.:Instruct-nerf2nerf: Editing 3d scenes with instructions (2023),https://arxiv.org/abs/2303.12789

work page arXiv 2023

[16] [16]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

LRM: Large Reconstruction Model for Single Image to 3D

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)

work page 2019

[20] [20]

Segment Anything

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept cus- tomization of text-to-image diffusion (2023)

work page 2023

[22] [22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 300–309 (2023)

work page 2023

[23] [23]

In: Proceedings of the IEEE/CVF inter- national conference on computer vision

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 9298–9309 (2023)

work page 2023

[24] [24]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., Wang, W.: Wonder3d: Single image to 3d using cross-domain diffusion (2023),https://arxiv.org/abs/2310.15008

work page arXiv 2023

[26] [26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Re- Paint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11461–11471 (2022)

work page 2022

[27] [27]

Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: Text-driven neural stylization for meshes (2021),https://arxiv.org/abs/2112.03221

work page arXiv 2021

[28] [28]

Mikaeili, A., Perel, O., Safaee, M., Cohen-Or, D., Mahdavi-Amiri, A.: Sked: Sketch- guided text-based 3d editing (2023),https://arxiv.org/abs/2303.10735

work page arXiv 2023

[29] [29]

Ng, K.W., Zhu, X., Song, Y.Z., Xiang, T.: Partcraft: Crafting creative objects by parts (2024),https://arxiv.org/abs/2407.04604

work page arXiv 2024

[30] [30]

DreamFusion: Text-to-3D using 2D Diffusion

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022) DreamEdit3D 17

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

In: Proceedings of the IEEE/CVF international conference on computer vision

Raj, A., Kaza, S., Poole, B., Niemeyer, M., Ruiz, N., Mildenhall, B., Zada, S., Aberman, K., Rubinstein, M., Barron, J., et al.: Dreambooth3d: Subject-driven text-to-3d generation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2349–2359 (2023)

work page 2023

[32] [32]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. In: arXiv preprint arXiv:2204.06125 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022)

work page 2022

[34] [34]

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth:Finetuningtext-to-imagediffusionmodelsforsubject-drivengeneration.In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 22500–22510 (2023)

work page 2023

[35] [35]

Ruiz, N., Li, Y., Jampani, V., Wei, W., Hou, T., Pritch, Y., Wadhwa, N., Rubin- stein, M., Aberman, K.: Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models (2024),https://arxiv.org/abs/2307.06949

work page arXiv 2024

[36] [36]

In: Proceedings of the IEEE/CVF international conference on computer vision

Sella, E., Fiebelman, G., Hedman, P., Averbuch-Elor, H.: Vox-e: Text-guided voxel editing of 3d objects. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 430–440 (2023)

work page 2023

[37] [37]

Shah, V., Ruiz, N., Cole, F., Lu, E., Lazebnik, S., Li, Y., Jampani, V.: Ziplora: Any subject in any style by effectively merging loras (2026),https://arxiv.org/ abs/2311.13600

work page arXiv 2026

[38] [38]

Advances in Neural Information Processing Systems34, 6087–6101 (2021)

Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems34, 6087–6101 (2021)

work page 2021

[39] [39]

MVDream: Multi-view Diffusion for 3D Generation

Shi,Y.,Wang,P.,Ye,J.,Long,M.,Li,K.,Yang,X.:Mvdream:Multi-viewdiffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[41] [41]

arXiv preprint arXiv:2303.09522 (2023)

Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: p+: Extended textual condi- tioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)

work page arXiv 2023

[42] [42]

Wang, J., Fang, J., Zhang, X., Xie, L., Tian, Q.: Gaussianeditor: Editing 3d gaus- sians delicately with text instructions (2024),https://arxiv.org/abs/2311. 16037

work page 2024

[43] [43]

Wang, P., Shi, Y.: Imagedream: Image-prompt multi-view diffusion for 3d genera- tion (2023),https://arxiv.org/abs/2312.02201

work page arXiv 2023

[44] [44]

Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A., Li, H., Tang, X., Hu, Y.: Instantid: Zero-shot identity-preserving generation in seconds (2024),https://arxiv.org/ abs/2401.07519

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Advances in neural information processing systems36, 8406–8441 (2023)

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems36, 8406–8441 (2023)

work page 2023

[46] [46]

In: European conference on computer vision

Wu, J., Bian, J.W., Li, X., Wang, G., Reid, I., Torr, P., Prisacariu, V.A.: Gauss- ctrl: Multi-view consistent text-driven 3d gaussian splatting editing. In: European conference on computer vision. pp. 55–71. Springer (2024)

work page 2024

[47] [47]

Yang, Y., Long, X.X., Dou, Z., Lin, C., Liu, Y., Yan, Q., Ma, Y., Wang, H., Wu, Z., Yin, W.: Wonder3d++: Cross-domain diffusion for high-fidelity 3d generation from a single image. IEEE Transactions on Pattern Analysis and Machine Intelligence 48(2), 1674–1688 (Feb 2026).https://doi.org/10.1109/tpami.2025.3618675, http://dx.doi.org/10.1109/TPAMI.2025.3618...

work page doi:10.1109/tpami.2025.3618675 2026

[48] [48]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

arXiv preprint arXiv:2510.15019 (2025)

Ye, J., Xie, S., Zhao, R., Wang, Z., Yan, H., Zu, W., Ma, L., Zhu, J.: Nano3d: A training-free approach for efficient 3d editing without masks. arXiv preprint arXiv:2510.15019 (2025)

work page arXiv 2025

[50] [50]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

work page 2023

[51] [51]

Zheng, Y., Huang, M., Chen, N., Mao, Z.: Pro3d-editor : A progressive-views per- spective for consistent and precise 3d editing (2025),https://arxiv.org/abs/ 2506.00512

work page arXiv 2025

[52] [52]

robot” for the Robot Sitting case, “dog

Zhuang, P., Han, S., Wang, C., Siarohin, A., Zou, J., Vasilkovsky, M., Shakhrai, V., Korolev, S., Tulyakov, S., Lee, H.Y.: Gtr: Improving large 3d reconstruction models through geometry and texture refinement. arXiv preprint arXiv:2406.05649 (2024) DreamEdit3D 19 6 Appendix We organize the supplementary material as follows: Sec. 6.1 analyzes the editing q...

work page arXiv 2024