pith. sign in

arxiv: 2605.16990 · v1 · submitted 2026-05-16 · 💻 cs.CV

DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing

Pith reviewed 2026-05-19 20:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D editingdiffusion modelspersonalizationmulti-view consistencytext-guided editingtoken embeddings3D mesh generationidentity preservation
0
0 comments X p. Extension

The pith

Personalizing multi-view diffusion models enables text-guided 3D editing with object-level control and preserved consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to extend 2D identity-preserving personalization techniques to 3D assets. It renders orthogonal views of a 3D input, extracts segmentation masks for semantic components, and learns distinct token embeddings through a two-phase optimization. This allows composing tokens with editing prompts to generate consistent multi-view images that lift to edited 3D meshes. If successful, this would make 3D editing as flexible as 2D diffusion personalization, improving applications in content creation and design.

Core claim

The central claim is that by learning disentangled token embeddings for isolated semantic components in orthogonal views using multi-view textual inversion with attention alignment followed by full fine-tuning, the method achieves compositional text-guided 3D editing while maintaining multi-view consistency and identity preservation, outperforming baselines in faithfulness and preservation.

What carries the argument

Disentangled token embeddings for each object component, learned via two-phase optimization of multi-view diffusion models.

If this is right

  • Edited 3D models maintain consistency across multiple views when generated from composed prompts.
  • The approach supports object-level control through natural language without manual 3D manipulation.
  • It achieves state-of-the-art performance in edit faithfulness and identity preservation.
  • High-fidelity textured meshes can be produced from the generated consistent images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test this on more complex scenes with interacting objects.
  • Integration with real-time 3D rendering pipelines might enable interactive editing applications.
  • Similar token-based approaches could apply to video or 4D content for temporal consistency.

Load-bearing premise

Rendering orthogonal views and extracting object-level segmentation masks will allow learning of distinct, composable token embeddings that preserve multi-view consistency.

What would settle it

If the generated multi-view images for edited prompts show visible inconsistencies or artifacts when viewed from angles not used in training, or if the lifted 3D meshes fail to preserve the original object's identity under new edits, the claim would be challenged.

Figures

Figures reproduced from arXiv: 2605.16990 by Jinxin Ai, Matthias Nie{\ss}ner, Ziya Erko\c{c}.

Figure 1
Figure 1. Figure 1: DreamEdit3D produces multi-view consistent edits guided by natural lan￾guage, given a source 3D object. We apply personalization to multi-view diffusion models to preserve the identity of the input shapes. We show that multiple diverse edits can be generated from one source by preserving the input. Abstract. While 2D diffusion models have achieved remarkable success in identity-preserving personalization, … view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. Top: Given a 3D mesh, we render four orthogonal views and obtain object masks via SAM. Middle: In Phase 1 (TI), a token embedding s ∗ is learned for the object through textual inversion on 4 views with a frozen UNet and attention alignment loss. In Phase 2 (DB), the full UNet is fine-tuned jointly across all 4 views with prior preservation. Bottom: At inference, tokens are composed with ed… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative ablation study on the Robot Sitting case (“a photo of robot” → “a photo of robot sitting”). (a) Training with only a single view (front, back, or side) reduces 3D consistency. (b) Removing mask-based losses degrades edit localization. (c) Without TI, identity is partially lost; TI-only (no DreamBooth) fails to preserve the object. Ablation (c): Two-phase optimization. As shown in [PITH_FULL_IM… view at source ↗
Figure 5
Figure 5. Figure 5: illustrates the trade-off between editing quality (measured by CLIPdir-cos) and computational cost across all compared methods. Our method achieves the highest editing fidelity while requiring only ∼5 minutes per edit comparable to MVEdit [6] and over an order of magnitude faster than Vox-E [36], which demands ∼70 minutes due to its iterative SDS-based voxel optimization. PrEd￾itor3D [8] is the fastest at … view at source ↗
read the original abstract

While 2D diffusion models have achieved remarkable success in identity-preserving personalization, extending this capability to 3D assets remains a significant challenge due to the complexities of multi-view consistency and spatial control. Inspired by these 2D advancements, we present a novel personalization method for text-guided 3D editing that enables compositional, object-level control through natural language. Given a 3D input, we render orthogonal views and extract object-level segmentation masks to isolate semantic components. We then learn distinct token embeddings for each component through a tailored two-phase optimization strategy: multi-view textual inversion with attention alignment, followed by full fine-tuning of multi-view diffusion model. During inference, these disentangled tokens seamlessly compose with editing prompts to generate multi-view consistent images, which are subsequently lifted into high-fidelity textured 3D meshes. Extensive evaluations across diverse editing scenarios demonstrate that our method successfully transfers the flexibility of 2D personalization to 3D, achieving state-of-the-art edit faithfulness and identity preservation compared to existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents DreamEdit3D, a method for text-guided 3D editing by personalizing multi-view diffusion models. Given a 3D input, orthogonal views are rendered and object-level segmentation masks are extracted to isolate semantic components. Distinct token embeddings are learned for each component using a two-phase optimization: multi-view textual inversion with attention alignment, followed by full fine-tuning. These tokens are composed with editing prompts to generate multi-view consistent images, which are then lifted to textured 3D meshes. The authors claim this achieves state-of-the-art edit faithfulness and identity preservation compared to baselines across diverse scenarios.

Significance. If the disentanglement of component tokens and multi-view consistency hold under quantitative scrutiny, the work would meaningfully advance 3D editing by transferring compositional 2D personalization techniques to 3D assets with natural-language control.

major comments (2)
  1. [§3.2] §3.2 (two-phase optimization): The claim that multi-view textual inversion with attention alignment followed by fine-tuning yields distinct, composable token embeddings from segmented orthogonal renders is load-bearing for the composability assertion, yet no quantitative verification (e.g., token-swap consistency scores or attention-map isolation metrics) is supplied to rule out cross-component leakage or view-dependent entanglement.
  2. [§4] §4 (Experiments): The manuscript asserts state-of-the-art edit faithfulness and identity preservation, but the evaluations lack reported quantitative tables, specific baseline comparisons, error analysis, or metrics that directly test the separability of the learned tokens under editing prompts.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'object-level segmentation masks' would benefit from a short clarification on how masks are obtained and aligned across orthogonal views to ensure consistent component isolation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the quantitative support for our claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (two-phase optimization): The claim that multi-view textual inversion with attention alignment followed by fine-tuning yields distinct, composable token embeddings from segmented orthogonal renders is load-bearing for the composability assertion, yet no quantitative verification (e.g., token-swap consistency scores or attention-map isolation metrics) is supplied to rule out cross-component leakage or view-dependent entanglement.

    Authors: We agree that explicit quantitative verification of token disentanglement would better support the composability claims. In the revised manuscript we will add token-swap consistency scores computed by exchanging learned embeddings across components and measuring multi-view image consistency, together with attention-map isolation metrics that quantify the fraction of attention mass remaining within the intended segmentation mask. These will be reported on a held-out test set of 20 objects to demonstrate reduced cross-component leakage relative to single-phase baselines. revision: yes

  2. Referee: [§4] §4 (Experiments): The manuscript asserts state-of-the-art edit faithfulness and identity preservation, but the evaluations lack reported quantitative tables, specific baseline comparisons, error analysis, or metrics that directly test the separability of the learned tokens under editing prompts.

    Authors: We acknowledge the need for more comprehensive quantitative reporting. The revised experimental section will include tables with numerical results for edit faithfulness (CLIP similarity to target prompt) and identity preservation (DINO feature distance to input) against the cited baselines, accompanied by per-scenario error analysis. We will also add a separability test that measures editing success when tokens are deliberately swapped or omitted, directly quantifying the benefit of the two-phase optimization. revision: yes

Circularity Check

0 steps flagged

No equations or self-referential reductions; method builds on external 2D techniques

full rationale

The provided abstract and description contain no equations, derivations, or fitted-parameter predictions that reduce claims to inputs by construction. The two-phase optimization (multi-view textual inversion with attention alignment followed by fine-tuning) is presented as a procedural strategy for learning composable tokens, with success asserted via evaluations rather than tautological definitions. No load-bearing self-citations or uniqueness theorems from the same authors are invoked in the given text to force the central claims. This qualifies as a normal non-finding of significant circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the effectiveness of learned token embeddings and the assumption that segmentation plus multi-view fine-tuning produces disentangled, consistent edits, but the abstract gives no explicit free parameters, axioms, or new entities beyond standard diffusion model components.

free parameters (1)
  • component token embeddings
    Learned during the two-phase optimization for each segmented object part; values are fitted to the input 3D asset.

pith-pipeline@v0.9.0 · 5717 in / 1275 out tokens · 57381 ms · 2026-05-19T20:24:04.921343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 14 internal anchors

  1. [1]

    In: SIGGRAPH Asia 2023 Conference Papers

    Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a- scene: Extracting multiple concepts from a single image. In: SIGGRAPH Asia 2023 Conference Papers. SA ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/3610548.3618154,https://doi.org/ 10.1145/3610548.3618154

  2. [2]

    arXiv preprint arXiv:2408.07009 (2024)

    Baldridge, J., Bauer, J., Bhutani, M., Brichtova, N., Bunner, A., Castrejon, L., Chan, K., Chen, Y., Dieleman, S., Du, Y., et al.: Imagen 3. arXiv preprint arXiv:2408.07009 (2024)

  3. [3]

    Barda, A., Gadelha, M., Kim, V.G., Aigerman, N., Bermano, A.H., Groueix, T.: Instant3dit: Multiview inpainting for fast editing of 3d objects (2024),https: //arxiv.org/abs/2412.00518

  4. [4]

    Betker, J., Goh, G., Jing, L., TimBrooks, Wang, J., Li, L., LongOuyang, Jun- tangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, Ramesh, A.: Improving image generation with better captions.https: //api.semanticscholar.org/CorpusID:264403242

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16123–16133 (2022)

  6. [6]

    Chen, H., Shi, R., Liu, Y., Shen, B., Gu, J., Wetzstein, G., Su, H., Guibas, L.: Generic 3d diffusion adapter using controlled multi-view editing (2024)

  7. [7]

    Chen, M., Shapovalov, R., Laina, I., Monnier, T., Wang, J., Novotny, D., Vedaldi, A.: Partgen: Part-level 3d generation and reconstruction with multi-view diffusion models (2024),https://arxiv.org/abs/2412.18608

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Erkoç, Z., Gümeli, C., Wang, C., Nießner, M., Dai, A., Wonka, P., Lee, H.Y., Zhuang, P.: Preditor3d: Fast and precise 3d shape editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 640–649 (2025)

  9. [9]

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis (2024),https://arxiv.org/abs/2403.03206

  10. [10]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

  11. [11]

    In: ACM Transactions on Graphics (TOG)

    Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: CLIP-guided domain adaptation of image generators. In: ACM Transactions on Graphics (TOG). vol. 41, pp. 1–13 (2022)

  12. [12]

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models

    Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models (2024),https://arxiv.org/abs/2405.10314 16 J. Ai et al

  13. [13]

    Communications of the ACM63(11), 139–144 (2020)

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

  14. [14]

    Guo, Z., Wu, Y., Chen, Z., Chen, L., Zhang, P., He, Q.: Pulid: Pure and lightning id customization via contrastive alignment (2024),https://arxiv.org/abs/2404. 16022

  15. [15]

    Haque,A.,Tancik,M.,Efros,A.A.,Holynski,A.,Kanazawa,A.:Instruct-nerf2nerf: Editing 3d scenes with instructions (2023),https://arxiv.org/abs/2303.12789

  16. [16]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

  17. [17]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  18. [18]

    LRM: Large Reconstruction Model for Single Image to 3D

    Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)

  19. [19]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)

  20. [20]

    Segment Anything

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)

  21. [21]

    Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept cus- tomization of text-to-image diffusion (2023)

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 300–309 (2023)

  23. [23]

    In: Proceedings of the IEEE/CVF inter- national conference on computer vision

    Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 9298–9309 (2023)

  24. [24]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)

  25. [25]

    Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., Wang, W.: Wonder3d: Single image to 3d using cross-domain diffusion (2023),https://arxiv.org/abs/2310.15008

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Re- Paint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11461–11471 (2022)

  27. [27]

    Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R.: Text2mesh: Text-driven neural stylization for meshes (2021),https://arxiv.org/abs/2112.03221

  28. [28]

    Mikaeili, A., Perel, O., Safaee, M., Cohen-Or, D., Mahdavi-Amiri, A.: Sked: Sketch- guided text-based 3d editing (2023),https://arxiv.org/abs/2303.10735

  29. [29]

    Ng, K.W., Zhu, X., Song, Y.Z., Xiang, T.: Partcraft: Crafting creative objects by parts (2024),https://arxiv.org/abs/2407.04604

  30. [30]

    DreamFusion: Text-to-3D using 2D Diffusion

    Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022) DreamEdit3D 17

  31. [31]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Raj, A., Kaza, S., Poole, B., Niemeyer, M., Ruiz, N., Mildenhall, B., Zada, S., Aberman, K., Rubinstein, M., Barron, J., et al.: Dreambooth3d: Subject-driven text-to-3d generation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2349–2359 (2023)

  32. [32]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. In: arXiv preprint arXiv:2204.06125 (2022)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022)

  34. [34]

    Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth:Finetuningtext-to-imagediffusionmodelsforsubject-drivengeneration.In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 22500–22510 (2023)

  35. [35]

    Ruiz, N., Li, Y., Jampani, V., Wei, W., Hou, T., Pritch, Y., Wadhwa, N., Rubin- stein, M., Aberman, K.: Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models (2024),https://arxiv.org/abs/2307.06949

  36. [36]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Sella, E., Fiebelman, G., Hedman, P., Averbuch-Elor, H.: Vox-e: Text-guided voxel editing of 3d objects. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 430–440 (2023)

  37. [37]

    Shah, V., Ruiz, N., Cole, F., Lu, E., Lazebnik, S., Li, Y., Jampani, V.: Ziplora: Any subject in any style by effectively merging loras (2026),https://arxiv.org/ abs/2311.13600

  38. [38]

    Advances in Neural Information Processing Systems34, 6087–6101 (2021)

    Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems34, 6087–6101 (2021)

  39. [39]

    MVDream: Multi-view Diffusion for 3D Generation

    Shi,Y.,Wang,P.,Ye,J.,Long,M.,Li,K.,Yang,X.:Mvdream:Multi-viewdiffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)

  40. [40]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  41. [41]

    arXiv preprint arXiv:2303.09522 (2023)

    Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: p+: Extended textual condi- tioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)

  42. [42]

    Wang, J., Fang, J., Zhang, X., Xie, L., Tian, Q.: Gaussianeditor: Editing 3d gaus- sians delicately with text instructions (2024),https://arxiv.org/abs/2311. 16037

  43. [43]

    Wang, P., Shi, Y.: Imagedream: Image-prompt multi-view diffusion for 3d genera- tion (2023),https://arxiv.org/abs/2312.02201

  44. [44]

    Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A., Li, H., Tang, X., Hu, Y.: Instantid: Zero-shot identity-preserving generation in seconds (2024),https://arxiv.org/ abs/2401.07519

  45. [45]

    Advances in neural information processing systems36, 8406–8441 (2023)

    Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems36, 8406–8441 (2023)

  46. [46]

    In: European conference on computer vision

    Wu, J., Bian, J.W., Li, X., Wang, G., Reid, I., Torr, P., Prisacariu, V.A.: Gauss- ctrl: Multi-view consistent text-driven 3d gaussian splatting editing. In: European conference on computer vision. pp. 55–71. Springer (2024)

  47. [47]

    Yang, Y., Long, X.X., Dou, Z., Lin, C., Liu, Y., Yan, Q., Ma, Y., Wang, H., Wu, Z., Yin, W.: Wonder3d++: Cross-domain diffusion for high-fidelity 3d generation from a single image. IEEE Transactions on Pattern Analysis and Machine Intelligence 48(2), 1674–1688 (Feb 2026).https://doi.org/10.1109/tpami.2025.3618675, http://dx.doi.org/10.1109/TPAMI.2025.3618...

  48. [48]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

  49. [49]

    arXiv preprint arXiv:2510.15019 (2025)

    Ye, J., Xie, S., Zhao, R., Wang, Z., Yan, H., Zu, W., Ma, L., Zhu, J.: Nano3d: A training-free approach for efficient 3d editing without masks. arXiv preprint arXiv:2510.15019 (2025)

  50. [50]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

  51. [51]

    Zheng, Y., Huang, M., Chen, N., Mao, Z.: Pro3d-editor : A progressive-views per- spective for consistent and precise 3d editing (2025),https://arxiv.org/abs/ 2506.00512

  52. [52]

    robot” for the Robot Sitting case, “dog

    Zhuang, P., Han, S., Wang, C., Siarohin, A., Zou, J., Vasilkovsky, M., Shakhrai, V., Korolev, S., Tulyakov, S., Lee, H.Y.: Gtr: Improving large 3d reconstruction models through geometry and texture refinement. arXiv preprint arXiv:2406.05649 (2024) DreamEdit3D 19 6 Appendix We organize the supplementary material as follows: Sec. 6.1 analyzes the editing q...