pith. sign in

arxiv: 2506.18438 · v2 · submitted 2025-06-23 · 💻 cs.CV

CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing

Pith reviewed 2026-05-19 07:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot image editingdiffusion modelsself-attention adaptationmask guidancecontext preservationreal image manipulationtext-to-image editingnon-rigid object editing
0
0 comments X

The pith

CPAM adjusts self-attention in diffusion models to edit real images by text while preserving object identities and undistorted backgrounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CPAM as a zero-shot method for editing natural images according to text prompts in diffusion models. It targets the problem of handling complex non-rigid objects without losing their shapes, textures, or identities and without distorting the surrounding background. A preservation adaptation module modifies self-attention to control object and background regions separately through mask guidance. A localized extraction module reduces unwanted interference during cross-attention conditioning. The approach works across several diffusion backbones and ranks highest in human evaluations on the new IMBA benchmark for real image editing tasks.

Core claim

CPAM is a zero-shot framework that uses a preservation adaptation module to adjust self-attention mechanisms, thereby preserving and independently controlling object and background regions. Combined with mask guidance and a localized extraction module that limits interference in cross-attention, it maintains objects' shapes, textures, and identities while keeping backgrounds undistorted. The method supports various mask-guidance strategies for different editing tasks and integrates directly with diffusion backbones such as SD1.5, SD2.1, and SDXL, outperforming prior techniques on the IMBA benchmark according to human raters.

What carries the argument

The preservation adaptation module, which adjusts self-attention to preserve and independently control object and background regions using mask guidance.

If this is right

  • Objects retain their original shapes, textures, and identities after text-based edits.
  • Background regions stay visually consistent and undistorted throughout the process.
  • The framework operates without any model fine-tuning on the target images.
  • Multiple mask-guidance strategies support a range of manipulation tasks in one system.
  • The same modules apply across different diffusion backbones without architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar attention adjustments might improve consistency in other generative tasks that mix text and image inputs.
  • Extending the mask strategies to handle multiple objects could support more complex scene edits.
  • The zero-shot property suggests easier deployment in consumer photo tools compared with fine-tuned alternatives.
  • If the localized extraction reduces interference reliably, it could apply to related attention-heavy models beyond editing.

Load-bearing premise

The assumption that self-attention adjustments via the preservation adaptation module combined with mask guidance can independently control object and background regions without interference in cross-attention or the need for fine-tuning.

What would settle it

A side-by-side comparison on a non-rigid object edit where the background shows visible distortion or the edited object changes identity despite correct mask application.

read the original abstract

Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. CPAM can be seamlessly integrated with multiple diffusion backbones, including SD1.5, SD2.1, and SDXL, demonstrating strong generalization across different model architectures. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques. The source code and data will be publicly released at the project page: https://vdkhoi20.github.io/CPAM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CPAM, a zero-shot framework for complex non-rigid real-image editing in text-to-image diffusion models. It introduces a preservation adaptation module that adjusts self-attention to preserve object shapes, textures, and identities while using mask guidance to keep backgrounds undistorted, a localized extraction module to reduce cross-attention interference with undesired regions, and multiple mask-guidance strategies. The method integrates with SD1.5, SD2.1, and SDXL backbones. A new Image Manipulation BenchmArk (IMBA) dataset is presented, with human-rater evaluations claiming CPAM outperforms prior state-of-the-art editing techniques.

Significance. If the preservation and localization claims hold with rigorous verification, the work could advance zero-shot editing by reducing reliance on fine-tuning and improving regional control for non-rigid edits. The introduction of the IMBA benchmark and the explicit commitment to public release of source code and data are clear strengths that support reproducibility and future research.

major comments (3)
  1. The preservation adaptation module is described only at a high level as 'adjusting self-attention mechanisms to preserve and independently control the object and background.' Because self-attention operates globally over the full feature map, the manuscript must provide the explicit formulation or algorithm (e.g., in §3) showing how localization is enforced without leakage during non-rigid deformations; absent this, the central claim of independent regional control remains unverified.
  2. The experimental claims rest on human-rater preference on the new IMBA benchmark, yet no details appear on rater count, rating protocol, statistical significance, inter-rater agreement, or any quantitative metrics (FID, CLIP similarity, etc.). This absence directly undermines the assertion that CPAM is 'the preferred choice among human raters' and is load-bearing for the superiority conclusion.
  3. No ablation studies are reported that isolate the contributions of the preservation adaptation module, the localized extraction module, and the mask-guidance strategies. Without such controls, it is impossible to attribute performance gains to the proposed components rather than to the underlying diffusion backbone or mask quality.
minor comments (2)
  1. Clarify the exact mathematical definition of the preservation adaptation and localized extraction modules with equations or pseudocode rather than prose descriptions alone.
  2. Add error bars or confidence intervals to any quantitative results and ensure all figures include captions that explicitly describe the editing task, input mask, and observed artifacts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our CPAM manuscript. We address each major comment below and commit to revisions that strengthen the clarity, rigor, and verifiability of our claims.

read point-by-point responses
  1. Referee: The preservation adaptation module is described only at a high level as 'adjusting self-attention mechanisms to preserve and independently control the object and background.' Because self-attention operates globally over the full feature map, the manuscript must provide the explicit formulation or algorithm (e.g., in §3) showing how localization is enforced without leakage during non-rigid deformations; absent this, the central claim of independent regional control remains unverified.

    Authors: We agree that the current description of the preservation adaptation module would benefit from greater mathematical precision. In the revised manuscript we will expand §3 with the explicit formulation of the modified self-attention operation, including the precise mask-guided weighting terms and the algorithmic steps that enforce regional independence without cross-region leakage during non-rigid edits. revision: yes

  2. Referee: The experimental claims rest on human-rater preference on the new IMBA benchmark, yet no details appear on rater count, rating protocol, statistical significance, inter-rater agreement, or any quantitative metrics (FID, CLIP similarity, etc.). This absence directly undermines the assertion that CPAM is 'the preferred choice among human raters' and is load-bearing for the superiority conclusion.

    Authors: We acknowledge the omission of evaluation-protocol details. The revised version will add a dedicated subsection reporting the exact number of raters, the full rating protocol, statistical significance tests, inter-rater agreement (e.g., Fleiss’ kappa), and supplementary quantitative metrics including FID and CLIP similarity scores computed on the IMBA benchmark. revision: yes

  3. Referee: No ablation studies are reported that isolate the contributions of the preservation adaptation module, the localized extraction module, and the mask-guidance strategies. Without such controls, it is impossible to attribute performance gains to the proposed components rather than to the underlying diffusion backbone or mask quality.

    Authors: We concur that component-wise ablations are necessary to substantiate our claims. We will include new ablation experiments in the revised manuscript that systematically disable or replace each module (preservation adaptation, localized extraction, and mask-guidance variants) while keeping the diffusion backbone and input masks fixed, thereby isolating their individual contributions. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained via new modules and external evaluation

full rationale

The paper introduces a novel zero-shot framework CPAM consisting of a preservation adaptation module for adjusting self-attention and a localized extraction module for mitigating cross-attention interference, along with mask-guidance strategies. It constructs a new benchmark dataset IMBA and reports results from human raters comparing against prior methods. No equations, fitted parameters, or derivations are presented that reduce by construction to inputs, self-citations, or renamings of known results. The central claims rest on the proposed architecture and independent experimental validation rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard diffusion model assumptions for text conditioning and attention mechanisms, plus the novel modules introduced without independent evidence beyond the abstract description; no free parameters or invented physical entities are detailed.

axioms (1)
  • domain assumption Diffusion models can be effectively conditioned and edited using text prompts and spatial masks without additional fine-tuning.
    Invoked in the description of zero-shot editing and mask guidance technique.
invented entities (2)
  • preservation adaptation module no independent evidence
    purpose: Adjusts self-attention mechanisms to preserve and independently control object and background.
    New component proposed to address texture and identity preservation.
  • localized extraction module no independent evidence
    purpose: Mitigates interference with non-desired regions during cross-attention conditioning.
    New component proposed to improve localized editing.

pith-pipeline@v0.9.0 · 5824 in / 1487 out tokens · 31655 ms · 2026-05-19T07:36:27.564054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

  1. [1]

    In: International Conference on Machine Learning, pp

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831 (2021). Pmlr

  2. [2]

    In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794 (2021)

  3. [3]

    International conference on machine learning (2022)

    Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. International conference on machine learning (2022)

  4. [4]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al. : Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2(3), 5 (2022) 16

  5. [5]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arxiv 2022. arXiv preprint arXiv:2204.06125 (2022)

  6. [6]

    In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understanding. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Informati...

  7. [7]

    https://github.com/black-forest-labs/flux

    Black Forest Labs: Flux. https://github.com/black-forest-labs/flux. Accessed: 2024 (2024)

  8. [8]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv. org/abs/2403.03206 2

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

  10. [10]

    International Conference on Learning Representations (2023)

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. International Conference on Learning Representations (2023)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1921–1930 (2023)

  12. [12]

    In: ACM SIGGRAPH 2023 Conference Proceedings, pp

    Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.-Y.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)

  13. [13]

    In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (2024)

    Vo, D.-K., Ly, D.-N., Le, K.-D., Nguyen, T.V., Tran, M.-T., Le, T.-N.: icontra: Toward thematic collection design via interactive concept transfer. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (2024)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wallace, B., Gokul, A., Naik, N.: Edict: Exact diffusion inversion via coupled transformations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22532–22541 (2023)

  15. [15]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Pan, Z., Gherardi, R., Xie, X., Huang, S.: Effective real image editing with acceler- ated iterative diffusion inversion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15912–15921 (2023) 17

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022)

  17. [17]

    In: SIGGRAPH Asia 2024 Conference Papers, pp

    Deutch, G., Gal, R., Garibi, D., Patashnik, O., Cohen-Or, D.: Turboedit: Text- based image editing using few-step diffusion models. In: SIGGRAPH Asia 2024 Conference Papers, pp. 1–12 (2024)

  18. [18]

    In: European Conference on Computer Vision, pp

    Garibi, D., Patashnik, O., Voynov, A., Averbuch-Elor, H., Cohen-Or, D.: Renoise: Real image inversion through iterative noising. In: European Conference on Computer Vision, pp. 395–413 (2024). Springer

  19. [19]

    : Proxedit: Improving tuning-free real image editing with proximal guidance

    Han, L., Wen, S., Chen, Q., Zhang, Z., Song, K., Ren, M., Gao, R., Stathopoulos, A., He, X., Chen, Y., et al. : Proxedit: Improving tuning-free real image editing with proximal guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4291–4301 (2024)

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Huberman-Spiegelglas, I., Kulikov, V., Michaeli, T.: An edit friendly ddpm noise space: Inversion and manipulations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12469–12478 (2024)

  21. [21]

    arXiv preprint arXiv:2310.01506 , year=

    Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion- based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)

  22. [22]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)

  23. [23]

    Advances in Neural Information Processing Systems 35, 25278–25294 (2022)

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022)

  24. [24]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., Cohen-Or, D.: Localizing object-level shape variations with text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23051–23061 (2023)

  25. [25]

    : Zone: Zero-shot instruction-guided local editing

    Li, S., Zeng, B., Feng, Y., Gao, S., Liu, X., Liu, J., Li, L., Tang, X., Hu, Y., Liu, J., et al. : Zone: Zero-shot instruction-guided local editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6254–6263 (2024)

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Lin, Y., Chen, Y.-W., Tsai, Y.-H., Jiang, L., Yang, M.-H.: Text-driven image editing via learnable regions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7059–7068 (2024) 18

  27. [27]

    In: European Conference on Computer Vision, pp

    Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text- driven layered image and video editing. In: European Conference on Computer Vision, pp. 707–723 (2022). Springer

  28. [28]

    arXiv preprint arXiv:2210.09477 2(3), 5 (2022)

    Valevski, D., Kalman, M., Matias, Y., Leviathan, Y.: Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477 2(3), 5 (2022)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)

  30. [30]

    In: International Conference on Learning Representations (2022)

    Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., Passos, A.: Ledits++: Limitless image editing using text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8861–8870 (2024)

  32. [32]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

    Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22560–22570 (2023)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Liu, B., Wang, C., Cao, T., Jia, K., Huang, J.: Towards understanding cross and self-attention in stable diffusion for text-guided image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7817–7826 (2024)

  34. [34]

    In: European Conference on Computer Vision, pp

    Titov, V., Khalmatova, M., Ivanova, A., Vetrov, D., Alanov, A.: Guide-and- rescale: Self-guidance mechanism for effective tuning-free real image editing. In: European Conference on Computer Vision, pp. 235–251 (2024). Springer

  35. [35]

    ACM transac- tions on graphics (TOG) 42(4), 1–11 (2023)

    Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM transac- tions on graphics (TOG) 42(4), 1–11 (2023)

  36. [36]

    International Conference in Learning Representations (2023)

    Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. International Conference in Learning Representations (2023)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023) 19

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inver- sion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)

  39. [39]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Chen, X., Feng, Y., Chen, M., Wang, Y., Zhang, S., Liu, Y., Shen, Y., Zhao, H.: Zero-shot image editing with reference imitation. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems, vol. 37, pp. 84010–84032 (2024)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

  41. [41]

    International Conference on Learning Repre- sentations (2023)

    Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. International Conference on Learning Repre- sentations (2023)

  42. [42]

    In: CVPR (2023)

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR (2023)

  43. [43]

    ACM Transactions on Graphics (TOG) 42(4), 1–10 (2023)

    Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG) 42(4), 1–10 (2023)

  44. [44]

    Advances in neural information processing systems 30 (2017)

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)

  45. [45]

    In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021)

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021)

  46. [46]

    International Conference on Learning Representations (2021)

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. International Conference on Learning Representations (2021)

  47. [47]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)

  48. [48]

    In: International Conference on Learning Representations (2014) 20

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2014) 20