CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing

Dinh-Khoi Vo; Minh-Triet Tran; Tam V. Nguyen; Thanh-Toan Do; Trung-Nghia Le

arxiv: 2506.18438 · v2 · submitted 2025-06-23 · 💻 cs.CV

CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing

Dinh-Khoi Vo , Thanh-Toan Do , Tam V. Nguyen , Minh-Triet Tran , Trung-Nghia Le This is my paper

Pith reviewed 2026-05-19 07:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot image editingdiffusion modelsself-attention adaptationmask guidancecontext preservationreal image manipulationtext-to-image editingnon-rigid object editing

0 comments

The pith

CPAM adjusts self-attention in diffusion models to edit real images by text while preserving object identities and undistorted backgrounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CPAM as a zero-shot method for editing natural images according to text prompts in diffusion models. It targets the problem of handling complex non-rigid objects without losing their shapes, textures, or identities and without distorting the surrounding background. A preservation adaptation module modifies self-attention to control object and background regions separately through mask guidance. A localized extraction module reduces unwanted interference during cross-attention conditioning. The approach works across several diffusion backbones and ranks highest in human evaluations on the new IMBA benchmark for real image editing tasks.

Core claim

CPAM is a zero-shot framework that uses a preservation adaptation module to adjust self-attention mechanisms, thereby preserving and independently controlling object and background regions. Combined with mask guidance and a localized extraction module that limits interference in cross-attention, it maintains objects' shapes, textures, and identities while keeping backgrounds undistorted. The method supports various mask-guidance strategies for different editing tasks and integrates directly with diffusion backbones such as SD1.5, SD2.1, and SDXL, outperforming prior techniques on the IMBA benchmark according to human raters.

What carries the argument

The preservation adaptation module, which adjusts self-attention to preserve and independently control object and background regions using mask guidance.

If this is right

Objects retain their original shapes, textures, and identities after text-based edits.
Background regions stay visually consistent and undistorted throughout the process.
The framework operates without any model fine-tuning on the target images.
Multiple mask-guidance strategies support a range of manipulation tasks in one system.
The same modules apply across different diffusion backbones without architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar attention adjustments might improve consistency in other generative tasks that mix text and image inputs.
Extending the mask strategies to handle multiple objects could support more complex scene edits.
The zero-shot property suggests easier deployment in consumer photo tools compared with fine-tuned alternatives.
If the localized extraction reduces interference reliably, it could apply to related attention-heavy models beyond editing.

Load-bearing premise

The assumption that self-attention adjustments via the preservation adaptation module combined with mask guidance can independently control object and background regions without interference in cross-attention or the need for fine-tuning.

What would settle it

A side-by-side comparison on a non-rigid object edit where the background shows visible distortion or the edited object changes identity despite correct mask application.

read the original abstract

Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. CPAM can be seamlessly integrated with multiple diffusion backbones, including SD1.5, SD2.1, and SDXL, demonstrating strong generalization across different model architectures. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques. The source code and data will be publicly released at the project page: https://vdkhoi20.github.io/CPAM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CPAM adds a preservation adaptation module for self-attention and a localized extraction module for cross-attention in zero-shot editing, but the abstract gives almost no experimental details to back the claims.

read the letter

The key takeaway is that CPAM adds a preservation adaptation module to adjust self-attention and a localized extraction module for cross-attention in diffusion models, aiming for better context preservation in zero-shot real image editing without fine-tuning. It also introduces the IMBA benchmark and various mask guidance strategies. This approach is new in how it combines these specific modules to control object and background regions independently. It does well in claiming compatibility with multiple backbones like SD1.5, SD2.1, and SDXL, which shows some effort toward generalization. The mask guidance for diverse tasks is a practical touch that could make it easy to use. On the soft spots, the description stays high-level. There are no quantitative metrics, ablation results, or specifics on the experimental setup in what I've seen, so the outperformance on human raters is hard to judge fully. The concern about self-attention allowing subtle leakage despite mask guidance seems plausible because self-attention is global, and without clear mechanisms or tests for strict localization in overlapping features, it might not fully isolate edits for non-rigid changes. This paper would interest researchers focused on diffusion-based image manipulation. Someone working on practical editing tools might pick up ideas from the modules, but only if the full results back them up. I think it deserves peer review to examine the implementation details and verify the claims against the stress-test issues.

Referee Report

3 major / 2 minor

Summary. The paper proposes CPAM, a zero-shot framework for complex non-rigid real-image editing in text-to-image diffusion models. It introduces a preservation adaptation module that adjusts self-attention to preserve object shapes, textures, and identities while using mask guidance to keep backgrounds undistorted, a localized extraction module to reduce cross-attention interference with undesired regions, and multiple mask-guidance strategies. The method integrates with SD1.5, SD2.1, and SDXL backbones. A new Image Manipulation BenchmArk (IMBA) dataset is presented, with human-rater evaluations claiming CPAM outperforms prior state-of-the-art editing techniques.

Significance. If the preservation and localization claims hold with rigorous verification, the work could advance zero-shot editing by reducing reliance on fine-tuning and improving regional control for non-rigid edits. The introduction of the IMBA benchmark and the explicit commitment to public release of source code and data are clear strengths that support reproducibility and future research.

major comments (3)

The preservation adaptation module is described only at a high level as 'adjusting self-attention mechanisms to preserve and independently control the object and background.' Because self-attention operates globally over the full feature map, the manuscript must provide the explicit formulation or algorithm (e.g., in §3) showing how localization is enforced without leakage during non-rigid deformations; absent this, the central claim of independent regional control remains unverified.
The experimental claims rest on human-rater preference on the new IMBA benchmark, yet no details appear on rater count, rating protocol, statistical significance, inter-rater agreement, or any quantitative metrics (FID, CLIP similarity, etc.). This absence directly undermines the assertion that CPAM is 'the preferred choice among human raters' and is load-bearing for the superiority conclusion.
No ablation studies are reported that isolate the contributions of the preservation adaptation module, the localized extraction module, and the mask-guidance strategies. Without such controls, it is impossible to attribute performance gains to the proposed components rather than to the underlying diffusion backbone or mask quality.

minor comments (2)

Clarify the exact mathematical definition of the preservation adaptation and localized extraction modules with equations or pseudocode rather than prose descriptions alone.
Add error bars or confidence intervals to any quantitative results and ensure all figures include captions that explicitly describe the editing task, input mask, and observed artifacts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our CPAM manuscript. We address each major comment below and commit to revisions that strengthen the clarity, rigor, and verifiability of our claims.

read point-by-point responses

Referee: The preservation adaptation module is described only at a high level as 'adjusting self-attention mechanisms to preserve and independently control the object and background.' Because self-attention operates globally over the full feature map, the manuscript must provide the explicit formulation or algorithm (e.g., in §3) showing how localization is enforced without leakage during non-rigid deformations; absent this, the central claim of independent regional control remains unverified.

Authors: We agree that the current description of the preservation adaptation module would benefit from greater mathematical precision. In the revised manuscript we will expand §3 with the explicit formulation of the modified self-attention operation, including the precise mask-guided weighting terms and the algorithmic steps that enforce regional independence without cross-region leakage during non-rigid edits. revision: yes
Referee: The experimental claims rest on human-rater preference on the new IMBA benchmark, yet no details appear on rater count, rating protocol, statistical significance, inter-rater agreement, or any quantitative metrics (FID, CLIP similarity, etc.). This absence directly undermines the assertion that CPAM is 'the preferred choice among human raters' and is load-bearing for the superiority conclusion.

Authors: We acknowledge the omission of evaluation-protocol details. The revised version will add a dedicated subsection reporting the exact number of raters, the full rating protocol, statistical significance tests, inter-rater agreement (e.g., Fleiss’ kappa), and supplementary quantitative metrics including FID and CLIP similarity scores computed on the IMBA benchmark. revision: yes
Referee: No ablation studies are reported that isolate the contributions of the preservation adaptation module, the localized extraction module, and the mask-guidance strategies. Without such controls, it is impossible to attribute performance gains to the proposed components rather than to the underlying diffusion backbone or mask quality.

Authors: We concur that component-wise ablations are necessary to substantiate our claims. We will include new ablation experiments in the revised manuscript that systematically disable or replace each module (preservation adaptation, localized extraction, and mask-guidance variants) while keeping the diffusion backbone and input masks fixed, thereby isolating their individual contributions. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained via new modules and external evaluation

full rationale

The paper introduces a novel zero-shot framework CPAM consisting of a preservation adaptation module for adjusting self-attention and a localized extraction module for mitigating cross-attention interference, along with mask-guidance strategies. It constructs a new benchmark dataset IMBA and reports results from human raters comparing against prior methods. No equations, fitted parameters, or derivations are presented that reduce by construction to inputs, self-citations, or renamings of known results. The central claims rest on the proposed architecture and independent experimental validation rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard diffusion model assumptions for text conditioning and attention mechanisms, plus the novel modules introduced without independent evidence beyond the abstract description; no free parameters or invented physical entities are detailed.

axioms (1)

domain assumption Diffusion models can be effectively conditioned and edited using text prompts and spatial masks without additional fine-tuning.
Invoked in the description of zero-shot editing and mask guidance technique.

invented entities (2)

preservation adaptation module no independent evidence
purpose: Adjusts self-attention mechanisms to preserve and independently control object and background.
New component proposed to address texture and identity preservation.
localized extraction module no independent evidence
purpose: Mitigates interference with non-desired regions during cross-attention conditioning.
New component proposed to improve localized editing.

pith-pipeline@v0.9.0 · 5824 in / 1487 out tokens · 31655 ms · 2026-05-19T07:36:27.564054+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

[1]

In: International Conference on Machine Learning, pp

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831 (2021). Pmlr

work page 2021
[2]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794 (2021)

work page 2021
[3]

International conference on machine learning (2022)

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. International conference on machine learning (2022)

work page 2022
[4]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al. : Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2(3), 5 (2022) 16

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arxiv 2022. arXiv preprint arXiv:2204.06125 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understanding. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Informati...

work page 2022
[7]

https://github.com/black-forest-labs/flux

Black Forest Labs: Flux. https://github.com/black-forest-labs/flux. Accessed: 2024 (2024)

work page 2024
[8]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv. org/abs/2403.03206 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

work page 2022
[10]

International Conference on Learning Representations (2023)

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. International Conference on Learning Representations (2023)

work page 2023
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1921–1930 (2023)

work page 1921
[12]

In: ACM SIGGRAPH 2023 Conference Proceedings, pp

Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.-Y.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)

work page 2023
[13]

In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (2024)

Vo, D.-K., Ly, D.-N., Le, K.-D., Nguyen, T.V., Tran, M.-T., Le, T.-N.: icontra: Toward thematic collection design via interactive concept transfer. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (2024)

work page 2024
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Wallace, B., Gokul, A., Naik, N.: Edict: Exact diffusion inversion via coupled transformations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22532–22541 (2023)

work page 2023
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Pan, Z., Gherardi, R., Xie, X., Huang, S.: Effective real image editing with acceler- ated iterative diffusion inversion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15912–15921 (2023) 17

work page 2023
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022)

work page 2022
[17]

In: SIGGRAPH Asia 2024 Conference Papers, pp

Deutch, G., Gal, R., Garibi, D., Patashnik, O., Cohen-Or, D.: Turboedit: Text- based image editing using few-step diffusion models. In: SIGGRAPH Asia 2024 Conference Papers, pp. 1–12 (2024)

work page 2024
[18]

In: European Conference on Computer Vision, pp

Garibi, D., Patashnik, O., Voynov, A., Averbuch-Elor, H., Cohen-Or, D.: Renoise: Real image inversion through iterative noising. In: European Conference on Computer Vision, pp. 395–413 (2024). Springer

work page 2024
[19]

: Proxedit: Improving tuning-free real image editing with proximal guidance

Han, L., Wen, S., Chen, Q., Zhang, Z., Song, K., Ren, M., Gao, R., Stathopoulos, A., He, X., Chen, Y., et al. : Proxedit: Improving tuning-free real image editing with proximal guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4291–4301 (2024)

work page 2024
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Huberman-Spiegelglas, I., Kulikov, V., Michaeli, T.: An edit friendly ddpm noise space: Inversion and manipulations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12469–12478 (2024)

work page 2024
[21]

arXiv preprint arXiv:2310.01506 , year=

Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion- based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)

work page arXiv 2023
[22]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Advances in Neural Information Processing Systems 35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022)

work page 2022
[24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., Cohen-Or, D.: Localizing object-level shape variations with text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23051–23061 (2023)

work page 2023
[25]

: Zone: Zero-shot instruction-guided local editing

Li, S., Zeng, B., Feng, Y., Gao, S., Liu, X., Liu, J., Li, L., Tang, X., Hu, Y., Liu, J., et al. : Zone: Zero-shot instruction-guided local editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6254–6263 (2024)

work page 2024
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Lin, Y., Chen, Y.-W., Tsai, Y.-H., Jiang, L., Yang, M.-H.: Text-driven image editing via learnable regions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7059–7068 (2024) 18

work page 2024
[27]

In: European Conference on Computer Vision, pp

Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text- driven layered image and video editing. In: European Conference on Computer Vision, pp. 707–723 (2022). Springer

work page 2022
[28]

arXiv preprint arXiv:2210.09477 2(3), 5 (2022)

Valevski, D., Kalman, M., Matias, Y., Leviathan, Y.: Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477 2(3), 5 (2022)

work page arXiv 2022
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)

work page 2022
[30]

In: International Conference on Learning Representations (2022)

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)

work page 2022
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., Passos, A.: Ledits++: Limitless image editing using text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8861–8870 (2024)

work page 2024
[32]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22560–22570 (2023)

work page 2023
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Liu, B., Wang, C., Cao, T., Jia, K., Huang, J.: Towards understanding cross and self-attention in stable diffusion for text-guided image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7817–7826 (2024)

work page 2024
[34]

In: European Conference on Computer Vision, pp

Titov, V., Khalmatova, M., Ivanova, A., Vetrov, D., Alanov, A.: Guide-and- rescale: Self-guidance mechanism for effective tuning-free real image editing. In: European Conference on Computer Vision, pp. 235–251 (2024). Springer

work page 2024
[35]

ACM transac- tions on graphics (TOG) 42(4), 1–11 (2023)

Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM transac- tions on graphics (TOG) 42(4), 1–11 (2023)

work page 2023
[36]

International Conference in Learning Representations (2023)

Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. International Conference in Learning Representations (2023)

work page 2023
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023) 19

work page 2023
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inver- sion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)

work page 2023
[39]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Chen, X., Feng, Y., Chen, M., Wang, Y., Zhang, S., Liu, Y., Shen, Y., Zhao, H.: Zero-shot image editing with reference imitation. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems, vol. 37, pp. 84010–84032 (2024)

work page 2024
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

work page 2023
[41]

International Conference on Learning Repre- sentations (2023)

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. International Conference on Learning Repre- sentations (2023)

work page 2023
[42]

In: CVPR (2023)

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR (2023)

work page 2023
[43]

ACM Transactions on Graphics (TOG) 42(4), 1–10 (2023)

Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG) 42(4), 1–10 (2023)

work page 2023
[44]

Advances in neural information processing systems 30 (2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)

work page 2017
[45]

In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021)

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021)

work page 2021
[46]

International Conference on Learning Representations (2021)

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. International Conference on Learning Representations (2021)

work page 2021
[47]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)

work page 2023
[48]

In: International Conference on Learning Representations (2014) 20

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2014) 20

work page 2014

[1] [1]

In: International Conference on Machine Learning, pp

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831 (2021). Pmlr

work page 2021

[2] [2]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794 (2021)

work page 2021

[3] [3]

International conference on machine learning (2022)

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. International conference on machine learning (2022)

work page 2022

[4] [4]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al. : Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2(3), 5 (2022) 16

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arxiv 2022. arXiv preprint arXiv:2204.06125 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understanding. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Informati...

work page 2022

[7] [7]

https://github.com/black-forest-labs/flux

Black Forest Labs: Flux. https://github.com/black-forest-labs/flux. Accessed: 2024 (2024)

work page 2024

[8] [8]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv. org/abs/2403.03206 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

work page 2022

[10] [10]

International Conference on Learning Representations (2023)

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. International Conference on Learning Representations (2023)

work page 2023

[11] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1921–1930 (2023)

work page 1921

[12] [12]

In: ACM SIGGRAPH 2023 Conference Proceedings, pp

Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.-Y.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)

work page 2023

[13] [13]

In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (2024)

Vo, D.-K., Ly, D.-N., Le, K.-D., Nguyen, T.V., Tran, M.-T., Le, T.-N.: icontra: Toward thematic collection design via interactive concept transfer. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (2024)

work page 2024

[14] [14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Wallace, B., Gokul, A., Naik, N.: Edict: Exact diffusion inversion via coupled transformations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22532–22541 (2023)

work page 2023

[15] [15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Pan, Z., Gherardi, R., Xie, X., Huang, S.: Effective real image editing with acceler- ated iterative diffusion inversion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15912–15921 (2023) 17

work page 2023

[16] [16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022)

work page 2022

[17] [17]

In: SIGGRAPH Asia 2024 Conference Papers, pp

Deutch, G., Gal, R., Garibi, D., Patashnik, O., Cohen-Or, D.: Turboedit: Text- based image editing using few-step diffusion models. In: SIGGRAPH Asia 2024 Conference Papers, pp. 1–12 (2024)

work page 2024

[18] [18]

In: European Conference on Computer Vision, pp

Garibi, D., Patashnik, O., Voynov, A., Averbuch-Elor, H., Cohen-Or, D.: Renoise: Real image inversion through iterative noising. In: European Conference on Computer Vision, pp. 395–413 (2024). Springer

work page 2024

[19] [19]

: Proxedit: Improving tuning-free real image editing with proximal guidance

Han, L., Wen, S., Chen, Q., Zhang, Z., Song, K., Ren, M., Gao, R., Stathopoulos, A., He, X., Chen, Y., et al. : Proxedit: Improving tuning-free real image editing with proximal guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4291–4301 (2024)

work page 2024

[20] [20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Huberman-Spiegelglas, I., Kulikov, V., Michaeli, T.: An edit friendly ddpm noise space: Inversion and manipulations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12469–12478 (2024)

work page 2024

[21] [21]

arXiv preprint arXiv:2310.01506 , year=

Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion- based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)

work page arXiv 2023

[22] [22]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Advances in Neural Information Processing Systems 35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022)

work page 2022

[24] [24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., Cohen-Or, D.: Localizing object-level shape variations with text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23051–23061 (2023)

work page 2023

[25] [25]

: Zone: Zero-shot instruction-guided local editing

Li, S., Zeng, B., Feng, Y., Gao, S., Liu, X., Liu, J., Li, L., Tang, X., Hu, Y., Liu, J., et al. : Zone: Zero-shot instruction-guided local editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6254–6263 (2024)

work page 2024

[26] [26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Lin, Y., Chen, Y.-W., Tsai, Y.-H., Jiang, L., Yang, M.-H.: Text-driven image editing via learnable regions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7059–7068 (2024) 18

work page 2024

[27] [27]

In: European Conference on Computer Vision, pp

Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text- driven layered image and video editing. In: European Conference on Computer Vision, pp. 707–723 (2022). Springer

work page 2022

[28] [28]

arXiv preprint arXiv:2210.09477 2(3), 5 (2022)

Valevski, D., Kalman, M., Matias, Y., Leviathan, Y.: Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477 2(3), 5 (2022)

work page arXiv 2022

[29] [29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)

work page 2022

[30] [30]

In: International Conference on Learning Representations (2022)

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)

work page 2022

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., Passos, A.: Ledits++: Limitless image editing using text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8861–8870 (2024)

work page 2024

[32] [32]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22560–22570 (2023)

work page 2023

[33] [33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Liu, B., Wang, C., Cao, T., Jia, K., Huang, J.: Towards understanding cross and self-attention in stable diffusion for text-guided image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7817–7826 (2024)

work page 2024

[34] [34]

In: European Conference on Computer Vision, pp

Titov, V., Khalmatova, M., Ivanova, A., Vetrov, D., Alanov, A.: Guide-and- rescale: Self-guidance mechanism for effective tuning-free real image editing. In: European Conference on Computer Vision, pp. 235–251 (2024). Springer

work page 2024

[35] [35]

ACM transac- tions on graphics (TOG) 42(4), 1–11 (2023)

Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM transac- tions on graphics (TOG) 42(4), 1–11 (2023)

work page 2023

[36] [36]

International Conference in Learning Representations (2023)

Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. International Conference in Learning Representations (2023)

work page 2023

[37] [37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023) 19

work page 2023

[38] [38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inver- sion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)

work page 2023

[39] [39]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Chen, X., Feng, Y., Chen, M., Wang, Y., Zhang, S., Liu, Y., Shen, Y., Zhao, H.: Zero-shot image editing with reference imitation. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems, vol. 37, pp. 84010–84032 (2024)

work page 2024

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

work page 2023

[41] [41]

International Conference on Learning Repre- sentations (2023)

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. International Conference on Learning Repre- sentations (2023)

work page 2023

[42] [42]

In: CVPR (2023)

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR (2023)

work page 2023

[43] [43]

ACM Transactions on Graphics (TOG) 42(4), 1–10 (2023)

Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG) 42(4), 1–10 (2023)

work page 2023

[44] [44]

Advances in neural information processing systems 30 (2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)

work page 2017

[45] [45]

In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021)

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021)

work page 2021

[46] [46]

International Conference on Learning Representations (2021)

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. International Conference on Learning Representations (2021)

work page 2021

[47] [47]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)

work page 2023

[48] [48]

In: International Conference on Learning Representations (2014) 20

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2014) 20

work page 2014