CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing
Pith reviewed 2026-05-19 07:36 UTC · model grok-4.3
The pith
CPAM adjusts self-attention in diffusion models to edit real images by text while preserving object identities and undistorted backgrounds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CPAM is a zero-shot framework that uses a preservation adaptation module to adjust self-attention mechanisms, thereby preserving and independently controlling object and background regions. Combined with mask guidance and a localized extraction module that limits interference in cross-attention, it maintains objects' shapes, textures, and identities while keeping backgrounds undistorted. The method supports various mask-guidance strategies for different editing tasks and integrates directly with diffusion backbones such as SD1.5, SD2.1, and SDXL, outperforming prior techniques on the IMBA benchmark according to human raters.
What carries the argument
The preservation adaptation module, which adjusts self-attention to preserve and independently control object and background regions using mask guidance.
If this is right
- Objects retain their original shapes, textures, and identities after text-based edits.
- Background regions stay visually consistent and undistorted throughout the process.
- The framework operates without any model fine-tuning on the target images.
- Multiple mask-guidance strategies support a range of manipulation tasks in one system.
- The same modules apply across different diffusion backbones without architecture changes.
Where Pith is reading between the lines
- Similar attention adjustments might improve consistency in other generative tasks that mix text and image inputs.
- Extending the mask strategies to handle multiple objects could support more complex scene edits.
- The zero-shot property suggests easier deployment in consumer photo tools compared with fine-tuned alternatives.
- If the localized extraction reduces interference reliably, it could apply to related attention-heavy models beyond editing.
Load-bearing premise
The assumption that self-attention adjustments via the preservation adaptation module combined with mask guidance can independently control object and background regions without interference in cross-attention or the need for fine-tuning.
What would settle it
A side-by-side comparison on a non-rigid object edit where the background shows visible distortion or the edited object changes identity despite correct mask application.
read the original abstract
Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. CPAM can be seamlessly integrated with multiple diffusion backbones, including SD1.5, SD2.1, and SDXL, demonstrating strong generalization across different model architectures. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques. The source code and data will be publicly released at the project page: https://vdkhoi20.github.io/CPAM
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CPAM, a zero-shot framework for complex non-rigid real-image editing in text-to-image diffusion models. It introduces a preservation adaptation module that adjusts self-attention to preserve object shapes, textures, and identities while using mask guidance to keep backgrounds undistorted, a localized extraction module to reduce cross-attention interference with undesired regions, and multiple mask-guidance strategies. The method integrates with SD1.5, SD2.1, and SDXL backbones. A new Image Manipulation BenchmArk (IMBA) dataset is presented, with human-rater evaluations claiming CPAM outperforms prior state-of-the-art editing techniques.
Significance. If the preservation and localization claims hold with rigorous verification, the work could advance zero-shot editing by reducing reliance on fine-tuning and improving regional control for non-rigid edits. The introduction of the IMBA benchmark and the explicit commitment to public release of source code and data are clear strengths that support reproducibility and future research.
major comments (3)
- The preservation adaptation module is described only at a high level as 'adjusting self-attention mechanisms to preserve and independently control the object and background.' Because self-attention operates globally over the full feature map, the manuscript must provide the explicit formulation or algorithm (e.g., in §3) showing how localization is enforced without leakage during non-rigid deformations; absent this, the central claim of independent regional control remains unverified.
- The experimental claims rest on human-rater preference on the new IMBA benchmark, yet no details appear on rater count, rating protocol, statistical significance, inter-rater agreement, or any quantitative metrics (FID, CLIP similarity, etc.). This absence directly undermines the assertion that CPAM is 'the preferred choice among human raters' and is load-bearing for the superiority conclusion.
- No ablation studies are reported that isolate the contributions of the preservation adaptation module, the localized extraction module, and the mask-guidance strategies. Without such controls, it is impossible to attribute performance gains to the proposed components rather than to the underlying diffusion backbone or mask quality.
minor comments (2)
- Clarify the exact mathematical definition of the preservation adaptation and localized extraction modules with equations or pseudocode rather than prose descriptions alone.
- Add error bars or confidence intervals to any quantitative results and ensure all figures include captions that explicitly describe the editing task, input mask, and observed artifacts.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our CPAM manuscript. We address each major comment below and commit to revisions that strengthen the clarity, rigor, and verifiability of our claims.
read point-by-point responses
-
Referee: The preservation adaptation module is described only at a high level as 'adjusting self-attention mechanisms to preserve and independently control the object and background.' Because self-attention operates globally over the full feature map, the manuscript must provide the explicit formulation or algorithm (e.g., in §3) showing how localization is enforced without leakage during non-rigid deformations; absent this, the central claim of independent regional control remains unverified.
Authors: We agree that the current description of the preservation adaptation module would benefit from greater mathematical precision. In the revised manuscript we will expand §3 with the explicit formulation of the modified self-attention operation, including the precise mask-guided weighting terms and the algorithmic steps that enforce regional independence without cross-region leakage during non-rigid edits. revision: yes
-
Referee: The experimental claims rest on human-rater preference on the new IMBA benchmark, yet no details appear on rater count, rating protocol, statistical significance, inter-rater agreement, or any quantitative metrics (FID, CLIP similarity, etc.). This absence directly undermines the assertion that CPAM is 'the preferred choice among human raters' and is load-bearing for the superiority conclusion.
Authors: We acknowledge the omission of evaluation-protocol details. The revised version will add a dedicated subsection reporting the exact number of raters, the full rating protocol, statistical significance tests, inter-rater agreement (e.g., Fleiss’ kappa), and supplementary quantitative metrics including FID and CLIP similarity scores computed on the IMBA benchmark. revision: yes
-
Referee: No ablation studies are reported that isolate the contributions of the preservation adaptation module, the localized extraction module, and the mask-guidance strategies. Without such controls, it is impossible to attribute performance gains to the proposed components rather than to the underlying diffusion backbone or mask quality.
Authors: We concur that component-wise ablations are necessary to substantiate our claims. We will include new ablation experiments in the revised manuscript that systematically disable or replace each module (preservation adaptation, localized extraction, and mask-guidance variants) while keeping the diffusion backbone and input masks fixed, thereby isolating their individual contributions. revision: yes
Circularity Check
No circularity detected; derivation is self-contained via new modules and external evaluation
full rationale
The paper introduces a novel zero-shot framework CPAM consisting of a preservation adaptation module for adjusting self-attention and a localized extraction module for mitigating cross-attention interference, along with mask-guidance strategies. It constructs a new benchmark dataset IMBA and reports results from human raters comparing against prior methods. No equations, fitted parameters, or derivations are presented that reduce by construction to inputs, self-citations, or renamings of known results. The central claims rest on the proposed architecture and independent experimental validation rather than any self-referential loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion models can be effectively conditioned and edited using text prompts and spatial masks without additional fine-tuning.
invented entities (2)
-
preservation adaptation module
no independent evidence
-
localized extraction module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: International Conference on Machine Learning, pp
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831 (2021). Pmlr
work page 2021
-
[2]
In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794 (2021)
work page 2021
-
[3]
International conference on machine learning (2022)
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. International conference on machine learning (2022)
work page 2022
-
[4]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al. : Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 2(3), 5 (2022) 16
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arxiv 2022. arXiv preprint arXiv:2204.06125 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understanding. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Informati...
work page 2022
-
[7]
https://github.com/black-forest-labs/flux
Black Forest Labs: Flux. https://github.com/black-forest-labs/flux. Accessed: 2024 (2024)
work page 2024
-
[8]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv. org/abs/2403.03206 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
work page 2022
-
[10]
International Conference on Learning Representations (2023)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. International Conference on Learning Representations (2023)
work page 2023
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1921–1930 (2023)
work page 1921
-
[12]
In: ACM SIGGRAPH 2023 Conference Proceedings, pp
Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.-Y.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
work page 2023
-
[13]
In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (2024)
Vo, D.-K., Ly, D.-N., Le, K.-D., Nguyen, T.V., Tran, M.-T., Le, T.-N.: icontra: Toward thematic collection design via interactive concept transfer. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (2024)
work page 2024
-
[14]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Wallace, B., Gokul, A., Naik, N.: Edict: Exact diffusion inversion via coupled transformations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22532–22541 (2023)
work page 2023
-
[15]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Pan, Z., Gherardi, R., Xie, X., Huang, S.: Effective real image editing with acceler- ated iterative diffusion inversion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15912–15921 (2023) 17
work page 2023
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022)
work page 2022
-
[17]
In: SIGGRAPH Asia 2024 Conference Papers, pp
Deutch, G., Gal, R., Garibi, D., Patashnik, O., Cohen-Or, D.: Turboedit: Text- based image editing using few-step diffusion models. In: SIGGRAPH Asia 2024 Conference Papers, pp. 1–12 (2024)
work page 2024
-
[18]
In: European Conference on Computer Vision, pp
Garibi, D., Patashnik, O., Voynov, A., Averbuch-Elor, H., Cohen-Or, D.: Renoise: Real image inversion through iterative noising. In: European Conference on Computer Vision, pp. 395–413 (2024). Springer
work page 2024
-
[19]
: Proxedit: Improving tuning-free real image editing with proximal guidance
Han, L., Wen, S., Chen, Q., Zhang, Z., Song, K., Ren, M., Gao, R., Stathopoulos, A., He, X., Chen, Y., et al. : Proxedit: Improving tuning-free real image editing with proximal guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4291–4301 (2024)
work page 2024
-
[20]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Huberman-Spiegelglas, I., Kulikov, V., Michaeli, T.: An edit friendly ddpm noise space: Inversion and manipulations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12469–12478 (2024)
work page 2024
-
[21]
arXiv preprint arXiv:2310.01506 , year=
Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion- based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)
-
[22]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Advances in Neural Information Processing Systems 35, 25278–25294 (2022)
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022)
work page 2022
-
[24]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., Cohen-Or, D.: Localizing object-level shape variations with text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23051–23061 (2023)
work page 2023
-
[25]
: Zone: Zero-shot instruction-guided local editing
Li, S., Zeng, B., Feng, Y., Gao, S., Liu, X., Liu, J., Li, L., Tang, X., Hu, Y., Liu, J., et al. : Zone: Zero-shot instruction-guided local editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6254–6263 (2024)
work page 2024
-
[26]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Lin, Y., Chen, Y.-W., Tsai, Y.-H., Jiang, L., Yang, M.-H.: Text-driven image editing via learnable regions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7059–7068 (2024) 18
work page 2024
-
[27]
In: European Conference on Computer Vision, pp
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text- driven layered image and video editing. In: European Conference on Computer Vision, pp. 707–723 (2022). Springer
work page 2022
-
[28]
arXiv preprint arXiv:2210.09477 2(3), 5 (2022)
Valevski, D., Kalman, M., Matias, Y., Leviathan, Y.: Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477 2(3), 5 (2022)
-
[29]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)
work page 2022
-
[30]
In: International Conference on Learning Representations (2022)
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)
work page 2022
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., Passos, A.: Ledits++: Limitless image editing using text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8861–8870 (2024)
work page 2024
-
[32]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp
Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22560–22570 (2023)
work page 2023
-
[33]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Liu, B., Wang, C., Cao, T., Jia, K., Huang, J.: Towards understanding cross and self-attention in stable diffusion for text-guided image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7817–7826 (2024)
work page 2024
-
[34]
In: European Conference on Computer Vision, pp
Titov, V., Khalmatova, M., Ivanova, A., Vetrov, D., Alanov, A.: Guide-and- rescale: Self-guidance mechanism for effective tuning-free real image editing. In: European Conference on Computer Vision, pp. 235–251 (2024). Springer
work page 2024
-
[35]
ACM transac- tions on graphics (TOG) 42(4), 1–11 (2023)
Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM transac- tions on graphics (TOG) 42(4), 1–11 (2023)
work page 2023
-
[36]
International Conference in Learning Representations (2023)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. International Conference in Learning Representations (2023)
work page 2023
-
[37]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023) 19
work page 2023
-
[38]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inver- sion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)
work page 2023
-
[39]
In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C
Chen, X., Feng, Y., Chen, M., Wang, Y., Zhang, S., Liu, Y., Shen, Y., Zhao, H.: Zero-shot image editing with reference imitation. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems, vol. 37, pp. 84010–84032 (2024)
work page 2024
-
[40]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
work page 2023
-
[41]
International Conference on Learning Repre- sentations (2023)
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. International Conference on Learning Repre- sentations (2023)
work page 2023
-
[42]
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR (2023)
work page 2023
-
[43]
ACM Transactions on Graphics (TOG) 42(4), 1–10 (2023)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG) 42(4), 1–10 (2023)
work page 2023
-
[44]
Advances in neural information processing systems 30 (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
work page 2017
-
[45]
In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021)
work page 2021
-
[46]
International Conference on Learning Representations (2021)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. International Conference on Learning Representations (2021)
work page 2021
-
[47]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)
work page 2023
-
[48]
In: International Conference on Learning Representations (2014) 20
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2014) 20
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.