{"work":{"id":"196f7eef-d65a-47e4-b815-9a188f6aedcf","openalex_id":null,"doi":null,"arxiv_id":"2208.01626","raw_key":null,"title":"Prompt-to-Prompt Image Editing with Cross Attention Control","authors":null,"authors_text":"Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-Or","year":2022,"venue":"cs.CV","abstract":"Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.","external_url":"https://arxiv.org/abs/2208.01626","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T04:50:21.444923+00:00","pith_arxiv_id":"2208.01626","created_at":"2026-05-08T18:23:55.250211+00:00","updated_at":"2026-05-25T04:50:21.444923+00:00","title_quality_ok":true,"display_title":"Prompt-to-Prompt Image Editing with Cross Attention Control","render_title":"Prompt-to-Prompt Image Editing with Cross Attention Control"},"hub":{"state":{"work_id":"196f7eef-d65a-47e4-b815-9a188f6aedcf","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":91,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2022-11-02T17:43:04+00:00","last_pith_cited_at":"2026-05-22T11:33:43+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-02T06:24:23.232138+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":20},{"context_role":"method","n":2},{"context_role":"extension","n":1}],"polarity_counts":[{"context_polarity":"background","n":21},{"context_polarity":"extend","n":1},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T16:22:28.070144+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen-Image Technical Report","work_id":"d06d7ecc-7579-4f89-a60b-4278a0f3c562","shared_citers":13},{"title":"SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations","work_id":"cfaeaf1f-2560-4505-bcdb-c9aa6cc99d83","shared_citers":13},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":13},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":11},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":10},{"title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","work_id":"d3153e5f-b6e2-4ab3-9f41-e24e24d64496","shared_citers":9},{"title":"Step1X-Edit: A Practical Framework for General Image Editing","work_id":"3392f2c8-a1cb-4d6c-8c82-2cdccffa33f9","shared_citers":9},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":8},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":8},{"title":"Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow","work_id":"a1989e1b-d66d-4533-be3a-fb9c5fd62290","shared_citers":8},{"title":"FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space","work_id":"5dfe19d5-3541-4803-8fe9-3c8b9e29b281","shared_citers":8},{"title":"IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models","work_id":"98e51b10-54bd-4251-8a2d-f79bd6215c19","shared_citers":8},{"title":"arXiv preprint arXiv:2210.11427 (2022)","work_id":"e0e59c16-b6da-4ec1-9bf5-488374e2cde5","shared_citers":7},{"title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","work_id":"881efa7e-7e73-4c66-9cc3-2803e551061c","shared_citers":7},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis","work_id":"40702548-f094-4c67-a5db-a62f426f852e","shared_citers":6},{"title":"ImgEdit: A Unified Image Editing Dataset and Benchmark","work_id":"059b5c3a-404c-4d30-a631-68c1d88a08a7","shared_citers":6},{"title":"Hierarchical Text-Conditional Image Generation with CLIP Latents","work_id":"0c6a768b-70b8-4242-bb0e-459f1008c9fc","shared_citers":5},{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":5},{"title":"An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion","work_id":"ca618c21-3ba6-448e-bd86-bcecff3cdeb5","shared_citers":4},{"title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning","work_id":"1f9d1d3b-a6d6-45a9-9f13-51393c03be8a","shared_citers":4},{"title":"arXiv preprint arXiv:2310.01506 (2023)","work_id":"eba12495-cf75-42e3-9cbb-d8ea203a3f5f","shared_citers":4},{"title":"arXiv preprint arXiv:2410.10792 (2024)","work_id":"6b803827-7c45-46a5-80be-22ac9d312ae3","shared_citers":4},{"title":"arXiv preprint arXiv:2411.04746 (2024)","work_id":"b996055d-5c8d-453c-91f5-4b01e7fe1393","shared_citers":4}],"time_series":[{"n":1,"year":2024},{"n":1,"year":2025},{"n":42,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T16:22:40.405308+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T16:22:30.710662+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Prompt-to-Prompt Image Editing with Cross Attention Control","claims":[{"claim_text":"Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modi","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Prompt-to-Prompt Image Editing with Cross Attention Control because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T16:22:26.103527+00:00"}},"summary":{"title":"Prompt-to-Prompt Image Editing with Cross Attention Control","claims":[{"claim_text":"Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modi","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Prompt-to-Prompt Image Editing with Cross Attention Control because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen-Image Technical Report","work_id":"d06d7ecc-7579-4f89-a60b-4278a0f3c562","shared_citers":13},{"title":"SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations","work_id":"cfaeaf1f-2560-4505-bcdb-c9aa6cc99d83","shared_citers":13},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":13},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":11},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":10},{"title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","work_id":"d3153e5f-b6e2-4ab3-9f41-e24e24d64496","shared_citers":9},{"title":"Step1X-Edit: A Practical Framework for General Image Editing","work_id":"3392f2c8-a1cb-4d6c-8c82-2cdccffa33f9","shared_citers":9},{"title":"Emerging Properties in Unified Multimodal Pretraining","work_id":"e0cfd82c-f5d4-44fd-b531-ec73ab0a805b","shared_citers":8},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":8},{"title":"Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow","work_id":"a1989e1b-d66d-4533-be3a-fb9c5fd62290","shared_citers":8},{"title":"FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space","work_id":"5dfe19d5-3541-4803-8fe9-3c8b9e29b281","shared_citers":8},{"title":"IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models","work_id":"98e51b10-54bd-4251-8a2d-f79bd6215c19","shared_citers":8},{"title":"arXiv preprint arXiv:2210.11427 (2022)","work_id":"e0e59c16-b6da-4ec1-9bf5-488374e2cde5","shared_citers":7},{"title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","work_id":"881efa7e-7e73-4c66-9cc3-2803e551061c","shared_citers":7},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis","work_id":"40702548-f094-4c67-a5db-a62f426f852e","shared_citers":6},{"title":"ImgEdit: A Unified Image Editing Dataset and Benchmark","work_id":"059b5c3a-404c-4d30-a631-68c1d88a08a7","shared_citers":6},{"title":"Hierarchical Text-Conditional Image Generation with CLIP Latents","work_id":"0c6a768b-70b8-4242-bb0e-459f1008c9fc","shared_citers":5},{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":5},{"title":"An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion","work_id":"ca618c21-3ba6-448e-bd86-bcecff3cdeb5","shared_citers":4},{"title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning","work_id":"1f9d1d3b-a6d6-45a9-9f13-51393c03be8a","shared_citers":4},{"title":"arXiv preprint arXiv:2310.01506 (2023)","work_id":"eba12495-cf75-42e3-9cbb-d8ea203a3f5f","shared_citers":4},{"title":"arXiv preprint arXiv:2410.10792 (2024)","work_id":"6b803827-7c45-46a5-80be-22ac9d312ae3","shared_citers":4},{"title":"arXiv preprint arXiv:2411.04746 (2024)","work_id":"b996055d-5c8d-453c-91f5-4b01e7fe1393","shared_citers":4}],"time_series":[{"n":1,"year":2024},{"n":1,"year":2025},{"n":42,"year":2026}],"dependency_candidates":[]},"authors":[]}}