pith. machine review for the scientific record. sign in

arxiv: 2604.15948 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-guided image editingtraining-free editingattention manipulationcoopetitionentropic refinementdiffusion modelssemantic harmony
0
0 comments X

The pith

CoEdit replaces competitive attention control with coopetitive negotiation between editing and reconstruction branches to reduce semantic conflicts in text-guided image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current training-free text-guided editing methods pit an editing branch against a reconstruction branch, each maximizing its own prompt alignment and thereby creating unpredictable semantic clashes. CoEdit instead treats the branches as negotiating partners by quantifying directional entropic interactions and reformulating attention control as an explicit harmony-maximization task. Spatially this is done through Dual-Entropy Attention Manipulation that improves localization of what should change versus what should stay fixed. Temporally an Entropic Latent Refinement step adjusts the latent code at each denoising step to limit accumulated errors and keep transitions consistent. The authors further introduce a composite Fidelity-Constrained Editing Score to measure both successful edits and background preservation, reporting stronger results on standard benchmarks.

Core claim

By shifting from independent competitive optimization of editing and reconstruction objectives to a coopetitive framework that negotiates attention through measured entropic interactions, CoEdit produces more harmonious edits across both space and the denoising trajectory while preserving source structure.

What carries the argument

Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between the editing and reconstruction branches to recast attention control as a harmony-maximization problem.

If this is right

  • Editable and preservable regions become more accurately localized because attention is negotiated rather than fought over.
  • Semantic drift is reduced across the full denoising sequence because latent states are refined at every step using the same entropic harmony signal.
  • A single composite metric now jointly scores how well the edit succeeds and how faithfully the background is retained.
  • The method remains fully training-free and zero-shot, inheriting the practical advantages of prior diffusion-based editors while addressing their coordination failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same negotiation logic could be tested on video or 3D diffusion models where temporal consistency is even harder to maintain under competing objectives.
  • Other attention-heavy generative tasks, such as layout-conditioned synthesis or prompt interpolation, might benefit from recasting their internal objectives as explicit harmony problems.
  • If the entropic quantification proves stable across different diffusion backbones, the approach offers a lightweight plug-in module rather than a full architectural overhaul.

Load-bearing premise

Directional entropic interactions between the two branches can be quantified in a way that reliably converts attention control into harmony maximization without creating fresh semantic conflicts or needing per-image tuning.

What would settle it

Applying CoEdit to the same editing benchmarks and finding that its output images score lower than strong competitive baselines on both semantic alignment with the target prompt and structural similarity to the source image.

Figures

Figures reproduced from arXiv: 2604.15948 by Haoqian Du, Jinhao Shen, Qing Li, Xiao-Yong Wei, Xulu Zhang.

Figure 1
Figure 1. Figure 1: Difference between competitive and coopetitive strategies. (a) Compet [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Quantitative trade-off between editing diversity (CS [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed CoEdit framework, which integrates Dual-Entropy Attention Manipulation and Entropic Latent Refinement. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparative visualization of various zero-shot image editing methods. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of failure examples with higher [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of different learning rate. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of original attention maps and dual entropy, demonstrating the spatial coopetitive strategy. Mask Acc [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Post-hoc comparison of attention mask accuracy throughout de [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at https://github.com/JinhaoShen/CoEdit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes CoEdit, a zero-shot training-free framework for text-guided image editing that reframes attention control as coopetitive negotiation rather than competition between editing and reconstruction branches. It introduces Dual-Entropy Attention Manipulation to quantify directional entropic interactions for spatial harmony maximization and improved localization, Entropic Latent Refinement to adjust latents temporally for consistent denoising trajectories, and the Fidelity-Constrained Editing Score as a composite metric for semantic editing and background fidelity. The authors claim superior editing quality and structural preservation on standard benchmarks.

Significance. If the entropy-based mechanisms are shown to deliver the claimed harmony without new conflicts or tuning, the work could meaningfully advance training-free diffusion editing by reducing adversarial branch interactions, with potential benefits for multimedia applications requiring precise yet faithful edits. The introduction of directional entropy quantification and a joint fidelity metric are distinctive if empirically grounded.

major comments (3)
  1. [§3.2] §3.2 (Dual-Entropy Attention Manipulation): the reformulation of attention control as a harmony-maximization problem via directional entropic interactions is presented without a derivation or analysis showing that the resulting weights avoid prompt-dependent scales or new semantic conflicts; this is load-bearing for the central coopetitive claim.
  2. [§4] §4 (Experiments): superiority in editing quality and structural preservation is asserted, yet no quantitative tables, ablation results on the entropy terms, or direct comparisons to prior attention-control baselines are referenced in sufficient detail to evaluate the benchmark gains.
  3. [§3.3] §3.3 (Entropic Latent Refinement): the mechanism for dynamically adjusting latents to minimize accumulated errors is described at a high level; it is unclear whether the entropy weighting is parameter-free or requires implicit per-prompt calibration, undermining the training-free guarantee.
minor comments (2)
  1. Notation for the entropy terms (e.g., directional interaction definitions) should be explicitly tied to the attention maps for reproducibility.
  2. The abstract states code will be released, but the manuscript should include a reproducibility statement or pseudocode for the two proposed mechanisms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and outline revisions that will strengthen the manuscript's rigor and clarity.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Dual-Entropy Attention Manipulation): the reformulation of attention control as a harmony-maximization problem via directional entropic interactions is presented without a derivation or analysis showing that the resulting weights avoid prompt-dependent scales or new semantic conflicts; this is load-bearing for the central coopetitive claim.

    Authors: We agree that an explicit derivation would better substantiate the coopetitive claim. The Dual-Entropy Attention Manipulation is motivated by quantifying directional entropic interactions to achieve spatial harmony, with empirical results across diverse prompts supporting stability. In the revised manuscript, we will add a dedicated analysis subsection deriving the weight normalization properties, proving scale-invariance, and demonstrating that the formulation avoids introducing new semantic conflicts through bounded entropy terms. revision: yes

  2. Referee: [§4] §4 (Experiments): superiority in editing quality and structural preservation is asserted, yet no quantitative tables, ablation results on the entropy terms, or direct comparisons to prior attention-control baselines are referenced in sufficient detail to evaluate the benchmark gains.

    Authors: We acknowledge that the experimental section would benefit from greater detail and explicit referencing. While the manuscript reports quantitative evaluations on standard benchmarks and includes initial ablations, we will expand §4 with full quantitative tables, comprehensive ablation studies isolating the entropy terms, and direct side-by-side comparisons to prior attention-control baselines (e.g., Prompt-to-Prompt, Attend-and-Excite) to clearly demonstrate the benchmark gains. revision: yes

  3. Referee: [§3.3] §3.3 (Entropic Latent Refinement): the mechanism for dynamically adjusting latents to minimize accumulated errors is described at a high level; it is unclear whether the entropy weighting is parameter-free or requires implicit per-prompt calibration, undermining the training-free guarantee.

    Authors: The Entropic Latent Refinement is designed to be fully parameter-free: entropy weights are computed dynamically from latent statistics and attention maps at each timestep with no per-prompt calibration or tunable hyperparameters. We will revise §3.3 to include explicit pseudocode and a step-by-step explanation confirming the absence of any calibration, thereby reinforcing the training-free guarantee. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new entropy-based mechanisms are introduced independently of fitted inputs or self-referential definitions.

full rationale

The abstract and described framework propose Dual-Entropy Attention Manipulation and Entropic Latent Refinement as novel reformulations that quantify directional interactions to achieve harmony maximization. No load-bearing equations, self-citations, or reductions to prior fitted parameters are evident in the provided text. The central claims add independent controls for spatial-temporal coordination rather than deriving predictions from the same inputs by construction. This matches the reader's assessment and qualifies as a normal non-finding under the guidelines (score 0-2).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented physical entities; the Fidelity-Constrained Editing Score and entropy quantifications may involve implicit choices but are not detailed.

pith-pipeline@v0.9.0 · 5568 in / 1140 out tokens · 51709 ms · 2026-05-10T08:08:57.548467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Kt-gan: Knowledge- transfer generative adversarial network for text-to-image synthesis,

    H. Tan, X. Liu, M. Liu, B. Yin, and X. Li, “Kt-gan: Knowledge- transfer generative adversarial network for text-to-image synthesis,” IEEE Transactions on Image Processing, vol. 30, pp. 1275–1290, 2021

  2. [2]

    Compositional inversion for stable diffusion models,

    X. Zhang, X.-Y . Wei, J. Wu, T. Zhang, Z. Zhang, Z. Lei, and Q. Li, “Compositional inversion for stable diffusion models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7350–7358

  3. [3]

    Enhanced generative structure prior for chinese text image super-resolution,

    X. Li, W. Zuo, and C. C. Loy, “Enhanced generative structure prior for chinese text image super-resolution,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16, 2025

  4. [4]

    Textir: A simple framework for text-based editable image restoration,

    Y . Bai, C. Wang, S. Xie, C. Dong, C. Yuan, and Z. Wang, “Textir: A simple framework for text-based editable image restoration,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 10, pp. 7549–7564, 2025

  5. [5]

    Coaching the exploration and exploitation in active learning for interactive video retrieval,

    Z.-Q. Y . Xiao-Yong Wei, “Coaching the exploration and exploitation in active learning for interactive video retrieval,”IEEE Transactions on Image Processing, vol. 22, no. 3, pp. 955–968, 2013

  6. [6]

    Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval,

    Y . Jiang, W. Zhang, X. Zhang, X.-Y . Wei, C. W. Chen, and Q. Li, “Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7249–7258

  7. [7]

    Mining in-class social networks for large- scale pedagogical analysis,

    X.-Y . Wei and Z.-Q. Yang, “Mining in-class social networks for large- scale pedagogical analysis,” inProceedings of the 20th ACM interna- tional conference on Multimedia, 2012

  8. [8]

    Lightweight text- driven image editing with disentangled content and attributes,

    B. Li, X. Lin, B. Liu, Z.-F. He, and Y .-K. Lai, “Lightweight text- driven image editing with disentangled content and attributes,”IEEE Transactions on Multimedia, vol. 26, pp. 1829–1841, 2024

  9. [9]

    Mmginpainting: Multi-modality guided image inpainting based on diffusion models,

    C. Zhang, W. Yang, X. Li, and H. Han, “Mmginpainting: Multi-modality guided image inpainting based on diffusion models,”IEEE Transactions on Multimedia, vol. 26, pp. 8811–8823, 2024

  10. [10]

    Trame: Trajectory-anchored multi-view editing for text-guided 3d gaussian manipulation,

    C. Luo, D. Di, X. Yang, Y . Ma, Z. Xue, W. Chen, X. Gou, and Y . Liu, “Trame: Trajectory-anchored multi-view editing for text-guided 3d gaussian manipulation,”IEEE Transactions on Multimedia, vol. 27, pp. 2886–2898, 2025

  11. [11]

    Box it to bind it: Unified layout control and attribute binding in text-to-image diffusion models,

    A. Taghipour, M. Ghahremani, M. Bennamoun, A. M. Rekavandi, H. Laga, and F. Boussaid, “Box it to bind it: Unified layout control and attribute binding in text-to-image diffusion models,”IEEE Transactions on Multimedia, pp. 1–15, 2025

  12. [12]

    Detailed object description with controllable dimensions,

    X. Wang, H. Zhang, B. Li, K. Liang, H. Sun, Z. He, Z. Ma, and J. Guo, “Detailed object description with controllable dimensions,”IEEE Transactions on Multimedia, pp. 1–13, 2025

  13. [13]

    Denoising Diffusion Implicit Models

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

  14. [14]

    Prompt-to-prompt image editing with cross-attention control,

    A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y . Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross-attention control,” inInternational Conference on Learning Representations, 2023

  15. [15]

    Plug-and-play diffusion features for text-driven image-to-image translation,

    N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1921–1930

  16. [16]

    h-edit: Effective and flex- ible diffusion-based editing via doob’s h-transform,

    T. Nguyen, K. Do, D. Kieu, and T. Nguyen, “h-edit: Effective and flex- ible diffusion-based editing via doob’s h-transform,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 28 490–28 501

  17. [17]

    Don’t forget your inverse ddim for image editing,

    G. Gomez-Trenado, P. Mesejo, O. Cord ´on, and S. Lathuili `ere, “Don’t forget your inverse ddim for image editing,”IEEE Computational Intelligence Magazine, vol. 20, no. 3, p. 10–18, 2025

  18. [18]

    Text-to-image rectified flow as plug-and-play priors,

    X. Yang, C. Cheng, X. Yang, F. Liu, and G. Lin, “Text-to-image rectified flow as plug-and-play priors,” inInternational Conference on Learning Representations, 2025

  19. [19]

    Postedit: Posterior sampling for efficient zero-shot image editing,

    F. Tian, Y . Li, Y . Yan, S. Guan, Y . Ge, and X. Yang, “Postedit: Posterior sampling for efficient zero-shot image editing,”International Conference on Learning Representations, 2025

  20. [20]

    Inversion-free image editing with language-guided diffusion models,

    S. Xu, Y . Huang, J. Pan, Z. Ma, and J. Chai, “Inversion-free image editing with language-guided diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9452–9461

  21. [21]

    Custom-edit: Text- guided image editing with customized diffusion models,

    J. Choi, Y . Choi, Y . Kim, J. Kim, and S. Yoon, “Custom-edit: Text- guided image editing with customized diffusion models,”arXiv preprint arXiv:2305.15779, 2023

  22. [22]

    Focus on your instruction: Fine-grained and multi- instruction image editing by attention modulation,

    Q. Guo and T. Lin, “Focus on your instruction: Fine-grained and multi- instruction image editing by attention modulation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6986–6996

  23. [23]

    Revisiting efficient semantic segmentation: Learning offsets for better spatial and class feature alignment,

    S.-C. Zhang, Y . Li, Y .-H. Wu, Q. Hou, and M.-M. Cheng, “Revisiting efficient semantic segmentation: Learning offsets for better spatial and class feature alignment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 361–22 371

  24. [24]

    Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,

    M. Cao, X. Wang, Z. Qi, Y . Shan, X. Qie, and Y . Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, October 2023, pp. 22 560–22 570

  25. [25]

    Consistent video inpainting using axial attention-based style transformer,

    M. S. Junayed and M. B. Islam, “Consistent video inpainting using axial attention-based style transformer,”IEEE Transactions on Multimedia, vol. 25, pp. 7494–7504, 2023

  26. [26]

    Art image inpainting with style-guided dual-branch inpainting network,

    Q. Wang, Z. Wang, X. Zhang, and G. Feng, “Art image inpainting with style-guided dual-branch inpainting network,”IEEE Transactions on Multimedia, vol. 26, pp. 8026–8037, 2024

  27. [27]

    Multi-channel attention selection gans for guided image-to-image translation,

    H. Tang, P. H. Torr, and N. Sebe, “Multi-channel attention selection gans for guided image-to-image translation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 5, pp. 6055–6071, 2023

  28. [28]

    Toward interactive image inpainting via robust sketch refinement,

    C. Liu, S. Xu, J. Peng, K. Zhang, and D. Liu, “Toward interactive image inpainting via robust sketch refinement,”IEEE Transactions on Multimedia, vol. 26, pp. 9973–9987, 2024

  29. [29]

    Weighted feature fusion of con- volutional neural network and graph attention network for hyperspectral image classification,

    Y . Dong, Q. Liu, B. Du, and L. Zhang, “Weighted feature fusion of con- volutional neural network and graph attention network for hyperspectral image classification,”IEEE Transactions on Image Processing, vol. 31, pp. 1559–1572, 2022

  30. [30]

    Tuning-free inversion-enhanced control for consistent image editing,

    X. Duan, S. Cui, G. Kang, B. Zhang, Z. Fei, M. Fan, and J. Huang, “Tuning-free inversion-enhanced control for consistent image editing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 1644–1652

  31. [31]

    Effective real image editing with accelerated iterative diffusion inversion,

    Z. Pan, R. Gherardi, X. Xie, and S. Huang, “Effective real image editing with accelerated iterative diffusion inversion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, Oct 2023, pp. 15 912–15 921

  32. [32]

    An edit-friendly ddpm noise space: Inversion and manipulations,

    I. Huberman-Spiegelglas, V . Kulikov, and T. Michaeli, “An edit-friendly ddpm noise space: Inversion and manipulations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1921–1930

  33. [33]

    Towards efficient diffusion-based image editing with instant attention masks,

    S. Zou, J. Tang, Y . Zhou, J. He, C. Zhao, R. Zhang, Z. Hu, and X. Sun, “Towards efficient diffusion-based image editing with instant attention masks,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7864–7872

  34. [34]

    Enhanced multi-scale cross-attention for person image generation,

    H. Tang, L. Shao, N. Sebe, and L. Van Gool, “Enhanced multi-scale cross-attention for person image generation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 5, pp. 3377– 3393, 2025

  35. [35]

    Swiftedit: Lightning fast text-guided image editing via one-step diffusion,

    T. Nguyen, Q. Nguyen, K. Nguyen, A. Tran, and C. Pham, “Swiftedit: Lightning fast text-guided image editing via one-step diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 21 492–21 501

  36. [36]

    Baret: Balanced attention based real image editing driven by target- text inversion,

    Y . Qiao, F. Wang, J. Su, Y . Zhang, Y . Yu, S. Wu, and G.-J. Qi, “Baret: Balanced attention based real image editing driven by target- text inversion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2023, pp. 4560–4568

  37. [37]

    Consistency Models

    Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” arXiv preprint arXiv:2303.01469, 2023

  38. [38]

    Improved techniques for training consistency models

    Y . Song and P. Dhariwal, “Improved techniques for training consistency models,”arXiv preprint arXiv:2310.14189, 2023

  39. [39]

    Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning.arXiv preprint arXiv:2402.00769, 2024

    F.-Y . Wang, Z. Huang, X. Shi, W. Bian, G. Song, Y . Liu, and H. Li, “Animatelcm: Accelerating the animation of personalized diffusion mod- els and adapters with decoupled consistency learning,”arXiv preprint arXiv:2402.00769, 2024

  40. [40]

    In- vertible consistency distillation for text-guided image editing in around 7 steps,

    N. Starodubcev, M. Khoroshikh, A. Babenko, and D. Baranchuk, “In- vertible consistency distillation for text-guided image editing in around 7 steps,” inNeurIPS, vol. 37, 2024, pp. 12 496–12 527

  41. [41]

    Scott: Accelerating diffusion models with stochastic consistency distillation,

    H. Liu, Q. Xie, T. Ye, Z. Deng, C. Chen, S. Tang, X. Fu, H. Lu, and Z. Zha, “Scott: Accelerating diffusion models with stochastic consistency distillation,”arXiv preprint arXiv:2403.01505, 2024

  42. [42]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    S. Luo, Y . Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency models: Synthesizing high-resolution images with few-step inference,” arXiv preprint arXiv:2310.04378, 2023

  43. [43]

    Rethinking score distillation as a bridge between image distributions,

    D. McAllister, S. Ge, J.-B. Huang, D. Jacobs, A. Efros, Alexei amao2025tuningfreend Holynski, and A. Kanazawa, “Rethinking score distillation as a bridge between image distributions,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 33 779– 33 804

  44. [44]

    Delta denoising score,

    A. Hertz, K. Aberman, and D. Cohen-Or, “Delta denoising score,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2023, pp. 2328–2337. 11

  45. [45]

    Adversarial diffusion distillation,

    A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” inEuropean Conference on Computer Vision, 2024, pp. 87–103

  46. [46]

    Improved zero-shot image editing via null-toon and directed delta denoising score,

    M. A. N. Islam Fahim and J. Boutellier, “Improved zero-shot image editing via null-toon and directed delta denoising score,” inInternational Conference on Pattern Recognition, ser. Lecture Notes in Computer Science, Dec 2024, vol. 15306, pp. 309–323

  47. [47]

    Qsd: Query-selection denoising score for image editing in latent diffusion model,

    J. Hwang, C. Lim, and W. Lee, “Qsd: Query-selection denoising score for image editing in latent diffusion model,” inEuropean Conference on Computer Vision, 2025, pp. 229–243

  48. [48]

    Dreamsteerer: Enhancing source image conditioned editability using personalized diffusion models,

    Z. Yu, Z. Yang, and J. Zhang, “Dreamsteerer: Enhancing source image conditioned editability using personalized diffusion models,” inAd- vances in Neural Information Processing Systems, vol. 37, 2024, pp. 120 699–120 734

  49. [49]

    Pnp inversion: Boosting diffusion-based editing with 3 lines of code,

    X. Ju, A. Zeng, Y . Bian, S. Liu, and Q. Xu, “Pnp inversion: Boosting diffusion-based editing with 3 lines of code,” inInternational Conference on Learning Representations, 2024

  50. [50]

    https://huggingface.co/datasets/ub-cvml- group/pie bench pp,

    PIEBench++, “https://huggingface.co/datasets/ub-cvml- group/pie bench pp,” 2024

  51. [51]

    Zero- shot image-to-image translation,

    G. Parmar, K. Kumar Singh, R. Zhang, Y . Li, J. Lu, and J.-Y . Zhu, “Zero- shot image-to-image translation,” inConf. ACM SIGGRAPH, 2023, pp. 1–11

  52. [52]

    Null- text inversion for editing real images using guided diffusion models,

    R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or, “Null- text inversion for editing real images using guided diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6038–6047

  53. [53]

    Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models,

    D. Miyake, A. Iohara, Y . Saito, and T. Tanaka, “Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models,” inIEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp. 2063–2072