arxiv: 2604.15948 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

Jinhao Shen , Haoqian Du , Xulu Zhang , Xiao-Yong Wei , Qing Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-guided image editingtraining-free editingattention manipulationcoopetitionentropic refinementdiffusion modelssemantic harmony

0 comments

The pith

CoEdit replaces competitive attention control with coopetitive negotiation between editing and reconstruction branches to reduce semantic conflicts in text-guided image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current training-free text-guided editing methods pit an editing branch against a reconstruction branch, each maximizing its own prompt alignment and thereby creating unpredictable semantic clashes. CoEdit instead treats the branches as negotiating partners by quantifying directional entropic interactions and reformulating attention control as an explicit harmony-maximization task. Spatially this is done through Dual-Entropy Attention Manipulation that improves localization of what should change versus what should stay fixed. Temporally an Entropic Latent Refinement step adjusts the latent code at each denoising step to limit accumulated errors and keep transitions consistent. The authors further introduce a composite Fidelity-Constrained Editing Score to measure both successful edits and background preservation, reporting stronger results on standard benchmarks.

Core claim

By shifting from independent competitive optimization of editing and reconstruction objectives to a coopetitive framework that negotiates attention through measured entropic interactions, CoEdit produces more harmonious edits across both space and the denoising trajectory while preserving source structure.

What carries the argument

Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between the editing and reconstruction branches to recast attention control as a harmony-maximization problem.

If this is right

Editable and preservable regions become more accurately localized because attention is negotiated rather than fought over.
Semantic drift is reduced across the full denoising sequence because latent states are refined at every step using the same entropic harmony signal.
A single composite metric now jointly scores how well the edit succeeds and how faithfully the background is retained.
The method remains fully training-free and zero-shot, inheriting the practical advantages of prior diffusion-based editors while addressing their coordination failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same negotiation logic could be tested on video or 3D diffusion models where temporal consistency is even harder to maintain under competing objectives.
Other attention-heavy generative tasks, such as layout-conditioned synthesis or prompt interpolation, might benefit from recasting their internal objectives as explicit harmony problems.
If the entropic quantification proves stable across different diffusion backbones, the approach offers a lightweight plug-in module rather than a full architectural overhaul.

Load-bearing premise

Directional entropic interactions between the two branches can be quantified in a way that reliably converts attention control into harmony maximization without creating fresh semantic conflicts or needing per-image tuning.

What would settle it

Applying CoEdit to the same editing benchmarks and finding that its output images score lower than strong competitive baselines on both semantic alignment with the target prompt and structural similarity to the source image.

Figures

Figures reproduced from arXiv: 2604.15948 by Haoqian Du, Jinhao Shen, Qing Li, Xiao-Yong Wei, Xulu Zhang.

**Figure 2.** Figure 2: Quantitative trade-off between editing diversity (CS [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed CoEdit framework, which integrates Dual-Entropy Attention Manipulation and Entropic Latent Refinement. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparative visualization of various zero-shot image editing methods. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of failure examples with higher [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of different learning rate. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of original attention maps and dual entropy, demonstrating the spatial coopetitive strategy. Mask Acc [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Post-hoc comparison of attention mask accuracy throughout de [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at https://github.com/JinhaoShen/CoEdit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoEdit reframes editing-reconstruction conflicts as entropy-based harmony maximization with two new mechanisms, but the abstract gives no numbers or derivations to show it actually works without hidden tuning.

read the letter

The main thing to know is that this paper takes the standard problem in training-free diffusion editing where the target prompt and source prompt pull attention in opposite directions and tries to replace that competition with a coopetitive setup based on entropy measures. They introduce Dual-Entropy Attention Manipulation to quantify directional interactions between branches and turn attention control into a harmony-maximization step for better spatial localization. Entropic Latent Refinement then adjusts latents over the denoising steps to cut down on accumulated errors, and they add a Fidelity-Constrained Editing Score that scores both edit quality and background preservation together. The code release is also noted, which helps. These pieces address a real coordination gap in prior work that just lets the branches run independently. The entropy angle is a concrete new control rather than another layer of optimization or masking. That part is worth looking at if you work on attention tricks in generative models. The soft spots are straightforward. The abstract states superior benchmark results but shows zero numbers, no ablation tables, and no step-by-step derivation of how the entropy terms are computed or why they avoid new semantic clashes. Without those, the claim that harmony emerges reliably and without prompt-specific scaling stays untested. The stress-test point about possible implicit tuning or fresh conflicts is reasonable until the equations and failure cases appear. This is aimed at people already doing zero-shot editing or multi-objective control in diffusion pipelines. A reader who wants alternatives to pure competition could pick up usable ideas from the mechanisms even if the full results need checking. It shows honest engagement with the competitive-paradigm literature and has enough of a distinct proposal to go to referees rather than get desk-rejected. I would send it for review to get the derivations, the actual scores, and a look at whether the harmony holds up on the reported benchmarks.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes CoEdit, a zero-shot training-free framework for text-guided image editing that reframes attention control as coopetitive negotiation rather than competition between editing and reconstruction branches. It introduces Dual-Entropy Attention Manipulation to quantify directional entropic interactions for spatial harmony maximization and improved localization, Entropic Latent Refinement to adjust latents temporally for consistent denoising trajectories, and the Fidelity-Constrained Editing Score as a composite metric for semantic editing and background fidelity. The authors claim superior editing quality and structural preservation on standard benchmarks.

Significance. If the entropy-based mechanisms are shown to deliver the claimed harmony without new conflicts or tuning, the work could meaningfully advance training-free diffusion editing by reducing adversarial branch interactions, with potential benefits for multimedia applications requiring precise yet faithful edits. The introduction of directional entropy quantification and a joint fidelity metric are distinctive if empirically grounded.

major comments (3)

[§3.2] §3.2 (Dual-Entropy Attention Manipulation): the reformulation of attention control as a harmony-maximization problem via directional entropic interactions is presented without a derivation or analysis showing that the resulting weights avoid prompt-dependent scales or new semantic conflicts; this is load-bearing for the central coopetitive claim.
[§4] §4 (Experiments): superiority in editing quality and structural preservation is asserted, yet no quantitative tables, ablation results on the entropy terms, or direct comparisons to prior attention-control baselines are referenced in sufficient detail to evaluate the benchmark gains.
[§3.3] §3.3 (Entropic Latent Refinement): the mechanism for dynamically adjusting latents to minimize accumulated errors is described at a high level; it is unclear whether the entropy weighting is parameter-free or requires implicit per-prompt calibration, undermining the training-free guarantee.

minor comments (2)

Notation for the entropy terms (e.g., directional interaction definitions) should be explicitly tied to the attention maps for reproducibility.
The abstract states code will be released, but the manuscript should include a reproducibility statement or pseudocode for the two proposed mechanisms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and outline revisions that will strengthen the manuscript's rigor and clarity.

read point-by-point responses

Referee: [§3.2] §3.2 (Dual-Entropy Attention Manipulation): the reformulation of attention control as a harmony-maximization problem via directional entropic interactions is presented without a derivation or analysis showing that the resulting weights avoid prompt-dependent scales or new semantic conflicts; this is load-bearing for the central coopetitive claim.

Authors: We agree that an explicit derivation would better substantiate the coopetitive claim. The Dual-Entropy Attention Manipulation is motivated by quantifying directional entropic interactions to achieve spatial harmony, with empirical results across diverse prompts supporting stability. In the revised manuscript, we will add a dedicated analysis subsection deriving the weight normalization properties, proving scale-invariance, and demonstrating that the formulation avoids introducing new semantic conflicts through bounded entropy terms. revision: yes
Referee: [§4] §4 (Experiments): superiority in editing quality and structural preservation is asserted, yet no quantitative tables, ablation results on the entropy terms, or direct comparisons to prior attention-control baselines are referenced in sufficient detail to evaluate the benchmark gains.

Authors: We acknowledge that the experimental section would benefit from greater detail and explicit referencing. While the manuscript reports quantitative evaluations on standard benchmarks and includes initial ablations, we will expand §4 with full quantitative tables, comprehensive ablation studies isolating the entropy terms, and direct side-by-side comparisons to prior attention-control baselines (e.g., Prompt-to-Prompt, Attend-and-Excite) to clearly demonstrate the benchmark gains. revision: yes
Referee: [§3.3] §3.3 (Entropic Latent Refinement): the mechanism for dynamically adjusting latents to minimize accumulated errors is described at a high level; it is unclear whether the entropy weighting is parameter-free or requires implicit per-prompt calibration, undermining the training-free guarantee.

Authors: The Entropic Latent Refinement is designed to be fully parameter-free: entropy weights are computed dynamically from latent statistics and attention maps at each timestep with no per-prompt calibration or tunable hyperparameters. We will revise §3.3 to include explicit pseudocode and a step-by-step explanation confirming the absence of any calibration, thereby reinforcing the training-free guarantee. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new entropy-based mechanisms are introduced independently of fitted inputs or self-referential definitions.

full rationale

The abstract and described framework propose Dual-Entropy Attention Manipulation and Entropic Latent Refinement as novel reformulations that quantify directional interactions to achieve harmony maximization. No load-bearing equations, self-citations, or reductions to prior fitted parameters are evident in the provided text. The central claims add independent controls for spatial-temporal coordination rather than deriving predictions from the same inputs by construction. This matches the reader's assessment and qualifies as a normal non-finding under the guidelines (score 0-2).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented physical entities; the Fidelity-Constrained Editing Score and entropy quantifications may involve implicit choices but are not detailed.

pith-pipeline@v0.9.0 · 5568 in / 1140 out tokens · 51709 ms · 2026-05-10T08:08:57.548467+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Kt-gan: Knowledge- transfer generative adversarial network for text-to-image synthesis,

H. Tan, X. Liu, M. Liu, B. Yin, and X. Li, “Kt-gan: Knowledge- transfer generative adversarial network for text-to-image synthesis,” IEEE Transactions on Image Processing, vol. 30, pp. 1275–1290, 2021

2021
[2]

Compositional inversion for stable diffusion models,

X. Zhang, X.-Y . Wei, J. Wu, T. Zhang, Z. Zhang, Z. Lei, and Q. Li, “Compositional inversion for stable diffusion models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7350–7358

2024
[3]

Enhanced generative structure prior for chinese text image super-resolution,

X. Li, W. Zuo, and C. C. Loy, “Enhanced generative structure prior for chinese text image super-resolution,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16, 2025

2025
[4]

Textir: A simple framework for text-based editable image restoration,

Y . Bai, C. Wang, S. Xie, C. Dong, C. Yuan, and Z. Wang, “Textir: A simple framework for text-based editable image restoration,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 10, pp. 7549–7564, 2025

2025
[5]

Coaching the exploration and exploitation in active learning for interactive video retrieval,

Z.-Q. Y . Xiao-Yong Wei, “Coaching the exploration and exploitation in active learning for interactive video retrieval,”IEEE Transactions on Image Processing, vol. 22, no. 3, pp. 955–968, 2013

2013
[6]

Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval,

Y . Jiang, W. Zhang, X. Zhang, X.-Y . Wei, C. W. Chen, and Q. Li, “Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7249–7258

2024
[7]

Mining in-class social networks for large- scale pedagogical analysis,

X.-Y . Wei and Z.-Q. Yang, “Mining in-class social networks for large- scale pedagogical analysis,” inProceedings of the 20th ACM interna- tional conference on Multimedia, 2012

2012
[8]

Lightweight text- driven image editing with disentangled content and attributes,

B. Li, X. Lin, B. Liu, Z.-F. He, and Y .-K. Lai, “Lightweight text- driven image editing with disentangled content and attributes,”IEEE Transactions on Multimedia, vol. 26, pp. 1829–1841, 2024

2024
[9]

Mmginpainting: Multi-modality guided image inpainting based on diffusion models,

C. Zhang, W. Yang, X. Li, and H. Han, “Mmginpainting: Multi-modality guided image inpainting based on diffusion models,”IEEE Transactions on Multimedia, vol. 26, pp. 8811–8823, 2024

2024
[10]

Trame: Trajectory-anchored multi-view editing for text-guided 3d gaussian manipulation,

C. Luo, D. Di, X. Yang, Y . Ma, Z. Xue, W. Chen, X. Gou, and Y . Liu, “Trame: Trajectory-anchored multi-view editing for text-guided 3d gaussian manipulation,”IEEE Transactions on Multimedia, vol. 27, pp. 2886–2898, 2025

2025
[11]

Box it to bind it: Unified layout control and attribute binding in text-to-image diffusion models,

A. Taghipour, M. Ghahremani, M. Bennamoun, A. M. Rekavandi, H. Laga, and F. Boussaid, “Box it to bind it: Unified layout control and attribute binding in text-to-image diffusion models,”IEEE Transactions on Multimedia, pp. 1–15, 2025

2025
[12]

Detailed object description with controllable dimensions,

X. Wang, H. Zhang, B. Li, K. Liang, H. Sun, Z. He, Z. Ma, and J. Guo, “Detailed object description with controllable dimensions,”IEEE Transactions on Multimedia, pp. 1–13, 2025

2025
[13]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Prompt-to-prompt image editing with cross-attention control,

A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y . Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross-attention control,” inInternational Conference on Learning Representations, 2023

2023
[15]

Plug-and-play diffusion features for text-driven image-to-image translation,

N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1921–1930

2023
[16]

h-edit: Effective and flex- ible diffusion-based editing via doob’s h-transform,

T. Nguyen, K. Do, D. Kieu, and T. Nguyen, “h-edit: Effective and flex- ible diffusion-based editing via doob’s h-transform,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 28 490–28 501

2025
[17]

Don’t forget your inverse ddim for image editing,

G. Gomez-Trenado, P. Mesejo, O. Cord ´on, and S. Lathuili `ere, “Don’t forget your inverse ddim for image editing,”IEEE Computational Intelligence Magazine, vol. 20, no. 3, p. 10–18, 2025

2025
[18]

Text-to-image rectified flow as plug-and-play priors,

X. Yang, C. Cheng, X. Yang, F. Liu, and G. Lin, “Text-to-image rectified flow as plug-and-play priors,” inInternational Conference on Learning Representations, 2025

2025
[19]

Postedit: Posterior sampling for efficient zero-shot image editing,

F. Tian, Y . Li, Y . Yan, S. Guan, Y . Ge, and X. Yang, “Postedit: Posterior sampling for efficient zero-shot image editing,”International Conference on Learning Representations, 2025

2025
[20]

Inversion-free image editing with language-guided diffusion models,

S. Xu, Y . Huang, J. Pan, Z. Ma, and J. Chai, “Inversion-free image editing with language-guided diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9452–9461

2024
[21]

Custom-edit: Text- guided image editing with customized diffusion models,

J. Choi, Y . Choi, Y . Kim, J. Kim, and S. Yoon, “Custom-edit: Text- guided image editing with customized diffusion models,”arXiv preprint arXiv:2305.15779, 2023

work page arXiv 2023
[22]

Focus on your instruction: Fine-grained and multi- instruction image editing by attention modulation,

Q. Guo and T. Lin, “Focus on your instruction: Fine-grained and multi- instruction image editing by attention modulation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6986–6996

2024
[23]

Revisiting efficient semantic segmentation: Learning offsets for better spatial and class feature alignment,

S.-C. Zhang, Y . Li, Y .-H. Wu, Q. Hou, and M.-M. Cheng, “Revisiting efficient semantic segmentation: Learning offsets for better spatial and class feature alignment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22 361–22 371

2025
[24]

Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,

M. Cao, X. Wang, Z. Qi, Y . Shan, X. Qie, and Y . Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, October 2023, pp. 22 560–22 570

2023
[25]

Consistent video inpainting using axial attention-based style transformer,

M. S. Junayed and M. B. Islam, “Consistent video inpainting using axial attention-based style transformer,”IEEE Transactions on Multimedia, vol. 25, pp. 7494–7504, 2023

2023
[26]

Art image inpainting with style-guided dual-branch inpainting network,

Q. Wang, Z. Wang, X. Zhang, and G. Feng, “Art image inpainting with style-guided dual-branch inpainting network,”IEEE Transactions on Multimedia, vol. 26, pp. 8026–8037, 2024

2024
[27]

Multi-channel attention selection gans for guided image-to-image translation,

H. Tang, P. H. Torr, and N. Sebe, “Multi-channel attention selection gans for guided image-to-image translation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 5, pp. 6055–6071, 2023

2023
[28]

Toward interactive image inpainting via robust sketch refinement,

C. Liu, S. Xu, J. Peng, K. Zhang, and D. Liu, “Toward interactive image inpainting via robust sketch refinement,”IEEE Transactions on Multimedia, vol. 26, pp. 9973–9987, 2024

2024
[29]

Weighted feature fusion of con- volutional neural network and graph attention network for hyperspectral image classification,

Y . Dong, Q. Liu, B. Du, and L. Zhang, “Weighted feature fusion of con- volutional neural network and graph attention network for hyperspectral image classification,”IEEE Transactions on Image Processing, vol. 31, pp. 1559–1572, 2022

2022
[30]

Tuning-free inversion-enhanced control for consistent image editing,

X. Duan, S. Cui, G. Kang, B. Zhang, Z. Fei, M. Fan, and J. Huang, “Tuning-free inversion-enhanced control for consistent image editing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 1644–1652

2024
[31]

Effective real image editing with accelerated iterative diffusion inversion,

Z. Pan, R. Gherardi, X. Xie, and S. Huang, “Effective real image editing with accelerated iterative diffusion inversion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, Oct 2023, pp. 15 912–15 921

2023
[32]

An edit-friendly ddpm noise space: Inversion and manipulations,

I. Huberman-Spiegelglas, V . Kulikov, and T. Michaeli, “An edit-friendly ddpm noise space: Inversion and manipulations,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1921–1930

2024
[33]

Towards efficient diffusion-based image editing with instant attention masks,

S. Zou, J. Tang, Y . Zhou, J. He, C. Zhao, R. Zhang, Z. Hu, and X. Sun, “Towards efficient diffusion-based image editing with instant attention masks,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7864–7872

2024
[34]

Enhanced multi-scale cross-attention for person image generation,

H. Tang, L. Shao, N. Sebe, and L. Van Gool, “Enhanced multi-scale cross-attention for person image generation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 5, pp. 3377– 3393, 2025

2025
[35]

Swiftedit: Lightning fast text-guided image editing via one-step diffusion,

T. Nguyen, Q. Nguyen, K. Nguyen, A. Tran, and C. Pham, “Swiftedit: Lightning fast text-guided image editing via one-step diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 21 492–21 501

2025
[36]

Baret: Balanced attention based real image editing driven by target- text inversion,

Y . Qiao, F. Wang, J. Su, Y . Zhang, Y . Yu, S. Wu, and G.-J. Qi, “Baret: Balanced attention based real image editing driven by target- text inversion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2023, pp. 4560–4568

2023
[37]

Consistency Models

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review arXiv 2023
[38]

Improved techniques for training consistency models

Y . Song and P. Dhariwal, “Improved techniques for training consistency models,”arXiv preprint arXiv:2310.14189, 2023

work page arXiv 2023
[39]

Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning.arXiv preprint arXiv:2402.00769, 2024

F.-Y . Wang, Z. Huang, X. Shi, W. Bian, G. Song, Y . Liu, and H. Li, “Animatelcm: Accelerating the animation of personalized diffusion mod- els and adapters with decoupled consistency learning,”arXiv preprint arXiv:2402.00769, 2024

work page arXiv 2024
[40]

In- vertible consistency distillation for text-guided image editing in around 7 steps,

N. Starodubcev, M. Khoroshikh, A. Babenko, and D. Baranchuk, “In- vertible consistency distillation for text-guided image editing in around 7 steps,” inNeurIPS, vol. 37, 2024, pp. 12 496–12 527

2024
[41]

Scott: Accelerating diffusion models with stochastic consistency distillation,

H. Liu, Q. Xie, T. Ye, Z. Deng, C. Chen, S. Tang, X. Fu, H. Lu, and Z. Zha, “Scott: Accelerating diffusion models with stochastic consistency distillation,”arXiv preprint arXiv:2403.01505, 2024

work page arXiv 2024
[42]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

S. Luo, Y . Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency models: Synthesizing high-resolution images with few-step inference,” arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review arXiv 2023
[43]

Rethinking score distillation as a bridge between image distributions,

D. McAllister, S. Ge, J.-B. Huang, D. Jacobs, A. Efros, Alexei amao2025tuningfreend Holynski, and A. Kanazawa, “Rethinking score distillation as a bridge between image distributions,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 33 779– 33 804

2024
[44]

Delta denoising score,

A. Hertz, K. Aberman, and D. Cohen-Or, “Delta denoising score,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2023, pp. 2328–2337. 11

2023
[45]

Adversarial diffusion distillation,

A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” inEuropean Conference on Computer Vision, 2024, pp. 87–103

2024
[46]

Improved zero-shot image editing via null-toon and directed delta denoising score,

M. A. N. Islam Fahim and J. Boutellier, “Improved zero-shot image editing via null-toon and directed delta denoising score,” inInternational Conference on Pattern Recognition, ser. Lecture Notes in Computer Science, Dec 2024, vol. 15306, pp. 309–323

2024
[47]

Qsd: Query-selection denoising score for image editing in latent diffusion model,

J. Hwang, C. Lim, and W. Lee, “Qsd: Query-selection denoising score for image editing in latent diffusion model,” inEuropean Conference on Computer Vision, 2025, pp. 229–243

2025
[48]

Dreamsteerer: Enhancing source image conditioned editability using personalized diffusion models,

Z. Yu, Z. Yang, and J. Zhang, “Dreamsteerer: Enhancing source image conditioned editability using personalized diffusion models,” inAd- vances in Neural Information Processing Systems, vol. 37, 2024, pp. 120 699–120 734

2024
[49]

Pnp inversion: Boosting diffusion-based editing with 3 lines of code,

X. Ju, A. Zeng, Y . Bian, S. Liu, and Q. Xu, “Pnp inversion: Boosting diffusion-based editing with 3 lines of code,” inInternational Conference on Learning Representations, 2024

2024
[50]

https://huggingface.co/datasets/ub-cvml- group/pie bench pp,

PIEBench++, “https://huggingface.co/datasets/ub-cvml- group/pie bench pp,” 2024

2024
[51]

Zero- shot image-to-image translation,

G. Parmar, K. Kumar Singh, R. Zhang, Y . Li, J. Lu, and J.-Y . Zhu, “Zero- shot image-to-image translation,” inConf. ACM SIGGRAPH, 2023, pp. 1–11

2023
[52]

Null- text inversion for editing real images using guided diffusion models,

R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or, “Null- text inversion for editing real images using guided diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6038–6047

2023
[53]

Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models,

D. Miyake, A. Iohara, Y . Saito, and T. Tanaka, “Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models,” inIEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp. 2063–2072

2025