ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Aligned Attention

Guanbin Li; Huan Yang; Huiguo He; Lianwen Jin; Pengyu Yan; Weizhi Zhong; Yejun Tang; Zheng Liu; Ziqi Yi

arxiv: 2512.08477 · v2 · submitted 2025-12-09 · 💻 cs.CV · cs.AI

ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Aligned Attention

Huiguo He , Pengyu Yan , Ziqi Yi , Weizhi Zhong , Zheng Liu , Yejun Tang , Huan Yang , Guanbin Li

show 1 more author

Lianwen Jin

This is my paper

Pith reviewed 2026-05-16 23:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords drag-based image editingcontext-preserving token injectionposition-aligned attentionin-context editingimage manipulationdiffusion modelsVAE features

0 comments

The pith

ContextDrag performs precise drag-based image editing by injecting context-preserving tokens at position-aligned locations in attention layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ContextDrag as a framework for intuitive drag-based image editing that leverages in-context capabilities of models like FLUX-Kontext. It addresses issues in existing methods by injecting VAE-encoded reference features directly into attention layers at target positions using latent correspondences from control points. Position-Aligned Attention re-encodes positional embeddings of displaced tokens and masks overlaps to prevent conflicts. This results in better texture preservation and editing accuracy compared to inversion or warping approaches, as shown on DragBench benchmarks.

Core claim

ContextDrag brings drag-based manipulation into the in-context image editing paradigm through Context-preserving Token Injection, which injects clean VAE-encoded reference features at spatially aligned target positions, and Position-Aligned Attention, which re-encodes positional embeddings to match targets and masks overlapping regions to maintain consistency, enabling precise control without inversion or fine-tuning.

What carries the argument

Context-preserving Token Injection (CTI) and Position-Aligned Attention (PAA) that operate on clean encoded features guided by latent-space correspondences from user control points.

If this is right

Drag operations achieve higher texture fidelity by avoiding noisy inversion outputs.
Precise control is maintained even with spatial displacements through re-encoded embeddings.
Visual consistency improves by masking conflicting features in overlapping regions.
No fine-tuning or inversion steps are required, simplifying the editing process.
State-of-the-art results are obtained on DragBench-SR and DragBench-DR benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be adapted to other in-context image editing models for broader applicability.
Real-time interactive editing applications may benefit from the reduced computational steps.
Extending the token injection to handle multiple simultaneous drags could enable more complex manipulations.
The alignment technique might apply to other token-based manipulation tasks in generative models.

Load-bearing premise

That latent-space correspondences from user-specified control points accurately map reference features to target regions without introducing misalignment or artifacts.

What would settle it

A test showing visible artifacts or loss of detail in dragged regions on DragBench images when using large drag distances would indicate the correspondences do not guide injection precisely enough.

Figures

Figures reproduced from arXiv: 2512.08477 by Guanbin Li, Huan Yang, Huiguo He, Lianwen Jin, Pengyu Yan, Weizhi Zhong, Yejun Tang, Zheng Liu, Ziqi Yi.

**Figure 1.** Figure 1: Illustration of our ContextDrag framework. The Drag-Guided Editing Framework injects noise-free reference tokens into the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative comparisons of drag editing between our ContextDrag and other SOTA methods. Our approach achieves the most [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons of drag editing between our [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons of ablation study. Our full [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Failure cases of our ContextDrag and competing ap [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: An illustration of the Concept Preservation (CP) Evaluation Instruction and the corresponding Summary & Planning returned by [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: An illustration of the Prompt Following (PF) Evaluation Instruction and the corresponding Summary & Planning returned by [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative examples illustrating both Concept Preservation (CP) and Prompt Following (PF) for a given test case. Each row [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative examples illustrating both Concept Preservation (CP) and Prompt Following (PF) for a given test case. Each row [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparisons of editing results under different interpolation weights [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparisons of drag editing between our ContextDrag and other SOTA methods. Our approach achieves the most [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparisons of drag editing between our ContextDrag and other SOTA methods. Our approach achieves the most [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparisons of drag editing between our warping strategy and other warping methods. Our warping approach [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparisons of ablation study. Our full model accurately moves the object to the target destination while better [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Additional qualitative results of our ContextDrag. These results demonstrate the effectiveness of our method. The target points [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

read the original abstract

Drag-based image editing enables intuitive visual manipulation through point-based drag operations. Existing methods mainly rely on diffusion inversion or pixel-space warping with inpainting. However, inversion inherently introduces approximation errors that degrade texture fidelity, whereas rigid pixel-space operations discard semantic context and produce unnatural deformations. To address these issues, we introduce ContextDrag, to our knowledge the first framework that brings drag-based manipulation into the in-context image editing paradigm. By leveraging the in-context capabilities of editing models (e.g., FLUX-Kontext), ContextDrag enables precise drag editing without inversion or fine-tuning. Specifically, we first propose Context-preserving Token Injection (CTI), which injects VAE-encoded reference features into attention layers at spatially aligned target positions, guided by latent-space correspondences estimated directly from user-specified control points. By operating on clean, directly encoded features rather than noisy inversion outputs, CTI preserves rich texture details and enables precise drag control. Second, we propose Position-Aligned Attention (PAA) to eliminate interference caused by spatial displacement of reference features. PAA re-encodes positional embeddings of displaced reference tokens to match their target locations, and masks overlapping regions between source and destination to prevent conflicting features from degrading visual consistency. Experiments on DragBench-SR and DragBench-DR demonstrate that ContextDrag achieves SOTA editing accuracy and overall quality, and comprehensive ablations validate the effectiveness of each proposed component. Code will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContextDrag moves drag editing into in-context models with direct VAE token injection and positional fixes, but the unrefined latent point mappings look like the main risk for complex cases.

read the letter

Hi, the main thing to know is that ContextDrag tries to do precise drag-based image edits inside an in-context model like FLUX-Kontext by injecting clean VAE-encoded reference features at target positions, using latent correspondences from the control points, plus a Position-Aligned Attention step to handle shifts and overlaps. This skips the usual diffusion inversion step entirely and claims better texture fidelity as a result. They introduce two components, Context-preserving Token Injection for the feature placement and Position-Aligned Attention for re-encoding positions and masking conflicts, and position the work as the first to combine drag manipulation with the in-context editing setup without retraining or inversion. What they do well is keep the reference features direct from encoding rather than noisy inversion outputs, which directly targets the texture degradation problem in prior methods. The ablations on DragBench-SR and DragBench-DR are presented to show each piece contributes, and the overall approach stays training-free, which is practical for editing tools. The soft spots center on the correspondence step. The method estimates placements straight from user points in latent space without any learned matcher or refinement. For simple translations this may work, but non-rigid drags, rotations, scaling, or occlusions can easily misalign the injected tokens with actual image content and create the deformations or artifacts the method aims to avoid. The abstract reports SOTA accuracy and quality but gives no error bars, exact metrics, or detailed protocol, so those numbers are difficult to assess fully without the code or full results tables. This paper is aimed at computer vision researchers and tool builders who work on interactive editing interfaces or extensions of in-context diffusion models. A reader following point-based manipulation or ways to leverage models like FLUX without fine-tuning would find the concrete mechanisms and benchmark comparisons useful to consider. The work is coherent on its own terms and shows honest engagement with the limitations of inversion-based baselines, so it deserves a serious referee to check the alignment robustness and implementation details. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ContextDrag, a drag-based image editing framework that operates in the in-context paradigm using models such as FLUX-Kontext. It proposes Context-preserving Token Injection (CTI) to inject clean VAE-encoded reference features into attention layers at target positions, guided by direct latent-space correspondences derived from user-specified control points, and Position-Aligned Attention (PAA) to re-encode positional embeddings of displaced tokens while masking source-destination overlaps. The work claims state-of-the-art editing accuracy and overall quality on DragBench-SR and DragBench-DR, with ablations confirming the contribution of each component, all without diffusion inversion or fine-tuning.

Significance. If the results hold, the approach offers a meaningful alternative to inversion-based and pixel-warping methods by preserving texture fidelity through direct feature injection. The explicit avoidance of inversion artifacts and the use of in-context capabilities without additional training are clear strengths that could influence future editing pipelines. However, the significance is tempered by the load-bearing nature of the unrefined correspondence estimation, which has not yet been shown to generalize reliably beyond the reported benchmarks.

major comments (2)

[Method (CTI description)] The central mechanism in Context-preserving Token Injection relies on estimating latent-space correspondences directly from user control points to position injected VAE tokens. This assumption is load-bearing for the SOTA accuracy claim and the absence of misalignment artifacts, yet the manuscript provides no dedicated analysis or experiments testing robustness under rotation, scaling, non-rigid deformation, or partial occlusion—precisely the conditions where point-to-region mapping in VAE space is most likely to deviate from semantic correspondence.
[Experiments] Experiments section: While SOTA results and ablations are reported on DragBench-SR and DragBench-DR, the evaluation protocol lacks error bars, statistical significance tests, or a detailed description of how metrics are computed across drag types. This omission prevents independent verification of the quantitative claims and weakens the cross-benchmark generalization argument.

minor comments (2)

[Abstract] The abstract states the method is 'to our knowledge, the first' to bring drag editing into the in-context paradigm; a brief related-work paragraph explicitly contrasting against the closest prior in-context editing works would strengthen this positioning.
[Overall] A schematic figure showing the flow from control points through CTI token placement and PAA masking would substantially improve readability of the technical contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the presentation of our method and experiments.

read point-by-point responses

Referee: [Method (CTI description)] The central mechanism in Context-preserving Token Injection relies on estimating latent-space correspondences directly from user control points to position injected VAE tokens. This assumption is load-bearing for the SOTA accuracy claim and the absence of misalignment artifacts, yet the manuscript provides no dedicated analysis or experiments testing robustness under rotation, scaling, non-rigid deformation, or partial occlusion—precisely the conditions where point-to-region mapping in VAE space is most likely to deviate from semantic correspondence.

Authors: We thank the referee for this observation. While DragBench-SR and DragBench-DR contain diverse drag operations that implicitly test some displacement and deformation scenarios, we acknowledge the absence of targeted robustness experiments for rotation, scaling, non-rigid deformation, and partial occlusion. In the revised manuscript we will add a dedicated subsection with both qualitative examples and quantitative metrics on synthetically transformed control points to evaluate CTI behavior under these conditions, along with a discussion of remaining limitations. revision: yes
Referee: [Experiments] Experiments section: While SOTA results and ablations are reported on DragBench-SR and DragBench-DR, the evaluation protocol lacks error bars, statistical significance tests, or a detailed description of how metrics are computed across drag types. This omission prevents independent verification of the quantitative claims and weakens the cross-benchmark generalization argument.

Authors: We agree that greater statistical rigor and transparency would improve reproducibility. In the revised Experiments section we will (1) report error bars as standard deviations over multiple random seeds, (2) include paired statistical significance tests against baselines, and (3) expand the protocol description to specify exactly how accuracy and quality metrics are aggregated across drag types and the two benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new components and external benchmarks keep derivation self-contained

full rationale

The paper introduces Context-preserving Token Injection (CTI) and Position-Aligned Attention (PAA) as novel mechanisms that inject VAE-encoded reference features at positions derived from user control points and re-encode positional embeddings to resolve displacement. These are presented as independent engineering contributions operating on clean encodings from existing models such as FLUX-Kontext. Performance claims rest on quantitative results and ablations conducted on the external DragBench-SR and DragBench-DR datasets rather than on any fitted parameter, self-referential equation, or load-bearing self-citation. No derivation step equates an output to its input by construction, and the method description contains no uniqueness theorems or ansatzes imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Central claim rests on the assumption that in-context models can be directly extended via token injection and positional re-encoding without additional training or inversion.

axioms (1)

domain assumption In-context capabilities of models such as FLUX-Kontext can be leveraged for precise drag editing via feature injection
Invoked when introducing the framework as operating without inversion or fine-tuning.

invented entities (2)

Context-preserving Token Injection (CTI) no independent evidence
purpose: Injects VAE-encoded reference features into attention layers at spatially aligned target positions guided by latent correspondences
New technique proposed to preserve texture details.
Position-Aligned Attention (PAA) no independent evidence
purpose: Re-encodes positional embeddings of displaced tokens and masks overlapping regions to prevent feature conflicts
New technique proposed to maintain visual consistency.

pith-pipeline@v0.9.0 · 5585 in / 1275 out tokens · 26946 ms · 2026-05-16T23:59:34.449725+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Context-preserving Token Injection (CTI) ... Position-Consistent Attention (PCA) ... RoPE re-encoding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 1 internal anchor

[1]

Positional encoding field

Yunpeng Bai, Haoxiang Li, and Qixing Huang. Positional encoding field. arXiv preprint arXiv:2510.20385, 2025. 2, 5

work page arXiv 2025
[2]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. FLUX. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints, pages arXiv–2506,

work page
[3]

In- structPix2Pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structPix2Pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023. 2

work page 2023
[4]

MasaCtrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. MasaCtrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In ICCV, pages 22560–22570, 2023. 4

work page 2023
[5]

XVerse: Consistent multi-subject control of identity and semantic attributes via dit modulation

Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. XVerse: Consistent multi-subject control of identity and semantic attributes via dit modulation. NIPS, pages 1–10, 2025. 1

work page 2025
[6]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In ICML, 2024. 11

work page 2024
[7]

An image is worth one word: Personalizing text-to-image generation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023. 2

work page 2023
[8]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 2

work page 2014
[9]

Learning profitable NFT image diffusions via mul- tiple visual-policy guided reinforcement learning

Huiguo He, Tianfu Wang, Huan Yang, Jianlong Fu, Nicholas Jing Yuan, Jian Yin, Hongyang Chao, and Qi Zhang. Learning profitable NFT image diffusions via mul- tiple visual-policy guided reinforcement learning. In ACM MM, pages 6831–6840, 2023. 2

work page 2023
[10]

Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention

Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, and Huan Yang. Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention. arXiv preprint arXiv:2411.19261, 2024. 5

work page arXiv 2024
[11]

DreamStory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. DreamStory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page
[12]

Prompt-to-prompt image editing with cross-attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023. 2

work page 2023
[13]

EasyDrag: Efficient point-based manip- ulation on diffusion models

Xingzhong Hou, Boxiao Liu, Yi Zhang, Jihao Liu, Yu Liu, and Haihang You. EasyDrag: Efficient point-based manip- ulation on diffusion models. In CVPR, pages 8404–8413,

work page
[14]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 2

work page 2022
[15]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023. 2, 6

work page 2023
[16]

Dif- fusionCLIP: Text-guided diffusion models for robust image manipulation

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- fusionCLIP: Text-guided diffusion models for robust image manipulation. In CVPR, pages 2426–2435, 2022. 2

work page 2022
[17]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In CVPR, pages 4015–4026, 2023. 11

work page 2023
[18]

FreeDrag: Feature dragging for reliable point-based image editing

Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, Yi Jin, and Jinjin Zheng. FreeDrag: Feature dragging for reliable point-based image editing. In CVPR, pages 6860–6870,

work page
[19]

Inpaint4Drag: Repurposing inpaint- ing models for drag-based image editing via bidirectional warping

Jingyi Lu and Kai Han. Inpaint4Drag: Repurposing inpaint- ing models for drag-based image editing via bidirectional warping. In ICCV, pages 18304–18313, 2025. 1, 2, 3, 4, 5, 6, 7, 8, 11, 13

work page 2025
[20]

RegionDrag: Fast region-based image editing with diffusion models

Jingyi Lu, Xinghui Li, and Kai Han. RegionDrag: Fast region-based image editing with diffusion models. InECCV, pages 231–246. Springer, 2024. 2

work page 2024
[21]

RotationDrag: point-based image editing with rotated diffusion features

Minxing Luo, Wentao Cheng, and Jian Yang. RotationDrag: point-based image editing with rotated diffusion features. arXiv preprint arXiv:2401.06442, 2024. 2

work page arXiv 2024
[22]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047,

work page
[23]

DragonDiffusion: Enabling drag-style manipu- lation on diffusion models

Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. DragonDiffusion: Enabling drag-style manipu- lation on diffusion models. In ICLR, 2024. 2

work page 2024
[24]

DiffEditor: Boosting accuracy and flexibility on diffusion-based image editing

Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. DiffEditor: Boosting accuracy and flexibility on diffusion-based image editing. In CVPR, pages 8488–8497,

work page
[25]

GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models. InICML, pages 16784–16804, 2022. 1

work page 2022
[26]

The blessing of random- ness: SDE beats ODE in general diffusion-based image edit- ing

Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, Chenyu Zheng, and Chongxuan Li. The blessing of random- ness: SDE beats ODE in general diffusion-based image edit- ing. In ICLR, 2024. 2, 5, 6, 7, 8, 12, 13

work page 2024
[27]

Drag your GAN: Interactive point-based manipulation on the generative image manifold

Xingang Pan, Ayush Tewari, Thomas Leimk ¨uhler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your GAN: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH, pages 1–11, 2023. 1, 2, 6

work page 2023
[28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In CVPR, pages 4195–4205, 2023. 1

work page 2023
[29]

Dreambench++: A human-aligned bench- mark for personalized image generation

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned bench- mark for personalized image generation. In ICLR, 2025. 6, 11

work page 2025
[30]

Drag- ging with geometry: From pixels to geometry-guided image editing

Xinyu Pu, Hongsong Wang, Jie Gui, and Pan Zhou. Drag- ging with geometry: From pixels to geometry-guided image editing. arXiv preprint arXiv:2509.25740, 2025. 2

work page arXiv 2025
[31]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, pages 10684– 10695, 2022. 1

work page 2022
[32]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023. 1

work page 2023
[33]

DragDiffusion: Harnessing diffusion models for interactive point-based image editing

Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Han- shu Yan, Wenqing Zhang, Vincent YF Tan, and Song Bai. DragDiffusion: Harnessing diffusion models for interactive point-based image editing. In CVPR, pages 8839–8849,

work page
[34]

Instant- Drag: Improving interactivity in drag-based image editing

Joonghyuk Shin, Daehyeon Choi, and Jaesik Park. Instant- Drag: Improving interactivity in drag-based image editing. In SIGGRAPH Asia 2024 Conference Papers, pages 1–10,

work page 2024
[35]

RoFormer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063,

work page
[36]

OminiControl: Minimal and universal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. OminiControl: Minimal and universal control for diffusion transformer. In ICCV, pages 14940– 14950, 2025. 1, 3

work page 2025
[37]

Training-free con- sistent text-to-image generation

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free con- sistent text-to-image generation. TOG, 43(4):1–18, 2024. 4

work page 2024
[38]

CharaConsist: Fine-grained consistent character generation

Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yun- peng Chen, and Yunchao Wei. CharaConsist: Fine-grained consistent character generation. In ICCV, pages 16058– 16067, 2025. 4, 6

work page 2025
[39]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image technical report. arXiv preprint arXiv:2508.02324, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

A latent space of stochastic diffusion models for zero-shot image editing and guidance

Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, pages 7378–7387, 2023. 2

work page 2023
[41]

Less-to-more generalization: Unlock- ing more controllability by in-context generation

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlock- ing more controllability by in-context generation. In ICCV, pages 18682–18692, 2025. 1

work page 2025
[42]

DragLoRA: Online optimization of LoRA adapters for drag-based image editing in diffusion model

Siwei Xia, Li Sun, Tiantian Sun, and Qingli Li. DragLoRA: Online optimization of LoRA adapters for drag-based image editing in diffusion model. In ICML, pages 68277–68291. PMLR, 2025. 1, 2, 5, 7, 13

work page 2025
[43]

LazyDrag: Enabling stable drag-based editing on multi-modal diffusion transformers via explicit correspondence

Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Li- onel M Ni, Gang Yu, and Heung-Yeung Shum. LazyDrag: Enabling stable drag-based editing on multi-modal diffusion transformers via explicit correspondence. arXiv preprint arXiv:2509.12203, 2025. 2

work page arXiv 2025
[44]

Multimodal image synthesis and editing: The generative AI era

Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu, Lingjie Liu, Adam Kortylewski, Christian Theobalt, and Eric Xing. Multimodal image synthesis and editing: The generative AI era. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15098–15119,

work page
[45]

Good- Drag: Towards good practices for drag editing with diffusion models

Zewei Zhang, Huan Liu, Jun Chen, and Xiangyu Xu. Good- Drag: Towards good practices for drag editing with diffusion models. In ICLR, 2025. 1, 2, 5, 7, 13

work page 2025
[46]

FastDrag: Manipulate anything in one step

Xuanjia Zhao, Jian Guan, Congyi Fan, Dongli Xu, Youtian Lin, Haiwei Pan, and Pengming Feng. FastDrag: Manipulate anything in one step. NIPS, 37:74439–74460, 2024. 1, 2, 4, 5, 6, 7, 8, 13 This supplementary material provides additional details and extended experimental results that complement the main paper. Sec. A presents additional implementation de- ta...

work page 2024
[47]

The presence and consistency of main semantic objects

work page
[48]

The preservation of spatial layouts and object relation- ships

work page
[49]

The plausibility of shapes and geometry

work page
[50]

The consistency of color and texture

work page
[51]

Based on an integrated assessment of these aspects, it out- puts an integer CP score ranging from 0–4, where a higher score indicates better preservation of the original concept

The global visual style and coherence. Based on an integrated assessment of these aspects, it out- puts an integer CP score ranging from 0–4, where a higher score indicates better preservation of the original concept. This CP metric enables us to systematically compare dif- ferent drag-editing methods in terms of how faithfully they maintain the source im...

work page
[52]

Whether the correct semantic object specified in the prompt is identified and manipulated

work page
[53]

Whether the object’s movement direction is consistent with the arrow direction and the semantic intent

work page
[54]

Whether the displaced object moves toward and aligns with the specified target point. Based on an integrated assessment of these aspects, it out- puts an integer PF score ranging from 0–4, where a higher score indicates better adherence to the intended manipula- tion. We further observe thatexplicitly marking the target positions in the generated image is...

work page
[55]

In the same Fig

This example shows that PF effectively penalizes cases where the semantic manipulation is not executed, even when the visual appearance remains consistent. In the same Fig. 8 (second row), the car rotation example illustrates the complementary role of the CP metric. Here, the object roughly follows the intended motion but under- goes substantial distortio...

work page arXiv
[56]

Semantic Objects Consistency: (1) Compare the main semantic objects in the reference and edited images.(2) Check if any objects are missing, added, or have undergone semantic changes

work page
[57]

Semantic Structure & Relationships: (1) Assess the spatial positions, hierarchical relations, and logical interactions between objects in both images.(2) Ensure that relative positions, occlusions, poses, and interactions are consistent

work page
[58]

(2) Look for distortions, misalignments, or unrealistic deformations

Shape & Geometry: (1) Examine the shape, pro portions, and pose of each semantic object. (2) Look for distortions, misalignments, or unrealistic deformations

work page
[59]

Color & Texture: (1) Compare the tones, brightness, lighting direction, and texture style.(2) Identify any unnatural color shifts, blurriness, discontinuities, or editing artifacts

work page
[60]

Global Style & Visual Coherence: (1) Determine the alignment of overall visual style, lighting, and fine details.(2) Ensure all semantic elements are naturally integrated and visually harmonious

work page
[61]

desired target position

Scoring: (1) Assign a score from 0 to 4 based on the overall degree of concept preservation, following the criteria provided. Execution: (1) I will systematically analyze each dimension for consistency between the reference and edited images, using the drag map for context. (2) I will provide a brief intern al thinking summary of my observations for each ...

work page
[62]

- No objects are missing or added

Semantic Objects Consistency: - The main object, a crocodile, is present in both the reference and edited images. - No objects are missing or added

work page
[63]

- The mouth area has been altered, with the teeth appearing more closed

Semantic Structure & Relationships: - The spatial position of the crocodile remains consistent. - The mouth area has been altered, with the teeth appearing more closed

work page
[64]

- There is a slight distortion in the mouth area

Shape & Geometry: - The shape of the crocodile's snout and mouth has changed, with the teeth now appearing less prominent and more closed. - There is a slight distortion in the mouth area

work page
[65]

- No unnatural color shifts or blurriness are observed

Color & Texture: - The color and texture of the crocodile's skin are consistent with the reference image. - No unnatural color shifts or blurriness are observed

work page
[66]

- The edited image maintains visual coherence [Score] Concept Preservation (CP): 3

Global Style & Visual Coherence: - The overall style and lighting are consistent with the reference image. - The edited image maintains visual coherence [Score] Concept Preservation (CP): 3. The edited image mostly preserves the semantic and visual characteristics of the reference image, with only minor differences in the shape and geometry of the crocodi...

work page
[67]

- No new objects are added, but the car's details are significantly altered

Semantic Objects Consistency: - The main object, the car, is present in both the reference and edited images. - No new objects are added, but the car's details are significantly altered

work page
[68]

- The spatial relationship with the background elements (buildings, street) is maintained

Semantic Structure & Relationships: - The car's position on the street remains consistent. - The spatial relationship with the background elements (buildings, street) is maintained

work page
[69]

- Proportions and details like the wheels and windows are not well-preserved

Shape & Geometry: - The car's shape is distorted in the edited image, with unrealistic deformations. - Proportions and details like the wheels and windows are not well-preserved

work page
[70]

- The lighting and shadows are not as defined as in the reference image

Color & Texture: - The color of the car is mostly consistent, but the texture is blurred and lacks detail. - The lighting and shadows are not as defined as in the reference image

work page
[71]

- The edited image lacks the sharpness and clarity of the reference

Global Style & Visual Coherence: - The overall style is somewhat maintained, but the car appears less integrated due to the loss of detail and texture. - The edited image lacks the sharpness and clarity of the reference. [Score]Concept Preservation (CP): 1. The edited image retains the basic presence and position of the car but fails to preserve its shape...

work page
[72]

No objects are missing or added

Semantic Objects Consistency: - The main object, a puppy, is present in both the reference and edited images. No objects are missing or added

work page
[73]

However, the face has been altered, affecting the logical interaction of facial features

Semantic Structure & Relationships: - The spatial position of the puppy remains consistent. However, the face has been altered, affecting the logical interaction of facial features

work page
[74]

This results in unrealistic deformations

Shape & Geometry: - The shape of the puppy's face is distorted, particularly around the nose and eyes, due to the dragging operation. This results in unrealistic deformations

work page
[75]

There are no significant color shifts or texture issues

Color & Texture: - The colors and textures are mostly consistent with the reference image. There are no significant color shifts or texture issues

work page
[76]

However, the distortion in the face affects visual coherence

Global Style & Visual Coherence: - The overall style and lighting remain consistent. However, the distortion in the face affects visual coherence. [Score] Concept Preservation (CP): 2. The edited image retains most semantic objects and colors, but noticeable deviations in shape and geometry reduce the overall concept preservation. reasoning: [Internal Thinking]

work page
[77]

No objects are missing or added

Semantic Objects Consistency: - The main semantic object, the sheep, is present in both the reference and edited images. No objects are missing or added

work page
[78]

The ears have been slightly adjusted upwards as indicated by the drag map, but the overall structure is maintained

Semantic Structure & Relationships: - The spatial position and orientation of the sheep remain consistent. The ears have been slightly adjusted upwards as indicated by the drag map, but the overall structure is maintained

work page
[79]

The ears have been moved slightly upwards, but there are no distortions or unrealistic deformations

Shape & Geometry: - The shape and proportions of the sheep, including its ears, are consistent with the reference image. The ears have been moved slightly upwards, but there are no distortions or unrealistic deformations

work page
[80]

There are no unnatural color shifts or blurriness

Color & Texture: - The colors and textures of the sheep's wool and the background are consistent with the reference image. There are no unnatural color shifts or blurriness

work page

Showing first 80 references.

[1] [1]

Positional encoding field

Yunpeng Bai, Haoxiang Li, and Qixing Huang. Positional encoding field. arXiv preprint arXiv:2510.20385, 2025. 2, 5

work page arXiv 2025

[2] [2]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. FLUX. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints, pages arXiv–2506,

work page

[3] [3]

In- structPix2Pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structPix2Pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023. 2

work page 2023

[4] [4]

MasaCtrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. MasaCtrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In ICCV, pages 22560–22570, 2023. 4

work page 2023

[5] [5]

XVerse: Consistent multi-subject control of identity and semantic attributes via dit modulation

Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. XVerse: Consistent multi-subject control of identity and semantic attributes via dit modulation. NIPS, pages 1–10, 2025. 1

work page 2025

[6] [6]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In ICML, 2024. 11

work page 2024

[7] [7]

An image is worth one word: Personalizing text-to-image generation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023. 2

work page 2023

[8] [8]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 2

work page 2014

[9] [9]

Learning profitable NFT image diffusions via mul- tiple visual-policy guided reinforcement learning

Huiguo He, Tianfu Wang, Huan Yang, Jianlong Fu, Nicholas Jing Yuan, Jian Yin, Hongyang Chao, and Qi Zhang. Learning profitable NFT image diffusions via mul- tiple visual-policy guided reinforcement learning. In ACM MM, pages 6831–6840, 2023. 2

work page 2023

[10] [10]

Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention

Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, and Huan Yang. Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention. arXiv preprint arXiv:2411.19261, 2024. 5

work page arXiv 2024

[11] [11]

DreamStory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. DreamStory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page

[12] [12]

Prompt-to-prompt image editing with cross-attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023. 2

work page 2023

[13] [13]

EasyDrag: Efficient point-based manip- ulation on diffusion models

Xingzhong Hou, Boxiao Liu, Yi Zhang, Jihao Liu, Yu Liu, and Haihang You. EasyDrag: Efficient point-based manip- ulation on diffusion models. In CVPR, pages 8404–8413,

work page

[14] [14]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 2

work page 2022

[15] [15]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023. 2, 6

work page 2023

[16] [16]

Dif- fusionCLIP: Text-guided diffusion models for robust image manipulation

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- fusionCLIP: Text-guided diffusion models for robust image manipulation. In CVPR, pages 2426–2435, 2022. 2

work page 2022

[17] [17]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In CVPR, pages 4015–4026, 2023. 11

work page 2023

[18] [18]

FreeDrag: Feature dragging for reliable point-based image editing

Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, Yi Jin, and Jinjin Zheng. FreeDrag: Feature dragging for reliable point-based image editing. In CVPR, pages 6860–6870,

work page

[19] [19]

Inpaint4Drag: Repurposing inpaint- ing models for drag-based image editing via bidirectional warping

Jingyi Lu and Kai Han. Inpaint4Drag: Repurposing inpaint- ing models for drag-based image editing via bidirectional warping. In ICCV, pages 18304–18313, 2025. 1, 2, 3, 4, 5, 6, 7, 8, 11, 13

work page 2025

[20] [20]

RegionDrag: Fast region-based image editing with diffusion models

Jingyi Lu, Xinghui Li, and Kai Han. RegionDrag: Fast region-based image editing with diffusion models. InECCV, pages 231–246. Springer, 2024. 2

work page 2024

[21] [21]

RotationDrag: point-based image editing with rotated diffusion features

Minxing Luo, Wentao Cheng, and Jian Yang. RotationDrag: point-based image editing with rotated diffusion features. arXiv preprint arXiv:2401.06442, 2024. 2

work page arXiv 2024

[22] [22]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047,

work page

[23] [23]

DragonDiffusion: Enabling drag-style manipu- lation on diffusion models

Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. DragonDiffusion: Enabling drag-style manipu- lation on diffusion models. In ICLR, 2024. 2

work page 2024

[24] [24]

DiffEditor: Boosting accuracy and flexibility on diffusion-based image editing

Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. DiffEditor: Boosting accuracy and flexibility on diffusion-based image editing. In CVPR, pages 8488–8497,

work page

[25] [25]

GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models. InICML, pages 16784–16804, 2022. 1

work page 2022

[26] [26]

The blessing of random- ness: SDE beats ODE in general diffusion-based image edit- ing

Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, Chenyu Zheng, and Chongxuan Li. The blessing of random- ness: SDE beats ODE in general diffusion-based image edit- ing. In ICLR, 2024. 2, 5, 6, 7, 8, 12, 13

work page 2024

[27] [27]

Drag your GAN: Interactive point-based manipulation on the generative image manifold

Xingang Pan, Ayush Tewari, Thomas Leimk ¨uhler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your GAN: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH, pages 1–11, 2023. 1, 2, 6

work page 2023

[28] [28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In CVPR, pages 4195–4205, 2023. 1

work page 2023

[29] [29]

Dreambench++: A human-aligned bench- mark for personalized image generation

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned bench- mark for personalized image generation. In ICLR, 2025. 6, 11

work page 2025

[30] [30]

Drag- ging with geometry: From pixels to geometry-guided image editing

Xinyu Pu, Hongsong Wang, Jie Gui, and Pan Zhou. Drag- ging with geometry: From pixels to geometry-guided image editing. arXiv preprint arXiv:2509.25740, 2025. 2

work page arXiv 2025

[31] [31]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, pages 10684– 10695, 2022. 1

work page 2022

[32] [32]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023. 1

work page 2023

[33] [33]

DragDiffusion: Harnessing diffusion models for interactive point-based image editing

Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Han- shu Yan, Wenqing Zhang, Vincent YF Tan, and Song Bai. DragDiffusion: Harnessing diffusion models for interactive point-based image editing. In CVPR, pages 8839–8849,

work page

[34] [34]

Instant- Drag: Improving interactivity in drag-based image editing

Joonghyuk Shin, Daehyeon Choi, and Jaesik Park. Instant- Drag: Improving interactivity in drag-based image editing. In SIGGRAPH Asia 2024 Conference Papers, pages 1–10,

work page 2024

[35] [35]

RoFormer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063,

work page

[36] [36]

OminiControl: Minimal and universal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. OminiControl: Minimal and universal control for diffusion transformer. In ICCV, pages 14940– 14950, 2025. 1, 3

work page 2025

[37] [37]

Training-free con- sistent text-to-image generation

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free con- sistent text-to-image generation. TOG, 43(4):1–18, 2024. 4

work page 2024

[38] [38]

CharaConsist: Fine-grained consistent character generation

Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yun- peng Chen, and Yunchao Wei. CharaConsist: Fine-grained consistent character generation. In ICCV, pages 16058– 16067, 2025. 4, 6

work page 2025

[39] [39]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image technical report. arXiv preprint arXiv:2508.02324, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

A latent space of stochastic diffusion models for zero-shot image editing and guidance

Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, pages 7378–7387, 2023. 2

work page 2023

[41] [41]

Less-to-more generalization: Unlock- ing more controllability by in-context generation

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlock- ing more controllability by in-context generation. In ICCV, pages 18682–18692, 2025. 1

work page 2025

[42] [42]

DragLoRA: Online optimization of LoRA adapters for drag-based image editing in diffusion model

Siwei Xia, Li Sun, Tiantian Sun, and Qingli Li. DragLoRA: Online optimization of LoRA adapters for drag-based image editing in diffusion model. In ICML, pages 68277–68291. PMLR, 2025. 1, 2, 5, 7, 13

work page 2025

[43] [43]

LazyDrag: Enabling stable drag-based editing on multi-modal diffusion transformers via explicit correspondence

Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Li- onel M Ni, Gang Yu, and Heung-Yeung Shum. LazyDrag: Enabling stable drag-based editing on multi-modal diffusion transformers via explicit correspondence. arXiv preprint arXiv:2509.12203, 2025. 2

work page arXiv 2025

[44] [44]

Multimodal image synthesis and editing: The generative AI era

Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu, Lingjie Liu, Adam Kortylewski, Christian Theobalt, and Eric Xing. Multimodal image synthesis and editing: The generative AI era. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15098–15119,

work page

[45] [45]

Good- Drag: Towards good practices for drag editing with diffusion models

Zewei Zhang, Huan Liu, Jun Chen, and Xiangyu Xu. Good- Drag: Towards good practices for drag editing with diffusion models. In ICLR, 2025. 1, 2, 5, 7, 13

work page 2025

[46] [46]

FastDrag: Manipulate anything in one step

Xuanjia Zhao, Jian Guan, Congyi Fan, Dongli Xu, Youtian Lin, Haiwei Pan, and Pengming Feng. FastDrag: Manipulate anything in one step. NIPS, 37:74439–74460, 2024. 1, 2, 4, 5, 6, 7, 8, 13 This supplementary material provides additional details and extended experimental results that complement the main paper. Sec. A presents additional implementation de- ta...

work page 2024

[47] [47]

The presence and consistency of main semantic objects

work page

[48] [48]

The preservation of spatial layouts and object relation- ships

work page

[49] [49]

The plausibility of shapes and geometry

work page

[50] [50]

The consistency of color and texture

work page

[51] [51]

Based on an integrated assessment of these aspects, it out- puts an integer CP score ranging from 0–4, where a higher score indicates better preservation of the original concept

The global visual style and coherence. Based on an integrated assessment of these aspects, it out- puts an integer CP score ranging from 0–4, where a higher score indicates better preservation of the original concept. This CP metric enables us to systematically compare dif- ferent drag-editing methods in terms of how faithfully they maintain the source im...

work page

[52] [52]

Whether the correct semantic object specified in the prompt is identified and manipulated

work page

[53] [53]

Whether the object’s movement direction is consistent with the arrow direction and the semantic intent

work page

[54] [54]

Whether the displaced object moves toward and aligns with the specified target point. Based on an integrated assessment of these aspects, it out- puts an integer PF score ranging from 0–4, where a higher score indicates better adherence to the intended manipula- tion. We further observe thatexplicitly marking the target positions in the generated image is...

work page

[55] [55]

In the same Fig

This example shows that PF effectively penalizes cases where the semantic manipulation is not executed, even when the visual appearance remains consistent. In the same Fig. 8 (second row), the car rotation example illustrates the complementary role of the CP metric. Here, the object roughly follows the intended motion but under- goes substantial distortio...

work page arXiv

[56] [56]

Semantic Objects Consistency: (1) Compare the main semantic objects in the reference and edited images.(2) Check if any objects are missing, added, or have undergone semantic changes

work page

[57] [57]

Semantic Structure & Relationships: (1) Assess the spatial positions, hierarchical relations, and logical interactions between objects in both images.(2) Ensure that relative positions, occlusions, poses, and interactions are consistent

work page

[58] [58]

(2) Look for distortions, misalignments, or unrealistic deformations

Shape & Geometry: (1) Examine the shape, pro portions, and pose of each semantic object. (2) Look for distortions, misalignments, or unrealistic deformations

work page

[59] [59]

Color & Texture: (1) Compare the tones, brightness, lighting direction, and texture style.(2) Identify any unnatural color shifts, blurriness, discontinuities, or editing artifacts

work page

[60] [60]

Global Style & Visual Coherence: (1) Determine the alignment of overall visual style, lighting, and fine details.(2) Ensure all semantic elements are naturally integrated and visually harmonious

work page

[61] [61]

desired target position

Scoring: (1) Assign a score from 0 to 4 based on the overall degree of concept preservation, following the criteria provided. Execution: (1) I will systematically analyze each dimension for consistency between the reference and edited images, using the drag map for context. (2) I will provide a brief intern al thinking summary of my observations for each ...

work page

[62] [62]

- No objects are missing or added

Semantic Objects Consistency: - The main object, a crocodile, is present in both the reference and edited images. - No objects are missing or added

work page

[63] [63]

- The mouth area has been altered, with the teeth appearing more closed

Semantic Structure & Relationships: - The spatial position of the crocodile remains consistent. - The mouth area has been altered, with the teeth appearing more closed

work page

[64] [64]

- There is a slight distortion in the mouth area

Shape & Geometry: - The shape of the crocodile's snout and mouth has changed, with the teeth now appearing less prominent and more closed. - There is a slight distortion in the mouth area

work page

[65] [65]

- No unnatural color shifts or blurriness are observed

Color & Texture: - The color and texture of the crocodile's skin are consistent with the reference image. - No unnatural color shifts or blurriness are observed

work page

[66] [66]

- The edited image maintains visual coherence [Score] Concept Preservation (CP): 3

Global Style & Visual Coherence: - The overall style and lighting are consistent with the reference image. - The edited image maintains visual coherence [Score] Concept Preservation (CP): 3. The edited image mostly preserves the semantic and visual characteristics of the reference image, with only minor differences in the shape and geometry of the crocodi...

work page

[67] [67]

- No new objects are added, but the car's details are significantly altered

Semantic Objects Consistency: - The main object, the car, is present in both the reference and edited images. - No new objects are added, but the car's details are significantly altered

work page

[68] [68]

- The spatial relationship with the background elements (buildings, street) is maintained

Semantic Structure & Relationships: - The car's position on the street remains consistent. - The spatial relationship with the background elements (buildings, street) is maintained

work page

[69] [69]

- Proportions and details like the wheels and windows are not well-preserved

Shape & Geometry: - The car's shape is distorted in the edited image, with unrealistic deformations. - Proportions and details like the wheels and windows are not well-preserved

work page

[70] [70]

- The lighting and shadows are not as defined as in the reference image

Color & Texture: - The color of the car is mostly consistent, but the texture is blurred and lacks detail. - The lighting and shadows are not as defined as in the reference image

work page

[71] [71]

- The edited image lacks the sharpness and clarity of the reference

Global Style & Visual Coherence: - The overall style is somewhat maintained, but the car appears less integrated due to the loss of detail and texture. - The edited image lacks the sharpness and clarity of the reference. [Score]Concept Preservation (CP): 1. The edited image retains the basic presence and position of the car but fails to preserve its shape...

work page

[72] [72]

No objects are missing or added

Semantic Objects Consistency: - The main object, a puppy, is present in both the reference and edited images. No objects are missing or added

work page

[73] [73]

However, the face has been altered, affecting the logical interaction of facial features

Semantic Structure & Relationships: - The spatial position of the puppy remains consistent. However, the face has been altered, affecting the logical interaction of facial features

work page

[74] [74]

This results in unrealistic deformations

Shape & Geometry: - The shape of the puppy's face is distorted, particularly around the nose and eyes, due to the dragging operation. This results in unrealistic deformations

work page

[75] [75]

There are no significant color shifts or texture issues

Color & Texture: - The colors and textures are mostly consistent with the reference image. There are no significant color shifts or texture issues

work page

[76] [76]

However, the distortion in the face affects visual coherence

Global Style & Visual Coherence: - The overall style and lighting remain consistent. However, the distortion in the face affects visual coherence. [Score] Concept Preservation (CP): 2. The edited image retains most semantic objects and colors, but noticeable deviations in shape and geometry reduce the overall concept preservation. reasoning: [Internal Thinking]

work page

[77] [77]

No objects are missing or added

Semantic Objects Consistency: - The main semantic object, the sheep, is present in both the reference and edited images. No objects are missing or added

work page

[78] [78]

The ears have been slightly adjusted upwards as indicated by the drag map, but the overall structure is maintained

Semantic Structure & Relationships: - The spatial position and orientation of the sheep remain consistent. The ears have been slightly adjusted upwards as indicated by the drag map, but the overall structure is maintained

work page

[79] [79]

The ears have been moved slightly upwards, but there are no distortions or unrealistic deformations

Shape & Geometry: - The shape and proportions of the sheep, including its ears, are consistent with the reference image. The ears have been moved slightly upwards, but there are no distortions or unrealistic deformations

work page

[80] [80]

There are no unnatural color shifts or blurriness

Color & Texture: - The colors and textures of the sheep's wool and the background are consistent with the reference image. There are no unnatural color shifts or blurriness

work page