pith. sign in

arxiv: 2512.08477 · v2 · submitted 2025-12-09 · 💻 cs.CV · cs.AI

ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Aligned Attention

Pith reviewed 2026-05-16 23:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords drag-based image editingcontext-preserving token injectionposition-aligned attentionin-context editingimage manipulationdiffusion modelsVAE features
0
0 comments X

The pith

ContextDrag performs precise drag-based image editing by injecting context-preserving tokens at position-aligned locations in attention layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ContextDrag as a framework for intuitive drag-based image editing that leverages in-context capabilities of models like FLUX-Kontext. It addresses issues in existing methods by injecting VAE-encoded reference features directly into attention layers at target positions using latent correspondences from control points. Position-Aligned Attention re-encodes positional embeddings of displaced tokens and masks overlaps to prevent conflicts. This results in better texture preservation and editing accuracy compared to inversion or warping approaches, as shown on DragBench benchmarks.

Core claim

ContextDrag brings drag-based manipulation into the in-context image editing paradigm through Context-preserving Token Injection, which injects clean VAE-encoded reference features at spatially aligned target positions, and Position-Aligned Attention, which re-encodes positional embeddings to match targets and masks overlapping regions to maintain consistency, enabling precise control without inversion or fine-tuning.

What carries the argument

Context-preserving Token Injection (CTI) and Position-Aligned Attention (PAA) that operate on clean encoded features guided by latent-space correspondences from user control points.

If this is right

  • Drag operations achieve higher texture fidelity by avoiding noisy inversion outputs.
  • Precise control is maintained even with spatial displacements through re-encoded embeddings.
  • Visual consistency improves by masking conflicting features in overlapping regions.
  • No fine-tuning or inversion steps are required, simplifying the editing process.
  • State-of-the-art results are obtained on DragBench-SR and DragBench-DR benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be adapted to other in-context image editing models for broader applicability.
  • Real-time interactive editing applications may benefit from the reduced computational steps.
  • Extending the token injection to handle multiple simultaneous drags could enable more complex manipulations.
  • The alignment technique might apply to other token-based manipulation tasks in generative models.

Load-bearing premise

That latent-space correspondences from user-specified control points accurately map reference features to target regions without introducing misalignment or artifacts.

What would settle it

A test showing visible artifacts or loss of detail in dragged regions on DragBench images when using large drag distances would indicate the correspondences do not guide injection precisely enough.

Figures

Figures reproduced from arXiv: 2512.08477 by Guanbin Li, Huan Yang, Huiguo He, Lianwen Jin, Pengyu Yan, Weizhi Zhong, Yejun Tang, Zheng Liu, Ziqi Yi.

Figure 1
Figure 1. Figure 1: Illustration of our ContextDrag framework. The Drag-Guided Editing Framework injects noise-free reference tokens into the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparisons of drag editing between our ContextDrag and other SOTA methods. Our approach achieves the most [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons of drag editing between our [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons of ablation study. Our full [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure cases of our ContextDrag and competing ap [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An illustration of the Concept Preservation (CP) Evaluation Instruction and the corresponding Summary & Planning returned by [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An illustration of the Prompt Following (PF) Evaluation Instruction and the corresponding Summary & Planning returned by [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative examples illustrating both Concept Preservation (CP) and Prompt Following (PF) for a given test case. Each row [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative examples illustrating both Concept Preservation (CP) and Prompt Following (PF) for a given test case. Each row [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparisons of editing results under different interpolation weights [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparisons of drag editing between our ContextDrag and other SOTA methods. Our approach achieves the most [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparisons of drag editing between our ContextDrag and other SOTA methods. Our approach achieves the most [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparisons of drag editing between our warping strategy and other warping methods. Our warping approach [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparisons of ablation study. Our full model accurately moves the object to the target destination while better [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional qualitative results of our ContextDrag. These results demonstrate the effectiveness of our method. The target points [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
read the original abstract

Drag-based image editing enables intuitive visual manipulation through point-based drag operations. Existing methods mainly rely on diffusion inversion or pixel-space warping with inpainting. However, inversion inherently introduces approximation errors that degrade texture fidelity, whereas rigid pixel-space operations discard semantic context and produce unnatural deformations. To address these issues, we introduce ContextDrag, to our knowledge the first framework that brings drag-based manipulation into the in-context image editing paradigm. By leveraging the in-context capabilities of editing models (e.g., FLUX-Kontext), ContextDrag enables precise drag editing without inversion or fine-tuning. Specifically, we first propose Context-preserving Token Injection (CTI), which injects VAE-encoded reference features into attention layers at spatially aligned target positions, guided by latent-space correspondences estimated directly from user-specified control points. By operating on clean, directly encoded features rather than noisy inversion outputs, CTI preserves rich texture details and enables precise drag control. Second, we propose Position-Aligned Attention (PAA) to eliminate interference caused by spatial displacement of reference features. PAA re-encodes positional embeddings of displaced reference tokens to match their target locations, and masks overlapping regions between source and destination to prevent conflicting features from degrading visual consistency. Experiments on DragBench-SR and DragBench-DR demonstrate that ContextDrag achieves SOTA editing accuracy and overall quality, and comprehensive ablations validate the effectiveness of each proposed component. Code will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ContextDrag, a drag-based image editing framework that operates in the in-context paradigm using models such as FLUX-Kontext. It proposes Context-preserving Token Injection (CTI) to inject clean VAE-encoded reference features into attention layers at target positions, guided by direct latent-space correspondences derived from user-specified control points, and Position-Aligned Attention (PAA) to re-encode positional embeddings of displaced tokens while masking source-destination overlaps. The work claims state-of-the-art editing accuracy and overall quality on DragBench-SR and DragBench-DR, with ablations confirming the contribution of each component, all without diffusion inversion or fine-tuning.

Significance. If the results hold, the approach offers a meaningful alternative to inversion-based and pixel-warping methods by preserving texture fidelity through direct feature injection. The explicit avoidance of inversion artifacts and the use of in-context capabilities without additional training are clear strengths that could influence future editing pipelines. However, the significance is tempered by the load-bearing nature of the unrefined correspondence estimation, which has not yet been shown to generalize reliably beyond the reported benchmarks.

major comments (2)
  1. [Method (CTI description)] The central mechanism in Context-preserving Token Injection relies on estimating latent-space correspondences directly from user control points to position injected VAE tokens. This assumption is load-bearing for the SOTA accuracy claim and the absence of misalignment artifacts, yet the manuscript provides no dedicated analysis or experiments testing robustness under rotation, scaling, non-rigid deformation, or partial occlusion—precisely the conditions where point-to-region mapping in VAE space is most likely to deviate from semantic correspondence.
  2. [Experiments] Experiments section: While SOTA results and ablations are reported on DragBench-SR and DragBench-DR, the evaluation protocol lacks error bars, statistical significance tests, or a detailed description of how metrics are computed across drag types. This omission prevents independent verification of the quantitative claims and weakens the cross-benchmark generalization argument.
minor comments (2)
  1. [Abstract] The abstract states the method is 'to our knowledge, the first' to bring drag editing into the in-context paradigm; a brief related-work paragraph explicitly contrasting against the closest prior in-context editing works would strengthen this positioning.
  2. [Overall] A schematic figure showing the flow from control points through CTI token placement and PAA masking would substantially improve readability of the technical contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the presentation of our method and experiments.

read point-by-point responses
  1. Referee: [Method (CTI description)] The central mechanism in Context-preserving Token Injection relies on estimating latent-space correspondences directly from user control points to position injected VAE tokens. This assumption is load-bearing for the SOTA accuracy claim and the absence of misalignment artifacts, yet the manuscript provides no dedicated analysis or experiments testing robustness under rotation, scaling, non-rigid deformation, or partial occlusion—precisely the conditions where point-to-region mapping in VAE space is most likely to deviate from semantic correspondence.

    Authors: We thank the referee for this observation. While DragBench-SR and DragBench-DR contain diverse drag operations that implicitly test some displacement and deformation scenarios, we acknowledge the absence of targeted robustness experiments for rotation, scaling, non-rigid deformation, and partial occlusion. In the revised manuscript we will add a dedicated subsection with both qualitative examples and quantitative metrics on synthetically transformed control points to evaluate CTI behavior under these conditions, along with a discussion of remaining limitations. revision: yes

  2. Referee: [Experiments] Experiments section: While SOTA results and ablations are reported on DragBench-SR and DragBench-DR, the evaluation protocol lacks error bars, statistical significance tests, or a detailed description of how metrics are computed across drag types. This omission prevents independent verification of the quantitative claims and weakens the cross-benchmark generalization argument.

    Authors: We agree that greater statistical rigor and transparency would improve reproducibility. In the revised Experiments section we will (1) report error bars as standard deviations over multiple random seeds, (2) include paired statistical significance tests against baselines, and (3) expand the protocol description to specify exactly how accuracy and quality metrics are aggregated across drag types and the two benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new components and external benchmarks keep derivation self-contained

full rationale

The paper introduces Context-preserving Token Injection (CTI) and Position-Aligned Attention (PAA) as novel mechanisms that inject VAE-encoded reference features at positions derived from user control points and re-encode positional embeddings to resolve displacement. These are presented as independent engineering contributions operating on clean encodings from existing models such as FLUX-Kontext. Performance claims rest on quantitative results and ablations conducted on the external DragBench-SR and DragBench-DR datasets rather than on any fitted parameter, self-referential equation, or load-bearing self-citation. No derivation step equates an output to its input by construction, and the method description contains no uniqueness theorems or ansatzes imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Central claim rests on the assumption that in-context models can be directly extended via token injection and positional re-encoding without additional training or inversion.

axioms (1)
  • domain assumption In-context capabilities of models such as FLUX-Kontext can be leveraged for precise drag editing via feature injection
    Invoked when introducing the framework as operating without inversion or fine-tuning.
invented entities (2)
  • Context-preserving Token Injection (CTI) no independent evidence
    purpose: Injects VAE-encoded reference features into attention layers at spatially aligned target positions guided by latent correspondences
    New technique proposed to preserve texture details.
  • Position-Aligned Attention (PAA) no independent evidence
    purpose: Re-encodes positional embeddings of displaced tokens and masks overlapping regions to prevent feature conflicts
    New technique proposed to maintain visual consistency.

pith-pipeline@v0.9.0 · 5585 in / 1275 out tokens · 26946 ms · 2026-05-16T23:59:34.449725+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 1 internal anchor

  1. [1]

    Positional encoding field

    Yunpeng Bai, Haoxiang Li, and Qixing Huang. Positional encoding field. arXiv preprint arXiv:2510.20385, 2025. 2, 5

  2. [2]

    Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. FLUX. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints, pages arXiv–2506,

  3. [3]

    In- structPix2Pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structPix2Pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023. 2

  4. [4]

    MasaCtrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. MasaCtrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In ICCV, pages 22560–22570, 2023. 4

  5. [5]

    XVerse: Consistent multi-subject control of identity and semantic attributes via dit modulation

    Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. XVerse: Consistent multi-subject control of identity and semantic attributes via dit modulation. NIPS, pages 1–10, 2025. 1

  6. [6]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In ICML, 2024. 11

  7. [7]

    An image is worth one word: Personalizing text-to-image generation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023. 2

  8. [8]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 2

  9. [9]

    Learning profitable NFT image diffusions via mul- tiple visual-policy guided reinforcement learning

    Huiguo He, Tianfu Wang, Huan Yang, Jianlong Fu, Nicholas Jing Yuan, Jian Yin, Hongyang Chao, and Qi Zhang. Learning profitable NFT image diffusions via mul- tiple visual-policy guided reinforcement learning. In ACM MM, pages 6831–6840, 2023. 2

  10. [10]

    Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention

    Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, and Huan Yang. Improving multi-subject consistency in open-domain image genera- tion with isolation and reposition attention. arXiv preprint arXiv:2411.19261, 2024. 5

  11. [11]

    DreamStory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion

    Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. DreamStory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence,

  12. [12]

    Prompt-to-prompt image editing with cross-attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023. 2

  13. [13]

    EasyDrag: Efficient point-based manip- ulation on diffusion models

    Xingzhong Hou, Boxiao Liu, Yi Zhang, Jihao Liu, Yu Liu, and Haihang You. EasyDrag: Efficient point-based manip- ulation on diffusion models. In CVPR, pages 8404–8413,

  14. [14]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 2

  15. [15]

    Imagic: Text-based real image editing with diffusion models

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023. 2, 6

  16. [16]

    Dif- fusionCLIP: Text-guided diffusion models for robust image manipulation

    Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- fusionCLIP: Text-guided diffusion models for robust image manipulation. In CVPR, pages 2426–2435, 2022. 2

  17. [17]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In CVPR, pages 4015–4026, 2023. 11

  18. [18]

    FreeDrag: Feature dragging for reliable point-based image editing

    Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, Yi Jin, and Jinjin Zheng. FreeDrag: Feature dragging for reliable point-based image editing. In CVPR, pages 6860–6870,

  19. [19]

    Inpaint4Drag: Repurposing inpaint- ing models for drag-based image editing via bidirectional warping

    Jingyi Lu and Kai Han. Inpaint4Drag: Repurposing inpaint- ing models for drag-based image editing via bidirectional warping. In ICCV, pages 18304–18313, 2025. 1, 2, 3, 4, 5, 6, 7, 8, 11, 13

  20. [20]

    RegionDrag: Fast region-based image editing with diffusion models

    Jingyi Lu, Xinghui Li, and Kai Han. RegionDrag: Fast region-based image editing with diffusion models. InECCV, pages 231–246. Springer, 2024. 2

  21. [21]

    RotationDrag: point-based image editing with rotated diffusion features

    Minxing Luo, Wentao Cheng, and Jian Yang. RotationDrag: point-based image editing with rotated diffusion features. arXiv preprint arXiv:2401.06442, 2024. 2

  22. [22]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047,

  23. [23]

    DragonDiffusion: Enabling drag-style manipu- lation on diffusion models

    Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. DragonDiffusion: Enabling drag-style manipu- lation on diffusion models. In ICLR, 2024. 2

  24. [24]

    DiffEditor: Boosting accuracy and flexibility on diffusion-based image editing

    Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. DiffEditor: Boosting accuracy and flexibility on diffusion-based image editing. In CVPR, pages 8488–8497,

  25. [25]

    GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models. InICML, pages 16784–16804, 2022. 1

  26. [26]

    The blessing of random- ness: SDE beats ODE in general diffusion-based image edit- ing

    Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, Chenyu Zheng, and Chongxuan Li. The blessing of random- ness: SDE beats ODE in general diffusion-based image edit- ing. In ICLR, 2024. 2, 5, 6, 7, 8, 12, 13

  27. [27]

    Drag your GAN: Interactive point-based manipulation on the generative image manifold

    Xingang Pan, Ayush Tewari, Thomas Leimk ¨uhler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your GAN: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH, pages 1–11, 2023. 1, 2, 6

  28. [28]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In CVPR, pages 4195–4205, 2023. 1

  29. [29]

    Dreambench++: A human-aligned bench- mark for personalized image generation

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned bench- mark for personalized image generation. In ICLR, 2025. 6, 11

  30. [30]

    Drag- ging with geometry: From pixels to geometry-guided image editing

    Xinyu Pu, Hongsong Wang, Jie Gui, and Pan Zhou. Drag- ging with geometry: From pixels to geometry-guided image editing. arXiv preprint arXiv:2509.25740, 2025. 2

  31. [31]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, pages 10684– 10695, 2022. 1

  32. [32]

    DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023. 1

  33. [33]

    DragDiffusion: Harnessing diffusion models for interactive point-based image editing

    Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Han- shu Yan, Wenqing Zhang, Vincent YF Tan, and Song Bai. DragDiffusion: Harnessing diffusion models for interactive point-based image editing. In CVPR, pages 8839–8849,

  34. [34]

    Instant- Drag: Improving interactivity in drag-based image editing

    Joonghyuk Shin, Daehyeon Choi, and Jaesik Park. Instant- Drag: Improving interactivity in drag-based image editing. In SIGGRAPH Asia 2024 Conference Papers, pages 1–10,

  35. [35]

    RoFormer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063,

  36. [36]

    OminiControl: Minimal and universal control for diffusion transformer

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. OminiControl: Minimal and universal control for diffusion transformer. In ICCV, pages 14940– 14950, 2025. 1, 3

  37. [37]

    Training-free con- sistent text-to-image generation

    Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free con- sistent text-to-image generation. TOG, 43(4):1–18, 2024. 4

  38. [38]

    CharaConsist: Fine-grained consistent character generation

    Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yun- peng Chen, and Yunchao Wei. CharaConsist: Fine-grained consistent character generation. In ICCV, pages 16058– 16067, 2025. 4, 6

  39. [39]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image technical report. arXiv preprint arXiv:2508.02324, 2025. 1, 2

  40. [40]

    A latent space of stochastic diffusion models for zero-shot image editing and guidance

    Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, pages 7378–7387, 2023. 2

  41. [41]

    Less-to-more generalization: Unlock- ing more controllability by in-context generation

    Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlock- ing more controllability by in-context generation. In ICCV, pages 18682–18692, 2025. 1

  42. [42]

    DragLoRA: Online optimization of LoRA adapters for drag-based image editing in diffusion model

    Siwei Xia, Li Sun, Tiantian Sun, and Qingli Li. DragLoRA: Online optimization of LoRA adapters for drag-based image editing in diffusion model. In ICML, pages 68277–68291. PMLR, 2025. 1, 2, 5, 7, 13

  43. [43]

    LazyDrag: Enabling stable drag-based editing on multi-modal diffusion transformers via explicit correspondence

    Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Li- onel M Ni, Gang Yu, and Heung-Yeung Shum. LazyDrag: Enabling stable drag-based editing on multi-modal diffusion transformers via explicit correspondence. arXiv preprint arXiv:2509.12203, 2025. 2

  44. [44]

    Multimodal image synthesis and editing: The generative AI era

    Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu, Lingjie Liu, Adam Kortylewski, Christian Theobalt, and Eric Xing. Multimodal image synthesis and editing: The generative AI era. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15098–15119,

  45. [45]

    Good- Drag: Towards good practices for drag editing with diffusion models

    Zewei Zhang, Huan Liu, Jun Chen, and Xiangyu Xu. Good- Drag: Towards good practices for drag editing with diffusion models. In ICLR, 2025. 1, 2, 5, 7, 13

  46. [46]

    FastDrag: Manipulate anything in one step

    Xuanjia Zhao, Jian Guan, Congyi Fan, Dongli Xu, Youtian Lin, Haiwei Pan, and Pengming Feng. FastDrag: Manipulate anything in one step. NIPS, 37:74439–74460, 2024. 1, 2, 4, 5, 6, 7, 8, 13 This supplementary material provides additional details and extended experimental results that complement the main paper. Sec. A presents additional implementation de- ta...

  47. [47]

    The presence and consistency of main semantic objects

  48. [48]

    The preservation of spatial layouts and object relation- ships

  49. [49]

    The plausibility of shapes and geometry

  50. [50]

    The consistency of color and texture

  51. [51]

    Based on an integrated assessment of these aspects, it out- puts an integer CP score ranging from 0–4, where a higher score indicates better preservation of the original concept

    The global visual style and coherence. Based on an integrated assessment of these aspects, it out- puts an integer CP score ranging from 0–4, where a higher score indicates better preservation of the original concept. This CP metric enables us to systematically compare dif- ferent drag-editing methods in terms of how faithfully they maintain the source im...

  52. [52]

    Whether the correct semantic object specified in the prompt is identified and manipulated

  53. [53]

    Whether the object’s movement direction is consistent with the arrow direction and the semantic intent

  54. [54]

    Whether the displaced object moves toward and aligns with the specified target point. Based on an integrated assessment of these aspects, it out- puts an integer PF score ranging from 0–4, where a higher score indicates better adherence to the intended manipula- tion. We further observe thatexplicitly marking the target positions in the generated image is...

  55. [55]

    In the same Fig

    This example shows that PF effectively penalizes cases where the semantic manipulation is not executed, even when the visual appearance remains consistent. In the same Fig. 8 (second row), the car rotation example illustrates the complementary role of the CP metric. Here, the object roughly follows the intended motion but under- goes substantial distortio...

  56. [56]

    Semantic Objects Consistency: (1) Compare the main semantic objects in the reference and edited images.(2) Check if any objects are missing, added, or have undergone semantic changes

  57. [57]

    Semantic Structure & Relationships: (1) Assess the spatial positions, hierarchical relations, and logical interactions between objects in both images.(2) Ensure that relative positions, occlusions, poses, and interactions are consistent

  58. [58]

    (2) Look for distortions, misalignments, or unrealistic deformations

    Shape & Geometry: (1) Examine the shape, pro portions, and pose of each semantic object. (2) Look for distortions, misalignments, or unrealistic deformations

  59. [59]

    Color & Texture: (1) Compare the tones, brightness, lighting direction, and texture style.(2) Identify any unnatural color shifts, blurriness, discontinuities, or editing artifacts

  60. [60]

    Global Style & Visual Coherence: (1) Determine the alignment of overall visual style, lighting, and fine details.(2) Ensure all semantic elements are naturally integrated and visually harmonious

  61. [61]

    desired target position

    Scoring: (1) Assign a score from 0 to 4 based on the overall degree of concept preservation, following the criteria provided. Execution: (1) I will systematically analyze each dimension for consistency between the reference and edited images, using the drag map for context. (2) I will provide a brief intern al thinking summary of my observations for each ...

  62. [62]

    - No objects are missing or added

    Semantic Objects Consistency: - The main object, a crocodile, is present in both the reference and edited images. - No objects are missing or added

  63. [63]

    - The mouth area has been altered, with the teeth appearing more closed

    Semantic Structure & Relationships: - The spatial position of the crocodile remains consistent. - The mouth area has been altered, with the teeth appearing more closed

  64. [64]

    - There is a slight distortion in the mouth area

    Shape & Geometry: - The shape of the crocodile's snout and mouth has changed, with the teeth now appearing less prominent and more closed. - There is a slight distortion in the mouth area

  65. [65]

    - No unnatural color shifts or blurriness are observed

    Color & Texture: - The color and texture of the crocodile's skin are consistent with the reference image. - No unnatural color shifts or blurriness are observed

  66. [66]

    - The edited image maintains visual coherence [Score] Concept Preservation (CP): 3

    Global Style & Visual Coherence: - The overall style and lighting are consistent with the reference image. - The edited image maintains visual coherence [Score] Concept Preservation (CP): 3. The edited image mostly preserves the semantic and visual characteristics of the reference image, with only minor differences in the shape and geometry of the crocodi...

  67. [67]

    - No new objects are added, but the car's details are significantly altered

    Semantic Objects Consistency: - The main object, the car, is present in both the reference and edited images. - No new objects are added, but the car's details are significantly altered

  68. [68]

    - The spatial relationship with the background elements (buildings, street) is maintained

    Semantic Structure & Relationships: - The car's position on the street remains consistent. - The spatial relationship with the background elements (buildings, street) is maintained

  69. [69]

    - Proportions and details like the wheels and windows are not well-preserved

    Shape & Geometry: - The car's shape is distorted in the edited image, with unrealistic deformations. - Proportions and details like the wheels and windows are not well-preserved

  70. [70]

    - The lighting and shadows are not as defined as in the reference image

    Color & Texture: - The color of the car is mostly consistent, but the texture is blurred and lacks detail. - The lighting and shadows are not as defined as in the reference image

  71. [71]

    - The edited image lacks the sharpness and clarity of the reference

    Global Style & Visual Coherence: - The overall style is somewhat maintained, but the car appears less integrated due to the loss of detail and texture. - The edited image lacks the sharpness and clarity of the reference. [Score]Concept Preservation (CP): 1. The edited image retains the basic presence and position of the car but fails to preserve its shape...

  72. [72]

    No objects are missing or added

    Semantic Objects Consistency: - The main object, a puppy, is present in both the reference and edited images. No objects are missing or added

  73. [73]

    However, the face has been altered, affecting the logical interaction of facial features

    Semantic Structure & Relationships: - The spatial position of the puppy remains consistent. However, the face has been altered, affecting the logical interaction of facial features

  74. [74]

    This results in unrealistic deformations

    Shape & Geometry: - The shape of the puppy's face is distorted, particularly around the nose and eyes, due to the dragging operation. This results in unrealistic deformations

  75. [75]

    There are no significant color shifts or texture issues

    Color & Texture: - The colors and textures are mostly consistent with the reference image. There are no significant color shifts or texture issues

  76. [76]

    However, the distortion in the face affects visual coherence

    Global Style & Visual Coherence: - The overall style and lighting remain consistent. However, the distortion in the face affects visual coherence. [Score] Concept Preservation (CP): 2. The edited image retains most semantic objects and colors, but noticeable deviations in shape and geometry reduce the overall concept preservation. reasoning: [Internal Thinking]

  77. [77]

    No objects are missing or added

    Semantic Objects Consistency: - The main semantic object, the sheep, is present in both the reference and edited images. No objects are missing or added

  78. [78]

    The ears have been slightly adjusted upwards as indicated by the drag map, but the overall structure is maintained

    Semantic Structure & Relationships: - The spatial position and orientation of the sheep remain consistent. The ears have been slightly adjusted upwards as indicated by the drag map, but the overall structure is maintained

  79. [79]

    The ears have been moved slightly upwards, but there are no distortions or unrealistic deformations

    Shape & Geometry: - The shape and proportions of the sheep, including its ears, are consistent with the reference image. The ears have been moved slightly upwards, but there are no distortions or unrealistic deformations

  80. [80]

    There are no unnatural color shifts or blurriness

    Color & Texture: - The colors and textures of the sheep's wool and the background are consistent with the reference image. There are no unnatural color shifts or blurriness

Showing first 80 references.