arxiv: 2604.23763 · v2 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

Honghao Cai , Xiangyuan Wang , Yunhao Bai , Haohua Chen , Tianze Zhou , Runqi Wang , Wei Zhu , Yibo Chen

show 3 more authors

Xu Tang Yao Hu Zhen Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords local image editingdiffusion transformersadapter injectionmask-free editingregion-aware adaptationinstruction-based editingspatial gatingDiT editing

0 comments

The pith

AdaptEdit retrofits frozen diffusion transformers for precise local edits by injecting instruction- and region-aware adapters that predict edit locations from text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion transformers spread local edits globally because their joint-attention layers provide no explicit signal for where changes should apply. AdaptEdit adds a lightweight Block Adapter to every transformer block in a frozen backbone; the adapter carries a factorized condition that separates the edit instruction from its spatial region. A SpatialGate routes the adapter output only into the intended area, and a Region-Aware Loss trains the model to focus on pixels that actually change. A thin MaskPredictor head, trained jointly, grounds the region directly from the instruction and source image, removing any need for user masks at deployment. On MagicBrush and Emu-Edit Test the method outperforms both mask-free and oracle-mask baselines while preserving the rest of the image.

Core claim

AdaptEdit is a co-trained adapter framework that retrofits a frozen DiT into a mask-free local editor by placing a Block Adapter at every transformer block to inject a structured condition stream that factorizes instruction semantics from spatial location, using a learned SpatialGate to route the signal selectively and a Region-Aware Loss to focus the objective on changing pixels, so that a jointly trained MaskPredictor can derive the edit region from the instruction and source image without any external mask.

What carries the argument

Block Adapter with SpatialGate and MaskPredictor: a lightweight per-block module that injects a factorized condition stream separating what to edit from where to edit, with the gate controlling selective application and the predictor grounding the region directly from instruction and source.

If this is right

The frozen DiT backbone performs local edits without any weight modification.
Pixel-level preservation and edit accuracy both improve over prior mask-free and masked baselines.
No user-provided mask is required at inference because the MaskPredictor supplies the region.
Performance generalizes across nine diverse edit categories on Emu-Edit Test.
Ablations confirm that each component—adapter, gate, predictor, and region-aware loss—contributes to isolating the edit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-block factorization of semantics and location could be tested on other attention-heavy generative architectures that exhibit global leakage.
Joint training of gate and predictor may allow the method to scale to instructions that describe multiple disjoint regions in one forward pass.
The region-aware loss could be combined with unpaired data to reduce reliance on paired ground-truth targets like those in MagicBrush.

Load-bearing premise

The jointly trained SpatialGate and MaskPredictor can reliably isolate edits and predict accurate regions from instructions without explicit masks, generalizing across the nine edit categories in Emu-Edit.

What would settle it

Measuring whether edits remain confined to the region implied by the instruction on held-out images from Emu-Edit or MagicBrush, or whether leakage into unrelated areas occurs when the MaskPredictor is removed or the SpatialGate is ablated.

Figures

Figures reproduced from arXiv: 2604.23763 by Haohua Chen, Honghao Cai, Runqi Wang, Tianze Zhou, Wei Zhu, Xiangyuan Wang, Xu Tang, Yao Hu, Yibo Chen, Yunhao Bai, Zhen Li.

**Figure 1.** Figure 1: Motivation. Left: a vanilla DiT leaks “remove hand sanitizer” into surrounding pixels. Right: Oursfactorizes what (instruction) from where (mask), injects both via a per-block adapter modulated by a SpatialGate, and grounds the edit region automatically via a MaskPredictor — no user mask needed at deployment. Their problems. Both lines have fundamental limitations (see view at source ↗

**Figure 2.** Figure 2: System overview. (a) Training: GT mask and VL hidden states are encoded into spatial and semantic tokens, fused, and injected via BlockAdapter+SpatialGate at every frozen DiT block; a MaskPredictor head is co-trained with a decoupled auxiliary loss. (b) Inference: the MaskPredictor produces the edit-region mask from source image and instruction alone — no user mask needed. parameter UNets. The DiT era is r… view at source ↗

**Figure 3.** Figure 3: Mask robustness on MagicBrush dev. Each curve sweeps a perturbation family (erode/dilate/shift) applied to the GT mask before inference; dotted/dashed baselines mark the GT-mask oracle and the deployed MaskPredictor configuration. All three metrics degrade monotonically with no cliff, and the deployed predictor sits at roughly the boundary error of erode 16 px on L1 view at source ↗

**Figure 4.** Figure 4: MaskPredictor convergence (variant G). Each column: a checkpoint step (500 view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on MagicBrush dev. Ourspreserves unedited regions while applying view at source ↗

read the original abstract

Large diffusion transformers (DiTs) follow global editing instructions well but consistently leak local edits into unrelated regions, because joint-attention architectures offer no explicit channel telling the network where to apply the edit. We introduce AdaptEdit, a co-trained, instruction- and region-aware adapter framework that retro-fits a frozen DiT into a precise local editor without modifying its backbone weights. A lightweight Block Adapter at every transformer block injects a structured condition stream that factorizes what to edit (instruction semantics) from where to edit (spatial mask); a learned SpatialGate routes the adapter signal selectively into the edit region while keeping the rest of the image near-identical to the source; and a Region-Aware Loss focuses the training objective on the changing pixels. Because these components make the backbone's internal representation mask-aware end-to-end, a thin MaskPredictor head trained jointly with the editor can ground the edit region directly from the instruction and source image -- eliminating any user-mask requirement at deployment. We evaluate on two complementary benchmarks: MagicBrush (paired ground-truth targets) to measure pixel-level preservation and edit accuracy, and Emu-Edit Test (no ground-truth images, 9 diverse edit categories) to stress-test instruction following and generalization across edit types. On both, AdaptEdit achieves state-of-the-art results, simultaneously outperforming mask-free and oracle-mask baselines. A seven-variant ablation cleanly isolates the contribution of each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaptEdit adds per-block adapters plus a joint MaskPredictor to frozen DiTs for mask-free local edits, with ablations and dual-benchmark results, but the SOTA edge rests on unshown predictor reliability across edit types.

read the letter

The main takeaway is that this paper gives a retrofit for making large diffusion transformers do precise local edits from text instructions without any mask input at test time. They freeze the DiT backbone and insert a small Block Adapter in each transformer block that separates the edit instruction from spatial location, then add a learned SpatialGate to route changes only where needed and a Region-Aware Loss to focus training on edited pixels. A thin MaskPredictor head is trained jointly so the model can guess the edit region itself from the source image and instruction.

Referee Report

3 major / 1 minor

Summary. The paper introduces AdaptEdit, a co-trained adapter framework that retrofits frozen diffusion transformers (DiTs) for mask-free local image editing. It factorizes edits via Block Adapters (injecting instruction semantics and spatial conditions), a SpatialGate (routing signals to edit regions), a Region-Aware Loss (focusing on changing pixels), and a jointly trained MaskPredictor head (predicting regions from instruction + source image at inference). Evaluation on MagicBrush (paired targets) and Emu-Edit Test (9 edit categories, no GT) claims SOTA results outperforming both mask-free and oracle-mask baselines, with a seven-variant ablation isolating component contributions.

Significance. If the results hold, this would represent a meaningful advance in practical local editing for large DiTs by eliminating user masks while preserving backbone weights and achieving better fidelity than oracle-mask baselines. The dual-benchmark design (pixel-accurate paired data plus diverse category stress-testing) and explicit ablation are strengths that support reproducibility and component analysis. The approach of making internal representations mask-aware end-to-end via lightweight adapters addresses a real architectural gap in joint-attention DiTs.

major comments (3)

[Abstract] Abstract: The central SOTA claim (outperforming mask-free and oracle-mask baselines on both MagicBrush and Emu-Edit) is load-bearing for the contribution but is stated without any numeric metrics, delta values, or table references, preventing verification of the magnitude or consistency of gains across the nine Emu-Edit categories.
[Abstract] Abstract (and implied Evaluation section): The mask-free deployment relies on the jointly trained MaskPredictor + SpatialGate producing reliable regions directly from instructions; however, no per-category mask accuracy, IoU metrics, or ablation isolating MaskPredictor error propagation from adapter quality is reported, leaving the generalization assumption untested against the skeptic concern.
[Abstract] Abstract: The seven-variant ablation is described as cleanly isolating each component, but without details on which variant removes the MaskPredictor (or SpatialGate) and the resulting drop in mask-free performance, it is impossible to confirm that the Region-Aware Loss and adapter injection alone suffice for the reported gains over oracle-mask baselines.

minor comments (1)

[Abstract] The abstract introduces several new entities (Block Adapter, SpatialGate, Region-Aware Loss, MaskPredictor) without a brief forward reference to their definitions or equations, which would aid readability for readers unfamiliar with adapter-based DiT modifications.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects of clarity in our claims and evaluations. We appreciate the positive assessment of the work's significance for mask-free editing in DiTs. We address each major comment below and will make corresponding revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central SOTA claim (outperforming mask-free and oracle-mask baselines on both MagicBrush and Emu-Edit) is load-bearing for the contribution but is stated without any numeric metrics, delta values, or table references, preventing verification of the magnitude or consistency of gains across the nine Emu-Edit categories.

Authors: We agree that the abstract would benefit from explicit quantitative support for the SOTA claims. In the revised version, we will add key numeric results, including specific deltas (e.g., improvements in edit accuracy and preservation metrics on MagicBrush, and average gains across Emu-Edit categories), along with direct references to the relevant tables in the evaluation section. This will make the magnitude and consistency of the gains verifiable directly from the abstract. revision: yes
Referee: [Abstract] Abstract (and implied Evaluation section): The mask-free deployment relies on the jointly trained MaskPredictor + SpatialGate producing reliable regions directly from instructions; however, no per-category mask accuracy, IoU metrics, or ablation isolating MaskPredictor error propagation from adapter quality is reported, leaving the generalization assumption untested against the skeptic concern.

Authors: The current manuscript reports overall mask accuracy and IoU for the MaskPredictor in the evaluation section, but we acknowledge that per-category breakdowns on Emu-Edit and an explicit ablation isolating MaskPredictor error propagation would provide stronger evidence. We will add these elements in the revision: per-category IoU metrics and a dedicated ablation comparing mask-free performance with and without the MaskPredictor (to quantify error propagation effects separately from adapter quality). This directly addresses the generalization concern. revision: yes
Referee: [Abstract] Abstract: The seven-variant ablation is described as cleanly isolating each component, but without details on which variant removes the MaskPredictor (or SpatialGate) and the resulting drop in mask-free performance, it is impossible to confirm that the Region-Aware Loss and adapter injection alone suffice for the reported gains over oracle-mask baselines.

Authors: The evaluation section details the seven variants and their results, including the specific variants that ablate the MaskPredictor and SpatialGate with corresponding performance drops in the mask-free setting. However, we agree the abstract should be more self-contained. We will revise it to briefly specify the relevant variants (e.g., the one removing MaskPredictor) and note the observed drops, while retaining the full analysis in the main text. This will clarify that the gains are not solely from Region-Aware Loss and adapters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper describes a training procedure for AdaptEdit (Block Adapter + SpatialGate + Region-Aware Loss + jointly trained MaskPredictor) and reports quantitative results on two external benchmarks (MagicBrush with paired GT targets; Emu-Edit Test with 9 categories and no GT images). These benchmarks are independent of the model's fitted parameters and are not constructed from the same quantities the model predicts. No equations, uniqueness theorems, or self-citations are presented that would make any reported metric equivalent to its inputs by definition. The central claim (SOTA mask-free performance) is therefore an empirical observation rather than a tautology or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The framework rests on the domain assumption that lightweight adapters can inject mask-aware conditioning into a frozen backbone and that co-training suffices to learn region prediction without explicit masks.

axioms (1)

domain assumption A frozen DiT backbone can be made mask-aware end-to-end by co-training lightweight adapters without modifying its weights.
Stated as the core retrofit premise in the abstract.

invented entities (4)

Block Adapter no independent evidence
purpose: Injects structured condition stream that factorizes instruction semantics from spatial mask at every transformer block.
New module introduced to solve the leakage problem.
SpatialGate no independent evidence
purpose: Learned routing mechanism that selectively applies adapter signal only inside the edit region.
Invented component to keep unrelated areas unchanged.
Region-Aware Loss no independent evidence
purpose: Training objective that focuses gradients on changing pixels.
Custom loss introduced to improve edit precision.
MaskPredictor head no independent evidence
purpose: Thin head that predicts edit region directly from instruction and source image.
Enables mask-free deployment at inference.

pith-pipeline@v0.9.0 · 5585 in / 1309 out tokens · 50231 ms · 2026-05-08T06:43:10.312869+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 17 canonical work pages · 5 internal anchors

[1]

A.\ (2023)

Brooks, T., Holynski, A., & Efros, A. A.\ (2023). InstructPix2Pix: Learning to follow image editing instructions. CVPR

2023
[2]

MagicBrush: A manually annotated dataset for instruction-guided image editing

Zhang, K., Mo, L., Chen, W., Sun, H., & Su, Y.\ (2023). MagicBrush: A manually annotated dataset for instruction-guided image editing. NeurIPS

2023
[3]

Y., Yang, Y., & Gan, Z.\ (2024)

Fu, T.-J., Hu, W., Du, X., Wang, W. Y., Yang, Y., & Gan, Z.\ (2024). Guiding instruction-based image editing via multimodal large language models. ICLR

2024
[4]

arXiv preprint arXiv:2404.09990 , year=

Hui, M., Yang, S., Zhao, B. et al.\ (2024). HQ-Edit: A high-quality dataset for instruction-based image editing. arXiv:2404.09990

work page arXiv 2024
[5]

et al.\ (2024)

Zhao, H., Ma, X. et al.\ (2024). UltraEdit: Instruction-based fine-grained image editing at scale. NeurIPS

2024
[6]

et al.\ (2024)

Ge, Y. et al.\ (2024). SEED-Data-Edit Technical Report: A hybrid dataset for instructional image editing. arXiv

2024
[7]

et al.\ (2024)

Sheynin, S., Polyak, A., Singer, U. et al.\ (2024). Emu Edit: Precise image editing via recognition and generation tasks. CVPR

2024
[8]

et al.\ (2024)

Yu, Q. et al.\ (2024). AnyEdit: Mastering unified high-quality image editing for any idea. arXiv

2024
[9]

et al.\ (2024)

Huang, Y., Xie, L. et al.\ (2024). SmartEdit: Exploring complex instruction-based image editing with multimodal LLMs. CVPR

2024
[10]

Qwen-Image-Edit Technical Report

Qwen Team.\ (2025). Qwen-Image-Edit Technical Report. arXiv

2025
[11]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B.\ (2022). High-resolution image synthesis with latent diffusion models. CVPR

2022
[12]

et al.\ (2024)

Esser, P., Kulal, S., Blattmann, A. et al.\ (2024). Scaling rectified flow transformers for high-resolution image synthesis. ICML

2024
[13]

et al.\ (2024)

Ju, X., Liu, X. et al.\ (2024). BrushNet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. ECCV

2024
[14]

et al.\ (2022)

Nichol, A. et al.\ (2022). GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. ICML

2022
[15]

Blended latent diffusion

Avrahami, O., Fried, O., & Lischinski, D.\ (2023). Blended latent diffusion. SIGGRAPH

2023
[16]

Hu, E. J. et al.\ (2022). LoRA: Low-rank adaptation of large language models. ICLR

2022
[17]

et al.\ (2023)

Ye, H. et al.\ (2023). IP-Adapter: Text-compatible image prompt adapter for text-to-image diffusion models. arXiv

2023
[18]

Adding conditional control to text-to-image diffusion models

Zhang, L., Rao, A., & Agrawala, M.\ (2023). Adding conditional control to text-to-image diffusion models. ICCV

2023
[19]

et al.\ (2024)

Mou, C. et al.\ (2024). T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. AAAI

2024
[20]

Prompt-to-prompt image editing with cross-attention control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D.\ (2023). Prompt-to-prompt image editing with cross-attention control. ICLR

2023
[21]

Plug-and-play diffusion features for text-driven image-to-image translation

Tumanyan, N., Geyer, M., Bagon, S., & Dekel, T.\ (2023). Plug-and-play diffusion features for text-driven image-to-image translation. CVPR

2023
[22]

et al.\ (2023)

Parmar, G. et al.\ (2023). Zero-shot image-to-image translation. SIGGRAPH

2023
[23]

et al.\ (2021)

Jaegle, A. et al.\ (2021). Perceiver: General perception with iterative attention. ICML

2021
[24]

FiLM: Visual reasoning with a general conditioning layer

Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A.\ (2018). FiLM: Visual reasoning with a general conditioning layer. AAAI

2018
[25]

A., Shechtman, E., & Wang, O.\ (2018)

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O.\ (2018). The unreasonable effectiveness of deep features as a perceptual metric. CVPR

2018
[26]

L., & Choi, Y.\ (2021)

Hessel, J., Holtzman, A., Forbes, M., Bras, R. L., & Choi, Y.\ (2021). CLIPScore: A reference-free evaluation metric for image captioning. EMNLP

2021
[27]

GANs trained by a two time-scale update rule converge to a local Nash equilibrium

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S.\ (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. NeurIPS

2017
[28]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Liu, X., Zhao, X. et al.\ (2025). BAGEL: Emerging Properties in Unified Multimodal Pretraining. arXiv:2505.14683

work page internal anchor Pith review arXiv 2025
[29]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, S., Liu, T., Liu, Y. et al.\ (2025). OmniGen2: Exploration to Advanced Multimodal Generation. arXiv:2506.18871

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Batifol, S., Blattmann, A. et al.\ (2025). FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space. arXiv:2506.15742

work page internal anchor Pith review arXiv 2025
[31]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Zhao, J., Hou, S. et al.\ (2025). Step1X-Edit: A Practical Framework for General Image Editing. arXiv:2504.17761

work page internal anchor Pith review arXiv 2025
[32]

et al.\ (2026)

Jiang, Z., Sun, Z., Zeng, X. et al.\ (2026). GEditBench v2: A Human-Aligned Benchmark for General Image Editing. arXiv:2603.28547

work page arXiv 2026
[33]

et al.\ (2026)

Alekseenko, G. et al.\ (2026). VIBE: Visual Instruction Based Editor. arXiv:2601.02242

work page arXiv 2026
[34]

et al.\ (2025)

Shi, Y., Chen, B., Zhang, T. et al.\ (2025). Factuality Matters: When Image Generation and Editing Meet Structured Visuals. arXiv:2510.05091

work page arXiv 2025
[35]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

Zhang, Z., Pan, X., Su, Y. et al.\ (2025). In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer. arXiv:2504.20690, NeurIPS 2025

work page arXiv 2025
[36]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Y., Cheng, X. et al.\ (2025). UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation. arXiv:2506.03147

work page internal anchor Pith review arXiv 2025
[37]

ReasonEdit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

Step1X-Image Team, StepFun.\ (2025). ReasonEdit: Towards Reasoning-Enhanced Image Editing Models. arXiv:2511.22625

work page arXiv 2025
[38]

et al.\ (2025)

Han, Z., Jiang, Z., Pan, Y. et al.\ (2025). ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer. arXiv:2410.00086, ICLR 2025

work page arXiv 2025
[39]

Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024

Lin, W., Wei, X., An, R. et al.\ (2024). PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions. arXiv:2409.15278

work page arXiv 2024
[40]

UniReal: Universal image generation and editing via learning real-world dynamics.arXiv preprint arXiv:2412.07774, 2024

Chen, X., Zhang, Z., Zhang, X. et al.\ (2024). UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics. arXiv:2412.07774

work page arXiv 2024
[41]

arXiv preprint arXiv:2508.15772 (2025)

Mao, Q., Liu, L., Liu, W. et al.\ (2025). Visual Autoregressive Modeling for Instruction-Guided Image Editing. arXiv:2508.15772

work page arXiv 2025
[42]

et al.\ (2025)

Chow, W. et al.\ (2025). EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing. arXiv:2512.11715

work page arXiv 2025
[43]

Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025a

Fang, R., Duan, C., Wang, K. et al.\ (2025). GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing. arXiv:2503.10639

work page arXiv 2025