pith. machine review for the scientific record. sign in

arxiv: 2604.23763 · v2 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords local image editingdiffusion transformersadapter injectionmask-free editingregion-aware adaptationinstruction-based editingspatial gatingDiT editing
0
0 comments X

The pith

AdaptEdit retrofits frozen diffusion transformers for precise local edits by injecting instruction- and region-aware adapters that predict edit locations from text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion transformers spread local edits globally because their joint-attention layers provide no explicit signal for where changes should apply. AdaptEdit adds a lightweight Block Adapter to every transformer block in a frozen backbone; the adapter carries a factorized condition that separates the edit instruction from its spatial region. A SpatialGate routes the adapter output only into the intended area, and a Region-Aware Loss trains the model to focus on pixels that actually change. A thin MaskPredictor head, trained jointly, grounds the region directly from the instruction and source image, removing any need for user masks at deployment. On MagicBrush and Emu-Edit Test the method outperforms both mask-free and oracle-mask baselines while preserving the rest of the image.

Core claim

AdaptEdit is a co-trained adapter framework that retrofits a frozen DiT into a mask-free local editor by placing a Block Adapter at every transformer block to inject a structured condition stream that factorizes instruction semantics from spatial location, using a learned SpatialGate to route the signal selectively and a Region-Aware Loss to focus the objective on changing pixels, so that a jointly trained MaskPredictor can derive the edit region from the instruction and source image without any external mask.

What carries the argument

Block Adapter with SpatialGate and MaskPredictor: a lightweight per-block module that injects a factorized condition stream separating what to edit from where to edit, with the gate controlling selective application and the predictor grounding the region directly from instruction and source.

If this is right

  • The frozen DiT backbone performs local edits without any weight modification.
  • Pixel-level preservation and edit accuracy both improve over prior mask-free and masked baselines.
  • No user-provided mask is required at inference because the MaskPredictor supplies the region.
  • Performance generalizes across nine diverse edit categories on Emu-Edit Test.
  • Ablations confirm that each component—adapter, gate, predictor, and region-aware loss—contributes to isolating the edit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-block factorization of semantics and location could be tested on other attention-heavy generative architectures that exhibit global leakage.
  • Joint training of gate and predictor may allow the method to scale to instructions that describe multiple disjoint regions in one forward pass.
  • The region-aware loss could be combined with unpaired data to reduce reliance on paired ground-truth targets like those in MagicBrush.

Load-bearing premise

The jointly trained SpatialGate and MaskPredictor can reliably isolate edits and predict accurate regions from instructions without explicit masks, generalizing across the nine edit categories in Emu-Edit.

What would settle it

Measuring whether edits remain confined to the region implied by the instruction on held-out images from Emu-Edit or MagicBrush, or whether leakage into unrelated areas occurs when the MaskPredictor is removed or the SpatialGate is ablated.

Figures

Figures reproduced from arXiv: 2604.23763 by Haohua Chen, Honghao Cai, Runqi Wang, Tianze Zhou, Wei Zhu, Xiangyuan Wang, Xu Tang, Yao Hu, Yibo Chen, Yunhao Bai, Zhen Li.

Figure 1
Figure 1. Figure 1: Motivation. Left: a vanilla DiT leaks “remove hand sanitizer” into surrounding pixels. Right: Oursfactorizes what (instruction) from where (mask), injects both via a per-block adapter modulated by a SpatialGate, and grounds the edit region automatically via a MaskPredictor — no user mask needed at deployment. Their problems. Both lines have fundamental limitations (see view at source ↗
Figure 2
Figure 2. Figure 2: System overview. (a) Training: GT mask and VL hidden states are encoded into spatial and semantic tokens, fused, and injected via BlockAdapter+SpatialGate at every frozen DiT block; a MaskPredictor head is co-trained with a decoupled auxiliary loss. (b) Inference: the MaskPredictor produces the edit-region mask from source image and instruction alone — no user mask needed. parameter UNets. The DiT era is r… view at source ↗
Figure 3
Figure 3. Figure 3: Mask robustness on MagicBrush dev. Each curve sweeps a perturbation family (erode/dilate/shift) applied to the GT mask before inference; dotted/dashed baselines mark the GT-mask oracle and the deployed MaskPredictor configuration. All three metrics degrade monotoni￾cally with no cliff, and the deployed predictor sits at roughly the boundary error of erode 16 px on L1 view at source ↗
Figure 4
Figure 4. Figure 4: MaskPredictor convergence (variant G). Each column: a checkpoint step (500 view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on MagicBrush dev. Ourspreserves unedited regions while applying view at source ↗
read the original abstract

Large diffusion transformers (DiTs) follow global editing instructions well but consistently leak local edits into unrelated regions, because joint-attention architectures offer no explicit channel telling the network where to apply the edit. We introduce AdaptEdit, a co-trained, instruction- and region-aware adapter framework that retro-fits a frozen DiT into a precise local editor without modifying its backbone weights. A lightweight Block Adapter at every transformer block injects a structured condition stream that factorizes what to edit (instruction semantics) from where to edit (spatial mask); a learned SpatialGate routes the adapter signal selectively into the edit region while keeping the rest of the image near-identical to the source; and a Region-Aware Loss focuses the training objective on the changing pixels. Because these components make the backbone's internal representation mask-aware end-to-end, a thin MaskPredictor head trained jointly with the editor can ground the edit region directly from the instruction and source image -- eliminating any user-mask requirement at deployment. We evaluate on two complementary benchmarks: MagicBrush (paired ground-truth targets) to measure pixel-level preservation and edit accuracy, and Emu-Edit Test (no ground-truth images, 9 diverse edit categories) to stress-test instruction following and generalization across edit types. On both, AdaptEdit achieves state-of-the-art results, simultaneously outperforming mask-free and oracle-mask baselines. A seven-variant ablation cleanly isolates the contribution of each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces AdaptEdit, a co-trained adapter framework that retrofits frozen diffusion transformers (DiTs) for mask-free local image editing. It factorizes edits via Block Adapters (injecting instruction semantics and spatial conditions), a SpatialGate (routing signals to edit regions), a Region-Aware Loss (focusing on changing pixels), and a jointly trained MaskPredictor head (predicting regions from instruction + source image at inference). Evaluation on MagicBrush (paired targets) and Emu-Edit Test (9 edit categories, no GT) claims SOTA results outperforming both mask-free and oracle-mask baselines, with a seven-variant ablation isolating component contributions.

Significance. If the results hold, this would represent a meaningful advance in practical local editing for large DiTs by eliminating user masks while preserving backbone weights and achieving better fidelity than oracle-mask baselines. The dual-benchmark design (pixel-accurate paired data plus diverse category stress-testing) and explicit ablation are strengths that support reproducibility and component analysis. The approach of making internal representations mask-aware end-to-end via lightweight adapters addresses a real architectural gap in joint-attention DiTs.

major comments (3)
  1. [Abstract] Abstract: The central SOTA claim (outperforming mask-free and oracle-mask baselines on both MagicBrush and Emu-Edit) is load-bearing for the contribution but is stated without any numeric metrics, delta values, or table references, preventing verification of the magnitude or consistency of gains across the nine Emu-Edit categories.
  2. [Abstract] Abstract (and implied Evaluation section): The mask-free deployment relies on the jointly trained MaskPredictor + SpatialGate producing reliable regions directly from instructions; however, no per-category mask accuracy, IoU metrics, or ablation isolating MaskPredictor error propagation from adapter quality is reported, leaving the generalization assumption untested against the skeptic concern.
  3. [Abstract] Abstract: The seven-variant ablation is described as cleanly isolating each component, but without details on which variant removes the MaskPredictor (or SpatialGate) and the resulting drop in mask-free performance, it is impossible to confirm that the Region-Aware Loss and adapter injection alone suffice for the reported gains over oracle-mask baselines.
minor comments (1)
  1. [Abstract] The abstract introduces several new entities (Block Adapter, SpatialGate, Region-Aware Loss, MaskPredictor) without a brief forward reference to their definitions or equations, which would aid readability for readers unfamiliar with adapter-based DiT modifications.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects of clarity in our claims and evaluations. We appreciate the positive assessment of the work's significance for mask-free editing in DiTs. We address each major comment below and will make corresponding revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central SOTA claim (outperforming mask-free and oracle-mask baselines on both MagicBrush and Emu-Edit) is load-bearing for the contribution but is stated without any numeric metrics, delta values, or table references, preventing verification of the magnitude or consistency of gains across the nine Emu-Edit categories.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the SOTA claims. In the revised version, we will add key numeric results, including specific deltas (e.g., improvements in edit accuracy and preservation metrics on MagicBrush, and average gains across Emu-Edit categories), along with direct references to the relevant tables in the evaluation section. This will make the magnitude and consistency of the gains verifiable directly from the abstract. revision: yes

  2. Referee: [Abstract] Abstract (and implied Evaluation section): The mask-free deployment relies on the jointly trained MaskPredictor + SpatialGate producing reliable regions directly from instructions; however, no per-category mask accuracy, IoU metrics, or ablation isolating MaskPredictor error propagation from adapter quality is reported, leaving the generalization assumption untested against the skeptic concern.

    Authors: The current manuscript reports overall mask accuracy and IoU for the MaskPredictor in the evaluation section, but we acknowledge that per-category breakdowns on Emu-Edit and an explicit ablation isolating MaskPredictor error propagation would provide stronger evidence. We will add these elements in the revision: per-category IoU metrics and a dedicated ablation comparing mask-free performance with and without the MaskPredictor (to quantify error propagation effects separately from adapter quality). This directly addresses the generalization concern. revision: yes

  3. Referee: [Abstract] Abstract: The seven-variant ablation is described as cleanly isolating each component, but without details on which variant removes the MaskPredictor (or SpatialGate) and the resulting drop in mask-free performance, it is impossible to confirm that the Region-Aware Loss and adapter injection alone suffice for the reported gains over oracle-mask baselines.

    Authors: The evaluation section details the seven variants and their results, including the specific variants that ablate the MaskPredictor and SpatialGate with corresponding performance drops in the mask-free setting. However, we agree the abstract should be more self-contained. We will revise it to briefly specify the relevant variants (e.g., the one removing MaskPredictor) and note the observed drops, while retaining the full analysis in the main text. This will clarify that the gains are not solely from Region-Aware Loss and adapters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper describes a training procedure for AdaptEdit (Block Adapter + SpatialGate + Region-Aware Loss + jointly trained MaskPredictor) and reports quantitative results on two external benchmarks (MagicBrush with paired GT targets; Emu-Edit Test with 9 categories and no GT images). These benchmarks are independent of the model's fitted parameters and are not constructed from the same quantities the model predicts. No equations, uniqueness theorems, or self-citations are presented that would make any reported metric equivalent to its inputs by definition. The central claim (SOTA mask-free performance) is therefore an empirical observation rather than a tautology or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The framework rests on the domain assumption that lightweight adapters can inject mask-aware conditioning into a frozen backbone and that co-training suffices to learn region prediction without explicit masks.

axioms (1)
  • domain assumption A frozen DiT backbone can be made mask-aware end-to-end by co-training lightweight adapters without modifying its weights.
    Stated as the core retrofit premise in the abstract.
invented entities (4)
  • Block Adapter no independent evidence
    purpose: Injects structured condition stream that factorizes instruction semantics from spatial mask at every transformer block.
    New module introduced to solve the leakage problem.
  • SpatialGate no independent evidence
    purpose: Learned routing mechanism that selectively applies adapter signal only inside the edit region.
    Invented component to keep unrelated areas unchanged.
  • Region-Aware Loss no independent evidence
    purpose: Training objective that focuses gradients on changing pixels.
    Custom loss introduced to improve edit precision.
  • MaskPredictor head no independent evidence
    purpose: Thin head that predicts edit region directly from instruction and source image.
    Enables mask-free deployment at inference.

pith-pipeline@v0.9.0 · 5585 in / 1309 out tokens · 50231 ms · 2026-05-08T06:43:10.312869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 17 canonical work pages · 5 internal anchors

  1. [1]

    A.\ (2023)

    Brooks, T., Holynski, A., & Efros, A. A.\ (2023). InstructPix2Pix: Learning to follow image editing instructions. CVPR

  2. [2]

    MagicBrush: A manually annotated dataset for instruction-guided image editing

    Zhang, K., Mo, L., Chen, W., Sun, H., & Su, Y.\ (2023). MagicBrush: A manually annotated dataset for instruction-guided image editing. NeurIPS

  3. [3]

    Y., Yang, Y., & Gan, Z.\ (2024)

    Fu, T.-J., Hu, W., Du, X., Wang, W. Y., Yang, Y., & Gan, Z.\ (2024). Guiding instruction-based image editing via multimodal large language models. ICLR

  4. [4]

    arXiv preprint arXiv:2404.09990 , year=

    Hui, M., Yang, S., Zhao, B. et al.\ (2024). HQ-Edit: A high-quality dataset for instruction-based image editing. arXiv:2404.09990

  5. [5]

    et al.\ (2024)

    Zhao, H., Ma, X. et al.\ (2024). UltraEdit: Instruction-based fine-grained image editing at scale. NeurIPS

  6. [6]

    et al.\ (2024)

    Ge, Y. et al.\ (2024). SEED-Data-Edit Technical Report: A hybrid dataset for instructional image editing. arXiv

  7. [7]

    et al.\ (2024)

    Sheynin, S., Polyak, A., Singer, U. et al.\ (2024). Emu Edit: Precise image editing via recognition and generation tasks. CVPR

  8. [8]

    et al.\ (2024)

    Yu, Q. et al.\ (2024). AnyEdit: Mastering unified high-quality image editing for any idea. arXiv

  9. [9]

    et al.\ (2024)

    Huang, Y., Xie, L. et al.\ (2024). SmartEdit: Exploring complex instruction-based image editing with multimodal LLMs. CVPR

  10. [10]

    Qwen-Image-Edit Technical Report

    Qwen Team.\ (2025). Qwen-Image-Edit Technical Report. arXiv

  11. [11]

    High-resolution image synthesis with latent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B.\ (2022). High-resolution image synthesis with latent diffusion models. CVPR

  12. [12]

    et al.\ (2024)

    Esser, P., Kulal, S., Blattmann, A. et al.\ (2024). Scaling rectified flow transformers for high-resolution image synthesis. ICML

  13. [13]

    et al.\ (2024)

    Ju, X., Liu, X. et al.\ (2024). BrushNet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. ECCV

  14. [14]

    et al.\ (2022)

    Nichol, A. et al.\ (2022). GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. ICML

  15. [15]

    Blended latent diffusion

    Avrahami, O., Fried, O., & Lischinski, D.\ (2023). Blended latent diffusion. SIGGRAPH

  16. [16]

    Hu, E. J. et al.\ (2022). LoRA: Low-rank adaptation of large language models. ICLR

  17. [17]

    et al.\ (2023)

    Ye, H. et al.\ (2023). IP-Adapter: Text-compatible image prompt adapter for text-to-image diffusion models. arXiv

  18. [18]

    Adding conditional control to text-to-image diffusion models

    Zhang, L., Rao, A., & Agrawala, M.\ (2023). Adding conditional control to text-to-image diffusion models. ICCV

  19. [19]

    et al.\ (2024)

    Mou, C. et al.\ (2024). T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. AAAI

  20. [20]

    Prompt-to-prompt image editing with cross-attention control

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D.\ (2023). Prompt-to-prompt image editing with cross-attention control. ICLR

  21. [21]

    Plug-and-play diffusion features for text-driven image-to-image translation

    Tumanyan, N., Geyer, M., Bagon, S., & Dekel, T.\ (2023). Plug-and-play diffusion features for text-driven image-to-image translation. CVPR

  22. [22]

    et al.\ (2023)

    Parmar, G. et al.\ (2023). Zero-shot image-to-image translation. SIGGRAPH

  23. [23]

    et al.\ (2021)

    Jaegle, A. et al.\ (2021). Perceiver: General perception with iterative attention. ICML

  24. [24]

    FiLM: Visual reasoning with a general conditioning layer

    Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A.\ (2018). FiLM: Visual reasoning with a general conditioning layer. AAAI

  25. [25]

    A., Shechtman, E., & Wang, O.\ (2018)

    Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O.\ (2018). The unreasonable effectiveness of deep features as a perceptual metric. CVPR

  26. [26]

    L., & Choi, Y.\ (2021)

    Hessel, J., Holtzman, A., Forbes, M., Bras, R. L., & Choi, Y.\ (2021). CLIPScore: A reference-free evaluation metric for image captioning. EMNLP

  27. [27]

    GANs trained by a two time-scale update rule converge to a local Nash equilibrium

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S.\ (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. NeurIPS

  28. [28]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Liu, X., Zhao, X. et al.\ (2025). BAGEL: Emerging Properties in Unified Multimodal Pretraining. arXiv:2505.14683

  29. [29]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, S., Liu, T., Liu, Y. et al.\ (2025). OmniGen2: Exploration to Advanced Multimodal Generation. arXiv:2506.18871

  30. [30]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Batifol, S., Blattmann, A. et al.\ (2025). FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space. arXiv:2506.15742

  31. [31]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Zhao, J., Hou, S. et al.\ (2025). Step1X-Edit: A Practical Framework for General Image Editing. arXiv:2504.17761

  32. [32]

    et al.\ (2026)

    Jiang, Z., Sun, Z., Zeng, X. et al.\ (2026). GEditBench v2: A Human-Aligned Benchmark for General Image Editing. arXiv:2603.28547

  33. [33]

    et al.\ (2026)

    Alekseenko, G. et al.\ (2026). VIBE: Visual Instruction Based Editor. arXiv:2601.02242

  34. [34]

    et al.\ (2025)

    Shi, Y., Chen, B., Zhang, T. et al.\ (2025). Factuality Matters: When Image Generation and Editing Meet Structured Visuals. arXiv:2510.05091

  35. [35]

    In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

    Zhang, Z., Pan, X., Su, Y. et al.\ (2025). In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer. arXiv:2504.20690, NeurIPS 2025

  36. [36]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Y., Cheng, X. et al.\ (2025). UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation. arXiv:2506.03147

  37. [37]

    ReasonEdit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

    Step1X-Image Team, StepFun.\ (2025). ReasonEdit: Towards Reasoning-Enhanced Image Editing Models. arXiv:2511.22625

  38. [38]

    et al.\ (2025)

    Han, Z., Jiang, Z., Pan, Y. et al.\ (2025). ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer. arXiv:2410.00086, ICLR 2025

  39. [39]

    Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024

    Lin, W., Wei, X., An, R. et al.\ (2024). PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions. arXiv:2409.15278

  40. [40]

    UniReal: Universal image generation and editing via learning real-world dynamics.arXiv preprint arXiv:2412.07774, 2024

    Chen, X., Zhang, Z., Zhang, X. et al.\ (2024). UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics. arXiv:2412.07774

  41. [41]

    arXiv preprint arXiv:2508.15772 (2025)

    Mao, Q., Liu, L., Liu, W. et al.\ (2025). Visual Autoregressive Modeling for Instruction-Guided Image Editing. arXiv:2508.15772

  42. [42]

    et al.\ (2025)

    Chow, W. et al.\ (2025). EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing. arXiv:2512.11715

  43. [43]

    Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025a

    Fang, R., Duan, C., Wang, K. et al.\ (2025). GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing. arXiv:2503.10639