pith. sign in

arxiv: 2605.01480 · v1 · submitted 2026-05-02 · 💻 cs.CV

AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT

Pith reviewed 2026-05-09 14:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords image editingMMDiTattention routingtraining-free editingdiffusion transformerKV injectionAttnRoutercategory-based selection
0
0 comments X

The pith

A per-category router for attention operations improves training-free image editing on MMDiT by 6.4%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that training-free edits on a 60-block MMDiT model succeed when the right attention manipulation is chosen for each edit type. It first presents KVInject, a single-pass method that blends source-image key and value projections into the noise stream within a chosen layer and step band. This avoids the prompt-mismatch failure of prior two-pass methods. AttnRouter then builds a lookup table that sends each semantic category to its best-performing operation, raising the CLIP-T plus DINO-I composite score 6.4% above baseline when categories are known. An automatic zero-shot classifier recovers 98% of that gain even with only 55% category accuracy. Ablations localize the useful signal to early denoising steps and a narrow alpha range.

Core claim

No single attention operation dominates across edit categories on Qwen-Image-Edit-2511. A routing table that assigns each category its strongest operation improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline when ground-truth categories are supplied; a CLIP zero-shot classifier closes 98% of the gap despite 55% accuracy. KVInject itself is a localized alpha-blend of source K/V into the noise half that outperforms the classical MasaCtrl recipe on this architecture.

What carries the argument

AttnRouter, a per-category routing table that dispatches each edit to the attention operation (KVInject or alternatives) that best preserves source structure for that category.

If this is right

  • KVInject succeeds with a single forward pass and avoids the prompt-mismatch collapse that drops MasaCtrl 31% below baseline.
  • Injection confined to early denoising steps (S0-7) recovers nearly all of the full-step gain.
  • Alpha blending in the interval [0.3, 0.5] forms a stable operating range while injection outside early or late layer bands produces no editing effect.
  • Simple K/V rescaling never beats baseline and aggressive rescaling collapses generation quality to 0.084 composite.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The existence of stable per-category optima suggests that MMDiT attention layers encode edit-type-specific structure-preservation rules that can be pre-tabulated once.
  • The large gap closed by a low-accuracy classifier indicates that even coarse semantic signals are enough to route most edits productively.
  • The same routing principle could be tested on other diffusion transformers by collecting a small stratified set and building an analogous table.

Load-bearing premise

The best attention operation for any given edit category stays stable and generalizes from the 100-sample stratified subset to unseen images and prompts.

What would settle it

Apply the fixed routing table (or the auto-classifier version) to a fresh 100-image subset with ground-truth categories and check whether the composite score still exceeds the baseline by roughly 6%.

Figures

Figures reproduced from arXiv: 2605.01480 by Guandong Li, Mengxia Ye.

Figure 1
Figure 1. Figure 1: End-to-end pipeline. Source image and noise latent are encoded into a shared VAE latent space; their tokens are concatenated to form the 8192-token image stream. Text tokens enter through the parallel add * proj branch. After projection but before joint attention, KVInject overwrites the noise-half of to k/to v with an α-blend toward the source-half within a chosen (layer,step) band. AttnRouter selects the… view at source ↗
Figure 2
Figure 2. Figure 2: CLIP-T vs. DINO-I scatter of every variant evaluated in this paper (23 points on ImgEdit-Bench-100). Dotted lines are iso-composite contours 0.5(CLIP-T + DINO-I)=c. AttnRouter (blue stars, oracle and auto) sits on the highest contour reached by any variant; baseline (black square) and the KVInject sweep cluster between comp=0.38–0.40; MasaCtrl-proper (brown plus) and degenerate simple-K/V configurations fa… view at source ↗
Figure 3
Figure 3. Figure 3: α-sweep curves for two layer bands. Left: composite. Right: DINO-I. Sweet spot is α=0.3 in band L30–45; pushing α to 0.7 in L30–45 triggers the over-injection collapse (DINO￾I drops from 0.586 to 0.438 while CLIP-T rises to 0.228, see Tab. 2) view at source ↗
Figure 4
Figure 4. Figure 4: Step-band ablation. Composite (red) and DINO-I (blue) jump above baseline (dotted) only when K/V injection covers early denoising steps S0–7; the remaining three step bands return to baseline. residual stream has not yet committed to a final pixel struc￾ture. 4.5 Step-band ablation We split the 28 denoising steps into four contiguous 7-step bands (Tab. 4, visualized in view at source ↗
Figure 5
Figure 5. Figure 5: Schematic attention visualization. (a) Baseline noise→source attention is unstructured. (b) After KVInject (α=0.3, L30– 45, S0–7) the noise stream’s K inherits the source’s geometric structure and a clear diagonal emerges: noise position p now attends preferentially to source position p, propagating identity. (c) Per-layer cosine similarity between Knoise ℓ and Ksrc ℓ is highest at shallow/deep layers and … view at source ↗
Figure 6
Figure 6. Figure 6: Per-category composite improvement of AttnRouter (oracle) over baseline. Style edits gain the most (+24%); add edits route to the baseline (no improvement) because injecting source K/V suppresses content insertion. on style edits and small object identity on attribute edits, which is consistent with the DINO-I gain. 5 Discussion What MMDiT changes about editing. On UNet dif￾fusion, source preservation is a… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on ImgEdit-Bench-100. view at source ↗
read the original abstract

We study training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that concatenates noise and source-image tokens within a single attention stream. We make three contributions. (i) We introduce KVInject, a single-forward attention manipulation that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band. KVInject is simpler than the classical two-pass MasaCtrl recipe and avoids the prompt-mismatch failure mode that disables MasaCtrl on MMDiT (composite score drops 31% versus baseline). (ii) We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. (iii) Through layer-, step-, and alpha-band ablations we localize the editing-effective attention sub-circuit: K/V injection in early denoising steps (S0-7) recovers nearly all of the gain of full-step injection, while injection in early (L0-15) or late (L45-60) layer bands fails to drive editing entirely; alpha in [0.3, 0.5] is a stable sweet spot. We also report negative results that highlight what does not transfer from the UNet folklore: simple K/V rescaling never beats baseline and aggressive variants collapse generation entirely (composite 0.084). We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces KVInject, a single-forward attention manipulation that alpha-blends source key/value projections into the noise half within localized layer/step bands on the Qwen-Image-Edit-2511 MMDiT, and AttnRouter, a per-category routing table that dispatches to the best operation for each edit type. It reports that no single operation dominates, that ground-truth categories yield a 6.4% lift in the CLIP-T+DINO-I composite over baseline, and that a CLIP zero-shot classifier recovers 98% of the gain despite 55% accuracy. Extensive layer/step/alpha ablations localize effective bands (early steps S0-7, alpha [0.3,0.5]), negative results show K/V rescaling fails, and code, routing tables, and the 100-sample stratified ImgEdit-Bench subset are released.

Significance. If the per-category optima prove stable and the routing table generalizes, the work supplies a practical, training-free improvement to MMDiT editing by showing that attention operations are category-dependent and that simple injection in early denoising steps suffices. The release of code, pre-computed tables, and the stratified subset, together with the negative rescaling results and localized ablation findings, are concrete strengths that enable reproducibility and further investigation.

major comments (1)
  1. [Abstract] Abstract: the 6.4% CLIP-T+DINO-I gain with ground-truth categories (and the 98% recovery via the CLIP classifier) both rest on a routing table obtained by ablating operations on the identical 100-sample stratified subset used for all reported metrics. No held-out validation, cross-validation, or transfer test on disjoint images/prompts is described; if per-category optima are not stable across source images, the composite improvement reduces to the single-operation baseline.
minor comments (2)
  1. [Abstract] The abstract states that the 100-sample subset is 'used in all ablations' but does not clarify whether the routing table itself was frozen before final metric computation or whether any post-hoc selection occurred.
  2. Clarify the exact definition of the CLIP-T+DINO-I composite (weights, normalization) and the number of edit categories used for routing.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed reading and for raising this important point about the evaluation protocol for AttnRouter. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 6.4% CLIP-T+DINO-I gain with ground-truth categories (and the 98% recovery via the CLIP classifier) both rest on a routing table obtained by ablating operations on the identical 100-sample stratified subset used for all reported metrics. No held-out validation, cross-validation, or transfer test on disjoint images/prompts is described; if per-category optima are not stable across source images, the composite improvement reduces to the single-operation baseline.

    Authors: We agree that the routing table was derived from ablations performed on the same 100-sample stratified subset used for all reported metrics, and that no held-out validation set, cross-validation, or transfer test on disjoint images/prompts was conducted. This constitutes a genuine limitation of the current evaluation. The subset was constructed to be stratified across edit categories from ImgEdit-Bench to ensure coverage, and we release the exact samples, the pre-computed routing tables, and the full code precisely to enable independent checks of stability on other data. The observation that a CLIP zero-shot classifier (never tuned on this subset) recovers 98% of the ground-truth-category gain offers indirect support for robustness, but we do not claim this substitutes for a proper held-out test. In the revised manuscript we will add an explicit discussion of this limitation in the Experiments section, clarify the evaluation scope in the abstract, and note that the reported 6.4% gain is measured under the released distribution. revision: yes

Circularity Check

1 steps flagged

Routing table fitted on the 100-sample evaluation subset makes the 6.4% gain an in-sample result

specific steps
  1. fitted input called prediction [Abstract]
    "We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. [...] We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations."

    The routing table is built by choosing the per-category optimum from ablations on the 100-sample subset; the 6.4% composite improvement (and 98% recovery) is then measured on the same subset. The reported gain is therefore the in-sample performance of the fitted router rather than a generalization.

full rationale

The paper's headline claim is that AttnRouter (a per-category dispatch table) yields a 6.4% CLIP-T+DINO-I lift. This table is constructed by selecting the best operation per category via ablations explicitly performed on the identical 100-sample stratified subset released and used for all metric reporting. The reported improvement with ground-truth categories is therefore the in-sample maximum over the ablated choices rather than an out-of-sample prediction. This matches the fitted-input-called-prediction pattern. No self-definitional equations, self-citation load-bearing steps, or other enumerated circularity patterns appear in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical selection of injection bands and a data-derived routing table rather than new theoretical axioms; standard transformer attention is presupposed but directly tested via ablations.

free parameters (2)
  • alpha blending factor = [0.3, 0.5]
    Identified as stable sweet spot in [0.3, 0.5] from layer/step ablations
  • injection layer and step bands = S0-7, L0-15/L45-60
    Localized to S0-7 steps and specific layer ranges via exhaustive ablations
axioms (1)
  • domain assumption KV injection in attention layers can transfer source structure without prompt mismatch on MMDiT
    Tested empirically but presupposed as transferable from UNet folklore

pith-pipeline@v0.9.0 · 5641 in / 1417 out tokens · 58719 ms · 2026-05-09T14:36:11.565304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    Localizing and editing knowl- edge in text-to-image generative models

    Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, and Varun Manjunatha. Localizing and editing knowl- edge in text-to-image generative models. InInternational Conference on Learning Representations (ICLR), 2024

  2. [2]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structPix2Pix: Learning to follow image editing instructions. InConference on Computer Vision and Pattern Recognition (CVPR), 2023

  3. [3]

    MasaCtrl: Tuning-free mutual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. MasaCtrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InInternational Conference on Computer Vision (ICCV), 2023

  4. [4]

    DiffEdit: Diffusion-based semantic image editing with mask guidance

    Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. DiffEdit: Diffusion-based semantic image editing with mask guidance. InInternational Conference on Learning Representations (ICLR), 2023

  5. [5]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InInter- national Conference on ...

  6. [6]

    Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

  7. [7]

    Prompt-to-prompt im- age editing with cross-attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross-attention control. InInternational Conference on Learning Representations (ICLR), 2023

  8. [8]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2020

  9. [9]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021

  10. [10]

    LEDITS++: Limitless image editing using text- to-image models

    Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. LEDITS++: Limitless image editing using text- to-image models. InConference on Computer Vision and Pattern Recognition (CVPR), 2024

  11. [11]

    PnP Inversion: Boosting diffusion-based edit- ing with 3 lines of code

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. PnP Inversion: Boosting diffusion-based edit- ing with 3 lines of code. InInternational Conference on Learning Representations (ICLR), 2024

  12. [12]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs. FLUX.1 Kontext: Flow matching for in- context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  13. [13]

    Dual-channel attention guidance for training-free image editing control in diffusion transformers, 2026

    Guandong Li. Dual-channel attention guidance for training- free image editing control in diffusion transformers.arXiv preprint arXiv:2602.18022, 2026

  14. [14]

    EditID: Training-free ed- itable ID customization for text-to-image generation

    Guandong Li and Zhaobin Chu. EditID: Training-free ed- itable ID customization for text-to-image generation. In Findings of the Association for Computational Linguistics: EMNLP, 2025

  15. [15]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Repre- sentations (ICLR), 2023. 9

  16. [16]

    SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022

  17. [17]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InConference on Computer Vision and Pattern Recognition (CVPR), 2023

  18. [18]

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herv ´e J ´egou, Julien Mairal, P...

  19. [19]

    Zero-shot image-to-image translation

    Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. InACM SIGGRAPH, 2023

  20. [20]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), 2023

  21. [21]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

  22. [22]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InConference on Computer Vision and Pattern Recognition (CVPR), 2022

  23. [23]

    Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017

  24. [24]

    Qwen-Image Technical Report

    Qwen Team. Qwen-Image technical report.arXiv preprint arXiv:2508.02324, 2025

  25. [25]

    Plug-and-play diffusion features for text-driven image-to- image translation

    Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to- image translation. InConference on Computer Vision and Pattern Recognition (CVPR), 2023

  26. [26]

    ImgEdit: A unified image editing dataset and benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. ImgEdit: A unified image editing dataset and benchmark. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025

  27. [27]

    MagicBrush: A manually annotated dataset for instruction- guided image editing

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A manually annotated dataset for instruction- guided image editing. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  28. [28]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In International Conference on Computer Vision (ICCV), 2023

  29. [29]

    UltraEdit: Instruction-based fine-grained image editing at scale

    Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Ru- jie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based fine-grained image editing at scale. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 10 Figure 7: Qualitative comparison on ImgEdit-Bench-100.One representative sample per category. ...