AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT
Pith reviewed 2026-05-09 14:36 UTC · model grok-4.3
The pith
A per-category router for attention operations improves training-free image editing on MMDiT by 6.4%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
No single attention operation dominates across edit categories on Qwen-Image-Edit-2511. A routing table that assigns each category its strongest operation improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline when ground-truth categories are supplied; a CLIP zero-shot classifier closes 98% of the gap despite 55% accuracy. KVInject itself is a localized alpha-blend of source K/V into the noise half that outperforms the classical MasaCtrl recipe on this architecture.
What carries the argument
AttnRouter, a per-category routing table that dispatches each edit to the attention operation (KVInject or alternatives) that best preserves source structure for that category.
If this is right
- KVInject succeeds with a single forward pass and avoids the prompt-mismatch collapse that drops MasaCtrl 31% below baseline.
- Injection confined to early denoising steps (S0-7) recovers nearly all of the full-step gain.
- Alpha blending in the interval [0.3, 0.5] forms a stable operating range while injection outside early or late layer bands produces no editing effect.
- Simple K/V rescaling never beats baseline and aggressive rescaling collapses generation quality to 0.084 composite.
Where Pith is reading between the lines
- The existence of stable per-category optima suggests that MMDiT attention layers encode edit-type-specific structure-preservation rules that can be pre-tabulated once.
- The large gap closed by a low-accuracy classifier indicates that even coarse semantic signals are enough to route most edits productively.
- The same routing principle could be tested on other diffusion transformers by collecting a small stratified set and building an analogous table.
Load-bearing premise
The best attention operation for any given edit category stays stable and generalizes from the 100-sample stratified subset to unseen images and prompts.
What would settle it
Apply the fixed routing table (or the auto-classifier version) to a fresh 100-image subset with ground-truth categories and check whether the composite score still exceeds the baseline by roughly 6%.
Figures
read the original abstract
We study training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that concatenates noise and source-image tokens within a single attention stream. We make three contributions. (i) We introduce KVInject, a single-forward attention manipulation that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band. KVInject is simpler than the classical two-pass MasaCtrl recipe and avoids the prompt-mismatch failure mode that disables MasaCtrl on MMDiT (composite score drops 31% versus baseline). (ii) We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. (iii) Through layer-, step-, and alpha-band ablations we localize the editing-effective attention sub-circuit: K/V injection in early denoising steps (S0-7) recovers nearly all of the gain of full-step injection, while injection in early (L0-15) or late (L45-60) layer bands fails to drive editing entirely; alpha in [0.3, 0.5] is a stable sweet spot. We also report negative results that highlight what does not transfer from the UNet folklore: simple K/V rescaling never beats baseline and aggressive variants collapse generation entirely (composite 0.084). We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces KVInject, a single-forward attention manipulation that alpha-blends source key/value projections into the noise half within localized layer/step bands on the Qwen-Image-Edit-2511 MMDiT, and AttnRouter, a per-category routing table that dispatches to the best operation for each edit type. It reports that no single operation dominates, that ground-truth categories yield a 6.4% lift in the CLIP-T+DINO-I composite over baseline, and that a CLIP zero-shot classifier recovers 98% of the gain despite 55% accuracy. Extensive layer/step/alpha ablations localize effective bands (early steps S0-7, alpha [0.3,0.5]), negative results show K/V rescaling fails, and code, routing tables, and the 100-sample stratified ImgEdit-Bench subset are released.
Significance. If the per-category optima prove stable and the routing table generalizes, the work supplies a practical, training-free improvement to MMDiT editing by showing that attention operations are category-dependent and that simple injection in early denoising steps suffices. The release of code, pre-computed tables, and the stratified subset, together with the negative rescaling results and localized ablation findings, are concrete strengths that enable reproducibility and further investigation.
major comments (1)
- [Abstract] Abstract: the 6.4% CLIP-T+DINO-I gain with ground-truth categories (and the 98% recovery via the CLIP classifier) both rest on a routing table obtained by ablating operations on the identical 100-sample stratified subset used for all reported metrics. No held-out validation, cross-validation, or transfer test on disjoint images/prompts is described; if per-category optima are not stable across source images, the composite improvement reduces to the single-operation baseline.
minor comments (2)
- [Abstract] The abstract states that the 100-sample subset is 'used in all ablations' but does not clarify whether the routing table itself was frozen before final metric computation or whether any post-hoc selection occurred.
- Clarify the exact definition of the CLIP-T+DINO-I composite (weights, normalization) and the number of edit categories used for routing.
Simulated Author's Rebuttal
We thank the referee for the detailed reading and for raising this important point about the evaluation protocol for AttnRouter. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 6.4% CLIP-T+DINO-I gain with ground-truth categories (and the 98% recovery via the CLIP classifier) both rest on a routing table obtained by ablating operations on the identical 100-sample stratified subset used for all reported metrics. No held-out validation, cross-validation, or transfer test on disjoint images/prompts is described; if per-category optima are not stable across source images, the composite improvement reduces to the single-operation baseline.
Authors: We agree that the routing table was derived from ablations performed on the same 100-sample stratified subset used for all reported metrics, and that no held-out validation set, cross-validation, or transfer test on disjoint images/prompts was conducted. This constitutes a genuine limitation of the current evaluation. The subset was constructed to be stratified across edit categories from ImgEdit-Bench to ensure coverage, and we release the exact samples, the pre-computed routing tables, and the full code precisely to enable independent checks of stability on other data. The observation that a CLIP zero-shot classifier (never tuned on this subset) recovers 98% of the ground-truth-category gain offers indirect support for robustness, but we do not claim this substitutes for a proper held-out test. In the revised manuscript we will add an explicit discussion of this limitation in the Experiments section, clarify the evaluation scope in the abstract, and note that the reported 6.4% gain is measured under the released distribution. revision: yes
Circularity Check
Routing table fitted on the 100-sample evaluation subset makes the 6.4% gain an in-sample result
specific steps
-
fitted input called prediction
[Abstract]
"We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. [...] We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations."
The routing table is built by choosing the per-category optimum from ablations on the 100-sample subset; the 6.4% composite improvement (and 98% recovery) is then measured on the same subset. The reported gain is therefore the in-sample performance of the fitted router rather than a generalization.
full rationale
The paper's headline claim is that AttnRouter (a per-category dispatch table) yields a 6.4% CLIP-T+DINO-I lift. This table is constructed by selecting the best operation per category via ablations explicitly performed on the identical 100-sample stratified subset released and used for all metric reporting. The reported improvement with ground-truth categories is therefore the in-sample maximum over the ablated choices rather than an out-of-sample prediction. This matches the fitted-input-called-prediction pattern. No self-definitional equations, self-citation load-bearing steps, or other enumerated circularity patterns appear in the provided text.
Axiom & Free-Parameter Ledger
free parameters (2)
- alpha blending factor =
[0.3, 0.5]
- injection layer and step bands =
S0-7, L0-15/L45-60
axioms (1)
- domain assumption KV injection in attention layers can transfer source structure without prompt mismatch on MMDiT
Reference graph
Works this paper leans on
-
[1]
Localizing and editing knowl- edge in text-to-image generative models
Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, and Varun Manjunatha. Localizing and editing knowl- edge in text-to-image generative models. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[2]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structPix2Pix: Learning to follow image editing instructions. InConference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[3]
MasaCtrl: Tuning-free mutual self-attention control for consistent image synthesis and editing
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. MasaCtrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InInternational Conference on Computer Vision (ICCV), 2023
work page 2023
-
[4]
DiffEdit: Diffusion-based semantic image editing with mask guidance
Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. DiffEdit: Diffusion-based semantic image editing with mask guidance. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[5]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InInter- national Conference on ...
work page 2024
-
[6]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022
work page 2022
-
[7]
Prompt-to-prompt im- age editing with cross-attention control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross-attention control. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[8]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2020
work page 2020
-
[9]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021
work page 2021
-
[10]
LEDITS++: Limitless image editing using text- to-image models
Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. LEDITS++: Limitless image editing using text- to-image models. InConference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[11]
PnP Inversion: Boosting diffusion-based edit- ing with 3 lines of code
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. PnP Inversion: Boosting diffusion-based edit- ing with 3 lines of code. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[12]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs. FLUX.1 Kontext: Flow matching for in- context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
Guandong Li. Dual-channel attention guidance for training- free image editing control in diffusion transformers.arXiv preprint arXiv:2602.18022, 2026
-
[14]
EditID: Training-free ed- itable ID customization for text-to-image generation
Guandong Li and Zhaobin Chu. EditID: Training-free ed- itable ID customization for text-to-image generation. In Findings of the Association for Computational Linguistics: EMNLP, 2025
work page 2025
-
[15]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Repre- sentations (ICLR), 2023. 9
work page 2023
-
[16]
SDEdit: Guided image synthesis and editing with stochastic differential equa- tions
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022
work page 2022
-
[17]
Null-text inversion for editing real images using guided diffusion models
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InConference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[18]
Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herv ´e J ´egou, Julien Mairal, P...
work page 2024
-
[19]
Zero-shot image-to-image translation
Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. InACM SIGGRAPH, 2023
work page 2023
-
[20]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), 2023
work page 2023
-
[21]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021
work page 2021
-
[22]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InConference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[23]
Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017
work page 2017
-
[24]
Qwen Team. Qwen-Image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review arXiv 2025
-
[25]
Plug-and-play diffusion features for text-driven image-to- image translation
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to- image translation. InConference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[26]
ImgEdit: A unified image editing dataset and benchmark
Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. ImgEdit: A unified image editing dataset and benchmark. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025
work page 2025
-
[27]
MagicBrush: A manually annotated dataset for instruction- guided image editing
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A manually annotated dataset for instruction- guided image editing. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[28]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In International Conference on Computer Vision (ICCV), 2023
work page 2023
-
[29]
UltraEdit: Instruction-based fine-grained image editing at scale
Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Ru- jie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based fine-grained image editing at scale. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 10 Figure 7: Qualitative comparison on ImgEdit-Bench-100.One representative sample per category. ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.