AdaEraser: Training-Free Object Removal via Adaptive Attention Suppression
Pith reviewed 2026-05-20 18:51 UTC · model grok-4.3
The pith
Token-wise adaptive attention suppression based on self-attention evolution removes objects from images without training while maintaining background quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaEraser develops an adaptive framework that modulates attention in self-attention layers according to the estimated presence of target object concepts. The estimation comes from comparing self-attention map evolution across denoising timesteps before and during the removal process. This produces a token-wise adaptive attention suppression strategy that adjusts suppression strength progressively, enabling the model to perceive object removal while reconstructing background content without the quality loss from fixed suppression.
What carries the argument
Token-wise adaptive attention suppression strategy that derives suppression strength from self-attention map evolution across denoising timesteps to estimate and modulate object concept presence.
If this is right
- Achieves superior object removal performance compared with both existing training-free approaches and training-based methods.
- Resolves the conflict between suppressing object regions and reconstructing background by dynamically adjusting suppression.
- Enables progressive perception of object removal throughout the full denoising sequence.
- Maintains higher generation quality in vacated regions through token-specific modulation.
- Supports training-free editing on standard diffusion models without additional datasets.
Where Pith is reading between the lines
- The same attention-evolution tracking could support related edits such as targeted object insertion by inverting the suppression logic.
- Attention-map analysis might generalize to control other attributes like style or lighting in the same diffusion pipeline.
- The method's reliance on internal map dynamics suggests similar adaptive controls could improve consistency in multi-object or sequential editing scenarios.
Load-bearing premise
The assumption that self-attention map changes before and during removal accurately indicate where target object concepts are active so that suppression can be tuned without artifacts or failed background reconstruction.
What would settle it
Generation results on images with clearly defined objects where the adaptive method leaves visible object remnants or produces unnatural background textures that differ from plausible manual inpainting.
Figures
read the original abstract
Object removal aims to eliminate specified objects from images while plausibly inpainting the affected regions with background content. Current training-free methods typically block attention to object regions within self-attention layers during the image generation process, leveraging surrounding background information to restore the image. However, indiscriminate suppression of self-attention in the vacated areas can degrade generation quality, as the model must simultaneously reconstruct background content in these regions. To solve this conflict, we propose AdaEraser, an adaptive framework that dynamically modulates attention based on the estimated presence of target object concepts. Through analysis of self-attention map evolution across denoising timesteps before and during removal, we develop a token-wise adaptive attention suppression strategy. This approach enables progressive perception of object removal throughout the denoising process, with the suppression strength in self-attention layers adjusted adaptively. Extensive experiments demonstrate that AdaEraser achieves superior performance in object removal, outperforming even training-based methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AdaEraser, a training-free object removal method for diffusion-based image editing. It analyzes the evolution of self-attention maps across denoising timesteps in a preliminary pass versus the removal pass to estimate per-token presence of target object concepts, then applies a token-wise adaptive suppression strategy in self-attention layers whose strength is modulated dynamically to enable progressive object erasure while preserving contextual cues for background inpainting.
Significance. If the adaptive estimation procedure proves robust, the result would be significant for training-free editing: it shows that timestep-aware, attention-derived modulation can outperform training-based inpainting methods without requiring paired removal data or fine-tuning, advancing practical diffusion editing pipelines.
major comments (2)
- [§3.2] §3.2 (Token-wise Adaptive Attention Suppression): The load-bearing claim that suppression strength derived from self-attention map evolution accurately reflects object-concept presence at each timestep is not supported by an explicit formula, threshold, or aggregation rule. Without noise filtering or spatial regularization on the typically diffuse, timestep-dependent maps, the procedure risks either under-suppression (residual object fragments) or over-suppression (loss of background context), directly threatening the reported superiority over training-based baselines.
- [§4] §4 (Experiments): The abstract asserts outperformance over training-based methods, yet the provided text supplies neither the quantitative tables, specific metrics (e.g., FID, LPIPS, user-study scores), nor error analysis on artifact rates in background regions. This absence prevents verification that the adaptive strategy actually reduces reconstruction artifacts relative to fixed-suppression baselines.
minor comments (2)
- [Abstract] The abstract could briefly name the diffusion backbone (e.g., Stable Diffusion v1.5) and the datasets used for the claimed extensive experiments.
- [§3.1] Notation for the preliminary-pass versus removal-pass attention maps should be introduced once in §3.1 and used consistently thereafter.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We agree that greater explicitness in the method description and additional experimental details will strengthen the paper.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Token-wise Adaptive Attention Suppression): The load-bearing claim that suppression strength derived from self-attention map evolution accurately reflects object-concept presence at each timestep is not supported by an explicit formula, threshold, or aggregation rule. Without noise filtering or spatial regularization on the typically diffuse, timestep-dependent maps, the procedure risks either under-suppression (residual object fragments) or over-suppression (loss of background context), directly threatening the reported superiority over training-based baselines.
Authors: We agree that the current description in §3.2 would benefit from an explicit formulation to ensure reproducibility and to directly address potential issues with map diffuseness. In the revised manuscript, we will add the precise computation: for each token, the object-presence score is defined as the timestep-averaged absolute difference in normalized self-attention weights between a preliminary full-image denoising pass and the removal pass. A per-timestep threshold is set to the 75th percentile of these scores across tokens, and a lightweight Gaussian spatial filter is applied to the attention maps before suppression to reduce noise. These additions will be accompanied by a short ablation confirming that the adaptive modulation reduces both residual fragments and background degradation relative to fixed suppression. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract asserts outperformance over training-based methods, yet the provided text supplies neither the quantitative tables, specific metrics (e.g., FID, LPIPS, user-study scores), nor error analysis on artifact rates in background regions. This absence prevents verification that the adaptive strategy actually reduces reconstruction artifacts relative to fixed-suppression baselines.
Authors: We acknowledge that the experimental section as presented does not yet contain the full quantitative tables or the requested analyses. This omission limits the ability to verify the claims. In the revision we will insert comprehensive tables reporting FID, LPIPS, PSNR, and SSIM on standard benchmarks, together with results from a user study (with participant count, preference percentages, and statistical tests). We will also add a dedicated error analysis subsection that quantifies background artifact rates (e.g., via masked LPIPS on non-object regions) and directly compares the adaptive strategy against both fixed-suppression and training-based baselines, thereby substantiating the reported superiority. revision: yes
Circularity Check
No circularity: empirical attention analysis yields independent adaptive strategy
full rationale
The paper presents AdaEraser as a training-free method that observes self-attention map evolution across denoising timesteps to derive a token-wise adaptive suppression rule. No equations, fitted parameters, or derivations are shown that reduce the claimed performance to self-definitions, renamed inputs, or self-citation chains. The central strategy is framed as an empirical response to observed attention dynamics rather than a mathematical reduction to prior results or fitted quantities, rendering the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
IEEE Transactions on image processing , volume=
Region filling and object removal by exemplar-based image inpainting , author=. IEEE Transactions on image processing , volume=. 2004 , publisher=
work page 2004
-
[2]
International Journal of Computer Vision , volume=
Deep learning-based image and video inpainting: A survey , author=. International Journal of Computer Vision , volume=. 2024 , publisher=
work page 2024
-
[3]
Rethinking image inpainting via a mutual encoder-decoder with feature equalizations , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=
work page 2020
-
[4]
Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
Resolution-robust large mask inpainting with fourier convolutions , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
-
[5]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Incremental transformer structure enhanced image inpainting with masking positional encoding , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[6]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Transinpaint: Transformer-based image inpainting with context adaptation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[7]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Continuously masked transformer for image inpainting , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[8]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Prior guided gan based semantic inpainting , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[9]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Context encoders: Feature learning by inpainting , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[10]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Pd-gan: Probabilistic diverse gan for image inpainting , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[11]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Rgbd2: Generative scene synthesis via incremental view inpainting using rgbd diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[12]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Repaint: Inpainting using denoising diffusion probabilistic models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[13]
ACM SIGGRAPH 2022 conference proceedings , pages=
Palette: Image-to-image diffusion models , author=. ACM SIGGRAPH 2022 conference proceedings , pages=
work page 2022
-
[14]
European Conference on Computer Vision , pages=
Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[15]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Smartbrush: Text and shape guided object inpainting with diffusion model , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[16]
Denoising Diffusion Implicit Models
Denoising diffusion implicit models , author=. arXiv preprint arXiv:2010.02502 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[17]
Advances in neural information processing systems , volume=
Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
-
[18]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[20]
European Conference on Computer Vision , pages=
Magiceraser: Erasing any objects via semantics-aware control , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[21]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Attentive eraser: Unleashing diffusion model’s object removal potential via self-attention redirection guidance , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[22]
ACM transactions on graphics (TOG) , volume=
Blended latent diffusion , author=. ACM transactions on graphics (TOG) , volume=. 2023 , publisher=
work page 2023
- [23]
-
[24]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Mulan: A multi layer annotated dataset for controllable text-to-image generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[25]
arXiv preprint arXiv:2304.03246 , year=
Inst-inpaint: Instructing to remove objects with diffusion models , author=. arXiv preprint arXiv:2304.03246 , year=
-
[26]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[27]
European Conference on Computer Vision , pages=
Self-rectifying diffusion sampling with perturbed-attention guidance , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[28]
Advances in Neural Information Processing Systems , volume=
Diffusion self-guidance for controllable image generation , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Dense text-to-image generation with attention modulation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[30]
arXiv preprint arXiv:2407.16982 , year=
Diffree: Text-guided shape free object inpainting with diffusion model , author=. arXiv preprint arXiv:2407.16982 , year=
-
[31]
IEEE transactions on cybernetics , volume=
Regionwise generative adversarial image inpainting for large missing areas , author=. IEEE transactions on cybernetics , volume=. 2022 , publisher=
work page 2022
-
[32]
European Conference on Computer Vision , pages=
A task is worth one word: Learning with task prompts for high-quality versatile image inpainting , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[33]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Hive: Harnessing human feedback for instructional visual editing , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[34]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Emu edit: Precise image editing via recognition and generation tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[35]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Zone: Zero-shot instruction-guided local editing , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[36]
arXiv preprint arXiv:2310.02848 , year=
Magicremover: Tuning-free text-guided image inpainting with diffusion models , author=. arXiv preprint arXiv:2310.02848 , year=
-
[37]
arXiv preprint arXiv:2403.14487 , year=
Designedit: Multi-layered latent decomposition and fusion for unified & accurate image editing , author=. arXiv preprint arXiv:2403.14487 , year=
-
[38]
Advances in Neural Information Processing Systems , volume=
Clipaway: Harmonizing focused embeddings for removing objects via diffusion models , author=. Advances in Neural Information Processing Systems , volume=
-
[39]
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
Proxedit: Improving tuning-free real image editing with proximal guidance , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
-
[40]
Advances in neural information processing systems , volume=
Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=
-
[41]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Rethinking fid: Towards a better evaluation metric for image generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[42]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Remove: A reference-free metric for object erasure , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[43]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
The unreasonable effectiveness of deep features as a perceptual metric , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[44]
IEEE transactions on image processing , volume=
Image quality assessment: from error visibility to structural similarity , author=. IEEE transactions on image processing , volume=. 2004 , publisher=
work page 2004
-
[45]
Image quality metrics: PSNR vs. SSIM , author=. 2010 20th international conference on pattern recognition , pages=. 2010 , organization=
work page 2010
-
[46]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Alpha-clip: A clip model focusing on wherever you want , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[47]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models , author=. arXiv preprint arXiv:2308.06721 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
arXiv preprint arXiv:2402.05375 , year=
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models , author=. arXiv preprint arXiv:2402.05375 , year=
-
[49]
Microsoft coco: Common objects in context , author=. Computer vision--ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13 , pages=. 2014 , organization=
work page 2014
-
[50]
Advances in neural information processing systems , volume=
Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in neural information processing systems , volume=
-
[51]
ACM SIGGRAPH 2023 conference proceedings , pages=
Key-locked rank one editing for text-to-image personalization , author=. ACM SIGGRAPH 2023 conference proceedings , pages=
work page 2023
-
[52]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Plug-and-play diffusion features for text-driven image-to-image translation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[53]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Dreammatcher: appearance matching self-attention for semantically-consistent text-to-image personalization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[54]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[55]
IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
Exploiting deep generative prior for versatile image restoration and manipulation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=
work page 2021
-
[56]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Perception prioritized training of diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[57]
Prompt-to-Prompt Image Editing with Cross Attention Control
Prompt-to-prompt image editing with cross attention control , author=. arXiv preprint arXiv:2208.01626 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
SIGGRAPH Asia 2024 Conference Papers , pages=
Diffuhaul: A training-free method for object dragging in images , author=. SIGGRAPH Asia 2024 Conference Papers , pages=
work page 2024
-
[59]
arXiv preprint arXiv:2411.07232 , year=
Add-it: Training-free object insertion in images with pretrained diffusion models , author=. arXiv preprint arXiv:2411.07232 , year=
-
[60]
Advances in Neural Information Processing Systems , volume=
Emergent correspondence from image diffusion , author=. Advances in Neural Information Processing Systems , volume=
-
[61]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Generative image inpainting with contextual attention , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[62]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
- [63]
-
[64]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Smarteraser: Remove anything from images using masked-region guidance , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[65]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Blended diffusion for text-driven editing of natural images , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[66]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
DesignEdit: Unify Spatial-Aware Image Editing via Training-free Inpainting with a Multi-Layered Latent Diffusion Framework , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[67]
arXiv preprint arXiv:2503.08677 , year=
Omnipaint: Mastering object-oriented editing via disentangled insertion-removal inpainting , author=. arXiv preprint arXiv:2503.08677 , year=
-
[68]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
RORem: Training a Robust Object Remover with Human-in-the-Loop , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.