OSOR: One-Step Diffusion Inpainting for Effect-Aware Object Removal

Chenxi Sun; Deyang Kong; Haotian Wu; Junhao He; Leilei Cao; Linfeng Zhang; Peike Yu; Qinming Zhou; Xiangheng Tang

arxiv: 2606.28094 · v1 · pith:HS5HXG3Knew · submitted 2026-06-26 · 💻 cs.CV · cs.AI

OSOR: One-Step Diffusion Inpainting for Effect-Aware Object Removal

Qinming Zhou , Chenxi Sun , Deyang Kong , Junhao He , Xiangheng Tang , Peike Yu , Haotian Wu , Leilei Cao

show 1 more author

Linfeng Zhang

This is my paper

Pith reviewed 2026-06-29 04:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords object removaldiffusion inpaintingone-step diffusioneffect-aware editingmask robustnessimage editingdataset curation

0 comments

The pith

OSOR performs effect-aware object removal in one diffusion step while handling imperfect masks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that object removal accounting for shadows, reflections and other non-local effects can be done reliably in a single diffusion step rather than dozens. Current diffusion methods deliver strong results but at high compute cost that blocks interactive or mobile use. OSOR adds an occupancy-guided discriminator to stabilize single-step training, an alpha head drawn from pretrained models to fix inaccurate masks, and a verification pipeline that scales up effect-aware training pairs to 280K. If the claim holds, high-quality removal becomes fast enough for real-time editing without precise user masks.

Core claim

OSOR introduces an occupancy-guided discriminator for precise boundary supervision that stabilizes single-step diffusion training, an alpha head that uses pretrained diffusion knowledge to predict removal regions and tolerate imperfect masks, and a semantic-anchored verification pipeline that filters noisy triplets to generate effect-aware supervision at scale; using the resulting 280K-pair CORNE dataset, the model surpasses multi-step diffusion baselines in perceptual quality at 4x to 30x faster inference.

What carries the argument

Occupancy-guided discriminator and alpha head that together enable stable single-step training and mask-robust, effect-aware inpainting.

If this is right

Object removal becomes practical for interactive applications and edge devices.
Users no longer need accurate masks for good results.
Effect-aware training data can be produced automatically at large scale.
Single-step diffusion training generalizes to other removal tasks with non-local effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same single-step stabilization tricks could accelerate other diffusion editing operations.
Verification pipelines like SAVP may help curate paired data for additional generative tasks.
The method's success on anime and text benchmarks suggests it may transfer to stylized or text-heavy images.

Load-bearing premise

The semantic-anchored verification pipeline can reliably filter noisy instruction-based triplets into high-quality effect-aware supervision at scale.

What would settle it

A side-by-side evaluation of perceptual scores and measured inference time for OSOR versus multi-step baselines on AnimeEraseBench or TextEraseBench would confirm or refute the quality and speed claims.

Figures

Figures reproduced from arXiv: 2606.28094 by Chenxi Sun, Deyang Kong, Haotian Wu, Junhao He, Leilei Cao, Linfeng Zhang, Peike Yu, Qinming Zhou, Xiangheng Tang.

**Figure 1.** Figure 1: Comparison between OSOR and other methods. OSOR effectively removes object-associated effects, such as shadows, while running 10.6× faster than ObjectClear. A 1024 × 1024 image can be processed in under one second on a single NVIDIA A100 GPU. The average rank is computed across six benchmarks. ABSTRACT Real-world object removal is challenging due to two key difficulties: the target object’s non-local effec… view at source ↗

**Figure 2.** Figure 2: Overview of SAVP and CORNE. Starting from single-edit instruction triplets, SAVP verifies semantically aligned and localized differences, then fuses the validated difference region with promptable segmentation to form an effect-aware mask. It further derives object-core masks for Phase II incomplete-mask conditioning. models into single-pass generators via distribution matching and adversarial objectives [… view at source ↗

**Figure 3.** Figure 3: Two-phase training curriculum. Phase I adapts a diffusion inpainting backbone with hard latent blending and occupancy-guided patch supervision for boundary-consistent one-step removal. Phase II predicts a soft alpha map under incomplete-mask conditioning and performs adaptive blending to remove residual shadows and reflections beyond the provided mask. Effect decomposition for incomplete-mask conditioning.… view at source ↗

**Figure 4.** Figure 4: Mask-derived patch targets for a four-scale discriminator. Left shows the input mask and its overlay on the shot image for visualization. Right compares three target constructions at each scale. HM uses nearest-neighbor downsampling. SM applies Gaussian smoothing after downsampling. OG uses area pooling to produce fractional occupancies. Differences grow on coarser grids. where z¯ = E(x) and mz is the back… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of OSOR and existing methods on CORNE-Val and AnimeEraseBench. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of incomplete conditioning masks [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of OSOR and existing methods on RORD-Val, RemovalBench and TextEraseBench. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Alpha compositing under imperfect masks. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Representative CORNE annotation cases. Each row shows the input image [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Overall architecture of the occupancy-guided multi-scale discriminator. A frozen feature trunk [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Structure of one trainable head h k ξ . Each head applies spectral-normalized convolution, LeakyReLU, BlurPool downsampling, and a final 1 × 1 convolution to produce a single-channel patch logit map. The packed alpha logits are then unpacked to the latent grid, ℓθ = Unpack ℓ pack θ , αˆ = σ(ℓθ), so that the final alpha map is also defined at the latent resolution. The modified terminal projection is ex… view at source ↗

**Figure 12.** Figure 12: User scribble-guided removal examples. For each case, we show the input image with user scribble, the [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: More qualitative comparisons of OSOR and existing methods on RemovalBench and CORNE-Val. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: More qualitative comparisons of OSOR and existing methods on RORD-Val, AnimeEraseBench, and [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

read the original abstract

Real-world object removal is challenging due to two key difficulties: the target object's non-local effects, such as shadows and reflections, which are difficult to model, and the fact that user-provided masks are often inaccurate or incomplete. With billions of parameters and tens of denoising steps, diffusion-based models achieve strong removal performance at the expense of substantial computational cost, limiting their use in interactive applications and on edge devices. To address these challenges, we present OSOR (One-Step Object Removal), which simultaneously achieves efficient, effect-aware, and mask-robust object removal. Concretely, OSOR introduces: (1) an occupancy-guided discriminator for precise boundary supervision, enabling stable single-step diffusion training; (2) an alpha head that leverages knowledge from pretrained diffusion models to predict appropriate removal regions with minimal overhead, thereby handling imperfect masks; and (3) a semantic-anchored verification pipeline (SAVP) that filters noisy instruction-based triplets to produce effect-aware supervision at scale. Using SAVP, we curate CORNE, which contains 280K verified removal pairs, and further annotate AnimeEraseBench and TextEraseBench to evaluate performance on more complex removal tasks. Experiments show that OSOR surpasses strong multi-step diffusion baselines in perceptual quality while achieving $4\times$ to $30\times$ faster inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OSOR's one-step setup with targeted components for effects and masks is a reasonable efficiency play, but SAVP data quality is unverified in the provided text.

read the letter

The main thing to know is that OSOR trains a single-step diffusion model for object removal that tries to handle non-local effects like shadows and reflections while tolerating sloppy user masks. It adds an occupancy-guided discriminator for boundary control, an alpha head drawn from pretrained models to adjust removal regions, and SAVP to clean instruction triplets into the 280K-pair CORNE dataset.

The new pieces are those three additions plus the curation pipeline and the two new benchmarks. The framing of the practical constraints—billions of parameters and many denoising steps limiting interactive use—is accurate, and the goal of 4×–30× speed with maintained or better perceptual quality is the right target for mobile or real-time editing.

The work is clearest on the problem setup and the intended fixes. The architectural choices make sense on paper for stabilizing one-step training and dealing with imperfect masks.

The soft spot is the lack of any numbers, ablations, or validation for SAVP. The abstract describes it only as filtering noisy triplets, with no precision/recall figures, no human agreement on retained effect pairs, and no comparison of models trained on raw versus filtered data. If SAVP lets through substantial noise on non-local effects, the discriminator and alpha head have nothing solid to train on, which undercuts both the quality and mask-robustness claims. The provided text gives no experimental details at all, so the performance statements cannot be checked.

This is for people working on efficient diffusion inpainting and image editing tools. A reader already building one-step or mobile variants might pick up the component ideas or the dataset if the full paper supplies the missing checks.

It should go to peer review if the complete manuscript contains proper ablations and quantitative SAVP validation, because the direction addresses a real constraint even if the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper presents OSOR, a one-step diffusion inpainting model for effect-aware object removal that handles non-local effects (shadows, reflections) and inaccurate user masks. It introduces an occupancy-guided discriminator for stable single-step training, an alpha head leveraging pretrained diffusion knowledge for mask robustness, and a semantic-anchored verification pipeline (SAVP) to filter instruction-based triplets and curate the CORNE dataset of 280K verified pairs, along with new benchmarks AnimeEraseBench and TextEraseBench. The central claim is that OSOR achieves superior perceptual quality to strong multi-step diffusion baselines at 4×–30× faster inference.

Significance. If the performance and stability claims hold with rigorous validation, the work would be significant for enabling interactive, edge-deployable object removal by reducing diffusion steps while addressing real-world effect modeling and mask issues. The dataset curation approach and architectural components for one-step stability could influence efficient generative editing methods, though the current evidence base is limited to high-level descriptions without metrics or ablations.

major comments (2)

[Abstract] Abstract / SAVP description: The headline claims of better perceptual quality and one-step stability rest on effect-aware supervision from the SAVP-curated CORNE dataset, yet the manuscript provides no quantitative validation of SAVP (e.g., filter precision/recall, human agreement on retained non-local effects, or ablation of OSOR trained on raw vs. filtered triplets). This is load-bearing for both the occupancy-guided discriminator and alpha head contributions.
[Abstract] Abstract: Performance gains are stated without any reported metrics (FID, LPIPS, user studies), baseline details, or experimental setup, preventing verification of the 4×–30× speedup and quality superiority over multi-step models.

minor comments (1)

[Abstract] Abstract: The description of the alpha head as adding 'minimal overhead' would benefit from a parameter/FLOPs comparison to the base diffusion model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract / SAVP description: The headline claims of better perceptual quality and one-step stability rest on effect-aware supervision from the SAVP-curated CORNE dataset, yet the manuscript provides no quantitative validation of SAVP (e.g., filter precision/recall, human agreement on retained non-local effects, or ablation of OSOR trained on raw vs. filtered triplets). This is load-bearing for both the occupancy-guided discriminator and alpha head contributions.

Authors: We acknowledge that the manuscript would benefit from explicit quantitative validation of SAVP. The current description focuses on the pipeline design and its role in curating CORNE, with downstream performance serving as indirect evidence. In the revision we will add an ablation comparing OSOR trained on raw versus SAVP-filtered triplets, along with verification statistics such as inter-annotator agreement on retained non-local effects. This directly supports the load-bearing role of SAVP for the other components. revision: yes
Referee: [Abstract] Abstract: Performance gains are stated without any reported metrics (FID, LPIPS, user studies), baseline details, or experimental setup, preventing verification of the 4×–30× speedup and quality superiority over multi-step models.

Authors: The abstract is written as a concise summary and therefore omits specific numbers and setup details. The full manuscript contains a complete experiments section that reports FID, LPIPS, user-study results, the exact multi-step diffusion baselines, and the inference-time measurements supporting the 4×–30× speedup claim. To improve immediate verifiability we will expand the abstract with the key quantitative highlights while retaining brevity. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on new architectural components and empirical validation

full rationale

The paper introduces three new components—an occupancy-guided discriminator, an alpha head leveraging pretrained diffusion knowledge, and the SAVP filtering pipeline used to curate the CORNE dataset—without presenting equations, derivations, or first-principles predictions. Performance claims (perceptual quality and speed) are evaluated via external comparisons to multi-step baselines on annotated benchmarks, not by construction from fitted parameters or self-referential definitions. No load-bearing self-citations or uniqueness theorems are invoked in the provided text; the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all components are presented as novel engineering contributions without detailed mathematical assumptions.

pith-pipeline@v0.9.1-grok · 5789 in / 1090 out tokens · 46059 ms · 2026-06-29T04:06:29.047972+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 21 canonical work pages · 1 internal anchor

[1]

In: CVPR

Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR. pp. 18187–18197 (2022)

2022
[2]

Black Forest Labs: Flux.https://github.com/black-forest-labs/flux(2024)

2024
[3]

In: CVPR

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR. pp. 18392–18402 (2023)

2023
[4]

Black, and Otmar Hilliges

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 2818–2829 (2023). https://doi.org/10.1109/CVPR5...

work page doi:10.1109/cvpr52729.2023.00276 2023
[5]

In: CVPR

Dong, Q., Cao, C., Fu, Y .: Incremental transformer structure enhanced image inpainting with masking positional encoding. In: CVPR. pp. 11348–11358 (2022)

2022
[6]

In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C

Ekin, Y ., Yildirim, A.B., Caglar, E.E., Erdem, A., Erdem, E., Dundar, A.: Clipaway: Harmonizing focused embeddings for removing objects via diffusion models. In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C. (eds.) Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proce...

2024
[7]

In: NeurIPS

Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y .: Generative adversarial nets. In: NeurIPS. pp. 2672–2680 (2014)

2014
[8]

In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time- scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 6626–6637 (2017), https://proceed...

2017
[9]

In: NeurIPS

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)

2020
[10]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022),https://openreview.net/forum?id=nZeVKeeFYf9

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022),https://openreview.net/forum?id=nZeVKeeFYf9

2022
[11]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXVIII

Hu, X., Peng, X., Luo, D., Ji, X., Peng, J., Jiang, Z., Zhang, J., Jin, T., Wang, C., Ji, R.: Diffumatting: Synthesizing arbitrary objects with matting-level annotation. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXVIII. vol. 15126, pp. 396–413 (2024). https://doi.org/10.1007/9...

work page doi:10.1007/978-3-031-73113-6_23 2024
[12]

ACM TOG36(4), 107:1–107:14 (2017) 11 APREPRINT- JUNE29, 2026

Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM TOG36(4), 107:1–107:14 (2017) 11 APREPRINT- JUNE29, 2026

2017
[13]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 5967–5976 (2017). https://doi.org/10.1109/CVPR.2017.632

work page doi:10.1109/cvpr.2017.632 2017
[14]

URL https://proceedings.mlr

Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking FID: towards a better evaluation metric for image generation. In: IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 9307–9315 (2024). https://doi.org/10.1109/CVPR52733.2024.00889

work page doi:10.1109/cvpr52733.2024.00889 2024
[15]

Showui: One vision-language- action model for GUI visual agent

Jiang, L., Wang, Z., Bao, J., Zhou, W., Chen, D., Shi, L., Chen, D., Li, H.: Smarteraser: Remove anything from images using masked-region guidance. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 24452–24462 (2025). https://doi.org/10.1109/CVPR52734.2025.02277

work page doi:10.1109/cvpr52734.2025.02277 2025
[17]

IEEE Trans

Levin, A., Lischinski, D., Weiss, Y .: A closed-form solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell.30(2), 228–242 (2008). https://doi.org/10.1109/TPAMI.2007.1177

work page doi:10.1109/tpami.2007.1177 2008
[18]

In: CVPR

Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y ., Jia, J.: Mat: Mask-aware transformer for large hole image inpainting. In: CVPR. pp. 10748–10758 (2022)

2022
[19]

Li, X., Yang, Z., Quan, R., Yang, Y .: DRIP: unleashing diffusion priors for joint foreground and al- pha prediction in image matting. In: Advances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024), http://papers.nips.cc/paper_f...

2024
[20]

ACM Trans

Lin, X., Yu, F., Hu, J., You, Z., Shi, W., Ren, J.S., Gu, J., Dong, C.: Harnessing diffusion-yielded score priors for image restoration. ACM Trans. Graph.44(6), 208:1–208:21 (2025). https://doi.org/10.1145/3763346

work page doi:10.1145/3763346 2025
[21]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII. vol. 15105, pp. 38–55 (202...

work page doi:10.1007/978-3-031-72970-6_3 2024
[22]

Showui: One vision-language- action model for GUI visual agent

Liu, Y ., Zhou, H., Cui, B., Shang, W., Lin, R.: Erase diffusion: Empowering object removal through calibrating diffusion pathways. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 2418–2427 (2025). https://doi.org/10.1109/CVPR52734.2025.00231

work page doi:10.1109/cvpr52734.2025.00231 2025
[23]

Price, I., Sanchez-Gonzalez, A., Alet, F., Ewalds, T., El- Kadi, A., Stott, J., Mohamed, S., Battaglia, P

Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 11966–11976 (2022). https://doi.org/10.1109/CVPR52688.2022.01167

work page doi:10.1109/cvpr52688.2022.01167 2022
[24]

In: CVPR

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: Inpainting using denoising diffusion probabilistic models. In: CVPR. pp. 11451–11461 (2022)

2022
[26]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022)

Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022)

2022
[27]

(eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

Mescheder, L.M., Geiger, A., Nowozin, S.: Which training methods for gans do actually converge? In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. vol. 80, pp. 3478–3487 (2018), http: //proceedings.mlr.press/v80/mescheder18a.html

2018
[28]

In: CVPR

Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: CVPR. pp. 2536–2544 (2016)

2016
[29]

In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024)

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024)

2024
[30]

In: Christiansen, H

Porter, T.K., Duff, T.: Compositing digital images. In: Christiansen, H. (ed.) Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1984, Minneapolis, Minnesota, USA, July 23-27, 1984. pp. 253–259 (1984). https://doi.org/10.1145/800031.808606 12 APREPRINT- JUNE29, 2026

work page doi:10.1145/800031.808606 1984
[31]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual E...

2021
[32]

In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 (2025),https://openreview.net/forum?id=Ha6RTeWMd0

Ravi, N., Gabeur, V ., Hu, Y ., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V ., Carion, N., Wu, C., Girshick, R.B., Dollár, P., Feichtenhofer, C.: SAM 2: Segment anything in images and videos. In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, Apri...

2025
[33]

In: CVPR

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10674–10685 (2022)

2022
[34]

In: 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022

Sagong, M., Yeo, Y ., Jung, S., Ko, S.: RORD: A real-world object removal dataset. In: 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. p. 542 (2022)

2022
[35]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022), https:// openreview.net/forum?id=TIdIXIpzhoI

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022), https:// openreview.net/forum?id=TIdIXIpzhoI

2022
[36]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXXVI. vol. 15144, pp. 87–103 (2024). https://doi.org/10.1007/97...

work page doi:10.1007/978-3-031-73016-0_6 2024
[37]

In: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA

Song, Y ., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. vol. 202, pp. 32211–32252 (2023)

2023
[38]

In: AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA

Sun, W., Dong, X., Cui, B., Tang, J.: Attentive eraser: Unleashing diffusion model’s object removal poten- tial via self-attention redirection guidance. In: AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA. pp. 20734–20742 (2025). https://doi.org/10.1609/AAAI.V39I19.34285

work page doi:10.1609/aaai.v39i19.34285 2025
[39]

In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022

Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V .: Resolution-robust large mask inpainting with fourier convolutions. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022. pp. 3172–3182 (2022). https://doi.org/10.1...

work page doi:10.1109/w 2022
[40]

Wei, R., Yin, Z., Zhang, S., Zhou, L., Wang, X., Ban, C., Cao, T., Sun, H., He, Z., Liang, K., Ma, Z.: Omnieraser: Remove objects and their effects in images with paired video-frame data (2025), https://arxiv.org/abs/ 2501.07397

arXiv 2025
[41]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII

Winter, D., Cohen, M., Fruchter, S., Pritch, Y ., Rav-Acha, A., Hoshen, Y .: Objectdrop: Bootstrapping counter- factuals for photorealistic object removal and insertion. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII. vol. 15135, pp. 112–129 (2024)

2024
[42]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Xu, N., Price, B.L., Cohen, S., Huang, T.S.: Deep image matting. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 311–320 (2017). https://doi.org/10.1109/CVPR.2017.41

work page doi:10.1109/cvpr.2017.41 2017
[43]

URL https://proceedings.mlr

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distri- bution matching distillation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 2024. pp. 6613–6623 (2024). https://doi.org/10.1109/CVPR52733.2024.00632

work page doi:10.1109/cvpr52733.2024.00632 2024
[44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. pp. 4470–4479 (2019). https://doi.org/10.1109/ICCV .2019.00457

work page doi:10.1109/iccv 2019
[45]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Yu, Y ., Zeng, Z., Zheng, H., Luo, J.: Omnipaint: Mastering object-oriented editing via disentangled insertion- removal inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17324–17334 (October 2025)

2025
[46]

IEEE Trans

Zeng, Y ., Fu, J., Chao, H., Guo, B.: Aggregated contextual transformations for high-resolution image inpainting. IEEE Trans. Vis. Comput. Graph.29(7), 3266–3280 (2023). https://doi.org/10.1109/TVCG.2022.3156949

work page doi:10.1109/tvcg.2022.3156949 2023
[47]

ACM Trans

Zhang, L., Agrawala, M.: Transparent image layer diffusion using latent transparency. ACM Trans. Graph.43(4), 100:1–100:15 (2024). https://doi.org/10.1145/3658150 13 APREPRINT- JUNE29, 2026

work page doi:10.1145/3658150 2024
[48]

In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. pp. 586–595 (2018). https://doi.org/10.1109/CVPR.2018.00068

work page doi:10.1109/cvpr.2018.00068 2018
[49]

Design Initiative for a 10 TeV pCM Wakefield Collider,

Zhao, J., Zhou, S., Wang, Z., Yang, P., Loy, C.C.: Objectclear: Complete object removal via object-effect attention. CoRRabs/2505.22636(2025). https://doi.org/10.48550/ARXIV .2505.22636 14 APREPRINT- JUNE29, 2026 Supplementary Material A Supplementary Details of SA VP and CORNE This section provides additional implementation details and dataset statistics...

work page internal anchor Pith review doi:10.48550/arxiv 2025

[1] [1]

In: CVPR

Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR. pp. 18187–18197 (2022)

2022

[2] [2]

Black Forest Labs: Flux.https://github.com/black-forest-labs/flux(2024)

2024

[3] [3]

In: CVPR

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR. pp. 18392–18402 (2023)

2023

[4] [4]

Black, and Otmar Hilliges

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 2818–2829 (2023). https://doi.org/10.1109/CVPR5...

work page doi:10.1109/cvpr52729.2023.00276 2023

[5] [5]

In: CVPR

Dong, Q., Cao, C., Fu, Y .: Incremental transformer structure enhanced image inpainting with masking positional encoding. In: CVPR. pp. 11348–11358 (2022)

2022

[6] [6]

In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C

Ekin, Y ., Yildirim, A.B., Caglar, E.E., Erdem, A., Erdem, E., Dundar, A.: Clipaway: Harmonizing focused embeddings for removing objects via diffusion models. In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C. (eds.) Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proce...

2024

[7] [7]

In: NeurIPS

Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y .: Generative adversarial nets. In: NeurIPS. pp. 2672–2680 (2014)

2014

[8] [8]

In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time- scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 6626–6637 (2017), https://proceed...

2017

[9] [9]

In: NeurIPS

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)

2020

[10] [10]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022),https://openreview.net/forum?id=nZeVKeeFYf9

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022),https://openreview.net/forum?id=nZeVKeeFYf9

2022

[11] [11]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXVIII

Hu, X., Peng, X., Luo, D., Ji, X., Peng, J., Jiang, Z., Zhang, J., Jin, T., Wang, C., Ji, R.: Diffumatting: Synthesizing arbitrary objects with matting-level annotation. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXVIII. vol. 15126, pp. 396–413 (2024). https://doi.org/10.1007/9...

work page doi:10.1007/978-3-031-73113-6_23 2024

[12] [12]

ACM TOG36(4), 107:1–107:14 (2017) 11 APREPRINT- JUNE29, 2026

Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM TOG36(4), 107:1–107:14 (2017) 11 APREPRINT- JUNE29, 2026

2017

[13] [13]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 5967–5976 (2017). https://doi.org/10.1109/CVPR.2017.632

work page doi:10.1109/cvpr.2017.632 2017

[14] [14]

URL https://proceedings.mlr

Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking FID: towards a better evaluation metric for image generation. In: IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 9307–9315 (2024). https://doi.org/10.1109/CVPR52733.2024.00889

work page doi:10.1109/cvpr52733.2024.00889 2024

[15] [15]

Showui: One vision-language- action model for GUI visual agent

Jiang, L., Wang, Z., Bao, J., Zhou, W., Chen, D., Shi, L., Chen, D., Li, H.: Smarteraser: Remove anything from images using masked-region guidance. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 24452–24462 (2025). https://doi.org/10.1109/CVPR52734.2025.02277

work page doi:10.1109/cvpr52734.2025.02277 2025

[16] [17]

IEEE Trans

Levin, A., Lischinski, D., Weiss, Y .: A closed-form solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell.30(2), 228–242 (2008). https://doi.org/10.1109/TPAMI.2007.1177

work page doi:10.1109/tpami.2007.1177 2008

[17] [18]

In: CVPR

Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y ., Jia, J.: Mat: Mask-aware transformer for large hole image inpainting. In: CVPR. pp. 10748–10758 (2022)

2022

[18] [19]

Li, X., Yang, Z., Quan, R., Yang, Y .: DRIP: unleashing diffusion priors for joint foreground and al- pha prediction in image matting. In: Advances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024), http://papers.nips.cc/paper_f...

2024

[19] [20]

ACM Trans

Lin, X., Yu, F., Hu, J., You, Z., Shi, W., Ren, J.S., Gu, J., Dong, C.: Harnessing diffusion-yielded score priors for image restoration. ACM Trans. Graph.44(6), 208:1–208:21 (2025). https://doi.org/10.1145/3763346

work page doi:10.1145/3763346 2025

[20] [21]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII. vol. 15105, pp. 38–55 (202...

work page doi:10.1007/978-3-031-72970-6_3 2024

[21] [22]

Showui: One vision-language- action model for GUI visual agent

Liu, Y ., Zhou, H., Cui, B., Shang, W., Lin, R.: Erase diffusion: Empowering object removal through calibrating diffusion pathways. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 2418–2427 (2025). https://doi.org/10.1109/CVPR52734.2025.00231

work page doi:10.1109/cvpr52734.2025.00231 2025

[22] [23]

Price, I., Sanchez-Gonzalez, A., Alet, F., Ewalds, T., El- Kadi, A., Stott, J., Mohamed, S., Battaglia, P

Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 11966–11976 (2022). https://doi.org/10.1109/CVPR52688.2022.01167

work page doi:10.1109/cvpr52688.2022.01167 2022

[23] [24]

In: CVPR

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: Inpainting using denoising diffusion probabilistic models. In: CVPR. pp. 11451–11461 (2022)

2022

[24] [26]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022)

Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022)

2022

[25] [27]

(eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

Mescheder, L.M., Geiger, A., Nowozin, S.: Which training methods for gans do actually converge? In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. vol. 80, pp. 3478–3487 (2018), http: //proceedings.mlr.press/v80/mescheder18a.html

2018

[26] [28]

In: CVPR

Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: CVPR. pp. 2536–2544 (2016)

2016

[27] [29]

In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024)

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024)

2024

[28] [30]

In: Christiansen, H

Porter, T.K., Duff, T.: Compositing digital images. In: Christiansen, H. (ed.) Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1984, Minneapolis, Minnesota, USA, July 23-27, 1984. pp. 253–259 (1984). https://doi.org/10.1145/800031.808606 12 APREPRINT- JUNE29, 2026

work page doi:10.1145/800031.808606 1984

[29] [31]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual E...

2021

[30] [32]

In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 (2025),https://openreview.net/forum?id=Ha6RTeWMd0

Ravi, N., Gabeur, V ., Hu, Y ., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V ., Carion, N., Wu, C., Girshick, R.B., Dollár, P., Feichtenhofer, C.: SAM 2: Segment anything in images and videos. In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, Apri...

2025

[31] [33]

In: CVPR

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10674–10685 (2022)

2022

[32] [34]

In: 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022

Sagong, M., Yeo, Y ., Jung, S., Ko, S.: RORD: A real-world object removal dataset. In: 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. p. 542 (2022)

2022

[33] [35]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022), https:// openreview.net/forum?id=TIdIXIpzhoI

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022), https:// openreview.net/forum?id=TIdIXIpzhoI

2022

[34] [36]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXXVI. vol. 15144, pp. 87–103 (2024). https://doi.org/10.1007/97...

work page doi:10.1007/978-3-031-73016-0_6 2024

[35] [37]

In: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA

Song, Y ., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. vol. 202, pp. 32211–32252 (2023)

2023

[36] [38]

In: AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA

Sun, W., Dong, X., Cui, B., Tang, J.: Attentive eraser: Unleashing diffusion model’s object removal poten- tial via self-attention redirection guidance. In: AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA. pp. 20734–20742 (2025). https://doi.org/10.1609/AAAI.V39I19.34285

work page doi:10.1609/aaai.v39i19.34285 2025

[37] [39]

In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022

Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V .: Resolution-robust large mask inpainting with fourier convolutions. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022. pp. 3172–3182 (2022). https://doi.org/10.1...

work page doi:10.1109/w 2022

[38] [40]

Wei, R., Yin, Z., Zhang, S., Zhou, L., Wang, X., Ban, C., Cao, T., Sun, H., He, Z., Liang, K., Ma, Z.: Omnieraser: Remove objects and their effects in images with paired video-frame data (2025), https://arxiv.org/abs/ 2501.07397

arXiv 2025

[39] [41]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII

Winter, D., Cohen, M., Fruchter, S., Pritch, Y ., Rav-Acha, A., Hoshen, Y .: Objectdrop: Bootstrapping counter- factuals for photorealistic object removal and insertion. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII. vol. 15135, pp. 112–129 (2024)

2024

[40] [42]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Xu, N., Price, B.L., Cohen, S., Huang, T.S.: Deep image matting. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 311–320 (2017). https://doi.org/10.1109/CVPR.2017.41

work page doi:10.1109/cvpr.2017.41 2017

[41] [43]

URL https://proceedings.mlr

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distri- bution matching distillation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 2024. pp. 6613–6623 (2024). https://doi.org/10.1109/CVPR52733.2024.00632

work page doi:10.1109/cvpr52733.2024.00632 2024

[42] [44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. pp. 4470–4479 (2019). https://doi.org/10.1109/ICCV .2019.00457

work page doi:10.1109/iccv 2019

[43] [45]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Yu, Y ., Zeng, Z., Zheng, H., Luo, J.: Omnipaint: Mastering object-oriented editing via disentangled insertion- removal inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17324–17334 (October 2025)

2025

[44] [46]

IEEE Trans

Zeng, Y ., Fu, J., Chao, H., Guo, B.: Aggregated contextual transformations for high-resolution image inpainting. IEEE Trans. Vis. Comput. Graph.29(7), 3266–3280 (2023). https://doi.org/10.1109/TVCG.2022.3156949

work page doi:10.1109/tvcg.2022.3156949 2023

[45] [47]

ACM Trans

Zhang, L., Agrawala, M.: Transparent image layer diffusion using latent transparency. ACM Trans. Graph.43(4), 100:1–100:15 (2024). https://doi.org/10.1145/3658150 13 APREPRINT- JUNE29, 2026

work page doi:10.1145/3658150 2024

[46] [48]

In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. pp. 586–595 (2018). https://doi.org/10.1109/CVPR.2018.00068

work page doi:10.1109/cvpr.2018.00068 2018

[47] [49]

Design Initiative for a 10 TeV pCM Wakefield Collider,

Zhao, J., Zhou, S., Wang, Z., Yang, P., Loy, C.C.: Objectclear: Complete object removal via object-effect attention. CoRRabs/2505.22636(2025). https://doi.org/10.48550/ARXIV .2505.22636 14 APREPRINT- JUNE29, 2026 Supplementary Material A Supplementary Details of SA VP and CORNE This section provides additional implementation details and dataset statistics...

work page internal anchor Pith review doi:10.48550/arxiv 2025