pith. sign in

arxiv: 2606.28094 · v1 · pith:HS5HXG3Knew · submitted 2026-06-26 · 💻 cs.CV · cs.AI

OSOR: One-Step Diffusion Inpainting for Effect-Aware Object Removal

Pith reviewed 2026-06-29 04:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords object removaldiffusion inpaintingone-step diffusioneffect-aware editingmask robustnessimage editingdataset curation
0
0 comments X

The pith

OSOR performs effect-aware object removal in one diffusion step while handling imperfect masks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that object removal accounting for shadows, reflections and other non-local effects can be done reliably in a single diffusion step rather than dozens. Current diffusion methods deliver strong results but at high compute cost that blocks interactive or mobile use. OSOR adds an occupancy-guided discriminator to stabilize single-step training, an alpha head drawn from pretrained models to fix inaccurate masks, and a verification pipeline that scales up effect-aware training pairs to 280K. If the claim holds, high-quality removal becomes fast enough for real-time editing without precise user masks.

Core claim

OSOR introduces an occupancy-guided discriminator for precise boundary supervision that stabilizes single-step diffusion training, an alpha head that uses pretrained diffusion knowledge to predict removal regions and tolerate imperfect masks, and a semantic-anchored verification pipeline that filters noisy triplets to generate effect-aware supervision at scale; using the resulting 280K-pair CORNE dataset, the model surpasses multi-step diffusion baselines in perceptual quality at 4x to 30x faster inference.

What carries the argument

Occupancy-guided discriminator and alpha head that together enable stable single-step training and mask-robust, effect-aware inpainting.

If this is right

  • Object removal becomes practical for interactive applications and edge devices.
  • Users no longer need accurate masks for good results.
  • Effect-aware training data can be produced automatically at large scale.
  • Single-step diffusion training generalizes to other removal tasks with non-local effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same single-step stabilization tricks could accelerate other diffusion editing operations.
  • Verification pipelines like SAVP may help curate paired data for additional generative tasks.
  • The method's success on anime and text benchmarks suggests it may transfer to stylized or text-heavy images.

Load-bearing premise

The semantic-anchored verification pipeline can reliably filter noisy instruction-based triplets into high-quality effect-aware supervision at scale.

What would settle it

A side-by-side evaluation of perceptual scores and measured inference time for OSOR versus multi-step baselines on AnimeEraseBench or TextEraseBench would confirm or refute the quality and speed claims.

Figures

Figures reproduced from arXiv: 2606.28094 by Chenxi Sun, Deyang Kong, Haotian Wu, Junhao He, Leilei Cao, Linfeng Zhang, Peike Yu, Qinming Zhou, Xiangheng Tang.

Figure 1
Figure 1. Figure 1: Comparison between OSOR and other methods. OSOR effectively removes object-associated effects, such as shadows, while running 10.6× faster than ObjectClear. A 1024 × 1024 image can be processed in under one second on a single NVIDIA A100 GPU. The average rank is computed across six benchmarks. ABSTRACT Real-world object removal is challenging due to two key difficulties: the target object’s non-local effec… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SAVP and CORNE. Starting from single-edit instruction triplets, SAVP verifies semantically aligned and localized differences, then fuses the validated difference region with promptable segmentation to form an effect-aware mask. It further derives object-core masks for Phase II incomplete-mask conditioning. models into single-pass generators via distribution matching and adversarial objectives [… view at source ↗
Figure 3
Figure 3. Figure 3: Two-phase training curriculum. Phase I adapts a diffusion inpainting backbone with hard latent blending and occupancy-guided patch supervision for boundary-consistent one-step removal. Phase II predicts a soft alpha map under incomplete-mask conditioning and performs adaptive blending to remove residual shadows and reflections beyond the provided mask. Effect decomposition for incomplete-mask conditioning.… view at source ↗
Figure 4
Figure 4. Figure 4: Mask-derived patch targets for a four-scale discriminator. Left shows the input mask and its overlay on the shot image for visualization. Right compares three target constructions at each scale. HM uses nearest-neighbor downsampling. SM applies Gaussian smoothing after downsampling. OG uses area pooling to produce fractional occupancies. Differences grow on coarser grids. where z¯ = E(x) and mz is the back… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of OSOR and existing methods on CORNE-Val and AnimeEraseBench. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of incomplete conditioning masks [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of OSOR and existing methods on RORD-Val, RemovalBench and TextEraseBench. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Alpha compositing under imperfect masks. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Representative CORNE annotation cases. Each row shows the input image [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overall architecture of the occupancy-guided multi-scale discriminator. A frozen feature trunk [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Structure of one trainable head h k ξ . Each head applies spectral-normalized convolution, LeakyReLU, BlurPool downsampling, and a final 1 × 1 convolution to produce a single-channel patch logit map. The packed alpha logits are then unpacked to the latent grid, ℓθ = Unpack ℓ pack θ  , αˆ = σ(ℓθ), so that the final alpha map is also defined at the latent resolution. The modified terminal projection is ex… view at source ↗
Figure 12
Figure 12. Figure 12: User scribble-guided removal examples. For each case, we show the input image with user scribble, the [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More qualitative comparisons of OSOR and existing methods on RemovalBench and CORNE-Val. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: More qualitative comparisons of OSOR and existing methods on RORD-Val, AnimeEraseBench, and [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

Real-world object removal is challenging due to two key difficulties: the target object's non-local effects, such as shadows and reflections, which are difficult to model, and the fact that user-provided masks are often inaccurate or incomplete. With billions of parameters and tens of denoising steps, diffusion-based models achieve strong removal performance at the expense of substantial computational cost, limiting their use in interactive applications and on edge devices. To address these challenges, we present OSOR (One-Step Object Removal), which simultaneously achieves efficient, effect-aware, and mask-robust object removal. Concretely, OSOR introduces: (1) an occupancy-guided discriminator for precise boundary supervision, enabling stable single-step diffusion training; (2) an alpha head that leverages knowledge from pretrained diffusion models to predict appropriate removal regions with minimal overhead, thereby handling imperfect masks; and (3) a semantic-anchored verification pipeline (SAVP) that filters noisy instruction-based triplets to produce effect-aware supervision at scale. Using SAVP, we curate CORNE, which contains 280K verified removal pairs, and further annotate AnimeEraseBench and TextEraseBench to evaluate performance on more complex removal tasks. Experiments show that OSOR surpasses strong multi-step diffusion baselines in perceptual quality while achieving $4\times$ to $30\times$ faster inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents OSOR, a one-step diffusion inpainting model for effect-aware object removal that handles non-local effects (shadows, reflections) and inaccurate user masks. It introduces an occupancy-guided discriminator for stable single-step training, an alpha head leveraging pretrained diffusion knowledge for mask robustness, and a semantic-anchored verification pipeline (SAVP) to filter instruction-based triplets and curate the CORNE dataset of 280K verified pairs, along with new benchmarks AnimeEraseBench and TextEraseBench. The central claim is that OSOR achieves superior perceptual quality to strong multi-step diffusion baselines at 4×–30× faster inference.

Significance. If the performance and stability claims hold with rigorous validation, the work would be significant for enabling interactive, edge-deployable object removal by reducing diffusion steps while addressing real-world effect modeling and mask issues. The dataset curation approach and architectural components for one-step stability could influence efficient generative editing methods, though the current evidence base is limited to high-level descriptions without metrics or ablations.

major comments (2)
  1. [Abstract] Abstract / SAVP description: The headline claims of better perceptual quality and one-step stability rest on effect-aware supervision from the SAVP-curated CORNE dataset, yet the manuscript provides no quantitative validation of SAVP (e.g., filter precision/recall, human agreement on retained non-local effects, or ablation of OSOR trained on raw vs. filtered triplets). This is load-bearing for both the occupancy-guided discriminator and alpha head contributions.
  2. [Abstract] Abstract: Performance gains are stated without any reported metrics (FID, LPIPS, user studies), baseline details, or experimental setup, preventing verification of the 4×–30× speedup and quality superiority over multi-step models.
minor comments (1)
  1. [Abstract] Abstract: The description of the alpha head as adding 'minimal overhead' would benefit from a parameter/FLOPs comparison to the base diffusion model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract / SAVP description: The headline claims of better perceptual quality and one-step stability rest on effect-aware supervision from the SAVP-curated CORNE dataset, yet the manuscript provides no quantitative validation of SAVP (e.g., filter precision/recall, human agreement on retained non-local effects, or ablation of OSOR trained on raw vs. filtered triplets). This is load-bearing for both the occupancy-guided discriminator and alpha head contributions.

    Authors: We acknowledge that the manuscript would benefit from explicit quantitative validation of SAVP. The current description focuses on the pipeline design and its role in curating CORNE, with downstream performance serving as indirect evidence. In the revision we will add an ablation comparing OSOR trained on raw versus SAVP-filtered triplets, along with verification statistics such as inter-annotator agreement on retained non-local effects. This directly supports the load-bearing role of SAVP for the other components. revision: yes

  2. Referee: [Abstract] Abstract: Performance gains are stated without any reported metrics (FID, LPIPS, user studies), baseline details, or experimental setup, preventing verification of the 4×–30× speedup and quality superiority over multi-step models.

    Authors: The abstract is written as a concise summary and therefore omits specific numbers and setup details. The full manuscript contains a complete experiments section that reports FID, LPIPS, user-study results, the exact multi-step diffusion baselines, and the inference-time measurements supporting the 4×–30× speedup claim. To improve immediate verifiability we will expand the abstract with the key quantitative highlights while retaining brevity. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on new architectural components and empirical validation

full rationale

The paper introduces three new components—an occupancy-guided discriminator, an alpha head leveraging pretrained diffusion knowledge, and the SAVP filtering pipeline used to curate the CORNE dataset—without presenting equations, derivations, or first-principles predictions. Performance claims (perceptual quality and speed) are evaluated via external comparisons to multi-step baselines on annotated benchmarks, not by construction from fitted parameters or self-referential definitions. No load-bearing self-citations or uniqueness theorems are invoked in the provided text; the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all components are presented as novel engineering contributions without detailed mathematical assumptions.

pith-pipeline@v0.9.1-grok · 5789 in / 1090 out tokens · 46059 ms · 2026-06-29T04:06:29.047972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    In: CVPR

    Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR. pp. 18187–18197 (2022)

  2. [2]

    Black Forest Labs: Flux.https://github.com/black-forest-labs/flux(2024)

  3. [3]

    In: CVPR

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR. pp. 18392–18402 (2023)

  4. [4]

    Black, and Otmar Hilliges

    Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 2818–2829 (2023). https://doi.org/10.1109/CVPR5...

  5. [5]

    In: CVPR

    Dong, Q., Cao, C., Fu, Y .: Incremental transformer structure enhanced image inpainting with masking positional encoding. In: CVPR. pp. 11348–11358 (2022)

  6. [6]

    In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C

    Ekin, Y ., Yildirim, A.B., Caglar, E.E., Erdem, A., Erdem, E., Dundar, A.: Clipaway: Harmonizing focused embeddings for removing objects via diffusion models. In: Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C. (eds.) Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proce...

  7. [7]

    In: NeurIPS

    Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y .: Generative adversarial nets. In: NeurIPS. pp. 2672–2680 (2014)

  8. [8]

    In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time- scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 6626–6637 (2017), https://proceed...

  9. [9]

    In: NeurIPS

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)

  10. [10]

    In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022),https://openreview.net/forum?id=nZeVKeeFYf9

    Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022),https://openreview.net/forum?id=nZeVKeeFYf9

  11. [11]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXVIII

    Hu, X., Peng, X., Luo, D., Ji, X., Peng, J., Jiang, Z., Zhang, J., Jin, T., Wang, C., Ji, R.: Diffumatting: Synthesizing arbitrary objects with matting-level annotation. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXVIII. vol. 15126, pp. 396–413 (2024). https://doi.org/10.1007/9...

  12. [12]

    ACM TOG36(4), 107:1–107:14 (2017) 11 APREPRINT- JUNE29, 2026

    Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM TOG36(4), 107:1–107:14 (2017) 11 APREPRINT- JUNE29, 2026

  13. [13]

    In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

    Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 5967–5976 (2017). https://doi.org/10.1109/CVPR.2017.632

  14. [14]

    URL https://proceedings.mlr

    Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking FID: towards a better evaluation metric for image generation. In: IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 9307–9315 (2024). https://doi.org/10.1109/CVPR52733.2024.00889

  15. [15]

    Showui: One vision-language- action model for GUI visual agent

    Jiang, L., Wang, Z., Bao, J., Zhou, W., Chen, D., Shi, L., Chen, D., Li, H.: Smarteraser: Remove anything from images using masked-region guidance. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 24452–24462 (2025). https://doi.org/10.1109/CVPR52734.2025.02277

  16. [17]

    IEEE Trans

    Levin, A., Lischinski, D., Weiss, Y .: A closed-form solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell.30(2), 228–242 (2008). https://doi.org/10.1109/TPAMI.2007.1177

  17. [18]

    In: CVPR

    Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y ., Jia, J.: Mat: Mask-aware transformer for large hole image inpainting. In: CVPR. pp. 10748–10758 (2022)

  18. [19]

    Li, X., Yang, Z., Quan, R., Yang, Y .: DRIP: unleashing diffusion priors for joint foreground and al- pha prediction in image matting. In: Advances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024), http://papers.nips.cc/paper_f...

  19. [20]

    ACM Trans

    Lin, X., Yu, F., Hu, J., You, Z., Shi, W., Ren, J.S., Gu, J., Dong, C.: Harnessing diffusion-yielded score priors for image restoration. ACM Trans. Graph.44(6), 208:1–208:21 (2025). https://doi.org/10.1145/3763346

  20. [21]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII. vol. 15105, pp. 38–55 (202...

  21. [22]

    Showui: One vision-language- action model for GUI visual agent

    Liu, Y ., Zhou, H., Cui, B., Shang, W., Lin, R.: Erase diffusion: Empowering object removal through calibrating diffusion pathways. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 2418–2427 (2025). https://doi.org/10.1109/CVPR52734.2025.00231

  22. [23]

    Price, I., Sanchez-Gonzalez, A., Alet, F., Ewalds, T., El- Kadi, A., Stott, J., Mohamed, S., Battaglia, P

    Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 11966–11976 (2022). https://doi.org/10.1109/CVPR52688.2022.01167

  23. [24]

    In: CVPR

    Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: Inpainting using denoising diffusion probabilistic models. In: CVPR. pp. 11451–11461 (2022)

  24. [26]

    In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022)

    Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022)

  25. [27]

    (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

    Mescheder, L.M., Geiger, A., Nowozin, S.: Which training methods for gans do actually converge? In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. vol. 80, pp. 3478–3487 (2018), http: //proceedings.mlr.press/v80/mescheder18a.html

  26. [28]

    In: CVPR

    Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: CVPR. pp. 2536–2544 (2016)

  27. [29]

    In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024)

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024)

  28. [30]

    In: Christiansen, H

    Porter, T.K., Duff, T.: Compositing digital images. In: Christiansen, H. (ed.) Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1984, Minneapolis, Minnesota, USA, July 23-27, 1984. pp. 253–259 (1984). https://doi.org/10.1145/800031.808606 12 APREPRINT- JUNE29, 2026

  29. [31]

    In: Meila, M., Zhang, T

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual E...

  30. [32]

    In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 (2025),https://openreview.net/forum?id=Ha6RTeWMd0

    Ravi, N., Gabeur, V ., Hu, Y ., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V ., Carion, N., Wu, C., Girshick, R.B., Dollár, P., Feichtenhofer, C.: SAM 2: Segment anything in images and videos. In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, Apri...

  31. [33]

    In: CVPR

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10674–10685 (2022)

  32. [34]

    In: 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022

    Sagong, M., Yeo, Y ., Jung, S., Ko, S.: RORD: A real-world object removal dataset. In: 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. p. 542 (2022)

  33. [35]

    In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022), https:// openreview.net/forum?id=TIdIXIpzhoI

    Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022), https:// openreview.net/forum?id=TIdIXIpzhoI

  34. [36]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXXVI. vol. 15144, pp. 87–103 (2024). https://doi.org/10.1007/97...

  35. [37]

    In: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA

    Song, Y ., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. vol. 202, pp. 32211–32252 (2023)

  36. [38]

    In: AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA

    Sun, W., Dong, X., Cui, B., Tang, J.: Attentive eraser: Unleashing diffusion model’s object removal poten- tial via self-attention redirection guidance. In: AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA. pp. 20734–20742 (2025). https://doi.org/10.1609/AAAI.V39I19.34285

  37. [39]

    In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022

    Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V .: Resolution-robust large mask inpainting with fourier convolutions. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022. pp. 3172–3182 (2022). https://doi.org/10.1...

  38. [40]

    Wei, R., Yin, Z., Zhang, S., Zhou, L., Wang, X., Ban, C., Cao, T., Sun, H., He, Z., Liang, K., Ma, Z.: Omnieraser: Remove objects and their effects in images with paired video-frame data (2025), https://arxiv.org/abs/ 2501.07397

  39. [41]

    In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII

    Winter, D., Cohen, M., Fruchter, S., Pritch, Y ., Rav-Acha, A., Hoshen, Y .: Objectdrop: Bootstrapping counter- factuals for photorealistic object removal and insertion. In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVII. vol. 15135, pp. 112–129 (2024)

  40. [42]

    In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

    Xu, N., Price, B.L., Cohen, S., Huang, T.S.: Deep image matting. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 311–320 (2017). https://doi.org/10.1109/CVPR.2017.41

  41. [43]

    URL https://proceedings.mlr

    Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distri- bution matching distillation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 2024. pp. 6613–6623 (2024). https://doi.org/10.1109/CVPR52733.2024.00632

  42. [44]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. pp. 4470–4479 (2019). https://doi.org/10.1109/ICCV .2019.00457

  43. [45]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Yu, Y ., Zeng, Z., Zheng, H., Luo, J.: Omnipaint: Mastering object-oriented editing via disentangled insertion- removal inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17324–17334 (October 2025)

  44. [46]

    IEEE Trans

    Zeng, Y ., Fu, J., Chao, H., Guo, B.: Aggregated contextual transformations for high-resolution image inpainting. IEEE Trans. Vis. Comput. Graph.29(7), 3266–3280 (2023). https://doi.org/10.1109/TVCG.2022.3156949

  45. [47]

    ACM Trans

    Zhang, L., Agrawala, M.: Transparent image layer diffusion using latent transparency. ACM Trans. Graph.43(4), 100:1–100:15 (2024). https://doi.org/10.1145/3658150 13 APREPRINT- JUNE29, 2026

  46. [48]

    In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. pp. 586–595 (2018). https://doi.org/10.1109/CVPR.2018.00068

  47. [49]

    Design Initiative for a 10 TeV pCM Wakefield Collider,

    Zhao, J., Zhou, S., Wang, Z., Yang, P., Loy, C.C.: Objectclear: Complete object removal via object-effect attention. CoRRabs/2505.22636(2025). https://doi.org/10.48550/ARXIV .2505.22636 14 APREPRINT- JUNE29, 2026 Supplementary Material A Supplementary Details of SA VP and CORNE This section provides additional implementation details and dataset statistics...