pith. sign in

arxiv: 2504.18361 · v2 · pith:VIZSQHCXnew · submitted 2025-04-25 · 💻 cs.CV · cs.AI

COCO-Inpaint: A Benchmark for Detecting and Localizing Inpainting-Based Image Manipulations

Pith reviewed 2026-05-22 17:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image manipulation detectioninpainting forgerylocalization benchmarkIMDL evaluationCOCO-Inpaintimage forensicsforgery localization
0
0 comments X

The pith

COCO-Inpaint benchmark supplies 238,302 images from six inpainting models to test forgery detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COCO-Inpaint to fill the gap in benchmarks for detecting inpainting-based image manipulations. It generates high-quality samples using six state-of-the-art models and four mask strategies, including optional text guidance, to create diverse and semantically rich cases. The design emphasizes intrinsic inconsistencies between inpainted and authentic regions instead of obvious semantic mismatches. A standard evaluation protocol with three metrics then measures how existing detection methods perform and where they encounter difficulties.

Core claim

We present COCO-Inpaint, a comprehensive benchmark with high-quality inpainting samples generated by six state-of-the-art inpainting models, diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and large-scale coverage of 238,302 inpainted images with rich semantic diversity. The benchmark is constructed to highlight intrinsic inconsistencies between inpainted and authentic regions rather than superficial semantic artifacts such as object shapes. We further establish a rigorous evaluation protocol with three standard metrics to benchmark existing IMDL methods and reveal current trends and challenges.

What carries the argument

The COCO-Inpaint dataset of controlled inpainted images that isolates intrinsic inconsistencies for evaluation of image manipulation detection and localization methods.

If this is right

  • Existing IMDL methods can be directly compared on inpainting manipulations using a shared large-scale test set.
  • Detection approaches must address intrinsic region inconsistencies rather than relying on semantic cues alone.
  • The four mask strategies allow testing of robustness across different editing patterns.
  • Large semantic diversity ensures evaluations cover varied real-world image content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could guide creation of inpainting models that deliberately reduce detectable inconsistencies.
  • Similar controlled generation approaches might apply to building test sets for other manipulation types.
  • Text-guided inpainting cases open questions about how language conditioning affects forensic traces.

Load-bearing premise

The inpainted regions generated by the six models contain intrinsic inconsistencies representative of real-world manipulations that standard metrics can expose in current detection methods.

What would settle it

Running the existing IMDL methods on the full COCO-Inpaint set and finding that they reach near-perfect scores on all three metrics would show the benchmark fails to reveal meaningful limitations.

Figures

Figures reproduced from arXiv: 2504.18361 by Haozhen Yan, Huijia Zhu, Jiahui Zhan, Jianfu Zhang, Jun Lan, Suning Lang, Yan Hong, Yikun Ji.

Figure 1
Figure 1. Figure 1: Comparison of IMDL model performance on cross [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the COCO-Inpaint dataset. Mask1, Mask2, Mask3, and Mask4 represent Random Polygon, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization results of the IMDL models on the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Recent advances in image manipulation have enabled highly photorealistic content generation, but also lowered the barrier to arbitrary editing, raising concerns about multimedia authenticity and security. Existing Image Manipulation Detection and Localization (IMDL) methods mainly target splicing or copy-move forgeries, while benchmarks for inpainting-based manipulations remain limited. To bridge this gap, we present COCO-Inpaint, a comprehensive benchmark specifically designed for inpainting detection and localization, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage of 238,302 inpainted images with rich semantic diversity. Our benchmark is constructed to highlight intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We further establish a rigorous evaluation protocol with three standard metrics to benchmark existing IMDL methods and reveal current trends and challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces COCO-Inpaint, a benchmark dataset of 238,302 inpainted images generated from COCO using six state-of-the-art inpainting models, four mask generation strategies (with optional text guidance), and a focus on creating samples that expose intrinsic inconsistencies (texture, lighting, blending) rather than semantic artifacts such as unnatural object shapes. It establishes an evaluation protocol using three standard IMDL metrics to benchmark existing detection and localization methods and highlight current limitations.

Significance. A well-validated inpainting-specific benchmark would address a clear gap in the IMDL literature, where most existing datasets target splicing or copy-move forgeries. If the generated samples demonstrably avoid superficial semantic artifacts and produce inconsistencies representative of real-world manipulations, the resource could enable more targeted progress on inpainting detection and provide reproducible baselines for future work.

major comments (2)
  1. [Abstract and §3 (Dataset Construction)] The central claim that the benchmark highlights intrinsic inconsistencies rather than superficial semantic artifacts (Abstract and §3) is load-bearing for the contribution but lacks supporting validation. No perceptual study, comparison against real inpainted images, or ablation quantifying boundary/shape artifacts introduced by the four mask strategies is reported; without this, it remains unclear whether standard IMDL metrics will isolate the intended intrinsic features or simply detect generation artifacts.
  2. [§4 (Evaluation Protocol)] The evaluation protocol (§4) applies three standard metrics to existing IMDL methods but does not include controls for selection bias or confirmation that the 238k images are balanced across semantic categories and mask types. This weakens the claim that the benchmark reveals representative trends and challenges in current methods.
minor comments (2)
  1. [§3] Clarify the exact overlap or differences between the four mask generation strategies and whether text guidance is applied uniformly or selectively across models.
  2. [§3] Provide more detail on the train/validation/test splits and any steps taken to prevent data leakage from the original COCO annotations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation of our benchmark.

read point-by-point responses
  1. Referee: [Abstract and §3 (Dataset Construction)] The central claim that the benchmark highlights intrinsic inconsistencies rather than superficial semantic artifacts (Abstract and §3) is load-bearing for the contribution but lacks supporting validation. No perceptual study, comparison against real inpainted images, or ablation quantifying boundary/shape artifacts introduced by the four mask strategies is reported; without this, it remains unclear whether standard IMDL metrics will isolate the intended intrinsic features or simply detect generation artifacts.

    Authors: We agree that explicit validation of the claim would strengthen the paper. Our mask generation strategies were intentionally designed to produce irregular, non-semantic boundaries (e.g., random scribbles, boundary perturbations, and text-guided regions) rather than complete object removal or unnatural shapes. The six SOTA inpainting models were selected precisely because they minimize visible blending artifacts. We will revise §3 to expand the description of each mask strategy with additional qualitative examples and a brief discussion of why these choices reduce semantic artifacts. We will also add a limitations paragraph acknowledging the absence of a formal perceptual study or real-world inpainting comparison, as no large-scale public dataset of verified real inpainted forgeries currently exists for direct benchmarking. revision: partial

  2. Referee: [§4 (Evaluation Protocol)] The evaluation protocol (§4) applies three standard metrics to existing IMDL methods but does not include controls for selection bias or confirmation that the 238k images are balanced across semantic categories and mask types. This weakens the claim that the benchmark reveals representative trends and challenges in current methods.

    Authors: The 238,302 images were generated by applying the four mask strategies uniformly to images from the COCO validation and test sets, which are already balanced across 80 semantic categories. To make this explicit, we will add summary statistics (e.g., histograms or tables) in §4 or the supplementary material showing the distribution of object categories, mask area ratios, and mask types across the full benchmark. This will confirm coverage and allow readers to assess potential selection effects. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction is self-contained dataset generation

full rationale

The paper constructs COCO-Inpaint by applying six external SOTA inpainting models and four mask-generation strategies to COCO images, then defines an evaluation protocol using standard IMDL metrics. No equations, fitted parameters, or predictions are derived; the central claim is the existence and scale of the resulting 238k-image collection with its stated properties. The design choice to emphasize intrinsic inconsistencies is an input assumption rather than a result obtained from the paper's own outputs or self-citations. No load-bearing step reduces to a prior result by the same authors or by redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented physical entities. The work rests on standard computer-vision assumptions about image semantics and forgery detection metrics.

axioms (2)
  • domain assumption Existing IMDL methods primarily target splicing or copy-move forgeries rather than inpainting
    Stated in the abstract as motivation for the benchmark.
  • standard math Standard metrics (precision, recall, F1 or equivalent) are appropriate for evaluating inpainting localization
    Abstract refers to three standard metrics without further justification.

pith-pipeline@v0.9.0 · 5726 in / 1314 out tokens · 71410 ms · 2026-05-22T17:46:48.800595+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Multi-axis Analysis of Image Manipulation Localization

    cs.CV 2026-05 unverdicted novelty 6.0

    Introduces the AUDITS benchmark for multi-axis evaluation of image manipulation localization under domain shifts and other factors.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Omri Avrahami, Ohad Fried, and Dani Lischinski. 2023. Blended latent diffusion. ACM transactions on graphics (TOG) 42, 4 (2023), 1–11

  2. [2]

    Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 18208–18218

  3. [3]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning To Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 18392–18402

  4. [4]

    Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng- Ann Heng, and Stan Z Li. 2024. A survey on generative diffusion models. IEEE Transactions on Knowledge and Data Engineering (2024)

  5. [5]

    Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shilong Zhang, Yu Liu, Yujun Shen, and Hengshuang Zhao. 2024. Zero-shot image editing with reference imitation. Advances in Neural Information Processing Systems 37 (2024), 84010– 84032

  6. [6]

    Yirui Chen, Xudong Huang, Quan Zhang, Wei Li, Mingjian Zhu, Qiangyu Yan, Simiao Li, Hanting Chen, Hailin Hu, Jie Yang, et al. 2024. GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization.arXiv preprint arXiv:2406.16531 (2024)

  7. [7]

    Ciprian Corneanu, Raghudeep Gadde, and Aleix M Martinez. 2024. LatentPaint: Image Inpainting in Latent Space With Diffusion Models. In IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 4334–4343

  8. [8]

    Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. 2023. On the detection of synthetic images generated by diffusion models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 1–5

  9. [9]

    Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. 2015. Efficient Dense- Field Copy–Move Forgery Detection. IEEE Transactions on Information Forensics and Security 10, 11 (Nov 2015), 2284–2297. doi:10.1109/TIFS.2015.2455334

  10. [10]

    Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. 2015. Splicebuster: A new blind image splicing detector. In 2015 IEEE International Workshop on Information Forensics and Security (WIFS) . IEEE, Roma, Italy, 1–6. doi:10.1109/ WIFS.2015.7368565

  11. [11]

    Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. 2015. Splicebuster: A new blind image splicing detector. In 2015 IEEE International Workshop on Information Forensics and Security (WIFS) . IEEE, 1–6

  12. [12]

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Im- agenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition . Ieee, 248–255

  13. [13]

    Chengbo Dong, Xinru Chen, Ruohan Hu, Juan Cao, and Xirong Li. 2022. Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3539– 3553

  14. [14]

    Jing Dong, Wei Wang, and Tieniu Tan. 2013. Casia image tampering detection evaluation database. In 2013 IEEE China summit and international conference on signal and information processing . IEEE, 422–426

  15. [15]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  16. [16]

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144

  17. [17]

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. 2023. TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 20606–20615

  18. [18]

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. 2023. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 20606–20615

  19. [19]

    Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Iacopo Masi, and Xiaoming Liu. 2023. Hierarchical fine-grained image forgery detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. 3155–3165

  20. [20]

    Jing Hao, Zhixin Zhang, Shicai Yang, Di Xie, and Shiliang Pu. 2021. Transforen- sics: image forgery localization with dense self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 15055–15064

  21. [21]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851

  22. [22]

    Xuefeng Hu, Zhihan Zhang, Zhenye Jiang, Syomantak Chaudhuri, Zhenheng Yang, and Ram Nevatia. 2020. SPAN: Spatial pyramid attention network for image manipulation localization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16 . Springer, 312–328

  23. [23]

    Shan Jia, Mingzhen Huang, Zhou Zhou, Yan Ju, Jialing Cai, and Siwei Lyu. 2023. AutoSplice: A Text-prompt Manipulated Image Dataset for Media Forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. 893–903

  24. [24]

    Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. 2024. Brushnet: A plug-and-play image inpainting model with decomposed dual- branch diffusion. In European Conference on Computer Vision . Springer, 150–168

  25. [25]

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. 2023. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 6007–6017

  26. [26]

    Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. 2023. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 22691– 22702

  27. [27]

    Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. 2022. Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision 130, 8 (2022), 1875– 1895

  28. [28]

    Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

  29. [29]

    Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng

  30. [30]

    Improving synthetic image detection towards generalization: An image trans- formation perspective

    Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective. arXiv preprint arXiv:2408.06741 (2024)

  31. [31]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13 . Springer, 740– 755

  32. [32]

    Anji Liu, Mathias Niepert, and Guy Van den Broeck. 2023. Image Inpainting via Tractable Steering of Diffusion Models. arXiv preprint arXiv:2401.03349 (2023)

  33. [33]

    Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. 2020. Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743 (2020)

  34. [34]

    Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. 2022. PSCC-Net: Pro- gressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology 32, 11 (2022), 7505–7517

  35. [35]

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. RePaint: Inpainting using denoising diffusion proba- bilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 11461–11471

  36. [36]

    Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y Al Hammadi, and Jizhe Zhou

  37. [37]

    arXiv preprint arXiv:2307.14863 (2023)

    IML-ViT: Benchmarking Image Manipulation Localization by Vision Trans- former. arXiv preprint arXiv:2307.14863 (2023)

  38. [38]

    Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, et al . 2025. Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization. Advances in Neural Information Processing Systems 37 (2025), 134591–134613

  39. [39]

    Gaël Mahfoudi, Badr Tajini, Florent Retraint, Frederic Morain-Nicolier, Jean Luc Dugelay, and Marc Pic. 2019. Defacto: Image and face manipulation dataset. In 2019 27Th european signal processing conference (EUSIPCO) . IEEE, 1–5

  40. [40]

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 6038–6047

  41. [41]

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  42. [42]

    Adam Novozamsky, Babak Mahdian, and Stanislav Saic. 2020. IMD2020: A large- scale annotated dataset tailored for detecting manipulated images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops

  43. [43]

    Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, Jingfeng Zhang, and Xiangteng He. 2024. Locate, Assign, Refine: Taming Customized Promptable Image Inpaint- ing. arXiv preprint arXiv:2403.19534 (2024)

  44. [44]

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. 2022. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 11410–11420

  45. [45]

    Patrick Pérez, Michel Gangnet, and Andrew Blake. 2023. Poisson image editing. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2 . 577–582

  46. [46]

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diffu- sion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  47. [47]

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

  48. [48]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022). Haozhen Yan, Jiahui Zhan, Yikun Ji, Yan Hong, Jun Lan, Huijia Zhu, Weiqiang Wang, and Jianfu Zhang

  49. [49]

    Yuan Rao and Jiangqun Ni. 2016. A deep learning approach to detection of splicing and copy-move forgeries in images. In 2016 IEEE International Workshop on Information Forensics and Security (WIFS) . IEEE, Abu Dhabi, United Arab Emirates, 1–6. doi:10.1109/WIFS.2016.7823911

  50. [50]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Mod- els. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695

  51. [51]

    Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings . 1–10

  52. [52]

    Chaehun Shin, Jooyoung Choi, Heeseung Kim, and Sungroh Yoon. 2024. Large- Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator. arXiv preprint arXiv:2411.15466 (2024)

  53. [53]

    Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  54. [54]

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2149–2159

  55. [55]

    Luisa Verdoliva. 2020. Media forensics and deepfakes: an overview. IEEE journal of selected topics in signal processing 14, 5 (2020), 910–932

  56. [56]

    Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. 2022. ObjectFormer for Image Manipula- tion Detection and Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 2364–2373

  57. [57]

    Yinhuai Wang, Jiwen Yu, and Jian Zhang. 2022. Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)

  58. [58]

    Haiwei Wu and Jiantao Zhou. 2021. IID-Net: Image inpainting detection network via neural architecture search and attention. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1172–1185

  59. [59]

    Yue Wu et al. 2019. ManTra-Net: Manipulation Tracing Network for Detection and Localization of Image Forgeries With Anomalous Features. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 9535–9544. doi:10.1109/CVPR.2019.00977

  60. [60]

    Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. 2019. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 9543–9552

  61. [61]

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. SegFormer: Simple and Efficient Design for Semantic Segmenta- tion with Transformers. In Neural Information Processing Systems (NeurIPS)

  62. [62]

    Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. 2023. Smart- brush: Text and shape guided object inpainting with diffusion model. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 22428–22437

  63. [63]

    Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin CK Chan, Yandong Li, Yanwu Xu, Kun Zhang, and Tingbo Hou. 2023. Dreaminpainter: Text-guided subject-driven image inpainting with diffusion models. arXiv preprint arXiv:2312.03771 (2023)

  64. [64]

    Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2024. Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv:2209.00796 [cs.LG] https://arxiv.org/abs/2209.00796

  65. [65]

    Shiyuan Yang, Xiaodong Chen, and Jing Liao. 2023. Uni-paint: A unified frame- work for multimodal image inpainting with pretrained diffusion model. In ACM International Conference on Multimedia (MM) . 3190–3199

  66. [66]

    Siyuan Yang, Lu Zhang, Liqian Ma, Yu Liu, JingJing Fu, and You He. 2023. Magi- cremover: Tuning-free text-guided image inpainting with diffusion models.arXiv preprint arXiv:2310.02848 (2023)

  67. [67]

    Yicheng Yang, Pengxiang Li, Lu Zhang, Liqian Ma, Ping Hu, Siyu Du, Yun- zhi Zhuge, Xu Jia, and Huchuan Lu. 2024. DreamMix: Decoupling Object At- tributes for Enhanced Editability in Customized Image Inpainting. arXiv preprint arXiv:2411.17223 (2024)

  68. [68]

    Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut Erdem, and Aysegul Dundar. 2023. Inst-inpaint: Instructing to remove objects with diffusion models. arXiv preprint arXiv:2304.03246 (2023)

  69. [69]

    Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang

  70. [70]

    In Proceedings of the IEEE/CVF international conference on computer vision

    Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF international conference on computer vision . 4471–4480

  71. [71]

    Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. 2023. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)

  72. [72]

    Markos Zampoglou, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2015. Detecting image splicing in the wild (web). In 2015 IEEE international conference on multimedia & expo workshops (ICMEW) . IEEE, 1–6

  73. [73]

    Guanhua Zhang, Jiabao Ji, Yang Zhang, Mo Yu, Tommi Jaakkola, and Shiyu Chang. 2023. Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models. arXiv:2304.03322 [cs.CV]

  74. [74]

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene Parsing through ADE20K Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  75. [75]

    Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. 2022. Detecting twenty-thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part IX . Springer, 350–368

  76. [76]

    Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. 2024. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. In European Conference on Computer Vision. Springer, 195–211

  77. [77]

    Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)