pith. sign in

arxiv: 2506.00721 · v2 · submitted 2025-05-31 · 💻 cs.CV · cs.LG

Common Inpainted Objects In-N-Out of Context

Pith reviewed 2026-05-19 11:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords context reasoningout-of-context detectioninpaintingvision datasetscene understandingimage forensicsdiffusion modelsobject replacement
0
0 comments X

The pith

A dataset of nearly 100,000 edited photos gives vision models clear examples of objects that fit or clash with their surroundings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a large collection of edited photos where objects have been swapped in or out of their usual settings. The goal is to give computer vision systems better examples for understanding what belongs in a scene and what does not. By using AI image editing to change objects in standard scene photos, the authors build both matching and mismatched versions. They check each change with vision-language models to label them correctly. This setup supports new ways to train models on context, such as classifying fit, predicting suitable objects, and spotting edited images more reliably.

Core claim

By systematically replacing objects in COCO images through diffusion-based inpainting, the authors create 97,722 unique images featuring both contextually coherent and inconsistent scenes. Each inpainted object is verified and categorized as in- or out-of-context through large vision language model assessments. This controlled testbed enables three tasks: fine-grained context reasoning that classifies objects based on three criteria, a novel Objects-from-Context prediction task at instance and clique levels, and context-enhanced fake detection on existing methods without fine-tuning.

What carries the argument

Diffusion-based inpainting to replace objects in original scenes, combined with large vision language model verification to label each result as contextually fitting or inconsistent.

If this is right

  • Fine-grained classification becomes possible by scoring objects against three explicit context criteria.
  • Models can predict which new objects naturally belong in a given scene, both for single instances and groups sharing semantic relations.
  • Existing fake-image detectors gain accuracy when supplied with context signals from the dataset, without any additional training on the detectors themselves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same editing approach could extend to video clips to study how context violations unfold over time.
  • Training on these examples might improve model performance on real-world tasks like spotting anomalies in surveillance footage.
  • The dataset offers a way to test whether context understanding helps models handle unusual but realistic scene variations beyond the training distribution.

Load-bearing premise

Large vision language models can reliably and accurately judge whether each inpainted object belongs in the surrounding scene or not.

What would settle it

A direct check would compare the large vision language model labels against human judgments on a random sample of the images, or measure whether models trained with these examples show clear gains on separate context-reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2506.00721 by Jin Sun, Ninghao Liu, Ruitong Sun, Tianze Yang, Tyson Jordan.

Figure 1
Figure 1. Figure 1: Which object is fake? Only one object per image is inpainted. Out-of-context inpainted objects are easier to identify. Answers are revealed at the bottom of this page2 . Using Stable Diffusion’s inpainting model [9], we replace exactly one object per COCO image. This selective approach allows us to maintain the broader scene context while introducing precise, controlled variations in object-scene relation￾… view at source ↗
Figure 2
Figure 2. Figure 2: Our COinCO pipeline. (a) For a given COCO image, an object is randomly replaced by with Stable Diffusion inpainting. (b) Inpainting success is verified using object detection and an MLLM. Successes are added to the dataset, while fail cases are regenerated and retested. (c) Inpainted images are classified as in-context or out-of-context using the MLLM. (d) Instance-level (object category) and clique-level … view at source ↗
Figure 3
Figure 3. Figure 3: (a) Inpainting success rate for original-inpainted object pairs. Rows are classes of original [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Inpainting, fake detection, and objects-from-context results. Context reasoning responses [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Object-from-Context prediction. A red box is a query. The top row shows three examples [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Context enhancement results [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fake localization performance. We evaluate this context-enhancement approach under two settings. In the oracle setting, we use ground truth annotations to enhance predic￾tions within the fake object’s mask region when it is out-of-context. This serves as an upper bound for performance gains. In the practical setting, the fake objects are unknown. We pro￾pose the use of Molmo to identify suspicious objects … view at source ↗
read the original abstract

We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments. We demonstrate three key tasks enabled by COinCO: (1) a fine-grained context reasoning approach that classifies objects as in- or out-of-context based on three criteria; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique level semantics, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision, including image forensics. Code and dataset are available at https://co-in-co.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces the COinCO dataset comprising 97,722 images generated by systematically inpainting objects from COCO scenes using diffusion models to produce both contextually coherent and inconsistent variants. Each inpainted object is categorized as in- or out-of-context via LVLM assessments, which the authors describe as meticulous verification. The dataset is then used to demonstrate three tasks: (1) fine-grained context reasoning that classifies objects according to three criteria, (2) Objects-from-Context prediction at instance and clique levels, and (3) context-enhanced fake detection on existing methods without fine-tuning.

Significance. A reliably labeled dataset of this scale could serve as a useful controlled testbed for context-aware vision models and image forensics. The construction method is straightforward and the three downstream tasks are well-motivated. However, the absence of any quantitative validation for the LVLM-generated labels means the central contribution rests on an unverified assumption; if that assumption does not hold, the utility of the entire resource and all reported experiments is undermined.

major comments (2)
  1. [§3.2] §3.2 (Verification procedure): The manuscript asserts that 'each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments,' yet reports no human agreement metrics, inter-rater reliability scores, error analysis, or ablation on prompt sensitivity. Because every downstream task (context reasoning classifier, Objects-from-Context prediction, and fake-detection experiments) depends directly on these labels, the lack of validation constitutes a load-bearing gap.
  2. [§4.1 and §4.2] §4.1 and §4.2 (Context reasoning and Objects-from-Context tasks): Both tasks treat the LVLM-derived in/out-of-context labels as ground truth for training and evaluation. Without reported label accuracy or noise analysis, the quantitative improvements claimed over baselines cannot be interpreted as evidence of improved context understanding; they may simply reflect propagation of LVLM biases.
minor comments (3)
  1. [Abstract and §3.1] The abstract and §3.1 should explicitly state the total number of unique source COCO images used and the average number of inpainted objects per image to allow readers to assess diversity.
  2. [Figure 2] Figure 2 (example images) would benefit from clearer annotation of which object was inpainted and its assigned context label; current captions are ambiguous.
  3. [§3.2] A brief discussion of known LVLM failure modes on fine-grained scene semantics (e.g., object affordance or spatial relations) should be added to §3.2 to contextualize the verification choice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address the major concerns regarding the validation of the LVLM labels below.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Verification procedure): The manuscript asserts that 'each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments,' yet reports no human agreement metrics, inter-rater reliability scores, error analysis, or ablation on prompt sensitivity. Because every downstream task (context reasoning classifier, Objects-from-Context prediction, and fake-detection experiments) depends directly on these labels, the lack of validation constitutes a load-bearing gap.

    Authors: We agree that providing quantitative validation for the LVLM labels would strengthen the paper. The current version relies on the LVLM's capability for this task without reporting human agreement. In the revised manuscript, we will add a section with human evaluation on a random sample of 500 images, reporting agreement rates between LVLM and human annotators, along with an error analysis. We will also include an ablation study on different prompts to show sensitivity. revision: yes

  2. Referee: [§4.1 and §4.2] §4.1 and §4.2 (Context reasoning and Objects-from-Context tasks): Both tasks treat the LVLM-derived in/out-of-context labels as ground truth for training and evaluation. Without reported label accuracy or noise analysis, the quantitative improvements claimed over baselines cannot be interpreted as evidence of improved context understanding; they may simply reflect propagation of LVLM biases.

    Authors: We acknowledge this valid point. The improvements are demonstrated using the same label set for all methods, meaning that if there is bias, it affects baselines and our method similarly. However, to better address potential label noise, we will add a noise robustness analysis in the revision, such as injecting controlled noise into labels and observing performance. This will help interpret the results more carefully. revision: yes

Circularity Check

0 steps flagged

No circularity detected in dataset construction methodology

full rationale

The paper presents a dataset creation pipeline using diffusion inpainting on COCO images followed by LVLM-based categorization of in/out-of-context objects, then demonstrates three downstream tasks. No equations, fitted parameters, or derivations are present that reduce any claimed result to its own inputs by construction. The LVLM verification step is an empirical labeling process rather than a self-referential fit or self-citation chain. Self-citations, if any, are not load-bearing for the core contribution of the new dataset and tasks. The work is self-contained as a standard dataset paper with independent methodological content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that diffusion inpainting produces sufficiently realistic images and that LVLM judgments provide reliable context labels without additional human validation.

axioms (2)
  • domain assumption Diffusion-based inpainting can generate images that preserve scene coherence when objects are contextually appropriate.
    Invoked in the description of creating coherent and inconsistent scenes.
  • domain assumption Large Vision Language Models can accurately classify objects as in- or out-of-context.
    Stated as the verification method for all 97,722 images.

pith-pipeline@v0.9.0 · 5726 in / 1271 out tokens · 41896 ms · 2026-05-19T11:34:43.105371+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization

    cs.CV 2026-04 unverdicted novelty 7.0

    DiffusionPrint learns robust forensic feature maps via MoCo-style contrastive training on diffusion inpainting fingerprints, boosting localization accuracy by up to 28% when fused into existing IFL systems and general...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Visual objects in context

    Moshe Bar. Visual objects in context. Nature Reviews Neuroscience, 5(8):617–629, 2004

  2. [2]

    Context models and out-of-context objects

    Myung Jin Choi, Antonio Torralba, and Alan S Willsky. Context models and out-of-context objects. Pattern Recognition Letters, 33(7):853–862, 2012

  3. [3]

    Noise or signal: The role of image backgrounds in object recognition

    Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. arXiv preprint arXiv:2006.09994, 2020

  4. [4]

    Scene perception: Detecting and judging objects undergoing relational violations

    Irving Biederman, Robert J Mezzanotte, and Jan C Rabinowitz. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2):143–177, 1982

  5. [5]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014

  6. [6]

    Coco-stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018

  7. [7]

    Lvis: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019

  8. [8]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

  9. [9]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  10. [10]

    Whole- body human pose estimation in the wild

    Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, and Ping Luo. Whole- body human pose estimation in the wild. In European Conference on Computer Vision, pages 196–214. Springer, 2020

  11. [11]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

  12. [12]

    COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

    Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016

  13. [13]

    Cd-coco: A versatile complex distorted coco database for scene-context-aware computer vision

    Ayman Beghdadi, Azeddine Beghdadi, Malik Mallem, Lotfi Beji, and Faouzi Alaya Cheikh. Cd-coco: A versatile complex distorted coco database for scene-context-aware computer vision. In2023 11th European Workshop on Visual Information Processing (EUVIP), pages 1–6. IEEE, 2023

  14. [14]

    Coconut: Modernizing coco segmentation

    Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, and Liang-Chieh Chen. Coconut: Modernizing coco segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21863–21873, 2024

  15. [15]

    Detecting out-of-context objects using graph context reasoning network

    Manoj Acharya, Anirban Roy, Kaushik Koneripalli, Susmit Jha, Christopher Kanan, and Ajay Divakaran. Detecting out-of-context objects using graph context reasoning network. In IJCAI, 2022

  16. [16]

    Context understanding in computer vision: A survey

    Xuan Wang and Zhigang Zhu. Context understanding in computer vision: A survey. Computer Vision and Image Understanding, 229:103646, 2023

  17. [17]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024

  18. [18]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

  19. [19]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/ blog/2024-01-30-llava-next/ . 10

  20. [20]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  21. [21]

    Unified language- vision pretraining in llm with dynamic discrete visual tokenization

    Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Yadong Mu, et al. Unified language- vision pretraining in llm with dynamic discrete visual tokenization. In International Conference on Learning Representations, 2024

  22. [22]

    Image forgery detection

    Hany Farid. Image forgery detection. IEEE Signal processing magazine, 26(2):16–25, 2009

  23. [23]

    A comprehensive framework for image inpainting

    Aurélie Bugeau, Marcelo Bertalmío, Vicent Caselles, and Guillermo Sapiro. A comprehensive framework for image inpainting. IEEE transactions on image processing, 19(10):2634–2645, 2010

  24. [24]

    Generative adversarial networks

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commun. ACM, 63(11):139–144, October

  25. [25]

    [HBP23] Aamal Abbas Hussain, Francesco Belardinelli, and G eorgios Piliouras

    ISSN 0001-0782. doi: 10.1145/3422622. URL https://doi.org/10.1145/3422622

  26. [26]

    Image-to-image translation with conditional adversarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017

  27. [27]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018

  28. [28]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2019. doi: 10.1109/CVPR.2019.00453

  29. [29]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020

  30. [30]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  31. [31]

    De-fake: Detection and attribution of fake images generated by text-to-image generation models

    Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 3418–3432, 2023

  32. [32]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning , pages 8821–8831. Pmlr, 2021

  33. [33]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023

  34. [34]

    Anydoor: Zero-shot object-level im- age customization

    Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization, 2024. URL https://arxiv.org/abs/2307.09481

  35. [35]

    Controlcom: Controllable image composition using diffusion model, 2023

    Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, and Li Niu. Controlcom: Controllable image composition using diffusion model, 2023. URL https://arxiv.org/abs/2308. 10040

  36. [36]

    Zero-1-to-3: Zero-shot one image to 3d object, 2023

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023

  37. [37]

    Diffrelight: Diffusion-based facial performance relighting

    Mingming He, Pascal Clausen, Ahmet Levent Ta¸ sel, Li Ma, Oliver Pilarski, Wenqi Xian, Laszlo Rikker, Xueming Yu, Ryan Burgert, Ning Yu, et al. Diffrelight: Diffusion-based facial performance relighting. arXiv preprint arXiv:2410.08188, 2024

  38. [38]

    Neural gaffer: Relighting any object via diffusion, 2024

    Haian Jin, Yuan Li, Fujun Luan, Yuanbo Xiangli, Sai Bi, Kai Zhang, Zexiang Xu, Jin Sun, and Noah Snavely. Neural gaffer: Relighting any object via diffusion, 2024. URL https://arxiv.org/abs/2406.07520

  39. [39]

    Deep learning based computer generated face identification using convolutional neural network.Applied Sciences, 8(12):2610, 2018

    L Minh Dang, Syed Ibrahim Hassan, Suhyeon Im, Jaecheol Lee, Sujin Lee, and Hyeonjoon Moon. Deep learning based computer generated face identification using convolutional neural network.Applied Sciences, 8(12):2610, 2018. 11

  40. [40]

    Fake faces identification via convolutional neural network

    Huaxiao Mo, Bolin Chen, and Weiqi Luo. Fake faces identification via convolutional neural network. In Proceedings of the 6th ACM workshop on information hiding and multimedia security, pages 43–47, 2018

  41. [41]

    Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization

    Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7505–7517, 2022

  42. [42]

    Learning jpeg compression artifacts for image manipulation detection and localization.International Journal of Computer Vision, 130(8):1875–1895, August 2022

    Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization.International Journal of Computer Vision, 130(8):1875–1895, August 2022. doi: 10.1007/s11263-022-01617-5

  43. [43]

    Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features

    Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9543–9552, 2019

  44. [44]

    Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20606–20615, 2023

  45. [45]

    Genimage: A million-scale benchmark for detecting ai-generated image, 2023

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image, 2023

  46. [46]

    Bird and Ahmad Lotfi

    Jordan J. Bird and Ahmad Lotfi. Cifake: Image classification and explainable identification of ai-generated synthetic images, 2023. URL https://arxiv.org/abs/2303.14126

  47. [47]

    Tgif: Text-guided inpainting forgery dataset

    Hannes Mareen, Dimitrios Karageorgiou, Glenn Van Wallendael, Peter Lambert, and Symeon Papadopou- los. Tgif: Text-guided inpainting forgery dataset. In 2024 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2024

  48. [48]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

  49. [49]

    Ultralytics yolov8, 2023

    Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics yolov8, 2023. URL https://github.com/ ultralytics/ultralytics

  50. [50]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and b...

  51. [51]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023

  52. [52]

    Bert: Pre-training of deep bidi- rectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  53. [53]

    Contextual priming for object detection

    Antonio Torralba. Contextual priming for object detection. International journal of computer vision, 53: 169–191, 2003

  54. [54]

    Object co-occurrence serves as a contextual cue to guide and facilitate visual search in a natural viewing environment

    Stephen C Mack and Miguel P Eckstein. Object co-occurrence serves as a contextual cue to guide and facilitate visual search in a natural viewing environment. Journal of vision, 11(9):9–9, 2011

  55. [55]

    Deep learning for anomaly detection: A review

    Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review. ACM computing surveys (CSUR), 54(2):1–38, 2021. 12