Common Inpainted Objects In-N-Out of Context
Pith reviewed 2026-05-19 11:34 UTC · model grok-4.3
The pith
A dataset of nearly 100,000 edited photos gives vision models clear examples of objects that fit or clash with their surroundings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By systematically replacing objects in COCO images through diffusion-based inpainting, the authors create 97,722 unique images featuring both contextually coherent and inconsistent scenes. Each inpainted object is verified and categorized as in- or out-of-context through large vision language model assessments. This controlled testbed enables three tasks: fine-grained context reasoning that classifies objects based on three criteria, a novel Objects-from-Context prediction task at instance and clique levels, and context-enhanced fake detection on existing methods without fine-tuning.
What carries the argument
Diffusion-based inpainting to replace objects in original scenes, combined with large vision language model verification to label each result as contextually fitting or inconsistent.
If this is right
- Fine-grained classification becomes possible by scoring objects against three explicit context criteria.
- Models can predict which new objects naturally belong in a given scene, both for single instances and groups sharing semantic relations.
- Existing fake-image detectors gain accuracy when supplied with context signals from the dataset, without any additional training on the detectors themselves.
Where Pith is reading between the lines
- The same editing approach could extend to video clips to study how context violations unfold over time.
- Training on these examples might improve model performance on real-world tasks like spotting anomalies in surveillance footage.
- The dataset offers a way to test whether context understanding helps models handle unusual but realistic scene variations beyond the training distribution.
Load-bearing premise
Large vision language models can reliably and accurately judge whether each inpainted object belongs in the surrounding scene or not.
What would settle it
A direct check would compare the large vision language model labels against human judgments on a random sample of the images, or measure whether models trained with these examples show clear gains on separate context-reasoning benchmarks.
Figures
read the original abstract
We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments. We demonstrate three key tasks enabled by COinCO: (1) a fine-grained context reasoning approach that classifies objects as in- or out-of-context based on three criteria; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique level semantics, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision, including image forensics. Code and dataset are available at https://co-in-co.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the COinCO dataset comprising 97,722 images generated by systematically inpainting objects from COCO scenes using diffusion models to produce both contextually coherent and inconsistent variants. Each inpainted object is categorized as in- or out-of-context via LVLM assessments, which the authors describe as meticulous verification. The dataset is then used to demonstrate three tasks: (1) fine-grained context reasoning that classifies objects according to three criteria, (2) Objects-from-Context prediction at instance and clique levels, and (3) context-enhanced fake detection on existing methods without fine-tuning.
Significance. A reliably labeled dataset of this scale could serve as a useful controlled testbed for context-aware vision models and image forensics. The construction method is straightforward and the three downstream tasks are well-motivated. However, the absence of any quantitative validation for the LVLM-generated labels means the central contribution rests on an unverified assumption; if that assumption does not hold, the utility of the entire resource and all reported experiments is undermined.
major comments (2)
- [§3.2] §3.2 (Verification procedure): The manuscript asserts that 'each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments,' yet reports no human agreement metrics, inter-rater reliability scores, error analysis, or ablation on prompt sensitivity. Because every downstream task (context reasoning classifier, Objects-from-Context prediction, and fake-detection experiments) depends directly on these labels, the lack of validation constitutes a load-bearing gap.
- [§4.1 and §4.2] §4.1 and §4.2 (Context reasoning and Objects-from-Context tasks): Both tasks treat the LVLM-derived in/out-of-context labels as ground truth for training and evaluation. Without reported label accuracy or noise analysis, the quantitative improvements claimed over baselines cannot be interpreted as evidence of improved context understanding; they may simply reflect propagation of LVLM biases.
minor comments (3)
- [Abstract and §3.1] The abstract and §3.1 should explicitly state the total number of unique source COCO images used and the average number of inpainted objects per image to allow readers to assess diversity.
- [Figure 2] Figure 2 (example images) would benefit from clearer annotation of which object was inpainted and its assigned context label; current captions are ambiguous.
- [§3.2] A brief discussion of known LVLM failure modes on fine-grained scene semantics (e.g., object affordance or spatial relations) should be added to §3.2 to contextualize the verification choice.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our paper. We address the major concerns regarding the validation of the LVLM labels below.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Verification procedure): The manuscript asserts that 'each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments,' yet reports no human agreement metrics, inter-rater reliability scores, error analysis, or ablation on prompt sensitivity. Because every downstream task (context reasoning classifier, Objects-from-Context prediction, and fake-detection experiments) depends directly on these labels, the lack of validation constitutes a load-bearing gap.
Authors: We agree that providing quantitative validation for the LVLM labels would strengthen the paper. The current version relies on the LVLM's capability for this task without reporting human agreement. In the revised manuscript, we will add a section with human evaluation on a random sample of 500 images, reporting agreement rates between LVLM and human annotators, along with an error analysis. We will also include an ablation study on different prompts to show sensitivity. revision: yes
-
Referee: [§4.1 and §4.2] §4.1 and §4.2 (Context reasoning and Objects-from-Context tasks): Both tasks treat the LVLM-derived in/out-of-context labels as ground truth for training and evaluation. Without reported label accuracy or noise analysis, the quantitative improvements claimed over baselines cannot be interpreted as evidence of improved context understanding; they may simply reflect propagation of LVLM biases.
Authors: We acknowledge this valid point. The improvements are demonstrated using the same label set for all methods, meaning that if there is bias, it affects baselines and our method similarly. However, to better address potential label noise, we will add a noise robustness analysis in the revision, such as injecting controlled noise into labels and observing performance. This will help interpret the results more carefully. revision: yes
Circularity Check
No circularity detected in dataset construction methodology
full rationale
The paper presents a dataset creation pipeline using diffusion inpainting on COCO images followed by LVLM-based categorization of in/out-of-context objects, then demonstrates three downstream tasks. No equations, fitted parameters, or derivations are present that reduce any claimed result to its own inputs by construction. The LVLM verification step is an empirical labeling process rather than a self-referential fit or self-citation chain. Self-citations, if any, are not load-bearing for the core contribution of the new dataset and tasks. The work is self-contained as a standard dataset paper with independent methodological content.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Diffusion-based inpainting can generate images that preserve scene coherence when objects are contextually appropriate.
- domain assumption Large Vision Language Models can accurately classify objects as in- or out-of-context.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images... Each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our context reasoning is based on three fundamental principles regarding context: location, size, and co-occurrence.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization
DiffusionPrint learns robust forensic feature maps via MoCo-style contrastive training on diffusion inpainting fingerprints, boosting localization accuracy by up to 28% when fused into existing IFL systems and general...
Reference graph
Works this paper leans on
-
[1]
Moshe Bar. Visual objects in context. Nature Reviews Neuroscience, 5(8):617–629, 2004
work page 2004
-
[2]
Context models and out-of-context objects
Myung Jin Choi, Antonio Torralba, and Alan S Willsky. Context models and out-of-context objects. Pattern Recognition Letters, 33(7):853–862, 2012
work page 2012
-
[3]
Noise or signal: The role of image backgrounds in object recognition
Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. arXiv preprint arXiv:2006.09994, 2020
-
[4]
Scene perception: Detecting and judging objects undergoing relational violations
Irving Biederman, Robert J Mezzanotte, and Jan C Rabinowitz. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2):143–177, 1982
work page 1982
-
[5]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014
work page 2014
-
[6]
Coco-stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018
work page 2018
-
[7]
Lvis: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019
work page 2019
-
[8]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014
work page 2014
-
[9]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[10]
Whole- body human pose estimation in the wild
Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, and Ping Luo. Whole- body human pose estimation in the wild. In European Conference on Computer Vision, pages 196–214. Springer, 2020
work page 2020
-
[11]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images
Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[13]
Cd-coco: A versatile complex distorted coco database for scene-context-aware computer vision
Ayman Beghdadi, Azeddine Beghdadi, Malik Mallem, Lotfi Beji, and Faouzi Alaya Cheikh. Cd-coco: A versatile complex distorted coco database for scene-context-aware computer vision. In2023 11th European Workshop on Visual Information Processing (EUVIP), pages 1–6. IEEE, 2023
work page 2023
-
[14]
Coconut: Modernizing coco segmentation
Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, and Liang-Chieh Chen. Coconut: Modernizing coco segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21863–21873, 2024
work page 2024
-
[15]
Detecting out-of-context objects using graph context reasoning network
Manoj Acharya, Anirban Roy, Kaushik Koneripalli, Susmit Jha, Christopher Kanan, and Ajay Divakaran. Detecting out-of-context objects using graph context reasoning network. In IJCAI, 2022
work page 2022
-
[16]
Context understanding in computer vision: A survey
Xuan Wang and Zhigang Zhu. Context understanding in computer vision: A survey. Computer Vision and Image Understanding, 229:103646, 2023
work page 2023
-
[17]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023
work page 2023
-
[19]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/ blog/2024-01-30-llava-next/ . 10
work page 2024
-
[20]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Unified language- vision pretraining in llm with dynamic discrete visual tokenization
Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Yadong Mu, et al. Unified language- vision pretraining in llm with dynamic discrete visual tokenization. In International Conference on Learning Representations, 2024
work page 2024
-
[22]
Hany Farid. Image forgery detection. IEEE Signal processing magazine, 26(2):16–25, 2009
work page 2009
-
[23]
A comprehensive framework for image inpainting
Aurélie Bugeau, Marcelo Bertalmío, Vicent Caselles, and Guillermo Sapiro. A comprehensive framework for image inpainting. IEEE transactions on image processing, 19(10):2634–2645, 2010
work page 2010
-
[24]
Generative adversarial networks
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commun. ACM, 63(11):139–144, October
-
[25]
[HBP23] Aamal Abbas Hussain, Francesco Belardinelli, and G eorgios Piliouras
ISSN 0001-0782. doi: 10.1145/3422622. URL https://doi.org/10.1145/3422622
-
[26]
Image-to-image translation with conditional adversarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017
work page 2017
-
[27]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Andrew Brock. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2019. doi: 10.1109/CVPR.2019.00453
-
[29]
Analyzing and improving the image quality of stylegan
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020
work page 2020
-
[30]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[31]
De-fake: Detection and attribution of fake images generated by text-to-image generation models
Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 3418–3432, 2023
work page 2023
-
[32]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning , pages 8821–8831. Pmlr, 2021
work page 2021
-
[33]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023
work page 2023
-
[34]
Anydoor: Zero-shot object-level im- age customization
Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization, 2024. URL https://arxiv.org/abs/2307.09481
-
[35]
Controlcom: Controllable image composition using diffusion model, 2023
Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, and Li Niu. Controlcom: Controllable image composition using diffusion model, 2023. URL https://arxiv.org/abs/2308. 10040
work page 2023
-
[36]
Zero-1-to-3: Zero-shot one image to 3d object, 2023
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023
work page 2023
-
[37]
Diffrelight: Diffusion-based facial performance relighting
Mingming He, Pascal Clausen, Ahmet Levent Ta¸ sel, Li Ma, Oliver Pilarski, Wenqi Xian, Laszlo Rikker, Xueming Yu, Ryan Burgert, Ning Yu, et al. Diffrelight: Diffusion-based facial performance relighting. arXiv preprint arXiv:2410.08188, 2024
-
[38]
Neural gaffer: Relighting any object via diffusion, 2024
Haian Jin, Yuan Li, Fujun Luan, Yuanbo Xiangli, Sai Bi, Kai Zhang, Zexiang Xu, Jin Sun, and Noah Snavely. Neural gaffer: Relighting any object via diffusion, 2024. URL https://arxiv.org/abs/2406.07520
-
[39]
L Minh Dang, Syed Ibrahim Hassan, Suhyeon Im, Jaecheol Lee, Sujin Lee, and Hyeonjoon Moon. Deep learning based computer generated face identification using convolutional neural network.Applied Sciences, 8(12):2610, 2018. 11
work page 2018
-
[40]
Fake faces identification via convolutional neural network
Huaxiao Mo, Bolin Chen, and Weiqi Luo. Fake faces identification via convolutional neural network. In Proceedings of the 6th ACM workshop on information hiding and multimedia security, pages 43–47, 2018
work page 2018
-
[41]
Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7505–7517, 2022
work page 2022
-
[42]
Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization.International Journal of Computer Vision, 130(8):1875–1895, August 2022. doi: 10.1007/s11263-022-01617-5
-
[43]
Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9543–9552, 2019
work page 2019
-
[44]
Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization
Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20606–20615, 2023
work page 2023
-
[45]
Genimage: A million-scale benchmark for detecting ai-generated image, 2023
Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image, 2023
work page 2023
-
[46]
Jordan J. Bird and Ahmad Lotfi. Cifake: Image classification and explainable identification of ai-generated synthetic images, 2023. URL https://arxiv.org/abs/2303.14126
-
[47]
Tgif: Text-guided inpainting forgery dataset
Hannes Mareen, Dimitrios Karageorgiou, Glenn Van Wallendael, Peter Lambert, and Symeon Papadopou- los. Tgif: Text-guided inpainting forgery dataset. In 2024 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2024
work page 2024
-
[48]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022
work page 2022
-
[49]
Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics yolov8, 2023. URL https://github.com/ ultralytics/ultralytics
work page 2023
-
[50]
MMDetection: Open MMLab Detection Toolbox and Benchmark
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and b...
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[51]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Bert: Pre-training of deep bidi- rectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
work page 2019
-
[53]
Contextual priming for object detection
Antonio Torralba. Contextual priming for object detection. International journal of computer vision, 53: 169–191, 2003
work page 2003
-
[54]
Stephen C Mack and Miguel P Eckstein. Object co-occurrence serves as a contextual cue to guide and facilitate visual search in a natural viewing environment. Journal of vision, 11(9):9–9, 2011
work page 2011
-
[55]
Deep learning for anomaly detection: A review
Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review. ACM computing surveys (CSUR), 54(2):1–38, 2021. 12
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.