SynSpill: Improved Industrial Spill Detection With Synthetic Data

Aaditya Baranwal; Abdul Mueez; Guneet Bhatia; Jason Voelker; Shruti Vyas

arxiv: 2508.10171 · v1 · submitted 2025-08-13 · 💻 cs.CV · cs.ET

SynSpill: Improved Industrial Spill Detection With Synthetic Data

Aaditya Baranwal , Abdul Mueez , Jason Voelker , Guneet Bhatia , Shruti Vyas This is my paper

Pith reviewed 2026-05-18 22:25 UTC · model grok-4.3

classification 💻 cs.CV cs.ET

keywords synthetic dataspill detectionvision-language modelsobject detectionindustrial safetydomain adaptationparameter-efficient fine-tuning

0 comments

The pith

Synthetic data lets vision-language models and object detectors perform comparably on industrial spill detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the challenge of scarce real data for detecting industrial spills by creating a synthetic data pipeline called SynSpill. The approach uses this data to fine-tune vision-language models efficiently and to improve the accuracy of detectors such as YOLO and DETR. In cases without synthetic data, the language models already show better generalization to new spill situations than the detectors. Adding the synthetic dataset leads to notable gains for both, bringing their results to similar levels. This provides an affordable method for building reliable detection systems in industrial settings where real examples of spills are hard to come by.

Core claim

The central discovery is that a high-quality synthetic data generation pipeline produces a corpus that supports parameter-efficient fine-tuning of vision-language models and boosts state-of-the-art object detectors. Without this synthetic data, VLMs generalize better to unseen spill scenarios than detectors. When the SynSpill dataset is used, both types of models achieve marked improvements and their performance becomes comparable, showing that synthetic data can bridge the domain gap in safety-critical applications.

What carries the argument

The SynSpill high-quality synthetic data generation pipeline that creates representative images of industrial spills for effective model adaptation.

Load-bearing premise

The synthetic spill images are close enough to real ones that models trained on them work well on actual industrial footage without major mismatches or biases.

What would settle it

A large collection of real industrial spill images where the models trained with SynSpill show little or no accuracy gain over those trained only on limited real data or none at all.

Figures

Figures reproduced from arXiv: 2508.10171 by Aaditya Baranwal, Abdul Mueez, Guneet Bhatia, Jason Voelker, Shruti Vyas.

**Figure 1.** Figure 1: Comparative detection performance of competing methods on a real-world CCTV image of an industrial spill. We visualize and contrast the predictions of three models: (1) a Zero-Shot Qwen2.5-VL-32B baseline without adaptation, (2) a PEFT-adapted Qwen2.5-VL-32B using LoRA on synthetic and web-scraped public data, and (3) a finetuned RF-DETR Base model trained on the same hybrid dataset. The ground-truth annot… view at source ↗

**Figure 2.** Figure 2: Overview of the Industrial Spill Detection Framework. The system ingests live CCTV feeds alongside a userdefined text prompt. A Vision-Language Model (VLM), selected and fine-tuned via a chosen strategy, analyzes the input to detect and localize potential spills, triggering an alert when the detection confidence surpasses a predefined threshold. Our results underscore that high-fidelity synthetic data is… view at source ↗

**Figure 3.** Figure 3: Adaptation strategies and their impact on spatial precision. The left subfigure illustrates that LoRA-adapted models retain higher localization accuracy at stricter IoU thresholds, validating the effectiveness of synthetic supervision. The right subfigure presents a component-level breakdown of LoRA application across vision and language backbones. From Supervised Detectors to Foundation Models. Tradition… view at source ↗

**Figure 4.** Figure 4: End-to-end synthetic data generation workflow. Stage 1 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Dataset composition for adaptation and evaluation. The left panel highlights the variability achieved through synthetic generation, while the right panel summarizes the total image count from each source category. Pipeline Summary and Implementation Details The full pipeline, background generation, expert-informed bounding box annotation, and guided inpainting, is lightweight, modular, and scalable [PITH_… view at source ↗

**Figure 6.** Figure 6: Performances across Methods, Models, and Datasets [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Dataset Samples [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities. However, their performance degrades significantly in niche, safety-critical domains such as industrial spill detection, where hazardous events are rare, sensitive, and difficult to annotate. This scarcity -- driven by privacy concerns, data sensitivity, and the infrequency of real incidents -- renders conventional fine-tuning of detectors infeasible for most industrial settings. We address this challenge by introducing a scalable framework centered on a high-quality synthetic data generation pipeline. We demonstrate that this synthetic corpus enables effective Parameter-Efficient Fine-Tuning (PEFT) of VLMs and substantially boosts the performance of state-of-the-art object detectors such as YOLO and DETR. Notably, in the absence of synthetic data (SynSpill dataset), VLMs still generalize better to unseen spill scenarios than these detectors. When SynSpill is used, both VLMs and detectors achieve marked improvements, with their performance becoming comparable. Our results underscore that high-fidelity synthetic data is a powerful means to bridge the domain gap in safety-critical applications. The combination of synthetic generation and lightweight adaptation offers a cost-effective, scalable pathway for deploying vision systems in industrial environments where real data is scarce/impractical to obtain. Project Page: https://synspill.vercel.app

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SynSpill, a scalable synthetic data generation pipeline for industrial spill detection. It shows that PEFT of VLMs and training of detectors (YOLO, DETR) on this data yields marked performance gains on real scenarios, making the two families comparable; without synthetic data, VLMs already outperform detectors on unseen spills. The work positions high-fidelity synthetic data as a practical solution for data-scarce, safety-critical vision tasks.

Significance. If the synthetic images are sufficiently representative of real industrial spills, the results demonstrate a cost-effective route to deploy reliable vision systems where real incident data cannot be collected at scale. The before/after comparison and the observation that synthetic data equalizes VLM and detector performance are the core contributions; reproducible code or parameter-free derivations are not claimed.

major comments (2)

[§3] §3 (Synthetic Data Pipeline): the central comparability claim requires that the generated images match the distribution of real spills in appearance, lighting, backgrounds, and rare configurations. No quantitative validation (e.g., FID, MMD, or failure-case analysis against a held-out real set) is reported; without it the observed gains could reflect reduced domain gap on the particular evaluation split rather than robust generalization.
[§4] §4 (Experiments): the abstract asserts 'marked improvements' and 'performance becoming comparable,' yet the provided text supplies no numerical metrics, dataset sizes, or ablation tables. The before-after comparison must be shown with exact mAP, F1, or zero-shot accuracy figures on the same real test set to rule out post-hoc selection effects.

minor comments (2)

[Abstract] The parenthetical '(SynSpill dataset)' in the abstract is ambiguous; clarify that SynSpill refers exclusively to the synthetic corpus and not to any real data.
[Figures] Figure captions and axis labels should explicitly state whether results are zero-shot, PEFT, or fully supervised to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on SynSpill. The comments highlight important aspects of validation and presentation that we address below. We provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [§3] §3 (Synthetic Data Pipeline): the central comparability claim requires that the generated images match the distribution of real spills in appearance, lighting, backgrounds, and rare configurations. No quantitative validation (e.g., FID, MMD, or failure-case analysis against a held-out real set) is reported; without it the observed gains could reflect reduced domain gap on the particular evaluation split rather than robust generalization.

Authors: We agree that explicit quantitative measures of distributional similarity would strengthen the central claim. Our primary evidence for the utility of SynSpill is the measured improvement in downstream detection performance on held-out real images; however, this indirect validation leaves open the possibility of split-specific effects. In the revised manuscript we will add FID and MMD scores computed between SynSpill images and a held-out real spill collection, together with a qualitative review of failure cases on real data. These additions will directly address the concern about domain-gap reduction versus robust generalization. revision: yes
Referee: [§4] §4 (Experiments): the abstract asserts 'marked improvements' and 'performance becoming comparable,' yet the provided text supplies no numerical metrics, dataset sizes, or ablation tables. The before-after comparison must be shown with exact mAP, F1, or zero-shot accuracy figures on the same real test set to rule out post-hoc selection effects.

Authors: We acknowledge that the abstract and early sections do not contain the numerical results. The full manuscript reports mAP, F1, and zero-shot accuracy for YOLO, DETR, and the PEFT-tuned VLM on an identical real test set, with explicit before/after comparisons and dataset sizes (synthetic corpus and real test split). To eliminate any ambiguity, we will revise the abstract to include the key quantitative figures and ensure the main experimental section presents all metrics, ablation tables, and test-set details in a single, clearly labeled table. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on held-out real scenarios is self-contained

full rationale

The paper advances an empirical pipeline: a synthetic data generator produces the SynSpill corpus, which is then used for PEFT of VLMs and supervised training of detectors such as YOLO and DETR; performance is measured on held-out real spill images. No equations, uniqueness theorems, or first-principles derivations appear in the provided text that would reduce any reported gain to a fitted parameter or self-citation by construction. The central claim—that synthetic data makes VLM and detector performance comparable—is supported by direct comparison of metrics before and after training on the synthetic set, a procedure that remains falsifiable against external real-world test distributions and does not collapse into definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that synthetic images can faithfully simulate real spill appearances and contexts; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption High-fidelity synthetic data can bridge the domain gap for spill detection without introducing harmful biases or artifacts.
Invoked to justify why training on SynSpill produces real-world improvements.

pith-pipeline@v0.9.0 · 5780 in / 1212 out tokens · 51697 ms · 2026-05-18T22:25:39.173082+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

[1]

Gaussian pyramid image and its application to change detection.Computer Vision and Image Understanding, 1995

John Adams and Howard Bloom. Gaussian pyramid image and its application to change detection.Computer Vision and Image Understanding, 1995. 2

work page 1995
[2]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 4

work page 2025
[3]

Language models are few-shot learn- ers

Tom Brown and et al. Language models are few-shot learn- ers. In NeurIPS, 2020. 4

work page 2020
[4]

End-to- end object detection with transformers, 2020

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers, 2020. 3

work page 2020
[5]

Adaptformer: Adapt- ing vision transformers for scalable visual recognition, 2022

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapt- ing vision transformers for scalable visual recognition, 2022. 3

work page 2022
[6]

Icleval: Evaluating in-context learning ability of large language mod- els, 2024

Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, and Ji-Rong Wen. Icleval: Evaluating in-context learning ability of large language mod- els, 2024. 7

work page 2024
[7]

Understanding domain randomization for sim-to-real transfer, 2022

Xiaoyu Chen, Jiachen Hu, Chi Jin, Lihong Li, and Liwei Wang. Understanding domain randomization for sim-to-real transfer, 2022. 3

work page 2022
[8]

Qlora: Efficient finetuning of quan- tized llms

Tim Dettmers and et al. Qlora: Efficient finetuning of quan- tized llms. arXiv preprint arXiv:2310.02578, 2023. 3

work page arXiv 2023
[9]

Carla: An open urban driving simulator, 2017

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator, 2017. 3

work page 2017
[10]

Pixmix: Dreamlike pictures comprehensively improve safety measures, 2022

Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt. Pixmix: Dreamlike pictures comprehensively improve safety measures, 2022. 3

work page 2022
[11]

Yolov8 to yolo11: A comprehensive architecture in- depth comparative review, 2025

Priyanto Hidayatullah, Nurjannah Syakrani, Muham- mad Rizqi Sholahuddin, Trisna Gelar, and Refdinal Tuba- gus. Yolov8 to yolo11: A comprehensive architecture in- depth comparative review, 2025. 2, 3

work page 2025
[12]

Denoising diffu- sion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models, 2020. 3

work page 2020
[13]

Lora: Low-rank adaptation of large language models

Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 2, 3, 4

work page 2022
[14]

Yolo evolution: A comprehensive benchmark and architectural review of yolov12, yolo11, and their previous versions, 2025

Nidhal Jegham, Chan Young Koh, Marwan Abdelatti, and Abdeltawab Hendawi. Yolo evolution: A comprehensive benchmark and architectural review of yolov12, yolo11, and their previous versions, 2025. 2

work page 2025
[15]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision, 2021. 2, 3, 4

work page 2021
[16]

Vi- sual prompt tuning, 2022

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning, 2022. 3

work page 2022
[17]

Ultralyt- ics yolov8

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralyt- ics yolov8. https://github.com/ultralytics/ ultralytics, 2023. 2

work page 2023
[18]

Analyzing and improving the image quality of stylegan, 2020

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan, 2020. 3

work page 2020
[19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

work page 2023
[20]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Lily Rolland, Laura Gustafson, Callen Romero, Michael Krainin, David Li, Chao Li, et al. Segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2426–2437, 2023. 3

work page 2023
[21]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell. Overcoming catastrophic for- getting in neural networks. Proceedings of the National Academy of Sciences, 114(13)...

work page 2017
[22]

Ai2-thor: An interactive 3d environ- ment for visual ai, 2022

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environ- ment for visual ai, 2022. 3

work page 2022
[23]

CLIP ViT-H/14 – LAION-2B (laion/CLIP-ViT-H-14-laion2B-s32B-b79K)

laion. CLIP ViT-H/14 – LAION-2B (laion/CLIP-ViT-H-14-laion2B-s32B-b79K). https : / / huggingface . co / laion / CLIP - ViT - H - 14 - laion2B- s32B- b79K, 2023. Published ca. Sep 2023; Accessed: 2025-07-04. 5

work page 2023
[24]

Differential diffusion: Giving each pixel its strength, 2024

Eran Levin and Ohad Fried. Differential diffusion: Giving each pixel its strength, 2024. 5

work page 2024
[25]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022. 3

work page 2022
[26]

Scaling & shifting your features: A new baseline for efficient model tuning, 2023

Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning, 2023. 3

work page 2023
[27]

Generalization-enhanced few-shot object detection in remote sensing, 2025

Hui Lin, Nan Li, Pengjuan Yao, Kexin Dong, Yuhan Guo, Danfeng Hong, Ying Zhang, and Congcong Wen. Generalization-enhanced few-shot object detection in remote sensing, 2025. 4

work page 2025
[28]

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. Best practices and lessons learned on synthetic data, 2024. 3

work page 2024
[29]

Grounding dino: Marrying object detec- tion with grounded language queries

Xin Liu and et al. Grounding dino: Marrying object detec- tion with grounded language queries. In CVPR, 2023. 3

work page 2023
[30]

Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021. 4

work page 2021
[31]

Repaint: Inpainting using denoising diffusion probabilistic models, 2022

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models, 2022. 5

work page 2022
[32]

Lcm-lora: A universal stable-diffusion acceleration module, 2023

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin´ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module, 2023. 4

work page 2023
[33]

Samarth Mishra, Rameswar Panda, Cheng Perng Phoo, Chun-Fu Chen, Leonid Karlinsky, Kate Saenko, Venkatesh Saligrama, and Rogerio S. Feris. Task2sim : Towards effec- tive pre-training and transfer from synthetic data, 2022. 3

work page 2022
[34]

Industrial language-image dataset (ilid): Adapting vision foundation models for industrial settings,

Keno Moenck, Duc Trung Thieu, Julian Koch, and Thorsten Sch¨uppstuhl. Industrial language-image dataset (ilid): Adapting vision foundation models for industrial settings,

work page
[35]

IP Composition Adapter (stable diffusion) model

ostris. IP Composition Adapter (stable diffusion) model. https://huggingface.co/ostris/ip- composition- adapter, 2024. Published: March 20, 2024; Accessed: 2025-07-04. 5

work page 2024
[36]

Adapterfusion: Non- destructive task composition for transfer learning, 2021

Jonas Pfeiffer, Aishwarya Kamath, Andreas R ¨uckl´e, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non- destructive task composition for transfer learning, 2021. 3

work page 2021
[37]

Barron, and Ben Milden- hall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion, 2022. 3

work page 2022
[38]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2, 3, 4

work page 2021
[39]

R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam. Im- age change detection algorithms: A systematic survey. IEEE Transactions on Image Processing, 14(3):294–307, 2005. 3

work page 2005
[40]

Hierarchical text-conditional image gener- ation with clip latents, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents, 2022. 3

work page 2022
[41]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015. 3

work page 2015
[43]

High-resolution image syn- thesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2022. 3, 4

work page 2022
[44]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. 3

work page 2023
[45]

Occlu- sion handling in generic object detection: A review

Kaziwa Saleh, Sandor Szenasi, and Zoltan Vamossy. Occlu- sion handling in generic object detection: A review. In 2021 IEEE 19th World Symposium on Applied Machine Intelli- gence and Informatics (SAMI), page 000477–000484. IEEE,

work page 2021
[46]

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, and Manoj Karkee. Rf-detr object detection vs yolov12 : A study of transformer-based and cnn-based architectures for single- class and multi-class greenfruit detection in complex orchard environments under label ambiguity, 2025. 2, 3

work page 2025
[47]

SDXL-Turbo 1.0 fp16 model checkpoint

Stability AI. SDXL-Turbo 1.0 fp16 model checkpoint. https://huggingface.co/stabilityai/sdxl- turbo / blob / main / sd _ xl _ turbo _ 1 . 0 _ fp16 . safetensors, 2023. Accessed: 2025-07-04. 5

work page 2023
[48]

Interior scene xl (stable diffusion xl model)

Unknown Author. Interior scene xl (stable diffusion xl model). https://civitai.com/models/715747/ interior-scene-xl, 2025. Accessed: 2025-07-04. 5

work page 2025
[49]

Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022

Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022. 3

work page 2022
[50]

A survey on deep learning-based industrial defect detection

Ping Wang, Jiangbo Liu, Yilai Yan, and Zongjian Tang. A survey on deep learning-based industrial defect detection. IEEE Transactions on Neural Networks and Learning Sys- tems, 2021. 2

work page 2021
[51]

Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023. 3, 4

work page 2023
[52]

Florence: A new foundation model for computer vision, 2021

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision, 2021. 3

work page 2021
[53]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022

Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022. 3

work page 2022
[54]

Ni, and Heung-Yeung Shum

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022. 2, 3

work page 2022
[55]

Adding conditional control to text-to-image diffusion models, 2023

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

work page 2023
[56]

• Negative:

Yucheng Zhou, Xiang Li, Qianning Wang, and Jianbing Shen. Visual in-context learning for large vision-language models, 2024. 4 Supplementary Material This supplementary section provides additional details, vi- sualizations, and implementation specifics that support the main claims and methodology of our work. It is organized as follows: A. Additional Qual...

work page 2024

[1] [1]

Gaussian pyramid image and its application to change detection.Computer Vision and Image Understanding, 1995

John Adams and Howard Bloom. Gaussian pyramid image and its application to change detection.Computer Vision and Image Understanding, 1995. 2

work page 1995

[2] [2]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 4

work page 2025

[3] [3]

Language models are few-shot learn- ers

Tom Brown and et al. Language models are few-shot learn- ers. In NeurIPS, 2020. 4

work page 2020

[4] [4]

End-to- end object detection with transformers, 2020

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers, 2020. 3

work page 2020

[5] [5]

Adaptformer: Adapt- ing vision transformers for scalable visual recognition, 2022

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapt- ing vision transformers for scalable visual recognition, 2022. 3

work page 2022

[6] [6]

Icleval: Evaluating in-context learning ability of large language mod- els, 2024

Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, and Ji-Rong Wen. Icleval: Evaluating in-context learning ability of large language mod- els, 2024. 7

work page 2024

[7] [7]

Understanding domain randomization for sim-to-real transfer, 2022

Xiaoyu Chen, Jiachen Hu, Chi Jin, Lihong Li, and Liwei Wang. Understanding domain randomization for sim-to-real transfer, 2022. 3

work page 2022

[8] [8]

Qlora: Efficient finetuning of quan- tized llms

Tim Dettmers and et al. Qlora: Efficient finetuning of quan- tized llms. arXiv preprint arXiv:2310.02578, 2023. 3

work page arXiv 2023

[9] [9]

Carla: An open urban driving simulator, 2017

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator, 2017. 3

work page 2017

[10] [10]

Pixmix: Dreamlike pictures comprehensively improve safety measures, 2022

Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt. Pixmix: Dreamlike pictures comprehensively improve safety measures, 2022. 3

work page 2022

[11] [11]

Yolov8 to yolo11: A comprehensive architecture in- depth comparative review, 2025

Priyanto Hidayatullah, Nurjannah Syakrani, Muham- mad Rizqi Sholahuddin, Trisna Gelar, and Refdinal Tuba- gus. Yolov8 to yolo11: A comprehensive architecture in- depth comparative review, 2025. 2, 3

work page 2025

[12] [12]

Denoising diffu- sion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models, 2020. 3

work page 2020

[13] [13]

Lora: Low-rank adaptation of large language models

Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 2, 3, 4

work page 2022

[14] [14]

Yolo evolution: A comprehensive benchmark and architectural review of yolov12, yolo11, and their previous versions, 2025

Nidhal Jegham, Chan Young Koh, Marwan Abdelatti, and Abdeltawab Hendawi. Yolo evolution: A comprehensive benchmark and architectural review of yolov12, yolo11, and their previous versions, 2025. 2

work page 2025

[15] [15]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision, 2021. 2, 3, 4

work page 2021

[16] [16]

Vi- sual prompt tuning, 2022

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning, 2022. 3

work page 2022

[17] [17]

Ultralyt- ics yolov8

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralyt- ics yolov8. https://github.com/ultralytics/ ultralytics, 2023. 2

work page 2023

[18] [18]

Analyzing and improving the image quality of stylegan, 2020

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan, 2020. 3

work page 2020

[19] [19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

work page 2023

[20] [20]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Lily Rolland, Laura Gustafson, Callen Romero, Michael Krainin, David Li, Chao Li, et al. Segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2426–2437, 2023. 3

work page 2023

[21] [21]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell. Overcoming catastrophic for- getting in neural networks. Proceedings of the National Academy of Sciences, 114(13)...

work page 2017

[22] [22]

Ai2-thor: An interactive 3d environ- ment for visual ai, 2022

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environ- ment for visual ai, 2022. 3

work page 2022

[23] [23]

CLIP ViT-H/14 – LAION-2B (laion/CLIP-ViT-H-14-laion2B-s32B-b79K)

laion. CLIP ViT-H/14 – LAION-2B (laion/CLIP-ViT-H-14-laion2B-s32B-b79K). https : / / huggingface . co / laion / CLIP - ViT - H - 14 - laion2B- s32B- b79K, 2023. Published ca. Sep 2023; Accessed: 2025-07-04. 5

work page 2023

[24] [24]

Differential diffusion: Giving each pixel its strength, 2024

Eran Levin and Ohad Fried. Differential diffusion: Giving each pixel its strength, 2024. 5

work page 2024

[25] [25]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022. 3

work page 2022

[26] [26]

Scaling & shifting your features: A new baseline for efficient model tuning, 2023

Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning, 2023. 3

work page 2023

[27] [27]

Generalization-enhanced few-shot object detection in remote sensing, 2025

Hui Lin, Nan Li, Pengjuan Yao, Kexin Dong, Yuhan Guo, Danfeng Hong, Ying Zhang, and Congcong Wen. Generalization-enhanced few-shot object detection in remote sensing, 2025. 4

work page 2025

[28] [28]

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. Best practices and lessons learned on synthetic data, 2024. 3

work page 2024

[29] [29]

Grounding dino: Marrying object detec- tion with grounded language queries

Xin Liu and et al. Grounding dino: Marrying object detec- tion with grounded language queries. In CVPR, 2023. 3

work page 2023

[30] [30]

Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021. 4

work page 2021

[31] [31]

Repaint: Inpainting using denoising diffusion probabilistic models, 2022

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models, 2022. 5

work page 2022

[32] [32]

Lcm-lora: A universal stable-diffusion acceleration module, 2023

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin´ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module, 2023. 4

work page 2023

[33] [33]

Samarth Mishra, Rameswar Panda, Cheng Perng Phoo, Chun-Fu Chen, Leonid Karlinsky, Kate Saenko, Venkatesh Saligrama, and Rogerio S. Feris. Task2sim : Towards effec- tive pre-training and transfer from synthetic data, 2022. 3

work page 2022

[34] [34]

Industrial language-image dataset (ilid): Adapting vision foundation models for industrial settings,

Keno Moenck, Duc Trung Thieu, Julian Koch, and Thorsten Sch¨uppstuhl. Industrial language-image dataset (ilid): Adapting vision foundation models for industrial settings,

work page

[35] [35]

IP Composition Adapter (stable diffusion) model

ostris. IP Composition Adapter (stable diffusion) model. https://huggingface.co/ostris/ip- composition- adapter, 2024. Published: March 20, 2024; Accessed: 2025-07-04. 5

work page 2024

[36] [36]

Adapterfusion: Non- destructive task composition for transfer learning, 2021

Jonas Pfeiffer, Aishwarya Kamath, Andreas R ¨uckl´e, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non- destructive task composition for transfer learning, 2021. 3

work page 2021

[37] [37]

Barron, and Ben Milden- hall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion, 2022. 3

work page 2022

[38] [38]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2, 3, 4

work page 2021

[39] [39]

R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam. Im- age change detection algorithms: A systematic survey. IEEE Transactions on Image Processing, 14(3):294–307, 2005. 3

work page 2005

[40] [40]

Hierarchical text-conditional image gener- ation with clip latents, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents, 2022. 3

work page 2022

[41] [41]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015. 3

work page 2015

[43] [43]

High-resolution image syn- thesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2022. 3, 4

work page 2022

[44] [44]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. 3

work page 2023

[45] [45]

Occlu- sion handling in generic object detection: A review

Kaziwa Saleh, Sandor Szenasi, and Zoltan Vamossy. Occlu- sion handling in generic object detection: A review. In 2021 IEEE 19th World Symposium on Applied Machine Intelli- gence and Informatics (SAMI), page 000477–000484. IEEE,

work page 2021

[46] [46]

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, and Manoj Karkee. Rf-detr object detection vs yolov12 : A study of transformer-based and cnn-based architectures for single- class and multi-class greenfruit detection in complex orchard environments under label ambiguity, 2025. 2, 3

work page 2025

[47] [47]

SDXL-Turbo 1.0 fp16 model checkpoint

Stability AI. SDXL-Turbo 1.0 fp16 model checkpoint. https://huggingface.co/stabilityai/sdxl- turbo / blob / main / sd _ xl _ turbo _ 1 . 0 _ fp16 . safetensors, 2023. Accessed: 2025-07-04. 5

work page 2023

[48] [48]

Interior scene xl (stable diffusion xl model)

Unknown Author. Interior scene xl (stable diffusion xl model). https://civitai.com/models/715747/ interior-scene-xl, 2025. Accessed: 2025-07-04. 5

work page 2025

[49] [49]

Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022

Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022. 3

work page 2022

[50] [50]

A survey on deep learning-based industrial defect detection

Ping Wang, Jiangbo Liu, Yilai Yan, and Zongjian Tang. A survey on deep learning-based industrial defect detection. IEEE Transactions on Neural Networks and Learning Sys- tems, 2021. 2

work page 2021

[51] [51]

Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023. 3, 4

work page 2023

[52] [52]

Florence: A new foundation model for computer vision, 2021

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision, 2021. 3

work page 2021

[53] [53]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022

Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022. 3

work page 2022

[54] [54]

Ni, and Heung-Yeung Shum

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022. 2, 3

work page 2022

[55] [55]

Adding conditional control to text-to-image diffusion models, 2023

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

work page 2023

[56] [56]

• Negative:

Yucheng Zhou, Xiang Li, Qianning Wang, and Jianbing Shen. Visual in-context learning for large vision-language models, 2024. 4 Supplementary Material This supplementary section provides additional details, vi- sualizations, and implementation specifics that support the main claims and methodology of our work. It is organized as follows: A. Additional Qual...

work page 2024