pith. sign in

arxiv: 2508.10171 · v1 · submitted 2025-08-13 · 💻 cs.CV · cs.ET

SynSpill: Improved Industrial Spill Detection With Synthetic Data

Pith reviewed 2026-05-18 22:25 UTC · model grok-4.3

classification 💻 cs.CV cs.ET
keywords synthetic dataspill detectionvision-language modelsobject detectionindustrial safetydomain adaptationparameter-efficient fine-tuning
0
0 comments X

The pith

Synthetic data lets vision-language models and object detectors perform comparably on industrial spill detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the challenge of scarce real data for detecting industrial spills by creating a synthetic data pipeline called SynSpill. The approach uses this data to fine-tune vision-language models efficiently and to improve the accuracy of detectors such as YOLO and DETR. In cases without synthetic data, the language models already show better generalization to new spill situations than the detectors. Adding the synthetic dataset leads to notable gains for both, bringing their results to similar levels. This provides an affordable method for building reliable detection systems in industrial settings where real examples of spills are hard to come by.

Core claim

The central discovery is that a high-quality synthetic data generation pipeline produces a corpus that supports parameter-efficient fine-tuning of vision-language models and boosts state-of-the-art object detectors. Without this synthetic data, VLMs generalize better to unseen spill scenarios than detectors. When the SynSpill dataset is used, both types of models achieve marked improvements and their performance becomes comparable, showing that synthetic data can bridge the domain gap in safety-critical applications.

What carries the argument

The SynSpill high-quality synthetic data generation pipeline that creates representative images of industrial spills for effective model adaptation.

Load-bearing premise

The synthetic spill images are close enough to real ones that models trained on them work well on actual industrial footage without major mismatches or biases.

What would settle it

A large collection of real industrial spill images where the models trained with SynSpill show little or no accuracy gain over those trained only on limited real data or none at all.

Figures

Figures reproduced from arXiv: 2508.10171 by Aaditya Baranwal, Abdul Mueez, Guneet Bhatia, Jason Voelker, Shruti Vyas.

Figure 1
Figure 1. Figure 1: Comparative detection performance of competing methods on a real-world CCTV image of an industrial spill. We visualize and contrast the predictions of three models: (1) a Zero-Shot Qwen2.5-VL-32B baseline without adaptation, (2) a PEFT-adapted Qwen2.5-VL-32B using LoRA on synthetic and web-scraped public data, and (3) a finetuned RF-DETR Base model trained on the same hybrid dataset. The ground-truth annot… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Industrial Spill Detection Frame￾work. The system ingests live CCTV feeds alongside a user￾defined text prompt. A Vision-Language Model (VLM), selected and fine-tuned via a chosen strategy, analyzes the input to detect and localize potential spills, triggering an alert when the detection confidence surpasses a predefined threshold. Our results underscore that high-fidelity synthetic data is… view at source ↗
Figure 3
Figure 3. Figure 3: Adaptation strategies and their impact on spatial precision. The left subfigure illustrates that LoRA-adapted models retain higher localization accuracy at stricter IoU thresholds, validating the effectiveness of synthetic supervision. The right subfigure presents a component-level breakdown of LoRA application across vision and language backbones. From Supervised Detectors to Foundation Models. Tra￾dition… view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end synthetic data generation workflow. Stage 1 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dataset composition for adaptation and evaluation. The left panel highlights the variability achieved through synthetic generation, while the right panel summarizes the total image count from each source category. Pipeline Summary and Implementation Details The full pipeline, background generation, expert-informed bounding box annotation, and guided inpainting, is lightweight, modular, and scalable [PITH_… view at source ↗
Figure 6
Figure 6. Figure 6: Performances across Methods, Models, and Datasets [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dataset Samples [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities. However, their performance degrades significantly in niche, safety-critical domains such as industrial spill detection, where hazardous events are rare, sensitive, and difficult to annotate. This scarcity -- driven by privacy concerns, data sensitivity, and the infrequency of real incidents -- renders conventional fine-tuning of detectors infeasible for most industrial settings. We address this challenge by introducing a scalable framework centered on a high-quality synthetic data generation pipeline. We demonstrate that this synthetic corpus enables effective Parameter-Efficient Fine-Tuning (PEFT) of VLMs and substantially boosts the performance of state-of-the-art object detectors such as YOLO and DETR. Notably, in the absence of synthetic data (SynSpill dataset), VLMs still generalize better to unseen spill scenarios than these detectors. When SynSpill is used, both VLMs and detectors achieve marked improvements, with their performance becoming comparable. Our results underscore that high-fidelity synthetic data is a powerful means to bridge the domain gap in safety-critical applications. The combination of synthetic generation and lightweight adaptation offers a cost-effective, scalable pathway for deploying vision systems in industrial environments where real data is scarce/impractical to obtain. Project Page: https://synspill.vercel.app

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SynSpill, a scalable synthetic data generation pipeline for industrial spill detection. It shows that PEFT of VLMs and training of detectors (YOLO, DETR) on this data yields marked performance gains on real scenarios, making the two families comparable; without synthetic data, VLMs already outperform detectors on unseen spills. The work positions high-fidelity synthetic data as a practical solution for data-scarce, safety-critical vision tasks.

Significance. If the synthetic images are sufficiently representative of real industrial spills, the results demonstrate a cost-effective route to deploy reliable vision systems where real incident data cannot be collected at scale. The before/after comparison and the observation that synthetic data equalizes VLM and detector performance are the core contributions; reproducible code or parameter-free derivations are not claimed.

major comments (2)
  1. [§3] §3 (Synthetic Data Pipeline): the central comparability claim requires that the generated images match the distribution of real spills in appearance, lighting, backgrounds, and rare configurations. No quantitative validation (e.g., FID, MMD, or failure-case analysis against a held-out real set) is reported; without it the observed gains could reflect reduced domain gap on the particular evaluation split rather than robust generalization.
  2. [§4] §4 (Experiments): the abstract asserts 'marked improvements' and 'performance becoming comparable,' yet the provided text supplies no numerical metrics, dataset sizes, or ablation tables. The before-after comparison must be shown with exact mAP, F1, or zero-shot accuracy figures on the same real test set to rule out post-hoc selection effects.
minor comments (2)
  1. [Abstract] The parenthetical '(SynSpill dataset)' in the abstract is ambiguous; clarify that SynSpill refers exclusively to the synthetic corpus and not to any real data.
  2. [Figures] Figure captions and axis labels should explicitly state whether results are zero-shot, PEFT, or fully supervised to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on SynSpill. The comments highlight important aspects of validation and presentation that we address below. We provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [§3] §3 (Synthetic Data Pipeline): the central comparability claim requires that the generated images match the distribution of real spills in appearance, lighting, backgrounds, and rare configurations. No quantitative validation (e.g., FID, MMD, or failure-case analysis against a held-out real set) is reported; without it the observed gains could reflect reduced domain gap on the particular evaluation split rather than robust generalization.

    Authors: We agree that explicit quantitative measures of distributional similarity would strengthen the central claim. Our primary evidence for the utility of SynSpill is the measured improvement in downstream detection performance on held-out real images; however, this indirect validation leaves open the possibility of split-specific effects. In the revised manuscript we will add FID and MMD scores computed between SynSpill images and a held-out real spill collection, together with a qualitative review of failure cases on real data. These additions will directly address the concern about domain-gap reduction versus robust generalization. revision: yes

  2. Referee: [§4] §4 (Experiments): the abstract asserts 'marked improvements' and 'performance becoming comparable,' yet the provided text supplies no numerical metrics, dataset sizes, or ablation tables. The before-after comparison must be shown with exact mAP, F1, or zero-shot accuracy figures on the same real test set to rule out post-hoc selection effects.

    Authors: We acknowledge that the abstract and early sections do not contain the numerical results. The full manuscript reports mAP, F1, and zero-shot accuracy for YOLO, DETR, and the PEFT-tuned VLM on an identical real test set, with explicit before/after comparisons and dataset sizes (synthetic corpus and real test split). To eliminate any ambiguity, we will revise the abstract to include the key quantitative figures and ensure the main experimental section presents all metrics, ablation tables, and test-set details in a single, clearly labeled table. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on held-out real scenarios is self-contained

full rationale

The paper advances an empirical pipeline: a synthetic data generator produces the SynSpill corpus, which is then used for PEFT of VLMs and supervised training of detectors such as YOLO and DETR; performance is measured on held-out real spill images. No equations, uniqueness theorems, or first-principles derivations appear in the provided text that would reduce any reported gain to a fitted parameter or self-citation by construction. The central claim—that synthetic data makes VLM and detector performance comparable—is supported by direct comparison of metrics before and after training on the synthetic set, a procedure that remains falsifiable against external real-world test distributions and does not collapse into definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that synthetic images can faithfully simulate real spill appearances and contexts; no explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption High-fidelity synthetic data can bridge the domain gap for spill detection without introducing harmful biases or artifacts.
    Invoked to justify why training on SynSpill produces real-world improvements.

pith-pipeline@v0.9.0 · 5780 in / 1212 out tokens · 51697 ms · 2026-05-18T22:25:39.173082+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    Gaussian pyramid image and its application to change detection.Computer Vision and Image Understanding, 1995

    John Adams and Howard Bloom. Gaussian pyramid image and its application to change detection.Computer Vision and Image Understanding, 1995. 2

  2. [2]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 4

  3. [3]

    Language models are few-shot learn- ers

    Tom Brown and et al. Language models are few-shot learn- ers. In NeurIPS, 2020. 4

  4. [4]

    End-to- end object detection with transformers, 2020

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers, 2020. 3

  5. [5]

    Adaptformer: Adapt- ing vision transformers for scalable visual recognition, 2022

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapt- ing vision transformers for scalable visual recognition, 2022. 3

  6. [6]

    Icleval: Evaluating in-context learning ability of large language mod- els, 2024

    Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, and Ji-Rong Wen. Icleval: Evaluating in-context learning ability of large language mod- els, 2024. 7

  7. [7]

    Understanding domain randomization for sim-to-real transfer, 2022

    Xiaoyu Chen, Jiachen Hu, Chi Jin, Lihong Li, and Liwei Wang. Understanding domain randomization for sim-to-real transfer, 2022. 3

  8. [8]

    Qlora: Efficient finetuning of quan- tized llms

    Tim Dettmers and et al. Qlora: Efficient finetuning of quan- tized llms. arXiv preprint arXiv:2310.02578, 2023. 3

  9. [9]

    Carla: An open urban driving simulator, 2017

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator, 2017. 3

  10. [10]

    Pixmix: Dreamlike pictures comprehensively improve safety measures, 2022

    Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, and Jacob Steinhardt. Pixmix: Dreamlike pictures comprehensively improve safety measures, 2022. 3

  11. [11]

    Yolov8 to yolo11: A comprehensive architecture in- depth comparative review, 2025

    Priyanto Hidayatullah, Nurjannah Syakrani, Muham- mad Rizqi Sholahuddin, Trisna Gelar, and Refdinal Tuba- gus. Yolov8 to yolo11: A comprehensive architecture in- depth comparative review, 2025. 2, 3

  12. [12]

    Denoising diffu- sion probabilistic models, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models, 2020. 3

  13. [13]

    Lora: Low-rank adaptation of large language models

    Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 2, 3, 4

  14. [14]

    Yolo evolution: A comprehensive benchmark and architectural review of yolov12, yolo11, and their previous versions, 2025

    Nidhal Jegham, Chan Young Koh, Marwan Abdelatti, and Abdeltawab Hendawi. Yolo evolution: A comprehensive benchmark and architectural review of yolov12, yolo11, and their previous versions, 2025. 2

  15. [15]

    Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision, 2021. 2, 3, 4

  16. [16]

    Vi- sual prompt tuning, 2022

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning, 2022. 3

  17. [17]

    Ultralyt- ics yolov8

    Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralyt- ics yolov8. https://github.com/ultralytics/ ultralytics, 2023. 2

  18. [18]

    Analyzing and improving the image quality of stylegan, 2020

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan, 2020. 3

  19. [19]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

  20. [20]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Lily Rolland, Laura Gustafson, Callen Romero, Michael Krainin, David Li, Chao Li, et al. Segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2426–2437, 2023. 3

  21. [21]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell. Overcoming catastrophic for- getting in neural networks. Proceedings of the National Academy of Sciences, 114(13)...

  22. [22]

    Ai2-thor: An interactive 3d environ- ment for visual ai, 2022

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environ- ment for visual ai, 2022. 3

  23. [23]

    CLIP ViT-H/14 – LAION-2B (laion/CLIP-ViT-H-14-laion2B-s32B-b79K)

    laion. CLIP ViT-H/14 – LAION-2B (laion/CLIP-ViT-H-14-laion2B-s32B-b79K). https : / / huggingface . co / laion / CLIP - ViT - H - 14 - laion2B- s32B- b79K, 2023. Published ca. Sep 2023; Accessed: 2025-07-04. 5

  24. [24]

    Differential diffusion: Giving each pixel its strength, 2024

    Eran Levin and Ohad Fried. Differential diffusion: Giving each pixel its strength, 2024. 5

  25. [25]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022. 3

  26. [26]

    Scaling & shifting your features: A new baseline for efficient model tuning, 2023

    Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning, 2023. 3

  27. [27]

    Generalization-enhanced few-shot object detection in remote sensing, 2025

    Hui Lin, Nan Li, Pengjuan Yao, Kexin Dong, Yuhan Guo, Danfeng Hong, Ying Zhang, and Congcong Wen. Generalization-enhanced few-shot object detection in remote sensing, 2025. 4

  28. [28]

    Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. Best practices and lessons learned on synthetic data, 2024. 3

  29. [29]

    Grounding dino: Marrying object detec- tion with grounded language queries

    Xin Liu and et al. Grounding dino: Marrying object detec- tion with grounded language queries. In CVPR, 2023. 3

  30. [30]

    Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021. 4

  31. [31]

    Repaint: Inpainting using denoising diffusion probabilistic models, 2022

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models, 2022. 5

  32. [32]

    Lcm-lora: A universal stable-diffusion acceleration module, 2023

    Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin´ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module, 2023. 4

  33. [33]

    Samarth Mishra, Rameswar Panda, Cheng Perng Phoo, Chun-Fu Chen, Leonid Karlinsky, Kate Saenko, Venkatesh Saligrama, and Rogerio S. Feris. Task2sim : Towards effec- tive pre-training and transfer from synthetic data, 2022. 3

  34. [34]

    Industrial language-image dataset (ilid): Adapting vision foundation models for industrial settings,

    Keno Moenck, Duc Trung Thieu, Julian Koch, and Thorsten Sch¨uppstuhl. Industrial language-image dataset (ilid): Adapting vision foundation models for industrial settings,

  35. [35]

    IP Composition Adapter (stable diffusion) model

    ostris. IP Composition Adapter (stable diffusion) model. https://huggingface.co/ostris/ip- composition- adapter, 2024. Published: March 20, 2024; Accessed: 2025-07-04. 5

  36. [36]

    Adapterfusion: Non- destructive task composition for transfer learning, 2021

    Jonas Pfeiffer, Aishwarya Kamath, Andreas R ¨uckl´e, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non- destructive task composition for transfer learning, 2021. 3

  37. [37]

    Barron, and Ben Milden- hall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion, 2022. 3

  38. [38]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2, 3, 4

  39. [39]

    R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam. Im- age change detection algorithms: A systematic survey. IEEE Transactions on Image Processing, 14(3):294–307, 2005. 3

  40. [40]

    Hierarchical text-conditional image gener- ation with clip latents, 2022

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents, 2022. 3

  41. [41]

    YOLOv3: An Incremental Improvement

    Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 3

  42. [42]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015. 3

  43. [43]

    High-resolution image syn- thesis with latent diffusion models, 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2022. 3, 4

  44. [44]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. 3

  45. [45]

    Occlu- sion handling in generic object detection: A review

    Kaziwa Saleh, Sandor Szenasi, and Zoltan Vamossy. Occlu- sion handling in generic object detection: A review. In 2021 IEEE 19th World Symposium on Applied Machine Intelli- gence and Informatics (SAMI), page 000477–000484. IEEE,

  46. [46]

    Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, and Manoj Karkee. Rf-detr object detection vs yolov12 : A study of transformer-based and cnn-based architectures for single- class and multi-class greenfruit detection in complex orchard environments under label ambiguity, 2025. 2, 3

  47. [47]

    SDXL-Turbo 1.0 fp16 model checkpoint

    Stability AI. SDXL-Turbo 1.0 fp16 model checkpoint. https://huggingface.co/stabilityai/sdxl- turbo / blob / main / sd _ xl _ turbo _ 1 . 0 _ fp16 . safetensors, 2023. Accessed: 2025-07-04. 5

  48. [48]

    Interior scene xl (stable diffusion xl model)

    Unknown Author. Interior scene xl (stable diffusion xl model). https://civitai.com/models/715747/ interior-scene-xl, 2025. Accessed: 2025-07-04. 5

  49. [49]

    Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022

    Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022. 3

  50. [50]

    A survey on deep learning-based industrial defect detection

    Ping Wang, Jiangbo Liu, Yilai Yan, and Zongjian Tang. A survey on deep learning-based industrial defect detection. IEEE Transactions on Neural Networks and Learning Sys- tems, 2021. 2

  51. [51]

    Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023. 3, 4

  52. [52]

    Florence: A new foundation model for computer vision, 2021

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision, 2021. 3

  53. [53]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022

    Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022. 3

  54. [54]

    Ni, and Heung-Yeung Shum

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022. 2, 3

  55. [55]

    Adding conditional control to text-to-image diffusion models, 2023

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

  56. [56]

    • Negative:

    Yucheng Zhou, Xiang Li, Qianning Wang, and Jianbing Shen. Visual in-context learning for large vision-language models, 2024. 4 Supplementary Material This supplementary section provides additional details, vi- sualizations, and implementation specifics that support the main claims and methodology of our work. It is organized as follows: A. Additional Qual...