pith. sign in

arxiv: 2606.07953 · v1 · pith:C4VXY7WAnew · submitted 2026-06-06 · 💻 cs.AI

Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines

Pith reviewed 2026-06-27 20:06 UTC · model grok-4.3

classification 💻 cs.AI
keywords industrial defect detectionvisual-language modelsopen-vocabulary detectionclosed-set detectionlarge-scale benchmarkprompt refinementdomain adaptation
0
0 comments X

The pith

A new million-sample benchmark unifies open-vocabulary and closed-set industrial defect detection, and RTVPNet adapts vision-language models to it with automatic domain projection and prompt refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MMIOC-1M, a dataset of over one million images spanning 14 super-categories, 29 scenes, and 351 defect types, to fill the gap in large-scale industrial data for visual-language models. It introduces RTVPNet, which uses expert-assisted projection to shift general models into industrial domains, energy-based sampling to create visual prompts automatically, and bidirectional interaction to align text and image features more precisely. These elements let the model handle both open and closed detection scenarios without relying on manual points or boxes. The result is reported state-of-the-art accuracy on the new benchmark as well as on LVIS and COCO while keeping computation efficient.

Core claim

The central claim is that expert-assisted domain projection, energy-based sparse sampling, and bidirectional text-visual interaction together allow a refined text-visual prompt network to adapt large vision-language models to industrial defect detection, achieving state-of-the-art results on the first unified open-closed benchmark of more than one million samples across 351 subcategories.

What carries the argument

RTVPNet, whose expert-assisted domain projection adapts general vision models, energy-based sparse sampling generates refined visual prompts, and bidirectional text-visual interaction improves cross-modal alignment.

If this is right

  • LVLMs can now be pre-trained on large industrial data that covers both open-vocabulary and closed-set detection.
  • Manual prompt engineering is no longer required for fine-grained industrial inspection tasks.
  • The same architecture improves results on standard natural-image benchmarks such as LVIS and COCO.
  • A single model can switch between open and closed detection modes in factory settings without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection and sampling approach could be tested on other narrow domains such as medical imaging or satellite analysis.
  • The benchmark size may allow systematic study of how defect rarity affects open-vocabulary performance.
  • Real-time factory deployment would require measuring whether the reported efficiency holds under continuous video streams.

Load-bearing premise

The three proposed components and the new benchmark produce real generalization across industrial scenes rather than overfitting to the collected data or the chosen 29 scenes.

What would settle it

Performance of RTVPNet falling below existing baselines when evaluated on an industrial dataset drawn from scenes and defect types completely outside the 29 scenes and 351 subcategories in MMIOC-1M.

Figures

Figures reproduced from arXiv: 2606.07953 by Cong Bai, Cong Liu, Da Chen, Dagang Li, Gang Li, He Wang, Jinglin Zhang, Qinghui Chen, Shengyong Chen, Shuainan Jing, Zekai Zhang.

Figure 1
Figure 1. Figure 1: (a) Comparison between traditional prompting methods and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MMIOC-1M category attribute analysis. MMIOC-1M contains 14 supercategories, 29 scene categories, and 351 subcategories. To our [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Subcategories of some scene categories in MMIOC-1M. Each [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Label calibration process. According to the category properties [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of MMIOC-1M with other industrial datasets. (a) MMIOC-1M has significant advantages in categories and quantity. (b) MMIOC [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The distributions over each category in the MMIOC-1M. It can be [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Challenging examples in MMIOC-1M. MMIOC-1M has the char [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: RTVPNet framework. RTVPNet contains three basic tasks: Visual Grounding, Vision Question Answering, and Object Detection. (a) Domain [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: T-SNE visualization in MMIOC-1M open scenes. Compared with Ground Truth and other methods, RTVPNet can generate the most compact [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of detection results in open and closed scenarios of MMIOC-1M. RTVPNet has significant advantages in closed and open [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual Question Answering results on MMIOC-1M. The combi [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Refined visual prompts visualization. The yellow area represents [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization of the Visual-Text prompt bidirectional interaction. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
read the original abstract

Large-scale Visual-Language Models (LVLMs) have achieved remarkable success in natural visual tasks, yet their application to industrial defect detection remains challenging due to two fundamental limitations: (i) the scarcity of large-scale industrial datasets that cover diverse defect categories across multiple domains, and (ii) the reliance on manual prompts (points, boxes, masks) that introduce subjective noise and lack text-visual interaction for fine-grained understanding. To address these challenges, we introduce a Large-Scale Multi-Modal Industrial Open-Closed benchmark (MMIOC-1M) containing over one million samples across $14$ super-categories, $29$ industrial scenes, and $351$ defect subcategories. To our knowledge, MMIOC-1M is the first unified largest benchmark supporting both open-vocabulary and closed-set industrial detection, providing valuable pre-training data for LVLMs in industrial scenarios. Furthermore, we propose a Refined Text-Visual Prompt Network (RTVPNet) that incorporates three key innovations: (1) an expert-assisted domain projection mechanism that enables rapid adaptation of general vision models to industrial domains, (2) an energy-based sparse sampling strategy that automatically generates refined visual prompts without manual intervention, and (3) a bidirectional text-visual interaction module that enhances cross-modal semantic alignment and understanding. Extensive experiments demonstrate that RTVPNet achieves state-of-the-art performance on MMIOC-1M, LVIS, and COCO benchmarks while maintaining computational efficiency. The dataset and code are available at https://github.com/hellozzk/MMIO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces the MMIOC-1M benchmark containing over one million samples across 14 super-categories, 29 industrial scenes, and 351 defect subcategories to support both open-vocabulary and closed-set industrial defect detection. It proposes RTVPNet incorporating expert-assisted domain projection, energy-based sparse sampling, and bidirectional text-visual interaction, claiming that this model achieves state-of-the-art performance on MMIOC-1M as well as on the LVIS and COCO benchmarks while remaining computationally efficient. The dataset and code are released publicly.

Significance. If the results hold, the work supplies the first large-scale unified benchmark for industrial scenarios and a concrete adaptation strategy for LVLMs that avoids manual prompts. Public release of data and code is a clear strength that could accelerate research on domain-specific vision-language applications in manufacturing.

major comments (1)
  1. [Abstract / Experiments] The headline SOTA claim on LVIS and COCO rests on the assertion that the three RTVPNet components produce measurable gains outside the industrial domain. The abstract states that 'extensive experiments demonstrate' this performance but supplies no ablation tables, performance deltas, or component-removal results on LVIS/COCO; without such evidence the generalization step remains unsecured.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the generalization claims require stronger supporting evidence and will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Abstract / Experiments] The headline SOTA claim on LVIS and COCO rests on the assertion that the three RTVPNet components produce measurable gains outside the industrial domain. The abstract states that 'extensive experiments demonstrate' this performance but supplies no ablation tables, performance deltas, or component-removal results on LVIS/COCO; without such evidence the generalization step remains unsecured.

    Authors: We acknowledge the validity of this observation. The current manuscript presents component ablations and performance deltas primarily on the MMIOC-1M benchmark (Tables 3-5 and associated figures), while LVIS and COCO results focus on overall SOTA comparisons. To secure the generalization claim, we will add new ablation tables in the revised version showing the effect of removing each RTVPNet component (domain projection, sparse sampling, bidirectional interaction) on LVIS and COCO, including quantitative deltas relative to the full model. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and model proposal with external validation

full rationale

The paper introduces the MMIOC-1M dataset and RTVPNet architecture (with expert-assisted projection, energy-based sampling, and bidirectional interaction) then reports empirical results on MMIOC-1M plus the independent external benchmarks LVIS and COCO. No derivation chain, first-principles predictions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on measured performance rather than any reduction of outputs to inputs by construction, satisfying the self-contained-against-external-benchmarks criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or invented physical entities are described in the abstract; the contribution is a new dataset and an empirical neural architecture.

pith-pipeline@v0.9.1-grok · 5843 in / 1140 out tokens · 17337 ms · 2026-06-27T20:06:01.152765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 12 linked inside Pith

  1. [1]

    Yolov7: Trainable bag-of-freebies sets new state-of- the-art for real-time object detectors,

    C.-Y. Wang, “Yolov7: Trainable bag-of-freebies sets new state-of- the-art for real-time object detectors,” inCVPR, 2023, pp. 7464– 7475

  2. [2]

    Yolox: Exceeding yolo series in 2021,

    Z. Ge and S. Liu, “Yolox: Exceeding yolo series in 2021,”arXiv preprint arXiv:2107.08430, 2021

  3. [3]

    Yolov6: A single-stage object detection framework for industrial applications,

    C. Li and L. Li, “Yolov6: A single-stage object detection framework for industrial applications,”arXiv preprint arXiv:2209.02976, 2022

  4. [4]

    Available: https://github.com/ultralytics/yolov8, 2023

    ultralytics, “Yolov8,”[Online]. Available: https://github.com/ultralytics/yolov8, 2023

  5. [5]

    Liteyolo-id: A lightweight object detection network for insulator defect detection,

    D. Li and Y. Lu, “Liteyolo-id: A lightweight object detection network for insulator defect detection,”IEEE Trans. on Instru. and Measure., vol. 73, pp. 1–12, 2024

  6. [6]

    Segment anything,

    A. Kirillov and E. Mintun, “Segment anything,” inICCV, October 2023, pp. 4015–4026

  7. [7]

    Fast segment anything,

    X. Zhao and W. Ding, “Fast segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2306.12156

  8. [8]

    Faster segment anything: Towards lightweight sam for mobile applications,

    C. Zhang, “Faster segment anything: Towards lightweight sam for mobile applications,”arXiv preprint arXiv:2306.14289, 2023

  9. [9]

    Visual instruction tuning,

    H. Liu and C. Li, “Visual instruction tuning,” vol. 36, 2024

  10. [10]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liu and Z. Zeng, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inECCV. Springer, 2024, pp. 38–55

  11. [11]

    S2dbft: Spectral-spatial dual-branch fusion trans- former for hyperspectral image classification,

    Y. Zhang, Z. Wang, M. Huang, M. Li, J. Zhang, S. Wang, J. Zhang, and H. Zhang, “S2dbft: Spectral-spatial dual-branch fusion trans- former for hyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, 2025

  12. [12]

    Implementation of motion estimation based on heterogeneous parallel computing system with opencl,

    J. Zhang, J.-F. Nezan, and J.-G. Cousin, “Implementation of motion estimation based on heterogeneous parallel computing system with opencl,” in2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th Inter- national Conference on Embedded Software and Systems. IEEE, 2012, pp. 41–45

  13. [13]

    3d octave and 2d vanilla mixed convolutional neural network for hyperspectral image classification with limited samples,

    Y. Feng, J. Zheng, M. Qin, C. Bai, and J. Zhang, “3d octave and 2d vanilla mixed convolutional neural network for hyperspectral image classification with limited samples,”Remote Sensing, vol. 13, no. 21, p. 4407, 2021. UNIFICATION OF CLOSED-OPEN INDUSTRIAL DETECTION SCENARIOS: NEW LARGE-SCALE BENCHMARKS, CHALLENGES AND BASELINES 17

  14. [14]

    Learning vertex representations for bipartite networks,

    M. Gao, X. He, L. Chen, T. Liu, J. Zhang, and A. Zhou, “Learning vertex representations for bipartite networks,”IEEE transactions on knowledge and data engineering, vol. 34, no. 1, pp. 379–393, 2020

  15. [15]

    Multi-granularity episodic contrastive learning for few-shot learning,

    P . Zhu, Z. Zhu, Y. Wang, J. Zhang, and S. Zhao, “Multi-granularity episodic contrastive learning for few-shot learning,”Pattern Recog- nition, vol. 131, p. 108820, 2022

  16. [16]

    A novel ground-based cloud image segmentation method by using deep transfer learning,

    Z. Zhou, F. Zhang, H. Xiao, F. Wang, X. Hong, K. Wu, and J. Zhang, “A novel ground-based cloud image segmentation method by using deep transfer learning,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021

  17. [17]

    Ensemble meteorological cloud classifica- tion meets internet of dependable and controllable things,

    J. Zhang, P . Liu, F. Zhang, H. Iwabuchi, A. A. d. H. e Ayres, V . H. C. De Albuquerqueet al., “Ensemble meteorological cloud classifica- tion meets internet of dependable and controllable things,”IEEE Internet of Things Journal, vol. 8, no. 5, pp. 3323–3330, 2020

  18. [18]

    Automated cca-mwf algorithm for unsuper- vised identification and removal of eog artifacts from eeg,

    M. Miao, W. Hu, B. Xu, J. Zhang, J. J. Rodrigues, and V . H. C. De Albuquerque, “Automated cca-mwf algorithm for unsuper- vised identification and removal of eog artifacts from eeg,”IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 8, pp. 3607– 3617, 2021

  19. [19]

    Supervised learning based discrete hashing for image retrieval,

    Q. Ma, C. Bai, J. Zhang, Z. Liu, and S. Chen, “Supervised learning based discrete hashing for image retrieval,”Pattern Recognition, vol. 92, pp. 156–164, 2019

  20. [20]

    Clothing sale forecasting by a composite gru–prophet model with an attention mechanism,

    Y. Li, Y. Yang, K. Zhu, and J. Zhang, “Clothing sale forecasting by a composite gru–prophet model with an attention mechanism,” IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 8335– 8344, 2021

  21. [21]

    Distilled large language model-driven dynamic sparse expert activation mechanism,

    Q. Chen, Z. Zhang, Z. Zhang, K. Zhang, D. Li, W. Wang, J. Zhang, and C. Liu, “Distilled large language model-driven dynamic sparse expert activation mechanism,”Applied Soft Computing, p. 114037, 2025

  22. [22]

    Dual-path aggregation transformer network for super-resolution with images occlusions and variabil- ity,

    Q. Chen, L. Wang, Z. Zhang, X. Wang, W. Liu, B. Xia, H. Ding, J. Zhang, S. Xu, and X. Wang, “Dual-path aggregation transformer network for super-resolution with images occlusions and variabil- ity,”Engineering Applications of Artificial Intelligence, vol. 139, no. PartA, 2025

  23. [23]

    Kftd: Koopman- fourier time-differentiable network for continuous ocean spa- tiotemporal forecasting,

    Q. Chen, Z. Zhang, H. Liu, J. Zhang, and C. Bai, “Kftd: Koopman- fourier time-differentiable network for continuous ocean spa- tiotemporal forecasting,” inProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2026, pp. 94–103

  24. [24]

    A novel dataset and lightweight distillation baseline for highlight transparent object detection,

    Z. Zhang, G. Li, H. Zhang, Q. Chen, Q. Zhang, J. Wan, M. Xiong, C. Bai, D. Li, W. Zhanget al., “A novel dataset and lightweight distillation baseline for highlight transparent object detection,” International Journal of Computer Vision, vol. 134, no. 4, p. 157, 2026

  25. [25]

    Idd- net: Industrial defect detection method based on deep-learning,

    Z. Zhang, M. Zhou, H. Wan, M. Li, G. Li, and D. Han, “Idd- net: Industrial defect detection method based on deep-learning,” Engineering Applications of Artificial Intelligence, vol. 123, p. 106390, 2023

  26. [26]

    Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,

    Z. Zhang, Q. Chen, M. Xiong, S. Ding, Z. Su, X. Yao, Y. Sun, C. Bai, and J. Zhang, “Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 357–10 366

  27. [27]

    Representation learning based on co- evolutionary combined with probability distribution optimization for precise defect location,

    J. Zhang, Z. Zhang, Q. Chen, G. Li, W. Li, S. Ding, M. Xiong, W. Zhang, and S. Chen, “Representation learning based on co- evolutionary combined with probability distribution optimization for precise defect location,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 7, pp. 11 989–12 003, 2024

  28. [28]

    Unification of closed-open industrial detection scenarios: New large-scale benchmarks, challenges and baselines,

    Z. Zhang, J. Zhang, Q. Chen, G. Li, D. Chen, S. Jing, H. Wang, D. Li, C. Liu, C. Baiet al., “Unification of closed-open industrial detection scenarios: New large-scale benchmarks, challenges and baselines,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  29. [29]

    Distillation-based fabric anomaly detection,

    S. Thomine and H. Snoussi, “Distillation-based fabric anomaly detection,”Textile Research Journal, vol. 94, no. 5-6, pp. 552–565, 2024

  30. [30]

    Deep learning for medical anomaly detection–a survey,

    T. Fernando and H. Gammulle, “Deep learning for medical anomaly detection–a survey,”ACM Computing Surveys (CSUR), vol. 54, no. 7, pp. 1–37, 2021

  31. [31]

    Fdsnet: An accurate real-time surface defect segmentation network,

    J. Zhang and R. Ding, “Fdsnet: An accurate real-time surface defect segmentation network,” inICASSP, 2022, pp. 3803–3807

  32. [32]

    Matcher: Segment anything with one shot us- ing all-purpose feature matching,

    Y. Liu and M. Zhu, “Matcher: Segment anything with one shot us- ing all-purpose feature matching,”arXiv preprint arXiv:2305.13310, 2023

  33. [33]

    Personalize segment anything model with one shot,

    R. Zhang and Z. Jiang, “Personalize segment anything model with one shot,”arXiv preprint arXiv:2305.03048, 2023

  34. [34]

    Aggregated residual transformations for deep neural networks,

    S. Xie and R. Girshick, “Aggregated residual transformations for deep neural networks,” inCVPR, July 2017

  35. [35]

    Deep residual learning for image recogni- tion,

    K. He and X. Zhang, “Deep residual learning for image recogni- tion,” inCVPR, 2016, pp. 770–778

  36. [36]

    Cpt: Colorful prompt tuning for pre-trained vision-language models,

    Y. Yao and A. Zhang, “Cpt: Colorful prompt tuning for pre-trained vision-language models,”AI Open, vol. 5, pp. 30–38, 2024

  37. [37]

    Reclip: A strong zero-shot baseline for referring expression comprehension,

    S. Subramanian, “Reclip: A strong zero-shot baseline for referring expression comprehension,”arXiv preprint arXiv:2204.05991, 2022

  38. [38]

    How to efficiently adapt large segmentation model (sam) to medical images,

    X. Hu and X. Xu, “How to efficiently adapt large segmentation model (sam) to medical images,”arXiv preprint arXiv:2306.13731, 2023

  39. [39]

    Cocoopter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification,

    J. Yan and Y. Xie, “Cocoopter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification,”Int. J. Multi. Inform. Retri., vol. 12, no. 2, p. 27, 2023

  40. [40]

    Yolo-world: Real-time open-vocabulary object detection,

    T. Cheng and L. Song, “Yolo-world: Real-time open-vocabulary object detection,” inCVPR, 2024, pp. 16 901–16 911

  41. [41]

    LVIS: A dataset for large vocabulary instance segmentation,

    A. Gupta and P . Doll ´ar, “LVIS: A dataset for large vocabulary instance segmentation,”CoRR, vol. abs/1908.03195, 2019

  42. [42]

    Microsoft coco: Common objects in context,

    T.-Y. Lin and M. Maire, “Microsoft coco: Common objects in context,” inECCV, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., vol. 8693, 2014, pp. 740–755

  43. [43]

    Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,

    Z. Zhang and Q. Chen, “Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,” inAAAI, vol. 39, no. 10, Apr. 2025, pp. 10 357–10 366. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/33124

  44. [44]

    Mmad: The first-ever comprehensive bench- mark for multimodal large language models in industrial anomaly detection,

    X. Jiang and J. Li, “Mmad: The first-ever comprehensive bench- mark for multimodal large language models in industrial anomaly detection,”arXiv preprint arXiv:2410.09453, 2024

  45. [45]

    Defect spectrum: a granular look of large- scale defect datasets with rich semantics,

    S. Yang and Z. Chen, “Defect spectrum: a granular look of large- scale defect datasets with rich semantics,” inECCV. Springer, 2024, pp. 187–203

  46. [46]

    Vision datasets: A benchmark for vision-based industrial inspection,

    H. Bai and S. Mou, “Vision datasets: A benchmark for vision-based industrial inspection,”arXiv preprint arXiv:2306.07890, 2023

  47. [47]

    Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,

    J. Zhang and R. Ding, “Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,” IEEE Robot. and Auto. Lett., vol. 9, no. 3, pp. 2008–2015, 2024

  48. [48]

    Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,

    P . Bergmann and M. Fauser, “Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,” inCVPR, 2019, pp. 9592–9600

  49. [49]

    Spot-the-difference self-supervised pre- training for anomaly detection and segmentation,

    Y. Zou and J. Jeong, “Spot-the-difference self-supervised pre- training for anomaly detection and segmentation,” inECCV. Springer, 2022, pp. 392–408

  50. [50]

    Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,

    C. Wang and W. Zhu, “Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,” in CVPR, 2024, pp. 22 883–22 892

  51. [51]

    Multi-sensor object anomaly detection: Uni- fying appearance, geometry, and internal properties,

    W. Li and B. Zheng, “Multi-sensor object anomaly detection: Uni- fying appearance, geometry, and internal properties,” inCVPR, 2025, pp. 9984–9993

  52. [52]

    3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,

    E. Yang and P . Xing, “3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,” inAAAI, vol. 39, no. 9, 2025, pp. 9175–9183

  53. [53]

    Pixel-level contrastive pretrainer for indus- trial image representation,

    B. Zhu and Y. Chen, “Pixel-level contrastive pretrainer for indus- trial image representation,”IEEE Trans. on Instru. and Measure., vol. 73, 2024

  54. [54]

    Vt-adl: A vision transformer network for image anomaly detection and localization,

    P . Mishra and R. Verk, “Vt-adl: A vision transformer network for image anomaly detection and localization,” inISIE. IEEE, 2021, pp. 01–06

  55. [55]

    Gpt-4 technical report,

    J. Achiam and S. Adler, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  56. [56]

    Learning transferable visual models from natural language supervision,

    A. Radford and J. W. Kim, “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763

  57. [57]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li and D. Li, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML. PMLR, 2023, pp. 19 730–19 742

  58. [58]

    Application of segment anything model for civil infrastructure defect assessment,

    M. Ahmadi and A. G. Lonbar, “Application of segment anything model for civil infrastructure defect assessment,”arXiv preprint arXiv:2304.12600, 2023

  59. [59]

    Medical sam adapter: Adapting segment anything model for medical image segmentation,

    J. Wu and Z. Wang, “Medical sam adapter: Adapting segment anything model for medical image segmentation,”Medical image analysis, vol. 102, p. 103547, 2025

  60. [60]

    Eviprompt: A training-free evidential prompt generation method for segment anything model in medical im- ages,

    Y. Xu and J. Tang, “Eviprompt: A training-free evidential prompt generation method for segment anything model in medical im- ages,”arXiv preprint arXiv:2311.06400, 2023

  61. [61]

    Adaptershadow: Adapting seg- ment anything model for shadow detection,

    L. Jie and H. Zhang, “Adaptershadow: Adapting seg- ment anything model for shadow detection,”arXiv preprint arXiv:2311.08891, 2023. UNIFICATION OF CLOSED-OPEN INDUSTRIAL DETECTION SCENARIOS: NEW LARGE-SCALE BENCHMARKS, CHALLENGES AND BASELINES 18

  62. [62]

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts,

    T. Shin and Y. Razeghi, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,”arXiv preprint arXiv:2010.15980, 2020

  63. [63]

    Fine-grained visual prompting,

    L. Yang and Y. Wang, “Fine-grained visual prompting,”NeurIPS, vol. 36, 2024

  64. [64]

    Vrp-sam: Sam with visual reference prompt,

    Y. Sun and J. Chen, “Vrp-sam: Sam with visual reference prompt,” inCVPR, 2024, pp. 23 565–23 574

  65. [65]

    Yolo-uniow: Efficient universal open-world object detection,

    L. Liu and J. Feng, “Yolo-uniow: Efficient universal open-world object detection,”arXiv preprint arXiv:2412.20645, 2024

  66. [66]

    Relational proxies: Fine-grained relationships as zero-shot discriminators,

    A. Chaudhuri and M. Mancini, “Relational proxies: Fine-grained relationships as zero-shot discriminators,”IEEE Trans. on Pattern Anal. and Mach. Intell., 2024

  67. [67]

    Segment anything is not always perfect: An investigation of sam on different real-world applications,

    W. Ji and J. Li, “Segment anything is not always perfect: An investigation of sam on different real-world applications,”Machine Intelligence Research, vol. 21, pp. 617–630, 2024

  68. [68]

    Masked autoencoders are scalable vision learners,

    K. He and X. Chen, “Masked autoencoders are scalable vision learners,” inCVPR, 2022, pp. 16 000–16 009

  69. [69]

    Distance-iou loss: Faster and better learning for bounding box regression,

    Z. Zheng and P . Wang, “Distance-iou loss: Faster and better learning for bounding box regression,” inAAAI, vol. 34, no. 07, 2020, pp. 12 993–13 000

  70. [70]

    Representation learning with contrastive predictive coding,

    A. v. d. Oord and Y. Li, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

  71. [71]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

    D. Guo and D. Yang, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  72. [72]

    A light-weight framework for open-set object detection with decoupled feature alignment in joint space,

    Y. He and H. Su, “A light-weight framework for open-set object detection with decoupled feature alignment in joint space,” 2024. [Online]. Available: https://arxiv.org/abs/2412.14680

  73. [73]

    Mamba-yolo-world: Marrying yolo-world with mamba for open-vocabulary detection,

    H. Wang and Q. He, “Mamba-yolo-world: Marrying yolo-world with mamba for open-vocabulary detection,” inICASSP, 2025, pp. 1–5

  74. [74]

    Qwen2.5-vl technical report,

    S. Bai and K. Chen, “Qwen2.5-vl technical report,” 2025. [Online]. Available: arXivpreprintarXiv:2502.13923

  75. [75]

    Qwen3-omni technical report,

    J. Xu and Z. Guo, “Qwen3-omni technical report,” 2025. [Online]. Available: arXivpreprintarXiv:2505.09388

  76. [76]

    Large-scale visual language model boosted by contrast domain adaptation for intelligent industrial visual monitoring,

    H. Wang and C. Li, “Large-scale visual language model boosted by contrast domain adaptation for intelligent industrial visual monitoring,”IEEE Trans. on Indus. Infor., vol. 20, no. 12, pp. 14 114– 14 123, 2024

  77. [77]

    Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,

    F. Li, “Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,” 2025. [Online]. Available: arXivpreprintarXiv:2407.07895

  78. [78]

    Llava-onevision: Easy visual task transfer,

    B. Li and Y. Zhang, “Llava-onevision: Easy visual task transfer,”

  79. [79]

    Available: arXivpreprintarXiv:2408.03326

    [Online]. Available: arXivpreprintarXiv:2408.03326

  80. [80]

    Yolov11: An overview of the key architectural enhancements,

    R. Khanam and M. Hussain, “Yolov11: An overview of the key architectural enhancements,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17725

Showing first 80 references.