Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines

Cong Bai; Cong Liu; Da Chen; Dagang Li; Gang Li; He Wang; Jinglin Zhang; Qinghui Chen; Shengyong Chen; Shuainan Jing

arxiv: 2606.07953 · v1 · pith:C4VXY7WAnew · submitted 2026-06-06 · 💻 cs.AI

Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines

Zekai Zhang , Jinglin Zhang , Qinghui Chen , Gang Li , Da Chen , Shuainan Jing , He Wang , Dagang Li

show 3 more authors

Cong Liu Cong Bai Shengyong Chen

This is my paper

Pith reviewed 2026-06-27 20:06 UTC · model grok-4.3

classification 💻 cs.AI

keywords industrial defect detectionvisual-language modelsopen-vocabulary detectionclosed-set detectionlarge-scale benchmarkprompt refinementdomain adaptation

0 comments

The pith

A new million-sample benchmark unifies open-vocabulary and closed-set industrial defect detection, and RTVPNet adapts vision-language models to it with automatic domain projection and prompt refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MMIOC-1M, a dataset of over one million images spanning 14 super-categories, 29 scenes, and 351 defect types, to fill the gap in large-scale industrial data for visual-language models. It introduces RTVPNet, which uses expert-assisted projection to shift general models into industrial domains, energy-based sampling to create visual prompts automatically, and bidirectional interaction to align text and image features more precisely. These elements let the model handle both open and closed detection scenarios without relying on manual points or boxes. The result is reported state-of-the-art accuracy on the new benchmark as well as on LVIS and COCO while keeping computation efficient.

Core claim

The central claim is that expert-assisted domain projection, energy-based sparse sampling, and bidirectional text-visual interaction together allow a refined text-visual prompt network to adapt large vision-language models to industrial defect detection, achieving state-of-the-art results on the first unified open-closed benchmark of more than one million samples across 351 subcategories.

What carries the argument

RTVPNet, whose expert-assisted domain projection adapts general vision models, energy-based sparse sampling generates refined visual prompts, and bidirectional text-visual interaction improves cross-modal alignment.

If this is right

LVLMs can now be pre-trained on large industrial data that covers both open-vocabulary and closed-set detection.
Manual prompt engineering is no longer required for fine-grained industrial inspection tasks.
The same architecture improves results on standard natural-image benchmarks such as LVIS and COCO.
A single model can switch between open and closed detection modes in factory settings without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection and sampling approach could be tested on other narrow domains such as medical imaging or satellite analysis.
The benchmark size may allow systematic study of how defect rarity affects open-vocabulary performance.
Real-time factory deployment would require measuring whether the reported efficiency holds under continuous video streams.

Load-bearing premise

The three proposed components and the new benchmark produce real generalization across industrial scenes rather than overfitting to the collected data or the chosen 29 scenes.

What would settle it

Performance of RTVPNet falling below existing baselines when evaluated on an industrial dataset drawn from scenes and defect types completely outside the 29 scenes and 351 subcategories in MMIOC-1M.

Figures

Figures reproduced from arXiv: 2606.07953 by Cong Bai, Cong Liu, Da Chen, Dagang Li, Gang Li, He Wang, Jinglin Zhang, Qinghui Chen, Shengyong Chen, Shuainan Jing, Zekai Zhang.

**Figure 2.** Figure 2: MMIOC-1M category attribute analysis. MMIOC-1M contains 14 supercategories, 29 scene categories, and 351 subcategories. To our [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Subcategories of some scene categories in MMIOC-1M. Each [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Label calibration process. According to the category properties [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of MMIOC-1M with other industrial datasets. (a) MMIOC-1M has significant advantages in categories and quantity. (b) MMIOC [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The distributions over each category in the MMIOC-1M. It can be [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 8.** Figure 8: Challenging examples in MMIOC-1M. MMIOC-1M has the char [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: RTVPNet framework. RTVPNet contains three basic tasks: Visual Grounding, Vision Question Answering, and Object Detection. (a) Domain [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: T-SNE visualization in MMIOC-1M open scenes. Compared with Ground Truth and other methods, RTVPNet can generate the most compact [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of detection results in open and closed scenarios of MMIOC-1M. RTVPNet has significant advantages in closed and open [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Visual Question Answering results on MMIOC-1M. The combi [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Refined visual prompts visualization. The yellow area represents [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Visualization of the Visual-Text prompt bidirectional interaction. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

read the original abstract

Large-scale Visual-Language Models (LVLMs) have achieved remarkable success in natural visual tasks, yet their application to industrial defect detection remains challenging due to two fundamental limitations: (i) the scarcity of large-scale industrial datasets that cover diverse defect categories across multiple domains, and (ii) the reliance on manual prompts (points, boxes, masks) that introduce subjective noise and lack text-visual interaction for fine-grained understanding. To address these challenges, we introduce a Large-Scale Multi-Modal Industrial Open-Closed benchmark (MMIOC-1M) containing over one million samples across $14$ super-categories, $29$ industrial scenes, and $351$ defect subcategories. To our knowledge, MMIOC-1M is the first unified largest benchmark supporting both open-vocabulary and closed-set industrial detection, providing valuable pre-training data for LVLMs in industrial scenarios. Furthermore, we propose a Refined Text-Visual Prompt Network (RTVPNet) that incorporates three key innovations: (1) an expert-assisted domain projection mechanism that enables rapid adaptation of general vision models to industrial domains, (2) an energy-based sparse sampling strategy that automatically generates refined visual prompts without manual intervention, and (3) a bidirectional text-visual interaction module that enhances cross-modal semantic alignment and understanding. Extensive experiments demonstrate that RTVPNet achieves state-of-the-art performance on MMIOC-1M, LVIS, and COCO benchmarks while maintaining computational efficiency. The dataset and code are available at https://github.com/hellozzk/MMIO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMIOC-1M is a practical new industrial benchmark, but the RTVPNet transfer claims to LVIS and COCO rest on the weakest link.

read the letter

The paper's main deliverable is MMIOC-1M, a dataset of over one million samples spanning 14 super-categories, 29 industrial scenes, and 351 defect subcategories. It is positioned as the first large resource that supports both open-vocabulary and closed-set industrial detection at once. The authors also describe RTVPNet, which adds an expert-assisted domain projection, energy-based sparse sampling, and a bidirectional text-visual interaction module.

The benchmark itself is the part that holds up. Collecting data at this scale in a real manufacturing setting is non-trivial work, and a unified resource like this can serve as pre-training material or a standard testbed for people building inspection systems. Releasing the data and code is the right move.

The model claims are the soft spot. The abstract states that the same three components produce state-of-the-art results on LVIS and COCO while staying efficient. Those components read as if they were shaped around the 29 industrial scenes, so it is not clear they remain helpful once the domain shift disappears. The stress-test note flags exactly this point, and nothing in the provided text supplies ablations or protocol details that would settle it. Without those numbers it is hard to separate genuine architectural gains from other factors.

This is a paper for applied computer-vision groups that work on industrial defect detection. The benchmark alone gives it enough weight to reach peer review, even if the model section will need tightening on the transfer results.

Referee Report

1 major / 0 minor

Summary. The paper introduces the MMIOC-1M benchmark containing over one million samples across 14 super-categories, 29 industrial scenes, and 351 defect subcategories to support both open-vocabulary and closed-set industrial defect detection. It proposes RTVPNet incorporating expert-assisted domain projection, energy-based sparse sampling, and bidirectional text-visual interaction, claiming that this model achieves state-of-the-art performance on MMIOC-1M as well as on the LVIS and COCO benchmarks while remaining computationally efficient. The dataset and code are released publicly.

Significance. If the results hold, the work supplies the first large-scale unified benchmark for industrial scenarios and a concrete adaptation strategy for LVLMs that avoids manual prompts. Public release of data and code is a clear strength that could accelerate research on domain-specific vision-language applications in manufacturing.

major comments (1)

[Abstract / Experiments] The headline SOTA claim on LVIS and COCO rests on the assertion that the three RTVPNet components produce measurable gains outside the industrial domain. The abstract states that 'extensive experiments demonstrate' this performance but supplies no ablation tables, performance deltas, or component-removal results on LVIS/COCO; without such evidence the generalization step remains unsecured.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the generalization claims require stronger supporting evidence and will revise the manuscript to address this.

read point-by-point responses

Referee: [Abstract / Experiments] The headline SOTA claim on LVIS and COCO rests on the assertion that the three RTVPNet components produce measurable gains outside the industrial domain. The abstract states that 'extensive experiments demonstrate' this performance but supplies no ablation tables, performance deltas, or component-removal results on LVIS/COCO; without such evidence the generalization step remains unsecured.

Authors: We acknowledge the validity of this observation. The current manuscript presents component ablations and performance deltas primarily on the MMIOC-1M benchmark (Tables 3-5 and associated figures), while LVIS and COCO results focus on overall SOTA comparisons. To secure the generalization claim, we will add new ablation tables in the revised version showing the effect of removing each RTVPNet component (domain projection, sparse sampling, bidirectional interaction) on LVIS and COCO, including quantitative deltas relative to the full model. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and model proposal with external validation

full rationale

The paper introduces the MMIOC-1M dataset and RTVPNet architecture (with expert-assisted projection, energy-based sampling, and bidirectional interaction) then reports empirical results on MMIOC-1M plus the independent external benchmarks LVIS and COCO. No derivation chain, first-principles predictions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on measured performance rather than any reduction of outputs to inputs by construction, satisfying the self-contained-against-external-benchmarks criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or invented physical entities are described in the abstract; the contribution is a new dataset and an empirical neural architecture.

pith-pipeline@v0.9.1-grok · 5843 in / 1140 out tokens · 17337 ms · 2026-06-27T20:06:01.152765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 12 linked inside Pith

[1]

Yolov7: Trainable bag-of-freebies sets new state-of- the-art for real-time object detectors,

C.-Y. Wang, “Yolov7: Trainable bag-of-freebies sets new state-of- the-art for real-time object detectors,” inCVPR, 2023, pp. 7464– 7475

2023
[2]

Yolox: Exceeding yolo series in 2021,

Z. Ge and S. Liu, “Yolox: Exceeding yolo series in 2021,”arXiv preprint arXiv:2107.08430, 2021

Pith/arXiv arXiv 2021
[3]

Yolov6: A single-stage object detection framework for industrial applications,

C. Li and L. Li, “Yolov6: A single-stage object detection framework for industrial applications,”arXiv preprint arXiv:2209.02976, 2022

arXiv 2022
[4]

Available: https://github.com/ultralytics/yolov8, 2023

ultralytics, “Yolov8,”[Online]. Available: https://github.com/ultralytics/yolov8, 2023

2023
[5]

Liteyolo-id: A lightweight object detection network for insulator defect detection,

D. Li and Y. Lu, “Liteyolo-id: A lightweight object detection network for insulator defect detection,”IEEE Trans. on Instru. and Measure., vol. 73, pp. 1–12, 2024

2024
[6]

Segment anything,

A. Kirillov and E. Mintun, “Segment anything,” inICCV, October 2023, pp. 4015–4026

2023
[7]

Fast segment anything,

X. Zhao and W. Ding, “Fast segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2306.12156

arXiv 2023
[8]

Faster segment anything: Towards lightweight sam for mobile applications,

C. Zhang, “Faster segment anything: Towards lightweight sam for mobile applications,”arXiv preprint arXiv:2306.14289, 2023

Pith/arXiv arXiv 2023
[9]

Visual instruction tuning,

H. Liu and C. Li, “Visual instruction tuning,” vol. 36, 2024

2024
[10]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu and Z. Zeng, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inECCV. Springer, 2024, pp. 38–55

2024
[11]

S2dbft: Spectral-spatial dual-branch fusion trans- former for hyperspectral image classification,

Y. Zhang, Z. Wang, M. Huang, M. Li, J. Zhang, S. Wang, J. Zhang, and H. Zhang, “S2dbft: Spectral-spatial dual-branch fusion trans- former for hyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, 2025

2025
[12]

Implementation of motion estimation based on heterogeneous parallel computing system with opencl,

J. Zhang, J.-F. Nezan, and J.-G. Cousin, “Implementation of motion estimation based on heterogeneous parallel computing system with opencl,” in2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th Inter- national Conference on Embedded Software and Systems. IEEE, 2012, pp. 41–45

2012
[13]

3d octave and 2d vanilla mixed convolutional neural network for hyperspectral image classification with limited samples,

Y. Feng, J. Zheng, M. Qin, C. Bai, and J. Zhang, “3d octave and 2d vanilla mixed convolutional neural network for hyperspectral image classification with limited samples,”Remote Sensing, vol. 13, no. 21, p. 4407, 2021. UNIFICATION OF CLOSED-OPEN INDUSTRIAL DETECTION SCENARIOS: NEW LARGE-SCALE BENCHMARKS, CHALLENGES AND BASELINES 17

2021
[14]

Learning vertex representations for bipartite networks,

M. Gao, X. He, L. Chen, T. Liu, J. Zhang, and A. Zhou, “Learning vertex representations for bipartite networks,”IEEE transactions on knowledge and data engineering, vol. 34, no. 1, pp. 379–393, 2020

2020
[15]

Multi-granularity episodic contrastive learning for few-shot learning,

P . Zhu, Z. Zhu, Y. Wang, J. Zhang, and S. Zhao, “Multi-granularity episodic contrastive learning for few-shot learning,”Pattern Recog- nition, vol. 131, p. 108820, 2022

2022
[16]

A novel ground-based cloud image segmentation method by using deep transfer learning,

Z. Zhou, F. Zhang, H. Xiao, F. Wang, X. Hong, K. Wu, and J. Zhang, “A novel ground-based cloud image segmentation method by using deep transfer learning,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021

2021
[17]

Ensemble meteorological cloud classifica- tion meets internet of dependable and controllable things,

J. Zhang, P . Liu, F. Zhang, H. Iwabuchi, A. A. d. H. e Ayres, V . H. C. De Albuquerqueet al., “Ensemble meteorological cloud classifica- tion meets internet of dependable and controllable things,”IEEE Internet of Things Journal, vol. 8, no. 5, pp. 3323–3330, 2020

2020
[18]

Automated cca-mwf algorithm for unsuper- vised identification and removal of eog artifacts from eeg,

M. Miao, W. Hu, B. Xu, J. Zhang, J. J. Rodrigues, and V . H. C. De Albuquerque, “Automated cca-mwf algorithm for unsuper- vised identification and removal of eog artifacts from eeg,”IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 8, pp. 3607– 3617, 2021

2021
[19]

Supervised learning based discrete hashing for image retrieval,

Q. Ma, C. Bai, J. Zhang, Z. Liu, and S. Chen, “Supervised learning based discrete hashing for image retrieval,”Pattern Recognition, vol. 92, pp. 156–164, 2019

2019
[20]

Clothing sale forecasting by a composite gru–prophet model with an attention mechanism,

Y. Li, Y. Yang, K. Zhu, and J. Zhang, “Clothing sale forecasting by a composite gru–prophet model with an attention mechanism,” IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 8335– 8344, 2021

2021
[21]

Distilled large language model-driven dynamic sparse expert activation mechanism,

Q. Chen, Z. Zhang, Z. Zhang, K. Zhang, D. Li, W. Wang, J. Zhang, and C. Liu, “Distilled large language model-driven dynamic sparse expert activation mechanism,”Applied Soft Computing, p. 114037, 2025

2025
[22]

Dual-path aggregation transformer network for super-resolution with images occlusions and variabil- ity,

Q. Chen, L. Wang, Z. Zhang, X. Wang, W. Liu, B. Xia, H. Ding, J. Zhang, S. Xu, and X. Wang, “Dual-path aggregation transformer network for super-resolution with images occlusions and variabil- ity,”Engineering Applications of Artificial Intelligence, vol. 139, no. PartA, 2025

2025
[23]

Kftd: Koopman- fourier time-differentiable network for continuous ocean spa- tiotemporal forecasting,

Q. Chen, Z. Zhang, H. Liu, J. Zhang, and C. Bai, “Kftd: Koopman- fourier time-differentiable network for continuous ocean spa- tiotemporal forecasting,” inProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2026, pp. 94–103

2026
[24]

A novel dataset and lightweight distillation baseline for highlight transparent object detection,

Z. Zhang, G. Li, H. Zhang, Q. Chen, Q. Zhang, J. Wan, M. Xiong, C. Bai, D. Li, W. Zhanget al., “A novel dataset and lightweight distillation baseline for highlight transparent object detection,” International Journal of Computer Vision, vol. 134, no. 4, p. 157, 2026

2026
[25]

Idd- net: Industrial defect detection method based on deep-learning,

Z. Zhang, M. Zhou, H. Wan, M. Li, G. Li, and D. Han, “Idd- net: Industrial defect detection method based on deep-learning,” Engineering Applications of Artificial Intelligence, vol. 123, p. 106390, 2023

2023
[26]

Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,

Z. Zhang, Q. Chen, M. Xiong, S. Ding, Z. Su, X. Yao, Y. Sun, C. Bai, and J. Zhang, “Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 357–10 366

2025
[27]

Representation learning based on co- evolutionary combined with probability distribution optimization for precise defect location,

J. Zhang, Z. Zhang, Q. Chen, G. Li, W. Li, S. Ding, M. Xiong, W. Zhang, and S. Chen, “Representation learning based on co- evolutionary combined with probability distribution optimization for precise defect location,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 7, pp. 11 989–12 003, 2024

2024
[28]

Unification of closed-open industrial detection scenarios: New large-scale benchmarks, challenges and baselines,

Z. Zhang, J. Zhang, Q. Chen, G. Li, D. Chen, S. Jing, H. Wang, D. Li, C. Liu, C. Baiet al., “Unification of closed-open industrial detection scenarios: New large-scale benchmarks, challenges and baselines,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[29]

Distillation-based fabric anomaly detection,

S. Thomine and H. Snoussi, “Distillation-based fabric anomaly detection,”Textile Research Journal, vol. 94, no. 5-6, pp. 552–565, 2024

2024
[30]

Deep learning for medical anomaly detection–a survey,

T. Fernando and H. Gammulle, “Deep learning for medical anomaly detection–a survey,”ACM Computing Surveys (CSUR), vol. 54, no. 7, pp. 1–37, 2021

2021
[31]

Fdsnet: An accurate real-time surface defect segmentation network,

J. Zhang and R. Ding, “Fdsnet: An accurate real-time surface defect segmentation network,” inICASSP, 2022, pp. 3803–3807

2022
[32]

Matcher: Segment anything with one shot us- ing all-purpose feature matching,

Y. Liu and M. Zhu, “Matcher: Segment anything with one shot us- ing all-purpose feature matching,”arXiv preprint arXiv:2305.13310, 2023

arXiv 2023
[33]

Personalize segment anything model with one shot,

R. Zhang and Z. Jiang, “Personalize segment anything model with one shot,”arXiv preprint arXiv:2305.03048, 2023

arXiv 2023
[34]

Aggregated residual transformations for deep neural networks,

S. Xie and R. Girshick, “Aggregated residual transformations for deep neural networks,” inCVPR, July 2017

2017
[35]

Deep residual learning for image recogni- tion,

K. He and X. Zhang, “Deep residual learning for image recogni- tion,” inCVPR, 2016, pp. 770–778

2016
[36]

Cpt: Colorful prompt tuning for pre-trained vision-language models,

Y. Yao and A. Zhang, “Cpt: Colorful prompt tuning for pre-trained vision-language models,”AI Open, vol. 5, pp. 30–38, 2024

2024
[37]

Reclip: A strong zero-shot baseline for referring expression comprehension,

S. Subramanian, “Reclip: A strong zero-shot baseline for referring expression comprehension,”arXiv preprint arXiv:2204.05991, 2022

arXiv 2022
[38]

How to efficiently adapt large segmentation model (sam) to medical images,

X. Hu and X. Xu, “How to efficiently adapt large segmentation model (sam) to medical images,”arXiv preprint arXiv:2306.13731, 2023

arXiv 2023
[39]

Cocoopter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification,

J. Yan and Y. Xie, “Cocoopter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification,”Int. J. Multi. Inform. Retri., vol. 12, no. 2, p. 27, 2023

2023
[40]

Yolo-world: Real-time open-vocabulary object detection,

T. Cheng and L. Song, “Yolo-world: Real-time open-vocabulary object detection,” inCVPR, 2024, pp. 16 901–16 911

2024
[41]

LVIS: A dataset for large vocabulary instance segmentation,

A. Gupta and P . Doll ´ar, “LVIS: A dataset for large vocabulary instance segmentation,”CoRR, vol. abs/1908.03195, 2019

arXiv 1908
[42]

Microsoft coco: Common objects in context,

T.-Y. Lin and M. Maire, “Microsoft coco: Common objects in context,” inECCV, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., vol. 8693, 2014, pp. 740–755

2014
[43]

Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,

Z. Zhang and Q. Chen, “Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,” inAAAI, vol. 39, no. 10, Apr. 2025, pp. 10 357–10 366. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/33124

2025
[44]

Mmad: The first-ever comprehensive bench- mark for multimodal large language models in industrial anomaly detection,

X. Jiang and J. Li, “Mmad: The first-ever comprehensive bench- mark for multimodal large language models in industrial anomaly detection,”arXiv preprint arXiv:2410.09453, 2024

arXiv 2024
[45]

Defect spectrum: a granular look of large- scale defect datasets with rich semantics,

S. Yang and Z. Chen, “Defect spectrum: a granular look of large- scale defect datasets with rich semantics,” inECCV. Springer, 2024, pp. 187–203

2024
[46]

Vision datasets: A benchmark for vision-based industrial inspection,

H. Bai and S. Mou, “Vision datasets: A benchmark for vision-based industrial inspection,”arXiv preprint arXiv:2306.07890, 2023

arXiv 2023
[47]

Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,

J. Zhang and R. Ding, “Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,” IEEE Robot. and Auto. Lett., vol. 9, no. 3, pp. 2008–2015, 2024

2008
[48]

Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,

P . Bergmann and M. Fauser, “Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,” inCVPR, 2019, pp. 9592–9600

2019
[49]

Spot-the-difference self-supervised pre- training for anomaly detection and segmentation,

Y. Zou and J. Jeong, “Spot-the-difference self-supervised pre- training for anomaly detection and segmentation,” inECCV. Springer, 2022, pp. 392–408

2022
[50]

Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,

C. Wang and W. Zhu, “Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,” in CVPR, 2024, pp. 22 883–22 892

2024
[51]

Multi-sensor object anomaly detection: Uni- fying appearance, geometry, and internal properties,

W. Li and B. Zheng, “Multi-sensor object anomaly detection: Uni- fying appearance, geometry, and internal properties,” inCVPR, 2025, pp. 9984–9993

2025
[52]

3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,

E. Yang and P . Xing, “3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,” inAAAI, vol. 39, no. 9, 2025, pp. 9175–9183

2025
[53]

Pixel-level contrastive pretrainer for indus- trial image representation,

B. Zhu and Y. Chen, “Pixel-level contrastive pretrainer for indus- trial image representation,”IEEE Trans. on Instru. and Measure., vol. 73, 2024

2024
[54]

Vt-adl: A vision transformer network for image anomaly detection and localization,

P . Mishra and R. Verk, “Vt-adl: A vision transformer network for image anomaly detection and localization,” inISIE. IEEE, 2021, pp. 01–06

2021
[55]

Gpt-4 technical report,

J. Achiam and S. Adler, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[56]

Learning transferable visual models from natural language supervision,

A. Radford and J. W. Kim, “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763

2021
[57]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li and D. Li, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML. PMLR, 2023, pp. 19 730–19 742

2023
[58]

Application of segment anything model for civil infrastructure defect assessment,

M. Ahmadi and A. G. Lonbar, “Application of segment anything model for civil infrastructure defect assessment,”arXiv preprint arXiv:2304.12600, 2023

arXiv 2023
[59]

Medical sam adapter: Adapting segment anything model for medical image segmentation,

J. Wu and Z. Wang, “Medical sam adapter: Adapting segment anything model for medical image segmentation,”Medical image analysis, vol. 102, p. 103547, 2025

2025
[60]

Eviprompt: A training-free evidential prompt generation method for segment anything model in medical im- ages,

Y. Xu and J. Tang, “Eviprompt: A training-free evidential prompt generation method for segment anything model in medical im- ages,”arXiv preprint arXiv:2311.06400, 2023

arXiv 2023
[61]

Adaptershadow: Adapting seg- ment anything model for shadow detection,

L. Jie and H. Zhang, “Adaptershadow: Adapting seg- ment anything model for shadow detection,”arXiv preprint arXiv:2311.08891, 2023. UNIFICATION OF CLOSED-OPEN INDUSTRIAL DETECTION SCENARIOS: NEW LARGE-SCALE BENCHMARKS, CHALLENGES AND BASELINES 18

arXiv 2023
[62]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts,

T. Shin and Y. Razeghi, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,”arXiv preprint arXiv:2010.15980, 2020

arXiv 2010
[63]

Fine-grained visual prompting,

L. Yang and Y. Wang, “Fine-grained visual prompting,”NeurIPS, vol. 36, 2024

2024
[64]

Vrp-sam: Sam with visual reference prompt,

Y. Sun and J. Chen, “Vrp-sam: Sam with visual reference prompt,” inCVPR, 2024, pp. 23 565–23 574

2024
[65]

Yolo-uniow: Efficient universal open-world object detection,

L. Liu and J. Feng, “Yolo-uniow: Efficient universal open-world object detection,”arXiv preprint arXiv:2412.20645, 2024

arXiv 2024
[66]

Relational proxies: Fine-grained relationships as zero-shot discriminators,

A. Chaudhuri and M. Mancini, “Relational proxies: Fine-grained relationships as zero-shot discriminators,”IEEE Trans. on Pattern Anal. and Mach. Intell., 2024

2024
[67]

Segment anything is not always perfect: An investigation of sam on different real-world applications,

W. Ji and J. Li, “Segment anything is not always perfect: An investigation of sam on different real-world applications,”Machine Intelligence Research, vol. 21, pp. 617–630, 2024

2024
[68]

Masked autoencoders are scalable vision learners,

K. He and X. Chen, “Masked autoencoders are scalable vision learners,” inCVPR, 2022, pp. 16 000–16 009

2022
[69]

Distance-iou loss: Faster and better learning for bounding box regression,

Z. Zheng and P . Wang, “Distance-iou loss: Faster and better learning for bounding box regression,” inAAAI, vol. 34, no. 07, 2020, pp. 12 993–13 000

2020
[70]

Representation learning with contrastive predictive coding,

A. v. d. Oord and Y. Li, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

Pith/arXiv arXiv 2018
[71]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

D. Guo and D. Yang, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[72]

A light-weight framework for open-set object detection with decoupled feature alignment in joint space,

Y. He and H. Su, “A light-weight framework for open-set object detection with decoupled feature alignment in joint space,” 2024. [Online]. Available: https://arxiv.org/abs/2412.14680

arXiv 2024
[73]

Mamba-yolo-world: Marrying yolo-world with mamba for open-vocabulary detection,

H. Wang and Q. He, “Mamba-yolo-world: Marrying yolo-world with mamba for open-vocabulary detection,” inICASSP, 2025, pp. 1–5

2025
[74]

Qwen2.5-vl technical report,

S. Bai and K. Chen, “Qwen2.5-vl technical report,” 2025. [Online]. Available: arXivpreprintarXiv:2502.13923

Pith/arXiv arXiv 2025
[75]

Qwen3-omni technical report,

J. Xu and Z. Guo, “Qwen3-omni technical report,” 2025. [Online]. Available: arXivpreprintarXiv:2505.09388

Pith/arXiv arXiv 2025
[76]

Large-scale visual language model boosted by contrast domain adaptation for intelligent industrial visual monitoring,

H. Wang and C. Li, “Large-scale visual language model boosted by contrast domain adaptation for intelligent industrial visual monitoring,”IEEE Trans. on Indus. Infor., vol. 20, no. 12, pp. 14 114– 14 123, 2024

2024
[77]

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,

F. Li, “Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,” 2025. [Online]. Available: arXivpreprintarXiv:2407.07895

Pith/arXiv arXiv 2025
[78]

Llava-onevision: Easy visual task transfer,

B. Li and Y. Zhang, “Llava-onevision: Easy visual task transfer,”
[79]

Available: arXivpreprintarXiv:2408.03326

[Online]. Available: arXivpreprintarXiv:2408.03326

Pith/arXiv arXiv
[80]

Yolov11: An overview of the key architectural enhancements,

R. Khanam and M. Hussain, “Yolov11: An overview of the key architectural enhancements,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17725

Pith/arXiv arXiv 2024

Showing first 80 references.

[1] [1]

Yolov7: Trainable bag-of-freebies sets new state-of- the-art for real-time object detectors,

C.-Y. Wang, “Yolov7: Trainable bag-of-freebies sets new state-of- the-art for real-time object detectors,” inCVPR, 2023, pp. 7464– 7475

2023

[2] [2]

Yolox: Exceeding yolo series in 2021,

Z. Ge and S. Liu, “Yolox: Exceeding yolo series in 2021,”arXiv preprint arXiv:2107.08430, 2021

Pith/arXiv arXiv 2021

[3] [3]

Yolov6: A single-stage object detection framework for industrial applications,

C. Li and L. Li, “Yolov6: A single-stage object detection framework for industrial applications,”arXiv preprint arXiv:2209.02976, 2022

arXiv 2022

[4] [4]

Available: https://github.com/ultralytics/yolov8, 2023

ultralytics, “Yolov8,”[Online]. Available: https://github.com/ultralytics/yolov8, 2023

2023

[5] [5]

Liteyolo-id: A lightweight object detection network for insulator defect detection,

D. Li and Y. Lu, “Liteyolo-id: A lightweight object detection network for insulator defect detection,”IEEE Trans. on Instru. and Measure., vol. 73, pp. 1–12, 2024

2024

[6] [6]

Segment anything,

A. Kirillov and E. Mintun, “Segment anything,” inICCV, October 2023, pp. 4015–4026

2023

[7] [7]

Fast segment anything,

X. Zhao and W. Ding, “Fast segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2306.12156

arXiv 2023

[8] [8]

Faster segment anything: Towards lightweight sam for mobile applications,

C. Zhang, “Faster segment anything: Towards lightweight sam for mobile applications,”arXiv preprint arXiv:2306.14289, 2023

Pith/arXiv arXiv 2023

[9] [9]

Visual instruction tuning,

H. Liu and C. Li, “Visual instruction tuning,” vol. 36, 2024

2024

[10] [10]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu and Z. Zeng, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inECCV. Springer, 2024, pp. 38–55

2024

[11] [11]

S2dbft: Spectral-spatial dual-branch fusion trans- former for hyperspectral image classification,

Y. Zhang, Z. Wang, M. Huang, M. Li, J. Zhang, S. Wang, J. Zhang, and H. Zhang, “S2dbft: Spectral-spatial dual-branch fusion trans- former for hyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, 2025

2025

[12] [12]

Implementation of motion estimation based on heterogeneous parallel computing system with opencl,

J. Zhang, J.-F. Nezan, and J.-G. Cousin, “Implementation of motion estimation based on heterogeneous parallel computing system with opencl,” in2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th Inter- national Conference on Embedded Software and Systems. IEEE, 2012, pp. 41–45

2012

[13] [13]

3d octave and 2d vanilla mixed convolutional neural network for hyperspectral image classification with limited samples,

Y. Feng, J. Zheng, M. Qin, C. Bai, and J. Zhang, “3d octave and 2d vanilla mixed convolutional neural network for hyperspectral image classification with limited samples,”Remote Sensing, vol. 13, no. 21, p. 4407, 2021. UNIFICATION OF CLOSED-OPEN INDUSTRIAL DETECTION SCENARIOS: NEW LARGE-SCALE BENCHMARKS, CHALLENGES AND BASELINES 17

2021

[14] [14]

Learning vertex representations for bipartite networks,

M. Gao, X. He, L. Chen, T. Liu, J. Zhang, and A. Zhou, “Learning vertex representations for bipartite networks,”IEEE transactions on knowledge and data engineering, vol. 34, no. 1, pp. 379–393, 2020

2020

[15] [15]

Multi-granularity episodic contrastive learning for few-shot learning,

P . Zhu, Z. Zhu, Y. Wang, J. Zhang, and S. Zhao, “Multi-granularity episodic contrastive learning for few-shot learning,”Pattern Recog- nition, vol. 131, p. 108820, 2022

2022

[16] [16]

A novel ground-based cloud image segmentation method by using deep transfer learning,

Z. Zhou, F. Zhang, H. Xiao, F. Wang, X. Hong, K. Wu, and J. Zhang, “A novel ground-based cloud image segmentation method by using deep transfer learning,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021

2021

[17] [17]

Ensemble meteorological cloud classifica- tion meets internet of dependable and controllable things,

J. Zhang, P . Liu, F. Zhang, H. Iwabuchi, A. A. d. H. e Ayres, V . H. C. De Albuquerqueet al., “Ensemble meteorological cloud classifica- tion meets internet of dependable and controllable things,”IEEE Internet of Things Journal, vol. 8, no. 5, pp. 3323–3330, 2020

2020

[18] [18]

Automated cca-mwf algorithm for unsuper- vised identification and removal of eog artifacts from eeg,

M. Miao, W. Hu, B. Xu, J. Zhang, J. J. Rodrigues, and V . H. C. De Albuquerque, “Automated cca-mwf algorithm for unsuper- vised identification and removal of eog artifacts from eeg,”IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 8, pp. 3607– 3617, 2021

2021

[19] [19]

Supervised learning based discrete hashing for image retrieval,

Q. Ma, C. Bai, J. Zhang, Z. Liu, and S. Chen, “Supervised learning based discrete hashing for image retrieval,”Pattern Recognition, vol. 92, pp. 156–164, 2019

2019

[20] [20]

Clothing sale forecasting by a composite gru–prophet model with an attention mechanism,

Y. Li, Y. Yang, K. Zhu, and J. Zhang, “Clothing sale forecasting by a composite gru–prophet model with an attention mechanism,” IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 8335– 8344, 2021

2021

[21] [21]

Distilled large language model-driven dynamic sparse expert activation mechanism,

Q. Chen, Z. Zhang, Z. Zhang, K. Zhang, D. Li, W. Wang, J. Zhang, and C. Liu, “Distilled large language model-driven dynamic sparse expert activation mechanism,”Applied Soft Computing, p. 114037, 2025

2025

[22] [22]

Dual-path aggregation transformer network for super-resolution with images occlusions and variabil- ity,

Q. Chen, L. Wang, Z. Zhang, X. Wang, W. Liu, B. Xia, H. Ding, J. Zhang, S. Xu, and X. Wang, “Dual-path aggregation transformer network for super-resolution with images occlusions and variabil- ity,”Engineering Applications of Artificial Intelligence, vol. 139, no. PartA, 2025

2025

[23] [23]

Kftd: Koopman- fourier time-differentiable network for continuous ocean spa- tiotemporal forecasting,

Q. Chen, Z. Zhang, H. Liu, J. Zhang, and C. Bai, “Kftd: Koopman- fourier time-differentiable network for continuous ocean spa- tiotemporal forecasting,” inProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2026, pp. 94–103

2026

[24] [24]

A novel dataset and lightweight distillation baseline for highlight transparent object detection,

Z. Zhang, G. Li, H. Zhang, Q. Chen, Q. Zhang, J. Wan, M. Xiong, C. Bai, D. Li, W. Zhanget al., “A novel dataset and lightweight distillation baseline for highlight transparent object detection,” International Journal of Computer Vision, vol. 134, no. 4, p. 157, 2026

2026

[25] [25]

Idd- net: Industrial defect detection method based on deep-learning,

Z. Zhang, M. Zhou, H. Wan, M. Li, G. Li, and D. Han, “Idd- net: Industrial defect detection method based on deep-learning,” Engineering Applications of Artificial Intelligence, vol. 123, p. 106390, 2023

2023

[26] [26]

Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,

Z. Zhang, Q. Chen, M. Xiong, S. Ding, Z. Su, X. Yao, Y. Sun, C. Bai, and J. Zhang, “Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 357–10 366

2025

[27] [27]

Representation learning based on co- evolutionary combined with probability distribution optimization for precise defect location,

J. Zhang, Z. Zhang, Q. Chen, G. Li, W. Li, S. Ding, M. Xiong, W. Zhang, and S. Chen, “Representation learning based on co- evolutionary combined with probability distribution optimization for precise defect location,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 7, pp. 11 989–12 003, 2024

2024

[28] [28]

Unification of closed-open industrial detection scenarios: New large-scale benchmarks, challenges and baselines,

Z. Zhang, J. Zhang, Q. Chen, G. Li, D. Chen, S. Jing, H. Wang, D. Li, C. Liu, C. Baiet al., “Unification of closed-open industrial detection scenarios: New large-scale benchmarks, challenges and baselines,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026

[29] [29]

Distillation-based fabric anomaly detection,

S. Thomine and H. Snoussi, “Distillation-based fabric anomaly detection,”Textile Research Journal, vol. 94, no. 5-6, pp. 552–565, 2024

2024

[30] [30]

Deep learning for medical anomaly detection–a survey,

T. Fernando and H. Gammulle, “Deep learning for medical anomaly detection–a survey,”ACM Computing Surveys (CSUR), vol. 54, no. 7, pp. 1–37, 2021

2021

[31] [31]

Fdsnet: An accurate real-time surface defect segmentation network,

J. Zhang and R. Ding, “Fdsnet: An accurate real-time surface defect segmentation network,” inICASSP, 2022, pp. 3803–3807

2022

[32] [32]

Matcher: Segment anything with one shot us- ing all-purpose feature matching,

Y. Liu and M. Zhu, “Matcher: Segment anything with one shot us- ing all-purpose feature matching,”arXiv preprint arXiv:2305.13310, 2023

arXiv 2023

[33] [33]

Personalize segment anything model with one shot,

R. Zhang and Z. Jiang, “Personalize segment anything model with one shot,”arXiv preprint arXiv:2305.03048, 2023

arXiv 2023

[34] [34]

Aggregated residual transformations for deep neural networks,

S. Xie and R. Girshick, “Aggregated residual transformations for deep neural networks,” inCVPR, July 2017

2017

[35] [35]

Deep residual learning for image recogni- tion,

K. He and X. Zhang, “Deep residual learning for image recogni- tion,” inCVPR, 2016, pp. 770–778

2016

[36] [36]

Cpt: Colorful prompt tuning for pre-trained vision-language models,

Y. Yao and A. Zhang, “Cpt: Colorful prompt tuning for pre-trained vision-language models,”AI Open, vol. 5, pp. 30–38, 2024

2024

[37] [37]

Reclip: A strong zero-shot baseline for referring expression comprehension,

S. Subramanian, “Reclip: A strong zero-shot baseline for referring expression comprehension,”arXiv preprint arXiv:2204.05991, 2022

arXiv 2022

[38] [38]

How to efficiently adapt large segmentation model (sam) to medical images,

X. Hu and X. Xu, “How to efficiently adapt large segmentation model (sam) to medical images,”arXiv preprint arXiv:2306.13731, 2023

arXiv 2023

[39] [39]

Cocoopter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification,

J. Yan and Y. Xie, “Cocoopter: Pre-train, prompt, and fine-tune the vision-language model for few-shot image classification,”Int. J. Multi. Inform. Retri., vol. 12, no. 2, p. 27, 2023

2023

[40] [40]

Yolo-world: Real-time open-vocabulary object detection,

T. Cheng and L. Song, “Yolo-world: Real-time open-vocabulary object detection,” inCVPR, 2024, pp. 16 901–16 911

2024

[41] [41]

LVIS: A dataset for large vocabulary instance segmentation,

A. Gupta and P . Doll ´ar, “LVIS: A dataset for large vocabulary instance segmentation,”CoRR, vol. abs/1908.03195, 2019

arXiv 1908

[42] [42]

Microsoft coco: Common objects in context,

T.-Y. Lin and M. Maire, “Microsoft coco: Common objects in context,” inECCV, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., vol. 8693, 2014, pp. 740–755

2014

[43] [43]

Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,

Z. Zhang and Q. Chen, “Zero-shot learning in industrial scenarios: New large-scale benchmark, challenges and baseline,” inAAAI, vol. 39, no. 10, Apr. 2025, pp. 10 357–10 366. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/33124

2025

[44] [44]

Mmad: The first-ever comprehensive bench- mark for multimodal large language models in industrial anomaly detection,

X. Jiang and J. Li, “Mmad: The first-ever comprehensive bench- mark for multimodal large language models in industrial anomaly detection,”arXiv preprint arXiv:2410.09453, 2024

arXiv 2024

[45] [45]

Defect spectrum: a granular look of large- scale defect datasets with rich semantics,

S. Yang and Z. Chen, “Defect spectrum: a granular look of large- scale defect datasets with rich semantics,” inECCV. Springer, 2024, pp. 187–203

2024

[46] [46]

Vision datasets: A benchmark for vision-based industrial inspection,

H. Bai and S. Mou, “Vision datasets: A benchmark for vision-based industrial inspection,”arXiv preprint arXiv:2306.07890, 2023

arXiv 2023

[47] [47]

Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,

J. Zhang and R. Ding, “Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,” IEEE Robot. and Auto. Lett., vol. 9, no. 3, pp. 2008–2015, 2024

2008

[48] [48]

Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,

P . Bergmann and M. Fauser, “Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,” inCVPR, 2019, pp. 9592–9600

2019

[49] [49]

Spot-the-difference self-supervised pre- training for anomaly detection and segmentation,

Y. Zou and J. Jeong, “Spot-the-difference self-supervised pre- training for anomaly detection and segmentation,” inECCV. Springer, 2022, pp. 392–408

2022

[50] [50]

Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,

C. Wang and W. Zhu, “Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,” in CVPR, 2024, pp. 22 883–22 892

2024

[51] [51]

Multi-sensor object anomaly detection: Uni- fying appearance, geometry, and internal properties,

W. Li and B. Zheng, “Multi-sensor object anomaly detection: Uni- fying appearance, geometry, and internal properties,” inCVPR, 2025, pp. 9984–9993

2025

[52] [52]

3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,

E. Yang and P . Xing, “3cad: A large-scale real-world 3c product dataset for unsupervised anomaly detection,” inAAAI, vol. 39, no. 9, 2025, pp. 9175–9183

2025

[53] [53]

Pixel-level contrastive pretrainer for indus- trial image representation,

B. Zhu and Y. Chen, “Pixel-level contrastive pretrainer for indus- trial image representation,”IEEE Trans. on Instru. and Measure., vol. 73, 2024

2024

[54] [54]

Vt-adl: A vision transformer network for image anomaly detection and localization,

P . Mishra and R. Verk, “Vt-adl: A vision transformer network for image anomaly detection and localization,” inISIE. IEEE, 2021, pp. 01–06

2021

[55] [55]

Gpt-4 technical report,

J. Achiam and S. Adler, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[56] [56]

Learning transferable visual models from natural language supervision,

A. Radford and J. W. Kim, “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763

2021

[57] [57]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li and D. Li, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML. PMLR, 2023, pp. 19 730–19 742

2023

[58] [58]

Application of segment anything model for civil infrastructure defect assessment,

M. Ahmadi and A. G. Lonbar, “Application of segment anything model for civil infrastructure defect assessment,”arXiv preprint arXiv:2304.12600, 2023

arXiv 2023

[59] [59]

Medical sam adapter: Adapting segment anything model for medical image segmentation,

J. Wu and Z. Wang, “Medical sam adapter: Adapting segment anything model for medical image segmentation,”Medical image analysis, vol. 102, p. 103547, 2025

2025

[60] [60]

Eviprompt: A training-free evidential prompt generation method for segment anything model in medical im- ages,

Y. Xu and J. Tang, “Eviprompt: A training-free evidential prompt generation method for segment anything model in medical im- ages,”arXiv preprint arXiv:2311.06400, 2023

arXiv 2023

[61] [61]

Adaptershadow: Adapting seg- ment anything model for shadow detection,

L. Jie and H. Zhang, “Adaptershadow: Adapting seg- ment anything model for shadow detection,”arXiv preprint arXiv:2311.08891, 2023. UNIFICATION OF CLOSED-OPEN INDUSTRIAL DETECTION SCENARIOS: NEW LARGE-SCALE BENCHMARKS, CHALLENGES AND BASELINES 18

arXiv 2023

[62] [62]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts,

T. Shin and Y. Razeghi, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,”arXiv preprint arXiv:2010.15980, 2020

arXiv 2010

[63] [63]

Fine-grained visual prompting,

L. Yang and Y. Wang, “Fine-grained visual prompting,”NeurIPS, vol. 36, 2024

2024

[64] [64]

Vrp-sam: Sam with visual reference prompt,

Y. Sun and J. Chen, “Vrp-sam: Sam with visual reference prompt,” inCVPR, 2024, pp. 23 565–23 574

2024

[65] [65]

Yolo-uniow: Efficient universal open-world object detection,

L. Liu and J. Feng, “Yolo-uniow: Efficient universal open-world object detection,”arXiv preprint arXiv:2412.20645, 2024

arXiv 2024

[66] [66]

Relational proxies: Fine-grained relationships as zero-shot discriminators,

A. Chaudhuri and M. Mancini, “Relational proxies: Fine-grained relationships as zero-shot discriminators,”IEEE Trans. on Pattern Anal. and Mach. Intell., 2024

2024

[67] [67]

Segment anything is not always perfect: An investigation of sam on different real-world applications,

W. Ji and J. Li, “Segment anything is not always perfect: An investigation of sam on different real-world applications,”Machine Intelligence Research, vol. 21, pp. 617–630, 2024

2024

[68] [68]

Masked autoencoders are scalable vision learners,

K. He and X. Chen, “Masked autoencoders are scalable vision learners,” inCVPR, 2022, pp. 16 000–16 009

2022

[69] [69]

Distance-iou loss: Faster and better learning for bounding box regression,

Z. Zheng and P . Wang, “Distance-iou loss: Faster and better learning for bounding box regression,” inAAAI, vol. 34, no. 07, 2020, pp. 12 993–13 000

2020

[70] [70]

Representation learning with contrastive predictive coding,

A. v. d. Oord and Y. Li, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

Pith/arXiv arXiv 2018

[71] [71]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

D. Guo and D. Yang, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[72] [72]

A light-weight framework for open-set object detection with decoupled feature alignment in joint space,

Y. He and H. Su, “A light-weight framework for open-set object detection with decoupled feature alignment in joint space,” 2024. [Online]. Available: https://arxiv.org/abs/2412.14680

arXiv 2024

[73] [73]

Mamba-yolo-world: Marrying yolo-world with mamba for open-vocabulary detection,

H. Wang and Q. He, “Mamba-yolo-world: Marrying yolo-world with mamba for open-vocabulary detection,” inICASSP, 2025, pp. 1–5

2025

[74] [74]

Qwen2.5-vl technical report,

S. Bai and K. Chen, “Qwen2.5-vl technical report,” 2025. [Online]. Available: arXivpreprintarXiv:2502.13923

Pith/arXiv arXiv 2025

[75] [75]

Qwen3-omni technical report,

J. Xu and Z. Guo, “Qwen3-omni technical report,” 2025. [Online]. Available: arXivpreprintarXiv:2505.09388

Pith/arXiv arXiv 2025

[76] [76]

Large-scale visual language model boosted by contrast domain adaptation for intelligent industrial visual monitoring,

H. Wang and C. Li, “Large-scale visual language model boosted by contrast domain adaptation for intelligent industrial visual monitoring,”IEEE Trans. on Indus. Infor., vol. 20, no. 12, pp. 14 114– 14 123, 2024

2024

[77] [77]

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,

F. Li, “Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,” 2025. [Online]. Available: arXivpreprintarXiv:2407.07895

Pith/arXiv arXiv 2025

[78] [78]

Llava-onevision: Easy visual task transfer,

B. Li and Y. Zhang, “Llava-onevision: Easy visual task transfer,”

[79] [79]

Available: arXivpreprintarXiv:2408.03326

[Online]. Available: arXivpreprintarXiv:2408.03326

Pith/arXiv arXiv

[80] [80]

Yolov11: An overview of the key architectural enhancements,

R. Khanam and M. Hussain, “Yolov11: An overview of the key architectural enhancements,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17725

Pith/arXiv arXiv 2024