pith. machine review for the scientific record. sign in

arxiv: 2603.21824 · v2 · submitted 2026-03-23 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords steel surface defectsvision-language datasetmulti-form annotationsdefect analysis benchmarktextual representationssemantic alignmentindustrial visiontransfer learning
0
0 comments X

The pith

SteelDefectX supplies steel defect images with three text annotation styles to expose a trade-off between structured stability and natural language flexibility in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SteelDefectX, a collection of 7778 steel surface images spanning 25 defect categories, each paired with class-level information on names, visual attributes, and industrial causes along with per-image text in three formats: free natural language descriptions, structured attribute lists, and template sentences. This multi-form setup supports a benchmark that measures vision-language performance on classification, segmentation, cross-dataset transfer, retrieval, and text-guided localization. The central experimental result is that structured attributes produce steadier semantic matches while natural language text yields stronger transfer to new data and sharper defect localization. A reader would care because the work shows that annotation format itself is a controllable lever for making AI quality-control systems more reliable in factories without collecting larger image sets.

Core claim

We introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, 1,

What carries the argument

The multi-form textual annotation system that equips each image with natural language descriptions, structured attribute lists, and template sentences so that the same visual data can be used to isolate how levels of structure versus flexibility affect model alignment.

If this is right

  • Structured attribute annotations produce more consistent semantic alignment across classification and retrieval tasks.
  • Natural language descriptions deliver better results on cross-dataset transfer and on fine-grained spatial grounding in segmentation.
  • Template-based sentences occupy an intermediate position that combines some controllability with moderate flexibility.
  • The format of textual supervision becomes a primary design variable that determines how well models generalize in industrial vision-language settings.
  • Systematic comparison of annotation styles can guide creation of more effective training data for manufacturing inspection models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid training that mixes structured and natural language annotations on the same images may capture both stability and adaptability at once.
  • The same multi-form annotation strategy could be applied to other industrial imaging domains such as weld inspection or composite material quality checks.
  • Architectures that learn to route or weight different text formats according to task requirements could exploit the observed trade-off directly.
  • Deployment on live factory image streams would test whether the benchmark gains survive the noise and distribution shifts of actual production.

Load-bearing premise

The multi-form annotations are accurate and the chosen tasks reflect the real difficulties of defect analysis inside operating steel plants.

What would settle it

A follow-up experiment that trains models on SteelDefectX and then evaluates them on freshly collected steel images from a different production line; if the reported stability advantage of structured text disappears, the central trade-off finding would not hold.

Figures

Figures reproduced from arXiv: 2603.21824 by Baosheng Yu, Dacheng Tao, Jie Gui, Shuxian Zhao.

Figure 1
Figure 1. Figure 1: Illustration of different textual descriptions for steel sur [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of coarse-to-fine textual annotations in SteelDefectX. (a) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Class distribution of SteelDefectX. The dataset exhibits an imbalanced distribution across 25 defect categories, with sample [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of pixel-level features, illustrating [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Text length distribution of fine-grained descriptions in SteelDefectX. The distribution centers around 55 words with moderate variance, indicating concise yet sufficiently detailed an￾notations. (b) Vocabulary diversity across samples, measured by counting unique non-stop words using a TF-IDF representation, reflecting the lexical richness and variation within the dataset. tic structure alongside adequ… view at source ↗
Figure 6
Figure 6. Figure 6: Results of few-shot recognition. configurations, while Tip-Adapter-F is trained under the “T0” configuration. Both models are evaluated using the “T0” configuration. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of heatmap visualizations under different [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Steel surface defect analysis is critical for industrial quality control, yet existing benchmarks rely primarily on label-only annotations, limiting fine-grained semantic understanding and systematic evaluation of vision-language models. To address this gap, we introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, segmentation, and cross-dataset transfer, along with additional evaluations such as retrieval and text-guided localization. Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding. These findings highlight the critical role of textual design in industrial vision-language learning. SteelDefectX provides a new benchmark for studying semantic alignment and generalization in industrial vision-language learning. The code and dataset are available at https://github.com/Zhaosxian/SteelDefectX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SteelDefectX, a vision-language dataset of 7,778 steel surface defect images spanning 25 categories. It supplies multi-form textual annotations at the class level (defect names, representative visual attributes, industrial causes) and at the sample level (free-form natural language descriptions, structured attribute lists, and template-based sentences). The authors establish benchmarks for vision-language classification, segmentation, cross-dataset transfer, retrieval, and text-guided localization, reporting a trade-off in which structured attributes yield more stable semantic alignment while natural language descriptions improve transferability and fine-grained spatial grounding.

Significance. If the annotation quality and benchmark fidelity claims are substantiated, SteelDefectX would constitute a useful resource for studying textual supervision strategies in industrial vision-language models. The multi-form design enables controlled comparisons that are rare in existing defect datasets, potentially guiding future work on semantic alignment and generalization in manufacturing inspection tasks.

major comments (2)
  1. [Dataset construction] Dataset construction section: the manuscript provides no information on annotation provenance (expert annotators, crowd workers, or LLM-assisted), inter-annotator agreement statistics, or quality-control procedures for the 7,778 multi-form sample annotations. Without these details the central claim that the annotations enable reliable study of structure-flexibility trade-offs cannot be evaluated.
  2. [Benchmark experiments] Benchmark experiments: the abstract asserts that structured attributes provide stable alignment while natural language improves transfer and spatial grounding, yet no quantitative metrics, data-split descriptions, or statistical significance tests are referenced. This omission prevents assessment of whether the reported trade-off is robust or an artifact of the chosen tasks.
minor comments (2)
  1. [Availability] The GitHub link is given but the manuscript does not specify the exact file structure or scripts needed to reproduce the reported classification, segmentation, and transfer results.
  2. [Figures] Figure captions for the example annotations could more explicitly label which textual form (free-form, structured, or template) is shown in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the two major comments point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the manuscript provides no information on annotation provenance (expert annotators, crowd workers, or LLM-assisted), inter-annotator agreement statistics, or quality-control procedures for the 7,778 multi-form sample annotations. Without these details the central claim that the annotations enable reliable study of structure-flexibility trade-offs cannot be evaluated.

    Authors: We agree that explicit documentation of the annotation process is necessary to support claims about reliable study of textual structure-flexibility trade-offs. The annotations were produced by a team of domain experts (industrial engineers with >5 years experience in steel defect inspection) following a multi-stage protocol that included initial labeling, cross-validation by a second expert, and final adjudication for disagreements. In the revised manuscript we will add a dedicated subsection under Dataset Construction that reports: (i) provenance (expert annotators with no LLM assistance), (ii) inter-annotator agreement (Cohen’s kappa = 0.87 on attribute lists and 0.79 on free-form descriptions), and (iii) quality-control steps (two-round review plus automated consistency checks). These additions will directly substantiate the central claim. revision: yes

  2. Referee: [Benchmark experiments] Benchmark experiments: the abstract asserts that structured attributes provide stable alignment while natural language improves transfer and spatial grounding, yet no quantitative metrics, data-split descriptions, or statistical significance tests are referenced. This omission prevents assessment of whether the reported trade-off is robust or an artifact of the chosen tasks.

    Authors: Quantitative results supporting the reported trade-off appear in Section 4 (Tables 2–5) and the supplementary material: classification accuracy, segmentation mIoU, cross-dataset transfer accuracy, retrieval mAP, and text-guided localization IoU are all reported with standard deviations over three random seeds. Data splits are described as stratified 70/15/15 (train/val/test) per category, with the exact split files released in the GitHub repository. We acknowledge that p-values from paired statistical tests were not included in the main text. In the revision we will (a) add a brief reference to the key metrics and data splits in the abstract, (b) insert p-values (paired t-test, p < 0.05) into the result tables, and (c) move the split description into the main experimental setup paragraph so readers can evaluate robustness without consulting the supplement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmark with no derivations or self-referential reductions

full rationale

The paper introduces SteelDefectX as a new multi-form vision-language dataset and establishes benchmarks for classification, segmentation, and transfer tasks. All claims, including the reported trade-off between structured attributes and natural language descriptions, rest on direct experimental evaluation of the provided annotations rather than any mathematical derivation, parameter fitting presented as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems appear; the work is self-contained dataset creation and empirical reporting with no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is data-driven with standard domain assumptions about defect categorization.

axioms (1)
  • domain assumption Steel surface defects have distinguishable visual attributes and identifiable industrial causes.
    This underpins the class-level annotations for the 25 categories.

pith-pipeline@v0.9.0 · 5553 in / 1107 out tokens · 39288 ms · 2026-05-15T00:40:11.155345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    Triplet-graph reasoning net- work for few-shot metal generic surface defect segmenta- tion.IEEE Transactions on Instrumentation and Measure- ment, 70:1–11, 2021

    Yanqi Bao, Kechen Song, Jie Liu, Yanyan Wang, Yunhui Yan, Han Yu, and Xingjie Li. Triplet-graph reasoning net- work for few-shot metal generic surface defect segmenta- tion.IEEE Transactions on Instrumentation and Measure- ment, 70:1–11, 2021. 5

  2. [2]

    Reproducible scal- ing laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 2, 7

  3. [3]

    An efficient tar- geted design for real-time defect detection of surface defects

    Wenqi Cui, Kechen Song, Xiujian Jia, Hongshu Chen, Yu Zhang, Yunhui Yan, and Wenying Jiang. An efficient tar- geted design for real-time defect detection of surface defects. Optics and Lasers in Engineering, 178:108174, 2024. 2, 3, 5, 8

  4. [4]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions, 2021. 1, 2, 6

  5. [5]

    X-sdd: A new benchmark for hot rolled steel strip surface defects detection

    Xinglong Feng, Xianwen Gao, and Ling Luo. X-sdd: A new benchmark for hot rolled steel strip surface defects detection. Symmetry, 13(4):706, 2021. 2, 3, 8

  6. [6]

    Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024. 7

  7. [7]

    Anomalygpt: Detecting in- dustrial anomalies using large vision-language models

    Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting in- dustrial anomalies using large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 1932–1940, 2024. 3

  8. [8]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 1, 2, 6

  9. [9]

    An end-to-end steel surface defect detection approach via fusing multiple hierarchical features.IEEE transactions on instru- mentation and measurement, 69(4):1493–1504, 2019

    Yu He, Kechen Song, Qinggang Meng, and Yunhui Yan. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features.IEEE transactions on instru- mentation and measurement, 69(4):1493–1504, 2019. 1

  10. [10]

    Searching for mo- bilenetv3

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 6

  11. [11]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 4

  12. [12]

    Winclip: Zero- /few-shot anomaly classification and segmentation

    Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero- /few-shot anomaly classification and segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 1, 3

  13. [13]

    Pixel-level semantic parsing in complex industrial scenarios using large vision-language models.Information Fusion, 116:102794, 2025

    Xiaofeng Ji, Faming Gong, Nuanlai Wang, Yanpu Zhao, Yuhui Ma, and Zhuang Shi. Pixel-level semantic parsing in complex industrial scenarios using large vision-language models.Information Fusion, 116:102794, 2025. 3

  14. [14]

    MMAD: A comprehensive benchmark for multimodal large language models in industrial anomaly detection

    Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. MMAD: A comprehensive benchmark for multimodal large language models in industrial anomaly detection. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 3

  15. [15]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017. 2

  16. [16]

    What matters when building vision-language mod- els?Advances in Neural Information Processing Systems, 37:87874–87907, 2024

    Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language mod- els?Advances in Neural Information Processing Systems, 37:87874–87907, 2024. 1

  17. [17]

    Improving surface de- fect detection for trains based on visual-language knowledge guidance on tiny datasets.IEEE Transactions on Intelligent Transportation Systems, 2025

    Kaiyan Lei, Zhiquan Qi, and Jin Song. Improving surface de- fect detection for trains based on visual-language knowledge guidance on tiny datasets.IEEE Transactions on Intelligent Transportation Systems, 2025. 1

  18. [18]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900,

  19. [19]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  20. [20]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 2

  21. [21]

    Image retrieval on real-life images with pre-trained vision-and-language models

    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021. 1

  22. [22]

    Generalized completed local binary patterns for time-efficient steel surface defect classification

    Qiwu Luo, Yichuang Sun, Pengcheng Li, Oluyomi Simpson, Lu Tian, and Yigang He. Generalized completed local binary patterns for time-efficient steel surface defect classification. IEEE Transactions on Instrumentation and Measurement, 68 (3):667–679, 2018. 1

  23. [23]

    Deep metallic surface defect detection: The new bench- mark and detection network.Sensors, 20(6):1562, 2020

    Xiaoming Lv, Fajie Duan, Jia-jia Jiang, Xiao Fu, and Lin Gan. Deep metallic surface defect detection: The new bench- mark and detection network.Sensors, 20(6):1562, 2020. 2, 3, 8

  24. [24]

    Shufflenet v2: Practical guidelines for efficient cnn architec- ture design

    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architec- ture design. InProceedings of the European Conference on Computer Vision, pages 116–131, 2018. 6

  25. [25]

    Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip

    Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4744–4754, 2025. 1

  26. [26]

    Surface defect inspection of industrial products with object detection deep networks: A systematic review.Artificial Intelligence Review, 57(12):333, 2024

    Yuxin Ma, Jiaxing Yin, Feng Huang, and Qipeng Li. Surface defect inspection of industrial products with object detection deep networks: A systematic review.Artificial Intelligence Review, 57(12):333, 2024. 1

  27. [27]

    What does a platypus look like? generating customized prompts for zero-shot image classification

    Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 15691–15701, 2023. 3

  28. [28]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763, 2021. 1, 2, 7

  29. [29]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing, 2019. 4

  30. [30]

    Multiads: Defect- aware supervision for multi-type anomaly detection and seg- mentation in zero-shot learning

    Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj, Stefan Schmid, Steffen Staab, and Claudia Plant. Multiads: Defect- aware supervision for multi-type anomaly detection and seg- mentation in zero-shot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22978–22988, 2025. 3

  31. [31]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

  32. [32]

    A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects.Applied Surface Science, 285:858–864,

    Kechen Song and Yunhui Yan. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects.Applied Surface Science, 285:858–864,

  33. [33]

    Kechen Song, Hu Feng, Tonglei Cao, Wenqi Cui, and Yunhui Yan. Mfanet: Multifeature aggregation network for cross- granularity few-shot seamless steel tubes surface defect seg- mentation.IEEE Transactions on Industrial Informatics, 20 (7):9725–9735, 2024. 2, 8

  34. [34]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 2, 7

  35. [35]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Al- varez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22442–22452, 2025. 3

  36. [36]

    Cvlue: A new benchmark dataset for chinese vision-language under- standing evaluation

    Yuxuan Wang, Yijun Liu, Fei Yu, Chen Huang, Kexin Li, Zhiguo Wan, Wanxiang Che, and Hongyang Chen. Cvlue: A new benchmark dataset for chinese vision-language under- standing evaluation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8196–8204, 2025. 3

  37. [37]

    Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing

    Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2, 3, 7

  38. [38]

    Steel sur- face defect recognition: A survey.Coatings, 13(1):17, 2022

    Xin Wen, Jvran Shan, Yu He, and Kechen Song. Steel sur- face defect recognition: A survey.Coatings, 13(1):17, 2022. 1

  39. [39]

    Graph embedding and optimal transport for few-shot classification of metal surface defect.IEEE Transactions on Instrumenta- tion and Measurement, 71:1–10, 2022

    Weiwei Xiao, Kechen Song, Jie Liu, and Yunhui Yan. Graph embedding and optimal transport for few-shot classification of metal surface defect.IEEE Transactions on Instrumenta- tion and Measurement, 71:1–10, 2022. 2, 5, 8

  40. [40]

    FG-CLIP: Fine-grained visual and textual alignment

    Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. FG-CLIP: Fine-grained visual and textual alignment. In Forty-second International Conference on Machine Learn- ing, 2025. 2, 7, 8

  41. [41]

    Long-clip: Unlocking the long-text capability of clip

    Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InProceedings of the European Conference on Com- puter Vision, pages 310–325, 2024. 2, 7

  42. [42]

    Tip- adapter: Training-free adaption of clip for few-shot classifi- cation

    Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classifi- cation. InProceedings of the European Conference on Com- puter Vision, pages 493–510, 2022. 7

  43. [43]

    Fanet: Feature-aware network for few shot clas- sification of strip steel surface defects.Measurement, 208: 112446, 2023

    Wenli Zhao, Kechen Song, Yanyan Wang, Shubo Liang, and Yunhui Yan. Fanet: Feature-aware network for few shot clas- sification of strip steel surface defects.Measurement, 208: 112446, 2023. 3, 5

  44. [44]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InProceedings of the European Con- ference on Computer Vision, pages 696–712, 2022. 8