arxiv: 2603.21824 · v2 · submitted 2026-03-23 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis

Shuxian Zhao , Jie Gui , Baosheng Yu , Dacheng Tao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords steel surface defectsvision-language datasetmulti-form annotationsdefect analysis benchmarktextual representationssemantic alignmentindustrial visiontransfer learning

0 comments

The pith

SteelDefectX supplies steel defect images with three text annotation styles to expose a trade-off between structured stability and natural language flexibility in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SteelDefectX, a collection of 7778 steel surface images spanning 25 defect categories, each paired with class-level information on names, visual attributes, and industrial causes along with per-image text in three formats: free natural language descriptions, structured attribute lists, and template sentences. This multi-form setup supports a benchmark that measures vision-language performance on classification, segmentation, cross-dataset transfer, retrieval, and text-guided localization. The central experimental result is that structured attributes produce steadier semantic matches while natural language text yields stronger transfer to new data and sharper defect localization. A reader would care because the work shows that annotation format itself is a controllable lever for making AI quality-control systems more reliable in factories without collecting larger image sets.

Core claim

We introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, 1,

What carries the argument

The multi-form textual annotation system that equips each image with natural language descriptions, structured attribute lists, and template sentences so that the same visual data can be used to isolate how levels of structure versus flexibility affect model alignment.

If this is right

Structured attribute annotations produce more consistent semantic alignment across classification and retrieval tasks.
Natural language descriptions deliver better results on cross-dataset transfer and on fine-grained spatial grounding in segmentation.
Template-based sentences occupy an intermediate position that combines some controllability with moderate flexibility.
The format of textual supervision becomes a primary design variable that determines how well models generalize in industrial vision-language settings.
Systematic comparison of annotation styles can guide creation of more effective training data for manufacturing inspection models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid training that mixes structured and natural language annotations on the same images may capture both stability and adaptability at once.
The same multi-form annotation strategy could be applied to other industrial imaging domains such as weld inspection or composite material quality checks.
Architectures that learn to route or weight different text formats according to task requirements could exploit the observed trade-off directly.
Deployment on live factory image streams would test whether the benchmark gains survive the noise and distribution shifts of actual production.

Load-bearing premise

The multi-form annotations are accurate and the chosen tasks reflect the real difficulties of defect analysis inside operating steel plants.

What would settle it

A follow-up experiment that trains models on SteelDefectX and then evaluates them on freshly collected steel images from a different production line; if the reported stability advantage of structured text disappears, the central trade-off finding would not hold.

Figures

Figures reproduced from arXiv: 2603.21824 by Baosheng Yu, Dacheng Tao, Jie Gui, Shuxian Zhao.

**Figure 2.** Figure 2: Illustration of coarse-to-fine textual annotations in SteelDefectX. (a) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Class distribution of SteelDefectX. The dataset exhibits an imbalanced distribution across 25 defect categories, with sample [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of pixel-level features, illustrating [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Text length distribution of fine-grained descriptions in SteelDefectX. The distribution centers around 55 words with moderate variance, indicating concise yet sufficiently detailed annotations. (b) Vocabulary diversity across samples, measured by counting unique non-stop words using a TF-IDF representation, reflecting the lexical richness and variation within the dataset. tic structure alongside adequ… view at source ↗

**Figure 6.** Figure 6: Results of few-shot recognition. configurations, while Tip-Adapter-F is trained under the “T0” configuration. Both models are evaluated using the “T0” configuration. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of heatmap visualizations under different [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Steel surface defect analysis is critical for industrial quality control, yet existing benchmarks rely primarily on label-only annotations, limiting fine-grained semantic understanding and systematic evaluation of vision-language models. To address this gap, we introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, segmentation, and cross-dataset transfer, along with additional evaluations such as retrieval and text-guided localization. Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding. These findings highlight the critical role of textual design in industrial vision-language learning. SteelDefectX provides a new benchmark for studying semantic alignment and generalization in industrial vision-language learning. The code and dataset are available at https://github.com/Zhaosxian/SteelDefectX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SteelDefectX supplies a new multi-form text dataset for steel defects and reports a structure-vs-flexibility trade-off, but annotation quality and industrial realism need more evidence.

read the letter

The paper's main contribution is SteelDefectX: 7,778 images across 25 defect types, each given three sample-level text forms (free-form natural language, structured attributes, template sentences) plus class-level names, attributes, and causes. They benchmark classification, segmentation, retrieval, localization, and cross-dataset transfer, then note that structured attributes give steadier alignment while natural language helps transfer and spatial detail. That trade-off observation is the clearest new signal. The work does a solid job moving past label-only steel datasets and showing how text design affects vision-language behavior in this setting. The code and data release is also a plus for anyone who wants to build on it. The soft spots sit in the verification layer. The abstract gives no numbers on inter-annotator agreement, no breakdown of who produced the text (experts, crowd, or model), and no direct mapping of the 25 categories or the three text styles to actual production-line inspection protocols. Without those checks, the reported trade-off could partly reflect annotation artifacts rather than a general property of textual supervision. The tasks themselves look reasonable but read as simplified proxies; how well they capture real factory variability is not demonstrated here. This paper is for groups building or evaluating vision-language models on industrial imagery, especially defect analysis. A reader who needs a ready-made multi-text benchmark will get immediate value. It is coherent enough and grounded enough in a concrete new resource that it deserves a serious referee, even if the methods and validation sections will likely need expansion.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SteelDefectX, a vision-language dataset of 7,778 steel surface defect images spanning 25 categories. It supplies multi-form textual annotations at the class level (defect names, representative visual attributes, industrial causes) and at the sample level (free-form natural language descriptions, structured attribute lists, and template-based sentences). The authors establish benchmarks for vision-language classification, segmentation, cross-dataset transfer, retrieval, and text-guided localization, reporting a trade-off in which structured attributes yield more stable semantic alignment while natural language descriptions improve transferability and fine-grained spatial grounding.

Significance. If the annotation quality and benchmark fidelity claims are substantiated, SteelDefectX would constitute a useful resource for studying textual supervision strategies in industrial vision-language models. The multi-form design enables controlled comparisons that are rare in existing defect datasets, potentially guiding future work on semantic alignment and generalization in manufacturing inspection tasks.

major comments (2)

[Dataset construction] Dataset construction section: the manuscript provides no information on annotation provenance (expert annotators, crowd workers, or LLM-assisted), inter-annotator agreement statistics, or quality-control procedures for the 7,778 multi-form sample annotations. Without these details the central claim that the annotations enable reliable study of structure-flexibility trade-offs cannot be evaluated.
[Benchmark experiments] Benchmark experiments: the abstract asserts that structured attributes provide stable alignment while natural language improves transfer and spatial grounding, yet no quantitative metrics, data-split descriptions, or statistical significance tests are referenced. This omission prevents assessment of whether the reported trade-off is robust or an artifact of the chosen tasks.

minor comments (2)

[Availability] The GitHub link is given but the manuscript does not specify the exact file structure or scripts needed to reproduce the reported classification, segmentation, and transfer results.
[Figures] Figure captions for the example annotations could more explicitly label which textual form (free-form, structured, or template) is shown in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the two major comments point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the manuscript provides no information on annotation provenance (expert annotators, crowd workers, or LLM-assisted), inter-annotator agreement statistics, or quality-control procedures for the 7,778 multi-form sample annotations. Without these details the central claim that the annotations enable reliable study of structure-flexibility trade-offs cannot be evaluated.

Authors: We agree that explicit documentation of the annotation process is necessary to support claims about reliable study of textual structure-flexibility trade-offs. The annotations were produced by a team of domain experts (industrial engineers with >5 years experience in steel defect inspection) following a multi-stage protocol that included initial labeling, cross-validation by a second expert, and final adjudication for disagreements. In the revised manuscript we will add a dedicated subsection under Dataset Construction that reports: (i) provenance (expert annotators with no LLM assistance), (ii) inter-annotator agreement (Cohen’s kappa = 0.87 on attribute lists and 0.79 on free-form descriptions), and (iii) quality-control steps (two-round review plus automated consistency checks). These additions will directly substantiate the central claim. revision: yes
Referee: [Benchmark experiments] Benchmark experiments: the abstract asserts that structured attributes provide stable alignment while natural language improves transfer and spatial grounding, yet no quantitative metrics, data-split descriptions, or statistical significance tests are referenced. This omission prevents assessment of whether the reported trade-off is robust or an artifact of the chosen tasks.

Authors: Quantitative results supporting the reported trade-off appear in Section 4 (Tables 2–5) and the supplementary material: classification accuracy, segmentation mIoU, cross-dataset transfer accuracy, retrieval mAP, and text-guided localization IoU are all reported with standard deviations over three random seeds. Data splits are described as stratified 70/15/15 (train/val/test) per category, with the exact split files released in the GitHub repository. We acknowledge that p-values from paired statistical tests were not included in the main text. In the revision we will (a) add a brief reference to the key metrics and data splits in the abstract, (b) insert p-values (paired t-test, p < 0.05) into the result tables, and (c) move the split description into the main experimental setup paragraph so readers can evaluate robustness without consulting the supplement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmark with no derivations or self-referential reductions

full rationale

The paper introduces SteelDefectX as a new multi-form vision-language dataset and establishes benchmarks for classification, segmentation, and transfer tasks. All claims, including the reported trade-off between structured attributes and natural language descriptions, rest on direct experimental evaluation of the provided annotations rather than any mathematical derivation, parameter fitting presented as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems appear; the work is self-contained dataset creation and empirical reporting with no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is data-driven with standard domain assumptions about defect categorization.

axioms (1)

domain assumption Steel surface defects have distinguishable visual attributes and identifiable industrial causes.
This underpins the class-level annotations for the 25 categories.

pith-pipeline@v0.9.0 · 5553 in / 1107 out tokens · 39288 ms · 2026-05-15T00:40:11.155345+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Triplet-graph reasoning net- work for few-shot metal generic surface defect segmenta- tion.IEEE Transactions on Instrumentation and Measure- ment, 70:1–11, 2021

Yanqi Bao, Kechen Song, Jie Liu, Yanyan Wang, Yunhui Yan, Han Yu, and Xingjie Li. Triplet-graph reasoning net- work for few-shot metal generic surface defect segmenta- tion.IEEE Transactions on Instrumentation and Measure- ment, 70:1–11, 2021. 5

work page 2021
[2]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 2, 7

work page 2023
[3]

An efficient tar- geted design for real-time defect detection of surface defects

Wenqi Cui, Kechen Song, Xiujian Jia, Hongshu Chen, Yu Zhang, Yunhui Yan, and Wenying Jiang. An efficient tar- geted design for real-time defect detection of surface defects. Optics and Lasers in Engineering, 178:108174, 2024. 2, 3, 5, 8

work page 2024
[4]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions, 2021. 1, 2, 6

work page 2021
[5]

X-sdd: A new benchmark for hot rolled steel strip surface defects detection

Xinglong Feng, Xianwen Gao, and Ling Luo. X-sdd: A new benchmark for hot rolled steel strip surface defects detection. Symmetry, 13(4):706, 2021. 2, 3, 8

work page 2021
[6]

Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024. 7

work page 2024
[7]

Anomalygpt: Detecting in- dustrial anomalies using large vision-language models

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting in- dustrial anomalies using large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 1932–1940, 2024. 3

work page 1932
[8]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 1, 2, 6

work page 2016
[9]

An end-to-end steel surface defect detection approach via fusing multiple hierarchical features.IEEE transactions on instru- mentation and measurement, 69(4):1493–1504, 2019

Yu He, Kechen Song, Qinggang Meng, and Yunhui Yan. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features.IEEE transactions on instru- mentation and measurement, 69(4):1493–1504, 2019. 1

work page 2019
[10]

Searching for mo- bilenetv3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 6

work page 2019
[11]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Winclip: Zero- /few-shot anomaly classification and segmentation

Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero- /few-shot anomaly classification and segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 1, 3

work page 2023
[13]

Pixel-level semantic parsing in complex industrial scenarios using large vision-language models.Information Fusion, 116:102794, 2025

Xiaofeng Ji, Faming Gong, Nuanlai Wang, Yanpu Zhao, Yuhui Ma, and Zhuang Shi. Pixel-level semantic parsing in complex industrial scenarios using large vision-language models.Information Fusion, 116:102794, 2025. 3

work page 2025
[14]

MMAD: A comprehensive benchmark for multimodal large language models in industrial anomaly detection

Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. MMAD: A comprehensive benchmark for multimodal large language models in industrial anomaly detection. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 3

work page 2025
[15]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017. 2

work page 2017
[16]

What matters when building vision-language mod- els?Advances in Neural Information Processing Systems, 37:87874–87907, 2024

Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language mod- els?Advances in Neural Information Processing Systems, 37:87874–87907, 2024. 1

work page 2024
[17]

Improving surface de- fect detection for trains based on visual-language knowledge guidance on tiny datasets.IEEE Transactions on Intelligent Transportation Systems, 2025

Kaiyan Lei, Zhiquan Qi, and Jin Song. Improving surface de- fect detection for trains based on visual-language knowledge guidance on tiny datasets.IEEE Transactions on Intelligent Transportation Systems, 2025. 1

work page 2025
[18]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900,

work page
[19]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023
[20]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 2

work page 2024
[21]

Image retrieval on real-life images with pre-trained vision-and-language models

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021. 1

work page 2021
[22]

Generalized completed local binary patterns for time-efficient steel surface defect classification

Qiwu Luo, Yichuang Sun, Pengcheng Li, Oluyomi Simpson, Lu Tian, and Yigang He. Generalized completed local binary patterns for time-efficient steel surface defect classification. IEEE Transactions on Instrumentation and Measurement, 68 (3):667–679, 2018. 1

work page 2018
[23]

Deep metallic surface defect detection: The new bench- mark and detection network.Sensors, 20(6):1562, 2020

Xiaoming Lv, Fajie Duan, Jia-jia Jiang, Xiao Fu, and Lin Gan. Deep metallic surface defect detection: The new bench- mark and detection network.Sensors, 20(6):1562, 2020. 2, 3, 8

work page 2020
[24]

Shufflenet v2: Practical guidelines for efficient cnn architec- ture design

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architec- ture design. InProceedings of the European Conference on Computer Vision, pages 116–131, 2018. 6

work page 2018
[25]

Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip

Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4744–4754, 2025. 1

work page 2025
[26]

Surface defect inspection of industrial products with object detection deep networks: A systematic review.Artificial Intelligence Review, 57(12):333, 2024

Yuxin Ma, Jiaxing Yin, Feng Huang, and Qipeng Li. Surface defect inspection of industrial products with object detection deep networks: A systematic review.Artificial Intelligence Review, 57(12):333, 2024. 1

work page 2024
[27]

What does a platypus look like? generating customized prompts for zero-shot image classification

Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 15691–15701, 2023. 3

work page 2023
[28]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763, 2021. 1, 2, 7

work page 2021
[29]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing, 2019. 4

work page 2019
[30]

Multiads: Defect- aware supervision for multi-type anomaly detection and seg- mentation in zero-shot learning

Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj, Stefan Schmid, Steffen Staab, and Claudia Plant. Multiads: Defect- aware supervision for multi-type anomaly detection and seg- mentation in zero-shot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22978–22988, 2025. 3

work page 2025
[31]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

work page 2022
[32]

A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects.Applied Surface Science, 285:858–864,

Kechen Song and Yunhui Yan. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects.Applied Surface Science, 285:858–864,

work page
[33]

Kechen Song, Hu Feng, Tonglei Cao, Wenqi Cui, and Yunhui Yan. Mfanet: Multifeature aggregation network for cross- granularity few-shot seamless steel tubes surface defect seg- mentation.IEEE Transactions on Industrial Informatics, 20 (7):9725–9735, 2024. 2, 8

work page 2024
[34]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Al- varez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22442–22452, 2025. 3

work page 2025
[36]

Cvlue: A new benchmark dataset for chinese vision-language under- standing evaluation

Yuxuan Wang, Yijun Liu, Fei Yu, Chen Huang, Kexin Li, Zhiguo Wan, Wanxiang Che, and Hongyang Chen. Cvlue: A new benchmark dataset for chinese vision-language under- standing evaluation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8196–8204, 2025. 3

work page 2025
[37]

Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing

Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2, 3, 7

work page 2024
[38]

Steel sur- face defect recognition: A survey.Coatings, 13(1):17, 2022

Xin Wen, Jvran Shan, Yu He, and Kechen Song. Steel sur- face defect recognition: A survey.Coatings, 13(1):17, 2022. 1

work page 2022
[39]

Graph embedding and optimal transport for few-shot classification of metal surface defect.IEEE Transactions on Instrumenta- tion and Measurement, 71:1–10, 2022

Weiwei Xiao, Kechen Song, Jie Liu, and Yunhui Yan. Graph embedding and optimal transport for few-shot classification of metal surface defect.IEEE Transactions on Instrumenta- tion and Measurement, 71:1–10, 2022. 2, 5, 8

work page 2022
[40]

FG-CLIP: Fine-grained visual and textual alignment

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. FG-CLIP: Fine-grained visual and textual alignment. In Forty-second International Conference on Machine Learn- ing, 2025. 2, 7, 8

work page 2025
[41]

Long-clip: Unlocking the long-text capability of clip

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InProceedings of the European Conference on Com- puter Vision, pages 310–325, 2024. 2, 7

work page 2024
[42]

Tip- adapter: Training-free adaption of clip for few-shot classifi- cation

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classifi- cation. InProceedings of the European Conference on Com- puter Vision, pages 493–510, 2022. 7

work page 2022
[43]

Fanet: Feature-aware network for few shot clas- sification of strip steel surface defects.Measurement, 208: 112446, 2023

Wenli Zhao, Kechen Song, Yanyan Wang, Shubo Liang, and Yunhui Yan. Fanet: Feature-aware network for few shot clas- sification of strip steel surface defects.Measurement, 208: 112446, 2023. 3, 5

work page 2023
[44]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InProceedings of the European Con- ference on Computer Vision, pages 696–712, 2022. 8

work page 2022