Recognition: 2 theorem links
· Lean TheoremSteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis
Pith reviewed 2026-05-15 00:40 UTC · model grok-4.3
The pith
SteelDefectX supplies steel defect images with three text annotation styles to expose a trade-off between structured stability and natural language flexibility in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, 1,
What carries the argument
The multi-form textual annotation system that equips each image with natural language descriptions, structured attribute lists, and template sentences so that the same visual data can be used to isolate how levels of structure versus flexibility affect model alignment.
If this is right
- Structured attribute annotations produce more consistent semantic alignment across classification and retrieval tasks.
- Natural language descriptions deliver better results on cross-dataset transfer and on fine-grained spatial grounding in segmentation.
- Template-based sentences occupy an intermediate position that combines some controllability with moderate flexibility.
- The format of textual supervision becomes a primary design variable that determines how well models generalize in industrial vision-language settings.
- Systematic comparison of annotation styles can guide creation of more effective training data for manufacturing inspection models.
Where Pith is reading between the lines
- Hybrid training that mixes structured and natural language annotations on the same images may capture both stability and adaptability at once.
- The same multi-form annotation strategy could be applied to other industrial imaging domains such as weld inspection or composite material quality checks.
- Architectures that learn to route or weight different text formats according to task requirements could exploit the observed trade-off directly.
- Deployment on live factory image streams would test whether the benchmark gains survive the noise and distribution shifts of actual production.
Load-bearing premise
The multi-form annotations are accurate and the chosen tasks reflect the real difficulties of defect analysis inside operating steel plants.
What would settle it
A follow-up experiment that trains models on SteelDefectX and then evaluates them on freshly collected steel images from a different production line; if the reported stability advantage of structured text disappears, the central trade-off finding would not hold.
Figures
read the original abstract
Steel surface defect analysis is critical for industrial quality control, yet existing benchmarks rely primarily on label-only annotations, limiting fine-grained semantic understanding and systematic evaluation of vision-language models. To address this gap, we introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, segmentation, and cross-dataset transfer, along with additional evaluations such as retrieval and text-guided localization. Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding. These findings highlight the critical role of textual design in industrial vision-language learning. SteelDefectX provides a new benchmark for studying semantic alignment and generalization in industrial vision-language learning. The code and dataset are available at https://github.com/Zhaosxian/SteelDefectX.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SteelDefectX, a vision-language dataset of 7,778 steel surface defect images spanning 25 categories. It supplies multi-form textual annotations at the class level (defect names, representative visual attributes, industrial causes) and at the sample level (free-form natural language descriptions, structured attribute lists, and template-based sentences). The authors establish benchmarks for vision-language classification, segmentation, cross-dataset transfer, retrieval, and text-guided localization, reporting a trade-off in which structured attributes yield more stable semantic alignment while natural language descriptions improve transferability and fine-grained spatial grounding.
Significance. If the annotation quality and benchmark fidelity claims are substantiated, SteelDefectX would constitute a useful resource for studying textual supervision strategies in industrial vision-language models. The multi-form design enables controlled comparisons that are rare in existing defect datasets, potentially guiding future work on semantic alignment and generalization in manufacturing inspection tasks.
major comments (2)
- [Dataset construction] Dataset construction section: the manuscript provides no information on annotation provenance (expert annotators, crowd workers, or LLM-assisted), inter-annotator agreement statistics, or quality-control procedures for the 7,778 multi-form sample annotations. Without these details the central claim that the annotations enable reliable study of structure-flexibility trade-offs cannot be evaluated.
- [Benchmark experiments] Benchmark experiments: the abstract asserts that structured attributes provide stable alignment while natural language improves transfer and spatial grounding, yet no quantitative metrics, data-split descriptions, or statistical significance tests are referenced. This omission prevents assessment of whether the reported trade-off is robust or an artifact of the chosen tasks.
minor comments (2)
- [Availability] The GitHub link is given but the manuscript does not specify the exact file structure or scripts needed to reproduce the reported classification, segmentation, and transfer results.
- [Figures] Figure captions for the example annotations could more explicitly label which textual form (free-form, structured, or template) is shown in each panel.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the two major comments point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the manuscript provides no information on annotation provenance (expert annotators, crowd workers, or LLM-assisted), inter-annotator agreement statistics, or quality-control procedures for the 7,778 multi-form sample annotations. Without these details the central claim that the annotations enable reliable study of structure-flexibility trade-offs cannot be evaluated.
Authors: We agree that explicit documentation of the annotation process is necessary to support claims about reliable study of textual structure-flexibility trade-offs. The annotations were produced by a team of domain experts (industrial engineers with >5 years experience in steel defect inspection) following a multi-stage protocol that included initial labeling, cross-validation by a second expert, and final adjudication for disagreements. In the revised manuscript we will add a dedicated subsection under Dataset Construction that reports: (i) provenance (expert annotators with no LLM assistance), (ii) inter-annotator agreement (Cohen’s kappa = 0.87 on attribute lists and 0.79 on free-form descriptions), and (iii) quality-control steps (two-round review plus automated consistency checks). These additions will directly substantiate the central claim. revision: yes
-
Referee: [Benchmark experiments] Benchmark experiments: the abstract asserts that structured attributes provide stable alignment while natural language improves transfer and spatial grounding, yet no quantitative metrics, data-split descriptions, or statistical significance tests are referenced. This omission prevents assessment of whether the reported trade-off is robust or an artifact of the chosen tasks.
Authors: Quantitative results supporting the reported trade-off appear in Section 4 (Tables 2–5) and the supplementary material: classification accuracy, segmentation mIoU, cross-dataset transfer accuracy, retrieval mAP, and text-guided localization IoU are all reported with standard deviations over three random seeds. Data splits are described as stratified 70/15/15 (train/val/test) per category, with the exact split files released in the GitHub repository. We acknowledge that p-values from paired statistical tests were not included in the main text. In the revision we will (a) add a brief reference to the key metrics and data splits in the abstract, (b) insert p-values (paired t-test, p < 0.05) into the result tables, and (c) move the split description into the main experimental setup paragraph so readers can evaluate robustness without consulting the supplement. revision: yes
Circularity Check
No circularity: empirical dataset and benchmark with no derivations or self-referential reductions
full rationale
The paper introduces SteelDefectX as a new multi-form vision-language dataset and establishes benchmarks for classification, segmentation, and transfer tasks. All claims, including the reported trade-off between structured attributes and natural language descriptions, rest on direct experimental evaluation of the provided annotations rather than any mathematical derivation, parameter fitting presented as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems appear; the work is self-contained dataset creation and empirical reporting with no load-bearing step that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Steel surface defects have distinguishable visual attributes and identifiable industrial causes.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Yanqi Bao, Kechen Song, Jie Liu, Yanyan Wang, Yunhui Yan, Han Yu, and Xingjie Li. Triplet-graph reasoning net- work for few-shot metal generic surface defect segmenta- tion.IEEE Transactions on Instrumentation and Measure- ment, 70:1–11, 2021. 5
work page 2021
-
[2]
Reproducible scal- ing laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 2, 7
work page 2023
-
[3]
An efficient tar- geted design for real-time defect detection of surface defects
Wenqi Cui, Kechen Song, Xiujian Jia, Hongshu Chen, Yu Zhang, Yunhui Yan, and Wenying Jiang. An efficient tar- geted design for real-time defect detection of surface defects. Optics and Lasers in Engineering, 178:108174, 2024. 2, 3, 5, 8
work page 2024
-
[4]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions, 2021. 1, 2, 6
work page 2021
-
[5]
X-sdd: A new benchmark for hot rolled steel strip surface defects detection
Xinglong Feng, Xianwen Gao, and Ling Luo. X-sdd: A new benchmark for hot rolled steel strip surface defects detection. Symmetry, 13(4):706, 2021. 2, 3, 8
work page 2021
-
[6]
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024. 7
work page 2024
-
[7]
Anomalygpt: Detecting in- dustrial anomalies using large vision-language models
Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Anomalygpt: Detecting in- dustrial anomalies using large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 1932–1940, 2024. 3
work page 1932
-
[8]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 1, 2, 6
work page 2016
-
[9]
Yu He, Kechen Song, Qinggang Meng, and Yunhui Yan. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features.IEEE transactions on instru- mentation and measurement, 69(4):1493–1504, 2019. 1
work page 2019
-
[10]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 6
work page 2019
-
[11]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Winclip: Zero- /few-shot anomaly classification and segmentation
Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. Winclip: Zero- /few-shot anomaly classification and segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023. 1, 3
work page 2023
-
[13]
Xiaofeng Ji, Faming Gong, Nuanlai Wang, Yanpu Zhao, Yuhui Ma, and Zhuang Shi. Pixel-level semantic parsing in complex industrial scenarios using large vision-language models.Information Fusion, 116:102794, 2025. 3
work page 2025
-
[14]
MMAD: A comprehensive benchmark for multimodal large language models in industrial anomaly detection
Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. MMAD: A comprehensive benchmark for multimodal large language models in industrial anomaly detection. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 3
work page 2025
-
[15]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017. 2
work page 2017
-
[16]
Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language mod- els?Advances in Neural Information Processing Systems, 37:87874–87907, 2024. 1
work page 2024
-
[17]
Kaiyan Lei, Zhiquan Qi, and Jin Song. Improving surface de- fect detection for trains based on visual-language knowledge guidance on tiny datasets.IEEE Transactions on Intelligent Transportation Systems, 2025. 1
work page 2025
-
[18]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900,
-
[19]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1
work page 2023
-
[20]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 2
work page 2024
-
[21]
Image retrieval on real-life images with pre-trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021. 1
work page 2021
-
[22]
Generalized completed local binary patterns for time-efficient steel surface defect classification
Qiwu Luo, Yichuang Sun, Pengcheng Li, Oluyomi Simpson, Lu Tian, and Yigang He. Generalized completed local binary patterns for time-efficient steel surface defect classification. IEEE Transactions on Instrumentation and Measurement, 68 (3):667–679, 2018. 1
work page 2018
-
[23]
Xiaoming Lv, Fajie Duan, Jia-jia Jiang, Xiao Fu, and Lin Gan. Deep metallic surface defect detection: The new bench- mark and detection network.Sensors, 20(6):1562, 2020. 2, 3, 8
work page 2020
-
[24]
Shufflenet v2: Practical guidelines for efficient cnn architec- ture design
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architec- ture design. InProceedings of the European Conference on Computer Vision, pages 116–131, 2018. 6
work page 2018
-
[25]
Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip
Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4744–4754, 2025. 1
work page 2025
-
[26]
Yuxin Ma, Jiaxing Yin, Feng Huang, and Qipeng Li. Surface defect inspection of industrial products with object detection deep networks: A systematic review.Artificial Intelligence Review, 57(12):333, 2024. 1
work page 2024
-
[27]
What does a platypus look like? generating customized prompts for zero-shot image classification
Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 15691–15701, 2023. 3
work page 2023
-
[28]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763, 2021. 1, 2, 7
work page 2021
-
[29]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing, 2019. 4
work page 2019
-
[30]
Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj, Stefan Schmid, Steffen Staab, and Claudia Plant. Multiads: Defect- aware supervision for multi-type anomaly detection and seg- mentation in zero-shot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22978–22988, 2025. 3
work page 2025
-
[31]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2
work page 2022
-
[32]
Kechen Song and Yunhui Yan. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects.Applied Surface Science, 285:858–864,
-
[33]
Kechen Song, Hu Feng, Tonglei Cao, Wenqi Cui, and Yunhui Yan. Mfanet: Multifeature aggregation network for cross- granularity few-shot seamless steel tubes surface defect seg- mentation.IEEE Transactions on Industrial Informatics, 20 (7):9725–9735, 2024. 2, 8
work page 2024
-
[34]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Al- varez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22442–22452, 2025. 3
work page 2025
-
[36]
Cvlue: A new benchmark dataset for chinese vision-language under- standing evaluation
Yuxuan Wang, Yijun Liu, Fei Yu, Chen Huang, Kexin Li, Zhiguo Wan, Wanxiang Che, and Hongyang Chen. Cvlue: A new benchmark dataset for chinese vision-language under- standing evaluation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8196–8204, 2025. 3
work page 2025
-
[37]
Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing
Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2, 3, 7
work page 2024
-
[38]
Steel sur- face defect recognition: A survey.Coatings, 13(1):17, 2022
Xin Wen, Jvran Shan, Yu He, and Kechen Song. Steel sur- face defect recognition: A survey.Coatings, 13(1):17, 2022. 1
work page 2022
-
[39]
Weiwei Xiao, Kechen Song, Jie Liu, and Yunhui Yan. Graph embedding and optimal transport for few-shot classification of metal surface defect.IEEE Transactions on Instrumenta- tion and Measurement, 71:1–10, 2022. 2, 5, 8
work page 2022
-
[40]
FG-CLIP: Fine-grained visual and textual alignment
Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. FG-CLIP: Fine-grained visual and textual alignment. In Forty-second International Conference on Machine Learn- ing, 2025. 2, 7, 8
work page 2025
-
[41]
Long-clip: Unlocking the long-text capability of clip
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InProceedings of the European Conference on Com- puter Vision, pages 310–325, 2024. 2, 7
work page 2024
-
[42]
Tip- adapter: Training-free adaption of clip for few-shot classifi- cation
Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classifi- cation. InProceedings of the European Conference on Com- puter Vision, pages 493–510, 2022. 7
work page 2022
-
[43]
Wenli Zhao, Kechen Song, Yanyan Wang, Shubo Liang, and Yunhui Yan. Fanet: Feature-aware network for few shot clas- sification of strip steel surface defects.Measurement, 208: 112446, 2023. 3, 5
work page 2023
-
[44]
Extract free dense labels from clip
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InProceedings of the European Con- ference on Computer Vision, pages 696–712, 2022. 8
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.