Contrastive Augmented Transformer with Domain-specific Enhancement for Robust Multi-scenario Metal Surface Defect Detection
Pith reviewed 2026-06-28 15:10 UTC · model grok-4.3
The pith
CAT framework uses Swin Transformer and droplet augmentation to reach 99.54% AUROC on metal defect detection while generalizing to unseen datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The CAT framework employs a hierarchical Swin Transformer backbone and a redesigned feature pyramid network to model subtle multi-scale defect patterns. Combined with a domain-specific droplet augmentation algorithm and hard negative mining in the contrastive loss, it achieves a pixel-level AUROC of 99.54% on KolektorSDD2 and superior generalization on unseen datasets including KSDD1, MTD for tile defects, and MSDD for rail surface defects.
What carries the argument
Contrastive Augmented Transformer (CAT) with domain-specific droplet augmentation and hard negative mining in contrastive loss
If this is right
- CAT achieves a pixel-level AUROC of 99.54% on KolektorSDD2, outperforming existing methods.
- CAT exhibits superior generalization and robustness on three unseen datasets: KSDD1, MTD for tile defects, and MSDD for rail surface defects.
- The domain-specific droplet augmentation enhances robustness under real-world noise conditions.
- Hard negative mining strengthens the model's discrimination ability in ambiguous defect regions.
- The framework shows potential for wide-scale industrial deployment without extensive per-scenario retraining.
Where Pith is reading between the lines
- The combination of contrastive learning and targeted augmentation may reduce reliance on large labeled datasets for other industrial vision tasks with scarce annotations.
- The approach could extend to surface inspection in non-metal domains where defects vary in scale and appear under varying lighting or contamination.
- Hard negative mining paired with domain-specific augmentations might improve performance in related ambiguous-label problems such as anomaly detection in medical imaging.
Load-bearing premise
The domain-specific droplet augmentation algorithm and hard negative mining strategy will enhance robustness and discrimination in ambiguous regions across real-world noise conditions and unseen datasets without overfitting or requiring dataset-specific tuning.
What would settle it
Testing CAT on a fourth unseen industrial dataset with novel noise patterns and observing whether its AUROC advantage over baselines disappears or requires retuning of the augmentation parameters.
Figures
read the original abstract
Metal surface defect detection is critical for maintaining product quality in industrial manufacturing. However, it faces significant challenges, including limited annotated data, difficulty in identifying subtle multi-scale defects, and poor generalization across diverse scenarios. To address these issues, this paper proposes a novel Contrastive Augmented Transformer (CAT) framework for robust defect detection. CAT employs a hierarchical Swin Transformer backbone and redesigns the feature pyramid network to effectively fuse low-level textures with high-level semantics, enabling precise modeling of subtle and multi-scale defect patterns. To enhance robustness under real-world noise conditions, we propose a domain-specific droplet augmentation algorithm. Furthermore, we incorporate a hard negative mining strategy into the contrastive loss to strengthen the model's discrimination ability in ambiguous defect regions. Experimental results on the KolektorSDD2 dataset demonstrate that CAT achieves a pixel-level AUROC of 99.54%, outperforming existing methods. In addition, CAT exhibits superior generalization and robustness on three unseen datasets, including KSDD1, MTD for tile defects, and MSDD for rail surface defects, demonstrating its potential for wide-scale industrial deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Contrastive Augmented Transformer (CAT) for robust multi-scenario metal surface defect detection. It uses a hierarchical Swin Transformer backbone with a redesigned feature pyramid network to fuse low-level textures and high-level semantics, introduces a domain-specific droplet augmentation algorithm for noise robustness, and adds hard negative mining to the contrastive loss for better discrimination in ambiguous regions. Central claims are a pixel-level AUROC of 99.54% on KolektorSDD2 (outperforming existing methods) plus superior generalization and robustness on three unseen datasets (KSDD1, MTD, MSDD).
Significance. If the empirical results hold under detailed verification, the work addresses practical challenges in industrial defect detection with limited annotations and cross-scenario noise, potentially enabling more reliable automated inspection systems.
major comments (1)
- [Experiments section] Experiments section: the reported 99.54% pixel-level AUROC and outperformance claims lack any description of data splits, baseline implementations, number of runs, error bars, or hyperparameter tuning protocols; without these the central performance claims cannot be assessed for reproducibility or statistical significance.
minor comments (1)
- [Abstract] Abstract: the metrics (pixel-level AUROC) and exact comparison methods should be stated more explicitly to allow immediate evaluation of the generalization claims.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater experimental transparency. We will revise the manuscript to address this concern directly.
read point-by-point responses
-
Referee: [Experiments section] Experiments section: the reported 99.54% pixel-level AUROC and outperformance claims lack any description of data splits, baseline implementations, number of runs, error bars, or hyperparameter tuning protocols; without these the central performance claims cannot be assessed for reproducibility or statistical significance.
Authors: We agree that the current manuscript lacks explicit details on these experimental protocols. In the revised version, we will expand the Experiments section with a new subsection that specifies: (1) the exact train/validation/test splits and ratios used on KolektorSDD2 (and the three generalization datasets), (2) baseline implementation sources (official code or our re-implementations with hyperparameters), (3) the number of independent runs performed (with random seeds), (4) standard deviations reported as error bars on all metrics, and (5) the hyperparameter tuning procedure (grid or random search ranges and final selected values). These additions will enable proper evaluation of reproducibility and statistical significance. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an applied computer vision contribution proposing a CAT architecture (Swin Transformer + redesigned FPN), a domain-specific droplet augmentation, and hard-negative contrastive mining. All load-bearing claims are direct empirical measurements (pixel AUROC 99.54 % on KolektorSDD2, generalization on KSDD1/MTD/MSDD). No equations, fitted parameters, or self-citations are presented as derivations that reduce to the inputs by construction. The argument is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
C. Huber, D. Knoll, M. Guthe, Fully-synthetic training for visual qual- ity inspection in automotive production, Procedia CIRP 134 (2025) 777–782. doi:10.1016/j.procir.2025.02.205. URLhttp://dx.doi.org/10.1016/j.procir.2025.02.205
-
[2]
A. García Pérez, M. J. Gómez Silva, A. de la Escalera Hueso, Automated defect recognition of castings defects using neural networks, Journal of Nondestructive Evaluation 41 (1) (Dec. 2021). doi:10.1007/s10921-021- 00842-1. URLhttp://dx.doi.org/10.1007/s10921-021-00842-1
-
[3]
M. Campos, T. Martins, M. Ferreira, C. Santos, Detection of defects in automotive metal components through computer vision (06 2008). doi:10.1109/ISIE.2008.4677037
-
[4]
L. Bounenni, M. Arbane, C. Ibarra-Castanedo, Y. Yaddaden, S. Unnikr- ishnakurup, A. N. C. Yong, X. Maldague, Advanced defect detection on curved aeronautical surfaces through infrared imaging and deep learn- ing, NDT 2 (4) (2024) 519–531. doi:10.3390/ndt2040032. URLhttps://www.mdpi.com/2813-477X/2/4/32
-
[5]
A. Agarwal, A. Ajith, C. Wen, V. Stryzheus, B. Miller, M. Chen, M. K. Johnson, J. L. Susa Rincon, J. Rosca, W. Yuan, Robotic defect inspection with visual and tactile perception for large-scale components, in: 2023 IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS), 2023, pp. 10110–10116. doi:10.1109/IROS55552.2023.10341590
-
[6]
L. Zeng, F. Wan, B. Zhang, X. Zhu, Automated visual inspection for precise defect detection and classification in cbn inserts, Sensors 24 (23) (2024). doi:10.3390/s24237824. URLhttps://www.mdpi.com/1424-8220/24/23/7824
-
[7]
X. Xu, G. Zhang, W. Zheng, A. Zhao, Y. Zhong, H. Wang, High- precision detection algorithm for metal workpiece defects based on deep learning, Machines 11 (8) (2023). doi:10.3390/machines11080834. URLhttps://www.mdpi.com/2075-1702/11/8/834 28
-
[8]
B. Liu, S. Wu, S. Zou, Automatic detection technology of surface defects on plastic products based on machine vision, 2010, pp. 2213 – 2216. doi:10.1109/MACE.2010.5536470
-
[9]
A. V., B. N.U., N. P., Automatic detection of texture defects using texture-periodicity and gabor wavelets, in: K. R. Venugopal, L. M. Pat- naik (Eds.), Computer Networks and Intelligent Computing, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 548–553
2011
-
[10]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi:10.1109/CVPR.2016.90
-
[11]
Z. Zhang, Z. Zhao, X. Zhang, C. Sun, X. Chen, Industrial anomaly detection with domain shift: A real-world dataset and masked multi-scale reconstruction, Computers in Industry 151 (2023) 103990. doi:https://doi.org/10.1016/j.compind.2023.103990. URLhttps://www.sciencedirect.com/science/article/pii/S0166361523001409
-
[12]
W. Zhu, H. Zhang, C. Zhang, X. Zhu, Z. Guan, J. Jia, Surface defect detection and classification of steel using an efficient swin transformer, Advanced Engineering Informatics 57 (2023) 102061. doi:10.1016/j.aei.2023.102061
-
[13]
Y. Hou, X. Zhang, A lightweight real-time detection transformer model for surface defect detection systems, Information Sciences 725 (2025) 122685. doi:10.1016/j.ins.2025.122685
-
[14]
L. Gao, J. Zhang, C. Yang, Y. Zhou, Cas-vswin transformer: A vari- ant swin transformer for surface-defect detection, Comput. Ind. 140 (C) (Sep. 2022). doi:10.1016/j.compind.2022.103689. URLhttps://doi.org/10.1016/j.compind.2022.103689
-
[15]
C.-L.Li, K.Sohn, J.Yoon, T.Pfister, Cutpaste: Self-supervisedlearning for anomaly detection and localization, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9659–
2021
-
[16]
doi:10.1109/CVPR46437.2021.00954
-
[17]
H. M. Schlüter, J. Tan, B. Hou, B. Kainz, Natural synthetic anomalies for self-supervised anomaly detection and localization, in: S. Avidan, 29 G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 474–489
2022
-
[18]
E. Schwartz, A. Arbelle, L. Karlinsky, S. Harary, F. Scheidegger, S. Doveh, R. Giryes, Maeday: Mae for few- and zero-shot anomaly- detection, Computer Vision and Image Understanding 241 (2024) 103958. doi:https://doi.org/10.1016/j.cviu.2024.103958. URLhttps://www.sciencedirect.com/science/article/pii/S1077314224000390
-
[19]
X. Huang, Y. Li, Y. Bao, W. Zheng, Adaptive cross trans- former with contrastive learning for surface defect detection, IEEE Transactions on Instrumentation and Measurement PP (2024) 1–1. doi:10.1109/TIM.2024.3470998
-
[20]
Božič, D
J. Božič, D. Tabernik, D. Skočaj, Mixed supervision for surface-defect detection: from weakly to fully supervised learning, Computers in In- dustry (2021)
2021
-
[21]
L. Nahar, M. Awrangjeb, M. S. Islam, Ai-enabled defect detection in industrial products: A comprehensive survey, key insights and future research challenges, Advanced Engineering Informatics 69 (2026) 104067. doi:https://doi.org/10.1016/j.aei.2025.104067. URLhttps://www.sciencedirect.com/science/article/pii/S1474034625009607
-
[22]
K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, in: 2020 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9726–9735. doi:10.1109/CVPR42600.2020.00975
-
[23]
X. Chen, H. Fan, R. B. Girshick, K. He, Improved baselines with mo- mentum contrastive learning, ArXiv abs/2003.04297 (2020). URLhttps://api.semanticscholar.org/CorpusID:212633993
Pith/arXiv arXiv 2003
-
[24]
J. Guo, S. Lu, L. Jia, W. Zhang, H. Li, Recontrast: Domain-specific anomaly detection via contrastive reconstruction, in: Advances in Neu- ral Information Processing Systems, Vol. 36, 2023, pp. 10721–10740
2023
-
[25]
J. Canny, A computational approach to edge detection, IEEE Transac- tions on Pattern Analysis and Machine Intelligence PAMI-8 (6) (1986) 679–698. doi:10.1109/TPAMI.1986.4767851. 30
-
[26]
Ronneberger, P
O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: N. Navab, J. Hornegger, W. M. Wells, A. F. Frangi (Eds.), Medical Image Computing and Computer- Assisted Intervention – MICCAI 2015, Springer International Publish- ing, Cham, 2015, pp. 234–241
2015
-
[27]
L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder- decoder with atrous separable convolution for semantic image segmenta- tion, in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Com- puter Vision – ECCV 2018, Springer International Publishing, Cham, 2018, pp. 833–851
2018
-
[28]
X. Shi, L. Zou, K. Qian, X. Liu, Dd-detr: A dual-decoder detr with information interaction and competitive learning for blade surface defect detection, Advanced Engineering Informatics 71 (2026) 104234. doi:https://doi.org/10.1016/j.aei.2025.104234. URLhttps://www.sciencedirect.com/science/article/pii/S1474034625011279
-
[29]
T. Liu, B. Li, X. Du, B. Jiang, L. Geng, F. Wang, Z. Zhao, Simple and effective frequency-aware image restoration for industrial visual anomaly detection, Advanced Engineering Informatics 64 (2025) 103064. doi:https://doi.org/10.1016/j.aei.2024.103064. URLhttps://www.sciencedirect.com/science/article/pii/S1474034624007158
-
[30]
Rolih, M
B. Rolih, M. Fučka, D. Skočaj, SuperSimpleNet: Unifying Unsupervised and Supervised Learning for Fast and Reliable Surface Defect Detection, in: International Conference on Pattern Recognition, 2024
2024
-
[31]
Rolih, M
B. Rolih, M. Fučka, D. Skočaj, No label left behind: A unified surface defect detection model for all supervision regimes, Journal of Intelligent Manufacturing (2025)
2025
-
[32]
Y. Shi, J. Yang, Z. Qi, Unsupervised anomaly segmentation via deep feature reconstruction, Neurocomputing 424 (2021) 9–22. doi:https://doi.org/10.1016/j.neucom.2020.11.018. URLhttps://www.sciencedirect.com/science/article/pii/S0925231220317951
-
[33]
H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6230–6239. doi:10.1109/CVPR.2017.660. 31
-
[34]
Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, W. Liu, Ccnet: Criss- cross attention for semantic segmentation, in: 2019 IEEE/CVF Inter- national Conference on Computer Vision (ICCV), 2019, pp. 603–612. doi:10.1109/ICCV.2019.00069
-
[35]
L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, ArXiv abs/1706.05587 (2017). URLhttps://api.semanticscholar.org/CorpusID:22655199
Pith/arXiv arXiv 2017
-
[36]
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, ArXiv abs/1704.04861 (2017). URLhttps://api.semanticscholar.org/CorpusID:12670695 32
Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.