pith. sign in

arxiv: 2606.01962 · v2 · pith:REKS2KFWnew · submitted 2026-06-01 · 💻 cs.CV

Contrastive Augmented Transformer with Domain-specific Enhancement for Robust Multi-scenario Metal Surface Defect Detection

Pith reviewed 2026-06-28 15:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords metal surface defect detectiontransformercontrastive learningdata augmentationindustrial inspectionSwin Transformergeneralizationfeature pyramid network
0
0 comments X

The pith

CAT framework uses Swin Transformer and droplet augmentation to reach 99.54% AUROC on metal defect detection while generalizing to unseen datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Contrastive Augmented Transformer (CAT) to address limited annotated data, subtle multi-scale defects, and poor generalization in metal surface defect detection. It employs a hierarchical Swin Transformer backbone with a redesigned feature pyramid network to fuse low-level textures and high-level semantics. A domain-specific droplet augmentation algorithm improves robustness to real-world noise, while hard negative mining in the contrastive loss strengthens discrimination in ambiguous regions. The approach is evaluated on KolektorSDD2 and shows strong results on three additional datasets without per-dataset tuning.

Core claim

The CAT framework employs a hierarchical Swin Transformer backbone and a redesigned feature pyramid network to model subtle multi-scale defect patterns. Combined with a domain-specific droplet augmentation algorithm and hard negative mining in the contrastive loss, it achieves a pixel-level AUROC of 99.54% on KolektorSDD2 and superior generalization on unseen datasets including KSDD1, MTD for tile defects, and MSDD for rail surface defects.

What carries the argument

Contrastive Augmented Transformer (CAT) with domain-specific droplet augmentation and hard negative mining in contrastive loss

If this is right

  • CAT achieves a pixel-level AUROC of 99.54% on KolektorSDD2, outperforming existing methods.
  • CAT exhibits superior generalization and robustness on three unseen datasets: KSDD1, MTD for tile defects, and MSDD for rail surface defects.
  • The domain-specific droplet augmentation enhances robustness under real-world noise conditions.
  • Hard negative mining strengthens the model's discrimination ability in ambiguous defect regions.
  • The framework shows potential for wide-scale industrial deployment without extensive per-scenario retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The combination of contrastive learning and targeted augmentation may reduce reliance on large labeled datasets for other industrial vision tasks with scarce annotations.
  • The approach could extend to surface inspection in non-metal domains where defects vary in scale and appear under varying lighting or contamination.
  • Hard negative mining paired with domain-specific augmentations might improve performance in related ambiguous-label problems such as anomaly detection in medical imaging.

Load-bearing premise

The domain-specific droplet augmentation algorithm and hard negative mining strategy will enhance robustness and discrimination in ambiguous regions across real-world noise conditions and unseen datasets without overfitting or requiring dataset-specific tuning.

What would settle it

Testing CAT on a fourth unseen industrial dataset with novel noise patterns and observing whether its AUROC advantage over baselines disappears or requires retuning of the augmentation parameters.

Figures

Figures reproduced from arXiv: 2606.01962 by Huan Wang, Liyuan Ren, Wenxiao He, Yiyao Liu.

Figure 1
Figure 1. Figure 1: CAT’s architecture. Images are first padded into a fixed size. One copy of [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structure of Swin Transformer block. One uses Window-based Multi-Head Self [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Process of droplet augmentation, it simulates physical damage of metal surfaces. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multi-Scale Fusion FPN (MSF-FPN). A bi-directional fusion module combines [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative examples of defect samples from four datasets (KolektorSDD1, [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example results of eight synthetic defect-generation methods applied to one [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative visualization of failure cases in the CAT framework, highlighting [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
read the original abstract

Metal surface defect detection is critical for maintaining product quality in industrial manufacturing. However, it faces significant challenges, including limited annotated data, difficulty in identifying subtle multi-scale defects, and poor generalization across diverse scenarios. To address these issues, this paper proposes a novel Contrastive Augmented Transformer (CAT) framework for robust defect detection. CAT employs a hierarchical Swin Transformer backbone and redesigns the feature pyramid network to effectively fuse low-level textures with high-level semantics, enabling precise modeling of subtle and multi-scale defect patterns. To enhance robustness under real-world noise conditions, we propose a domain-specific droplet augmentation algorithm. Furthermore, we incorporate a hard negative mining strategy into the contrastive loss to strengthen the model's discrimination ability in ambiguous defect regions. Experimental results on the KolektorSDD2 dataset demonstrate that CAT achieves a pixel-level AUROC of 99.54%, outperforming existing methods. In addition, CAT exhibits superior generalization and robustness on three unseen datasets, including KSDD1, MTD for tile defects, and MSDD for rail surface defects, demonstrating its potential for wide-scale industrial deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes the Contrastive Augmented Transformer (CAT) for robust multi-scenario metal surface defect detection. It uses a hierarchical Swin Transformer backbone with a redesigned feature pyramid network to fuse low-level textures and high-level semantics, introduces a domain-specific droplet augmentation algorithm for noise robustness, and adds hard negative mining to the contrastive loss for better discrimination in ambiguous regions. Central claims are a pixel-level AUROC of 99.54% on KolektorSDD2 (outperforming existing methods) plus superior generalization and robustness on three unseen datasets (KSDD1, MTD, MSDD).

Significance. If the empirical results hold under detailed verification, the work addresses practical challenges in industrial defect detection with limited annotations and cross-scenario noise, potentially enabling more reliable automated inspection systems.

major comments (1)
  1. [Experiments section] Experiments section: the reported 99.54% pixel-level AUROC and outperformance claims lack any description of data splits, baseline implementations, number of runs, error bars, or hyperparameter tuning protocols; without these the central performance claims cannot be assessed for reproducibility or statistical significance.
minor comments (1)
  1. [Abstract] Abstract: the metrics (pixel-level AUROC) and exact comparison methods should be stated more explicitly to allow immediate evaluation of the generalization claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater experimental transparency. We will revise the manuscript to address this concern directly.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section: the reported 99.54% pixel-level AUROC and outperformance claims lack any description of data splits, baseline implementations, number of runs, error bars, or hyperparameter tuning protocols; without these the central performance claims cannot be assessed for reproducibility or statistical significance.

    Authors: We agree that the current manuscript lacks explicit details on these experimental protocols. In the revised version, we will expand the Experiments section with a new subsection that specifies: (1) the exact train/validation/test splits and ratios used on KolektorSDD2 (and the three generalization datasets), (2) baseline implementation sources (official code or our re-implementations with hyperparameters), (3) the number of independent runs performed (with random seeds), (4) standard deviations reported as error bars on all metrics, and (5) the hyperparameter tuning procedure (grid or random search ranges and final selected values). These additions will enable proper evaluation of reproducibility and statistical significance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an applied computer vision contribution proposing a CAT architecture (Swin Transformer + redesigned FPN), a domain-specific droplet augmentation, and hard-negative contrastive mining. All load-bearing claims are direct empirical measurements (pixel AUROC 99.54 % on KolektorSDD2, generalization on KSDD1/MTD/MSDD). No equations, fitted parameters, or self-citations are presented as derivations that reduce to the inputs by construction. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The 'domain-specific droplet augmentation algorithm' is introduced without details on its formulation or parameters.

pith-pipeline@v0.9.1-grok · 5730 in / 1197 out tokens · 29436 ms · 2026-06-28T15:10:25.126037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 24 canonical work pages

  1. [1]

    Huber, D

    C. Huber, D. Knoll, M. Guthe, Fully-synthetic training for visual qual- ity inspection in automotive production, Procedia CIRP 134 (2025) 777–782. doi:10.1016/j.procir.2025.02.205. URLhttp://dx.doi.org/10.1016/j.procir.2025.02.205

  2. [2]

    García Pérez, M

    A. García Pérez, M. J. Gómez Silva, A. de la Escalera Hueso, Automated defect recognition of castings defects using neural networks, Journal of Nondestructive Evaluation 41 (1) (Dec. 2021). doi:10.1007/s10921-021- 00842-1. URLhttp://dx.doi.org/10.1007/s10921-021-00842-1

  3. [3]

    Campos, T

    M. Campos, T. Martins, M. Ferreira, C. Santos, Detection of defects in automotive metal components through computer vision (06 2008). doi:10.1109/ISIE.2008.4677037

  4. [4]

    Bounenni, M

    L. Bounenni, M. Arbane, C. Ibarra-Castanedo, Y. Yaddaden, S. Unnikr- ishnakurup, A. N. C. Yong, X. Maldague, Advanced defect detection on curved aeronautical surfaces through infrared imaging and deep learn- ing, NDT 2 (4) (2024) 519–531. doi:10.3390/ndt2040032. URLhttps://www.mdpi.com/2813-477X/2/4/32

  5. [5]

    In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    A. Agarwal, A. Ajith, C. Wen, V. Stryzheus, B. Miller, M. Chen, M. K. Johnson, J. L. Susa Rincon, J. Rosca, W. Yuan, Robotic defect inspection with visual and tactile perception for large-scale components, in: 2023 IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS), 2023, pp. 10110–10116. doi:10.1109/IROS55552.2023.10341590

  6. [6]

    L. Zeng, F. Wan, B. Zhang, X. Zhu, Automated visual inspection for precise defect detection and classification in cbn inserts, Sensors 24 (23) (2024). doi:10.3390/s24237824. URLhttps://www.mdpi.com/1424-8220/24/23/7824

  7. [7]

    X. Xu, G. Zhang, W. Zheng, A. Zhao, Y. Zhong, H. Wang, High- precision detection algorithm for metal workpiece defects based on deep learning, Machines 11 (8) (2023). doi:10.3390/machines11080834. URLhttps://www.mdpi.com/2075-1702/11/8/834 28

  8. [8]

    B. Liu, S. Wu, S. Zou, Automatic detection technology of surface defects on plastic products based on machine vision, 2010, pp. 2213 – 2216. doi:10.1109/MACE.2010.5536470

  9. [9]

    A. V., B. N.U., N. P., Automatic detection of texture defects using texture-periodicity and gabor wavelets, in: K. R. Venugopal, L. M. Pat- naik (Eds.), Computer Networks and Intelligent Computing, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 548–553

  10. [10]

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi:10.1109/CVPR.2016.90

  11. [11]

    Zhang, Z

    Z. Zhang, Z. Zhao, X. Zhang, C. Sun, X. Chen, Industrial anomaly detection with domain shift: A real-world dataset and masked multi-scale reconstruction, Computers in Industry 151 (2023) 103990. doi:https://doi.org/10.1016/j.compind.2023.103990. URLhttps://www.sciencedirect.com/science/article/pii/S0166361523001409

  12. [12]

    W. Zhu, H. Zhang, C. Zhang, X. Zhu, Z. Guan, J. Jia, Surface defect detection and classification of steel using an efficient swin transformer, Advanced Engineering Informatics 57 (2023) 102061. doi:10.1016/j.aei.2023.102061

  13. [13]

    Y. Hou, X. Zhang, A lightweight real-time detection transformer model for surface defect detection systems, Information Sciences 725 (2025) 122685. doi:10.1016/j.ins.2025.122685

  14. [14]

    L. Gao, J. Zhang, C. Yang, Y. Zhou, Cas-vswin transformer: A vari- ant swin transformer for surface-defect detection, Comput. Ind. 140 (C) (Sep. 2022). doi:10.1016/j.compind.2022.103689. URLhttps://doi.org/10.1016/j.compind.2022.103689

  15. [15]

    C.-L.Li, K.Sohn, J.Yoon, T.Pfister, Cutpaste: Self-supervisedlearning for anomaly detection and localization, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9659–

  16. [16]

    doi:10.1109/CVPR46437.2021.00954

  17. [17]

    H. M. Schlüter, J. Tan, B. Hou, B. Kainz, Natural synthetic anomalies for self-supervised anomaly detection and localization, in: S. Avidan, 29 G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 474–489

  18. [18]

    Schwartz, A

    E. Schwartz, A. Arbelle, L. Karlinsky, S. Harary, F. Scheidegger, S. Doveh, R. Giryes, Maeday: Mae for few- and zero-shot anomaly- detection, Computer Vision and Image Understanding 241 (2024) 103958. doi:https://doi.org/10.1016/j.cviu.2024.103958. URLhttps://www.sciencedirect.com/science/article/pii/S1077314224000390

  19. [19]

    Huang, Y

    X. Huang, Y. Li, Y. Bao, W. Zheng, Adaptive cross trans- former with contrastive learning for surface defect detection, IEEE Transactions on Instrumentation and Measurement PP (2024) 1–1. doi:10.1109/TIM.2024.3470998

  20. [20]

    Božič, D

    J. Božič, D. Tabernik, D. Skočaj, Mixed supervision for surface-defect detection: from weakly to fully supervised learning, Computers in In- dustry (2021)

  21. [21]

    Nahar, M

    L. Nahar, M. Awrangjeb, M. S. Islam, Ai-enabled defect detection in industrial products: A comprehensive survey, key insights and future research challenges, Advanced Engineering Informatics 69 (2026) 104067. doi:https://doi.org/10.1016/j.aei.2025.104067. URLhttps://www.sciencedirect.com/science/article/pii/S1474034625009607

  22. [22]

    K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, in: 2020 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9726–9735. doi:10.1109/CVPR42600.2020.00975

  23. [23]

    X. Chen, H. Fan, R. B. Girshick, K. He, Improved baselines with mo- mentum contrastive learning, ArXiv abs/2003.04297 (2020). URLhttps://api.semanticscholar.org/CorpusID:212633993

  24. [24]

    J. Guo, S. Lu, L. Jia, W. Zhang, H. Li, Recontrast: Domain-specific anomaly detection via contrastive reconstruction, in: Advances in Neu- ral Information Processing Systems, Vol. 36, 2023, pp. 10721–10740

  25. [25]

    Canny, A computational approach to edge detection, IEEE Transac- tions on Pattern Analysis and Machine Intelligence PAMI-8 (6) (1986) 679–698

    J. Canny, A computational approach to edge detection, IEEE Transac- tions on Pattern Analysis and Machine Intelligence PAMI-8 (6) (1986) 679–698. doi:10.1109/TPAMI.1986.4767851. 30

  26. [26]

    Ronneberger, P

    O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: N. Navab, J. Hornegger, W. M. Wells, A. F. Frangi (Eds.), Medical Image Computing and Computer- Assisted Intervention – MICCAI 2015, Springer International Publish- ing, Cham, 2015, pp. 234–241

  27. [27]

    L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder- decoder with atrous separable convolution for semantic image segmenta- tion, in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Com- puter Vision – ECCV 2018, Springer International Publishing, Cham, 2018, pp. 833–851

  28. [28]

    X. Shi, L. Zou, K. Qian, X. Liu, Dd-detr: A dual-decoder detr with information interaction and competitive learning for blade surface defect detection, Advanced Engineering Informatics 71 (2026) 104234. doi:https://doi.org/10.1016/j.aei.2025.104234. URLhttps://www.sciencedirect.com/science/article/pii/S1474034625011279

  29. [29]

    T. Liu, B. Li, X. Du, B. Jiang, L. Geng, F. Wang, Z. Zhao, Simple and effective frequency-aware image restoration for industrial visual anomaly detection, Advanced Engineering Informatics 64 (2025) 103064. doi:https://doi.org/10.1016/j.aei.2024.103064. URLhttps://www.sciencedirect.com/science/article/pii/S1474034624007158

  30. [30]

    Rolih, M

    B. Rolih, M. Fučka, D. Skočaj, SuperSimpleNet: Unifying Unsupervised and Supervised Learning for Fast and Reliable Surface Defect Detection, in: International Conference on Pattern Recognition, 2024

  31. [31]

    Rolih, M

    B. Rolih, M. Fučka, D. Skočaj, No label left behind: A unified surface defect detection model for all supervision regimes, Journal of Intelligent Manufacturing (2025)

  32. [32]

    Y. Shi, J. Yang, Z. Qi, Unsupervised anomaly segmentation via deep feature reconstruction, Neurocomputing 424 (2021) 9–22. doi:https://doi.org/10.1016/j.neucom.2020.11.018. URLhttps://www.sciencedirect.com/science/article/pii/S0925231220317951

  33. [33]

    H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6230–6239. doi:10.1109/CVPR.2017.660. 31

  34. [34]

    Huang, X

    Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, W. Liu, Ccnet: Criss- cross attention for semantic segmentation, in: 2019 IEEE/CVF Inter- national Conference on Computer Vision (ICCV), 2019, pp. 603–612. doi:10.1109/ICCV.2019.00069

  35. [35]

    L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, ArXiv abs/1706.05587 (2017). URLhttps://api.semanticscholar.org/CorpusID:22655199

  36. [36]

    A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, ArXiv abs/1704.04861 (2017). URLhttps://api.semanticscholar.org/CorpusID:12670695 32