pith. sign in

arxiv: 2601.20524 · v2 · submitted 2026-01-28 · 💻 cs.CV

AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Pith reviewed 2026-05-16 10:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot anomaly detectionvision foundation modelssynthetic data generationanomaly localizationparameter-efficient adaptationlow-rank adapters
0
0 comments X p. Extension

The pith

AnomalyVFM converts any vision foundation model into a zero-shot anomaly detector using synthetic data and efficient adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that vision foundation models can be adapted for zero-shot anomaly detection by generating diverse synthetic anomalies in three stages and applying low-rank feature adapters with a confidence-weighted loss. This approach addresses the performance gap between vision-language models and pure vision models in detecting anomalies without in-domain training data. A sympathetic reader would care because it enables anomaly detection in new domains without collecting real anomaly examples, which are often rare or expensive to obtain. With the RADIO backbone, it reaches 94.1% average image-level AUROC on nine datasets, improving on prior methods by 3.3 points. The framework is general and works with various pretrained VFMs.

Core claim

AnomalyVFM is a framework that combines a three-stage synthetic dataset generation scheme with parameter-efficient adaptation using low-rank feature adapters and a confidence-weighted pixel loss to transform pretrained vision foundation models into strong zero-shot anomaly detectors, achieving superior performance on multiple datasets.

What carries the argument

The three-stage synthetic anomaly dataset generation combined with low-rank feature adapters and confidence-weighted pixel loss, which adapts the VFM features for anomaly scoring without full retraining.

If this is right

  • Any pretrained VFM can be turned into a zero-shot anomaly detector without domain-specific training images.
  • Performance on zero-shot anomaly detection improves significantly, reaching 94.1% AUROC on average across diverse datasets.
  • The method closes the gap with VLM-based approaches by using only vision models.
  • Adaptation is parameter-efficient, making it practical for large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such synthetic data methods could extend to other vision tasks where real anomalies are scarce.
  • Future work might explore combining this with multi-modal models for even better localization.
  • Testing on more industrial or medical domains could reveal if the synthetic generation generalizes further.

Load-bearing premise

The synthetic anomalies generated in three stages have statistical properties close enough to real anomalies in the nine test domains.

What would settle it

Evaluating AnomalyVFM on a new dataset where real anomalies differ markedly in appearance or distribution from the synthetic ones generated, and finding performance drops below prior methods.

Figures

Figures reproduced from arXiv: 2601.20524 by Danijel Sko\v{c}aj, Matic Fu\v{c}ka, Vitjan Zavrtanik.

Figure 1
Figure 1. Figure 1: Vision–language models excel in zero-shot anomaly de [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of generated anomaly-free images [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dataset generation pipeline. The image I is generated using a text-conditioned image generation model. Then, the fore￾ground mask Mfg is extracted and an anomalous region R is sam￾pled from it. Then, the anomalous image Ia is generated by in￾painting an anomaly inside R. Finally, features are extracted from I and Ia, and then compared and thresholded to obtain M. f and fa, respectively. The extracted featu… view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of AnomalyVFM. All additions to the base VFM are colored in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of the anomaly segmentation masks produced by AnomalyVFM and two other best-performing methods. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 1
Figure 1. Figure 1: Failure Cases in Image Generation Process [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Anomalous Area Distribution in the generated synthetic [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model performance and rejection rate in relation to filtering threshold [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model performance in comparison to the number of [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Model performance in comparison to the number of images in the training set. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples of anomaly segmentation masks produced by AnomalyVFM. In the first row, the image is shown. In the [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: https://maticfuc.github.io/anomaly_vfm/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes AnomalyVFM, a framework that adapts pretrained vision foundation models (VFMs) such as RADIO and DINOv2 into zero-shot anomaly detectors. It combines a three-stage synthetic anomaly dataset generation scheme with parameter-efficient low-rank feature adapters and a confidence-weighted pixel loss. The central empirical claim is that this yields an average image-level AUROC of 94.1% across nine diverse datasets with the RADIO backbone, outperforming prior methods by 3.3 percentage points.

Significance. If the performance claims hold under rigorous validation, the work would be significant for zero-shot anomaly detection. It shows that modern VFMs can close the gap with VLM-based approaches through synthetic data augmentation and efficient adaptation rather than direct parameter fitting to target domains. This could enable more practical deployment in domains with scarce real anomalies, such as industrial inspection.

major comments (3)
  1. [§3] §3 (Synthetic Dataset Generation): The three-stage procedure is load-bearing for the zero-shot claim, yet the manuscript provides no quantitative validation (e.g., feature distribution distances, texture statistics, or scale histograms) that the generated anomalies match the visual properties of real anomalies across the nine evaluation domains. Without this, the reported gains may reflect memorization of the generation recipe rather than transferable anomaly cues.
  2. [Results section, Table 1] Results section, Table 1: The 94.1% average AUROC and 3.3pp improvement are presented without statistical significance tests, standard deviations across runs, or explicit train/test split details. This leaves moderate uncertainty about whether the central performance claim is robust.
  3. [§4.2] §4.2 (Ablation Studies): The ablations on adapter rank and loss weighting coefficient do not include controls that vary the synthetic generation parameters, making it impossible to isolate whether gains arise from the adaptation mechanism or from the specific synthetic distribution.
minor comments (2)
  1. [Abstract] The abstract states 'surpassing previous methods by significant 3.3 percentage points' without naming the exact baselines or providing the per-dataset breakdown in the summary.
  2. [§4.1] Notation for the confidence-weighted loss could be clarified with an explicit equation reference to avoid ambiguity in the weighting term.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of rigor in validating the synthetic data, statistical robustness, and ablation design. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Synthetic Dataset Generation): The three-stage procedure is load-bearing for the zero-shot claim, yet the manuscript provides no quantitative validation (e.g., feature distribution distances, texture statistics, or scale histograms) that the generated anomalies match the visual properties of real anomalies across the nine evaluation domains. Without this, the reported gains may reflect memorization of the generation recipe rather than transferable anomaly cues.

    Authors: We agree that explicit quantitative validation of the synthetic anomalies would better support the zero-shot transferability claim. Although the consistent gains across nine diverse, unseen datasets provide indirect evidence against pure memorization, we will add analyses in the revised manuscript, including feature distribution distances (e.g., MMD between VFM embeddings of synthetic vs. real anomalies), texture statistics (e.g., LBP histograms), and scale histograms. These will be reported for representative datasets to demonstrate alignment with real anomaly properties. revision: yes

  2. Referee: [Results section, Table 1] Results section, Table 1: The 94.1% average AUROC and 3.3pp improvement are presented without statistical significance tests, standard deviations across runs, or explicit train/test split details. This leaves moderate uncertainty about whether the central performance claim is robust.

    Authors: We acknowledge that reporting variability and significance strengthens the central claim. In the revision, we will add standard deviations computed over multiple random seeds (minimum 3 runs), paired statistical significance tests (e.g., t-tests) against prior methods, and explicit clarification of the evaluation protocol: zero-shot adaptation uses only the synthetic dataset with no in-domain real images, while test splits follow the standard benchmarks from prior zero-shot anomaly detection literature. revision: yes

  3. Referee: [§4.2] §4.2 (Ablation Studies): The ablations on adapter rank and loss weighting coefficient do not include controls that vary the synthetic generation parameters, making it impossible to isolate whether gains arise from the adaptation mechanism or from the specific synthetic distribution.

    Authors: We will expand §4.2 with new ablation experiments that systematically vary synthetic generation parameters (e.g., anomaly density, scale ranges, and blending factors in the three-stage pipeline) while holding the adapter and loss fixed. This will help disentangle the contributions of the low-rank feature adapters and confidence-weighted pixel loss from the synthetic data characteristics, with results added as an additional table or figure. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses independent synthetic data and external pretrained backbones

full rationale

The paper's core derivation trains low-rank adapters on procedurally generated synthetic anomalies (three-stage scheme) and applies them zero-shot to nine real evaluation datasets. No equations or steps reduce by construction to fitted parameters from the target distributions, no self-citation chains justify uniqueness or ansatzes, and no predictions are statistically forced by input fitting. The 94.1% AUROC claim is an external evaluation result, not a tautology. This is the normal self-contained case.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pretrained vision foundation models already encode anomaly-relevant features and that synthetic anomalies generated in three stages are representative enough to transfer to real test distributions.

free parameters (2)
  • adapter rank
    Low-rank adapters require a chosen rank hyperparameter that is tuned on the synthetic data.
  • loss weighting coefficient
    The confidence-weighted pixel loss introduces at least one scalar that balances certain versus uncertain pixels.
axioms (1)
  • domain assumption Pretrained vision foundation models contain transferable features useful for anomaly localization without task-specific fine-tuning.
    Invoked when the authors state that modern VFMs have lagged behind VLMs due to adaptation strategy rather than inherent feature quality.

pith-pipeline@v0.9.0 · 5531 in / 1395 out tokens · 21916 ms · 2026-05-16T10:43:52.831084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 5 internal anchors

  1. [1]

    Zero-shot versus many-shot: Unsupervised texture anomaly detection

    Toshimichi Aota, Lloyd Teh Tzer Tong, and Takayuki Okatani. Zero-shot versus many-shot: Unsupervised texture anomaly detection. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 5564–5572, 2023. 2, 5

  2. [2]

    Efficien- tAD: Accurate Visual Anomaly Detection at Millisecond- Level Latencies

    Kilian Batzner, Lars Heckler, and Rebecca K ¨onig. Efficien- tAD: Accurate Visual Anomaly Detection at Millisecond- Level Latencies. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 128– 138, 2024. 2, 8

  3. [3]

    Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders

    Paul Bergmann, Sindy L ¨owe, Michael Fauser, David Sattleg- ger, and Carsten Steger. Improving Unsupervised defect seg- mentation by applying structural similarity to autoencoders. ArXiv, abs/1807.02011, 2018. 2

  4. [4]

    MVTec AD–A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection

    Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. MVTec AD–A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019. 1, 2, 5, 7

  5. [5]

    Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs

    Jorge Bernal, F Javier S ´anchez, Gloria Fern ´andez- Esparrach, Debora Gil, Cristina Rodr ´ıguez, and Fernando Vilari˜no. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physi- cians.Computerized medical imaging and graphics, 43:99– 111, 2015. 5

  6. [6]

    Language mod- els are realistic tabular data generators.arXiv preprint arXiv:2210.06280, 2022

    Vadim Borisov, Kathrin Seßler, Tobias Leemann, Mar- tin Pawelczyk, and Gjergji Kasneci. Language mod- els are realistic tabular data generators.arXiv preprint arXiv:2210.06280, 2022. 2

  7. [7]

    Mixed supervision for surface-defect detection: From weakly to fully supervised learning.Computers in Industry, 129: 103459, 2021

    Jakob Bo ˇziˇc, Domen Tabernik, and Danijel Sko ˇcaj. Mixed supervision for surface-defect detection: From weakly to fully supervised learning.Computers in Industry, 129: 103459, 2021. 5

  8. [8]

    Segment Any Anomaly without Training via Hybrid Prompt Regularization.arXiv preprint arXiv:2305.10724, 2023

    Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Zongwei Du, Liang Gao, and Weiming Shen. Segment Any Anomaly without Training via Hybrid Prompt Regularization.arXiv preprint arXiv:2305.10724, 2023. 2, 3, 6

  9. [9]

    AdaCLIP: Adapting CLIP with hybrid learnable prompts for zero-shot anomaly detection

    Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. AdaCLIP: Adapting CLIP with hybrid learnable prompts for zero-shot anomaly detection. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 1, 2, 3, 4, 5, 6, 7, 8

  10. [10]

    Back on track: Bundle adjustment for dynamic scene re- construction

    Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene re- construction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4951–4960,

  11. [11]

    Clip-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection

    Xuhai Chen, Jiangning Zhang, Guanzhong Tian, Haoyang He, Wuhao Zhang, Yabiao Wang, Chengjie Wang, and Yong Liu. Clip-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection. InInternational Joint Conference on Artificial Intelligence, pages 17–33. Springer,

  12. [12]

    Noel CF Codella, David Gutman, M Emre Celebi, Brian Helba, Michael A Marchetti, Stephen W Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kit- tler, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomed- ical imaging (isbi), hosted by the international skin imaging collaboration (i...

  13. [13]

    Padim: a patch distribution modeling framework for anomaly detection and localization

    Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. InInter- national conference on pattern recognition, pages 475–489. Springer, 2021. 2

  14. [14]

    Outlier detec- tion by ensembling uncertainty with negative objectness

    Anja Deli ´c, Matej Grcic, and Sini ˇsa ˇSegvi´c. Outlier detec- tion by ensembling uncertainty with negative objectness. In 35th British Machine Vision Conference 2024, BMVC 2024, Glasgow, UK, November 25-28, 2024. BMV A, 2024. 1

  15. [15]

    Anomaly Detection via Re- verse Distillation from One-Class Embedding

    Hanqiu Deng and Xingyu Li. Anomaly Detection via Re- verse Distillation from One-Class Embedding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9737–9746, 2022. 2, 5

  16. [16]

    Few- shot defect image generation via defect-aware feature ma- nipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 37(1):571–578, 2023

    Yuxuan Duan, Yan Hong, Li Niu, and Liqing Zhang. Few- shot defect image generation via defect-aware feature ma- nipulation.Proceedings of the AAAI Conference on Artificial Intelligence, 37(1):571–578, 2023. 2

  17. [17]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In Forty-first International Conference on Machine Learning,

  18. [18]

    TransFusion–a Transparency-based Diffusion Model for Anomaly Detection

    Matic Fu ˇcka, Vitjan Zavrtanik, and Danijel Sko ˇcaj. TransFusion–a Transparency-based Diffusion Model for Anomaly Detection. InEuropean conference on computer vision, pages 91–108. Springer, 2025. 1, 2

  19. [19]

    SALAD – Semantics-Aware Logical Anomaly Detection

    Matic Fu ˇcka, Vitjan Zavrtanik, and Danijel Skoˇcaj. SALAD – Semantics-Aware Logical Anomaly Detection. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2

  20. [20]

    Multi- task learning for thyroid nodule segmentation with thyroid region prior

    Haifan Gong, Guanqi Chen, Ranran Wang, Xiang Xie, Mingzhi Mao, Yizhou Yu, Fei Chen, and Guanbin Li. Multi- task learning for thyroid nodule segmentation with thyroid region prior. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 257–261. IEEE, 2021. 5

  21. [21]

    A. Hamada. Br35h: Brain tumor detection.https: //www.kaggle.com/datasets/ahmedhamada0/ braintumor - detection, 2020. Online; accessed

  22. [22]

    The 9 endotect 2020 challenge: evaluation and comparison of clas- sification, segmentation and inference time for endoscopy

    Steven A Hicks, Debesh Jha, Vajira Thambawita, P ˚al Halvorsen, Hugo L Hammer, and Michael A Riegler. The 9 endotect 2020 challenge: evaluation and comparison of clas- sification, segmentation and inference time for endoscopy. InInternational Conference on Pattern Recognition, pages 263–274. Springer, 2021. 5

  23. [23]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 4, 8

  24. [24]

    Anomalyd- iffusion: Few-shot anomaly image generation with diffusion model.Proceedings of the AAAI Conference on Artificial Intelligence, 38(8):8526–8534, 2024

    Teng Hu, Jiangning Zhang, Ran Yi, Yuzhen Du, Xu Chen, Liang Liu, Yabiao Wang, and Chengjie Wang. Anomalyd- iffusion: Few-shot anomaly image generation with diffusion model.Proceedings of the AAAI Conference on Artificial Intelligence, 38(8):8526–8534, 2024. 2

  25. [25]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. GPT-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3

  26. [26]

    WinCLIP: Zero- /Few-Shot Anomaly Classification and Segmentation

    Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. WinCLIP: Zero- /Few-Shot Anomaly Classification and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 19606–19616, 2023. 3, 6, 7

  27. [27]

    Deep learning-based defect detection of metal parts: evaluating current methods in complex condi- tions

    Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvorak, and Milos Skotak. Deep learning-based defect detection of metal parts: evaluating current methods in complex condi- tions. In2021 13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), pages 66–71, 2021. 5

  28. [28]

    Kvasir-seg: A segmented polyp dataset

    Debesh Jha, Pia H Smedsrud, Michael A Riegler, P ˚al Halvorsen, Thomas De Lange, Dag Johansen, and H˚avard D Johansen. Kvasir-seg: A segmented polyp dataset. InIn- ternational conference on multimedia modeling, pages 451–

  29. [29]

    Springer, 2019. 1, 5

  30. [30]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022. 8

  31. [31]

    Brain tumor detec- tion using mri images.Brain, 3(2):146–150, 2015

    Pranita Balaji Kanade and PP Gumaste. Brain tumor detec- tion using mri images.Brain, 3(2):146–150, 2015. 5

  32. [32]

    Diffusion Models for Open-Vocabulary Segmen- tation

    Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. Diffusion Models for Open-Vocabulary Segmen- tation. InEuropean Conference on Computer Vision, pages 299–317. Springer, 2024. 2

  33. [33]

    Repurpos- ing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9492–9502,

  34. [34]

    Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics

    Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491,

  35. [35]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 3

  36. [36]

    Dataset Enhancement with Instance-Level Augmentations

    Orest Kupyn and Christian Rupprecht. Dataset Enhancement with Instance-Level Augmentations. InEuropean Confer- ence on Computer Vision, pages 384–402. Springer, 2024. 2

  37. [37]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2, 3, 5, 7, 8, 1

  38. [38]

    Zero-Shot Anomaly Detection via Batch Normalization.Advances in Neural Information Processing Systems, 36, 2024

    Aodong Li, Chen Qiu, Marius Kloft, Padhraic Smyth, Maja Rudolph, and Stephan Mandt. Zero-Shot Anomaly Detection via Batch Normalization.Advances in Neural Information Processing Systems, 36, 2024. 3

  39. [39]

    PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection

    Xiaofan Li, Zhizhong Zhang, Xin Tan, Chengwei Chen, Yanyun Qu, Yuan Xie, and Lizhuang Ma. PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 16838–16848, 2024. 7

  40. [40]

    PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection

    Xiaofan Li, Zhizhong Zhang, Xin Tan, Chengwei Chen, Yanyun Qu, Yuan Xie, and Lizhuang Ma. PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16838–16848, 2024. 1

  41. [41]

    Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning.Advances in Neural Information Processing Systems, 35:109–123, 2022

    Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning.Advances in Neural Information Processing Systems, 35:109–123, 2022. 8

  42. [42]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 4

  43. [43]

    Can OOD Object Detectors Learn from Founda- tion Models? InEuropean Conference on Computer Vision, pages 213–231

    Jiahui Liu, Xin Wen, Shizhen Zhao, Yingxian Chen, and Xi- aojuan Qi. Can OOD Object Detectors Learn from Founda- tion Models? InEuropean Conference on Computer Vision, pages 213–231. Springer, 2024. 2

  44. [44]

    Grounding DINO: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding DINO: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2024. 3

  45. [45]

    SimpleNet: A Simple Network for Image Anomaly Detec- tion and Localization

    Zhikang Liu, Yiming Zhou, Yuansheng Xu, and Zilei Wang. SimpleNet: A Simple Network for Image Anomaly Detec- tion and Localization. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20402–20411, 2023. 2

  46. [46]

    RePaint: Inpainting using denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using denoising diffusion probabilistic models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022. 3

  47. [47]

    Exploring intrinsic normal prototypes within a single im- age for universal anomaly detection

    Wei Luo, Yunkang Cao, Haiming Yao, Xiaotian Zhang, Jianan Lou, Yuqi Cheng, Weiming Shen, and Wenyong Yu. Exploring intrinsic normal prototypes within a single im- age for universal anomaly detection. InProceedings of the 10 Computer Vision and Pattern Recognition Conference, pages 9974–9983, 2025. 7

  48. [48]

    Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip

    Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 4744–4754,

  49. [49]

    VT-ADL: A vision trans- former network for image anomaly detection and localiza- tion

    Pankaj Mishra, Riccardo Verk, Daniele Fornasier, Claudio Piciarelli, and Gian Luca Foresti. VT-ADL: A vision trans- former network for image anomaly detection and localiza- tion. In30th IEEE/IES International Symposium on Indus- trial Electronics (ISIE), 2021. 5

  50. [50]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

  51. [51]

    Inpainting transformer for anomaly detection

    Jonathan Pirnay and Keng Chai. Inpainting transformer for anomaly detection. InInternational Conference on Image Analysis and Processing, pages 394–406. Springer, 2022. 2

  52. [52]

    Supporting high-level to low-level requirements coverage reviewing with large lan- guage models

    Anamaria-Roberta Preda, Christoph Mayr-Dorn, Atif Mashkoor, and Alexander Egyed. Supporting high-level to low-level requirements coverage reviewing with large lan- guage models. InProceedings of the 21st International Con- ference on Mining Software Repositories, pages 242–253,

  53. [53]

    Highly Accurate Dichotomous Im- age Segmentation

    Xuebin Qin, Hang Dai, Xiaobin Hu, Deng-Ping Fan, Ling Shao, and Luc Van Gool. Highly Accurate Dichotomous Im- age Segmentation. InEuropean Conference on Computer Vision, pages 38–56. Springer, 2022. 3

  54. [54]

    Bayesian Prompt Flow Learning for Zero-Shot Anomaly De- tection

    Zhen Qu, Xian Tao, Xinyi Gong, ShiChen Qu, Qiyu Chen, Zhengtao Zhang, Xingang Wang, and Guiguang Ding. Bayesian Prompt Flow Learning for Zero-Shot Anomaly De- tection. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 30398–30408, 2025. 1, 5, 6, 7, 8

  55. [55]

    Learn- ing Transferable Visual Models From Natural Language Su- pervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing Transferable Visual Models From Natural Language Su- pervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 3

  56. [56]

    AM-RADIO: Agglomerative vision founda- tion model reduce all domains into one

    Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. AM-RADIO: Agglomerative vision founda- tion model reduce all domains into one. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12490–12500, 2024. 5

  57. [57]

    SuperSim- pleNet: Unifying Unsupervised and Supervised Learning for Fast and Reliable Surface Defect Detection

    Bla ˇz Rolih, Matic Fu ˇcka, and Danijel Sko ˇcaj. SuperSim- pleNet: Unifying Unsupervised and Supervised Learning for Fast and Reliable Surface Defect Detection. InInternational Conference on Pattern Recognition, 2024. 2

  58. [58]

    No Label Left Behind: A Unified Surface Defect Detection model for all Supervision Regimes.Journal of Intelligent Manufacturing,

    Bla ˇz Rolih, Matic Fuˇcka, and Danijel Skoˇcaj. No Label Left Behind: A Unified Surface Defect Detection model for all Supervision Regimes.Journal of Intelligent Manufacturing,

  59. [59]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  60. [60]

    Towards To- tal Recall in Industrial Anomaly Detection

    Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards To- tal Recall in Industrial Anomaly Detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2022. 1, 2, 7

  61. [61]

    Multiresolution knowledge distillation for anomaly detection

    Mohammadreza Salehi, Niousha Sadjadi, Soroosh Baselizadeh, Mohammad H Rohban, and Hamid R Ra- biee. Multiresolution knowledge distillation for anomaly detection. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 14902–14912, 2021. 1, 5

  62. [62]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 5

  63. [63]

    Segmentation-Based Deep-Learning Approach for Surface-Defect Detection.Journal of Intelligent Manufac- turing, 2019

    Domen Tabernik, Samo ˇSela, Jure Skvar ˇc, and Danijel Skoˇcaj. Segmentation-Based Deep-Learning Approach for Surface-Defect Detection.Journal of Intelligent Manufac- turing, 2019. 5

  64. [64]

    Automated polyp detection in colonoscopy videos using shape and context information.IEEE transactions on medical imaging, 35(2):630–644, 2015

    Nima Tajbakhsh, Suryakanth R Gurudu, and Jianming Liang. Automated polyp detection in colonoscopy videos using shape and context information.IEEE transactions on medical imaging, 35(2):630–644, 2015. 5

  65. [65]

    Kernel-aware graph prompt learning for few-shot anomaly detection

    Fenfang Tao, Guo-Sen Xie, Fang Zhao, and Xiangbo Shu. Kernel-aware graph prompt learning for few-shot anomaly detection. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 7347–7355, 2025. 1

  66. [66]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

  67. [67]

    Image-consistent detection of road anomalies as unpredictable patches

    Tom ´aˇs V oj´ıˇr and Ji ˇr´ı Matas. Image-consistent detection of road anomalies as unpredictable patches. InProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pages 5491–5500, 2023. 1

  68. [68]

    Pixood: Pixel- level out-of-distribution detection

    Tom ´aˇs V oj´ıˇr, Jan ˇSochman, and Ji ˇr´ı Matas. Pixood: Pixel- level out-of-distribution detection. InEuropean Conference on Computer Vision, pages 93–109. Springer, 2024. 1

  69. [69]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 7, 8, 1

  70. [70]

    Real-IAD: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion

    Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jiangning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, and Lizhuang Ma. Real-IAD: A real-world multi-view dataset for benchmarking versatile industrial anomaly detec- tion. InProceedings of the IEEE/CVF Conference on Com- 11 puter Vision and Pattern Recognition, pages 22883–22892,

  71. [71]

    DUST3R: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUST3R: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 4

  72. [72]

    LLM-powered data augmentation for enhanced cross- lingual performance

    Chenxi Whitehouse, Monojit Choudhury, and Alham Fikri Aji. LLM-powered data augmentation for enhanced cross- lingual performance. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 2

  73. [73]

    Weakly supervised learn- ing for industrial optical inspection

    Matthias Wieler and Tobias Hahn. Weakly supervised learn- ing for industrial optical inspection. InDAGM symposium in, page 11, 2007. 5

  74. [74]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 7, 8, 1

  75. [75]

    Group normalization

    Yuxin Wu and Kaiming He. Group normalization. InPro- ceedings of the European conference on computer vision (ECCV), pages 3–19, 2018. 4

  76. [76]

    Memseg: A semi- supervised method for image surface defect detection using differences and commonalities.Engineering Applications of Artificial Intelligence, 119:105835, 2023

    Minghui Yang, Peng Wu, and Hui Feng. Memseg: A semi- supervised method for image surface defect detection using differences and commonalities.Engineering Applications of Artificial Intelligence, 119:105835, 2023. 4

  77. [77]

    Defect spectrum: A granular look of large-scale defect datasets with rich seman- tics

    Shuai Yang, Zhifei Chen, Pengguang Chen, Xi Fang, Yixun Liang, Shu Liu, and Yingcong Chen. Defect spectrum: A granular look of large-scale defect datasets with rich seman- tics. InComputer Vision – ECCV 2024, pages 187–203, Cham, 2024. Springer Nature Switzerland. 2

  78. [78]

    GPT3Mix: Leveraging large- scale language models for text augmentation

    Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woomyoung Park. GPT3Mix: Leveraging large- scale language models for text augmentation. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic,

  79. [79]

    Association for Computational Linguistics. 2

  80. [80]

    DRÆM - a Discriminatively Trained Reconstruction Embedding for Surface Anomaly Detection

    Vitjan Zavrtanik, Matej Kristan, and Danijel Skoˇcaj. DRÆM - a Discriminatively Trained Reconstruction Embedding for Surface Anomaly Detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8330–8339, 2021. 2

Showing first 80 references.