Recognition: 2 theorem links
· Lean TheoremAnomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection
Pith reviewed 2026-05-13 06:24 UTC · model grok-4.3
The pith
Dual specialized branches for normal and anomalous patterns, combined via text-guided routing, enable zero-shot anomaly detection by exploiting distribution asymmetry.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AVA-DINO adapts frozen DINOv3 visual features using an anomaly-aware vision-language framework with two branches: one specialized for normal patterns and one for anomalous patterns. The branches are learned jointly on auxiliary data through a text-guided routing mechanism and explicit regularization that encourages specialization. At test time the input image and predefined language descriptions produce a dynamic, asymmetric combination of the branches that applies context-specific transformations respecting the compact-versus-diverse nature of normal and anomalous data.
What carries the argument
Anomaly-aware vision-language adapters (AVA) consisting of dual normal and anomalous branches with text-guided routing that dynamically selects and combines their outputs.
If this is right
- Reaches 93.5 percent image-AUROC on the MVTec-AD benchmark.
- Generalizes to medical imaging domains without any domain-specific fine-tuning.
- Produces context-specific feature transformations instead of uniform ones.
- Avoids degenerate uniform routing through explicit regularization during training.
Where Pith is reading between the lines
- The same dual-branch routing idea could be tested on other asymmetric tasks such as out-of-distribution detection where normal data is compact and outliers vary widely.
- If predefined language descriptions prove insufficient for novel anomaly types, performance would drop on categories whose defects are poorly described by the fixed text prompts.
- Jointly training specialized adapters under routing constraints may improve zero-shot results in any vision task where data naturally splits into a tight normal mode and a scattered anomalous mode.
Load-bearing premise
Joint training with text-guided routing and explicit regularization will produce genuine branch specialization rather than uniform routing, and fixed predefined language descriptions will suffice to select the correct combination for completely unseen categories.
What would settle it
On a new industrial benchmark, removing the routing regularization causes the two branches to produce identical outputs and performance falls to the level of a single uniform adapter.
read the original abstract
Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AVA-DINO, a zero-shot anomaly detection framework that adapts frozen DINOv3 visual features using dual specialized branches for normal and anomalous patterns. Training on auxiliary data employs a text-guided routing mechanism with explicit regularization to encourage branch specialization; at inference, fixed predefined language descriptions dynamically combine the branches for asymmetric feature transformation. Experiments across nine industrial and medical benchmarks report state-of-the-art results, including 93.5% image-AUROC on MVTec-AD and cross-domain generalization to medical imaging without target-specific fine-tuning.
Significance. If the dual-branch asymmetry and routing specialization hold for unseen categories, the approach would meaningfully advance zero-shot anomaly detection by exploiting the compact-vs-diverse distributional asymmetry rather than applying symmetric adaptation. The use of frozen DINOv3 plus language-guided routing without domain-specific fine-tuning could enable more robust cross-domain transfer, provided the specialization is empirically verified rather than assumed.
major comments (3)
- [Methods and Experiments] The central claim that text-guided routing plus regularization produces genuine branch specialization (rather than uniform or misaligned activation) for unseen categories lacks direct verification. No routing histograms, per-category activation statistics, or ablation removing the anomalous branch are reported to confirm non-degenerate behavior on MVTec-AD or medical test sets.
- [Experiments] The reported 93.5% image-AUROC on MVTec-AD and cross-domain medical gains are presented without isolating the contribution of the dual-branch design versus a single-branch baseline using the same frozen DINOv3 backbone and language descriptions. This makes it impossible to attribute gains specifically to the asymmetry argument.
- [Methods] The fixed predefined language descriptions are asserted to suffice for dynamic branch selection at test time on unseen categories, yet no analysis shows how these descriptions were chosen or whether they generalize beyond the auxiliary training distribution.
minor comments (2)
- [Methods] Notation for the routing weights and regularization term should be defined explicitly with equations rather than prose descriptions.
- [Experiments] The abstract and results tables would benefit from reporting standard deviations over multiple runs or seeds to contextualize the 93.5% figure.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of verifying branch specialization and isolating design contributions, which we address below by committing to targeted revisions that strengthen the empirical support for our claims without altering the core methodology.
read point-by-point responses
-
Referee: [Methods and Experiments] The central claim that text-guided routing plus regularization produces genuine branch specialization (rather than uniform or misaligned activation) for unseen categories lacks direct verification. No routing histograms, per-category activation statistics, or ablation removing the anomalous branch are reported to confirm non-degenerate behavior on MVTec-AD or medical test sets.
Authors: We agree that explicit verification of non-degenerate routing is necessary to substantiate the specialization claim. In the revised version, we will add routing histograms and per-category activation statistics computed on the MVTec-AD and medical test sets. We will also include an ablation that removes the anomalous branch entirely, demonstrating its necessity for the reported performance on unseen categories. These additions will directly confirm that the text-guided routing and regularization prevent uniform activation. revision: yes
-
Referee: [Experiments] The reported 93.5% image-AUROC on MVTec-AD and cross-domain medical gains are presented without isolating the contribution of the dual-branch design versus a single-branch baseline using the same frozen DINOv3 backbone and language descriptions. This makes it impossible to attribute gains specifically to the asymmetry argument.
Authors: We acknowledge the need to isolate the dual-branch contribution. The revised manuscript will include a single-branch baseline that uses identical frozen DINOv3 features and the same language descriptions, allowing direct comparison of image-AUROC on MVTec-AD and the medical benchmarks. This will quantify the specific benefit of the asymmetric dual-branch design over symmetric adaptation. revision: yes
-
Referee: [Methods] The fixed predefined language descriptions are asserted to suffice for dynamic branch selection at test time on unseen categories, yet no analysis shows how these descriptions were chosen or whether they generalize beyond the auxiliary training distribution.
Authors: The language descriptions were selected as a compact set of general anomaly descriptors (e.g., 'defect', 'anomaly', 'irregularity') derived from patterns observed in the auxiliary training data to enable broad coverage without category-specific tuning. In revision, we will add a dedicated paragraph detailing this selection process and include a sensitivity analysis testing descriptor variations on held-out unseen categories to demonstrate generalization. revision: yes
Circularity Check
No circularity: empirical claims rest on external benchmarks and frozen pre-trained features
full rationale
The paper introduces AVA-DINO as a vision-language adaptation method that uses frozen DINOv3 features, trains dual normal/anomalous branches jointly on auxiliary data with text-guided routing and explicit regularization, and evaluates zero-shot performance on standard held-out benchmarks (MVTec-AD at 93.5% image-AUROC plus eight others, including medical cross-domain). No equations, derivations, or self-citations are present that reduce the performance claims to quantities defined by the same inputs or fitted parameters by construction. The central design choices (asymmetric branches, routing mechanism) are motivated by asymmetry in normal vs. anomalous distributions and validated empirically against external data, rendering the derivation chain self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- routing regularization coefficient
axioms (1)
- domain assumption Frozen DINOv3 visual features provide a suitable base representation for zero-shot anomaly detection across domains
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual specialized branches for normal and anomalous patterns... text-guided routing mechanism and explicit routing regularization that encourages branch specialization... [wn, wa] = softmax([cos(fcls, tproj_n), cos(fcls, tproj_a)] / τ)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Industrial quality inspection demands detecting diverse defect types across varying object categories, yet collecting exhaus- tive anomaly samples for supervised training remains imprac- tical due to the rarity and diversity of failure modes. Zero- shot anomaly detection (ZSAD) addresses this challenge by leveraging vision-language models pre...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
RELATED WORK Recent advances in vision-language models have enabled zero-shot anomaly detection without target-specific training data. WinCLIP [1] applies CLIP features with learnable text prompts in a sliding window manner for localized anomaly detection. AnomalyCLIP [2] introduces object-aware prompt learning to enhance semantic alignment between visual...
-
[3]
PROPOSED APPROACH 3.1. Overview Zero-shot anomaly detection aims to generate a pixel-wise anomaly mapM∈[0,1] H×W for a test imagex∈R H×W×3 without exposure to target-specific samples during training. Following established protocols, we train on an auxiliary datasetD a containing normal and anomalous samples with ground-truth masks, then evaluate on a disj...
work page 2023
-
[4]
EXPERIMENTS Datasets.We evaluate on nine benchmarks spanning in- dustrial and medical domains. Industrial datasets include MVTec-AD [10] (15 categories), ViSA [11] (12 categories), BTAD [12] (3 categories), KSDD2 [13] (surface defects), MPDD [14] (6 categories), and MVTec-AD2 [15] (8 cat- egories). In line with [2, 5], we further evaluate on med- ical dat...
-
[5]
Ground truth bound- aries shown in green
and medical (columns 5-6) samples. Ground truth bound- aries shown in green. F1, improving over the second-best by 7.8 and 15.1 points respectively. Kvasir results (90.6% P-AUC, 66.5% Pixel-F1) further confirm that our anomaly-aware adapters transfer to polyp segmentation without domain-specific fine-tuning. Figure 3 provides qualitative comparisons acros...
-
[6]
CONCLUSION We present A V A-DINO, a dual-branch adaptation framework for zero-shot anomaly detection that learns context-specific feature transformations. Unlike uniform adaptation, A V A- DINO uses separate normal and anomaly pathways, com- bined through text-guided routing with explicit regularization to enforce specialization. Experiments on industrial...
-
[7]
Winclip: Zero-/few-shot anomaly classification and segmentation,
Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023, pp. 19606–19616
work page 2023
-
[8]
Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection,
Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen, “Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection,” inInterna- tional Conference on Learning Representations (ICLR), 2024, vol. 2024, pp. 49705–49737
work page 2024
-
[9]
Learning transferable visual models from natural lan- guage supervision,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural lan- guage supervision,” inInternational Conference on Ma- chine Learning (ICML). PmLR, 2021, pp. 8748–8763
work page 2021
-
[10]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Fed- erico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al., “Dinov3,”arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Bayesian prompt flow learning for zero-shot anomaly detection,
Zhen Qu, Xian Tao, Xinyi Gong, Shichen Qu, Qiyu Chen, Zhengtao Zhang, Xingang Wang, and Guiguang Ding, “Bayesian prompt flow learning for zero-shot anomaly detection,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 30398–30408
work page 2025
-
[12]
Ada- clip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,
Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi, “Ada- clip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,” inEuropean Conference on Computer Vision (ECCV). Springer, 2024, pp. 55–72
work page 2024
-
[13]
A contrastive learning-guided confi- dent meta-learning for zero shot anomaly detection,
Muhammad Aqeel, Danijel Sko ˇcaj, Marco Cristani, and Francesco Setti, “A contrastive learning-guided confi- dent meta-learning for zero shot anomaly detection,” in IEEE/CVF International Conference on Computer Vi- sion (ICCV), 2025, pp. 1452–1461
work page 2025
-
[14]
Parameter- efficient transfer learning for nlp,
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly, “Parameter- efficient transfer learning for nlp,” inInternational Con- ference on Machine Learning (ICML). PMLR, 2019, pp. 2790–2799
work page 2019
-
[15]
Adaptformer: Adapting vision transformers for scalable visual recog- nition,
Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo, “Adaptformer: Adapting vision transformers for scalable visual recog- nition,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 16664–16678, 2022
work page 2022
-
[16]
Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,
Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger, “Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9592–9600
work page 2019
-
[17]
Spot-the-difference self- supervised pre-training for anomaly detection and seg- mentation,
Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer, “Spot-the-difference self- supervised pre-training for anomaly detection and seg- mentation,” inEuropean Conference on Computer Vi- sion (ECCV). Springer, 2022, pp. 392–408
work page 2022
-
[18]
Vt-adl: A vision transformer network for image anomaly detection and localization,
Pankaj Mishra, Riccardo Verk, Daniele Fornasier, Clau- dio Piciarelli, and Gian Luca Foresti, “Vt-adl: A vision transformer network for image anomaly detection and localization,” inIEEE 30th International Symposium on Industrial Electronics (ISIE). IEEE, 2021, pp. 01–06
work page 2021
-
[19]
Mixed supervision for surface-defect detection: from weakly to fully supervised learning,
Jakob Bo ˇziˇc, Domen Tabernik, and Danijel Sko ˇcaj, “Mixed supervision for surface-defect detection: from weakly to fully supervised learning,”Computers in In- dustry, 2021
work page 2021
-
[20]
Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvo- rak, and Milos Skotak, “Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions,” in13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT). IEEE, 2021, pp. 66–71
work page 2021
-
[21]
Lars Heckler-Kram, Jan-Hendrik Neudeck, Ulla Scheler, Rebecca K ¨onig, and Carsten Steger, “The mvtec ad 2 dataset: Advanced scenarios for un- supervised anomaly detection,”arXiv preprint arXiv:2503.21622, 2025
-
[22]
Kvasir-seg: A segmented polyp dataset,
Debesh Jha, Pia H Smedsrud, Michael A Riegler, P ˚al Halvorsen, Thomas De Lange, Dag Johansen, and H˚avard D Johansen, “Kvasir-seg: A segmented polyp dataset,” inInternational Conference on Multimedia Modeling (MMM). Springer, 2019, pp. 451–462
work page 2019
-
[23]
Automated polyp detection in colonoscopy videos using shape and context information,
Nima Tajbakhsh, Suryakanth R Gurudu, and Jianming Liang, “Automated polyp detection in colonoscopy videos using shape and context information,”IEEE Transactions on Medical Imaging, vol. 35, no. 2, pp. 630–644, 2015
work page 2015
-
[24]
Jorge Bernal, F Javier S ´anchez, Gloria Fern ´andez- Esparrach, Debora Gil, Cristina Rodr ´ıguez, and Fer- nando Vilari ˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”Computerized Medical Imag- ing and Graphics (CMGI), vol. 43, pp. 99–111, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.