arxiv: 2605.12069 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection

Muhammad Aqeel , Maham Nazir , Uzair Khan , Marco Cristani , Francesco Setti

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords zero-shot anomaly detectionvision-language adaptersdual branch routingDINOv3 featurestext-guided routingMVTec-ADcross-domain generalizationasymmetric distributions

0 comments

The pith

Dual specialized branches for normal and anomalous patterns, combined via text-guided routing, enable zero-shot anomaly detection by exploiting distribution asymmetry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard uniform feature transformations fail for zero-shot anomaly detection because normal samples form compact clusters while anomalies are diverse. It instead introduces separate adapter branches that specialize during training on auxiliary data, then combines them at test time using only the input image and fixed language descriptions. A reader would care because this asymmetric treatment avoids treating all inputs identically and supports detection on entirely unseen categories in both industrial and medical settings. The design relies on joint training with routing regularization to keep the branches distinct rather than collapsing to one behavior. Results show this produces better performance than prior uniform approaches across multiple benchmarks.

Core claim

AVA-DINO adapts frozen DINOv3 visual features using an anomaly-aware vision-language framework with two branches: one specialized for normal patterns and one for anomalous patterns. The branches are learned jointly on auxiliary data through a text-guided routing mechanism and explicit regularization that encourages specialization. At test time the input image and predefined language descriptions produce a dynamic, asymmetric combination of the branches that applies context-specific transformations respecting the compact-versus-diverse nature of normal and anomalous data.

What carries the argument

Anomaly-aware vision-language adapters (AVA) consisting of dual normal and anomalous branches with text-guided routing that dynamically selects and combines their outputs.

If this is right

Reaches 93.5 percent image-AUROC on the MVTec-AD benchmark.
Generalizes to medical imaging domains without any domain-specific fine-tuning.
Produces context-specific feature transformations instead of uniform ones.
Avoids degenerate uniform routing through explicit regularization during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-branch routing idea could be tested on other asymmetric tasks such as out-of-distribution detection where normal data is compact and outliers vary widely.
If predefined language descriptions prove insufficient for novel anomaly types, performance would drop on categories whose defects are poorly described by the fixed text prompts.
Jointly training specialized adapters under routing constraints may improve zero-shot results in any vision task where data naturally splits into a tight normal mode and a scattered anomalous mode.

Load-bearing premise

Joint training with text-guided routing and explicit regularization will produce genuine branch specialization rather than uniform routing, and fixed predefined language descriptions will suffice to select the correct combination for completely unseen categories.

What would settle it

On a new industrial benchmark, removing the routing regularization causes the two branches to produce identical outputs and performance falls to the level of a single uniform adapter.

read the original abstract

Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual-branch adapters with text-guided routing give a clean way to handle normal-anomaly asymmetry on frozen DINOv3, but the evidence that routing actually specializes rather than collapses is still thin.

read the letter

The main thing to know is that this paper adds dual specialized branches to a vision-language adapter on top of frozen DINOv3 features, using text-guided routing plus explicit regularization during auxiliary training so the normal and anomalous paths learn different transformations. At test time it combines them with fixed language prompts. That setup directly targets the compact-normal versus diverse-anomaly mismatch that uniform adapters ignore, and the reported 93.5% image-AUROC on MVTec-AD plus cross-domain medical results without fine-tuning are the headline numbers.

Referee Report

3 major / 2 minor

Summary. The paper proposes AVA-DINO, a zero-shot anomaly detection framework that adapts frozen DINOv3 visual features using dual specialized branches for normal and anomalous patterns. Training on auxiliary data employs a text-guided routing mechanism with explicit regularization to encourage branch specialization; at inference, fixed predefined language descriptions dynamically combine the branches for asymmetric feature transformation. Experiments across nine industrial and medical benchmarks report state-of-the-art results, including 93.5% image-AUROC on MVTec-AD and cross-domain generalization to medical imaging without target-specific fine-tuning.

Significance. If the dual-branch asymmetry and routing specialization hold for unseen categories, the approach would meaningfully advance zero-shot anomaly detection by exploiting the compact-vs-diverse distributional asymmetry rather than applying symmetric adaptation. The use of frozen DINOv3 plus language-guided routing without domain-specific fine-tuning could enable more robust cross-domain transfer, provided the specialization is empirically verified rather than assumed.

major comments (3)

[Methods and Experiments] The central claim that text-guided routing plus regularization produces genuine branch specialization (rather than uniform or misaligned activation) for unseen categories lacks direct verification. No routing histograms, per-category activation statistics, or ablation removing the anomalous branch are reported to confirm non-degenerate behavior on MVTec-AD or medical test sets.
[Experiments] The reported 93.5% image-AUROC on MVTec-AD and cross-domain medical gains are presented without isolating the contribution of the dual-branch design versus a single-branch baseline using the same frozen DINOv3 backbone and language descriptions. This makes it impossible to attribute gains specifically to the asymmetry argument.
[Methods] The fixed predefined language descriptions are asserted to suffice for dynamic branch selection at test time on unseen categories, yet no analysis shows how these descriptions were chosen or whether they generalize beyond the auxiliary training distribution.

minor comments (2)

[Methods] Notation for the routing weights and regularization term should be defined explicitly with equations rather than prose descriptions.
[Experiments] The abstract and results tables would benefit from reporting standard deviations over multiple runs or seeds to contextualize the 93.5% figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of verifying branch specialization and isolating design contributions, which we address below by committing to targeted revisions that strengthen the empirical support for our claims without altering the core methodology.

read point-by-point responses

Referee: [Methods and Experiments] The central claim that text-guided routing plus regularization produces genuine branch specialization (rather than uniform or misaligned activation) for unseen categories lacks direct verification. No routing histograms, per-category activation statistics, or ablation removing the anomalous branch are reported to confirm non-degenerate behavior on MVTec-AD or medical test sets.

Authors: We agree that explicit verification of non-degenerate routing is necessary to substantiate the specialization claim. In the revised version, we will add routing histograms and per-category activation statistics computed on the MVTec-AD and medical test sets. We will also include an ablation that removes the anomalous branch entirely, demonstrating its necessity for the reported performance on unseen categories. These additions will directly confirm that the text-guided routing and regularization prevent uniform activation. revision: yes
Referee: [Experiments] The reported 93.5% image-AUROC on MVTec-AD and cross-domain medical gains are presented without isolating the contribution of the dual-branch design versus a single-branch baseline using the same frozen DINOv3 backbone and language descriptions. This makes it impossible to attribute gains specifically to the asymmetry argument.

Authors: We acknowledge the need to isolate the dual-branch contribution. The revised manuscript will include a single-branch baseline that uses identical frozen DINOv3 features and the same language descriptions, allowing direct comparison of image-AUROC on MVTec-AD and the medical benchmarks. This will quantify the specific benefit of the asymmetric dual-branch design over symmetric adaptation. revision: yes
Referee: [Methods] The fixed predefined language descriptions are asserted to suffice for dynamic branch selection at test time on unseen categories, yet no analysis shows how these descriptions were chosen or whether they generalize beyond the auxiliary training distribution.

Authors: The language descriptions were selected as a compact set of general anomaly descriptors (e.g., 'defect', 'anomaly', 'irregularity') derived from patterns observed in the auxiliary training data to enable broad coverage without category-specific tuning. In revision, we will add a dedicated paragraph detailing this selection process and include a sensitivity analysis testing descriptor variations on held-out unseen categories to demonstrate generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks and frozen pre-trained features

full rationale

The paper introduces AVA-DINO as a vision-language adaptation method that uses frozen DINOv3 features, trains dual normal/anomalous branches jointly on auxiliary data with text-guided routing and explicit regularization, and evaluates zero-shot performance on standard held-out benchmarks (MVTec-AD at 93.5% image-AUROC plus eight others, including medical cross-domain). No equations, derivations, or self-citations are present that reduce the performance claims to quantities defined by the same inputs or fitted parameters by construction. The central design choices (asymmetric branches, routing mechanism) are motivated by asymmetry in normal vs. anomalous distributions and validated empirically against external data, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that frozen DINOv3 features are sufficiently rich for adaptation and that the introduced routing regularization will enforce useful specialization; no new physical entities are postulated.

free parameters (1)

routing regularization coefficient
Controls the strength of the explicit term that encourages the two branches to specialize during joint training on auxiliary data.

axioms (1)

domain assumption Frozen DINOv3 visual features provide a suitable base representation for zero-shot anomaly detection across domains
The framework adapts these frozen features rather than training a new backbone.

pith-pipeline@v0.9.0 · 5499 in / 1260 out tokens · 36211 ms · 2026-05-13T06:24:49.786885+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual specialized branches for normal and anomalous patterns... text-guided routing mechanism and explicit routing regularization that encourages branch specialization... [wn, wa] = softmax([cos(fcls, tproj_n), cos(fcls, tproj_a)] / τ)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

[1]

INTRODUCTION Industrial quality inspection demands detecting diverse defect types across varying object categories, yet collecting exhaus- tive anomaly samples for supervised training remains imprac- tical due to the rarity and diversity of failure modes. Zero- shot anomaly detection (ZSAD) addresses this challenge by leveraging vision-language models pre...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

WinCLIP [1] applies CLIP features with learnable text prompts in a sliding window manner for localized anomaly detection

RELATED WORK Recent advances in vision-language models have enabled zero-shot anomaly detection without target-specific training data. WinCLIP [1] applies CLIP features with learnable text prompts in a sliding window manner for localized anomaly detection. AnomalyCLIP [2] introduces object-aware prompt learning to enhance semantic alignment between visual...

work page
[3]

a photo of [normal] [class]

PROPOSED APPROACH 3.1. Overview Zero-shot anomaly detection aims to generate a pixel-wise anomaly mapM∈[0,1] H×W for a test imagex∈R H×W×3 without exposure to target-specific samples during training. Following established protocols, we train on an auxiliary datasetD a containing normal and anomalous samples with ground-truth masks, then evaluate on a disj...

work page 2023
[4]

EXPERIMENTS Datasets.We evaluate on nine benchmarks spanning in- dustrial and medical domains. Industrial datasets include MVTec-AD [10] (15 categories), ViSA [11] (12 categories), BTAD [12] (3 categories), KSDD2 [13] (surface defects), MPDD [14] (6 categories), and MVTec-AD2 [15] (8 cat- egories). In line with [2, 5], we further evaluate on med- ical dat...

work page
[5]

Ground truth bound- aries shown in green

and medical (columns 5-6) samples. Ground truth bound- aries shown in green. F1, improving over the second-best by 7.8 and 15.1 points respectively. Kvasir results (90.6% P-AUC, 66.5% Pixel-F1) further confirm that our anomaly-aware adapters transfer to polyp segmentation without domain-specific fine-tuning. Figure 3 provides qualitative comparisons acros...

work page
[6]

Unlike uniform adaptation, A V A- DINO uses separate normal and anomaly pathways, com- bined through text-guided routing with explicit regularization to enforce specialization

CONCLUSION We present A V A-DINO, a dual-branch adaptation framework for zero-shot anomaly detection that learns context-specific feature transformations. Unlike uniform adaptation, A V A- DINO uses separate normal and anomaly pathways, com- bined through text-guided routing with explicit regularization to enforce specialization. Experiments on industrial...

work page
[7]

Winclip: Zero-/few-shot anomaly classification and segmentation,

Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023, pp. 19606–19616

work page 2023
[8]

Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection,

Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen, “Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection,” inInterna- tional Conference on Learning Representations (ICLR), 2024, vol. 2024, pp. 49705–49737

work page 2024
[9]

Learning transferable visual models from natural lan- guage supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural lan- guage supervision,” inInternational Conference on Ma- chine Learning (ICML). PmLR, 2021, pp. 8748–8763

work page 2021
[10]

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Fed- erico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al., “Dinov3,”arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Bayesian prompt flow learning for zero-shot anomaly detection,

Zhen Qu, Xian Tao, Xinyi Gong, Shichen Qu, Qiyu Chen, Zhengtao Zhang, Xingang Wang, and Guiguang Ding, “Bayesian prompt flow learning for zero-shot anomaly detection,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 30398–30408

work page 2025
[12]

Ada- clip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,

Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi, “Ada- clip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,” inEuropean Conference on Computer Vision (ECCV). Springer, 2024, pp. 55–72

work page 2024
[13]

A contrastive learning-guided confi- dent meta-learning for zero shot anomaly detection,

Muhammad Aqeel, Danijel Sko ˇcaj, Marco Cristani, and Francesco Setti, “A contrastive learning-guided confi- dent meta-learning for zero shot anomaly detection,” in IEEE/CVF International Conference on Computer Vi- sion (ICCV), 2025, pp. 1452–1461

work page 2025
[14]

Parameter- efficient transfer learning for nlp,

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly, “Parameter- efficient transfer learning for nlp,” inInternational Con- ference on Machine Learning (ICML). PMLR, 2019, pp. 2790–2799

work page 2019
[15]

Adaptformer: Adapting vision transformers for scalable visual recog- nition,

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo, “Adaptformer: Adapting vision transformers for scalable visual recog- nition,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 16664–16678, 2022

work page 2022
[16]

Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,

Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger, “Mvtec ad–a comprehensive real- world dataset for unsupervised anomaly detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9592–9600

work page 2019
[17]

Spot-the-difference self- supervised pre-training for anomaly detection and seg- mentation,

Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer, “Spot-the-difference self- supervised pre-training for anomaly detection and seg- mentation,” inEuropean Conference on Computer Vi- sion (ECCV). Springer, 2022, pp. 392–408

work page 2022
[18]

Vt-adl: A vision transformer network for image anomaly detection and localization,

Pankaj Mishra, Riccardo Verk, Daniele Fornasier, Clau- dio Piciarelli, and Gian Luca Foresti, “Vt-adl: A vision transformer network for image anomaly detection and localization,” inIEEE 30th International Symposium on Industrial Electronics (ISIE). IEEE, 2021, pp. 01–06

work page 2021
[19]

Mixed supervision for surface-defect detection: from weakly to fully supervised learning,

Jakob Bo ˇziˇc, Domen Tabernik, and Danijel Sko ˇcaj, “Mixed supervision for surface-defect detection: from weakly to fully supervised learning,”Computers in In- dustry, 2021

work page 2021
[20]

Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions,

Stepan Jezek, Martin Jonak, Radim Burget, Pavel Dvo- rak, and Milos Skotak, “Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions,” in13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT). IEEE, 2021, pp. 66–71

work page 2021
[21]

The mvtec ad 2 dataset: Advanced scenarios for unsupervised anomaly detection.arXiv preprint arXiv:2503.21622,

Lars Heckler-Kram, Jan-Hendrik Neudeck, Ulla Scheler, Rebecca K ¨onig, and Carsten Steger, “The mvtec ad 2 dataset: Advanced scenarios for un- supervised anomaly detection,”arXiv preprint arXiv:2503.21622, 2025

work page arXiv 2025
[22]

Kvasir-seg: A segmented polyp dataset,

Debesh Jha, Pia H Smedsrud, Michael A Riegler, P ˚al Halvorsen, Thomas De Lange, Dag Johansen, and H˚avard D Johansen, “Kvasir-seg: A segmented polyp dataset,” inInternational Conference on Multimedia Modeling (MMM). Springer, 2019, pp. 451–462

work page 2019
[23]

Automated polyp detection in colonoscopy videos using shape and context information,

Nima Tajbakhsh, Suryakanth R Gurudu, and Jianming Liang, “Automated polyp detection in colonoscopy videos using shape and context information,”IEEE Transactions on Medical Imaging, vol. 35, no. 2, pp. 630–644, 2015

work page 2015
[24]

Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,

Jorge Bernal, F Javier S ´anchez, Gloria Fern ´andez- Esparrach, Debora Gil, Cristina Rodr ´ıguez, and Fer- nando Vilari ˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”Computerized Medical Imag- ing and Graphics (CMGI), vol. 43, pp. 99–111, 2015

work page 2015