CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection

Chengkan Lv; Haiming Yao; Huiyuan Luo; Qiyu Chen; Wei Luo; Yinan Duan; Yunkang Cao; Yuxin Jiang; Zhengtao Zhang; Zhen Qu

arxiv: 2508.03447 · v2 · submitted 2025-08-05 · 💻 cs.CV

CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection

Qiyu Chen , Zhen Qu , Wei Luo , Haiming Yao , Yunkang Cao , Yuxin Jiang , Yinan Duan , Huiyuan Luo

show 2 more authors

Chengkan Lv Zhengtao Zhang

This is my paper

Pith reviewed 2026-05-19 00:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot anomaly detectionprompt synthesisvision-language modelspatch featuresvariational autoencoderindustrial defect detectionmedical lesion detection

0 comments

The pith

CoPS generates dynamic prompts by conditioning on visual prototypes and VAE semantic tokens to advance zero-shot anomaly detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that synthesizing prompts dynamically from image-derived normal and anomaly prototypes, plus modeled class tokens, overcomes the limits of static learnable tokens and sparse labels in vision-language models for anomaly detection. This would matter if true because it enables better generalization to entirely unseen categories after fine-tuning on only one auxiliary dataset, reducing the need for manual prompt design or category-specific training. A sympathetic reader would care because current approaches overfit to limited semantic patterns and fail to capture continuous variations in normal versus anomalous states across industrial defects and medical lesions. The framework injects patch-level prototypes explicitly into prompts and uses a variational autoencoder to implicitly enrich class information. Reported results across thirteen datasets show measurable gains in both image-level classification and pixel-level segmentation performance.

Core claim

CoPS synthesizes dynamic prompts conditioned on visual features by extracting representative normal and anomaly prototypes from fine-grained patch features for adaptive state modeling, while a variational autoencoder models semantic image features to implicitly fuse varied class tokens into the prompts; when paired with a spatially-aware alignment mechanism, this produces prompts that support effective zero-shot anomaly detection on unseen categories after auxiliary fine-tuning.

What carries the argument

Conditional Prompt Synthesis (CoPS), which injects extracted normal and anomaly prototypes from patch features into prompts and fuses VAE-modeled semantic class tokens to create adaptive rather than static prompts.

If this is right

Prompts adapt to continuous and diverse patterns of normal and anomalous states instead of remaining static.
Sparsity in fixed textual labels is mitigated by implicit fusion of varied class tokens from the VAE.
Cross-category anomaly detection becomes feasible on diverse industrial and medical datasets after single-dataset fine-tuning.
Spatially-aware alignment further supports precise localization of anomalies at the segmentation level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prototype extraction step could be tested for transfer to other prompt-conditioned vision tasks such as zero-shot classification or retrieval.
If the VAE component proves stable, similar semantic enrichment might reduce the size of auxiliary datasets needed for other zero-shot settings.
The conditional synthesis approach suggests a path toward prompts that evolve with incoming image batches rather than remaining fixed after training.

Load-bearing premise

Representative normal and anomaly prototypes from patch features combined with VAE-modeled semantic class tokens will produce prompts that generalize to entirely unseen categories without overfitting to the single auxiliary fine-tuning dataset.

What would settle it

Performance on a held-out test set of categories whose visual and semantic distributions differ markedly from the auxiliary dataset shows no improvement or a drop relative to static-prompt baselines in both classification and segmentation AUROC.

read the original abstract

Recently, large pre-trained vision-language models have shown remarkable performance in zero-shot anomaly detection (ZSAD). With fine-tuning on a single auxiliary dataset, the model enables cross-category anomaly detection on diverse datasets covering industrial defects and medical lesions. Compared to manually designed prompts, prompt learning eliminates the need for expert knowledge and trial-and-error. However, it still faces the following challenges: (i) static learnable tokens struggle to capture the continuous and diverse patterns of normal and anomalous states, limiting generalization to unseen categories; (ii) fixed textual labels provide overly sparse category information, making the model prone to overfitting to a specific semantic subspace. To address these issues, we propose Conditional Prompt Synthesis (CoPS), a novel framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance. Specifically, we extract representative normal and anomaly prototypes from fine-grained patch features and explicitly inject them into prompts, enabling adaptive state modeling. Given the sparsity of class labels, we leverage a variational autoencoder to model semantic image features and implicitly fuse varied class tokens into prompts. Additionally, integrated with our spatially-aware alignment mechanism, extensive experiments demonstrate that CoPS surpasses state-of-the-art methods by 1.4% in classification AUROC and 1.9% in segmentation AUROC across 13 industrial and medical datasets. The code is available at https://github.com/cqylunlun/CoPS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoPS adds visual prototype injection and VAE-based token fusion to make prompts dynamic for zero-shot anomaly detection, delivering small gains on 13 datasets but resting on an auxiliary-to-unseen transfer that still needs checking.

read the letter

CoPS stands out for trying to make prompts dynamic in zero-shot anomaly detection by pulling in visual prototypes from patches and using a VAE to handle semantic class tokens. This targets the static nature of prior learnable prompts and the sparsity of labels. The new elements are the explicit injection of normal and anomaly prototypes into the prompt and the VAE-based fusion for class information. They pair this with a spatially-aware alignment. The results show improvements of 1.4% AUROC in classification and 1.9% in segmentation over SOTA on 13 datasets, which is decent progress for the field. The paper does a reasonable job laying out the challenges and the approach. Code availability helps. On the soft side, the gains are incremental, and success depends heavily on whether the prototypes and VAE outputs from the auxiliary set generalize without picking up on shared low-level features across datasets. If the auxiliary data is too similar to the test sets in texture or defect types, the apparent zero-shot performance could be overstated. I'd like to see more on how they chose the auxiliary dataset and any tests for semantic overfitting. This paper is for computer vision folks working on practical zero-shot methods for anomaly detection, especially those applying it to manufacturing or healthcare. A reader looking for incremental improvements in prompt-based ZSAD would get value from the specific conditioning strategy. It deserves a serious referee because the idea is clear, the experiments are broad, and the code is shared. The central claim can be checked with the provided details. I would recommend sending it to peer review, focusing review on the robustness of the generalization and the contribution of each new component.

Referee Report

2 major / 2 minor

Summary. The paper introduces Conditional Prompt Synthesis (CoPS) for zero-shot anomaly detection in vision-language models. It fine-tunes on a single auxiliary dataset and synthesizes dynamic prompts conditioned on visual features: representative normal/anomaly prototypes are extracted from fine-grained patch features and injected into prompts, while a variational autoencoder models semantic image features to fuse varied class tokens. This targets limitations of static learnable tokens and sparse fixed labels. The manuscript reports that CoPS outperforms state-of-the-art methods by 1.4% classification AUROC and 1.9% segmentation AUROC across 13 industrial and medical datasets.

Significance. If the reported gains prove robust, the conditional synthesis approach could meaningfully improve generalization in zero-shot anomaly detection by making prompts adaptive to continuous normal/anomaly patterns and richer semantic information, reducing dependence on manual prompt design.

major comments (2)

[§3.2 and §4.3] The central generalization claim—that prototypes and VAE tokens derived from one auxiliary dataset produce prompts that adapt to entirely disjoint categories and domains without overfitting—is load-bearing for the headline 1.4%/1.9% AUROC improvements, yet the manuscript provides no dedicated analysis or ablation on auxiliary-dataset choice, domain-shift statistics, or VAE mode collapse (see §3.2 and §4.3).
[Table 2] Table 2 (or equivalent main results table) reports aggregate AUROC gains but supplies no per-dataset breakdowns, standard deviations across seeds, or statistical significance tests; with margins this small, it is impossible to determine whether the improvements are reliable or sensitive to post-hoc hyperparameter choices.

minor comments (2)

[Abstract] The abstract states 'extensive experiments' without naming the 13 datasets or the auxiliary dataset used for fine-tuning; adding one sentence with this information would improve readability.
[§3.1] Notation for the prototype injection step (Eq. 3 or equivalent) is introduced without an explicit diagram showing how the normal/anomaly prototypes are concatenated or attended into the prompt tokens.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of generalization analysis and result reporting that we will address to strengthen the manuscript. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§3.2 and §4.3] The central generalization claim—that prototypes and VAE tokens derived from one auxiliary dataset produce prompts that adapt to entirely disjoint categories and domains without overfitting—is load-bearing for the headline 1.4%/1.9% AUROC improvements, yet the manuscript provides no dedicated analysis or ablation on auxiliary-dataset choice, domain-shift statistics, or VAE mode collapse (see §3.2 and §4.3).

Authors: We agree that dedicated analysis would better support the generalization claim. In the revised manuscript we will add an ablation study on auxiliary-dataset choice by training on alternative auxiliary sets and reporting performance on the 13 evaluation datasets. We will also include quantitative domain-shift statistics (e.g., feature distribution distances between auxiliary and target domains) and an examination of VAE behavior, such as latent-space visualizations and reconstruction error statistics, to address potential mode collapse. These additions will be placed in §4.3 or a new subsection. revision: yes
Referee: [Table 2] Table 2 (or equivalent main results table) reports aggregate AUROC gains but supplies no per-dataset breakdowns, standard deviations across seeds, or statistical significance tests; with margins this small, it is impossible to determine whether the improvements are reliable or sensitive to post-hoc hyperparameter choices.

Authors: We concur that greater transparency is needed given the reported margins. We will revise the main results table (currently Table 2) to include per-dataset AUROC values for both classification and segmentation. We will also report standard deviations computed over multiple random seeds and include statistical significance tests (paired t-test or Wilcoxon signed-rank test) comparing CoPS against the strongest baseline. These details will appear in the main paper or as supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a new empirical architecture evaluated on held-out data

full rationale

The paper introduces CoPS as a prompt synthesis framework that extracts prototypes from patch features and uses a VAE for semantic tokens, then reports AUROC gains on 13 separate datasets after auxiliary fine-tuning. No equations or steps are shown that reduce a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction. The central claims rest on external pre-trained VLMs plus new components whose performance is measured against held-out industrial and medical sets, satisfying the self-contained criterion. Minor self-citation of prior prompt-learning work is not load-bearing for the reported gains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that pre-trained vision-language models contain transferable features for anomaly detection and that prototypes plus VAE outputs can be fused into prompts without introducing new overfitting modes.

axioms (1)

domain assumption Pre-trained vision-language models encode features that can be adapted for cross-category anomaly detection via prompt conditioning.
Invoked as the foundation for extracting prototypes and semantic features.

pith-pipeline@v0.9.0 · 5812 in / 1269 out tokens · 36784 ms · 2026-05-19T00:55:24.815064+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extract representative normal and anomaly prototypes from fine-grained patch features and explicitly inject them into prompts... leverage a variational autoencoder to model semantic image features
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

extensive experiments demonstrate that CoPS surpasses state-of-the-art methods by 1.4% in classification AUROC and 1.9% in segmentation AUROC across 13 industrial and medical datasets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.