Recognition: no theorem link
SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection
Pith reviewed 2026-05-13 20:59 UTC · model grok-4.3
The pith
Sparse-Projected Guides learn coefficients in a sparse autoencoder to produce normal and anomaly reference vectors from auxiliary data alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SPG is a prompt-free framework that learns sparse guide coefficients in the SAE latent space on a labeled auxiliary dataset. In the first stage an SAE is trained on patch-token features; in the second stage only the guide coefficients are optimized using auxiliary masks while the backbone and SAE remain frozen. The resulting coefficients produce normal and anomaly guide vectors via the SAE dictionary, enabling deployment to unseen target categories without adaptation. On MVTec AD and VisA under cross-dataset zero-shot protocols the approach yields competitive image-level detection and strong pixel-level segmentation, with the DINOv3 instantiation reaching the highest pixel-level AUROC among,
What carries the argument
Sparse-Projected Guides (SPG) formed by sparse guide coefficients in the latent space of a sparse autoencoder; the coefficients are optimized on auxiliary masks and then used to reconstruct normal and anomalous reference vectors through the SAE dictionary.
If this is right
- SPG attains competitive image-level AUROC and strong pixel-level segmentation on MVTec AD and VisA under cross-dataset zero-shot conditions.
- With a DINOv3 backbone SPG records the highest pixel-level AUROC among the methods compared in the paper.
- The learned coefficients allow tracing model decisions back to a small number of dictionary atoms that separate category-general from category-specific factors.
- No target-domain adaptation or prompt engineering is required at deployment time.
Where Pith is reading between the lines
- The same sparse-projection idea could be tested on other dense-prediction tasks that rely on reference vectors in foundation-model feature space.
- Because decisions reduce to a handful of dictionary atoms, the coefficients may offer a route to category-agnostic explanations of anomalies.
- Performance on diverse target domains may improve if auxiliary datasets are selected to cover a broader range of visual factors.
- The approach suggests that foundation-model patch features already contain enough structure for anomaly tasks once a sparse linear projection is learned.
Load-bearing premise
Sparse guide coefficients optimized solely on auxiliary dataset masks will produce effective normal and anomaly reference vectors that generalize to unseen target categories using only frozen foundation model features without any adaptation.
What would settle it
Running SPG on a new cross-dataset zero-shot split where pixel-level AUROC falls substantially below the strongest baseline would falsify the generalization claim.
Figures
read the original abstract
We study zero-shot anomaly detection and segmentation using frozen foundation model features, where all learnable parameters are trained only on a labeled auxiliary dataset and deployed to unseen target categories without any target-domain adaptation. Existing prompt-based approaches use handcrafted or learned prompt embeddings as reference vectors for normal/anomalous states. We propose Sparse-Projected Guides (SPG), a prompt-free framework that learns sparse guide coefficients in the Sparse Autoencoder (SAE) latent space, which generate normal/anomaly guide vectors via the SAE dictionary. SPG employs a two stage learning strategy on the labeled auxiliary dataset: (i) train an SAE on patch-token features, and (ii) optimize only guide coefficients using auxiliary pixel-level masks while freezing the backbone and SAE. On MVTec AD and VisA under cross-dataset zero-shot settings, SPG achieves competitive image-level detection and strong pixel-level segmentation; with DINOv3, SPG attains the highest pixellevel AUROC among the compared methods. We also report SPG instantiated with OpenCLIP (ViT-L/14@336px) to align the backbone with CLIP-based baselines. Moreover, the learned guide coefficients trace decisions back to a small set of dictionary atoms, revealing category-general and category-specific factors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Sparse-Projected Guides (SPG), a prompt-free framework for zero-shot anomaly detection and segmentation. It trains a Sparse Autoencoder (SAE) on patch-token features from a labeled auxiliary dataset, then optimizes only sparse guide coefficients using auxiliary pixel masks while keeping the foundation model backbone and SAE frozen. The learned coefficients produce normal and anomaly reference vectors via the SAE dictionary, which are applied directly to unseen target categories (MVTec AD and VisA) in a cross-dataset setting. Experiments report competitive image-level detection and strong pixel-level segmentation, with the DINOv3 variant achieving the highest pixel-level AUROC among compared methods; an OpenCLIP instantiation is also presented for alignment with CLIP baselines. The approach additionally claims interpretability by tracing decisions to a small set of dictionary atoms that capture category-general and category-specific factors.
Significance. If the reported generalization holds, SPG would provide a practical prompt-free alternative to existing zero-shot methods, leveraging the sparsity and interpretability of SAE latent spaces to generate reference vectors without handcrafted prompts or target adaptation. The two-stage training (SAE pretraining followed by coefficient optimization) and the ability to trace decisions to dictionary atoms represent a clear methodological contribution that could improve both performance and explainability in anomaly detection. The claim of highest pixel AUROC with DINOv3, if substantiated with full numerical results and ablations, would indicate meaningful gains over prior art on standard benchmarks.
major comments (2)
- [§3.2 (two-stage learning) and §4 (experiments)] The central zero-shot claim rests on the transfer of auxiliary-optimized guide coefficients to unseen MVTec/VisA categories. The manuscript provides no ablation that isolates the contribution of the learned coefficients (e.g., by replacing them with random or zero vectors while keeping the frozen SAE dictionary and backbone fixed) versus the raw foundation-model features alone. Without this control, it is impossible to determine whether the reported performance gains derive from the coefficient optimization or simply from the choice of backbone (DINOv3).
- [Abstract and §4.1] The abstract states that SPG attains the highest pixel-level AUROC with DINOv3, yet supplies neither the numerical value, the full list of baselines, standard deviations, nor the corresponding table/figure reference. This omission prevents verification of the magnitude and statistical reliability of the claimed improvement.
minor comments (2)
- [§3.1] The notation for the guide coefficients and their projection through the SAE dictionary should be formalized with an explicit equation (e.g., defining the normal and anomaly reference vectors as linear combinations of dictionary atoms weighted by the learned sparse coefficients).
- [§4.3] Figure captions and axis labels in the qualitative results should explicitly state the backbone (DINOv3 vs. OpenCLIP) and the exact metric (image-level vs. pixel-level AUROC) for each panel to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to strengthen the presentation of our contributions and experimental results.
read point-by-point responses
-
Referee: [§3.2 (two-stage learning) and §4 (experiments)] The central zero-shot claim rests on the transfer of auxiliary-optimized guide coefficients to unseen MVTec/VisA categories. The manuscript provides no ablation that isolates the contribution of the learned coefficients (e.g., by replacing them with random or zero vectors while keeping the frozen SAE dictionary and backbone fixed) versus the raw foundation-model features alone. Without this control, it is impossible to determine whether the reported performance gains derive from the coefficient optimization or simply from the choice of backbone (DINOv3).
Authors: We agree that an explicit ablation isolating the learned guide coefficients is necessary to substantiate the role of the two-stage optimization. In the revised manuscript we will add a control experiment in Section 4 that replaces the optimized coefficients with random or zero vectors while keeping the frozen SAE dictionary and backbone unchanged. The results will be reported alongside the main tables and discussed to clarify that performance gains arise from the auxiliary-optimized coefficients rather than backbone features alone. revision: yes
-
Referee: [Abstract and §4.1] The abstract states that SPG attains the highest pixel-level AUROC with DINOv3, yet supplies neither the numerical value, the full list of baselines, standard deviations, nor the corresponding table/figure reference. This omission prevents verification of the magnitude and statistical reliability of the claimed improvement.
Authors: We acknowledge this omission. In the revised manuscript we will update the abstract to report the exact pixel-level AUROC value for the DINOv3 instantiation, explicitly reference Table 2 (or the corresponding table in §4.1), list the primary baselines, and include standard deviations where available to support the statistical reliability of the result. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper trains the SAE on auxiliary patch tokens and optimizes only the sparse guide coefficients via a mask-supervised loss on the auxiliary dataset while keeping the backbone and SAE frozen. These components are then applied without adaptation to produce anomaly scores on entirely unseen target categories (MVTec AD, VisA). The reported AUROC values are therefore measured on data never used in any fitting step, so the central zero-shot claim does not reduce to a self-defined or fitted quantity on the evaluation set. No self-citation chain, ansatz smuggling, or renaming of known results is invoked as load-bearing in the provided description; the method remains a genuine cross-dataset prediction.
Axiom & Free-Parameter Ledger
free parameters (1)
- guide coefficients
axioms (1)
- domain assumption Frozen foundation model patch-token features contain sufficient information to distinguish normal and anomalous states across categories
invented entities (1)
-
Sparse-Projected Guides
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Supervised anomaly detection for complex indus- trial images
Aimira Baitieva, David Hurych, Victor Besnier, and Olivier Bernard. Supervised anomaly detection for complex indus- trial images. InCVPR, 2024. 1
work page 2024
-
[2]
MVTec AD – a comprehensive real-world dataset for unsupervised anomaly detection
Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. MVTec AD – a comprehensive real-world dataset for unsupervised anomaly detection. InCVPR, 2019. 1, 4, 5
work page 2019
-
[3]
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yi- fan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Ka- rina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, an...
-
[4]
https://transformer-circuits.pub/2023/monosemantic- features/index.html. 1, 2
work page 2023
-
[5]
AdaCLIP: Adapting CLIP with hybrid learnable prompts for zero-shot anomaly detection
Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. AdaCLIP: Adapting CLIP with hybrid learnable prompts for zero-shot anomaly detection. InECCV, 2024. 2
work page 2024
-
[6]
Anomalydino: Boosting patch-based few-shot anomaly detection with dinov2
Simon Damm, Mike Laszkiewicz, Johannes Lederer, and Asja Fischer. Anomalydino: Boosting patch-based few-shot anomaly detection with dinov2. InWACV, 2025. 2
work page 2025
-
[7]
Adaptclip: Adapting clip for universal vi- sual anomaly detection
Bin-Bin Gao, Yue Zhu, Jiangtao Yan, Yuezhi Cai, Weixi Zhang, Meng Wang, Jun Liu, Yong Liu, Lei Wang, and Chengjie Wang. Adaptclip: Adapting clip for universal vi- sual anomaly detection. InAAAI, 2026. 2
work page 2026
-
[8]
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupr ´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. In ICLR, 2025. 1, 2
work page 2025
-
[9]
WinCLIP: Zero-/few-shot anomaly classification and segmentation
Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. WinCLIP: Zero-/few-shot anomaly classification and segmentation. In CVPR, 2023. 1, 2
work page 2023
-
[10]
Adam: A method for stochastic optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 5
work page 2015
-
[11]
Girshick, Kaiming He, and Piotr Doll ´ar
Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In ICCV, 2017. 4
work page 2017
-
[12]
Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S. Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InCVPR, 2025. 2
work page 2025
-
[13]
Alireza Makhzani and Brendan Frey. k-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013. 1, 2
work page Pith review arXiv 2013
-
[14]
V-net: Fully convolutional neural networks for volumetric medical image segmentation
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In3DV, 2016. 4
work page 2016
-
[15]
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...
work page 2024
-
[16]
Vcp-clip: A visual context prompting model for zero-shot anomaly segmenta- tion
Zhen Qu, Xian Tao, Mukesh Prasad, Fei Shen, Zhengtao Zhang, Xinyi Gong, and Guiguang Ding. Vcp-clip: A visual context prompting model for zero-shot anomaly segmenta- tion. InECCV, 2024. 2
work page 2024
-
[17]
Learning transferable visual models from natural language supervision.ICML, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.ICML, 2021. 1, 2
work page 2021
-
[18]
Towards total recall in industrial anomaly detection
Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InCVPR, 2022. 1
work page 2022
-
[19]
Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Shun Wei, Jielin Jiang, and Xiaolong Xu. Uninet: A con- trastive learning-guided unified framework with feature se- lection for anomaly detection. InCVPR, 2025. 1
work page 2025
-
[21]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 6
work page 2023
-
[22]
AnomalyCLIP: Object-agnostic prompt learning for zero-shot anomaly detection
Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jim- ing Chen. AnomalyCLIP: Object-agnostic prompt learning for zero-shot anomaly detection. InICLR, 2024. 1, 2
work page 2024
-
[23]
SPot-the-difference self-supervised pre- training for anomaly detection and segmentation
Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. SPot-the-difference self-supervised pre- training for anomaly detection and segmentation. InECCV,
-
[24]
1, 4 SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection Supplementary Material A. Stage-2 Guide Learning with TopK Projec- tion The main paper parameterizes Stage-2 guide coefficients using ReLU together with anℓ 1 sparsity regularizer. Here, we study an alternative based on explicit TopK sparsifica- tion to assess how t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.