arxiv: 2604.02871 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection

Junichi Okubo, Junichiro Fujii, Takayoshi Yamashita, Tomoyasu Nanaumi, Yukino Tsuzuki

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot anomaly detectionsparse autoencodersanomaly segmentationfoundation model featurescross-dataset evaluationMVTec ADVisAsparse guide coefficients

0 comments

The pith

Sparse-Projected Guides learn coefficients in a sparse autoencoder to produce normal and anomaly reference vectors from auxiliary data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SPG for zero-shot anomaly detection and segmentation using only frozen foundation model features. It trains a sparse autoencoder on patch tokens from an auxiliary dataset and then optimizes sparse guide coefficients with pixel masks while keeping everything else fixed. These coefficients generate reference vectors through the autoencoder dictionary for normal and anomalous states. The method is evaluated on MVTec AD and VisA in cross-dataset settings and shows competitive image-level detection plus strong pixel-level segmentation. A reader would care because it removes the need for handcrafted prompts or any target adaptation while still matching or beating existing baselines.

Core claim

SPG is a prompt-free framework that learns sparse guide coefficients in the SAE latent space on a labeled auxiliary dataset. In the first stage an SAE is trained on patch-token features; in the second stage only the guide coefficients are optimized using auxiliary masks while the backbone and SAE remain frozen. The resulting coefficients produce normal and anomaly guide vectors via the SAE dictionary, enabling deployment to unseen target categories without adaptation. On MVTec AD and VisA under cross-dataset zero-shot protocols the approach yields competitive image-level detection and strong pixel-level segmentation, with the DINOv3 instantiation reaching the highest pixel-level AUROC among,

What carries the argument

Sparse-Projected Guides (SPG) formed by sparse guide coefficients in the latent space of a sparse autoencoder; the coefficients are optimized on auxiliary masks and then used to reconstruct normal and anomalous reference vectors through the SAE dictionary.

If this is right

SPG attains competitive image-level AUROC and strong pixel-level segmentation on MVTec AD and VisA under cross-dataset zero-shot conditions.
With a DINOv3 backbone SPG records the highest pixel-level AUROC among the methods compared in the paper.
The learned coefficients allow tracing model decisions back to a small number of dictionary atoms that separate category-general from category-specific factors.
No target-domain adaptation or prompt engineering is required at deployment time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse-projection idea could be tested on other dense-prediction tasks that rely on reference vectors in foundation-model feature space.
Because decisions reduce to a handful of dictionary atoms, the coefficients may offer a route to category-agnostic explanations of anomalies.
Performance on diverse target domains may improve if auxiliary datasets are selected to cover a broader range of visual factors.
The approach suggests that foundation-model patch features already contain enough structure for anomaly tasks once a sparse linear projection is learned.

Load-bearing premise

Sparse guide coefficients optimized solely on auxiliary dataset masks will produce effective normal and anomaly reference vectors that generalize to unseen target categories using only frozen foundation model features without any adaptation.

What would settle it

Running SPG on a new cross-dataset zero-shot split where pixel-level AUROC falls substantially below the strongest baseline would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.02871 by Junichi Okubo, Junichiro Fujii, Takayoshi Yamashita, Tomoyasu Nanaumi, Yukino Tsuzuki.

**Figure 1.** Figure 1: Overview of SPG (Sparse-Projected Guides). Stage 1 trains a Sparse Autoencoder (SAE) on patch-token features extracted by [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Sensitivity to SAE hyperparameters. The Heatmaps summarize cross-dataset performance when sweeping the SAE dictionary [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation of image-level anomaly-score aggregation via [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of the visual backbone on SPG under cross-dataset [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative interpretation of SAE dictionary atoms emphasized by the learned anomaly guide. We select representative atoms [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We study zero-shot anomaly detection and segmentation using frozen foundation model features, where all learnable parameters are trained only on a labeled auxiliary dataset and deployed to unseen target categories without any target-domain adaptation. Existing prompt-based approaches use handcrafted or learned prompt embeddings as reference vectors for normal/anomalous states. We propose Sparse-Projected Guides (SPG), a prompt-free framework that learns sparse guide coefficients in the Sparse Autoencoder (SAE) latent space, which generate normal/anomaly guide vectors via the SAE dictionary. SPG employs a two stage learning strategy on the labeled auxiliary dataset: (i) train an SAE on patch-token features, and (ii) optimize only guide coefficients using auxiliary pixel-level masks while freezing the backbone and SAE. On MVTec AD and VisA under cross-dataset zero-shot settings, SPG achieves competitive image-level detection and strong pixel-level segmentation; with DINOv3, SPG attains the highest pixellevel AUROC among the compared methods. We also report SPG instantiated with OpenCLIP (ViT-L/14@336px) to align the backbone with CLIP-based baselines. Moreover, the learned guide coefficients trace decisions back to a small set of dictionary atoms, revealing category-general and category-specific factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPG swaps prompts for sparse coefficients learned in SAE space on auxiliary data, delivering competitive zero-shot results on MVTec and VisA, but the transfer of those coefficients to unseen targets is the least tested part.

read the letter

SPG replaces prompt embeddings with sparse guide coefficients that are optimized in SAE latent space. The coefficients are learned only on a labeled auxiliary dataset, then used with a frozen SAE dictionary and frozen backbone to produce normal and anomaly reference vectors for any target category. No target adaptation happens at all. That is the main technical move and it is distinct from the prompt-based baselines cited in the abstract. The two-stage procedure—fit SAE on auxiliary patch tokens, then train only the coefficients against auxiliary masks—keeps the parameter count low and avoids any leakage from the target sets. On MVTec AD and VisA under cross-dataset zero-shot protocols the method reports competitive image-level AUROC and strong pixel-level segmentation, with the DINOv3 version reaching the highest pixel AUROC among the methods compared. The fact that the learned coefficients can be traced back to a small number of dictionary atoms also gives a readable account of which factors are category-general versus category-specific. That interpretability is a modest but real advantage over opaque prompt vectors. The central assumption is that coefficients tuned on auxiliary masks will still produce useful reference vectors once the SAE dictionary is applied to patch features from entirely new target categories. The stress-test note correctly flags this as the weakest link. The paper does not appear to include ablations that swap in random coefficients, test mismatched auxiliaries, or compare against using the raw backbone features without the learned guides. Without those checks it remains unclear how much of the reported performance comes from the sparse projection step versus the strength of the frozen foundation model itself. Distribution shift in feature statistics or anomaly appearance between auxiliary and target data could therefore degrade the guides more than the numbers suggest. For readers who work on industrial inspection or prompt-free adaptation of vision foundation models, the framework is worth reading because it supplies a concrete alternative and reports results on the standard benchmarks. The experiments follow clear zero-shot protocols and the citation pattern to prior prompt work is appropriate. The paper deserves a serious referee because the idea is new enough and the empirical claims are testable on public data, even though reviewers will almost certainly ask for more direct evidence on the transfer step.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Sparse-Projected Guides (SPG), a prompt-free framework for zero-shot anomaly detection and segmentation. It trains a Sparse Autoencoder (SAE) on patch-token features from a labeled auxiliary dataset, then optimizes only sparse guide coefficients using auxiliary pixel masks while keeping the foundation model backbone and SAE frozen. The learned coefficients produce normal and anomaly reference vectors via the SAE dictionary, which are applied directly to unseen target categories (MVTec AD and VisA) in a cross-dataset setting. Experiments report competitive image-level detection and strong pixel-level segmentation, with the DINOv3 variant achieving the highest pixel-level AUROC among compared methods; an OpenCLIP instantiation is also presented for alignment with CLIP baselines. The approach additionally claims interpretability by tracing decisions to a small set of dictionary atoms that capture category-general and category-specific factors.

Significance. If the reported generalization holds, SPG would provide a practical prompt-free alternative to existing zero-shot methods, leveraging the sparsity and interpretability of SAE latent spaces to generate reference vectors without handcrafted prompts or target adaptation. The two-stage training (SAE pretraining followed by coefficient optimization) and the ability to trace decisions to dictionary atoms represent a clear methodological contribution that could improve both performance and explainability in anomaly detection. The claim of highest pixel AUROC with DINOv3, if substantiated with full numerical results and ablations, would indicate meaningful gains over prior art on standard benchmarks.

major comments (2)

[§3.2 (two-stage learning) and §4 (experiments)] The central zero-shot claim rests on the transfer of auxiliary-optimized guide coefficients to unseen MVTec/VisA categories. The manuscript provides no ablation that isolates the contribution of the learned coefficients (e.g., by replacing them with random or zero vectors while keeping the frozen SAE dictionary and backbone fixed) versus the raw foundation-model features alone. Without this control, it is impossible to determine whether the reported performance gains derive from the coefficient optimization or simply from the choice of backbone (DINOv3).
[Abstract and §4.1] The abstract states that SPG attains the highest pixel-level AUROC with DINOv3, yet supplies neither the numerical value, the full list of baselines, standard deviations, nor the corresponding table/figure reference. This omission prevents verification of the magnitude and statistical reliability of the claimed improvement.

minor comments (2)

[§3.1] The notation for the guide coefficients and their projection through the SAE dictionary should be formalized with an explicit equation (e.g., defining the normal and anomaly reference vectors as linear combinations of dictionary atoms weighted by the learned sparse coefficients).
[§4.3] Figure captions and axis labels in the qualitative results should explicitly state the backbone (DINOv3 vs. OpenCLIP) and the exact metric (image-level vs. pixel-level AUROC) for each panel to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to strengthen the presentation of our contributions and experimental results.

read point-by-point responses

Referee: [§3.2 (two-stage learning) and §4 (experiments)] The central zero-shot claim rests on the transfer of auxiliary-optimized guide coefficients to unseen MVTec/VisA categories. The manuscript provides no ablation that isolates the contribution of the learned coefficients (e.g., by replacing them with random or zero vectors while keeping the frozen SAE dictionary and backbone fixed) versus the raw foundation-model features alone. Without this control, it is impossible to determine whether the reported performance gains derive from the coefficient optimization or simply from the choice of backbone (DINOv3).

Authors: We agree that an explicit ablation isolating the learned guide coefficients is necessary to substantiate the role of the two-stage optimization. In the revised manuscript we will add a control experiment in Section 4 that replaces the optimized coefficients with random or zero vectors while keeping the frozen SAE dictionary and backbone unchanged. The results will be reported alongside the main tables and discussed to clarify that performance gains arise from the auxiliary-optimized coefficients rather than backbone features alone. revision: yes
Referee: [Abstract and §4.1] The abstract states that SPG attains the highest pixel-level AUROC with DINOv3, yet supplies neither the numerical value, the full list of baselines, standard deviations, nor the corresponding table/figure reference. This omission prevents verification of the magnitude and statistical reliability of the claimed improvement.

Authors: We acknowledge this omission. In the revised manuscript we will update the abstract to report the exact pixel-level AUROC value for the DINOv3 instantiation, explicitly reference Table 2 (or the corresponding table in §4.1), list the primary baselines, and include standard deviations where available to support the statistical reliability of the result. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains the SAE on auxiliary patch tokens and optimizes only the sparse guide coefficients via a mask-supervised loss on the auxiliary dataset while keeping the backbone and SAE frozen. These components are then applied without adaptation to produce anomaly scores on entirely unseen target categories (MVTec AD, VisA). The reported AUROC values are therefore measured on data never used in any fitting step, so the central zero-shot claim does not reduce to a self-defined or fitted quantity on the evaluation set. No self-citation chain, ansatz smuggling, or renaming of known results is invoked as load-bearing in the provided description; the method remains a genuine cross-dataset prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the transferability of frozen foundation model features and the utility of sparse coefficients for generating guides; the SAE dictionary and guide coefficients are learned rather than derived from first principles.

free parameters (1)

guide coefficients
Sparse coefficients optimized on auxiliary pixel-level masks to produce normal and anomaly guides.

axioms (1)

domain assumption Frozen foundation model patch-token features contain sufficient information to distinguish normal and anomalous states across categories
Invoked to justify zero-shot deployment without target adaptation or fine-tuning.

invented entities (1)

Sparse-Projected Guides no independent evidence
purpose: Generate normal and anomaly reference vectors from SAE dictionary atoms using learned sparse coefficients
New construct introduced to replace prompt embeddings in zero-shot anomaly detection.

pith-pipeline@v0.9.0 · 5547 in / 1346 out tokens · 44106 ms · 2026-05-13T20:59:27.244876+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

Supervised anomaly detection for complex indus- trial images

Aimira Baitieva, David Hurych, Victor Besnier, and Olivier Bernard. Supervised anomaly detection for complex indus- trial images. InCVPR, 2024. 1

work page 2024
[2]

MVTec AD – a comprehensive real-world dataset for unsupervised anomaly detection

Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. MVTec AD – a comprehensive real-world dataset for unsupervised anomaly detection. InCVPR, 2019. 1, 4, 5

work page 2019
[3]

Towards monosemanticity: Decomposing language mod- els with dictionary learning.Transformer Circuits Thread,

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yi- fan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Ka- rina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, an...

work page
[4]

https://transformer-circuits.pub/2023/monosemantic- features/index.html. 1, 2

work page 2023
[5]

AdaCLIP: Adapting CLIP with hybrid learnable prompts for zero-shot anomaly detection

Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. AdaCLIP: Adapting CLIP with hybrid learnable prompts for zero-shot anomaly detection. InECCV, 2024. 2

work page 2024
[6]

Anomalydino: Boosting patch-based few-shot anomaly detection with dinov2

Simon Damm, Mike Laszkiewicz, Johannes Lederer, and Asja Fischer. Anomalydino: Boosting patch-based few-shot anomaly detection with dinov2. InWACV, 2025. 2

work page 2025
[7]

Adaptclip: Adapting clip for universal vi- sual anomaly detection

Bin-Bin Gao, Yue Zhu, Jiangtao Yan, Yuezhi Cai, Weixi Zhang, Meng Wang, Jun Liu, Yong Liu, Lei Wang, and Chengjie Wang. Adaptclip: Adapting clip for universal vi- sual anomaly detection. InAAAI, 2026. 2

work page 2026
[8]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupr ´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. In ICLR, 2025. 1, 2

work page 2025
[9]

WinCLIP: Zero-/few-shot anomaly classification and segmentation

Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. WinCLIP: Zero-/few-shot anomaly classification and segmentation. In CVPR, 2023. 1, 2

work page 2023
[10]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 5

work page 2015
[11]

Girshick, Kaiming He, and Piotr Doll ´ar

Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In ICCV, 2017. 4

work page 2017
[12]

Kevin Zhou

Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S. Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InCVPR, 2025. 2

work page 2025
[13]

k-Sparse Autoencoders

Alireza Makhzani and Brendan Frey. k-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013. 1, 2

work page Pith review arXiv 2013
[14]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In3DV, 2016. 4

work page 2016
[15]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Re- search, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

work page 2024
[16]

Vcp-clip: A visual context prompting model for zero-shot anomaly segmenta- tion

Zhen Qu, Xian Tao, Mukesh Prasad, Fei Shen, Zhengtao Zhang, Xinyi Gong, and Guiguang Ding. Vcp-clip: A visual context prompting model for zero-shot anomaly segmenta- tion. InECCV, 2024. 2

work page 2024
[17]

Learning transferable visual models from natural language supervision.ICML, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.ICML, 2021. 1, 2

work page 2021
[18]

Towards total recall in industrial anomaly detection

Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InCVPR, 2022. 1

work page 2022
[19]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Uninet: A con- trastive learning-guided unified framework with feature se- lection for anomaly detection

Shun Wei, Jielin Jiang, and Xiaolong Xu. Uninet: A con- trastive learning-guided unified framework with feature se- lection for anomaly detection. InCVPR, 2025. 1

work page 2025
[21]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 6

work page 2023
[22]

AnomalyCLIP: Object-agnostic prompt learning for zero-shot anomaly detection

Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jim- ing Chen. AnomalyCLIP: Object-agnostic prompt learning for zero-shot anomaly detection. InICLR, 2024. 1, 2

work page 2024
[23]

SPot-the-difference self-supervised pre- training for anomaly detection and segmentation

Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. SPot-the-difference self-supervised pre- training for anomaly detection and segmentation. InECCV,

work page
[24]

Stage-2 Guide Learning with TopK Projec- tion The main paper parameterizes Stage-2 guide coefficients using ReLU together with anℓ 1 sparsity regularizer

1, 4 SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection Supplementary Material A. Stage-2 Guide Learning with TopK Projec- tion The main paper parameterizes Stage-2 guide coefficients using ReLU together with anℓ 1 sparsity regularizer. Here, we study an alternative based on explicit TopK sparsifica- tion to assess how t...

work page