pith. machine review for the scientific record. sign in

arxiv: 2604.02871 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection

Junichi Okubo, Junichiro Fujii, Takayoshi Yamashita, Tomoyasu Nanaumi, Yukino Tsuzuki

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot anomaly detectionsparse autoencodersanomaly segmentationfoundation model featurescross-dataset evaluationMVTec ADVisAsparse guide coefficients
0
0 comments X

The pith

Sparse-Projected Guides learn coefficients in a sparse autoencoder to produce normal and anomaly reference vectors from auxiliary data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SPG for zero-shot anomaly detection and segmentation using only frozen foundation model features. It trains a sparse autoencoder on patch tokens from an auxiliary dataset and then optimizes sparse guide coefficients with pixel masks while keeping everything else fixed. These coefficients generate reference vectors through the autoencoder dictionary for normal and anomalous states. The method is evaluated on MVTec AD and VisA in cross-dataset settings and shows competitive image-level detection plus strong pixel-level segmentation. A reader would care because it removes the need for handcrafted prompts or any target adaptation while still matching or beating existing baselines.

Core claim

SPG is a prompt-free framework that learns sparse guide coefficients in the SAE latent space on a labeled auxiliary dataset. In the first stage an SAE is trained on patch-token features; in the second stage only the guide coefficients are optimized using auxiliary masks while the backbone and SAE remain frozen. The resulting coefficients produce normal and anomaly guide vectors via the SAE dictionary, enabling deployment to unseen target categories without adaptation. On MVTec AD and VisA under cross-dataset zero-shot protocols the approach yields competitive image-level detection and strong pixel-level segmentation, with the DINOv3 instantiation reaching the highest pixel-level AUROC among,

What carries the argument

Sparse-Projected Guides (SPG) formed by sparse guide coefficients in the latent space of a sparse autoencoder; the coefficients are optimized on auxiliary masks and then used to reconstruct normal and anomalous reference vectors through the SAE dictionary.

If this is right

  • SPG attains competitive image-level AUROC and strong pixel-level segmentation on MVTec AD and VisA under cross-dataset zero-shot conditions.
  • With a DINOv3 backbone SPG records the highest pixel-level AUROC among the methods compared in the paper.
  • The learned coefficients allow tracing model decisions back to a small number of dictionary atoms that separate category-general from category-specific factors.
  • No target-domain adaptation or prompt engineering is required at deployment time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sparse-projection idea could be tested on other dense-prediction tasks that rely on reference vectors in foundation-model feature space.
  • Because decisions reduce to a handful of dictionary atoms, the coefficients may offer a route to category-agnostic explanations of anomalies.
  • Performance on diverse target domains may improve if auxiliary datasets are selected to cover a broader range of visual factors.
  • The approach suggests that foundation-model patch features already contain enough structure for anomaly tasks once a sparse linear projection is learned.

Load-bearing premise

Sparse guide coefficients optimized solely on auxiliary dataset masks will produce effective normal and anomaly reference vectors that generalize to unseen target categories using only frozen foundation model features without any adaptation.

What would settle it

Running SPG on a new cross-dataset zero-shot split where pixel-level AUROC falls substantially below the strongest baseline would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.02871 by Junichi Okubo, Junichiro Fujii, Takayoshi Yamashita, Tomoyasu Nanaumi, Yukino Tsuzuki.

Figure 1
Figure 1. Figure 1: Overview of SPG (Sparse-Projected Guides). Stage 1 trains a Sparse Autoencoder (SAE) on patch-token features extracted by [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sensitivity to SAE hyperparameters. The Heatmaps summarize cross-dataset performance when sweeping the SAE dictionary [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation of image-level anomaly-score aggregation via [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the visual backbone on SPG under cross-dataset [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative interpretation of SAE dictionary atoms emphasized by the learned anomaly guide. We select representative atoms [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

We study zero-shot anomaly detection and segmentation using frozen foundation model features, where all learnable parameters are trained only on a labeled auxiliary dataset and deployed to unseen target categories without any target-domain adaptation. Existing prompt-based approaches use handcrafted or learned prompt embeddings as reference vectors for normal/anomalous states. We propose Sparse-Projected Guides (SPG), a prompt-free framework that learns sparse guide coefficients in the Sparse Autoencoder (SAE) latent space, which generate normal/anomaly guide vectors via the SAE dictionary. SPG employs a two stage learning strategy on the labeled auxiliary dataset: (i) train an SAE on patch-token features, and (ii) optimize only guide coefficients using auxiliary pixel-level masks while freezing the backbone and SAE. On MVTec AD and VisA under cross-dataset zero-shot settings, SPG achieves competitive image-level detection and strong pixel-level segmentation; with DINOv3, SPG attains the highest pixellevel AUROC among the compared methods. We also report SPG instantiated with OpenCLIP (ViT-L/14@336px) to align the backbone with CLIP-based baselines. Moreover, the learned guide coefficients trace decisions back to a small set of dictionary atoms, revealing category-general and category-specific factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Sparse-Projected Guides (SPG), a prompt-free framework for zero-shot anomaly detection and segmentation. It trains a Sparse Autoencoder (SAE) on patch-token features from a labeled auxiliary dataset, then optimizes only sparse guide coefficients using auxiliary pixel masks while keeping the foundation model backbone and SAE frozen. The learned coefficients produce normal and anomaly reference vectors via the SAE dictionary, which are applied directly to unseen target categories (MVTec AD and VisA) in a cross-dataset setting. Experiments report competitive image-level detection and strong pixel-level segmentation, with the DINOv3 variant achieving the highest pixel-level AUROC among compared methods; an OpenCLIP instantiation is also presented for alignment with CLIP baselines. The approach additionally claims interpretability by tracing decisions to a small set of dictionary atoms that capture category-general and category-specific factors.

Significance. If the reported generalization holds, SPG would provide a practical prompt-free alternative to existing zero-shot methods, leveraging the sparsity and interpretability of SAE latent spaces to generate reference vectors without handcrafted prompts or target adaptation. The two-stage training (SAE pretraining followed by coefficient optimization) and the ability to trace decisions to dictionary atoms represent a clear methodological contribution that could improve both performance and explainability in anomaly detection. The claim of highest pixel AUROC with DINOv3, if substantiated with full numerical results and ablations, would indicate meaningful gains over prior art on standard benchmarks.

major comments (2)
  1. [§3.2 (two-stage learning) and §4 (experiments)] The central zero-shot claim rests on the transfer of auxiliary-optimized guide coefficients to unseen MVTec/VisA categories. The manuscript provides no ablation that isolates the contribution of the learned coefficients (e.g., by replacing them with random or zero vectors while keeping the frozen SAE dictionary and backbone fixed) versus the raw foundation-model features alone. Without this control, it is impossible to determine whether the reported performance gains derive from the coefficient optimization or simply from the choice of backbone (DINOv3).
  2. [Abstract and §4.1] The abstract states that SPG attains the highest pixel-level AUROC with DINOv3, yet supplies neither the numerical value, the full list of baselines, standard deviations, nor the corresponding table/figure reference. This omission prevents verification of the magnitude and statistical reliability of the claimed improvement.
minor comments (2)
  1. [§3.1] The notation for the guide coefficients and their projection through the SAE dictionary should be formalized with an explicit equation (e.g., defining the normal and anomaly reference vectors as linear combinations of dictionary atoms weighted by the learned sparse coefficients).
  2. [§4.3] Figure captions and axis labels in the qualitative results should explicitly state the backbone (DINOv3 vs. OpenCLIP) and the exact metric (image-level vs. pixel-level AUROC) for each panel to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to strengthen the presentation of our contributions and experimental results.

read point-by-point responses
  1. Referee: [§3.2 (two-stage learning) and §4 (experiments)] The central zero-shot claim rests on the transfer of auxiliary-optimized guide coefficients to unseen MVTec/VisA categories. The manuscript provides no ablation that isolates the contribution of the learned coefficients (e.g., by replacing them with random or zero vectors while keeping the frozen SAE dictionary and backbone fixed) versus the raw foundation-model features alone. Without this control, it is impossible to determine whether the reported performance gains derive from the coefficient optimization or simply from the choice of backbone (DINOv3).

    Authors: We agree that an explicit ablation isolating the learned guide coefficients is necessary to substantiate the role of the two-stage optimization. In the revised manuscript we will add a control experiment in Section 4 that replaces the optimized coefficients with random or zero vectors while keeping the frozen SAE dictionary and backbone unchanged. The results will be reported alongside the main tables and discussed to clarify that performance gains arise from the auxiliary-optimized coefficients rather than backbone features alone. revision: yes

  2. Referee: [Abstract and §4.1] The abstract states that SPG attains the highest pixel-level AUROC with DINOv3, yet supplies neither the numerical value, the full list of baselines, standard deviations, nor the corresponding table/figure reference. This omission prevents verification of the magnitude and statistical reliability of the claimed improvement.

    Authors: We acknowledge this omission. In the revised manuscript we will update the abstract to report the exact pixel-level AUROC value for the DINOv3 instantiation, explicitly reference Table 2 (or the corresponding table in §4.1), list the primary baselines, and include standard deviations where available to support the statistical reliability of the result. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains the SAE on auxiliary patch tokens and optimizes only the sparse guide coefficients via a mask-supervised loss on the auxiliary dataset while keeping the backbone and SAE frozen. These components are then applied without adaptation to produce anomaly scores on entirely unseen target categories (MVTec AD, VisA). The reported AUROC values are therefore measured on data never used in any fitting step, so the central zero-shot claim does not reduce to a self-defined or fitted quantity on the evaluation set. No self-citation chain, ansatz smuggling, or renaming of known results is invoked as load-bearing in the provided description; the method remains a genuine cross-dataset prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the transferability of frozen foundation model features and the utility of sparse coefficients for generating guides; the SAE dictionary and guide coefficients are learned rather than derived from first principles.

free parameters (1)
  • guide coefficients
    Sparse coefficients optimized on auxiliary pixel-level masks to produce normal and anomaly guides.
axioms (1)
  • domain assumption Frozen foundation model patch-token features contain sufficient information to distinguish normal and anomalous states across categories
    Invoked to justify zero-shot deployment without target adaptation or fine-tuning.
invented entities (1)
  • Sparse-Projected Guides no independent evidence
    purpose: Generate normal and anomaly reference vectors from SAE dictionary atoms using learned sparse coefficients
    New construct introduced to replace prompt embeddings in zero-shot anomaly detection.

pith-pipeline@v0.9.0 · 5547 in / 1346 out tokens · 44106 ms · 2026-05-13T20:59:27.244876+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    Supervised anomaly detection for complex indus- trial images

    Aimira Baitieva, David Hurych, Victor Besnier, and Olivier Bernard. Supervised anomaly detection for complex indus- trial images. InCVPR, 2024. 1

  2. [2]

    MVTec AD – a comprehensive real-world dataset for unsupervised anomaly detection

    Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. MVTec AD – a comprehensive real-world dataset for unsupervised anomaly detection. InCVPR, 2019. 1, 4, 5

  3. [3]

    Towards monosemanticity: Decomposing language mod- els with dictionary learning.Transformer Circuits Thread,

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yi- fan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Ka- rina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, an...

  4. [4]

    https://transformer-circuits.pub/2023/monosemantic- features/index.html. 1, 2

  5. [5]

    AdaCLIP: Adapting CLIP with hybrid learnable prompts for zero-shot anomaly detection

    Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, and Giacomo Boracchi. AdaCLIP: Adapting CLIP with hybrid learnable prompts for zero-shot anomaly detection. InECCV, 2024. 2

  6. [6]

    Anomalydino: Boosting patch-based few-shot anomaly detection with dinov2

    Simon Damm, Mike Laszkiewicz, Johannes Lederer, and Asja Fischer. Anomalydino: Boosting patch-based few-shot anomaly detection with dinov2. InWACV, 2025. 2

  7. [7]

    Adaptclip: Adapting clip for universal vi- sual anomaly detection

    Bin-Bin Gao, Yue Zhu, Jiangtao Yan, Yuezhi Cai, Weixi Zhang, Meng Wang, Jun Liu, Yong Liu, Lei Wang, and Chengjie Wang. Adaptclip: Adapting clip for universal vi- sual anomaly detection. InAAAI, 2026. 2

  8. [8]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupr ´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. In ICLR, 2025. 1, 2

  9. [9]

    WinCLIP: Zero-/few-shot anomaly classification and segmentation

    Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. WinCLIP: Zero-/few-shot anomaly classification and segmentation. In CVPR, 2023. 1, 2

  10. [10]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 5

  11. [11]

    Girshick, Kaiming He, and Piotr Doll ´ar

    Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. In ICCV, 2017. 4

  12. [12]

    Kevin Zhou

    Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S. Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InCVPR, 2025. 2

  13. [13]

    k-Sparse Autoencoders

    Alireza Makhzani and Brendan Frey. k-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013. 1, 2

  14. [14]

    V-net: Fully convolutional neural networks for volumetric medical image segmentation

    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In3DV, 2016. 4

  15. [15]

    Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Re- search, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

  16. [16]

    Vcp-clip: A visual context prompting model for zero-shot anomaly segmenta- tion

    Zhen Qu, Xian Tao, Mukesh Prasad, Fei Shen, Zhengtao Zhang, Xinyi Gong, and Guiguang Ding. Vcp-clip: A visual context prompting model for zero-shot anomaly segmenta- tion. InECCV, 2024. 2

  17. [17]

    Learning transferable visual models from natural language supervision.ICML, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.ICML, 2021. 1, 2

  18. [18]

    Towards total recall in industrial anomaly detection

    Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InCVPR, 2022. 1

  19. [19]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  20. [20]

    Uninet: A con- trastive learning-guided unified framework with feature se- lection for anomaly detection

    Shun Wei, Jielin Jiang, and Xiaolong Xu. Uninet: A con- trastive learning-guided unified framework with feature se- lection for anomaly detection. InCVPR, 2025. 1

  21. [21]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 6

  22. [22]

    AnomalyCLIP: Object-agnostic prompt learning for zero-shot anomaly detection

    Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jim- ing Chen. AnomalyCLIP: Object-agnostic prompt learning for zero-shot anomaly detection. InICLR, 2024. 1, 2

  23. [23]

    SPot-the-difference self-supervised pre- training for anomaly detection and segmentation

    Yang Zou, Jongheon Jeong, Latha Pemula, Dongqing Zhang, and Onkar Dabeer. SPot-the-difference self-supervised pre- training for anomaly detection and segmentation. InECCV,

  24. [24]

    Stage-2 Guide Learning with TopK Projec- tion The main paper parameterizes Stage-2 guide coefficients using ReLU together with anℓ 1 sparsity regularizer

    1, 4 SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection Supplementary Material A. Stage-2 Guide Learning with TopK Projec- tion The main paper parameterizes Stage-2 guide coefficients using ReLU together with anℓ 1 sparsity regularizer. Here, we study an alternative based on explicit TopK sparsifica- tion to assess how t...