Res²CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment
Pith reviewed 2026-05-20 18:52 UTC · model grok-4.3
The pith
Shifting multimodal alignment into CLIP's residual space resolves cross-granularity mismatch and domain shift for few-shot generalist anomaly detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By conducting multimodal alignment entirely inside CLIP's residual space, residual representations eliminate fine-grained normal feature differences across regions and class-specific biases at once; the resulting Res²CLIP framework uses three branches (text prompt, visual prompt, residual-to-residual alignment) whose optimization forces attention to relative anomaly deviations instead of class-specific features, thereby avoiding both cross-granularity mismatch and cross-category generalization loss.
What carries the argument
The residual-to-residual alignment branch that symmetrically connects visual and text residuals inside CLIP's residual space and whose objectives are designed to emphasize relative anomaly deviations.
If this is right
- Alignment adapts automatically to fine-grained foreground-background differences without custom prompts per region.
- All learnable parameters remain inside the residual domain, preserving CLIP's open-world generalization.
- Optimization objectives direct the model toward relative deviations rather than absolute class features.
- The three-branch architecture supports symmetric bridging of visual and text modalities without domain shift from auxiliary training.
Where Pith is reading between the lines
- The residual-space idea could be tested on other vision-language models beyond CLIP to check whether the bias-elimination effect is architecture-specific.
- Combining the residual alignment branch with existing prompt-learning methods might further reduce the number of shots needed for new categories.
- The focus on relative deviations suggests the framework could help in related tasks such as open-set recognition where class biases also limit generalization.
Load-bearing premise
Residual representations naturally remove fine-grained normal feature differences across regions and class-specific biases while keeping anomaly signals intact.
What would settle it
A test set of completely novel anomaly categories where the residual-to-residual model shows no improvement in detection accuracy over standard CLIP prompt tuning or auxiliary fine-tuning baselines.
Figures
read the original abstract
Few-shot Generalist Anomaly Detection requires models to generalize to novel categories without retraining, posing significant challenges in real-world scenarios with scarce samples and rapidly changing categories. Existing CLIP-based methods face two major challenges: coarse-grained unified text prompts struggle to adapt to fine-grained foreground-background differences, causing cross-granularity mismatch; and fine-tuning on auxiliary datasets disrupts CLIP's inherent open-world generalization due to domain shift, leading to cross-category generalization degradation. To address these, we propose to shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases, simultaneously resolving both problems. Based on this insight, Res$^2$CLIP, the first residual-to-residual alignment framework that symmetrically bridges visual and text modalities within CLIP's residual space, is designed. The framework is developed from a residual perspective into three branches: a text prompt-based branch, a visual prompt-based branch, and a novel residual-to-residual alignment branch. All learnable optimizations are constrained within the residual domain, and the residual alignment optimization objectives are designed to force the model to focus on relative anomaly deviations rather than optimizing class-specific features. Experiments on multiple datasets demonstrate the effectiveness of our architecture. The code is available at https://github.com/hito2448/Res2CLIP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Res²CLIP, a residual-to-residual alignment framework for few-shot generalist anomaly detection in CLIP. It identifies two challenges in prior CLIP-based methods—cross-granularity mismatch from coarse unified text prompts and loss of open-world generalization from fine-tuning on auxiliary data—and addresses them by shifting multimodal alignment into a unified residual space. The framework consists of three branches (text prompt-based, visual prompt-based, and residual-to-residual alignment) with all learnable optimizations constrained to the residual domain; residual alignment objectives are designed to emphasize relative anomaly deviations rather than class-specific features. Experiments on multiple datasets are reported to demonstrate effectiveness, with code released.
Significance. If the core premise holds—that operating in residual space selectively suppresses fine-grained normal variations and class biases while preserving anomaly signals—this would offer a principled way to improve generalization in few-shot anomaly detection without sacrificing CLIP’s open-vocabulary capabilities. The symmetric residual bridging and constrained optimization constitute a clean architectural contribution; public code is a positive factor for reproducibility.
major comments (2)
- [Abstract / Introduction] Abstract and introduction: The central claim that 'residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases' is asserted without a formal derivation, proof sketch, or controlled ablation that isolates the residual operation from prompt engineering and training constraints. This premise directly motivates the three-branch design and residual alignment objectives; its lack of isolated validation is load-bearing for the paper’s novelty claim.
- [Method] Method section (three-branch framework): The optimization objectives are described as forcing focus on relative deviations, yet no equation or analysis shows how the residual-to-residual loss mathematically differs from standard contrastive alignment in suppressing normal intra-class variance while retaining anomaly signals. A concrete comparison (e.g., gradient flow or feature distribution analysis) is needed.
minor comments (2)
- [Experiments] The abstract states that experiments demonstrate effectiveness, but the provided text contains no quantitative results, tables, or ablation studies. Adding at least one table summarizing key metrics and a dedicated ablation on the residual branch would strengthen the manuscript.
- [Method] Notation for residual computation (e.g., how visual and text residuals are obtained and aligned) should be introduced with explicit equations early in the method section to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Introduction] Abstract and introduction: The central claim that 'residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases' is asserted without a formal derivation, proof sketch, or controlled ablation that isolates the residual operation from prompt engineering and training constraints. This premise directly motivates the three-branch design and residual alignment objectives; its lack of isolated validation is load-bearing for the paper’s novelty claim.
Authors: We acknowledge that the original manuscript presents this property as an empirical insight motivating the design rather than a formally derived result. In the revision we will add a dedicated subsection providing a proof sketch based on the centering effect of residual computation (subtracting a class- or region-agnostic prototype) and its impact on intra-class variance in the CLIP embedding space. We will also include a controlled ablation that isolates the residual operation while freezing prompt engineering and training constraints, directly addressing the load-bearing nature of this premise for the novelty claim. revision: yes
-
Referee: [Method] Method section (three-branch framework): The optimization objectives are described as forcing focus on relative deviations, yet no equation or analysis shows how the residual-to-residual loss mathematically differs from standard contrastive alignment in suppressing normal intra-class variance while retaining anomaly signals. A concrete comparison (e.g., gradient flow or feature distribution analysis) is needed.
Authors: We agree that an explicit mathematical comparison would improve clarity. In the revised manuscript we will add an analysis (main text or appendix) that contrasts the residual-to-residual loss with standard contrastive alignment, including a derivation of how the loss gradients preferentially attenuate normal intra-class directions while preserving anomaly signals. We will further support this with feature-distribution visualizations (e.g., t-SNE or variance statistics) before and after residual alignment to illustrate the selective suppression effect. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper presents the shift to residual space and the claim that residual representations naturally eliminate fine-grained normal feature differences and class-specific biases as a motivating insight for the three-branch architecture, without any equations, derivations, or self-citations that reduce this premise or the resulting predictions back to the inputs by construction. The optimization objectives are described as independent constraints within the residual domain to focus on relative deviations, and effectiveness is validated through experiments on multiple datasets rather than tautological fits or renamed known results. No load-bearing self-citation chains, ansatzes smuggled via prior work, or uniqueness theorems imported from the authors appear in the provided text, making the approach self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
T. Aota, L. T. T. Tong, and T. Okatani. Zero-shot versus many-shot: Unsupervised texture anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5564–5572, 2023. 7, 18, 19, 22, 25
work page 2023
-
[2]
P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019. 7, 18, 19, 25
work page 2019
-
[3]
Y . Cao, J. Zhang, L. Frittoli, Y . Cheng, W. Shen, and G. Boracchi. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 3
work page 2024
- [4]
- [5]
-
[6]
H. Deng and X. Li. Anomaly detection via reverse distillation from one-class embedding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9737–9746, 2022. 2
work page 2022
-
[7]
Z. Fang, X. Wang, H. Li, J. Liu, Q. Hu, and J. Xiao. Fastrecon: Few-shot industrial anomaly detection via fast feature reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17481–17490, 2023. 3
work page 2023
-
[8]
B.-B. Gao. Metauas: Universal anomaly segmentation with one-prompt meta-learning.Advances in Neural Information Processing Systems, 37:39812–39836, 2024. 3
work page 2024
-
[9]
B.-B. Gao, Y . Zhou, J. Yan, Y . Cai, W. Zhang, M. Wang, J. Liu, Y . Liu, L. Wang, and C. Wang. Adaptclip: Adapting clip for universal visual anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4095–4103, 2026. 3, 7, 19, 20, 21, 22
work page 2026
-
[10]
Z. Gu, L. Liu, X. Chen, R. Yi, J. Zhang, Y . Wang, C. Wang, A. Shu, G. Jiang, and L. Ma. Remembering normality: Memory-guided knowledge distillation for unsupervised anomaly detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16401–16409, 2023. 2
work page 2023
-
[11]
D. Gudovskiy, S. Ishizaka, and K. Kozuka. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 98–107, 2022. 2
work page 2022
-
[12]
H. Guo, L. Ren, J. Fu, Y . Wang, Z. Zhang, C. Lan, H. Wang, and X. Hou. Template-guided hierarchical feature restoration for anomaly detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6447–6458, 2023. 2
work page 2023
-
[13]
J. Guo, L. Jia, W. Zhang, H. Li, et al. Recontrast: Domain-specific anomaly detection via contrastive reconstruction.Advances in Neural Information Processing Systems, 36, 2024. 2
work page 2024
- [14]
- [15]
-
[16]
S. Jezek, M. Jonak, R. Burget, P. Dvorak, and M. Skotak. Deep learning-based defect detection of metal parts: evaluating current methods in complex conditions. In2021 13th International congress on ultra modern telecommunications and control systems and workshops (ICUMT), pages 66–71. IEEE, 2021. 7, 18, 19, 21, 25
work page 2021
-
[17]
X. Li, Z. Zhang, X. Tan, C. Chen, Y . Qu, Y . Xie, and L. Ma. Promptad: Learning prompts with only normal samples for few-shot anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16838–16848, 2024. 3
work page 2024
-
[18]
Y . Li, H. Wang, Y . Duan, and X. Li. Clip surgery for better explainability with enhancement in open- vocabulary tasks.arXiv e-prints, pages arXiv–2304, 2023. 3, 7 10
work page 2023
-
[19]
X. Liu, J. Wang, B. Leng, and S. Zhang. Unlocking the potential of reverse distillation for anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5640–5648,
-
[20]
W. Lv, Q. Su, and W. Xu. One-for-all few-shot anomaly detection via instance-induced prompt learning. InThe Thirteenth International Conference on Learning Representations, 2025. 3
work page 2025
-
[21]
H. Ma, G. Yang, D. Zhao, Y . Ji, and W. Zuo. Remp-ad: Retrieval-enhanced multi-modal prompt fusion for few-shot industrial visual anomaly detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20425–20434, 2025. 3, 7, 19, 20, 21, 22
work page 2025
-
[22]
W. Ma, X. Zhang, Q. Yao, F. Tang, C. Wu, Y . Li, R. Yan, Z. Jiang, and S. K. Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4744–4754, 2025. 3
work page 2025
-
[23]
A. Martins and R. Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. InInternational conference on machine learning, pages 1614–1623. PMLR, 2016. 5, 15
work page 2016
- [24]
-
[25]
Z. Qu, X. Tao, M. Prasad, F. Shen, Z. Zhang, X. Gong, and G. Ding. Vcp-clip: A visual context prompting model for zero-shot anomaly segmentation. InEuropean Conference on Computer Vision, pages 301–317. Springer, 2024. 3
work page 2024
-
[26]
Z. Qu, X. Tao, X. Gong, S. Qu, Q. Chen, Z. Zhang, X. Wang, and G. Ding. Bayesian prompt flow learning for zero-shot anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30398–30408, 2025. 3
work page 2025
-
[27]
Z. Qu, X. Tao, X. Gong, S. Qu, X. Zhang, X. Wang, F. Shen, Z. Zhang, M. Prasad, and G. Ding. Dictas: A framework for class-generalizable few-shot anomaly segmentation via dictionary lookup. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20519–20528, 2025. 3, 5, 15
work page 2025
-
[28]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3, 7, 25
work page 2021
-
[29]
K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler. Towards total recall in indus- trial anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14318–14328, 2022. 2
work page 2022
- [30]
-
[31]
X. Yao, Z. Chen, C. Gao, G. Zhai, and C. Zhang. Resad: A simple framework for class generalizable anomaly detection.Advances in Neural Information Processing Systems, 37:125287–125311, 2024. 3
work page 2024
-
[32]
Z. You, L. Cui, Y . Shen, K. Yang, X. Lu, Y . Zheng, and X. Le. A unified model for multi-class anomaly detection.Advances in Neural Information Processing Systems, 35:4571–4584, 2022. 2
work page 2022
- [33]
- [34]
- [35]
-
[36]
Y . Zhou, X. Xu, J. Song, F. Shen, and H. T. Shen. Msflow: Multiscale flow-based framework for unsupervised anomaly detection.IEEE Transactions on Neural Networks and Learning Systems, 2024. 2
work page 2024
- [37]
-
[38]
Y . Zou, J. Jeong, L. Pemula, D. Zhang, and O. Dabeer. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. InEuropean Conference on Computer Vision, pages 392–408. Springer, 2022. 7, 18, 19, 20, 25 11 A Theoretical Derivations Notation.Throughout this appendix we adopt the following conventions consistent with the main...
work page 2022
-
[39]
Sort the entries ofzin descending order:z (1) ≥z (2) ≥ · · · ≥z (M)
-
[40]
Identify the support sizek(z) = max m∈[M] : 1 +m z (m) >Pm j=1 z(j)
-
[41]
Compute the thresholdτ(z) = Pk(z) j=1 z(j) −1 /k(z)
-
[42]
Output the sparse weightsSparsemax(z) i = max 0, z i −τ(z) . Entries with zi ≤τ(z) are mapped to exactly zero, while the remaining entries form a probability distribution that sums to 1. Application to reference retrieval.Applying the operator above row-wise to Sl yields a sparse weight matrix Wl ∈R N×(K·N) , where each query row contains a small number o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.