CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection
Pith reviewed 2026-05-25 05:15 UTC · model grok-4.3
The pith
A single frozen vision-language model can detect video anomalies and generate explanations without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoReVAD is a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. The model directly generates anomaly scores and temporal descriptions. A Local Response Cleaning module based on local vision-text alignment mitigates noise in the generative outputs. Global temporal context and progression are then added through softmax-based refinement, Gaussian smoothing, and position weighting. On UCF-Crime and XD-Violence the method achieves competitive performance among training-free approaches while also delivering reliable and interpretable explanations.
What carries the argument
The Local Response Cleaning (LRC) module that performs local vision-text alignment to filter generative noise from a frozen VLM, followed by temporal refinement via softmax, Gaussian smoothing, and position weighting.
If this is right
- Video anomaly detection becomes feasible without any task-specific training or extra models beyond one frozen VLM.
- Both scalar scores and human-interpretable temporal descriptions are obtained from the same generative process.
- Performance stays competitive with other training-free methods on established benchmarks such as UCF-Crime and XD-Violence.
- Domain dependency is lowered because the method does not require retraining on target data.
Where Pith is reading between the lines
- The same cleaning and refinement steps could be tested on other generative video tasks that need both detection and explanations.
- Lowering reliance on external LLMs may reduce inference latency for real-time surveillance applications.
- Extending the local alignment step to additional modalities might improve robustness on noisier video sources.
Load-bearing premise
The generative outputs of a single frozen VLM, after local vision-text alignment cleaning and temporal refinement, will reliably correspond to true anomalies without task-specific training or external models.
What would settle it
Evaluating the cleaned anomaly scores on a new video dataset with independent human annotations and finding that they correlate no better with the labels than a random or constant baseline would falsify the central claim.
Figures
read the original abstract
Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.com/Muk-00/CoReVAD
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CoReVAD, a contextual reasoning framework for training-free video anomaly detection using a single frozen VLM. It generates anomaly scores and temporal descriptions directly, introduces a Local Response Cleaning (LRC) module based on local vision-text alignment to reduce generative noise, and applies softmax-based refinement, Gaussian smoothing, and position weighting for global temporal context. Experiments on UCF-Crime and XD-Violence report competitive performance among training-free methods along with interpretable explanations; official code is released at the provided GitHub link.
Significance. If the results hold, the work is significant for enabling low-cost, training-free VAD with built-in interpretability, avoiding the domain dependency of task-specific training and the overhead of additional LLMs or fine-tuning required by prior VLM-based methods. Explicit credit is due for the public code release, which directly supports reproducibility.
minor comments (2)
- [Abstract] Abstract: the claim of 'competitive performance' would be more informative if accompanied by the key quantitative metrics (e.g., AUC or AP values) and the exact training-free baselines compared against.
- [§3] §3 (method): the LRC module and the three temporal-refinement steps are described at a high level; adding a short algorithm box or explicit equations for the softmax refinement and position weighting would improve clarity without altering the central claim.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of CoReVAD, the recognition of its significance for training-free VAD with built-in interpretability, and the recommendation for minor revision. We note that no specific major comments were raised in the report.
Circularity Check
No significant circularity detected
full rationale
The paper describes CoReVAD as a training-free framework that applies a single frozen external VLM to generate anomaly scores and descriptions, followed by post-processing modules (LRC based on vision-text alignment, softmax refinement, Gaussian smoothing, position weighting). No equations, derivations, or self-citations are presented that reduce claimed performance or explanations to fitted parameters, self-definitions, or prior author results by construction. Results are reported on standard external benchmarks (UCF-Crime, XD-Violence) under conventional metrics, keeping the method self-contained against independent data and models.
Axiom & Free-Parameter Ledger
invented entities (2)
-
CoReVAD framework
no independent evidence
-
Local Response Cleaning (LRC) module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Bogdoll, D., Nitsche, M., Zöllner, J.M.: Anomaly detection in autonomous driving: A survey. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4488–4499 (2022)
work page 2022
-
[3]
In: Proceedings of the AAAI conference on artificial intelligence
Chen, Y., Liu, Z., Zhang, B., Fok, W., Qi, X., Wu, Y.C.: Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detec- tion. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 387–395 (2023)
work page 2023
-
[4]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
work page 2024
-
[5]
In: International Conference on Pattern Recognition
Dev, P.P., Hazari, R., Das, P.: Mcanet: Multimodal caption aware training-free video anomaly detection via large language model. In: International Conference on Pattern Recognition. pp. 362–379. Springer (2024)
work page 2024
-
[6]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[7]
In: Proceedings of the IEEE/CVF interna- tional conference on computer vision
Gong, D., Liu, L., Le, V., Saha, B., Mansour, M.R., Venkatesh, S., Hengel, A.v.d.: Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 1705–1714 (2019)
work page 2019
-
[8]
In: Proceedings of the AAAI conference on artificial intelligence
Gu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Anomalygpt: Detecting industrial anomalies using large vision-language models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 1932–1940 (2024)
work page 1932
-
[9]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning temporal regularity in video sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 733–742 (2016) 14 H. Lim and Y. Hur
work page 2016
-
[10]
In: 2023 IEEE Interna- tional Conference on Image Processing (ICIP)
Joo, H.K., Vo, K., Yamazaki, K., Le, N.: Clip-tsa: Clip-assisted temporal self- attention for weakly-supervised video anomaly detection. In: 2023 IEEE Interna- tional Conference on Image Processing (ICIP). pp. 3230–3234. IEEE (2023)
work page 2023
-
[11]
In: 2022 26th International Conference on Pattern Recognition (ICPR)
Lee,J.,Nam,W.J.,Lee,S.W.:Multi-contextualpredictionswithvisiontransformer for video anomaly detection. In: 2022 26th International Conference on Pattern Recognition (ICPR). pp. 1012–1018. IEEE (2022)
work page 2022
-
[12]
In: European Conference on Computer Vision
Li, G., Cai, G., Zeng, X., Zhao, R.: Scale-aware spatio-temporal relation learning for video anomaly detection. In: European Conference on Computer Vision. pp. 333–350. Springer (2022)
work page 2022
-
[13]
In: Proceedings of the AAAI Confer- ence on Artificial Intelligence
Li, S., Liu, F., Jiao, L.: Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In: Proceedings of the AAAI Confer- ence on Artificial Intelligence. vol. 36, pp. 1395–1403 (2022)
work page 2022
-
[14]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
work page 2023
-
[15]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Liu,W.,Luo,W.,Lian,D.,Gao,S.:Futureframepredictionforanomalydetection– a new baseline. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6536–6545 (2018)
work page 2018
-
[16]
In: Proceedings of the IEEE/CVF international conference on computer vision
Liu, Z., Nie, Y., Long, C., Zhang, Q., Li, G.: A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame pre- diction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13588–13597 (2021)
work page 2021
-
[17]
In: Pro- ceedings of the IEEE international conference on computer vision
Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: Pro- ceedings of the IEEE international conference on computer vision. pp. 2720–2727 (2013)
work page 2013
-
[18]
arXiv preprint arXiv:2401.05702 (2024)
Lv, H., Sun, Q.: Video anomaly detection and explanation via large language mod- els. arXiv preprint arXiv:2401.05702 (2024)
-
[19]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Park, H., Noh, J., Ham, B.: Learning memory-guided normality for anomaly detec- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14372–14381 (2020)
work page 2020
-
[20]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[21]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Shao, Y., He, H., Li, S., Chen, S., Long, X., Zeng, F., Fan, Y., Zhang, M., Yan, Z., Ma, A., et al.: Eventvad: Training-free event-aware video anomaly detection. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 2586–2595 (2025)
work page 2025
-
[22]
In: 2018 24th International Conference on Pattern Recognition (ICPR)
Sohrab, F., Raitoharju, J., Gabbouj, M., Iosifidis, A.: Subspace support vector data description. In: 2018 24th International Conference on Pattern Recognition (ICPR). pp. 722–727. IEEE (2018)
work page 2018
-
[23]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6479–6488 (2018)
work page 2018
-
[24]
Pattern Recognition140, 109567 (2023)
Thakare, K.V., Dogra, D.P., Choi, H., Kim, H., Kim, I.J.: Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognition140, 109567 (2023)
work page 2023
-
[25]
In: Pro- ceedings of the IEEE/CVF Winter conference on applications of computer vision
Thakare, K.V., Raghuwanshi, Y., Dogra, D.P., Choi, H., Kim, I.J.: Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In: Pro- ceedings of the IEEE/CVF Winter conference on applications of computer vision. pp. 5541–5550 (2023) CoReVAD 15
work page 2023
-
[26]
Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., Carneiro, G.: Weakly- supervised video anomaly detection with robust temporal feature magnitude learn- ing.In:ProceedingsoftheIEEE/CVFinternationalconferenceoncomputervision. pp. 4975–4986 (2021)
work page 2021
-
[27]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wang, J., Cherian, A.: Gods: Generalized one-class discriminative subspaces for anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8201–8211 (2019)
work page 2019
-
[29]
In: European Conference on Computer Vision
Wu, J.C., Hsieh, H.Y., Chen, D.J., Fuh, C.S., Liu, T.L.: Self-supervised sparse representation for video anomaly detection. In: European Conference on Computer Vision. pp. 729–745. Springer (2022)
work page 2022
-
[30]
IEEE Transactions on Image Processing30, 3513–3527 (2021)
Wu, P., Liu, J.: Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing30, 3513–3527 (2021)
work page 2021
-
[31]
In: European conference on computer vision
Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., Yang, Z.: Not only look, but also listen: Learning multimodal violence detection under weak supervision. In: European conference on computer vision. pp. 322–339. Springer (2020)
work page 2020
-
[32]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Wu, P., Zhou, X., Pang, G., Zhou, L., Yan, Q., Wang, P., Zhang, Y.: Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 6074–6082 (2024)
work page 2024
-
[33]
arXiv preprint arXiv:2406.04344 (2024)
Xiao, T.Z., Bamler, R., Schölkopf, B., Liu, W.: Verbalized machine learning: Re- visiting machine learning with language models. arXiv preprint arXiv:2406.04344 (2024)
-
[34]
In: European Conference on Computer Vision
Yang, Y., Lee, K., Dariush, B., Cao, Y., Lo, S.Y.: Follow the rules: Reasoning for video anomaly detection with large language models. In: European Conference on Computer Vision. pp. 304–322. Springer (2024)
work page 2024
-
[35]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Ye, M., Liu, W., He, P.: Vera: Explainable video anomaly detection via verbalized learning of vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8679–8688 (2025)
work page 2025
-
[36]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zaheer, M.Z., Mahmood, A., Khan, M.H., Segu, M., Yu, F., Lee, S.I.: Generative cooperative learning for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14744– 14754 (2022)
work page 2022
-
[37]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zanella, L., Menapace, W., Mancini, M., Wang, Y., Ricci, E.: Harnessing large language models for training-free video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18527– 18536 (2024)
work page 2024
-
[38]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zhang, H., Xu, X., Wang, X., Zuo, J., Huang, X., Gao, C., Zhang, S., Yu, L., Sang, N.: Holmes-vau: Towards long-term video anomaly understanding at any granular- ity. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13843–13853 (2025)
work page 2025
-
[39]
In: 2022 26th International Conference on Pattern Recog- nition (ICPR)
Zhao, M., Liu, Y., Liu, J., Zeng, X.: Exploiting spatial-temporal correlations for video anomaly detection. In: 2022 26th International Conference on Pattern Recog- nition (ICPR). pp. 1727–1733. IEEE (2022)
work page 2022
-
[40]
In: Computer Vision: A Reference Guide, pp
Zhu, S., Chen, C., Sultani, W.: Video anomaly detection for smart surveillance. In: Computer Vision: A Reference Guide, pp. 1–8. Springer (2020)
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.