pith. sign in

arxiv: 2605.23116 · v1 · pith:4KGYT77Lnew · submitted 2026-05-22 · 💻 cs.CV · cs.AI

CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

Pith reviewed 2026-05-25 05:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video anomaly detectiontraining-free methodsvision-language modelscontextual reasoninginterpretable detectiontemporal refinementlocal vision-text alignment
0
0 comments X

The pith

A single frozen vision-language model can detect video anomalies and generate explanations without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video anomaly detection has long depended on task-specific training that creates domain dependency and high costs, while most methods output only scalar scores with little insight into the reasons. The paper presents CoReVAD as a framework that runs on one frozen VLM to produce both anomaly scores and temporal descriptions directly from video. It introduces a Local Response Cleaning step that uses local vision-text alignment to reduce noise in the model outputs, then applies softmax refinement, Gaussian smoothing, and position weighting to incorporate global temporal context. On the UCF-Crime and XD-Violence datasets this approach reaches competitive results among training-free methods and supplies human-readable explanations for detected events.

Core claim

CoReVAD is a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. The model directly generates anomaly scores and temporal descriptions. A Local Response Cleaning module based on local vision-text alignment mitigates noise in the generative outputs. Global temporal context and progression are then added through softmax-based refinement, Gaussian smoothing, and position weighting. On UCF-Crime and XD-Violence the method achieves competitive performance among training-free approaches while also delivering reliable and interpretable explanations.

What carries the argument

The Local Response Cleaning (LRC) module that performs local vision-text alignment to filter generative noise from a frozen VLM, followed by temporal refinement via softmax, Gaussian smoothing, and position weighting.

If this is right

  • Video anomaly detection becomes feasible without any task-specific training or extra models beyond one frozen VLM.
  • Both scalar scores and human-interpretable temporal descriptions are obtained from the same generative process.
  • Performance stays competitive with other training-free methods on established benchmarks such as UCF-Crime and XD-Violence.
  • Domain dependency is lowered because the method does not require retraining on target data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cleaning and refinement steps could be tested on other generative video tasks that need both detection and explanations.
  • Lowering reliance on external LLMs may reduce inference latency for real-time surveillance applications.
  • Extending the local alignment step to additional modalities might improve robustness on noisier video sources.

Load-bearing premise

The generative outputs of a single frozen VLM, after local vision-text alignment cleaning and temporal refinement, will reliably correspond to true anomalies without task-specific training or external models.

What would settle it

Evaluating the cleaned anomaly scores on a new video dataset with independent human annotations and finding that they correlate no better with the labels than a random or constant baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.23116 by Hyeongmuk Lim, Youngbum Hur.

Figure 1
Figure 1. Figure 1: Existing explainable VAD methods can be broadly categorized into three types: (a) training-free VAD methods that rely on external LLMs for reasoning, (b) VAD methods that adapt VLMs via instruction tuning and (c) VAD methods that introduce learnable guiding questions through verbalized learning. In contrast, (d) Our CoReVAD proposes an explainable VAD framework that requires neither external LLMs nor any t… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of our CoReVAD. The pre-trained VLM first generates a re￾sponse r˜j (red bounding box) for each segment sj ∈ V , including a segment-level anomaly decision and a temporal description (blue bounding box), forming a response sequence R˜. The Local Response Cleaning (LRC) refines r˜j by selecting the most se￾mantically aligned one response from neighboring responses using fvision and ftext. Visua… view at source ↗
Figure 3
Figure 3. Figure 3: PVLM and LRC. Local Response Cleaning. The raw response R˜ may contain noise due to randomness in generative outputs, leading to incorrect anomaly decisions or ir￾relevant explanations. Meanwhile, adjacent video frames typically share similar visual semantics because a video is typically recorded at high frame rates. Moti￾vated by this observation, we introduce a local vision-text alignment mechanism for r… view at source ↗
Figure 4
Figure 4. Figure 4: Results of LRC over the number of l neighboring segments used for reasoning selection. Impact of neighboring responses. We also analyze the impact of the number of neighboring responses l used in LRC on VAD performance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: presents the qualitative results of CoReVAD with sample videos from UCF￾Crime and XD-Violence. The figure shows representative frames along with their corresponding temporal descriptions. In abnormal cases (red boxes), the model accurately describes visual content such as assaults, explosions, and shootings, which is well aligned with the high anomaly scores. Normal samples (blue boxes) also show low anoma… view at source ↗
read the original abstract

Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.com/Muk-00/CoReVAD

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes CoReVAD, a contextual reasoning framework for training-free video anomaly detection using a single frozen VLM. It generates anomaly scores and temporal descriptions directly, introduces a Local Response Cleaning (LRC) module based on local vision-text alignment to reduce generative noise, and applies softmax-based refinement, Gaussian smoothing, and position weighting for global temporal context. Experiments on UCF-Crime and XD-Violence report competitive performance among training-free methods along with interpretable explanations; official code is released at the provided GitHub link.

Significance. If the results hold, the work is significant for enabling low-cost, training-free VAD with built-in interpretability, avoiding the domain dependency of task-specific training and the overhead of additional LLMs or fine-tuning required by prior VLM-based methods. Explicit credit is due for the public code release, which directly supports reproducibility.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'competitive performance' would be more informative if accompanied by the key quantitative metrics (e.g., AUC or AP values) and the exact training-free baselines compared against.
  2. [§3] §3 (method): the LRC module and the three temporal-refinement steps are described at a high level; adding a short algorithm box or explicit equations for the softmax refinement and position weighting would improve clarity without altering the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of CoReVAD, the recognition of its significance for training-free VAD with built-in interpretability, and the recommendation for minor revision. We note that no specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes CoReVAD as a training-free framework that applies a single frozen external VLM to generate anomaly scores and descriptions, followed by post-processing modules (LRC based on vision-text alignment, softmax refinement, Gaussian smoothing, position weighting). No equations, derivations, or self-citations are presented that reduce claimed performance or explanations to fitted parameters, self-definitions, or prior author results by construction. Results are reported on standard external benchmarks (UCF-Crime, XD-Violence) under conventional metrics, keeping the method self-contained against independent data and models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented physical entities are described. The framework itself and the LRC module are new constructs introduced to address noise and temporal context.

invented entities (2)
  • CoReVAD framework no independent evidence
    purpose: Training-free contextual reasoning for VAD
    New named system proposed in the paper.
  • Local Response Cleaning (LRC) module no independent evidence
    purpose: Mitigate noise in VLM generative outputs via local vision-text alignment
    Module introduced to clean model responses.

pith-pipeline@v0.9.0 · 5760 in / 1201 out tokens · 29432 ms · 2026-05-25T05:15:43.702129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Bogdoll, D., Nitsche, M., Zöllner, J.M.: Anomaly detection in autonomous driving: A survey. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4488–4499 (2022)

  3. [3]

    In: Proceedings of the AAAI conference on artificial intelligence

    Chen, Y., Liu, Z., Zhang, B., Fok, W., Qi, X., Wu, Y.C.: Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detec- tion. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 387–395 (2023)

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  5. [5]

    In: International Conference on Pattern Recognition

    Dev, P.P., Hazari, R., Das, P.: Mcanet: Multimodal caption aware training-free video anomaly detection via large language model. In: International Conference on Pattern Recognition. pp. 362–379. Springer (2024)

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

  7. [7]

    In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

    Gong, D., Liu, L., Le, V., Saha, B., Mansour, M.R., Venkatesh, S., Hengel, A.v.d.: Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 1705–1714 (2019)

  8. [8]

    In: Proceedings of the AAAI conference on artificial intelligence

    Gu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Anomalygpt: Detecting industrial anomalies using large vision-language models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 1932–1940 (2024)

  9. [9]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning temporal regularity in video sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 733–742 (2016) 14 H. Lim and Y. Hur

  10. [10]

    In: 2023 IEEE Interna- tional Conference on Image Processing (ICIP)

    Joo, H.K., Vo, K., Yamazaki, K., Le, N.: Clip-tsa: Clip-assisted temporal self- attention for weakly-supervised video anomaly detection. In: 2023 IEEE Interna- tional Conference on Image Processing (ICIP). pp. 3230–3234. IEEE (2023)

  11. [11]

    In: 2022 26th International Conference on Pattern Recognition (ICPR)

    Lee,J.,Nam,W.J.,Lee,S.W.:Multi-contextualpredictionswithvisiontransformer for video anomaly detection. In: 2022 26th International Conference on Pattern Recognition (ICPR). pp. 1012–1018. IEEE (2022)

  12. [12]

    In: European Conference on Computer Vision

    Li, G., Cai, G., Zeng, X., Zhao, R.: Scale-aware spatio-temporal relation learning for video anomaly detection. In: European Conference on Computer Vision. pp. 333–350. Springer (2022)

  13. [13]

    In: Proceedings of the AAAI Confer- ence on Artificial Intelligence

    Li, S., Liu, F., Jiao, L.: Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In: Proceedings of the AAAI Confer- ence on Artificial Intelligence. vol. 36, pp. 1395–1403 (2022)

  14. [14]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  15. [15]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Liu,W.,Luo,W.,Lian,D.,Gao,S.:Futureframepredictionforanomalydetection– a new baseline. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6536–6545 (2018)

  16. [16]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Liu, Z., Nie, Y., Long, C., Zhang, Q., Li, G.: A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame pre- diction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13588–13597 (2021)

  17. [17]

    In: Pro- ceedings of the IEEE international conference on computer vision

    Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: Pro- ceedings of the IEEE international conference on computer vision. pp. 2720–2727 (2013)

  18. [18]

    arXiv preprint arXiv:2401.05702 (2024)

    Lv, H., Sun, Q.: Video anomaly detection and explanation via large language mod- els. arXiv preprint arXiv:2401.05702 (2024)

  19. [19]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Park, H., Noh, J., Ham, B.: Learning memory-guided normality for anomaly detec- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14372–14381 (2020)

  20. [20]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  21. [21]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Shao, Y., He, H., Li, S., Chen, S., Long, X., Zeng, F., Fan, Y., Zhang, M., Yan, Z., Ma, A., et al.: Eventvad: Training-free event-aware video anomaly detection. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 2586–2595 (2025)

  22. [22]

    In: 2018 24th International Conference on Pattern Recognition (ICPR)

    Sohrab, F., Raitoharju, J., Gabbouj, M., Iosifidis, A.: Subspace support vector data description. In: 2018 24th International Conference on Pattern Recognition (ICPR). pp. 722–727. IEEE (2018)

  23. [23]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6479–6488 (2018)

  24. [24]

    Pattern Recognition140, 109567 (2023)

    Thakare, K.V., Dogra, D.P., Choi, H., Kim, H., Kim, I.J.: Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognition140, 109567 (2023)

  25. [25]

    In: Pro- ceedings of the IEEE/CVF Winter conference on applications of computer vision

    Thakare, K.V., Raghuwanshi, Y., Dogra, D.P., Choi, H., Kim, I.J.: Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In: Pro- ceedings of the IEEE/CVF Winter conference on applications of computer vision. pp. 5541–5550 (2023) CoReVAD 15

  26. [26]

    Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., Carneiro, G.: Weakly- supervised video anomaly detection with robust temporal feature magnitude learn- ing.In:ProceedingsoftheIEEE/CVFinternationalconferenceoncomputervision. pp. 4975–4986 (2021)

  27. [27]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  28. [28]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wang, J., Cherian, A.: Gods: Generalized one-class discriminative subspaces for anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8201–8211 (2019)

  29. [29]

    In: European Conference on Computer Vision

    Wu, J.C., Hsieh, H.Y., Chen, D.J., Fuh, C.S., Liu, T.L.: Self-supervised sparse representation for video anomaly detection. In: European Conference on Computer Vision. pp. 729–745. Springer (2022)

  30. [30]

    IEEE Transactions on Image Processing30, 3513–3527 (2021)

    Wu, P., Liu, J.: Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing30, 3513–3527 (2021)

  31. [31]

    In: European conference on computer vision

    Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., Yang, Z.: Not only look, but also listen: Learning multimodal violence detection under weak supervision. In: European conference on computer vision. pp. 322–339. Springer (2020)

  32. [32]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Wu, P., Zhou, X., Pang, G., Zhou, L., Yan, Q., Wang, P., Zhang, Y.: Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 6074–6082 (2024)

  33. [33]

    arXiv preprint arXiv:2406.04344 (2024)

    Xiao, T.Z., Bamler, R., Schölkopf, B., Liu, W.: Verbalized machine learning: Re- visiting machine learning with language models. arXiv preprint arXiv:2406.04344 (2024)

  34. [34]

    In: European Conference on Computer Vision

    Yang, Y., Lee, K., Dariush, B., Cao, Y., Lo, S.Y.: Follow the rules: Reasoning for video anomaly detection with large language models. In: European Conference on Computer Vision. pp. 304–322. Springer (2024)

  35. [35]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ye, M., Liu, W., He, P.: Vera: Explainable video anomaly detection via verbalized learning of vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8679–8688 (2025)

  36. [36]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zaheer, M.Z., Mahmood, A., Khan, M.H., Segu, M., Yu, F., Lee, S.I.: Generative cooperative learning for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14744– 14754 (2022)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zanella, L., Menapace, W., Mancini, M., Wang, Y., Ricci, E.: Harnessing large language models for training-free video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18527– 18536 (2024)

  38. [38]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhang, H., Xu, X., Wang, X., Zuo, J., Huang, X., Gao, C., Zhang, S., Yu, L., Sang, N.: Holmes-vau: Towards long-term video anomaly understanding at any granular- ity. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13843–13853 (2025)

  39. [39]

    In: 2022 26th International Conference on Pattern Recog- nition (ICPR)

    Zhao, M., Liu, Y., Liu, J., Zeng, X.: Exploiting spatial-temporal correlations for video anomaly detection. In: 2022 26th International Conference on Pattern Recog- nition (ICPR). pp. 1727–1733. IEEE (2022)

  40. [40]

    In: Computer Vision: A Reference Guide, pp

    Zhu, S., Chen, C., Sultani, W.: Video anomaly detection for smart surveillance. In: Computer Vision: A Reference Guide, pp. 1–8. Springer (2020)