CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

Hyeongmuk Lim; Youngbum Hur

arxiv: 2605.23116 · v1 · pith:4KGYT77Lnew · submitted 2026-05-22 · 💻 cs.CV · cs.AI

CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

Hyeongmuk Lim , Youngbum Hur This is my paper

Pith reviewed 2026-05-25 05:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video anomaly detectiontraining-free methodsvision-language modelscontextual reasoninginterpretable detectiontemporal refinementlocal vision-text alignment

0 comments

The pith

A single frozen vision-language model can detect video anomalies and generate explanations without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video anomaly detection has long depended on task-specific training that creates domain dependency and high costs, while most methods output only scalar scores with little insight into the reasons. The paper presents CoReVAD as a framework that runs on one frozen VLM to produce both anomaly scores and temporal descriptions directly from video. It introduces a Local Response Cleaning step that uses local vision-text alignment to reduce noise in the model outputs, then applies softmax refinement, Gaussian smoothing, and position weighting to incorporate global temporal context. On the UCF-Crime and XD-Violence datasets this approach reaches competitive results among training-free methods and supplies human-readable explanations for detected events.

Core claim

CoReVAD is a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. The model directly generates anomaly scores and temporal descriptions. A Local Response Cleaning module based on local vision-text alignment mitigates noise in the generative outputs. Global temporal context and progression are then added through softmax-based refinement, Gaussian smoothing, and position weighting. On UCF-Crime and XD-Violence the method achieves competitive performance among training-free approaches while also delivering reliable and interpretable explanations.

What carries the argument

The Local Response Cleaning (LRC) module that performs local vision-text alignment to filter generative noise from a frozen VLM, followed by temporal refinement via softmax, Gaussian smoothing, and position weighting.

If this is right

Video anomaly detection becomes feasible without any task-specific training or extra models beyond one frozen VLM.
Both scalar scores and human-interpretable temporal descriptions are obtained from the same generative process.
Performance stays competitive with other training-free methods on established benchmarks such as UCF-Crime and XD-Violence.
Domain dependency is lowered because the method does not require retraining on target data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cleaning and refinement steps could be tested on other generative video tasks that need both detection and explanations.
Lowering reliance on external LLMs may reduce inference latency for real-time surveillance applications.
Extending the local alignment step to additional modalities might improve robustness on noisier video sources.

Load-bearing premise

The generative outputs of a single frozen VLM, after local vision-text alignment cleaning and temporal refinement, will reliably correspond to true anomalies without task-specific training or external models.

What would settle it

Evaluating the cleaned anomaly scores on a new video dataset with independent human annotations and finding that they correlate no better with the labels than a random or constant baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.23116 by Hyeongmuk Lim, Youngbum Hur.

**Figure 1.** Figure 1: Existing explainable VAD methods can be broadly categorized into three types: (a) training-free VAD methods that rely on external LLMs for reasoning, (b) VAD methods that adapt VLMs via instruction tuning and (c) VAD methods that introduce learnable guiding questions through verbalized learning. In contrast, (d) Our CoReVAD proposes an explainable VAD framework that requires neither external LLMs nor any t… view at source ↗

**Figure 2.** Figure 2: Architecture of our CoReVAD. The pre-trained VLM first generates a response r˜j (red bounding box) for each segment sj ∈ V , including a segment-level anomaly decision and a temporal description (blue bounding box), forming a response sequence R˜. The Local Response Cleaning (LRC) refines r˜j by selecting the most semantically aligned one response from neighboring responses using fvision and ftext. Visua… view at source ↗

**Figure 3.** Figure 3: PVLM and LRC. Local Response Cleaning. The raw response R˜ may contain noise due to randomness in generative outputs, leading to incorrect anomaly decisions or irrelevant explanations. Meanwhile, adjacent video frames typically share similar visual semantics because a video is typically recorded at high frame rates. Motivated by this observation, we introduce a local vision-text alignment mechanism for r… view at source ↗

**Figure 4.** Figure 4: Results of LRC over the number of l neighboring segments used for reasoning selection. Impact of neighboring responses. We also analyze the impact of the number of neighboring responses l used in LRC on VAD performance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: presents the qualitative results of CoReVAD with sample videos from UCFCrime and XD-Violence. The figure shows representative frames along with their corresponding temporal descriptions. In abnormal cases (red boxes), the model accurately describes visual content such as assaults, explosions, and shootings, which is well aligned with the high anomaly scores. Normal samples (blue boxes) also show low anoma… view at source ↗

read the original abstract

Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.com/Muk-00/CoReVAD

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoReVAD adds a Local Response Cleaning module plus temporal refinements on a single frozen VLM for training-free VAD, but the abstract shows no numbers so the performance claim stays unverified.

read the letter

The core of this paper is a training-free setup that runs anomaly detection and generates temporal descriptions from one frozen VLM. It introduces a Local Response Cleaning module that uses local vision-text alignment to cut noise in the generative outputs, then layers on softmax refinement, Gaussian smoothing, and position weighting to handle temporal context. That combination is presented as new for this subfield, and the authors release code, which is useful for anyone who wants to test it directly on UCF-Crime and XD-Violence.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes CoReVAD, a contextual reasoning framework for training-free video anomaly detection using a single frozen VLM. It generates anomaly scores and temporal descriptions directly, introduces a Local Response Cleaning (LRC) module based on local vision-text alignment to reduce generative noise, and applies softmax-based refinement, Gaussian smoothing, and position weighting for global temporal context. Experiments on UCF-Crime and XD-Violence report competitive performance among training-free methods along with interpretable explanations; official code is released at the provided GitHub link.

Significance. If the results hold, the work is significant for enabling low-cost, training-free VAD with built-in interpretability, avoiding the domain dependency of task-specific training and the overhead of additional LLMs or fine-tuning required by prior VLM-based methods. Explicit credit is due for the public code release, which directly supports reproducibility.

minor comments (2)

[Abstract] Abstract: the claim of 'competitive performance' would be more informative if accompanied by the key quantitative metrics (e.g., AUC or AP values) and the exact training-free baselines compared against.
[§3] §3 (method): the LRC module and the three temporal-refinement steps are described at a high level; adding a short algorithm box or explicit equations for the softmax refinement and position weighting would improve clarity without altering the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of CoReVAD, the recognition of its significance for training-free VAD with built-in interpretability, and the recommendation for minor revision. We note that no specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes CoReVAD as a training-free framework that applies a single frozen external VLM to generate anomaly scores and descriptions, followed by post-processing modules (LRC based on vision-text alignment, softmax refinement, Gaussian smoothing, position weighting). No equations, derivations, or self-citations are presented that reduce claimed performance or explanations to fitted parameters, self-definitions, or prior author results by construction. Results are reported on standard external benchmarks (UCF-Crime, XD-Violence) under conventional metrics, keeping the method self-contained against independent data and models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented physical entities are described. The framework itself and the LRC module are new constructs introduced to address noise and temporal context.

invented entities (2)

CoReVAD framework no independent evidence
purpose: Training-free contextual reasoning for VAD
New named system proposed in the paper.
Local Response Cleaning (LRC) module no independent evidence
purpose: Mitigate noise in VLM generative outputs via local vision-text alignment
Module introduced to clean model responses.

pith-pipeline@v0.9.0 · 5760 in / 1201 out tokens · 29432 ms · 2026-05-25T05:15:43.702129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Bogdoll, D., Nitsche, M., Zöllner, J.M.: Anomaly detection in autonomous driving: A survey. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4488–4499 (2022)

work page 2022
[3]

In: Proceedings of the AAAI conference on artificial intelligence

Chen, Y., Liu, Z., Zhang, B., Fok, W., Qi, X., Wu, Y.C.: Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detec- tion. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 387–395 (2023)

work page 2023
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

work page 2024
[5]

In: International Conference on Pattern Recognition

Dev, P.P., Hazari, R., Das, P.: Mcanet: Multimodal caption aware training-free video anomaly detection via large language model. In: International Conference on Pattern Recognition. pp. 362–379. Springer (2024)

work page 2024
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

Gong, D., Liu, L., Le, V., Saha, B., Mansour, M.R., Venkatesh, S., Hengel, A.v.d.: Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 1705–1714 (2019)

work page 2019
[8]

In: Proceedings of the AAAI conference on artificial intelligence

Gu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Anomalygpt: Detecting industrial anomalies using large vision-language models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 1932–1940 (2024)

work page 1932
[9]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning temporal regularity in video sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 733–742 (2016) 14 H. Lim and Y. Hur

work page 2016
[10]

In: 2023 IEEE Interna- tional Conference on Image Processing (ICIP)

Joo, H.K., Vo, K., Yamazaki, K., Le, N.: Clip-tsa: Clip-assisted temporal self- attention for weakly-supervised video anomaly detection. In: 2023 IEEE Interna- tional Conference on Image Processing (ICIP). pp. 3230–3234. IEEE (2023)

work page 2023
[11]

In: 2022 26th International Conference on Pattern Recognition (ICPR)

Lee,J.,Nam,W.J.,Lee,S.W.:Multi-contextualpredictionswithvisiontransformer for video anomaly detection. In: 2022 26th International Conference on Pattern Recognition (ICPR). pp. 1012–1018. IEEE (2022)

work page 2022
[12]

In: European Conference on Computer Vision

Li, G., Cai, G., Zeng, X., Zhao, R.: Scale-aware spatio-temporal relation learning for video anomaly detection. In: European Conference on Computer Vision. pp. 333–350. Springer (2022)

work page 2022
[13]

In: Proceedings of the AAAI Confer- ence on Artificial Intelligence

Li, S., Liu, F., Jiao, L.: Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In: Proceedings of the AAAI Confer- ence on Artificial Intelligence. vol. 36, pp. 1395–1403 (2022)

work page 2022
[14]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

work page 2023
[15]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Liu,W.,Luo,W.,Lian,D.,Gao,S.:Futureframepredictionforanomalydetection– a new baseline. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6536–6545 (2018)

work page 2018
[16]

In: Proceedings of the IEEE/CVF international conference on computer vision

Liu, Z., Nie, Y., Long, C., Zhang, Q., Li, G.: A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame pre- diction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13588–13597 (2021)

work page 2021
[17]

In: Pro- ceedings of the IEEE international conference on computer vision

Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: Pro- ceedings of the IEEE international conference on computer vision. pp. 2720–2727 (2013)

work page 2013
[18]

arXiv preprint arXiv:2401.05702 (2024)

Lv, H., Sun, Q.: Video anomaly detection and explanation via large language mod- els. arXiv preprint arXiv:2401.05702 (2024)

work page arXiv 2024
[19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Park, H., Noh, J., Ham, B.: Learning memory-guided normality for anomaly detec- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14372–14381 (2020)

work page 2020
[20]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[21]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Shao, Y., He, H., Li, S., Chen, S., Long, X., Zeng, F., Fan, Y., Zhang, M., Yan, Z., Ma, A., et al.: Eventvad: Training-free event-aware video anomaly detection. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 2586–2595 (2025)

work page 2025
[22]

In: 2018 24th International Conference on Pattern Recognition (ICPR)

Sohrab, F., Raitoharju, J., Gabbouj, M., Iosifidis, A.: Subspace support vector data description. In: 2018 24th International Conference on Pattern Recognition (ICPR). pp. 722–727. IEEE (2018)

work page 2018
[23]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6479–6488 (2018)

work page 2018
[24]

Pattern Recognition140, 109567 (2023)

Thakare, K.V., Dogra, D.P., Choi, H., Kim, H., Kim, I.J.: Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognition140, 109567 (2023)

work page 2023
[25]

In: Pro- ceedings of the IEEE/CVF Winter conference on applications of computer vision

Thakare, K.V., Raghuwanshi, Y., Dogra, D.P., Choi, H., Kim, I.J.: Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In: Pro- ceedings of the IEEE/CVF Winter conference on applications of computer vision. pp. 5541–5550 (2023) CoReVAD 15

work page 2023
[26]

Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., Carneiro, G.: Weakly- supervised video anomaly detection with robust temporal feature magnitude learn- ing.In:ProceedingsoftheIEEE/CVFinternationalconferenceoncomputervision. pp. 4975–4986 (2021)

work page 2021
[27]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, J., Cherian, A.: Gods: Generalized one-class discriminative subspaces for anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8201–8211 (2019)

work page 2019
[29]

In: European Conference on Computer Vision

Wu, J.C., Hsieh, H.Y., Chen, D.J., Fuh, C.S., Liu, T.L.: Self-supervised sparse representation for video anomaly detection. In: European Conference on Computer Vision. pp. 729–745. Springer (2022)

work page 2022
[30]

IEEE Transactions on Image Processing30, 3513–3527 (2021)

Wu, P., Liu, J.: Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing30, 3513–3527 (2021)

work page 2021
[31]

In: European conference on computer vision

Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., Yang, Z.: Not only look, but also listen: Learning multimodal violence detection under weak supervision. In: European conference on computer vision. pp. 322–339. Springer (2020)

work page 2020
[32]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wu, P., Zhou, X., Pang, G., Zhou, L., Yan, Q., Wang, P., Zhang, Y.: Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 6074–6082 (2024)

work page 2024
[33]

arXiv preprint arXiv:2406.04344 (2024)

Xiao, T.Z., Bamler, R., Schölkopf, B., Liu, W.: Verbalized machine learning: Re- visiting machine learning with language models. arXiv preprint arXiv:2406.04344 (2024)

work page arXiv 2024
[34]

In: European Conference on Computer Vision

Yang, Y., Lee, K., Dariush, B., Cao, Y., Lo, S.Y.: Follow the rules: Reasoning for video anomaly detection with large language models. In: European Conference on Computer Vision. pp. 304–322. Springer (2024)

work page 2024
[35]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ye, M., Liu, W., He, P.: Vera: Explainable video anomaly detection via verbalized learning of vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8679–8688 (2025)

work page 2025
[36]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zaheer, M.Z., Mahmood, A., Khan, M.H., Segu, M., Yu, F., Lee, S.I.: Generative cooperative learning for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14744– 14754 (2022)

work page 2022
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zanella, L., Menapace, W., Mancini, M., Wang, Y., Ricci, E.: Harnessing large language models for training-free video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18527– 18536 (2024)

work page 2024
[38]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhang, H., Xu, X., Wang, X., Zuo, J., Huang, X., Gao, C., Zhang, S., Yu, L., Sang, N.: Holmes-vau: Towards long-term video anomaly understanding at any granular- ity. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13843–13853 (2025)

work page 2025
[39]

In: 2022 26th International Conference on Pattern Recog- nition (ICPR)

Zhao, M., Liu, Y., Liu, J., Zeng, X.: Exploiting spatial-temporal correlations for video anomaly detection. In: 2022 26th International Conference on Pattern Recog- nition (ICPR). pp. 1727–1733. IEEE (2022)

work page 2022
[40]

In: Computer Vision: A Reference Guide, pp

Zhu, S., Chen, C., Sultani, W.: Video anomaly detection for smart surveillance. In: Computer Vision: A Reference Guide, pp. 1–8. Springer (2020)

work page 2020

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Bogdoll, D., Nitsche, M., Zöllner, J.M.: Anomaly detection in autonomous driving: A survey. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4488–4499 (2022)

work page 2022

[3] [3]

In: Proceedings of the AAAI conference on artificial intelligence

Chen, Y., Liu, Z., Zhang, B., Fok, W., Qi, X., Wu, Y.C.: Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detec- tion. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 387–395 (2023)

work page 2023

[4] [4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

work page 2024

[5] [5]

In: International Conference on Pattern Recognition

Dev, P.P., Hazari, R., Das, P.: Mcanet: Multimodal caption aware training-free video anomaly detection via large language model. In: International Conference on Pattern Recognition. pp. 362–379. Springer (2024)

work page 2024

[6] [6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [7]

In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

Gong, D., Liu, L., Le, V., Saha, B., Mansour, M.R., Venkatesh, S., Hengel, A.v.d.: Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 1705–1714 (2019)

work page 2019

[8] [8]

In: Proceedings of the AAAI conference on artificial intelligence

Gu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Anomalygpt: Detecting industrial anomalies using large vision-language models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 1932–1940 (2024)

work page 1932

[9] [9]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning temporal regularity in video sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 733–742 (2016) 14 H. Lim and Y. Hur

work page 2016

[10] [10]

In: 2023 IEEE Interna- tional Conference on Image Processing (ICIP)

Joo, H.K., Vo, K., Yamazaki, K., Le, N.: Clip-tsa: Clip-assisted temporal self- attention for weakly-supervised video anomaly detection. In: 2023 IEEE Interna- tional Conference on Image Processing (ICIP). pp. 3230–3234. IEEE (2023)

work page 2023

[11] [11]

In: 2022 26th International Conference on Pattern Recognition (ICPR)

Lee,J.,Nam,W.J.,Lee,S.W.:Multi-contextualpredictionswithvisiontransformer for video anomaly detection. In: 2022 26th International Conference on Pattern Recognition (ICPR). pp. 1012–1018. IEEE (2022)

work page 2022

[12] [12]

In: European Conference on Computer Vision

Li, G., Cai, G., Zeng, X., Zhao, R.: Scale-aware spatio-temporal relation learning for video anomaly detection. In: European Conference on Computer Vision. pp. 333–350. Springer (2022)

work page 2022

[13] [13]

In: Proceedings of the AAAI Confer- ence on Artificial Intelligence

Li, S., Liu, F., Jiao, L.: Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In: Proceedings of the AAAI Confer- ence on Artificial Intelligence. vol. 36, pp. 1395–1403 (2022)

work page 2022

[14] [14]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

work page 2023

[15] [15]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Liu,W.,Luo,W.,Lian,D.,Gao,S.:Futureframepredictionforanomalydetection– a new baseline. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6536–6545 (2018)

work page 2018

[16] [16]

In: Proceedings of the IEEE/CVF international conference on computer vision

Liu, Z., Nie, Y., Long, C., Zhang, Q., Li, G.: A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame pre- diction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13588–13597 (2021)

work page 2021

[17] [17]

In: Pro- ceedings of the IEEE international conference on computer vision

Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: Pro- ceedings of the IEEE international conference on computer vision. pp. 2720–2727 (2013)

work page 2013

[18] [18]

arXiv preprint arXiv:2401.05702 (2024)

Lv, H., Sun, Q.: Video anomaly detection and explanation via large language mod- els. arXiv preprint arXiv:2401.05702 (2024)

work page arXiv 2024

[19] [19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Park, H., Noh, J., Ham, B.: Learning memory-guided normality for anomaly detec- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14372–14381 (2020)

work page 2020

[20] [20]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021

[21] [21]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Shao, Y., He, H., Li, S., Chen, S., Long, X., Zeng, F., Fan, Y., Zhang, M., Yan, Z., Ma, A., et al.: Eventvad: Training-free event-aware video anomaly detection. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 2586–2595 (2025)

work page 2025

[22] [22]

In: 2018 24th International Conference on Pattern Recognition (ICPR)

Sohrab, F., Raitoharju, J., Gabbouj, M., Iosifidis, A.: Subspace support vector data description. In: 2018 24th International Conference on Pattern Recognition (ICPR). pp. 722–727. IEEE (2018)

work page 2018

[23] [23]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6479–6488 (2018)

work page 2018

[24] [24]

Pattern Recognition140, 109567 (2023)

Thakare, K.V., Dogra, D.P., Choi, H., Kim, H., Kim, I.J.: Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognition140, 109567 (2023)

work page 2023

[25] [25]

In: Pro- ceedings of the IEEE/CVF Winter conference on applications of computer vision

Thakare, K.V., Raghuwanshi, Y., Dogra, D.P., Choi, H., Kim, I.J.: Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In: Pro- ceedings of the IEEE/CVF Winter conference on applications of computer vision. pp. 5541–5550 (2023) CoReVAD 15

work page 2023

[26] [26]

Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., Carneiro, G.: Weakly- supervised video anomaly detection with robust temporal feature magnitude learn- ing.In:ProceedingsoftheIEEE/CVFinternationalconferenceoncomputervision. pp. 4975–4986 (2021)

work page 2021

[27] [27]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, J., Cherian, A.: Gods: Generalized one-class discriminative subspaces for anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8201–8211 (2019)

work page 2019

[29] [29]

In: European Conference on Computer Vision

Wu, J.C., Hsieh, H.Y., Chen, D.J., Fuh, C.S., Liu, T.L.: Self-supervised sparse representation for video anomaly detection. In: European Conference on Computer Vision. pp. 729–745. Springer (2022)

work page 2022

[30] [30]

IEEE Transactions on Image Processing30, 3513–3527 (2021)

Wu, P., Liu, J.: Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing30, 3513–3527 (2021)

work page 2021

[31] [31]

In: European conference on computer vision

Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., Yang, Z.: Not only look, but also listen: Learning multimodal violence detection under weak supervision. In: European conference on computer vision. pp. 322–339. Springer (2020)

work page 2020

[32] [32]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wu, P., Zhou, X., Pang, G., Zhou, L., Yan, Q., Wang, P., Zhang, Y.: Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 6074–6082 (2024)

work page 2024

[33] [33]

arXiv preprint arXiv:2406.04344 (2024)

Xiao, T.Z., Bamler, R., Schölkopf, B., Liu, W.: Verbalized machine learning: Re- visiting machine learning with language models. arXiv preprint arXiv:2406.04344 (2024)

work page arXiv 2024

[34] [34]

In: European Conference on Computer Vision

Yang, Y., Lee, K., Dariush, B., Cao, Y., Lo, S.Y.: Follow the rules: Reasoning for video anomaly detection with large language models. In: European Conference on Computer Vision. pp. 304–322. Springer (2024)

work page 2024

[35] [35]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ye, M., Liu, W., He, P.: Vera: Explainable video anomaly detection via verbalized learning of vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8679–8688 (2025)

work page 2025

[36] [36]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zaheer, M.Z., Mahmood, A., Khan, M.H., Segu, M., Yu, F., Lee, S.I.: Generative cooperative learning for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14744– 14754 (2022)

work page 2022

[37] [37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zanella, L., Menapace, W., Mancini, M., Wang, Y., Ricci, E.: Harnessing large language models for training-free video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18527– 18536 (2024)

work page 2024

[38] [38]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhang, H., Xu, X., Wang, X., Zuo, J., Huang, X., Gao, C., Zhang, S., Yu, L., Sang, N.: Holmes-vau: Towards long-term video anomaly understanding at any granular- ity. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13843–13853 (2025)

work page 2025

[39] [39]

In: 2022 26th International Conference on Pattern Recog- nition (ICPR)

Zhao, M., Liu, Y., Liu, J., Zeng, X.: Exploiting spatial-temporal correlations for video anomaly detection. In: 2022 26th International Conference on Pattern Recog- nition (ICPR). pp. 1727–1733. IEEE (2022)

work page 2022

[40] [40]

In: Computer Vision: A Reference Guide, pp

Zhu, S., Chen, C., Sultani, W.: Video anomaly detection for smart surveillance. In: Computer Vision: A Reference Guide, pp. 1–8. Springer (2020)

work page 2020