Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models

Anoop Cherian; Furkan Mumcu; Michael J. Jones; Yasin Yilmaz

arxiv: 2605.12725 · v1 · pith:USFK5FKNnew · submitted 2026-05-12 · 💻 cs.CV

Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models

Furkan Mumcu , Michael J. Jones , Anoop Cherian , Yasin Yilmaz This is my paper

Pith reviewed 2026-05-14 20:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords video anomaly detectionmulti-scene modelslarge language modelssingle-scene analysisspatial localizationscene-specific normalityweak supervision

0 comments

The pith

Video anomaly detection research has shifted to multi-scene LLM models that reduce the task to semantic category recognition rather than scene-specific normality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that recent video anomaly detection work emphasizes general models meant to handle many scenes at once, often built on pretrained multi-modal large language models and video-level weak labels. This direction has pulled attention away from the scene-specific and context-dependent character of normal behavior that defines practical anomaly detection. Current approaches tend to detect familiar semantic anomaly types instead of local deviations in geometry, semantics, and activity within one environment, which suppresses spatial localization and turns the problem into a form of action recognition. A reader would care because real deployments, such as fixed-camera surveillance, operate in single scenes where normality must be learned from the particular setting rather than from cross-scene semantics.

Core claim

Prevailing multi-scene and LLM-based formulations in video anomaly detection do not align with real-world requirements, which demand single-scene, spatially-aware, and explainable models that capture the nuanced structure of normality within individual environments through local geometry, semantics, and activity patterns.

What carries the argument

The prevailing formulation of multi-scene generalization with pretrained multi-modal large language model representations, which orients models toward familiar semantic anomaly categories instead of deviations from environment-specific normality.

If this is right

Single-scene formulations would better preserve spatial localization of anomalies.
Spatially-aware models would directly use local geometry and activity patterns instead of global semantics.
Explainable models would make the learned structure of normality inspectable within each environment.
Progress would require datasets and benchmarks that emphasize intra-scene variations over cross-scene generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fixed-camera security systems would likely see improved detection rates if trained on scene-specific normality rather than broad semantic priors.
Dataset design could shift toward collecting dense annotations of normal activity within individual locations to support explainable models.
Similar single-scene reframing might apply to related tasks such as scene-specific action recognition or unusual event detection in robotics.

Load-bearing premise

Real-world video anomaly detection is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns.

What would settle it

A controlled comparison on fixed-scene data showing whether a multi-scene LLM model can localize and explain anomalies that lack familiar semantic labels, or whether single-scene spatially-aware models achieve higher precision on the same data.

Figures

Figures reproduced from arXiv: 2605.12725 by Anoop Cherian, Furkan Mumcu, Michael J. Jones, Yasin Yilmaz.

**Figure 1.** Figure 1: These images illustrate a core limitation of current video anomaly detection approaches. In the first image, the fighting occurs inside the boxing ring, a context where such an action is normal. However, recent models trained primarily with weak supervision or relying on an LLM’s built-in notion of normality tend to flag this as anomalous because they focus on the highlevel action category rather than the… view at source ↗

**Figure 2.** Figure 2: Coverage of key video anomaly detection (VAD) attributes across recent venues. Numbers in parentheses next to each venue indicate the total number of VAD papers at that venue. Location-specific model* denotes methods that explicitly model spatially conditioned normality; no surveyed papers satisfy this criterion, resulting in zero observed coverage. 3. Limitations of Current VAD Paradigms The recent shift … view at source ↗

read the original abstract

Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core requirements of real-world VAD, which is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns. Through targeted visual analyses and empirical evaluations, we demonstrate the practical consequences of these limitations and show that meaningful progress in VAD requires renewed focus on single-scene, spatially-aware, and explainable formulations that capture the nuanced structure of normality within individual environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues that multi-scene LLM-based VAD has drifted from real needs by favoring semantic categories over scene-specific spatial deviations.

read the letter

The main thing to know is that this paper pushes back on the recent trend in video anomaly detection toward general multi-scene models built on multimodal LLMs. It claims these approaches suppress spatial localization and turn the task into something closer to action recognition because of weak supervision and pretrained semantic biases. The authors support this with targeted visual analyses and empirical comparisons that show how models respond to familiar anomaly categories rather than local deviations in geometry or activity patterns within one environment. This reframing is the clearest new element: it assembles specific evidence for an existing worry instead of proposing another general model. The work does well at highlighting the mismatch between current methods and the scene-dependent nature of normality in practice. The visual examples make the localization loss concrete. On the soft spots, the argument rests mainly on observational analysis rather than large-scale quantitative benchmarks, so the strength of the conclusions depends on how representative those visuals are. The assumption that real-world VAD is almost always single-scene holds in many surveillance settings but could be tested against multi-camera deployments. Overall this is a useful critique for researchers already working in VAD who want to question evaluation practices and the push for generalization. It deserves a serious referee because the central claim is internally consistent and the evidence, while not overwhelming, is pointed enough to warrant discussion.

Referee Report

2 major / 2 minor

Summary. The paper claims that video anomaly detection (VAD) research has been misframed by its emphasis on multi-scene generalization and multi-modal large language model (MLLM)-based methods. These approaches rely on video-level weak supervision and opaque pretrained representations, which bias models toward familiar semantic categories rather than local deviations from scene-specific normality. Through targeted visual analyses and empirical evaluations, the manuscript demonstrates that this leads to suppressed spatial localization and a reduction of VAD to action recognition. It concludes that real-world VAD, typically performed in single scenes where normality depends on local geometry, semantics, and activity patterns, requires renewed focus on single-scene, spatially-aware, and explainable formulations.

Significance. If the visual analyses and empirical comparisons hold, the work offers a timely critique that could redirect VAD research away from scalable but semantically biased general models toward practical single-scene solutions. This aligns with the core requirements of real-world deployment and may encourage development of models that better capture nuanced, environment-specific normality structures, potentially improving localization and explainability over current trends.

major comments (2)

[Abstract / Introduction] The central claim that prevailing multi-scene MLLM formulations misalign with single-scene requirements rests on the assumption that real-world VAD is typically single-scene (stated in the abstract and introduction). This is load-bearing but would benefit from additional references to deployment literature or statistics on surveillance camera usage to strengthen the misalignment argument beyond observational analysis.
[Empirical Evaluations] The empirical evaluations (referenced in the abstract as demonstrating practical consequences) show semantic bias and reduced spatial localization, but the manuscript should provide explicit quantitative metrics (e.g., localization AUC or frame-level precision on single-scene vs. multi-scene splits) to make the evidence for the reduction to action recognition concrete and verifiable.

minor comments (2)

[Methods / Experiments] Clarify the exact datasets and baselines used in the targeted visual analyses to allow readers to reproduce the observed semantic bias effects.
[Abstract] The abstract's phrasing that models 'respond to familiar semantic anomaly categories' could be illustrated with one concrete failure case from the visual analyses for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback on our manuscript. We have carefully considered the comments and made revisions to address them.

read point-by-point responses

Referee: [Abstract / Introduction] The central claim that prevailing multi-scene MLLM formulations misalign with single-scene requirements rests on the assumption that real-world VAD is typically single-scene (stated in the abstract and introduction). This is load-bearing but would benefit from additional references to deployment literature or statistics on surveillance camera usage to strengthen the misalignment argument beyond observational analysis.

Authors: We agree that supporting the claim with additional references would strengthen the paper. In the revised manuscript, we have incorporated citations from surveillance deployment literature, including statistics indicating that over 80% of video surveillance systems operate in fixed single-scene environments, as reported in industry reports and papers on practical VAD applications. This bolsters the argument that multi-scene generalization is not the primary requirement in real-world settings. revision: yes
Referee: [Empirical Evaluations] The empirical evaluations (referenced in the abstract as demonstrating practical consequences) show semantic bias and reduced spatial localization, but the manuscript should provide explicit quantitative metrics (e.g., localization AUC or frame-level precision on single-scene vs. multi-scene splits) to make the evidence for the reduction to action recognition concrete and verifiable.

Authors: We appreciate this suggestion for making the empirical evidence more concrete. Our original evaluations focused on visual analyses and qualitative demonstrations of semantic bias. To address this, we have added explicit quantitative metrics in the revised manuscript, including localization AUC scores and frame-level precision comparisons between single-scene and multi-scene model splits, which further illustrate the reduction to action recognition in multi-scene MLLM approaches. revision: yes

Circularity Check

0 steps flagged

No significant circularity in observational critique

full rationale

This critique paper contains no mathematical derivation chain, fitted parameters, or equations that could reduce to inputs by construction. Its central claims rest on targeted visual analyses and empirical comparisons of existing VAD methods, which are presented as independent observations rather than self-definitional or self-citation load-bearing steps. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a manner that collapses the argument to prior author work. The paper is therefore self-contained against external benchmarks in the form of demonstrated limitations in multi-scene and MLLM-based approaches.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a position and analysis piece with no mathematical model, fitted parameters, or new postulated entities.

pith-pipeline@v0.9.0 · 5507 in / 1063 out tokens · 25194 ms · 2026-05-14T20:43:59.215636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

[1]

A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection

Anas Al-lahham, Nurbek Tastam, Muham- mad Zaigham Zaheer, and Karthik Nandakumar. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

work page 2024
[2]

Collab- orative learning of anomalies with privacy (clap) for unsupervised video anomaly detection: A new base- line

Anas Al-lahham, Muhammad Zaigham Zaheer, Nurbek Tastan, and Karthik Nandakumar. Collab- orative learning of anomalies with privacy (clap) for unsupervised video anomaly detection: A new base- line. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[3]

Advancing video anomaly detection: A concise review and a new dataset

Chen Chen, Tom Gedeon, Arjun Raj, Lei Wang, and Liyun Zhu. Advancing video anomaly detection: A concise review and a new dataset. InProceedings of the Conference on Neural Information Processing Systems, 2024

work page 2024
[4]

Prompt-enhanced multiple instance learning for weakly supervised video anomaly detec- tion

Junxi Chen, Liang Li, Li Su, Zheng-Jun Zha, and Qingming Huang. Prompt-enhanced multiple instance learning for weakly supervised video anomaly detec- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[5]

Generalizing single-frame supervi- sion to event-level understanding for video anomaly detection

Junxi Chen, Liang Li, Yunbin Tu, Li Su, Zhe Xue, and Qingming Huang. Generalizing single-frame supervi- sion to event-level understanding for video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025
[6]

Fok, Xi- aojuan Qi, and Yik-Chung Wu

Yingxian Chen, Jiahui Liu, Ruidi Fan, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W.T. Fok, Xi- aojuan Qi, and Yik-Chung Wu. Aligning effective tokens with video anomaly in large language mod- els. InProceedings of the IEEE/CVF Conference on Inernational Conference on Computer Vision, 2025

work page 2025
[7]

Towards multi-domain learning for generalizable video anomaly detection

MyeongAh Cho, Taeoh Kim, Minho Shim, Dongy- oon Wee, and Sangyoun Lee. Towards multi-domain learning for generalizable video anomaly detection. In Proceedings of the Conference on Neural Information Processing Systems, 2024

work page 2024
[8]

Distilling aggregated knowledge for weakly- supervised video anomaly detection

Jash Dalvi, Ali Dabouei, Gunjan Dhanuka, and Min Xu. Distilling aggregated knowledge for weakly- supervised video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

work page 2025
[9]

Sequen- tial keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection

Anja Delic, Matej Grcic, and Sinisa Segvic. Sequen- tial keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection. InPro- ceedings of the IEEE/CVF Conference on Inernational Conference on Computer Vision, 2025

work page 2025
[10]

Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly

Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jian- hang Chen, Qimei Cui, and Xiaofeng Tao. Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly. InPro- ceedings ...

work page 2024
[11]

Mixture of experts guided by gaussian splatters matters: A new approach to weakly-supervised video anomaly detection

Giacomo D’Amicantonio, Snehashis Majhi, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Fran- cois Bremond, and Egor Bondarev. Mixture of experts guided by gaussian splatters matters: A new approach to weakly-supervised video anomaly detection. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[12]

Learning temporal regularity in video sequences

Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016

work page 2016
[13]

Noise-resistant video anomaly detection via rgb error-guided multiscale predictive coding and dynamic memory

Han Hu, Wenli Du, Peng Liao, Bing Wang, and Siyuan Fan. Noise-resistant video anomaly detection via rgb error-guided multiscale predictive coding and dynamic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[14]

Vad- r1: Towards video anomaly reasoning via perception- to-cognition chain-of-thought

Chao Huang, Benfeng Wang, Wei Wang, Jie Wen, Li Liu, Chengliang Shen, and Xiaochun Cao. Vad- r1: Towards video anomaly reasoning via perception- to-cognition chain-of-thought. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025
[15]

Track any anomalous object:a granular video anomaly detection pipeline

Yuzhi Huang, Chenxin Li, Haitao Zhang, Zixu Lin, Yunlong Lin, Hengyu Liu, Wuyang Li, Xinyu Liu, Jiechao Gao, Yue Huang, Xinghao Ding, and Yixuan Yuan. Track any anomalous object:a granular video anomaly detection pipeline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025

work page 2025
[16]

Cross-domain learning for video anomaly detection with limited su- pervision

Yashika Jain, Ali Dabouei, and Min Xu. Cross-domain learning for video anomaly detection with limited su- pervision. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024
[17]

Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection

Ali Karami, Thi Kieu Khanh Ho, and Narges Arman- fard. Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection. InProceed- ings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, 2025

work page 2025
[18]

Real- time weakly supervised video anomaly detection

Hamza Karim, Keval Doshi, and Yasin Yilmaz. Real- time weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024. 7

work page 2024
[19]

Anomize: Better open vocabulary video anomaly detection

Fei Li, Wenxuan Liu, Jingjing Chen, Ruixu Zhang, Yuran Wang, Xian Zhong, and Zheng Wang. Anomize: Better open vocabulary video anomaly detection. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025

work page 2025
[20]

Anomaly detection and localization in crowded scenes

Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013

work page 2013
[21]

Vadtree: Explainable training-free video anomaly detection via hierarchical granularity- aware tree

Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, and Shuiguang Deng. Vadtree: Explainable training-free video anomaly detection via hierarchical granularity- aware tree. InProceedings of the Conference on Neu- ral Information Processing Systems, 2025

work page 2025
[22]

A unified reason- ing framework for holistic zero-shot video anomaly analysis

Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, and Yunchao Wei. A unified reason- ing framework for holistic zero-shot video anomaly analysis. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025
[23]

Abnormal event detection at 150 fps in matlab

Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. InProceedings of the IEEE Conference on Inernational Conference on Computer Vision, 2013

work page 2013
[24]

Anomaly detection in crowded scenes

Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010

work page 2010
[25]

Oe-ctst: Outlier-embedded cross temporal scale trans- former for weakly-supervised video anomaly detec- tion

Snehashis Majhi, Rui Dai, Quan Kong, Lorenzo Garat- toni, Gianpiero Francesca, and Francois Bremond. Oe-ctst: Outlier-embedded cross temporal scale trans- former for weakly-supervised video anomaly detec- tion. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, 2024

work page 2024
[26]

Just dance with pi! a poly-modal inductor for weakly- supervised video anomaly detection

Snehashis Majhi, Giacomo D’Amicantonio, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev, and Francois Bremond. Just dance with pi! a poly-modal inductor for weakly- supervised video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[27]

Pv-vtt: A privacy-centric dataset for mission-specific anomaly detection and natural lan- guage interpretation

Ryozo Masuakwa, Sanggeon Yun, Yoshiki Yamaguchi, and Mohsen Imani. Pv-vtt: A privacy-centric dataset for mission-specific anomaly detection and natural lan- guage interpretation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vi- sion, 2025

work page 2025
[28]

Mulde: Multi- scale log-density estimation via denoising score match- ing for video anomaly detection

Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, and Mateusz Kozinski. Mulde: Multi- scale log-density estimation via denoising score match- ing for video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[29]

A2seek: To- wards reasoning-centric benchmark for aerial anomaly understanding

Mengjingcheng Mo, Xinyang Tong, Mingpi Tan, Ji- axu Leng, Jiankang Zheng, Yiran Liu, Haosheng Chen, Ji Gan, Weisheng Li, and Xinbo Gao. A2seek: To- wards reasoning-centric benchmark for aerial anomaly understanding. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025
[30]

Complexvad: Detecting interaction anomalies in video

Furkan Mumcu, Michael Jones, Yasin Yilmaz, and Anoop Cherian. Complexvad: Detecting interaction anomalies in video. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1093–1102, 2025

work page 2025
[31]

Jones, Yasin Yilmaz, and Anoop Cherian

Furkan Mumcu, Michael J. Jones, Yasin Yilmaz, and Anoop Cherian. Leveraging multimodal llm descrip- tions of activity for explainable semi-supervised video anomaly detection.arXiv preprint arXiv:2510.14896, 2025

work page arXiv 2025
[32]

Frameshield: Adversarially robust video anomaly detection

Mojtaba Nafez, Mobina Poulaei, Nikan Vasei, Bar- dia Soltani Moakhar, Mohammad Sabokrou, and Mo- hammadHossein Rohban. Frameshield: Adversarially robust video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025
[33]

Interleaving one-class and weakly-supervised models with adap- tive thresholding for unsupervised video anomaly de- tection

Yongwei Nie, Hao Huang, Chengjiang Long, Qing Zhang, Pradipta Maji, and Hongmin Cai. Interleaving one-class and weakly-supervised models with adap- tive thresholding for unsupervised video anomaly de- tection. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024
[34]

Fedvad: Enhancing federated video anomaly de- tection with gpt-driven semantic distillation

Fan Qi, Ruijie Pan, Huaiwen Zhang, and Changsheng Xu. Fedvad: Enhancing federated video anomaly de- tection with gpt-driven semantic distillation. InPro- ceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024
[35]

Street scene: A new dataset and evaluation protocol for video anomaly detection

Bharathkumar Ramachandra and Michael Jones. Street scene: A new dataset and evaluation protocol for video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2569–2578, 2020

work page 2020
[36]

Self-distilled masked auto-encoders are efficient video anomaly detectors

Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, and Mubarak Shah. Self-distilled masked auto-encoders are efficient video anomaly detectors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[37]

Eventvad: Training-free event- aware video anomaly detection

Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xin- wei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, and Shuyan Li. Eventvad: Training-free event- aware video anomaly detection. InProceedings of the ACM International Conference on Multimedia, 2025. 8

work page 2025
[38]

Learning anomalies with normality prior for unsupervised video anomaly detection

Haoyue Shi, Le Wang, Sanping Zhou, Gang Hua, and Wei Tang. Learning anomalies with normality prior for unsupervised video anomaly detection. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024
[39]

Anomaly detection for people with visual impairments using an egocentric 360-degree camera

Inpyo Song, Sanghyeon Lee, Minjun Joo, and Jang- won Lee. Anomaly detection for people with visual impairments using an egocentric 360-degree camera. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

work page 2025
[40]

Holistic representation learning for multi- task trajectory anomaly detection

Alexandros Stergiou, Brent De Weerdt, and Nikos Deligiannis. Holistic representation learning for multi- task trajectory anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

work page 2024
[41]

Real- world anomaly detection in surveillance videos

Waqas Sultani, Chen Chen, and Mubarak Shah. Real- world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 6479–6488, 2018

work page 2018
[42]

Hawk: Learning to un- derstand open-world video anomalies

Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Ying-Cong Chen. Hawk: Learning to un- derstand open-world video anomalies. InProceedings of the Conference on Neural Information Processing Systems, 2024

work page 2024
[43]

Open- vocabulary video anomaly detection

Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open- vocabulary video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[44]

Discrimi- native score suppression for weakly supervised video anomaly detection

Chen Xu, Chunguo Li, and Hongjie Xing. Discrimi- native score suppression for weakly supervised video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vi- sion, 2025

work page 2025
[45]

Learning Deep Representations of Appearance and Motion for Anomalous Event Detection

Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. Learning deep representations of appear- ance and motion for anomalous event detection.arXiv preprint arXiv:1510.01553, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[46]

Monitor: Exploiting large language models with instruction for online video anomaly detection

Shengtian Yang, Yue Feng, Yingshi Liu, Jingrou Zhang, and Jie Qin. Monitor: Exploiting large language models with instruction for online video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025
[47]

Follow the rules: Reason- ing for video anomaly detection with large language models

Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reason- ing for video anomaly detection with large language models. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024
[48]

Zhengye Yang and Richard J. Radke. Detecting con- textual anomalies by discovering consistent spatial re- gions. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision Workshops, 2025

work page 2025
[49]

Text prompt with normality guidance for weakly supervised video anomaly detection

Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024

work page 2024
[50]

Panda: Towards generalist video anomaly detection via agentic ai engineer

Zhiwei Yang, Chen Gao, and Mike Zheng Shou. Panda: Towards generalist video anomaly detection via agentic ai engineer. InProceedings of the Con- ference on Neural Information Processing Systems, 2025

work page 2025
[51]

Vera: Explain- able video anomaly detection via verbalized learning of vision-language models

Muchao Ye, Weiyang Liu, and Pan He. Vera: Explain- able video anomaly detection via verbalized learning of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025

work page 2025
[52]

Missiongnn: Hierarchical multi- modal gnn-based weakly supervised video anomaly recognition with mission-specific knowledge graph generation

Sanggeon Yun, Ryozo Masukawa, Minhyoung Na, and Mohsen Imani. Missiongnn: Hierarchical multi- modal gnn-based weakly supervised video anomaly recognition with mission-specific knowledge graph generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

work page 2025
[53]

Harnessing large language models for training-free video anomaly de- tection

Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly de- tection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[54]

Autoregressive denoising score matching is a good video anomaly detector

Hanwen Zhang, Congqi Cao, Qinyi Lv, Lingtong Min, and Yanning Zhang. Autoregressive denoising score matching is a good video anomaly detector. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 12057–12067, 2025

work page 2025
[55]

Holmes-vau: Towards long- term video anomaly understanding at any granularity

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long- term video anomaly understanding at any granularity. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025

work page 2025
[56]

Multi-scale video anomaly detection by multi- grained spatio-temporal representation learning

Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multi-scale video anomaly detection by multi- grained spatio-temporal representation learning. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024

work page 2024
[57]

Do lvlms truly understand video anomalies? revealing hallu- cination via co-occurrence patterns

Menghao Zhang, Huazheng Wang, Pengfei Ren, Kangheng Lin, Qi Qi, Haifeng Sun, Zirui Zhuang, Lei Zhang, Jianxin Liao, and Jingyu Wang. Do lvlms truly understand video anomalies? revealing hallu- cination via co-occurrence patterns. InProceedings 9 of the Conference on Neural Information Processing Systems, 2025. 10 Table 2.Statistics of recent papers on vid...

work page 2025

[1] [1]

A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection

Anas Al-lahham, Nurbek Tastam, Muham- mad Zaigham Zaheer, and Karthik Nandakumar. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

work page 2024

[2] [2]

Collab- orative learning of anomalies with privacy (clap) for unsupervised video anomaly detection: A new base- line

Anas Al-lahham, Muhammad Zaigham Zaheer, Nurbek Tastan, and Karthik Nandakumar. Collab- orative learning of anomalies with privacy (clap) for unsupervised video anomaly detection: A new base- line. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[3] [3]

Advancing video anomaly detection: A concise review and a new dataset

Chen Chen, Tom Gedeon, Arjun Raj, Lei Wang, and Liyun Zhu. Advancing video anomaly detection: A concise review and a new dataset. InProceedings of the Conference on Neural Information Processing Systems, 2024

work page 2024

[4] [4]

Prompt-enhanced multiple instance learning for weakly supervised video anomaly detec- tion

Junxi Chen, Liang Li, Li Su, Zheng-Jun Zha, and Qingming Huang. Prompt-enhanced multiple instance learning for weakly supervised video anomaly detec- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[5] [5]

Generalizing single-frame supervi- sion to event-level understanding for video anomaly detection

Junxi Chen, Liang Li, Yunbin Tu, Li Su, Zhe Xue, and Qingming Huang. Generalizing single-frame supervi- sion to event-level understanding for video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025

[6] [6]

Fok, Xi- aojuan Qi, and Yik-Chung Wu

Yingxian Chen, Jiahui Liu, Ruidi Fan, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W.T. Fok, Xi- aojuan Qi, and Yik-Chung Wu. Aligning effective tokens with video anomaly in large language mod- els. InProceedings of the IEEE/CVF Conference on Inernational Conference on Computer Vision, 2025

work page 2025

[7] [7]

Towards multi-domain learning for generalizable video anomaly detection

MyeongAh Cho, Taeoh Kim, Minho Shim, Dongy- oon Wee, and Sangyoun Lee. Towards multi-domain learning for generalizable video anomaly detection. In Proceedings of the Conference on Neural Information Processing Systems, 2024

work page 2024

[8] [8]

Distilling aggregated knowledge for weakly- supervised video anomaly detection

Jash Dalvi, Ali Dabouei, Gunjan Dhanuka, and Min Xu. Distilling aggregated knowledge for weakly- supervised video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

work page 2025

[9] [9]

Sequen- tial keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection

Anja Delic, Matej Grcic, and Sinisa Segvic. Sequen- tial keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection. InPro- ceedings of the IEEE/CVF Conference on Inernational Conference on Computer Vision, 2025

work page 2025

[10] [10]

Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly

Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jian- hang Chen, Qimei Cui, and Xiaofeng Tao. Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly. InPro- ceedings ...

work page 2024

[11] [11]

Mixture of experts guided by gaussian splatters matters: A new approach to weakly-supervised video anomaly detection

Giacomo D’Amicantonio, Snehashis Majhi, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Fran- cois Bremond, and Egor Bondarev. Mixture of experts guided by gaussian splatters matters: A new approach to weakly-supervised video anomaly detection. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025

[12] [12]

Learning temporal regularity in video sequences

Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016

work page 2016

[13] [13]

Noise-resistant video anomaly detection via rgb error-guided multiscale predictive coding and dynamic memory

Han Hu, Wenli Du, Peng Liao, Bing Wang, and Siyuan Fan. Noise-resistant video anomaly detection via rgb error-guided multiscale predictive coding and dynamic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[14] [14]

Vad- r1: Towards video anomaly reasoning via perception- to-cognition chain-of-thought

Chao Huang, Benfeng Wang, Wei Wang, Jie Wen, Li Liu, Chengliang Shen, and Xiaochun Cao. Vad- r1: Towards video anomaly reasoning via perception- to-cognition chain-of-thought. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025

[15] [15]

Track any anomalous object:a granular video anomaly detection pipeline

Yuzhi Huang, Chenxin Li, Haitao Zhang, Zixu Lin, Yunlong Lin, Hengyu Liu, Wuyang Li, Xinyu Liu, Jiechao Gao, Yue Huang, Xinghao Ding, and Yixuan Yuan. Track any anomalous object:a granular video anomaly detection pipeline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025

work page 2025

[16] [16]

Cross-domain learning for video anomaly detection with limited su- pervision

Yashika Jain, Ali Dabouei, and Min Xu. Cross-domain learning for video anomaly detection with limited su- pervision. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024

[17] [17]

Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection

Ali Karami, Thi Kieu Khanh Ho, and Narges Arman- fard. Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection. InProceed- ings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, 2025

work page 2025

[18] [18]

Real- time weakly supervised video anomaly detection

Hamza Karim, Keval Doshi, and Yasin Yilmaz. Real- time weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024. 7

work page 2024

[19] [19]

Anomize: Better open vocabulary video anomaly detection

Fei Li, Wenxuan Liu, Jingjing Chen, Ruixu Zhang, Yuran Wang, Xian Zhong, and Zheng Wang. Anomize: Better open vocabulary video anomaly detection. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025

work page 2025

[20] [20]

Anomaly detection and localization in crowded scenes

Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013

work page 2013

[21] [21]

Vadtree: Explainable training-free video anomaly detection via hierarchical granularity- aware tree

Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, and Shuiguang Deng. Vadtree: Explainable training-free video anomaly detection via hierarchical granularity- aware tree. InProceedings of the Conference on Neu- ral Information Processing Systems, 2025

work page 2025

[22] [22]

A unified reason- ing framework for holistic zero-shot video anomaly analysis

Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, and Yunchao Wei. A unified reason- ing framework for holistic zero-shot video anomaly analysis. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025

[23] [23]

Abnormal event detection at 150 fps in matlab

Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. InProceedings of the IEEE Conference on Inernational Conference on Computer Vision, 2013

work page 2013

[24] [24]

Anomaly detection in crowded scenes

Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010

work page 2010

[25] [25]

Oe-ctst: Outlier-embedded cross temporal scale trans- former for weakly-supervised video anomaly detec- tion

Snehashis Majhi, Rui Dai, Quan Kong, Lorenzo Garat- toni, Gianpiero Francesca, and Francois Bremond. Oe-ctst: Outlier-embedded cross temporal scale trans- former for weakly-supervised video anomaly detec- tion. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, 2024

work page 2024

[26] [26]

Just dance with pi! a poly-modal inductor for weakly- supervised video anomaly detection

Snehashis Majhi, Giacomo D’Amicantonio, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev, and Francois Bremond. Just dance with pi! a poly-modal inductor for weakly- supervised video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[27] [27]

Pv-vtt: A privacy-centric dataset for mission-specific anomaly detection and natural lan- guage interpretation

Ryozo Masuakwa, Sanggeon Yun, Yoshiki Yamaguchi, and Mohsen Imani. Pv-vtt: A privacy-centric dataset for mission-specific anomaly detection and natural lan- guage interpretation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vi- sion, 2025

work page 2025

[28] [28]

Mulde: Multi- scale log-density estimation via denoising score match- ing for video anomaly detection

Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, and Mateusz Kozinski. Mulde: Multi- scale log-density estimation via denoising score match- ing for video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[29] [29]

A2seek: To- wards reasoning-centric benchmark for aerial anomaly understanding

Mengjingcheng Mo, Xinyang Tong, Mingpi Tan, Ji- axu Leng, Jiankang Zheng, Yiran Liu, Haosheng Chen, Ji Gan, Weisheng Li, and Xinbo Gao. A2seek: To- wards reasoning-centric benchmark for aerial anomaly understanding. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025

[30] [30]

Complexvad: Detecting interaction anomalies in video

Furkan Mumcu, Michael Jones, Yasin Yilmaz, and Anoop Cherian. Complexvad: Detecting interaction anomalies in video. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1093–1102, 2025

work page 2025

[31] [31]

Jones, Yasin Yilmaz, and Anoop Cherian

Furkan Mumcu, Michael J. Jones, Yasin Yilmaz, and Anoop Cherian. Leveraging multimodal llm descrip- tions of activity for explainable semi-supervised video anomaly detection.arXiv preprint arXiv:2510.14896, 2025

work page arXiv 2025

[32] [32]

Frameshield: Adversarially robust video anomaly detection

Mojtaba Nafez, Mobina Poulaei, Nikan Vasei, Bar- dia Soltani Moakhar, Mohammad Sabokrou, and Mo- hammadHossein Rohban. Frameshield: Adversarially robust video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025

[33] [33]

Interleaving one-class and weakly-supervised models with adap- tive thresholding for unsupervised video anomaly de- tection

Yongwei Nie, Hao Huang, Chengjiang Long, Qing Zhang, Pradipta Maji, and Hongmin Cai. Interleaving one-class and weakly-supervised models with adap- tive thresholding for unsupervised video anomaly de- tection. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024

[34] [34]

Fedvad: Enhancing federated video anomaly de- tection with gpt-driven semantic distillation

Fan Qi, Ruijie Pan, Huaiwen Zhang, and Changsheng Xu. Fedvad: Enhancing federated video anomaly de- tection with gpt-driven semantic distillation. InPro- ceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024

[35] [35]

Street scene: A new dataset and evaluation protocol for video anomaly detection

Bharathkumar Ramachandra and Michael Jones. Street scene: A new dataset and evaluation protocol for video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2569–2578, 2020

work page 2020

[36] [36]

Self-distilled masked auto-encoders are efficient video anomaly detectors

Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, and Mubarak Shah. Self-distilled masked auto-encoders are efficient video anomaly detectors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[37] [37]

Eventvad: Training-free event- aware video anomaly detection

Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xin- wei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, and Shuyan Li. Eventvad: Training-free event- aware video anomaly detection. InProceedings of the ACM International Conference on Multimedia, 2025. 8

work page 2025

[38] [38]

Learning anomalies with normality prior for unsupervised video anomaly detection

Haoyue Shi, Le Wang, Sanping Zhou, Gang Hua, and Wei Tang. Learning anomalies with normality prior for unsupervised video anomaly detection. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024

[39] [39]

Anomaly detection for people with visual impairments using an egocentric 360-degree camera

Inpyo Song, Sanghyeon Lee, Minjun Joo, and Jang- won Lee. Anomaly detection for people with visual impairments using an egocentric 360-degree camera. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

work page 2025

[40] [40]

Holistic representation learning for multi- task trajectory anomaly detection

Alexandros Stergiou, Brent De Weerdt, and Nikos Deligiannis. Holistic representation learning for multi- task trajectory anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

work page 2024

[41] [41]

Real- world anomaly detection in surveillance videos

Waqas Sultani, Chen Chen, and Mubarak Shah. Real- world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 6479–6488, 2018

work page 2018

[42] [42]

Hawk: Learning to un- derstand open-world video anomalies

Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Ying-Cong Chen. Hawk: Learning to un- derstand open-world video anomalies. InProceedings of the Conference on Neural Information Processing Systems, 2024

work page 2024

[43] [43]

Open- vocabulary video anomaly detection

Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open- vocabulary video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[44] [44]

Discrimi- native score suppression for weakly supervised video anomaly detection

Chen Xu, Chunguo Li, and Hongjie Xing. Discrimi- native score suppression for weakly supervised video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vi- sion, 2025

work page 2025

[45] [45]

Learning Deep Representations of Appearance and Motion for Anomalous Event Detection

Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. Learning deep representations of appear- ance and motion for anomalous event detection.arXiv preprint arXiv:1510.01553, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[46] [46]

Monitor: Exploiting large language models with instruction for online video anomaly detection

Shengtian Yang, Yue Feng, Yingshi Liu, Jingrou Zhang, and Jie Qin. Monitor: Exploiting large language models with instruction for online video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025

work page 2025

[47] [47]

Follow the rules: Reason- ing for video anomaly detection with large language models

Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reason- ing for video anomaly detection with large language models. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

work page 2024

[48] [48]

Zhengye Yang and Richard J. Radke. Detecting con- textual anomalies by discovering consistent spatial re- gions. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision Workshops, 2025

work page 2025

[49] [49]

Text prompt with normality guidance for weakly supervised video anomaly detection

Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024

work page 2024

[50] [50]

Panda: Towards generalist video anomaly detection via agentic ai engineer

Zhiwei Yang, Chen Gao, and Mike Zheng Shou. Panda: Towards generalist video anomaly detection via agentic ai engineer. InProceedings of the Con- ference on Neural Information Processing Systems, 2025

work page 2025

[51] [51]

Vera: Explain- able video anomaly detection via verbalized learning of vision-language models

Muchao Ye, Weiyang Liu, and Pan He. Vera: Explain- able video anomaly detection via verbalized learning of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025

work page 2025

[52] [52]

Missiongnn: Hierarchical multi- modal gnn-based weakly supervised video anomaly recognition with mission-specific knowledge graph generation

Sanggeon Yun, Ryozo Masukawa, Minhyoung Na, and Mohsen Imani. Missiongnn: Hierarchical multi- modal gnn-based weakly supervised video anomaly recognition with mission-specific knowledge graph generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

work page 2025

[53] [53]

Harnessing large language models for training-free video anomaly de- tection

Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly de- tection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[54] [54]

Autoregressive denoising score matching is a good video anomaly detector

Hanwen Zhang, Congqi Cao, Qinyi Lv, Lingtong Min, and Yanning Zhang. Autoregressive denoising score matching is a good video anomaly detector. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 12057–12067, 2025

work page 2025

[55] [55]

Holmes-vau: Towards long- term video anomaly understanding at any granularity

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long- term video anomaly understanding at any granularity. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025

work page 2025

[56] [56]

Multi-scale video anomaly detection by multi- grained spatio-temporal representation learning

Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multi-scale video anomaly detection by multi- grained spatio-temporal representation learning. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024

work page 2024

[57] [57]

Do lvlms truly understand video anomalies? revealing hallu- cination via co-occurrence patterns

Menghao Zhang, Huazheng Wang, Pengfei Ren, Kangheng Lin, Qi Qi, Haifeng Sun, Zirui Zhuang, Lei Zhang, Jianxin Liao, and Jingyu Wang. Do lvlms truly understand video anomalies? revealing hallu- cination via co-occurrence patterns. InProceedings 9 of the Conference on Neural Information Processing Systems, 2025. 10 Table 2.Statistics of recent papers on vid...

work page 2025