pith. sign in

arxiv: 2605.12725 · v1 · pith:USFK5FKNnew · submitted 2026-05-12 · 💻 cs.CV

Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models

Pith reviewed 2026-05-14 20:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords video anomaly detectionmulti-scene modelslarge language modelssingle-scene analysisspatial localizationscene-specific normalityweak supervision
0
0 comments X

The pith

Video anomaly detection research has shifted to multi-scene LLM models that reduce the task to semantic category recognition rather than scene-specific normality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that recent video anomaly detection work emphasizes general models meant to handle many scenes at once, often built on pretrained multi-modal large language models and video-level weak labels. This direction has pulled attention away from the scene-specific and context-dependent character of normal behavior that defines practical anomaly detection. Current approaches tend to detect familiar semantic anomaly types instead of local deviations in geometry, semantics, and activity within one environment, which suppresses spatial localization and turns the problem into a form of action recognition. A reader would care because real deployments, such as fixed-camera surveillance, operate in single scenes where normality must be learned from the particular setting rather than from cross-scene semantics.

Core claim

Prevailing multi-scene and LLM-based formulations in video anomaly detection do not align with real-world requirements, which demand single-scene, spatially-aware, and explainable models that capture the nuanced structure of normality within individual environments through local geometry, semantics, and activity patterns.

What carries the argument

The prevailing formulation of multi-scene generalization with pretrained multi-modal large language model representations, which orients models toward familiar semantic anomaly categories instead of deviations from environment-specific normality.

If this is right

  • Single-scene formulations would better preserve spatial localization of anomalies.
  • Spatially-aware models would directly use local geometry and activity patterns instead of global semantics.
  • Explainable models would make the learned structure of normality inspectable within each environment.
  • Progress would require datasets and benchmarks that emphasize intra-scene variations over cross-scene generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fixed-camera security systems would likely see improved detection rates if trained on scene-specific normality rather than broad semantic priors.
  • Dataset design could shift toward collecting dense annotations of normal activity within individual locations to support explainable models.
  • Similar single-scene reframing might apply to related tasks such as scene-specific action recognition or unusual event detection in robotics.

Load-bearing premise

Real-world video anomaly detection is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns.

What would settle it

A controlled comparison on fixed-scene data showing whether a multi-scene LLM model can localize and explain anomalies that lack familiar semantic labels, or whether single-scene spatially-aware models achieve higher precision on the same data.

Figures

Figures reproduced from arXiv: 2605.12725 by Anoop Cherian, Furkan Mumcu, Michael J. Jones, Yasin Yilmaz.

Figure 1
Figure 1. Figure 1: These images illustrate a core limitation of current video anomaly detection approaches. In the first image, the fighting occurs inside the boxing ring, a context where such an action is normal. However, recent models trained primarily with weak supervision or relying on an LLM’s built-in notion of normality tend to flag this as anomalous because they focus on the high￾level action category rather than the… view at source ↗
Figure 2
Figure 2. Figure 2: Coverage of key video anomaly detection (VAD) attributes across recent venues. Numbers in parentheses next to each venue indicate the total number of VAD papers at that venue. Location-specific model* denotes methods that explicitly model spatially conditioned normality; no surveyed papers satisfy this criterion, resulting in zero observed coverage. 3. Limitations of Current VAD Paradigms The recent shift … view at source ↗
read the original abstract

Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core requirements of real-world VAD, which is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns. Through targeted visual analyses and empirical evaluations, we demonstrate the practical consequences of these limitations and show that meaningful progress in VAD requires renewed focus on single-scene, spatially-aware, and explainable formulations that capture the nuanced structure of normality within individual environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that video anomaly detection (VAD) research has been misframed by its emphasis on multi-scene generalization and multi-modal large language model (MLLM)-based methods. These approaches rely on video-level weak supervision and opaque pretrained representations, which bias models toward familiar semantic categories rather than local deviations from scene-specific normality. Through targeted visual analyses and empirical evaluations, the manuscript demonstrates that this leads to suppressed spatial localization and a reduction of VAD to action recognition. It concludes that real-world VAD, typically performed in single scenes where normality depends on local geometry, semantics, and activity patterns, requires renewed focus on single-scene, spatially-aware, and explainable formulations.

Significance. If the visual analyses and empirical comparisons hold, the work offers a timely critique that could redirect VAD research away from scalable but semantically biased general models toward practical single-scene solutions. This aligns with the core requirements of real-world deployment and may encourage development of models that better capture nuanced, environment-specific normality structures, potentially improving localization and explainability over current trends.

major comments (2)
  1. [Abstract / Introduction] The central claim that prevailing multi-scene MLLM formulations misalign with single-scene requirements rests on the assumption that real-world VAD is typically single-scene (stated in the abstract and introduction). This is load-bearing but would benefit from additional references to deployment literature or statistics on surveillance camera usage to strengthen the misalignment argument beyond observational analysis.
  2. [Empirical Evaluations] The empirical evaluations (referenced in the abstract as demonstrating practical consequences) show semantic bias and reduced spatial localization, but the manuscript should provide explicit quantitative metrics (e.g., localization AUC or frame-level precision on single-scene vs. multi-scene splits) to make the evidence for the reduction to action recognition concrete and verifiable.
minor comments (2)
  1. [Methods / Experiments] Clarify the exact datasets and baselines used in the targeted visual analyses to allow readers to reproduce the observed semantic bias effects.
  2. [Abstract] The abstract's phrasing that models 'respond to familiar semantic anomaly categories' could be illustrated with one concrete failure case from the visual analyses for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback on our manuscript. We have carefully considered the comments and made revisions to address them.

read point-by-point responses
  1. Referee: [Abstract / Introduction] The central claim that prevailing multi-scene MLLM formulations misalign with single-scene requirements rests on the assumption that real-world VAD is typically single-scene (stated in the abstract and introduction). This is load-bearing but would benefit from additional references to deployment literature or statistics on surveillance camera usage to strengthen the misalignment argument beyond observational analysis.

    Authors: We agree that supporting the claim with additional references would strengthen the paper. In the revised manuscript, we have incorporated citations from surveillance deployment literature, including statistics indicating that over 80% of video surveillance systems operate in fixed single-scene environments, as reported in industry reports and papers on practical VAD applications. This bolsters the argument that multi-scene generalization is not the primary requirement in real-world settings. revision: yes

  2. Referee: [Empirical Evaluations] The empirical evaluations (referenced in the abstract as demonstrating practical consequences) show semantic bias and reduced spatial localization, but the manuscript should provide explicit quantitative metrics (e.g., localization AUC or frame-level precision on single-scene vs. multi-scene splits) to make the evidence for the reduction to action recognition concrete and verifiable.

    Authors: We appreciate this suggestion for making the empirical evidence more concrete. Our original evaluations focused on visual analyses and qualitative demonstrations of semantic bias. To address this, we have added explicit quantitative metrics in the revised manuscript, including localization AUC scores and frame-level precision comparisons between single-scene and multi-scene model splits, which further illustrate the reduction to action recognition in multi-scene MLLM approaches. revision: yes

Circularity Check

0 steps flagged

No significant circularity in observational critique

full rationale

This critique paper contains no mathematical derivation chain, fitted parameters, or equations that could reduce to inputs by construction. Its central claims rest on targeted visual analyses and empirical comparisons of existing VAD methods, which are presented as independent observations rather than self-definitional or self-citation load-bearing steps. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a manner that collapses the argument to prior author work. The paper is therefore self-contained against external benchmarks in the form of demonstrated limitations in multi-scene and MLLM-based approaches.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a position and analysis piece with no mathematical model, fitted parameters, or new postulated entities.

pith-pipeline@v0.9.0 · 5507 in / 1063 out tokens · 25194 ms · 2026-05-14T20:43:59.215636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

  1. [1]

    A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection

    Anas Al-lahham, Nurbek Tastam, Muham- mad Zaigham Zaheer, and Karthik Nandakumar. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

  2. [2]

    Collab- orative learning of anomalies with privacy (clap) for unsupervised video anomaly detection: A new base- line

    Anas Al-lahham, Muhammad Zaigham Zaheer, Nurbek Tastan, and Karthik Nandakumar. Collab- orative learning of anomalies with privacy (clap) for unsupervised video anomaly detection: A new base- line. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  3. [3]

    Advancing video anomaly detection: A concise review and a new dataset

    Chen Chen, Tom Gedeon, Arjun Raj, Lei Wang, and Liyun Zhu. Advancing video anomaly detection: A concise review and a new dataset. InProceedings of the Conference on Neural Information Processing Systems, 2024

  4. [4]

    Prompt-enhanced multiple instance learning for weakly supervised video anomaly detec- tion

    Junxi Chen, Liang Li, Li Su, Zheng-Jun Zha, and Qingming Huang. Prompt-enhanced multiple instance learning for weakly supervised video anomaly detec- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  5. [5]

    Generalizing single-frame supervi- sion to event-level understanding for video anomaly detection

    Junxi Chen, Liang Li, Yunbin Tu, Li Su, Zhe Xue, and Qingming Huang. Generalizing single-frame supervi- sion to event-level understanding for video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025

  6. [6]

    Fok, Xi- aojuan Qi, and Yik-Chung Wu

    Yingxian Chen, Jiahui Liu, Ruidi Fan, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W.T. Fok, Xi- aojuan Qi, and Yik-Chung Wu. Aligning effective tokens with video anomaly in large language mod- els. InProceedings of the IEEE/CVF Conference on Inernational Conference on Computer Vision, 2025

  7. [7]

    Towards multi-domain learning for generalizable video anomaly detection

    MyeongAh Cho, Taeoh Kim, Minho Shim, Dongy- oon Wee, and Sangyoun Lee. Towards multi-domain learning for generalizable video anomaly detection. In Proceedings of the Conference on Neural Information Processing Systems, 2024

  8. [8]

    Distilling aggregated knowledge for weakly- supervised video anomaly detection

    Jash Dalvi, Ali Dabouei, Gunjan Dhanuka, and Min Xu. Distilling aggregated knowledge for weakly- supervised video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

  9. [9]

    Sequen- tial keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection

    Anja Delic, Matej Grcic, and Sinisa Segvic. Sequen- tial keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection. InPro- ceedings of the IEEE/CVF Conference on Inernational Conference on Computer Vision, 2025

  10. [10]

    Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly

    Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jian- hang Chen, Qimei Cui, and Xiaofeng Tao. Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly. InPro- ceedings ...

  11. [11]

    Mixture of experts guided by gaussian splatters matters: A new approach to weakly-supervised video anomaly detection

    Giacomo D’Amicantonio, Snehashis Majhi, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Fran- cois Bremond, and Egor Bondarev. Mixture of experts guided by gaussian splatters matters: A new approach to weakly-supervised video anomaly detection. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  12. [12]

    Learning temporal regularity in video sequences

    Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016

  13. [13]

    Noise-resistant video anomaly detection via rgb error-guided multiscale predictive coding and dynamic memory

    Han Hu, Wenli Du, Peng Liao, Bing Wang, and Siyuan Fan. Noise-resistant video anomaly detection via rgb error-guided multiscale predictive coding and dynamic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  14. [14]

    Vad- r1: Towards video anomaly reasoning via perception- to-cognition chain-of-thought

    Chao Huang, Benfeng Wang, Wei Wang, Jie Wen, Li Liu, Chengliang Shen, and Xiaochun Cao. Vad- r1: Towards video anomaly reasoning via perception- to-cognition chain-of-thought. InProceedings of the Conference on Neural Information Processing Systems, 2025

  15. [15]

    Track any anomalous object:a granular video anomaly detection pipeline

    Yuzhi Huang, Chenxin Li, Haitao Zhang, Zixu Lin, Yunlong Lin, Hengyu Liu, Wuyang Li, Xinyu Liu, Jiechao Gao, Yue Huang, Xinghao Ding, and Yixuan Yuan. Track any anomalous object:a granular video anomaly detection pipeline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025

  16. [16]

    Cross-domain learning for video anomaly detection with limited su- pervision

    Yashika Jain, Ali Dabouei, and Min Xu. Cross-domain learning for video anomaly detection with limited su- pervision. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

  17. [17]

    Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection

    Ali Karami, Thi Kieu Khanh Ho, and Narges Arman- fard. Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection. InProceed- ings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, 2025

  18. [18]

    Real- time weakly supervised video anomaly detection

    Hamza Karim, Keval Doshi, and Yasin Yilmaz. Real- time weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024. 7

  19. [19]

    Anomize: Better open vocabulary video anomaly detection

    Fei Li, Wenxuan Liu, Jingjing Chen, Ruixu Zhang, Yuran Wang, Xian Zhong, and Zheng Wang. Anomize: Better open vocabulary video anomaly detection. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025

  20. [20]

    Anomaly detection and localization in crowded scenes

    Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013

  21. [21]

    Vadtree: Explainable training-free video anomaly detection via hierarchical granularity- aware tree

    Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, and Shuiguang Deng. Vadtree: Explainable training-free video anomaly detection via hierarchical granularity- aware tree. InProceedings of the Conference on Neu- ral Information Processing Systems, 2025

  22. [22]

    A unified reason- ing framework for holistic zero-shot video anomaly analysis

    Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, and Yunchao Wei. A unified reason- ing framework for holistic zero-shot video anomaly analysis. InProceedings of the Conference on Neural Information Processing Systems, 2025

  23. [23]

    Abnormal event detection at 150 fps in matlab

    Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. InProceedings of the IEEE Conference on Inernational Conference on Computer Vision, 2013

  24. [24]

    Anomaly detection in crowded scenes

    Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010

  25. [25]

    Oe-ctst: Outlier-embedded cross temporal scale trans- former for weakly-supervised video anomaly detec- tion

    Snehashis Majhi, Rui Dai, Quan Kong, Lorenzo Garat- toni, Gianpiero Francesca, and Francois Bremond. Oe-ctst: Outlier-embedded cross temporal scale trans- former for weakly-supervised video anomaly detec- tion. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, 2024

  26. [26]

    Just dance with pi! a poly-modal inductor for weakly- supervised video anomaly detection

    Snehashis Majhi, Giacomo D’Amicantonio, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev, and Francois Bremond. Just dance with pi! a poly-modal inductor for weakly- supervised video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  27. [27]

    Pv-vtt: A privacy-centric dataset for mission-specific anomaly detection and natural lan- guage interpretation

    Ryozo Masuakwa, Sanggeon Yun, Yoshiki Yamaguchi, and Mohsen Imani. Pv-vtt: A privacy-centric dataset for mission-specific anomaly detection and natural lan- guage interpretation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vi- sion, 2025

  28. [28]

    Mulde: Multi- scale log-density estimation via denoising score match- ing for video anomaly detection

    Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, and Mateusz Kozinski. Mulde: Multi- scale log-density estimation via denoising score match- ing for video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  29. [29]

    A2seek: To- wards reasoning-centric benchmark for aerial anomaly understanding

    Mengjingcheng Mo, Xinyang Tong, Mingpi Tan, Ji- axu Leng, Jiankang Zheng, Yiran Liu, Haosheng Chen, Ji Gan, Weisheng Li, and Xinbo Gao. A2seek: To- wards reasoning-centric benchmark for aerial anomaly understanding. InProceedings of the Conference on Neural Information Processing Systems, 2025

  30. [30]

    Complexvad: Detecting interaction anomalies in video

    Furkan Mumcu, Michael Jones, Yasin Yilmaz, and Anoop Cherian. Complexvad: Detecting interaction anomalies in video. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1093–1102, 2025

  31. [31]

    Jones, Yasin Yilmaz, and Anoop Cherian

    Furkan Mumcu, Michael J. Jones, Yasin Yilmaz, and Anoop Cherian. Leveraging multimodal llm descrip- tions of activity for explainable semi-supervised video anomaly detection.arXiv preprint arXiv:2510.14896, 2025

  32. [32]

    Frameshield: Adversarially robust video anomaly detection

    Mojtaba Nafez, Mobina Poulaei, Nikan Vasei, Bar- dia Soltani Moakhar, Mohammad Sabokrou, and Mo- hammadHossein Rohban. Frameshield: Adversarially robust video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025

  33. [33]

    Interleaving one-class and weakly-supervised models with adap- tive thresholding for unsupervised video anomaly de- tection

    Yongwei Nie, Hao Huang, Chengjiang Long, Qing Zhang, Pradipta Maji, and Hongmin Cai. Interleaving one-class and weakly-supervised models with adap- tive thresholding for unsupervised video anomaly de- tection. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

  34. [34]

    Fedvad: Enhancing federated video anomaly de- tection with gpt-driven semantic distillation

    Fan Qi, Ruijie Pan, Huaiwen Zhang, and Changsheng Xu. Fedvad: Enhancing federated video anomaly de- tection with gpt-driven semantic distillation. InPro- ceedings of the IEEE/CVF European Conference on Computer Vision, 2024

  35. [35]

    Street scene: A new dataset and evaluation protocol for video anomaly detection

    Bharathkumar Ramachandra and Michael Jones. Street scene: A new dataset and evaluation protocol for video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2569–2578, 2020

  36. [36]

    Self-distilled masked auto-encoders are efficient video anomaly detectors

    Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, and Mubarak Shah. Self-distilled masked auto-encoders are efficient video anomaly detectors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  37. [37]

    Eventvad: Training-free event- aware video anomaly detection

    Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xin- wei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, and Shuyan Li. Eventvad: Training-free event- aware video anomaly detection. InProceedings of the ACM International Conference on Multimedia, 2025. 8

  38. [38]

    Learning anomalies with normality prior for unsupervised video anomaly detection

    Haoyue Shi, Le Wang, Sanping Zhou, Gang Hua, and Wei Tang. Learning anomalies with normality prior for unsupervised video anomaly detection. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

  39. [39]

    Anomaly detection for people with visual impairments using an egocentric 360-degree camera

    Inpyo Song, Sanghyeon Lee, Minjun Joo, and Jang- won Lee. Anomaly detection for people with visual impairments using an egocentric 360-degree camera. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

  40. [40]

    Holistic representation learning for multi- task trajectory anomaly detection

    Alexandros Stergiou, Brent De Weerdt, and Nikos Deligiannis. Holistic representation learning for multi- task trajectory anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

  41. [41]

    Real- world anomaly detection in surveillance videos

    Waqas Sultani, Chen Chen, and Mubarak Shah. Real- world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 6479–6488, 2018

  42. [42]

    Hawk: Learning to un- derstand open-world video anomalies

    Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Ying-Cong Chen. Hawk: Learning to un- derstand open-world video anomalies. InProceedings of the Conference on Neural Information Processing Systems, 2024

  43. [43]

    Open- vocabulary video anomaly detection

    Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open- vocabulary video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  44. [44]

    Discrimi- native score suppression for weakly supervised video anomaly detection

    Chen Xu, Chunguo Li, and Hongjie Xing. Discrimi- native score suppression for weakly supervised video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vi- sion, 2025

  45. [45]

    Learning Deep Representations of Appearance and Motion for Anomalous Event Detection

    Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. Learning deep representations of appear- ance and motion for anomalous event detection.arXiv preprint arXiv:1510.01553, 2015

  46. [46]

    Monitor: Exploiting large language models with instruction for online video anomaly detection

    Shengtian Yang, Yue Feng, Yingshi Liu, Jingrou Zhang, and Jie Qin. Monitor: Exploiting large language models with instruction for online video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025

  47. [47]

    Follow the rules: Reason- ing for video anomaly detection with large language models

    Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reason- ing for video anomaly detection with large language models. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024

  48. [48]

    Zhengye Yang and Richard J. Radke. Detecting con- textual anomalies by discovering consistent spatial re- gions. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision Workshops, 2025

  49. [49]

    Text prompt with normality guidance for weakly supervised video anomaly detection

    Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024

  50. [50]

    Panda: Towards generalist video anomaly detection via agentic ai engineer

    Zhiwei Yang, Chen Gao, and Mike Zheng Shou. Panda: Towards generalist video anomaly detection via agentic ai engineer. InProceedings of the Con- ference on Neural Information Processing Systems, 2025

  51. [51]

    Vera: Explain- able video anomaly detection via verbalized learning of vision-language models

    Muchao Ye, Weiyang Liu, and Pan He. Vera: Explain- able video anomaly detection via verbalized learning of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025

  52. [52]

    Missiongnn: Hierarchical multi- modal gnn-based weakly supervised video anomaly recognition with mission-specific knowledge graph generation

    Sanggeon Yun, Ryozo Masukawa, Minhyoung Na, and Mohsen Imani. Missiongnn: Hierarchical multi- modal gnn-based weakly supervised video anomaly recognition with mission-specific knowledge graph generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

  53. [53]

    Harnessing large language models for training-free video anomaly de- tection

    Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly de- tection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  54. [54]

    Autoregressive denoising score matching is a good video anomaly detector

    Hanwen Zhang, Congqi Cao, Qinyi Lv, Lingtong Min, and Yanning Zhang. Autoregressive denoising score matching is a good video anomaly detector. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 12057–12067, 2025

  55. [55]

    Holmes-vau: Towards long- term video anomaly understanding at any granularity

    Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long- term video anomaly understanding at any granularity. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025

  56. [56]

    Multi-scale video anomaly detection by multi- grained spatio-temporal representation learning

    Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multi-scale video anomaly detection by multi- grained spatio-temporal representation learning. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024

  57. [57]

    Do lvlms truly understand video anomalies? revealing hallu- cination via co-occurrence patterns

    Menghao Zhang, Huazheng Wang, Pengfei Ren, Kangheng Lin, Qi Qi, Haifeng Sun, Zirui Zhuang, Lei Zhang, Jianxin Liao, and Jingyu Wang. Do lvlms truly understand video anomalies? revealing hallu- cination via co-occurrence patterns. InProceedings 9 of the Conference on Neural Information Processing Systems, 2025. 10 Table 2.Statistics of recent papers on vid...