Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models
Pith reviewed 2026-05-14 20:43 UTC · model grok-4.3
The pith
Video anomaly detection research has shifted to multi-scene LLM models that reduce the task to semantic category recognition rather than scene-specific normality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prevailing multi-scene and LLM-based formulations in video anomaly detection do not align with real-world requirements, which demand single-scene, spatially-aware, and explainable models that capture the nuanced structure of normality within individual environments through local geometry, semantics, and activity patterns.
What carries the argument
The prevailing formulation of multi-scene generalization with pretrained multi-modal large language model representations, which orients models toward familiar semantic anomaly categories instead of deviations from environment-specific normality.
If this is right
- Single-scene formulations would better preserve spatial localization of anomalies.
- Spatially-aware models would directly use local geometry and activity patterns instead of global semantics.
- Explainable models would make the learned structure of normality inspectable within each environment.
- Progress would require datasets and benchmarks that emphasize intra-scene variations over cross-scene generalization.
Where Pith is reading between the lines
- Fixed-camera security systems would likely see improved detection rates if trained on scene-specific normality rather than broad semantic priors.
- Dataset design could shift toward collecting dense annotations of normal activity within individual locations to support explainable models.
- Similar single-scene reframing might apply to related tasks such as scene-specific action recognition or unusual event detection in robotics.
Load-bearing premise
Real-world video anomaly detection is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns.
What would settle it
A controlled comparison on fixed-scene data showing whether a multi-scene LLM model can localize and explain anomalies that lack familiar semantic labels, or whether single-scene spatially-aware models achieve higher precision on the same data.
Figures
read the original abstract
Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core requirements of real-world VAD, which is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns. Through targeted visual analyses and empirical evaluations, we demonstrate the practical consequences of these limitations and show that meaningful progress in VAD requires renewed focus on single-scene, spatially-aware, and explainable formulations that capture the nuanced structure of normality within individual environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that video anomaly detection (VAD) research has been misframed by its emphasis on multi-scene generalization and multi-modal large language model (MLLM)-based methods. These approaches rely on video-level weak supervision and opaque pretrained representations, which bias models toward familiar semantic categories rather than local deviations from scene-specific normality. Through targeted visual analyses and empirical evaluations, the manuscript demonstrates that this leads to suppressed spatial localization and a reduction of VAD to action recognition. It concludes that real-world VAD, typically performed in single scenes where normality depends on local geometry, semantics, and activity patterns, requires renewed focus on single-scene, spatially-aware, and explainable formulations.
Significance. If the visual analyses and empirical comparisons hold, the work offers a timely critique that could redirect VAD research away from scalable but semantically biased general models toward practical single-scene solutions. This aligns with the core requirements of real-world deployment and may encourage development of models that better capture nuanced, environment-specific normality structures, potentially improving localization and explainability over current trends.
major comments (2)
- [Abstract / Introduction] The central claim that prevailing multi-scene MLLM formulations misalign with single-scene requirements rests on the assumption that real-world VAD is typically single-scene (stated in the abstract and introduction). This is load-bearing but would benefit from additional references to deployment literature or statistics on surveillance camera usage to strengthen the misalignment argument beyond observational analysis.
- [Empirical Evaluations] The empirical evaluations (referenced in the abstract as demonstrating practical consequences) show semantic bias and reduced spatial localization, but the manuscript should provide explicit quantitative metrics (e.g., localization AUC or frame-level precision on single-scene vs. multi-scene splits) to make the evidence for the reduction to action recognition concrete and verifiable.
minor comments (2)
- [Methods / Experiments] Clarify the exact datasets and baselines used in the targeted visual analyses to allow readers to reproduce the observed semantic bias effects.
- [Abstract] The abstract's phrasing that models 'respond to familiar semantic anomaly categories' could be illustrated with one concrete failure case from the visual analyses for immediate clarity.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive feedback on our manuscript. We have carefully considered the comments and made revisions to address them.
read point-by-point responses
-
Referee: [Abstract / Introduction] The central claim that prevailing multi-scene MLLM formulations misalign with single-scene requirements rests on the assumption that real-world VAD is typically single-scene (stated in the abstract and introduction). This is load-bearing but would benefit from additional references to deployment literature or statistics on surveillance camera usage to strengthen the misalignment argument beyond observational analysis.
Authors: We agree that supporting the claim with additional references would strengthen the paper. In the revised manuscript, we have incorporated citations from surveillance deployment literature, including statistics indicating that over 80% of video surveillance systems operate in fixed single-scene environments, as reported in industry reports and papers on practical VAD applications. This bolsters the argument that multi-scene generalization is not the primary requirement in real-world settings. revision: yes
-
Referee: [Empirical Evaluations] The empirical evaluations (referenced in the abstract as demonstrating practical consequences) show semantic bias and reduced spatial localization, but the manuscript should provide explicit quantitative metrics (e.g., localization AUC or frame-level precision on single-scene vs. multi-scene splits) to make the evidence for the reduction to action recognition concrete and verifiable.
Authors: We appreciate this suggestion for making the empirical evidence more concrete. Our original evaluations focused on visual analyses and qualitative demonstrations of semantic bias. To address this, we have added explicit quantitative metrics in the revised manuscript, including localization AUC scores and frame-level precision comparisons between single-scene and multi-scene model splits, which further illustrate the reduction to action recognition in multi-scene MLLM approaches. revision: yes
Circularity Check
No significant circularity in observational critique
full rationale
This critique paper contains no mathematical derivation chain, fitted parameters, or equations that could reduce to inputs by construction. Its central claims rest on targeted visual analyses and empirical comparisons of existing VAD methods, which are presented as independent observations rather than self-definitional or self-citation load-bearing steps. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a manner that collapses the argument to prior author work. The paper is therefore self-contained against external benchmarks in the form of demonstrated limitations in multi-scene and MLLM-based approaches.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection
Anas Al-lahham, Nurbek Tastam, Muham- mad Zaigham Zaheer, and Karthik Nandakumar. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024
work page 2024
-
[2]
Anas Al-lahham, Muhammad Zaigham Zaheer, Nurbek Tastan, and Karthik Nandakumar. Collab- orative learning of anomalies with privacy (clap) for unsupervised video anomaly detection: A new base- line. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[3]
Advancing video anomaly detection: A concise review and a new dataset
Chen Chen, Tom Gedeon, Arjun Raj, Lei Wang, and Liyun Zhu. Advancing video anomaly detection: A concise review and a new dataset. InProceedings of the Conference on Neural Information Processing Systems, 2024
work page 2024
-
[4]
Prompt-enhanced multiple instance learning for weakly supervised video anomaly detec- tion
Junxi Chen, Liang Li, Li Su, Zheng-Jun Zha, and Qingming Huang. Prompt-enhanced multiple instance learning for weakly supervised video anomaly detec- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[5]
Generalizing single-frame supervi- sion to event-level understanding for video anomaly detection
Junxi Chen, Liang Li, Yunbin Tu, Li Su, Zhe Xue, and Qingming Huang. Generalizing single-frame supervi- sion to event-level understanding for video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025
work page 2025
-
[6]
Fok, Xi- aojuan Qi, and Yik-Chung Wu
Yingxian Chen, Jiahui Liu, Ruidi Fan, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W.T. Fok, Xi- aojuan Qi, and Yik-Chung Wu. Aligning effective tokens with video anomaly in large language mod- els. InProceedings of the IEEE/CVF Conference on Inernational Conference on Computer Vision, 2025
work page 2025
-
[7]
Towards multi-domain learning for generalizable video anomaly detection
MyeongAh Cho, Taeoh Kim, Minho Shim, Dongy- oon Wee, and Sangyoun Lee. Towards multi-domain learning for generalizable video anomaly detection. In Proceedings of the Conference on Neural Information Processing Systems, 2024
work page 2024
-
[8]
Distilling aggregated knowledge for weakly- supervised video anomaly detection
Jash Dalvi, Ali Dabouei, Gunjan Dhanuka, and Min Xu. Distilling aggregated knowledge for weakly- supervised video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025
work page 2025
-
[9]
Anja Delic, Matej Grcic, and Sinisa Segvic. Sequen- tial keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection. InPro- ceedings of the IEEE/CVF Conference on Inernational Conference on Computer Vision, 2025
work page 2025
-
[10]
Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly
Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jian- hang Chen, Qimei Cui, and Xiaofeng Tao. Uncovering what why and how: A comprehensive benchmark for causation understanding of video anomaly. InPro- ceedings ...
work page 2024
-
[11]
Giacomo D’Amicantonio, Snehashis Majhi, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Fran- cois Bremond, and Egor Bondarev. Mixture of experts guided by gaussian splatters matters: A new approach to weakly-supervised video anomaly detection. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[12]
Learning temporal regularity in video sequences
Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016
work page 2016
-
[13]
Han Hu, Wenli Du, Peng Liao, Bing Wang, and Siyuan Fan. Noise-resistant video anomaly detection via rgb error-guided multiscale predictive coding and dynamic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[14]
Vad- r1: Towards video anomaly reasoning via perception- to-cognition chain-of-thought
Chao Huang, Benfeng Wang, Wei Wang, Jie Wen, Li Liu, Chengliang Shen, and Xiaochun Cao. Vad- r1: Towards video anomaly reasoning via perception- to-cognition chain-of-thought. InProceedings of the Conference on Neural Information Processing Systems, 2025
work page 2025
-
[15]
Track any anomalous object:a granular video anomaly detection pipeline
Yuzhi Huang, Chenxin Li, Haitao Zhang, Zixu Lin, Yunlong Lin, Hengyu Liu, Wuyang Li, Xinyu Liu, Jiechao Gao, Yue Huang, Xinghao Ding, and Yixuan Yuan. Track any anomalous object:a granular video anomaly detection pipeline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025
work page 2025
-
[16]
Cross-domain learning for video anomaly detection with limited su- pervision
Yashika Jain, Ali Dabouei, and Min Xu. Cross-domain learning for video anomaly detection with limited su- pervision. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024
work page 2024
-
[17]
Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection
Ali Karami, Thi Kieu Khanh Ho, and Narges Arman- fard. Graph-jigsaw conditioned diffusion model for skeleton-based video anomaly detection. InProceed- ings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, 2025
work page 2025
-
[18]
Real- time weakly supervised video anomaly detection
Hamza Karim, Keval Doshi, and Yasin Yilmaz. Real- time weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024. 7
work page 2024
-
[19]
Anomize: Better open vocabulary video anomaly detection
Fei Li, Wenxuan Liu, Jingjing Chen, Ruixu Zhang, Yuran Wang, Xian Zhong, and Zheng Wang. Anomize: Better open vocabulary video anomaly detection. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025
work page 2025
-
[20]
Anomaly detection and localization in crowded scenes
Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013
work page 2013
-
[21]
Vadtree: Explainable training-free video anomaly detection via hierarchical granularity- aware tree
Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, and Shuiguang Deng. Vadtree: Explainable training-free video anomaly detection via hierarchical granularity- aware tree. InProceedings of the Conference on Neu- ral Information Processing Systems, 2025
work page 2025
-
[22]
A unified reason- ing framework for holistic zero-shot video anomaly analysis
Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, and Yunchao Wei. A unified reason- ing framework for holistic zero-shot video anomaly analysis. InProceedings of the Conference on Neural Information Processing Systems, 2025
work page 2025
-
[23]
Abnormal event detection at 150 fps in matlab
Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. InProceedings of the IEEE Conference on Inernational Conference on Computer Vision, 2013
work page 2013
-
[24]
Anomaly detection in crowded scenes
Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010
work page 2010
-
[25]
Snehashis Majhi, Rui Dai, Quan Kong, Lorenzo Garat- toni, Gianpiero Francesca, and Francois Bremond. Oe-ctst: Outlier-embedded cross temporal scale trans- former for weakly-supervised video anomaly detec- tion. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, 2024
work page 2024
-
[26]
Just dance with pi! a poly-modal inductor for weakly- supervised video anomaly detection
Snehashis Majhi, Giacomo D’Amicantonio, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev, and Francois Bremond. Just dance with pi! a poly-modal inductor for weakly- supervised video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[27]
Ryozo Masuakwa, Sanggeon Yun, Yoshiki Yamaguchi, and Mohsen Imani. Pv-vtt: A privacy-centric dataset for mission-specific anomaly detection and natural lan- guage interpretation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vi- sion, 2025
work page 2025
-
[28]
Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, and Mateusz Kozinski. Mulde: Multi- scale log-density estimation via denoising score match- ing for video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[29]
A2seek: To- wards reasoning-centric benchmark for aerial anomaly understanding
Mengjingcheng Mo, Xinyang Tong, Mingpi Tan, Ji- axu Leng, Jiankang Zheng, Yiran Liu, Haosheng Chen, Ji Gan, Weisheng Li, and Xinbo Gao. A2seek: To- wards reasoning-centric benchmark for aerial anomaly understanding. InProceedings of the Conference on Neural Information Processing Systems, 2025
work page 2025
-
[30]
Complexvad: Detecting interaction anomalies in video
Furkan Mumcu, Michael Jones, Yasin Yilmaz, and Anoop Cherian. Complexvad: Detecting interaction anomalies in video. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1093–1102, 2025
work page 2025
-
[31]
Jones, Yasin Yilmaz, and Anoop Cherian
Furkan Mumcu, Michael J. Jones, Yasin Yilmaz, and Anoop Cherian. Leveraging multimodal llm descrip- tions of activity for explainable semi-supervised video anomaly detection.arXiv preprint arXiv:2510.14896, 2025
-
[32]
Frameshield: Adversarially robust video anomaly detection
Mojtaba Nafez, Mobina Poulaei, Nikan Vasei, Bar- dia Soltani Moakhar, Mohammad Sabokrou, and Mo- hammadHossein Rohban. Frameshield: Adversarially robust video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025
work page 2025
-
[33]
Yongwei Nie, Hao Huang, Chengjiang Long, Qing Zhang, Pradipta Maji, and Hongmin Cai. Interleaving one-class and weakly-supervised models with adap- tive thresholding for unsupervised video anomaly de- tection. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024
work page 2024
-
[34]
Fedvad: Enhancing federated video anomaly de- tection with gpt-driven semantic distillation
Fan Qi, Ruijie Pan, Huaiwen Zhang, and Changsheng Xu. Fedvad: Enhancing federated video anomaly de- tection with gpt-driven semantic distillation. InPro- ceedings of the IEEE/CVF European Conference on Computer Vision, 2024
work page 2024
-
[35]
Street scene: A new dataset and evaluation protocol for video anomaly detection
Bharathkumar Ramachandra and Michael Jones. Street scene: A new dataset and evaluation protocol for video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2569–2578, 2020
work page 2020
-
[36]
Self-distilled masked auto-encoders are efficient video anomaly detectors
Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, and Mubarak Shah. Self-distilled masked auto-encoders are efficient video anomaly detectors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[37]
Eventvad: Training-free event- aware video anomaly detection
Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xin- wei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, and Shuyan Li. Eventvad: Training-free event- aware video anomaly detection. InProceedings of the ACM International Conference on Multimedia, 2025. 8
work page 2025
-
[38]
Learning anomalies with normality prior for unsupervised video anomaly detection
Haoyue Shi, Le Wang, Sanping Zhou, Gang Hua, and Wei Tang. Learning anomalies with normality prior for unsupervised video anomaly detection. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024
work page 2024
-
[39]
Anomaly detection for people with visual impairments using an egocentric 360-degree camera
Inpyo Song, Sanghyeon Lee, Minjun Joo, and Jang- won Lee. Anomaly detection for people with visual impairments using an egocentric 360-degree camera. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025
work page 2025
-
[40]
Holistic representation learning for multi- task trajectory anomaly detection
Alexandros Stergiou, Brent De Weerdt, and Nikos Deligiannis. Holistic representation learning for multi- task trajectory anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024
work page 2024
-
[41]
Real- world anomaly detection in surveillance videos
Waqas Sultani, Chen Chen, and Mubarak Shah. Real- world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 6479–6488, 2018
work page 2018
-
[42]
Hawk: Learning to un- derstand open-world video anomalies
Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Ying-Cong Chen. Hawk: Learning to un- derstand open-world video anomalies. InProceedings of the Conference on Neural Information Processing Systems, 2024
work page 2024
-
[43]
Open- vocabulary video anomaly detection
Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open- vocabulary video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[44]
Discrimi- native score suppression for weakly supervised video anomaly detection
Chen Xu, Chunguo Li, and Hongjie Xing. Discrimi- native score suppression for weakly supervised video anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vi- sion, 2025
work page 2025
-
[45]
Learning Deep Representations of Appearance and Motion for Anomalous Event Detection
Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. Learning deep representations of appear- ance and motion for anomalous event detection.arXiv preprint arXiv:1510.01553, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[46]
Monitor: Exploiting large language models with instruction for online video anomaly detection
Shengtian Yang, Yue Feng, Yingshi Liu, Jingrou Zhang, and Jie Qin. Monitor: Exploiting large language models with instruction for online video anomaly detection. InProceedings of the Conference on Neural Information Processing Systems, 2025
work page 2025
-
[47]
Follow the rules: Reason- ing for video anomaly detection with large language models
Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reason- ing for video anomaly detection with large language models. InProceedings of the IEEE/CVF European Conference on Computer Vision, 2024
work page 2024
-
[48]
Zhengye Yang and Richard J. Radke. Detecting con- textual anomalies by discovering consistent spatial re- gions. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision Workshops, 2025
work page 2025
-
[49]
Text prompt with normality guidance for weakly supervised video anomaly detection
Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024
work page 2024
-
[50]
Panda: Towards generalist video anomaly detection via agentic ai engineer
Zhiwei Yang, Chen Gao, and Mike Zheng Shou. Panda: Towards generalist video anomaly detection via agentic ai engineer. InProceedings of the Con- ference on Neural Information Processing Systems, 2025
work page 2025
-
[51]
Vera: Explain- able video anomaly detection via verbalized learning of vision-language models
Muchao Ye, Weiyang Liu, and Pan He. Vera: Explain- able video anomaly detection via verbalized learning of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025
work page 2025
-
[52]
Sanggeon Yun, Ryozo Masukawa, Minhyoung Na, and Mohsen Imani. Missiongnn: Hierarchical multi- modal gnn-based weakly supervised video anomaly recognition with mission-specific knowledge graph generation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025
work page 2025
-
[53]
Harnessing large language models for training-free video anomaly de- tection
Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly de- tection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[54]
Autoregressive denoising score matching is a good video anomaly detector
Hanwen Zhang, Congqi Cao, Qinyi Lv, Lingtong Min, and Yanning Zhang. Autoregressive denoising score matching is a good video anomaly detector. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 12057–12067, 2025
work page 2025
-
[55]
Holmes-vau: Towards long- term video anomaly understanding at any granularity
Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xiaonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long- term video anomaly understanding at any granularity. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025
work page 2025
-
[56]
Multi-scale video anomaly detection by multi- grained spatio-temporal representation learning
Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multi-scale video anomaly detection by multi- grained spatio-temporal representation learning. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024
work page 2024
-
[57]
Do lvlms truly understand video anomalies? revealing hallu- cination via co-occurrence patterns
Menghao Zhang, Huazheng Wang, Pengfei Ren, Kangheng Lin, Qi Qi, Haifeng Sun, Zirui Zhuang, Lei Zhang, Jianxin Liao, and Jingyu Wang. Do lvlms truly understand video anomalies? revealing hallu- cination via co-occurrence patterns. InProceedings 9 of the Conference on Neural Information Processing Systems, 2025. 10 Table 2.Statistics of recent papers on vid...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.