Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Fei Wang; Hongyi Cai; Jie Li; Mingkang Dong; Muxin Pu; Shan You; Tao Huang

arxiv: 2511.19474 · v5 · submitted 2025-11-22 · 💻 cs.CV · cs.AI· cs.MM

Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Jie Li , Hongyi Cai , Mingkang Dong , Muxin Pu , Shan You , Fei Wang , Tao Huang This is my paper

Pith reviewed 2026-05-17 06:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords video anomaly detectionvideo anomaly understandingsynthetic benchmarksvideo generationlong-form videobenchmark datasetanomaly types

0 comments

The pith

A controlled video generation pipeline produces balanced, diverse long-form anomaly benchmarks without internet data biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that recent video generation models can be used to build a new benchmark for video anomaly detection and understanding that fixes the scene bias, imbalance, and short duration problems of existing real-world datasets. Current benchmarks make it hard to test methods reliably because they come from uncontrolled internet sources and require costly manual labels for deeper causal reasoning tasks. Pistachio instead runs a pipeline of scene-conditioned anomaly assignment followed by multi-step storyline generation and temporally consistent synthesis to output coherent 41-second videos with exact control over events. If this works, researchers gain a scalable way to evaluate methods on complex, multi-event scenarios and can focus development on handling dynamic temporal narratives.

Core claim

Pistachio is a new VAD and VAU benchmark built entirely through a controlled generation-based pipeline. The pipeline combines scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to create coherent 41-second videos. This gives precise control over scenes, anomaly types, and temporal narratives, removing the biases and limitations of Internet-collected datasets while demonstrating scale, diversity, and complexity that expose new challenges for existing methods.

What carries the argument

The controlled generation-based pipeline integrating scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to produce 41-second videos.

If this is right

Existing VAD and VAU methods encounter new performance challenges on long-form videos with balanced and diverse anomaly coverage.
Benchmark creation for semantic and causal anomaly reasoning becomes feasible with far less manual annotation effort.
Research can shift toward models that handle dynamic multi-event sequences and temporal causality in anomalies.
Precise control over anomaly types and narratives enables targeted testing of method robustness on specific patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could generate on-demand benchmarks for other video tasks such as action prediction or event localization by swapping the anomaly assignment step.
As generation quality rises, fully synthetic datasets might reduce reliance on real-world collection for many video understanding benchmarks.
Performance differences between Pistachio and real datasets could highlight specific weaknesses in current models for long temporal context.
Domain-specific versions could be produced quickly by conditioning the scene and anomaly choices on particular environments like traffic or indoor surveillance.

Load-bearing premise

Videos produced by current generation models are realistic enough and match real-world anomaly distributions and timing to act as a reliable stand-in for evaluating detection and understanding methods.

What would settle it

If methods that rank highest on Pistachio videos produce markedly different rankings or much lower accuracy when tested on established real-world anomaly video datasets, the synthetic benchmark would fail to serve as a valid proxy.

Figures

Figures reproduced from arXiv: 2511.19474 by Fei Wang, Hongyi Cai, Jie Li, Mingkang Dong, Muxin Pu, Shan You, Tao Huang.

**Figure 1.** Figure 1: We introduce Pistachio - a benchmark for video anomaly analysis, which aims at two fundamental tasks: Video Anomaly Detection (VAD) and Video Anomaly Understanding (VAU). The VAD dataset totals 1.6 million frames and extends existing datasets by expanding the number of scenes from hundreds to thousands, covering 31 distinct anomaly types, over half of which are unique to this benchmark. Pistachio offers mu… view at source ↗

**Figure 2.** Figure 2: Overview of our video anomaly dataset generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of anomaly videos across different anomaly types. For each type, the left bar represents short videos and the right [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of anomaly video ratios in each scenario [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of Different Video Generation Schemes and Non-compliant Videos. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of all exception categories across all scenarios. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pistachio gives a controlled synthetic pipeline for balanced long-form video anomaly data, but the realism transfer to real-world cases still needs numbers to hold up.

read the letter

Pistachio is a new synthetic benchmark for video anomaly detection and understanding built entirely from generation models. The core idea is to fix the usual problems with scraped datasets—poor scene variety, unbalanced anomaly types, and short clips—by generating 41-second coherent videos with explicit control over scenes, anomaly placement, and narrative flow through a multi-step storyline process and minimal human input. That pipeline is the main new piece here, and it lines up with the documented gaps in things like ShanghaiTech or UCF-Crime that the abstract flags. If the full experiments really show scale plus fresh failure modes for current detectors, the work gives the field a concrete way to test longer, more structured anomaly sequences and to move toward semantic VAU evaluation without massive annotation costs. The low-intervention design is a practical plus for reproducibility too. The soft spot is the missing validation layer. The claim that this setup eliminates biases rests on the generated videos serving as faithful proxies for real motion, physics, and anomaly timing. The abstract mentions extensive experiments on diversity and complexity but does not surface quantitative checks such as video-level fidelity scores, human realism ratings, or direct distribution comparisons against real datasets. That leaves the transfer assumption untested in what is shown, which is the load-bearing part for anyone who would actually use the benchmark downstream. This paper is for people who build or evaluate VAD and VAU models and want better-controlled test sets. Readers working on synthetic data pipelines in computer vision will find the construction steps useful even if they stay skeptical on the realism side. It deserves a serious referee because the problem it targets is real and the approach is specific enough to review in detail. I would send it to peer review and expect the reviewers to ask for the fidelity metrics up front.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Pistachio, a synthetic VAD/VAU benchmark constructed via a controlled generation pipeline. The pipeline combines scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to produce 41-second videos with explicit control over scenes, anomaly types, and temporal narratives. The central claim is that this approach eliminates the biases and limitations of Internet-collected datasets while providing the scale, diversity, and complexity needed to expose new challenges for existing VAD and VAU methods.

Significance. If the generated videos prove to be faithful proxies for real-world anomaly distributions and temporal dynamics, Pistachio would offer a scalable, precisely controllable benchmark that reduces annotation burden and enables systematic study of dynamic and multi-event scenarios, directly addressing gaps in current VAD/VAU evaluation.

major comments (2)

[Experiments] Experiments section: The abstract asserts that 'extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges,' yet no quantitative results (e.g., VAD/VAU performance metrics, error analysis, or baseline comparisons on Pistachio versus ShanghaiTech/UCF-Crime) are referenced. This absence is load-bearing for the claim that the benchmark exposes new challenges.
[Pipeline] Pipeline description (Section 3): The claim that the pipeline 'effectively eliminat[es] the biases and limitations of Internet-collected datasets' rests on the untested assumption that generated 41-second videos match real-world motion, physics, and anomaly temporal profiles. No video-level fidelity metrics (FID, FVD), human realism scores, or statistical tests (e.g., Kolmogorov-Smirnov on anomaly duration/frequency distributions) against real benchmarks are reported.

minor comments (2)

[Abstract] Abstract: 'VAU' is expanded on first use, but subsequent references to 'dynamic and multi-event anomaly understanding' would benefit from a brief forward pointer to the specific VAU tasks evaluated.
[Figures] Figure captions (throughout): Several figures showing generated video frames lack explicit labels for anomaly start/end times or scene conditioning parameters, reducing immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The abstract asserts that 'extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges,' yet no quantitative results (e.g., VAD/VAU performance metrics, error analysis, or baseline comparisons on Pistachio versus ShanghaiTech/UCF-Crime) are referenced. This absence is load-bearing for the claim that the benchmark exposes new challenges.

Authors: We agree that the current manuscript lacks the quantitative baseline evaluations needed to fully support the claim that Pistachio reveals new challenges. The experiments section in the submitted version focuses on dataset statistics, diversity measures, and qualitative examples of generated videos. In the revision we will add a new subsection reporting VAD and VAU baseline results on Pistachio, including standard metrics such as AUC-ROC, comparisons against ShanghaiTech and UCF-Crime, and an error analysis highlighting failure modes that are more prevalent in our long-form, multi-event setting. revision: yes
Referee: [Pipeline] Pipeline description (Section 3): The claim that the pipeline 'effectively eliminat[es] the biases and limitations of Internet-collected datasets' rests on the untested assumption that generated 41-second videos match real-world motion, physics, and anomaly temporal profiles. No video-level fidelity metrics (FID, FVD), human realism scores, or statistical tests (e.g., Kolmogorov-Smirnov on anomaly duration/frequency distributions) against real benchmarks are reported.

Authors: The referee is correct that the manuscript does not yet provide direct quantitative evidence that the generated videos match real-world distributions in motion, physics, or anomaly timing. While the pipeline was designed to reduce collection biases through explicit control, we will add video fidelity measurements (FVD scores), human realism ratings from a user study, and statistical comparisons (including Kolmogorov-Smirnov tests on anomaly duration and frequency) against real benchmarks such as UCF-Crime to substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: dataset construction without derivations or fitted predictions

full rationale

This is a dataset construction paper that describes a procedural pipeline for generating synthetic long-form videos using existing video generation models, scene-conditioned assignment, and storyline generation. No equations, predictions, or first-principles results are presented that could reduce to inputs by construction. The central claims concern control over content and elimination of internet-data biases, but these rest on the external properties of the generators rather than any internal self-definition, fitted-parameter renaming, or load-bearing self-citation chain. The work is self-contained as a benchmark proposal whose utility is to be assessed by downstream users against real-world data, consistent with the reader's assessment of no derivation or fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current video generation models can produce temporally coherent long-form videos with controllable anomalies that match real-world distributions closely enough for benchmarking purposes. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Video generation models can be conditioned to produce temporally consistent 41-second videos with specified anomaly narratives.
Invoked in the pipeline description to justify minimal human intervention and bias elimination.

pith-pipeline@v0.9.0 · 5503 in / 1202 out tokens · 52523 ms · 2026-05-17T06:25:47.337826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 2 internal anchors

[1]

Ubnor- mal: New benchmark for supervised open-set video anomaly detection

Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnor- mal: New benchmark for supervised open-set video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022
[2]

Ub- normal: New benchmark for supervised open-set video anomaly detection

Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ub- normal: New benchmark for supervised open-set video anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20143–20153, 2022. 3

work page 2022
[3]

Robust real-time unusual event detection using mul- tiple fixed-location monitors.IEEE transactions on pattern analysis and machine intelligence, 30(3):555–560, 2008

Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. Robust real-time unusual event detection using mul- tiple fixed-location monitors.IEEE transactions on pattern analysis and machine intelligence, 30(3):555–560, 2008. 3

work page 2008
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation

Congqi Cao, Yue Lu, Peng Wang, and Yanning Zhang. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20392–20401, 2023. 3

work page 2023
[6]

Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection, 2022

Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection, 2022. 2, 6, 7

work page 2022
[7]

Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection

Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection. InProceedings of the AAAI con- ference on artificial intelligence, pages 387–395, 2023. 3

work page 2023
[8]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 8

work page 2024
[9]

A discriminative framework for anomaly detection in large videos

Allison Del Giorno, J Andrew Bagnell, and Martial Hebert. A discriminative framework for anomaly detection in large videos. InEuropean conference on computer vision, pages 334–349. Springer, 2016. 3

work page 2016
[10]

Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly, 2024

Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Ji- ayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiang- ming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, and Xiaofeng Tao. Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly, 2024. 2

work page 2024
[11]

Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection, 2023

Hyekang Kevin Joo, Khoa V o, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection, 2023. 2

work page 2023
[12]

Hunyuanvideo: A systematic framework for large video generative models, 2025

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page 2025
[13]

Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013

Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013. 3

work page 2013
[14]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 6

work page 2014
[15]

Exploring background-bias for anomaly detection in surveillance videos

Kun Liu and Huadong Ma. Exploring background-bias for anomaly detection in surveillance videos. InProceedings of the 27th ACM International Conference on Multimedia, pages 1490–1499, 2019. 3

work page 2019
[16]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Abnormal event de- tection at 150 fps in matlab

Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event de- tection at 150 fps in matlab. 2013. 3

work page 2013
[18]

A revisit of sparse coding based anomaly detection in stacked rnn framework

Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. InProceedings of the IEEE international conference on com- puter vision, pages 341–349, 2017. 3

work page 2017
[19]

Mulde: Multiscale log- density estimation via denoising score matching for video anomaly detection

Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, and Mateusz Kozinski. Mulde: Multiscale log- density estimation via denoising score matching for video anomaly detection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 18868–18877, 2024. 2, 6, 7

work page 2024
[20]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[21]

Scalable diffusion models with transformers, 2023

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. 3

work page 2023
[22]

Learning prompt-enhanced context features for weakly- supervised video anomaly detection.IEEE Transactions on Image Processing, 2024

Yujiang Pu, Xiaoyu Wu, Lulu Yang, and Shengjin Wang. Learning prompt-enhanced context features for weakly- supervised video anomaly detection.IEEE Transactions on Image Processing, 2024. 3, 2, 5

work page 2024
[23]

Recognizing indoor scenes

Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In2009 IEEE conference on computer vision and pattern recognition, pages 413–420. IEEE, 2009. 6

work page 2009
[24]

Street scene: A new dataset and evaluation protocol for video anomaly detection

Bharathkumar Ramachandra and Michael Jones. Street scene: A new dataset and evaluation protocol for video anomaly detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2569– 2578, 2020. 2, 3

work page 2020
[25]

Self-distilled masked auto-encoders are efficient video anomaly detectors

Nicolae-C Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah, et al. Self-distilled masked auto-encoders are efficient video anomaly detectors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15984–15995, 2024. 3

work page 2024
[26]

Real-world anomaly detection in surveillance videos

Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6479–6488, 2018. 3

work page 2018
[27]

Real-world anomaly detection in surveillance videos, 2019

Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos, 2019. 3

work page 2019
[28]

Hawk: Learning to understand open-world video anomalies, 2024

Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Ying- Cong Chen. Hawk: Learning to understand open-world video anomalies, 2024. 2

work page 2024
[29]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 7, 8

work page 2025
[31]

Weakly-supervised video anomaly detection with robust temporal feature mag- nitude learning.arXiv preprint arXiv:2101.10030, 2021

Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature mag- nitude learning.arXiv preprint arXiv:2101.10030, 2021. 2

work page arXiv 2021
[32]

Weakly-supervised video anomaly detection with robust temporal feature magni- tude learning

Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magni- tude learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021. 3

work page 2021
[33]

Wan: Open and advanced large-scale video generative models, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page 2025
[34]

Robust unsupervised video anomaly detection by multipath frame prediction.IEEE transactions on neural networks and learn- ing systems, 33(6):2301–2312, 2021

Xuanzhao Wang, Zhengping Che, Bo Jiang, Ning Xiao, Ke Yang, Jian Tang, Jieping Ye, Jingyu Wang, and Qi Qi. Robust unsupervised video anomaly detection by multipath frame prediction.IEEE transactions on neural networks and learn- ing systems, 33(6):2301–2312, 2021. 2

work page 2021
[35]

Video models are zero-shot learn- ers and reasoners, 2025

Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners, 2025. 2, 3

work page 2025
[36]

Not only look, but also listen: Learning multimodal violence detection under weak supervision

Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. InEuropean conference on computer vision, pages 322–339. Springer, 2020. 3

work page 2020
[37]

Vadclip: Adapting vision-language models for weakly supervised video anomaly detection, 2023

Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection, 2023. 2

work page 2023
[38]

Open-vocabulary video anomaly detection, 2024

Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection, 2024. 3

work page 2024
[39]

Follow the rules: Reasoning for video anomaly detection with large language models, 2024

Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reasoning for video anomaly detection with large language models, 2024. 3

work page 2024
[40]

Harnessing large language mod- els for training-free video anomaly detection, 2024

Luca Zanella, Willi Menapace, Massimiliano Mancini, Yim- ing Wang, and Elisa Ricci. Harnessing large language mod- els for training-free video anomaly detection, 2024. 3

work page 2024
[41]

Holmes-vau: Towards long-term video anomaly understanding at any granularity

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xi- aonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long-term video anomaly understanding at any granularity. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13843–13853, 2025. 8

work page 2025
[42]

Holmes-vau: Towards long-term video anomaly understanding at any granularity, 2025

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xi- aonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long-term video anomaly understanding at any granularity, 2025. 2

work page 2025
[43]

Multi- scale video anomaly detection by multi-grained spatio- temporal representation learning

Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multi- scale video anomaly detection by multi-grained spatio- temporal representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17385–17394, 2024. 3

work page 2024
[44]

Single-image crowd counting via multi-column convolutional neural network

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016. 2

work page 2016
[45]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 8

work page 2024
[46]

Dual memory units with uncertainty regulation for weakly supervised video anomaly detection

Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3769–3777, 2023. 2, 6, 7

work page 2023
[47]

Dual memory units with uncertainty regulation for weakly supervised video anomaly detection

Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3769–3777, 2023. 3

work page 2023
[48]

Advancing video anomaly detection: A concise re- view and a new dataset.Advances in Neural Information Processing Systems, 37:89943–89977, 2024

Liyun Zhu, Lei Wang, Arjun Raj, Tom Gedeon, and Chen Chen. Advancing video anomaly detection: A concise re- view and a new dataset.Advances in Neural Information Processing Systems, 37:89943–89977, 2024. 3 Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks Supplementary Material

work page 2024
[49]

The comprehensive system prompts (Tab 5) are pivotal for precisely guiding the Large Language Model (LLM) through the multi-stage annotation workflow of the Pista- chio dataset

Details of the System Prompts. The comprehensive system prompts (Tab 5) are pivotal for precisely guiding the Large Language Model (LLM) through the multi-stage annotation workflow of the Pista- chio dataset. These stage-specific prompts are designed to ensure structured and consistent output generation across four critical phases, aligning with the metho...

work page
[50]

Dataset Characterization and Curation De- tails. To further substantiate the quality, diversity, and robust gen- eration methodology of the Pistachio dataset, we provide additional visualizations and analysis related to our cura- tion process. Our complete generation pipeline, formalized in Algorithm 1, demonstrates a systematic three-stage ap- proach: Sc...

work page
[51]

Public Roads & Transportation Areas

work page
[52]

Enclosed & Indoor Premises

work page
[53]

Commercial & Entertainment Gathering Points

work page
[54]

Industrial & Construction Zones

work page
[55]

Outdoor & Natural Environments

work page
[56]

Critical Infrastructure Anomaly Type Determination (6 prompts, one per scene) Example for Commercial & Entertainment Gathering Points: You are an expert in multi-image analysis and event inference for ”Commercial & Entertainment Gath- ering Points.” You will receive a set of image files. Your task is to assign the most likely and specific anomalous events...

work page
[57]

Panoramic fixed shot, a man crawls into the frame, visibly in pain with a clear gunshot wound to his waist, and then crawls out of the frame

leverages a dual-branch structure that makes full use of the frozen CLIP model’s fine-grained vision-language alignment. The visual features are enhanced by alignment with rich semantic language representations. This cross- modal knowledge acts as a strong semantic prior, providing a generalized understanding of ”anomaly” that transcends dataset-specific ...

work page

[1] [1]

Ubnor- mal: New benchmark for supervised open-set video anomaly detection

Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnor- mal: New benchmark for supervised open-set video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022

[2] [2]

Ub- normal: New benchmark for supervised open-set video anomaly detection

Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ub- normal: New benchmark for supervised open-set video anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20143–20153, 2022. 3

work page 2022

[3] [3]

Robust real-time unusual event detection using mul- tiple fixed-location monitors.IEEE transactions on pattern analysis and machine intelligence, 30(3):555–560, 2008

Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. Robust real-time unusual event detection using mul- tiple fixed-location monitors.IEEE transactions on pattern analysis and machine intelligence, 30(3):555–560, 2008. 3

work page 2008

[4] [4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation

Congqi Cao, Yue Lu, Peng Wang, and Yanning Zhang. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20392–20401, 2023. 3

work page 2023

[6] [6]

Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection, 2022

Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection, 2022. 2, 6, 7

work page 2022

[7] [7]

Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection

Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection. InProceedings of the AAAI con- ference on artificial intelligence, pages 387–395, 2023. 3

work page 2023

[8] [8]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 8

work page 2024

[9] [9]

A discriminative framework for anomaly detection in large videos

Allison Del Giorno, J Andrew Bagnell, and Martial Hebert. A discriminative framework for anomaly detection in large videos. InEuropean conference on computer vision, pages 334–349. Springer, 2016. 3

work page 2016

[10] [10]

Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly, 2024

Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Ji- ayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiang- ming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, and Xiaofeng Tao. Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly, 2024. 2

work page 2024

[11] [11]

Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection, 2023

Hyekang Kevin Joo, Khoa V o, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection, 2023. 2

work page 2023

[12] [12]

Hunyuanvideo: A systematic framework for large video generative models, 2025

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page 2025

[13] [13]

Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013

Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013. 3

work page 2013

[14] [14]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 6

work page 2014

[15] [15]

Exploring background-bias for anomaly detection in surveillance videos

Kun Liu and Huadong Ma. Exploring background-bias for anomaly detection in surveillance videos. InProceedings of the 27th ACM International Conference on Multimedia, pages 1490–1499, 2019. 3

work page 2019

[16] [16]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Abnormal event de- tection at 150 fps in matlab

Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event de- tection at 150 fps in matlab. 2013. 3

work page 2013

[18] [18]

A revisit of sparse coding based anomaly detection in stacked rnn framework

Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. InProceedings of the IEEE international conference on com- puter vision, pages 341–349, 2017. 3

work page 2017

[19] [19]

Mulde: Multiscale log- density estimation via denoising score matching for video anomaly detection

Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, and Mateusz Kozinski. Mulde: Multiscale log- density estimation via denoising score matching for video anomaly detection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 18868–18877, 2024. 2, 6, 7

work page 2024

[20] [20]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page

[21] [21]

Scalable diffusion models with transformers, 2023

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. 3

work page 2023

[22] [22]

Learning prompt-enhanced context features for weakly- supervised video anomaly detection.IEEE Transactions on Image Processing, 2024

Yujiang Pu, Xiaoyu Wu, Lulu Yang, and Shengjin Wang. Learning prompt-enhanced context features for weakly- supervised video anomaly detection.IEEE Transactions on Image Processing, 2024. 3, 2, 5

work page 2024

[23] [23]

Recognizing indoor scenes

Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In2009 IEEE conference on computer vision and pattern recognition, pages 413–420. IEEE, 2009. 6

work page 2009

[24] [24]

Street scene: A new dataset and evaluation protocol for video anomaly detection

Bharathkumar Ramachandra and Michael Jones. Street scene: A new dataset and evaluation protocol for video anomaly detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2569– 2578, 2020. 2, 3

work page 2020

[25] [25]

Self-distilled masked auto-encoders are efficient video anomaly detectors

Nicolae-C Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah, et al. Self-distilled masked auto-encoders are efficient video anomaly detectors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15984–15995, 2024. 3

work page 2024

[26] [26]

Real-world anomaly detection in surveillance videos

Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6479–6488, 2018. 3

work page 2018

[27] [27]

Real-world anomaly detection in surveillance videos, 2019

Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos, 2019. 3

work page 2019

[28] [28]

Hawk: Learning to understand open-world video anomalies, 2024

Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Ying- Cong Chen. Hawk: Learning to understand open-world video anomalies, 2024. 2

work page 2024

[29] [29]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 7, 8

work page 2025

[30] [31]

Weakly-supervised video anomaly detection with robust temporal feature mag- nitude learning.arXiv preprint arXiv:2101.10030, 2021

Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature mag- nitude learning.arXiv preprint arXiv:2101.10030, 2021. 2

work page arXiv 2021

[31] [32]

Weakly-supervised video anomaly detection with robust temporal feature magni- tude learning

Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magni- tude learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021. 3

work page 2021

[32] [33]

Wan: Open and advanced large-scale video generative models, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page 2025

[33] [34]

Robust unsupervised video anomaly detection by multipath frame prediction.IEEE transactions on neural networks and learn- ing systems, 33(6):2301–2312, 2021

Xuanzhao Wang, Zhengping Che, Bo Jiang, Ning Xiao, Ke Yang, Jian Tang, Jieping Ye, Jingyu Wang, and Qi Qi. Robust unsupervised video anomaly detection by multipath frame prediction.IEEE transactions on neural networks and learn- ing systems, 33(6):2301–2312, 2021. 2

work page 2021

[34] [35]

Video models are zero-shot learn- ers and reasoners, 2025

Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners, 2025. 2, 3

work page 2025

[35] [36]

Not only look, but also listen: Learning multimodal violence detection under weak supervision

Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. InEuropean conference on computer vision, pages 322–339. Springer, 2020. 3

work page 2020

[36] [37]

Vadclip: Adapting vision-language models for weakly supervised video anomaly detection, 2023

Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection, 2023. 2

work page 2023

[37] [38]

Open-vocabulary video anomaly detection, 2024

Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection, 2024. 3

work page 2024

[38] [39]

Follow the rules: Reasoning for video anomaly detection with large language models, 2024

Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reasoning for video anomaly detection with large language models, 2024. 3

work page 2024

[39] [40]

Harnessing large language mod- els for training-free video anomaly detection, 2024

Luca Zanella, Willi Menapace, Massimiliano Mancini, Yim- ing Wang, and Elisa Ricci. Harnessing large language mod- els for training-free video anomaly detection, 2024. 3

work page 2024

[40] [41]

Holmes-vau: Towards long-term video anomaly understanding at any granularity

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xi- aonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long-term video anomaly understanding at any granularity. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13843–13853, 2025. 8

work page 2025

[41] [42]

Holmes-vau: Towards long-term video anomaly understanding at any granularity, 2025

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xi- aonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long-term video anomaly understanding at any granularity, 2025. 2

work page 2025

[42] [43]

Multi- scale video anomaly detection by multi-grained spatio- temporal representation learning

Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multi- scale video anomaly detection by multi-grained spatio- temporal representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17385–17394, 2024. 3

work page 2024

[43] [44]

Single-image crowd counting via multi-column convolutional neural network

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016. 2

work page 2016

[44] [45]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 8

work page 2024

[45] [46]

Dual memory units with uncertainty regulation for weakly supervised video anomaly detection

Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3769–3777, 2023. 2, 6, 7

work page 2023

[46] [47]

Dual memory units with uncertainty regulation for weakly supervised video anomaly detection

Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3769–3777, 2023. 3

work page 2023

[47] [48]

Advancing video anomaly detection: A concise re- view and a new dataset.Advances in Neural Information Processing Systems, 37:89943–89977, 2024

Liyun Zhu, Lei Wang, Arjun Raj, Tom Gedeon, and Chen Chen. Advancing video anomaly detection: A concise re- view and a new dataset.Advances in Neural Information Processing Systems, 37:89943–89977, 2024. 3 Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks Supplementary Material

work page 2024

[48] [49]

The comprehensive system prompts (Tab 5) are pivotal for precisely guiding the Large Language Model (LLM) through the multi-stage annotation workflow of the Pista- chio dataset

Details of the System Prompts. The comprehensive system prompts (Tab 5) are pivotal for precisely guiding the Large Language Model (LLM) through the multi-stage annotation workflow of the Pista- chio dataset. These stage-specific prompts are designed to ensure structured and consistent output generation across four critical phases, aligning with the metho...

work page

[49] [50]

Dataset Characterization and Curation De- tails. To further substantiate the quality, diversity, and robust gen- eration methodology of the Pistachio dataset, we provide additional visualizations and analysis related to our cura- tion process. Our complete generation pipeline, formalized in Algorithm 1, demonstrates a systematic three-stage ap- proach: Sc...

work page

[50] [51]

Public Roads & Transportation Areas

work page

[51] [52]

Enclosed & Indoor Premises

work page

[52] [53]

Commercial & Entertainment Gathering Points

work page

[53] [54]

Industrial & Construction Zones

work page

[54] [55]

Outdoor & Natural Environments

work page

[55] [56]

Critical Infrastructure Anomaly Type Determination (6 prompts, one per scene) Example for Commercial & Entertainment Gathering Points: You are an expert in multi-image analysis and event inference for ”Commercial & Entertainment Gath- ering Points.” You will receive a set of image files. Your task is to assign the most likely and specific anomalous events...

work page

[56] [57]

Panoramic fixed shot, a man crawls into the frame, visibly in pain with a clear gunshot wound to his waist, and then crawls out of the frame

leverages a dual-branch structure that makes full use of the frozen CLIP model’s fine-grained vision-language alignment. The visual features are enhanced by alignment with rich semantic language representations. This cross- modal knowledge acts as a strong semantic prior, providing a generalized understanding of ”anomaly” that transcends dataset-specific ...

work page