pith. sign in

arxiv: 2511.19474 · v5 · submitted 2025-11-22 · 💻 cs.CV · cs.AI· cs.MM

Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Pith reviewed 2026-05-17 06:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords video anomaly detectionvideo anomaly understandingsynthetic benchmarksvideo generationlong-form videobenchmark datasetanomaly types
0
0 comments X

The pith

A controlled video generation pipeline produces balanced, diverse long-form anomaly benchmarks without internet data biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that recent video generation models can be used to build a new benchmark for video anomaly detection and understanding that fixes the scene bias, imbalance, and short duration problems of existing real-world datasets. Current benchmarks make it hard to test methods reliably because they come from uncontrolled internet sources and require costly manual labels for deeper causal reasoning tasks. Pistachio instead runs a pipeline of scene-conditioned anomaly assignment followed by multi-step storyline generation and temporally consistent synthesis to output coherent 41-second videos with exact control over events. If this works, researchers gain a scalable way to evaluate methods on complex, multi-event scenarios and can focus development on handling dynamic temporal narratives.

Core claim

Pistachio is a new VAD and VAU benchmark built entirely through a controlled generation-based pipeline. The pipeline combines scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to create coherent 41-second videos. This gives precise control over scenes, anomaly types, and temporal narratives, removing the biases and limitations of Internet-collected datasets while demonstrating scale, diversity, and complexity that expose new challenges for existing methods.

What carries the argument

The controlled generation-based pipeline integrating scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to produce 41-second videos.

If this is right

  • Existing VAD and VAU methods encounter new performance challenges on long-form videos with balanced and diverse anomaly coverage.
  • Benchmark creation for semantic and causal anomaly reasoning becomes feasible with far less manual annotation effort.
  • Research can shift toward models that handle dynamic multi-event sequences and temporal causality in anomalies.
  • Precise control over anomaly types and narratives enables targeted testing of method robustness on specific patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could generate on-demand benchmarks for other video tasks such as action prediction or event localization by swapping the anomaly assignment step.
  • As generation quality rises, fully synthetic datasets might reduce reliance on real-world collection for many video understanding benchmarks.
  • Performance differences between Pistachio and real datasets could highlight specific weaknesses in current models for long temporal context.
  • Domain-specific versions could be produced quickly by conditioning the scene and anomaly choices on particular environments like traffic or indoor surveillance.

Load-bearing premise

Videos produced by current generation models are realistic enough and match real-world anomaly distributions and timing to act as a reliable stand-in for evaluating detection and understanding methods.

What would settle it

If methods that rank highest on Pistachio videos produce markedly different rankings or much lower accuracy when tested on established real-world anomaly video datasets, the synthetic benchmark would fail to serve as a valid proxy.

Figures

Figures reproduced from arXiv: 2511.19474 by Fei Wang, Hongyi Cai, Jie Li, Mingkang Dong, Muxin Pu, Shan You, Tao Huang.

Figure 1
Figure 1. Figure 1: We introduce Pistachio - a benchmark for video anomaly analysis, which aims at two fundamental tasks: Video Anomaly Detection (VAD) and Video Anomaly Understanding (VAU). The VAD dataset totals 1.6 million frames and extends existing datasets by expanding the number of scenes from hundreds to thousands, covering 31 distinct anomaly types, over half of which are unique to this benchmark. Pistachio offers mu… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our video anomaly dataset generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of anomaly videos across different anomaly types. For each type, the left bar represents short videos and the right [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of anomaly video ratios in each scenario [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Different Video Generation Schemes and Non-compliant Videos. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of all exception categories across all scenarios. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Pistachio, a synthetic VAD/VAU benchmark constructed via a controlled generation pipeline. The pipeline combines scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to produce 41-second videos with explicit control over scenes, anomaly types, and temporal narratives. The central claim is that this approach eliminates the biases and limitations of Internet-collected datasets while providing the scale, diversity, and complexity needed to expose new challenges for existing VAD and VAU methods.

Significance. If the generated videos prove to be faithful proxies for real-world anomaly distributions and temporal dynamics, Pistachio would offer a scalable, precisely controllable benchmark that reduces annotation burden and enables systematic study of dynamic and multi-event scenarios, directly addressing gaps in current VAD/VAU evaluation.

major comments (2)
  1. [Experiments] Experiments section: The abstract asserts that 'extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges,' yet no quantitative results (e.g., VAD/VAU performance metrics, error analysis, or baseline comparisons on Pistachio versus ShanghaiTech/UCF-Crime) are referenced. This absence is load-bearing for the claim that the benchmark exposes new challenges.
  2. [Pipeline] Pipeline description (Section 3): The claim that the pipeline 'effectively eliminat[es] the biases and limitations of Internet-collected datasets' rests on the untested assumption that generated 41-second videos match real-world motion, physics, and anomaly temporal profiles. No video-level fidelity metrics (FID, FVD), human realism scores, or statistical tests (e.g., Kolmogorov-Smirnov on anomaly duration/frequency distributions) against real benchmarks are reported.
minor comments (2)
  1. [Abstract] Abstract: 'VAU' is expanded on first use, but subsequent references to 'dynamic and multi-event anomaly understanding' would benefit from a brief forward pointer to the specific VAU tasks evaluated.
  2. [Figures] Figure captions (throughout): Several figures showing generated video frames lack explicit labels for anomaly start/end times or scene conditioning parameters, reducing immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The abstract asserts that 'extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges,' yet no quantitative results (e.g., VAD/VAU performance metrics, error analysis, or baseline comparisons on Pistachio versus ShanghaiTech/UCF-Crime) are referenced. This absence is load-bearing for the claim that the benchmark exposes new challenges.

    Authors: We agree that the current manuscript lacks the quantitative baseline evaluations needed to fully support the claim that Pistachio reveals new challenges. The experiments section in the submitted version focuses on dataset statistics, diversity measures, and qualitative examples of generated videos. In the revision we will add a new subsection reporting VAD and VAU baseline results on Pistachio, including standard metrics such as AUC-ROC, comparisons against ShanghaiTech and UCF-Crime, and an error analysis highlighting failure modes that are more prevalent in our long-form, multi-event setting. revision: yes

  2. Referee: [Pipeline] Pipeline description (Section 3): The claim that the pipeline 'effectively eliminat[es] the biases and limitations of Internet-collected datasets' rests on the untested assumption that generated 41-second videos match real-world motion, physics, and anomaly temporal profiles. No video-level fidelity metrics (FID, FVD), human realism scores, or statistical tests (e.g., Kolmogorov-Smirnov on anomaly duration/frequency distributions) against real benchmarks are reported.

    Authors: The referee is correct that the manuscript does not yet provide direct quantitative evidence that the generated videos match real-world distributions in motion, physics, or anomaly timing. While the pipeline was designed to reduce collection biases through explicit control, we will add video fidelity measurements (FVD scores), human realism ratings from a user study, and statistical comparisons (including Kolmogorov-Smirnov tests on anomaly duration and frequency) against real benchmarks such as UCF-Crime to substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: dataset construction without derivations or fitted predictions

full rationale

This is a dataset construction paper that describes a procedural pipeline for generating synthetic long-form videos using existing video generation models, scene-conditioned assignment, and storyline generation. No equations, predictions, or first-principles results are presented that could reduce to inputs by construction. The central claims concern control over content and elimination of internet-data biases, but these rest on the external properties of the generators rather than any internal self-definition, fitted-parameter renaming, or load-bearing self-citation chain. The work is self-contained as a benchmark proposal whose utility is to be assessed by downstream users against real-world data, consistent with the reader's assessment of no derivation or fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current video generation models can produce temporally coherent long-form videos with controllable anomalies that match real-world distributions closely enough for benchmarking purposes. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Video generation models can be conditioned to produce temporally consistent 41-second videos with specified anomaly narratives.
    Invoked in the pipeline description to justify minimal human intervention and bias elimination.

pith-pipeline@v0.9.0 · 5503 in / 1202 out tokens · 52523 ms · 2026-05-17T06:25:47.337826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 2 internal anchors

  1. [1]

    Ubnor- mal: New benchmark for supervised open-set video anomaly detection

    Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ubnor- mal: New benchmark for supervised open-set video anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  2. [2]

    Ub- normal: New benchmark for supervised open-set video anomaly detection

    Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah. Ub- normal: New benchmark for supervised open-set video anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20143–20153, 2022. 3

  3. [3]

    Robust real-time unusual event detection using mul- tiple fixed-location monitors.IEEE transactions on pattern analysis and machine intelligence, 30(3):555–560, 2008

    Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. Robust real-time unusual event detection using mul- tiple fixed-location monitors.IEEE transactions on pattern analysis and machine intelligence, 30(3):555–560, 2008. 3

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  5. [5]

    A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation

    Congqi Cao, Yue Lu, Peng Wang, and Yanning Zhang. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20392–20401, 2023. 3

  6. [6]

    Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection, 2022

    Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection, 2022. 2, 6, 7

  7. [7]

    Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection

    Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitude- contrastive glance-and-focus network for weakly-supervised video anomaly detection. InProceedings of the AAAI con- ference on artificial intelligence, pages 387–395, 2023. 3

  8. [8]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 8

  9. [9]

    A discriminative framework for anomaly detection in large videos

    Allison Del Giorno, J Andrew Bagnell, and Martial Hebert. A discriminative framework for anomaly detection in large videos. InEuropean conference on computer vision, pages 334–349. Springer, 2016. 3

  10. [10]

    Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly, 2024

    Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Ji- ayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiang- ming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, and Xiaofeng Tao. Uncovering what, why and how: A comprehensive benchmark for causation understanding of video anomaly, 2024. 2

  11. [11]

    Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection, 2023

    Hyekang Kevin Joo, Khoa V o, Kashu Yamazaki, and Ngan Le. Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection, 2023. 2

  12. [12]

    Hunyuanvideo: A systematic framework for large video generative models, 2025

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  13. [13]

    Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013

    Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes.IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013. 3

  14. [14]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 6

  15. [15]

    Exploring background-bias for anomaly detection in surveillance videos

    Kun Liu and Huadong Ma. Exploring background-bias for anomaly detection in surveillance videos. InProceedings of the 27th ACM International Conference on Multimedia, pages 1490–1499, 2019. 3

  16. [16]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 2, 3

  17. [17]

    Abnormal event de- tection at 150 fps in matlab

    Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event de- tection at 150 fps in matlab. 2013. 3

  18. [18]

    A revisit of sparse coding based anomaly detection in stacked rnn framework

    Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. InProceedings of the IEEE international conference on com- puter vision, pages 341–349, 2017. 3

  19. [19]

    Mulde: Multiscale log- density estimation via denoising score matching for video anomaly detection

    Jakub Micorek, Horst Possegger, Dominik Narnhofer, Horst Bischof, and Mateusz Kozinski. Mulde: Multiscale log- density estimation via denoising score matching for video anomaly detection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 18868–18877, 2024. 2, 6, 7

  20. [20]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  21. [21]

    Scalable diffusion models with transformers, 2023

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. 3

  22. [22]

    Learning prompt-enhanced context features for weakly- supervised video anomaly detection.IEEE Transactions on Image Processing, 2024

    Yujiang Pu, Xiaoyu Wu, Lulu Yang, and Shengjin Wang. Learning prompt-enhanced context features for weakly- supervised video anomaly detection.IEEE Transactions on Image Processing, 2024. 3, 2, 5

  23. [23]

    Recognizing indoor scenes

    Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In2009 IEEE conference on computer vision and pattern recognition, pages 413–420. IEEE, 2009. 6

  24. [24]

    Street scene: A new dataset and evaluation protocol for video anomaly detection

    Bharathkumar Ramachandra and Michael Jones. Street scene: A new dataset and evaluation protocol for video anomaly detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2569– 2578, 2020. 2, 3

  25. [25]

    Self-distilled masked auto-encoders are efficient video anomaly detectors

    Nicolae-C Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah, et al. Self-distilled masked auto-encoders are efficient video anomaly detectors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15984–15995, 2024. 3

  26. [26]

    Real-world anomaly detection in surveillance videos

    Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6479–6488, 2018. 3

  27. [27]

    Real-world anomaly detection in surveillance videos, 2019

    Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos, 2019. 3

  28. [28]

    Hawk: Learning to understand open-world video anomalies, 2024

    Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, and Ying- Cong Chen. Hawk: Learning to understand open-world video anomalies, 2024. 2

  29. [29]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 7, 8

  30. [31]

    Weakly-supervised video anomaly detection with robust temporal feature mag- nitude learning.arXiv preprint arXiv:2101.10030, 2021

    Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature mag- nitude learning.arXiv preprint arXiv:2101.10030, 2021. 2

  31. [32]

    Weakly-supervised video anomaly detection with robust temporal feature magni- tude learning

    Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magni- tude learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021. 3

  32. [33]

    Wan: Open and advanced large-scale video generative models, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  33. [34]

    Robust unsupervised video anomaly detection by multipath frame prediction.IEEE transactions on neural networks and learn- ing systems, 33(6):2301–2312, 2021

    Xuanzhao Wang, Zhengping Che, Bo Jiang, Ning Xiao, Ke Yang, Jian Tang, Jieping Ye, Jingyu Wang, and Qi Qi. Robust unsupervised video anomaly detection by multipath frame prediction.IEEE transactions on neural networks and learn- ing systems, 33(6):2301–2312, 2021. 2

  34. [35]

    Video models are zero-shot learn- ers and reasoners, 2025

    Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners, 2025. 2, 3

  35. [36]

    Not only look, but also listen: Learning multimodal violence detection under weak supervision

    Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. InEuropean conference on computer vision, pages 322–339. Springer, 2020. 3

  36. [37]

    Vadclip: Adapting vision-language models for weakly supervised video anomaly detection, 2023

    Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection, 2023. 2

  37. [38]

    Open-vocabulary video anomaly detection, 2024

    Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. Open-vocabulary video anomaly detection, 2024. 3

  38. [39]

    Follow the rules: Reasoning for video anomaly detection with large language models, 2024

    Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao, and Shao-Yuan Lo. Follow the rules: Reasoning for video anomaly detection with large language models, 2024. 3

  39. [40]

    Harnessing large language mod- els for training-free video anomaly detection, 2024

    Luca Zanella, Willi Menapace, Massimiliano Mancini, Yim- ing Wang, and Elisa Ricci. Harnessing large language mod- els for training-free video anomaly detection, 2024. 3

  40. [41]

    Holmes-vau: Towards long-term video anomaly understanding at any granularity

    Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xi- aonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long-term video anomaly understanding at any granularity. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13843–13853, 2025. 8

  41. [42]

    Holmes-vau: Towards long-term video anomaly understanding at any granularity, 2025

    Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Xi- aonan Huang, Changxin Gao, Shanjun Zhang, Li Yu, and Nong Sang. Holmes-vau: Towards long-term video anomaly understanding at any granularity, 2025. 2

  42. [43]

    Multi- scale video anomaly detection by multi-grained spatio- temporal representation learning

    Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multi- scale video anomaly detection by multi-grained spatio- temporal representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17385–17394, 2024. 3

  43. [44]

    Single-image crowd counting via multi-column convolutional neural network

    Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016. 2

  44. [45]

    Llava- next: A strong zero-shot video understanding model, 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 8

  45. [46]

    Dual memory units with uncertainty regulation for weakly supervised video anomaly detection

    Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3769–3777, 2023. 2, 6, 7

  46. [47]

    Dual memory units with uncertainty regulation for weakly supervised video anomaly detection

    Hang Zhou, Junqing Yu, and Wei Yang. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3769–3777, 2023. 3

  47. [48]

    Advancing video anomaly detection: A concise re- view and a new dataset.Advances in Neural Information Processing Systems, 37:89943–89977, 2024

    Liyun Zhu, Lei Wang, Arjun Raj, Tom Gedeon, and Chen Chen. Advancing video anomaly detection: A concise re- view and a new dataset.Advances in Neural Information Processing Systems, 37:89943–89977, 2024. 3 Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks Supplementary Material

  48. [49]

    The comprehensive system prompts (Tab 5) are pivotal for precisely guiding the Large Language Model (LLM) through the multi-stage annotation workflow of the Pista- chio dataset

    Details of the System Prompts. The comprehensive system prompts (Tab 5) are pivotal for precisely guiding the Large Language Model (LLM) through the multi-stage annotation workflow of the Pista- chio dataset. These stage-specific prompts are designed to ensure structured and consistent output generation across four critical phases, aligning with the metho...

  49. [50]

    Dataset Characterization and Curation De- tails. To further substantiate the quality, diversity, and robust gen- eration methodology of the Pistachio dataset, we provide additional visualizations and analysis related to our cura- tion process. Our complete generation pipeline, formalized in Algorithm 1, demonstrates a systematic three-stage ap- proach: Sc...

  50. [51]

    Public Roads & Transportation Areas

  51. [52]

    Enclosed & Indoor Premises

  52. [53]

    Commercial & Entertainment Gathering Points

  53. [54]

    Industrial & Construction Zones

  54. [55]

    Outdoor & Natural Environments

  55. [56]

    Critical Infrastructure Anomaly Type Determination (6 prompts, one per scene) Example for Commercial & Entertainment Gathering Points: You are an expert in multi-image analysis and event inference for ”Commercial & Entertainment Gath- ering Points.” You will receive a set of image files. Your task is to assign the most likely and specific anomalous events...

  56. [57]

    Panoramic fixed shot, a man crawls into the frame, visibly in pain with a clear gunshot wound to his waist, and then crawls out of the frame

    leverages a dual-branch structure that makes full use of the frozen CLIP model’s fine-grained vision-language alignment. The visual features are enhanced by alignment with rich semantic language representations. This cross- modal knowledge acts as a strong semantic prior, providing a generalized understanding of ”anomaly” that transcends dataset-specific ...