CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

Damith Chamalke Senadeera; Dimitrios Kollias; Gregory Slabaugh

arxiv: 2604.03329 · v1 · submitted 2026-04-02 · 💻 cs.CV · cs.AI· cs.LG· cs.SD

CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

Damith Chamalke Senadeera , Dimitrios Kollias , Gregory Slabaugh This is my paper

Pith reviewed 2026-05-13 21:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.SD

keywords violence detectionmultimodal fusionMambaconditional LoRAstate-space modelsvideo-audio alignmentsurveillance

0 comments

The pith

Video CLS tokens steer audio Mamba parameters via conditional LoRA to improve multimodal violence detection accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CoLoRSMamba is a directional architecture that uses the video model's class token to guide audio processing in Mamba models for violence detection. The CLS token generates modulation vectors and stabilization gates through conditional LoRA to adapt the audio model's selective state space parameters at each layer. This allows scene-aware audio dynamics without relying on token-level cross-attention. Training uses classification loss plus a symmetric AV-InfoNCE loss for embedding alignment. On audio-filtered subsets of NTU-CCTV and DVD, it outperforms baselines with a good accuracy to efficiency ratio, which matters for practical surveillance systems where audio can be noisy.

Core claim

CoLoRSMamba is a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings.

What carries the argument

CLS-guided conditional LoRA adapting AudioMamba's Delta, B, C parameters from VideoMamba CLS token at each layer

If this is right

Outperforms audio-only, video-only, and multimodal baselines on audio-filtered NTU-CCTV and DVD subsets
Achieves 88.63% accuracy and 86.24% F1 on NTU-CCTV
Achieves 75.77% accuracy and 72.94% F1 on DVD
Provides better accuracy-efficiency tradeoff with fewer parameters and FLOPs than several larger models

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may extend to other video-audio tasks by replacing attention with parameter adaptation in state space models
Curating audio-filtered subsets enables fairer testing of multimodal models in noisy real-world conditions
Directional video-to-audio steering could reduce compute in surveillance pipelines compared to bidirectional fusion

Load-bearing premise

The video CLS token can reliably generate channel-wise modulation vectors and stabilization gates that adapt the audio parameters to produce scene-aware dynamics without misalignment or noise.

What would settle it

A standard multimodal baseline achieving equal or higher accuracy than 88.63% on NTU-CCTV and 75.77% on DVD using the same audio-filtered clip subsets with comparable or lower FLOPs would challenge the claimed advantage.

Figures

Figures reproduced from arXiv: 2604.03329 by Damith Chamalke Senadeera, Dimitrios Kollias, Gregory Slabaugh.

**Figure 2.** Figure 2: Overview of CoLoRSMamba. (a) Full architecture: The video backbone processes the video input [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy-efficiency comparison on the DVD bench [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Prediction flip analysis on the DVD test split. Audio [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings. To support fair multimodal evaluation, we curate audio-filtered clip level subsets of the NTU-CCTV and DVD datasets from temporal annotations, retaining only clips with available audio. On these subsets, CoLoRSMamba outperforms representative audio-only, video-only, and multimodal baselines, achieving 88.63% accuracy / 86.24% F1-V on NTU-CCTV and 75.77% accuracy / 72.94% F1-V on DVD. It further offers a favorable accuracy-efficiency tradeoff, surpassing several larger models with fewer parameters and FLOPs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoLoRSMamba's CLS-guided conditional LoRA steers AudioMamba parameters directionally for violence detection and reports efficiency gains on filtered clips, but the mechanism lacks direct validation that it creates genuine scene-aware audio dynamics.

read the letter

The main thing here is a directional fusion trick: the VideoMamba CLS token feeds a conditional LoRA that modulates AudioMamba's Delta, B, and C projections at each layer, plus a stabilization gate, to produce scene-aware audio without token-level cross-attention. They train with binary classification plus symmetric AV-InfoNCE and test on audio-filtered clip subsets of NTU-CCTV and DVD that they curated from temporal annotations. The model beats representative baselines at 88.63% accuracy on NTU-CCTV and 75.77% on DVD while using fewer parameters and FLOPs than some larger alternatives. That efficiency angle is the clearest practical plus for surveillance-style tasks. The conditional LoRA coupling itself is a fresh specific technique rather than a routine extension of existing Mamba work. The soft spots sit in the validation. No parameter visualizations, correlation checks with visual events, or controlled misalignment tests appear to confirm that the modulation actually aligns audio processing to the scene instead of applying generic shifts. Because the datasets drop clips without audio, some of the reported lift could come from that curation step or the loss combination rather than the steering mechanism. The abstract gives clean numbers but no error bars, statistical tests, or full ablation breakdowns on the LoRA rank and scaling factors. This paper is for people working on lightweight multimodal video-audio models, especially those already using state-space architectures for real-time or edge deployment. It is worth a serious referee. The directional adaptation idea is concrete enough and the efficiency results are concrete enough that reviewers can usefully pressure-test the mechanism and ask for the missing controls.

Referee Report

3 major / 2 minor

Summary. The paper presents CoLoRSMamba, a directional multimodal architecture coupling VideoMamba and AudioMamba via CLS-token-guided conditional LoRA. At each layer the VideoMamba CLS token generates a channel-wise modulation vector and stabilization gate that adapt AudioMamba's selective SSM parameters (Delta, B, C). Training uses binary classification plus symmetric AV-InfoNCE. On audio-filtered clip-level subsets of NTU-CCTV and DVD the model reports 88.63% accuracy / 86.24% F1-V and 75.77% accuracy / 72.94% F1-V respectively, together with favorable parameter/FLOP counts versus baselines.

Significance. If the conditional LoRA mechanism demonstrably produces scene-aware audio dynamics aligned to visual events, the work would offer a lightweight, attention-free route to multimodal fusion that preserves Mamba's linear scaling. This would be valuable for real-time violence detection on resource-constrained devices where noisy audio-visual correspondence is common.

major comments (3)

[Methods] Methods section (architecture description): the central claim that VideoMamba CLS-driven modulation of AudioMamba's Delta/B/C yields scene-aware dynamics lacks any mechanistic validation (parameter visualizations, correlation with scene events, or controlled misalignment ablations). Without such checks the reported gains cannot be confidently attributed to the proposed adaptation rather than the loss functions or dataset filtering.
[Experiments] Experiments / Results tables: accuracy and F1 improvements are presented as single-point estimates with no error bars, multiple random seeds, or statistical significance tests against baselines. This weakens the claim of consistent outperformance on the audio-filtered NTU-CCTV and DVD subsets.
[Dataset] Dataset curation paragraph: the process for creating the audio-filtered clip-level subsets from temporal annotations is described only at high level. Missing details include exact filtering criteria, audio availability thresholds, and any resulting distribution shifts that could affect fair multimodal comparison.

minor comments (2)

[Methods] The symmetry property of the AV-InfoNCE loss is mentioned in the abstract but not explicitly defined or derived in the methods; a short equation or pseudocode would improve clarity.
Figure captions for the architecture diagram should explicitly label the conditional LoRA rank, scaling factors, and the stabilization gate pathway.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and commit to revisions that strengthen the presentation of the conditional LoRA mechanism, experimental rigor, and dataset details.

read point-by-point responses

Referee: [Methods] Methods section (architecture description): the central claim that VideoMamba CLS-driven modulation of AudioMamba's Delta/B/C yields scene-aware dynamics lacks any mechanistic validation (parameter visualizations, correlation with scene events, or controlled misalignment ablations). Without such checks the reported gains cannot be confidently attributed to the proposed adaptation rather than the loss functions or dataset filtering.

Authors: We agree that mechanistic validation is required to attribute performance gains specifically to the CLS-guided conditional LoRA. In the revised manuscript we will add visualizations of the generated modulation vectors and the resulting changes to AudioMamba's Delta, B, and C parameters, together with quantitative correlations against visual scene events. We will also include controlled ablations that replace the video-guided modulation with random or misaligned conditioning to isolate the contribution of scene-aware adaptation from the loss functions and filtering. revision: yes
Referee: [Experiments] Experiments / Results tables: accuracy and F1 improvements are presented as single-point estimates with no error bars, multiple random seeds, or statistical significance tests against baselines. This weakens the claim of consistent outperformance on the audio-filtered NTU-CCTV and DVD subsets.

Authors: We acknowledge that single-run results limit confidence in the reported improvements. We will rerun all experiments using at least five random seeds, report mean accuracy and F1 scores with standard deviations as error bars, and add statistical significance tests (paired t-tests or Wilcoxon signed-rank tests) against each baseline in the updated tables. revision: yes
Referee: [Dataset] Dataset curation paragraph: the process for creating the audio-filtered clip-level subsets from temporal annotations is described only at high level. Missing details include exact filtering criteria, audio availability thresholds, and any resulting distribution shifts that could affect fair multimodal comparison.

Authors: We will expand the dataset curation paragraph with precise details: the minimum audio duration and RMS energy thresholds applied, the exact temporal alignment procedure between annotations and audio tracks, and a quantitative comparison of class balance and audio-visual correlation statistics before versus after filtering to demonstrate that the subsets remain representative for multimodal evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims consist of an empirical architecture (VideoMamba CLS token producing conditional LoRA modulation for AudioMamba's Delta/B/C parameters) trained end-to-end with standard binary classification plus AV-InfoNCE losses on externally curated subsets of NTU-CCTV and DVD. Reported accuracies (88.63% / 75.77%) are outcomes of optimization on held-out data rather than quantities defined by the model's own equations or by self-citation chains. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from the Mamba and LoRA literature plus the novel conditional modulation mechanism; no new physical entities are postulated.

free parameters (1)

Conditional LoRA rank and scaling factors
These control the modulation vectors produced from the video CLS token and are learned during training.

axioms (1)

domain assumption Mamba selective state-space models can be effectively adapted via external conditional signals for multimodal fusion without cross-attention
Invoked in the layer-wise steering design described in the abstract.

pith-pipeline@v0.9.0 · 5544 in / 1445 out tokens · 54035 ms · 2026-05-13T21:06:14.498550+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Glass, and Hilde Kuehne

Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, and Hilde Kuehne. 2025. CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

work page 2025
[2]

Enrique Bermejo Nievas, Oscar Deniz Suarez, Gloria Bueno García, and Rahul Sukthankar. 2011. Violence detection in video using computer vision techniques. InComputer Analysis of Images and Patterns: 14th International Conference, CAIP 2011, Seville, Spain, August 29-31, 2011, Proceedings, Part II 14. Springer, 332–339

work page 2011
[3]

Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. 2018. Multiple instance learning: A survey of problem characteristics and applications.Pattern Recognition77 (2018), 329–353

work page 2018
[4]

Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, and Dima Damen. 2024. TIM: A Time Interval Machine for Audio-Visual Action Recogni- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18153–18163

work page 2024
[5]

Guillermo Garcia-Cobo and Juan C SanMiguel. 2023. Human skeletons and change detection for efficient violence detection in surveillance videos.Computer Vision and Image Understanding233 (2023), 103739

work page 2023
[6]

Yuan Gong, Yu-An Chung, and James Glass. 2021. AST: Audio Spectrogram Transformer.Interspeech 2021(2021)

work page 2021
[7]

Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Conference on Language Modeling

work page 2024
[8]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[9]

Senadeera, Jianian Zheng, Kaushal K

Dimitrios Kollias, Damith C. Senadeera, Jianian Zheng, Kaushal K. K. Yadav, Gregory Slabaugh, Muhammad Awais, and Xiaoyun Yang. 2025. DVD: A Com- prehensive Dataset for Advancing Violence Detection in Real-World Scenarios. arXiv preprint arXiv:2506.05372(2025)

work page arXiv 2025
[10]

Chenghao Li, Xinyan Yang, and Gang Liang. 2023. Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection.Comput. J.(2023), bxad103

work page 2023
[11]

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. 2024. VideoMamba: State Space Model for Efficient Video Understanding. InEuropean Conference on Computer Vision

work page 2024
[12]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao

work page
[13]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Uniformerv2: Unlocking the potential of image vits for video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1632– 1643

work page
[14]

Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. 2023. Vision Transformers Are Parameter-Efficient Audio-Visual Learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2299–2309

work page 2023
[15]

Ziyi Liu and Yangcen Liu. 2025. Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8711–8720

work page 2025
[16]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202–3211

work page 2022
[17]

Hui Lu, Albert A Salah, and Ronald Poppe. 2025. Snakes and ladders: Two steps up for VideoMamba. InProceedings of the IEEE/CVF International Conference on Computer Vision. 24234–24244

work page 2025
[18]

Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention Bottlenecks for Multimodal Fusion. InAdvances in Neural Information Processing Systems

work page 2021
[19]

Batyrkhan Omarov, Sergazi Narynov, Zhandos Zhumanov, Aidana Gumar, and Mariyam Khassanova. 2022. State-of-the-art violence detection techniques in video surveillance security systems: a systematic review.PeerJ Computer Science 8 (2022), e920

work page 2022
[20]

Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.Interspeech 2019(2019), 2613

work page 2019
[21]

Bruno Peixoto, Bahram Lavi, Paolo Bestagini, Zanoni Dias, and Anderson Rocha

work page
[22]

InIEEE International Conference on Acoustics, Speech and Signal Processing

Multimodal Violence Detection in Videos. InIEEE International Conference on Acoustics, Speech and Signal Processing. 2957–2961

work page
[23]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018
[24]

Kot, and Anderson Rocha

Mauricio Perez, Alex C. Kot, and Anderson Rocha. 2019. Detection of Real- world Fights in Surveillance Videos. InIEEE International Conference on Acoustics, Speech and Signal Processing. 2662–2666

work page 2019
[25]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021
[26]

Damith Chamalke Senadeera, Xiaoyun Yang, Dimitrios Kollias, and Gregory Slabaugh. 2024. CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniFormerV2 and Modified Efficient Additive Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 4888–4897

work page 2024
[27]

Damith Chamalke Senadeera, Xiaoyun Yang, Shibo Li, Muhammad Awais, Dim- itrios Kollias, and Gregory Slabaugh. 2025. Dual branch videomamba with gated class token fusion for violence detection.arXiv preprint arXiv:2506.03162(2025)

work page arXiv 2025
[28]

Abdarahmane Traoré and Moulay A Akhloufi. 2020. Violence detection in videos using deep recurrent and convolutional neural networks. In2020 IEEE Interna- tional Conference on Systems, Man, and Cybernetics (SMC). IEEE, 154–159

work page 2020
[29]

Fath U Min Ullah, Mohammad S Obaidat, Amin Ullah, Khan Muhammad, Moham- mad Hijji, and Sung Wook Baik. 2023. A comprehensive review on vision-based violence detection in surveillance videos.Comput. Surveys55, 10 (2023), 1–44

work page 2023
[30]

Fath U Min Ullah, Amin Ullah, Khan Muhammad, Ijaz Ul Haq, and Sung Wook Baik. 2019. Violence detection using spatiotemporal features with 3D convolu- tional neural network.Sensors19, 11 (2019), 2472

work page 2019
[31]

Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not Only Look, but Also Listen: Learning Multimodal Violence Detection under Weak Supervision. InEuropean Conference on Computer Vision. 322–339

work page 2020
[32]

Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feicht- enhofer. 2020. Audiovisual SlowFast Networks for Video Recognition. InarXiv preprint arXiv:2001.08740

work page arXiv 2020
[33]

Sarthak Yadav and Zheng-Hua Tan. 2024. Audio Mamba: Bidirectional State Space Model for Audio Representation Learning. InInterspeech. 552–556

work page 2024
[34]

Jiaxin Ye, Junping Zhang, and Hongming Shan. 2025. DepMamba: Progressive Fusion Mamba for Multimodal Depression Detection. InIEEE International Con- ference on Acoustics, Speech and Signal Processing. 1–5

work page 2025
[35]

Xiao Zhou, Xiaogang Peng, Hao Wen, Yikai Luo, Keyang Yu, Ping Yang, and Zizhao Wu. 2024. Learning weakly supervised audio-visual violence detection in hyperbolic space.Image and Vision Computing151 (2024), 105286

work page 2024
[36]

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. 2024. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. InInternational Conference on Machine Learning

work page 2024

[1] [1]

Glass, and Hilde Kuehne

Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, and Hilde Kuehne. 2025. CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

work page 2025

[2] [2]

Enrique Bermejo Nievas, Oscar Deniz Suarez, Gloria Bueno García, and Rahul Sukthankar. 2011. Violence detection in video using computer vision techniques. InComputer Analysis of Images and Patterns: 14th International Conference, CAIP 2011, Seville, Spain, August 29-31, 2011, Proceedings, Part II 14. Springer, 332–339

work page 2011

[3] [3]

Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. 2018. Multiple instance learning: A survey of problem characteristics and applications.Pattern Recognition77 (2018), 329–353

work page 2018

[4] [4]

Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, and Dima Damen. 2024. TIM: A Time Interval Machine for Audio-Visual Action Recogni- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18153–18163

work page 2024

[5] [5]

Guillermo Garcia-Cobo and Juan C SanMiguel. 2023. Human skeletons and change detection for efficient violence detection in surveillance videos.Computer Vision and Image Understanding233 (2023), 103739

work page 2023

[6] [6]

Yuan Gong, Yu-An Chung, and James Glass. 2021. AST: Audio Spectrogram Transformer.Interspeech 2021(2021)

work page 2021

[7] [7]

Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Conference on Language Modeling

work page 2024

[8] [8]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=nZeVKeeFYf9

work page 2022

[9] [9]

Senadeera, Jianian Zheng, Kaushal K

Dimitrios Kollias, Damith C. Senadeera, Jianian Zheng, Kaushal K. K. Yadav, Gregory Slabaugh, Muhammad Awais, and Xiaoyun Yang. 2025. DVD: A Com- prehensive Dataset for Advancing Violence Detection in Real-World Scenarios. arXiv preprint arXiv:2506.05372(2025)

work page arXiv 2025

[10] [10]

Chenghao Li, Xinyan Yang, and Gang Liang. 2023. Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection.Comput. J.(2023), bxad103

work page 2023

[11] [11]

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. 2024. VideoMamba: State Space Model for Efficient Video Understanding. InEuropean Conference on Computer Vision

work page 2024

[12] [12]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao

work page

[13] [13]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Uniformerv2: Unlocking the potential of image vits for video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1632– 1643

work page

[14] [14]

Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. 2023. Vision Transformers Are Parameter-Efficient Audio-Visual Learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2299–2309

work page 2023

[15] [15]

Ziyi Liu and Yangcen Liu. 2025. Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8711–8720

work page 2025

[16] [16]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202–3211

work page 2022

[17] [17]

Hui Lu, Albert A Salah, and Ronald Poppe. 2025. Snakes and ladders: Two steps up for VideoMamba. InProceedings of the IEEE/CVF International Conference on Computer Vision. 24234–24244

work page 2025

[18] [18]

Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention Bottlenecks for Multimodal Fusion. InAdvances in Neural Information Processing Systems

work page 2021

[19] [19]

Batyrkhan Omarov, Sergazi Narynov, Zhandos Zhumanov, Aidana Gumar, and Mariyam Khassanova. 2022. State-of-the-art violence detection techniques in video surveillance security systems: a systematic review.PeerJ Computer Science 8 (2022), e920

work page 2022

[20] [20]

Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.Interspeech 2019(2019), 2613

work page 2019

[21] [21]

Bruno Peixoto, Bahram Lavi, Paolo Bestagini, Zanoni Dias, and Anderson Rocha

work page

[22] [22]

InIEEE International Conference on Acoustics, Speech and Signal Processing

Multimodal Violence Detection in Videos. InIEEE International Conference on Acoustics, Speech and Signal Processing. 2957–2961

work page

[23] [23]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

work page 2018

[24] [24]

Kot, and Anderson Rocha

Mauricio Perez, Alex C. Kot, and Anderson Rocha. 2019. Detection of Real- world Fights in Surveillance Videos. InIEEE International Conference on Acoustics, Speech and Signal Processing. 2662–2666

work page 2019

[25] [25]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021

[26] [26]

Damith Chamalke Senadeera, Xiaoyun Yang, Dimitrios Kollias, and Gregory Slabaugh. 2024. CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniFormerV2 and Modified Efficient Additive Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 4888–4897

work page 2024

[27] [27]

Damith Chamalke Senadeera, Xiaoyun Yang, Shibo Li, Muhammad Awais, Dim- itrios Kollias, and Gregory Slabaugh. 2025. Dual branch videomamba with gated class token fusion for violence detection.arXiv preprint arXiv:2506.03162(2025)

work page arXiv 2025

[28] [28]

Abdarahmane Traoré and Moulay A Akhloufi. 2020. Violence detection in videos using deep recurrent and convolutional neural networks. In2020 IEEE Interna- tional Conference on Systems, Man, and Cybernetics (SMC). IEEE, 154–159

work page 2020

[29] [29]

Fath U Min Ullah, Mohammad S Obaidat, Amin Ullah, Khan Muhammad, Moham- mad Hijji, and Sung Wook Baik. 2023. A comprehensive review on vision-based violence detection in surveillance videos.Comput. Surveys55, 10 (2023), 1–44

work page 2023

[30] [30]

Fath U Min Ullah, Amin Ullah, Khan Muhammad, Ijaz Ul Haq, and Sung Wook Baik. 2019. Violence detection using spatiotemporal features with 3D convolu- tional neural network.Sensors19, 11 (2019), 2472

work page 2019

[31] [31]

Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not Only Look, but Also Listen: Learning Multimodal Violence Detection under Weak Supervision. InEuropean Conference on Computer Vision. 322–339

work page 2020

[32] [32]

Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feicht- enhofer. 2020. Audiovisual SlowFast Networks for Video Recognition. InarXiv preprint arXiv:2001.08740

work page arXiv 2020

[33] [33]

Sarthak Yadav and Zheng-Hua Tan. 2024. Audio Mamba: Bidirectional State Space Model for Audio Representation Learning. InInterspeech. 552–556

work page 2024

[34] [34]

Jiaxin Ye, Junping Zhang, and Hongming Shan. 2025. DepMamba: Progressive Fusion Mamba for Multimodal Depression Detection. InIEEE International Con- ference on Acoustics, Speech and Signal Processing. 1–5

work page 2025

[35] [35]

Xiao Zhou, Xiaogang Peng, Hao Wen, Yikai Luo, Keyang Yu, Ping Yang, and Zizhao Wu. 2024. Learning weakly supervised audio-visual violence detection in hyperbolic space.Image and Vision Computing151 (2024), 105286

work page 2024

[36] [36]

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. 2024. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. InInternational Conference on Machine Learning

work page 2024