CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection
Pith reviewed 2026-05-13 21:06 UTC · model grok-4.3
The pith
Video CLS tokens steer audio Mamba parameters via conditional LoRA to improve multimodal violence detection accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoLoRSMamba is a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings.
What carries the argument
CLS-guided conditional LoRA adapting AudioMamba's Delta, B, C parameters from VideoMamba CLS token at each layer
If this is right
- Outperforms audio-only, video-only, and multimodal baselines on audio-filtered NTU-CCTV and DVD subsets
- Achieves 88.63% accuracy and 86.24% F1 on NTU-CCTV
- Achieves 75.77% accuracy and 72.94% F1 on DVD
- Provides better accuracy-efficiency tradeoff with fewer parameters and FLOPs than several larger models
Where Pith is reading between the lines
- The method may extend to other video-audio tasks by replacing attention with parameter adaptation in state space models
- Curating audio-filtered subsets enables fairer testing of multimodal models in noisy real-world conditions
- Directional video-to-audio steering could reduce compute in surveillance pipelines compared to bidirectional fusion
Load-bearing premise
The video CLS token can reliably generate channel-wise modulation vectors and stabilization gates that adapt the audio parameters to produce scene-aware dynamics without misalignment or noise.
What would settle it
A standard multimodal baseline achieving equal or higher accuracy than 88.63% on NTU-CCTV and 75.77% on DVD using the same audio-filtered clip subsets with comparable or lower FLOPs would challenge the claimed advantage.
Figures
read the original abstract
Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings. To support fair multimodal evaluation, we curate audio-filtered clip level subsets of the NTU-CCTV and DVD datasets from temporal annotations, retaining only clips with available audio. On these subsets, CoLoRSMamba outperforms representative audio-only, video-only, and multimodal baselines, achieving 88.63% accuracy / 86.24% F1-V on NTU-CCTV and 75.77% accuracy / 72.94% F1-V on DVD. It further offers a favorable accuracy-efficiency tradeoff, surpassing several larger models with fewer parameters and FLOPs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CoLoRSMamba, a directional multimodal architecture coupling VideoMamba and AudioMamba via CLS-token-guided conditional LoRA. At each layer the VideoMamba CLS token generates a channel-wise modulation vector and stabilization gate that adapt AudioMamba's selective SSM parameters (Delta, B, C). Training uses binary classification plus symmetric AV-InfoNCE. On audio-filtered clip-level subsets of NTU-CCTV and DVD the model reports 88.63% accuracy / 86.24% F1-V and 75.77% accuracy / 72.94% F1-V respectively, together with favorable parameter/FLOP counts versus baselines.
Significance. If the conditional LoRA mechanism demonstrably produces scene-aware audio dynamics aligned to visual events, the work would offer a lightweight, attention-free route to multimodal fusion that preserves Mamba's linear scaling. This would be valuable for real-time violence detection on resource-constrained devices where noisy audio-visual correspondence is common.
major comments (3)
- [Methods] Methods section (architecture description): the central claim that VideoMamba CLS-driven modulation of AudioMamba's Delta/B/C yields scene-aware dynamics lacks any mechanistic validation (parameter visualizations, correlation with scene events, or controlled misalignment ablations). Without such checks the reported gains cannot be confidently attributed to the proposed adaptation rather than the loss functions or dataset filtering.
- [Experiments] Experiments / Results tables: accuracy and F1 improvements are presented as single-point estimates with no error bars, multiple random seeds, or statistical significance tests against baselines. This weakens the claim of consistent outperformance on the audio-filtered NTU-CCTV and DVD subsets.
- [Dataset] Dataset curation paragraph: the process for creating the audio-filtered clip-level subsets from temporal annotations is described only at high level. Missing details include exact filtering criteria, audio availability thresholds, and any resulting distribution shifts that could affect fair multimodal comparison.
minor comments (2)
- [Methods] The symmetry property of the AV-InfoNCE loss is mentioned in the abstract but not explicitly defined or derived in the methods; a short equation or pseudocode would improve clarity.
- Figure captions for the architecture diagram should explicitly label the conditional LoRA rank, scaling factors, and the stabilization gate pathway.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and commit to revisions that strengthen the presentation of the conditional LoRA mechanism, experimental rigor, and dataset details.
read point-by-point responses
-
Referee: [Methods] Methods section (architecture description): the central claim that VideoMamba CLS-driven modulation of AudioMamba's Delta/B/C yields scene-aware dynamics lacks any mechanistic validation (parameter visualizations, correlation with scene events, or controlled misalignment ablations). Without such checks the reported gains cannot be confidently attributed to the proposed adaptation rather than the loss functions or dataset filtering.
Authors: We agree that mechanistic validation is required to attribute performance gains specifically to the CLS-guided conditional LoRA. In the revised manuscript we will add visualizations of the generated modulation vectors and the resulting changes to AudioMamba's Delta, B, and C parameters, together with quantitative correlations against visual scene events. We will also include controlled ablations that replace the video-guided modulation with random or misaligned conditioning to isolate the contribution of scene-aware adaptation from the loss functions and filtering. revision: yes
-
Referee: [Experiments] Experiments / Results tables: accuracy and F1 improvements are presented as single-point estimates with no error bars, multiple random seeds, or statistical significance tests against baselines. This weakens the claim of consistent outperformance on the audio-filtered NTU-CCTV and DVD subsets.
Authors: We acknowledge that single-run results limit confidence in the reported improvements. We will rerun all experiments using at least five random seeds, report mean accuracy and F1 scores with standard deviations as error bars, and add statistical significance tests (paired t-tests or Wilcoxon signed-rank tests) against each baseline in the updated tables. revision: yes
-
Referee: [Dataset] Dataset curation paragraph: the process for creating the audio-filtered clip-level subsets from temporal annotations is described only at high level. Missing details include exact filtering criteria, audio availability thresholds, and any resulting distribution shifts that could affect fair multimodal comparison.
Authors: We will expand the dataset curation paragraph with precise details: the minimum audio duration and RMS energy thresholds applied, the exact temporal alignment procedure between annotations and audio tracks, and a quantitative comparison of class balance and audio-visual correlation statistics before versus after filtering to demonstrate that the subsets remain representative for multimodal evaluation. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's central claims consist of an empirical architecture (VideoMamba CLS token producing conditional LoRA modulation for AudioMamba's Delta/B/C parameters) trained end-to-end with standard binary classification plus AV-InfoNCE losses on externally curated subsets of NTU-CCTV and DVD. Reported accuracies (88.63% / 75.77%) are outcomes of optimization on held-out data rather than quantities defined by the model's own equations or by self-citation chains. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text; the derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Conditional LoRA rank and scaling factors
axioms (1)
- domain assumption Mamba selective state-space models can be effectively adapted via external conditional signals for multimodal fusion without cross-attention
Reference graph
Works this paper leans on
-
[1]
Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, and Hilde Kuehne. 2025. CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition
work page 2025
-
[2]
Enrique Bermejo Nievas, Oscar Deniz Suarez, Gloria Bueno García, and Rahul Sukthankar. 2011. Violence detection in video using computer vision techniques. InComputer Analysis of Images and Patterns: 14th International Conference, CAIP 2011, Seville, Spain, August 29-31, 2011, Proceedings, Part II 14. Springer, 332–339
work page 2011
-
[3]
Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. 2018. Multiple instance learning: A survey of problem characteristics and applications.Pattern Recognition77 (2018), 329–353
work page 2018
-
[4]
Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, and Dima Damen. 2024. TIM: A Time Interval Machine for Audio-Visual Action Recogni- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18153–18163
work page 2024
-
[5]
Guillermo Garcia-Cobo and Juan C SanMiguel. 2023. Human skeletons and change detection for efficient violence detection in surveillance videos.Computer Vision and Image Understanding233 (2023), 103739
work page 2023
-
[6]
Yuan Gong, Yu-An Chung, and James Glass. 2021. AST: Audio Spectrogram Transformer.Interspeech 2021(2021)
work page 2021
-
[7]
Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Conference on Language Modeling
work page 2024
-
[8]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[9]
Senadeera, Jianian Zheng, Kaushal K
Dimitrios Kollias, Damith C. Senadeera, Jianian Zheng, Kaushal K. K. Yadav, Gregory Slabaugh, Muhammad Awais, and Xiaoyun Yang. 2025. DVD: A Com- prehensive Dataset for Advancing Violence Detection in Real-World Scenarios. arXiv preprint arXiv:2506.05372(2025)
-
[10]
Chenghao Li, Xinyan Yang, and Gang Liang. 2023. Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection.Comput. J.(2023), bxad103
work page 2023
-
[11]
Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. 2024. VideoMamba: State Space Model for Efficient Video Understanding. InEuropean Conference on Computer Vision
work page 2024
-
[12]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao
-
[13]
InProceedings of the IEEE/CVF International Conference on Computer Vision
Uniformerv2: Unlocking the potential of image vits for video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1632– 1643
-
[14]
Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. 2023. Vision Transformers Are Parameter-Efficient Audio-Visual Learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2299–2309
work page 2023
-
[15]
Ziyi Liu and Yangcen Liu. 2025. Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8711–8720
work page 2025
-
[16]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202–3211
work page 2022
-
[17]
Hui Lu, Albert A Salah, and Ronald Poppe. 2025. Snakes and ladders: Two steps up for VideoMamba. InProceedings of the IEEE/CVF International Conference on Computer Vision. 24234–24244
work page 2025
-
[18]
Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention Bottlenecks for Multimodal Fusion. InAdvances in Neural Information Processing Systems
work page 2021
-
[19]
Batyrkhan Omarov, Sergazi Narynov, Zhandos Zhumanov, Aidana Gumar, and Mariyam Khassanova. 2022. State-of-the-art violence detection techniques in video surveillance security systems: a systematic review.PeerJ Computer Science 8 (2022), e920
work page 2022
-
[20]
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.Interspeech 2019(2019), 2613
work page 2019
-
[21]
Bruno Peixoto, Bahram Lavi, Paolo Bestagini, Zanoni Dias, and Anderson Rocha
-
[22]
InIEEE International Conference on Acoustics, Speech and Signal Processing
Multimodal Violence Detection in Videos. InIEEE International Conference on Acoustics, Speech and Signal Processing. 2957–2961
-
[23]
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32
work page 2018
-
[24]
Mauricio Perez, Alex C. Kot, and Anderson Rocha. 2019. Detection of Real- world Fights in Surveillance Videos. InIEEE International Conference on Acoustics, Speech and Signal Processing. 2662–2666
work page 2019
-
[25]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
work page 2021
-
[26]
Damith Chamalke Senadeera, Xiaoyun Yang, Dimitrios Kollias, and Gregory Slabaugh. 2024. CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniFormerV2 and Modified Efficient Additive Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 4888–4897
work page 2024
- [27]
-
[28]
Abdarahmane Traoré and Moulay A Akhloufi. 2020. Violence detection in videos using deep recurrent and convolutional neural networks. In2020 IEEE Interna- tional Conference on Systems, Man, and Cybernetics (SMC). IEEE, 154–159
work page 2020
-
[29]
Fath U Min Ullah, Mohammad S Obaidat, Amin Ullah, Khan Muhammad, Moham- mad Hijji, and Sung Wook Baik. 2023. A comprehensive review on vision-based violence detection in surveillance videos.Comput. Surveys55, 10 (2023), 1–44
work page 2023
-
[30]
Fath U Min Ullah, Amin Ullah, Khan Muhammad, Ijaz Ul Haq, and Sung Wook Baik. 2019. Violence detection using spatiotemporal features with 3D convolu- tional neural network.Sensors19, 11 (2019), 2472
work page 2019
-
[31]
Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not Only Look, but Also Listen: Learning Multimodal Violence Detection under Weak Supervision. InEuropean Conference on Computer Vision. 322–339
work page 2020
- [32]
-
[33]
Sarthak Yadav and Zheng-Hua Tan. 2024. Audio Mamba: Bidirectional State Space Model for Audio Representation Learning. InInterspeech. 552–556
work page 2024
-
[34]
Jiaxin Ye, Junping Zhang, and Hongming Shan. 2025. DepMamba: Progressive Fusion Mamba for Multimodal Depression Detection. InIEEE International Con- ference on Acoustics, Speech and Signal Processing. 1–5
work page 2025
-
[35]
Xiao Zhou, Xiaogang Peng, Hao Wen, Yikai Luo, Keyang Yu, Ping Yang, and Zizhao Wu. 2024. Learning weakly supervised audio-visual violence detection in hyperbolic space.Image and Vision Computing151 (2024), 105286
work page 2024
-
[36]
Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. 2024. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. InInternational Conference on Machine Learning
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.