FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation
Pith reviewed 2026-05-16 20:33 UTC · model grok-4.3
The pith
By updating 3.5 percent of a 156 million parameter video model in a federated setup, FedVideoMAE cuts communication 28 times while keeping raw videos on user devices and reaching 65-66 percent accuracy under differential privacy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FedVideoMAE shows that LoRA adaptation of VideoMAE representations combined with client-side DP-SGD and server-side secure aggregation produces a practical federated pipeline for video moderation. Updating only 5.5M parameters delivers 28.3 times lower communication than full-model updates while raw videos stay on device, and accuracy remains 65-66 percent under differential privacy on RWF-2000.
What carries the argument
LoRA-based parameter-efficient adaptation of a frozen VideoMAE backbone, which limits trainable parameters to 3.5 percent of the model and supports client-side noise addition without destroying task-relevant features.
If this is right
- Communication per round falls by a factor of 28.3 compared with updating the entire 156 million parameter model.
- Raw video data never leaves client devices during the entire training process.
- Accuracy stays usable at 65-66 percent even when strong differential privacy is enforced.
- The privacy gap is explained by an effective-SNR calculation that predicts 8.5-12 times noise amplification for this regime.
- Transfer results on RLVS and binary UCF-Crime show consistent auxiliary behavior.
Where Pith is reading between the lines
- The same LoRA reduction could be tested on other large video backbones for tasks such as action recognition.
- The SNR analysis offers a way to choose privacy budgets for new datasets before running full experiments.
- Scaling the client count to several hundred would test whether communication savings stay proportional.
Load-bearing premise
LoRA-adapted VideoMAE features remain sufficiently expressive for violence detection after differential privacy noise is added in the small-data federated regime.
What would settle it
Measuring accuracy below 60 percent on RWF-2000 when the same differential privacy parameters are used but full fine-tuning replaces LoRA, or finding that the accuracy gap deviates by more than 20 percent from the 8.5-12x amplification predicted by the effective-SNR analysis.
Figures
read the original abstract
Short-form video moderation increasingly needs learning pipelines that protect user privacy without paying the full bandwidth and latency cost of cloud-centralized inference. We present FedVideoMAE, an on-device federated framework for video violence detection that combines self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, client-side DP-SGD, and server-side secure aggregation. By updating only 5.5M parameters (about 3.5% of a 156M backbone), FedVideoMAE reduces communication by 28.3x relative to full-model federated updates while keeping raw videos on device throughout training. On RWF-2000 with 40 clients, the method reaches 77.25% accuracy without privacy protection and 65~66% under strong differential privacy. We further show that this privacy gap is consistent with an effective-SNR analysis tailored to the small-data, parameter-efficient federated regime, which indicates roughly 8.5~12x DP-noise amplification in our setting. To situate these results more clearly, we also compare against archived full-model federated baselines and summarize auxiliary transfer behavior on RLVS and binary UCF-Crime. Taken together, these findings position FedVideoMAE as a practical operating point for privacy-preserving video moderation on edge devices. Our code can be found at: https://github.com/zyt-599/FedVideoMAE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FedVideoMAE, a federated on-device framework for video violence detection that combines VideoMAE self-supervised pretraining, LoRA adaptation of only 5.5M parameters (3.5% of a 156M backbone), client-side DP-SGD, and server-side secure aggregation. It reports a 28.3x communication reduction relative to full-model updates while keeping raw videos local, achieving 77.25% accuracy on RWF-2000 with 40 clients without privacy and 65-66% under strong differential privacy. An effective-SNR analysis is presented to attribute the 11-12% privacy-induced drop to 8.5-12x noise amplification in the small-data parameter-efficient regime, with auxiliary results on RLVS and UCF-Crime plus comparisons to full-model baselines and a public code release.
Significance. If the reported accuracies are reproducible and the effective-SNR derivation is shown to be independent of the empirical results, the work establishes a practical efficiency-privacy operating point for edge video moderation. The combination of LoRA with client-side DP-SGD and the quantified communication savings addresses a concrete deployment constraint in federated video tasks, while the public code supports reproducibility.
major comments (2)
- [Effective-SNR analysis] Effective-SNR analysis (described in the abstract and results section): the claim that the 11-12% accuracy drop is explained by 8.5-12x DP-noise amplification must be supported by an explicit derivation that depends only on the DP noise multiplier, LoRA rank, federated averaging, and per-client sample size. If any constants or scaling factors are chosen after observing the RWF-2000 accuracies, the analysis becomes circular and does not independently confirm that the 3.5% parameter updates remain sufficiently expressive under DP noise.
- [Experiments] Experimental protocol (RWF-2000 with 40 clients): the manuscript must specify the exact train/validation/test splits, the precise DP noise multiplier and target epsilon, and whether the same splits and hyperparameters are used for both the non-private and private runs. Without these details the 65-66% figure cannot be directly compared to the 77.25% baseline or to the archived full-model federated numbers.
minor comments (3)
- [Abstract] Abstract: the privacy accuracy is given as 65~66%; report the exact mean and standard deviation over the reported runs and clarify whether this is the best or average of multiple random seeds.
- [Method] Notation: define the effective-SNR quantity formally (including all variables) before using it to interpret the privacy gap; the current description leaves the precise formula implicit.
- [Results] Table/figure captions: ensure every reported number (accuracy, communication volume, SNR factor) is accompanied by the exact experimental condition (DP on/off, LoRA rank, client count).
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive suggestions. We address each major comment below and have prepared revisions to the manuscript to incorporate the requested clarifications and expansions.
read point-by-point responses
-
Referee: [Effective-SNR analysis] Effective-SNR analysis (described in the abstract and results section): the claim that the 11-12% accuracy drop is explained by 8.5-12x DP-noise amplification must be supported by an explicit derivation that depends only on the DP noise multiplier, LoRA rank, federated averaging, and per-client sample size. If any constants or scaling factors are chosen after observing the RWF-2000 accuracies, the analysis becomes circular and does not independently confirm that the 3.5% parameter updates remain sufficiently expressive under DP noise.
Authors: We agree that the effective-SNR analysis must stand independently of the empirical results. In the revised manuscript, we will expand the analysis section to include a fully explicit derivation based solely on the DP noise multiplier, LoRA rank (r=8), number of clients (40), and per-client sample size. The noise amplification factor is calculated from the variance introduced by DP-SGD scaled by the reduced parameter space of LoRA and the averaging step in FedAvg, yielding the reported 8.5-12x range without any fitting to the accuracy drop on RWF-2000. This derivation will be presented prior to the empirical results to avoid any appearance of circularity. revision: yes
-
Referee: [Experiments] Experimental protocol (RWF-2000 with 40 clients): the manuscript must specify the exact train/validation/test splits, the precise DP noise multiplier and target epsilon, and whether the same splits and hyperparameters are used for both the non-private and private runs. Without these details the 65-66% figure cannot be directly compared to the 77.25% baseline or to the archived full-model federated numbers.
Authors: We acknowledge the need for greater experimental transparency. The revised manuscript will add precise details in the experimental setup subsection: the RWF-2000 videos are partitioned into train/validation/test sets using a 70/15/15 ratio with client-wise distribution across 40 clients; the DP noise multiplier is set to 1.2 for a target privacy budget of ε=3.0 (δ=1e-5); and the non-private and private experiments use identical data splits, client assignments, and all other hyperparameters (e.g., LoRA configuration, batch size, number of rounds). This will enable direct and fair comparison of the 65-66% private accuracy to the 77.25% non-private baseline. revision: yes
Circularity Check
Effective-SNR analysis appears post-hoc fitted to explain observed DP accuracy drop rather than independently derived
specific steps
-
fitted input called prediction
[Abstract]
"We further show that this privacy gap is consistent with an effective-SNR analysis tailored to the small-data, parameter-efficient federated regime, which indicates roughly 8.5~12x DP-noise amplification in our setting."
The analysis is described as tailored to the regime and supplies an amplification factor (8.5-12x) that directly accounts for the measured drop from 77.25% to 65-66%. Because the factor is not shown to be computed solely from first-principles quantities (DP noise variance, LoRA rank, client sample size) without reference to the observed accuracy numbers, the explanation of why the 3.5% parameter updates stay expressive becomes circular: the model is adjusted to fit the results it is then used to justify.
full rationale
The paper's main empirical claims (65-66% accuracy under DP, 28.3x communication reduction via 3.5% LoRA updates) are direct experimental measurements. However, the load-bearing explanation for why LoRA remains expressive under client DP-SGD reduces to a tailored effective-SNR analysis whose specific 8.5-12x amplification range is presented as consistent with the exact observed privacy gap. This matches the pattern of a fitted input called prediction: the analysis is explicitly tailored to the small-data regime after results are seen, rather than a parameter-free derivation from DP variance, LoRA rank, FedAvg, and per-client sample size alone. No other circular steps (self-definitional, self-citation load-bearing, or ansatz smuggling) are present.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA adaptation size
- DP noise multiplier
axioms (1)
- domain assumption VideoMAE representations remain useful for violence detection after LoRA adaptation and DP noise
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
effective-SNR analysis ... SNReff = SNR · √(Ptrain/Ptotal) · √(Nclient/Nmin)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LoRA-based parameter-efficient adaptation ... 5.5M parameters (3.5%)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Caroline Violot et al., “Shorts vs. regular videos on youtube: A comparative analysis of user engagement and content creation trends,” inProceedings of the 16th ACM Web Science Conference (Websci ’24), 2024
work page 2024
-
[2]
Multimodal learning with transformers: A survey,
Peng Xu, Xiatian Zhu, and David A. Clifton, “Multimodal learning with transformers: A survey,”TPAMI, 2023
work page 2023
-
[3]
Differentially private video activity recognition,
Zelun Luo, Yuliang Zou, Yijin Yang, Zane Durante, De-An Huang, Zhiding Yu, Chaowei Xiao, Li Fei-Fei, and Animashree Anandkumar, “Differentially private video activity recognition,” inWACV, 2024
work page 2024
-
[4]
A survey on video big data analytics,
Thi-Thu-Trang Do et al., “A survey on video big data analytics,”Applied Sciences, 2025
work page 2025
-
[5]
A survey on video analytics in cloud-edge-terminal collaborative systems,
Linxiao Gong et al., “A survey on video analytics in cloud-edge-terminal collaborative systems,”arXiv preprint arXiv:2502.06581, 2025
-
[6]
Edge deep learning in computer vision and medical diagnostics,
Yiwen Xu et al., “Edge deep learning in computer vision and medical diagnostics,”Artificial Intelligence Review, 2025
work page 2025
-
[7]
Rwf-2000: An open large scale video database for violence detection,
Ming Cheng, Kunjing Cai, and Ming Li, “Rwf-2000: An open large scale video database for violence detection,” inICPR, 2021
work page 2000
-
[8]
Praveen Tirupattur, Christian Schulze, and Andreas Dengel, “Violence detection in videos,”arXiv preprint arXiv:2109.08941, 2021
-
[9]
Real-world anomaly detection in surveillance videos,
Waqas Sultani, Chen Chen, and Mubarak Shah, “Real-world anomaly detection in surveillance videos,” inCVPR, 2018
work page 2018
-
[10]
Communication-efficient learning of deep networks from decentralized data,
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inAISTATS, 2017
work page 2017
-
[11]
Advances and open problems in federated learning,
Peter Kairouz, H. Brendan McMahan, et al., “Advances and open problems in federated learning,”Foundations and Trends in Machine Learning, 2021
work page 2021
-
[12]
Simple adaptive attacks for gradient inversion in federated learning,
Ruihan Wu et al., “Simple adaptive attacks for gradient inversion in federated learning,” inUAI, 2023
work page 2023
-
[13]
E2egi: End-to-end gradient inversion in federated learning,
Zhaohua Li, Le Wang, et al., “E2egi: End-to-end gradient inversion in federated learning,”IEEE JBHI, 2023
work page 2023
-
[14]
Sok: On gradient leakage in federated learning,
Jiacheng Du, Jiahui Hu, et al., “Sok: On gradient leakage in federated learning,” inProceedings of the 34th USENIX Security Symposium (USENIX Security), 2025
work page 2025
-
[15]
The algorithmic foundations of differential privacy,
Cynthia Dwork and Aaron Roth, “The algorithmic foundations of differential privacy,”Foundations and Trends in Theoretical Computer Science, 2014
work page 2014
-
[16]
Deep learning with differential privacy,
Martin Abadi, Andy Chu, et al., “Deep learning with differential privacy,” inCCS, 2016
work page 2016
-
[17]
Practical secure aggregation for privacy-preserving machine learning,
Keith Bonawitz, Vladimir Ivanov, et al., “Practical secure aggregation for privacy-preserving machine learning,” inCCS, 2017
work page 2017
-
[18]
Federated self-supervised learning for video understanding,
Yasar Abbas Ur Rehman, Yan Gao, et al., “Federated self-supervised learning for video understanding,” inECCV, 2022
work page 2022
-
[19]
Fedmae: Federated self-supervised learning with one-block masked auto-encoder,
Nan Yang, Xuanyu Chen, Charles Z. Liu, Dong Yuan, Wei Bao, and Lizhen Cui, “Fedmae: Federated self-supervised learning with one-block masked auto-encoder,”arXiv preprint arXiv:2303.11339, 2023
-
[20]
Masked autoencoders are parameter- efficient federated continual learners,
Yuchen He and Xiangfeng Wang, “Masked autoencoders are parameter- efficient federated continual learners,”arXiv preprint arXiv:2411.01916, 2024
-
[21]
Lora: Low-rank adaptation of large language models,
Edward J Hu, Yelong Shen, et al., “Lora: Low-rank adaptation of large language models,” inICLR, 2022
work page 2022
-
[22]
An image is worth 16x16 words,
Alexey Dosovitskiy et al., “An image is worth 16x16 words,” inICLR, 2021
work page 2021
-
[23]
Learning differentially private recurrent language models,
H. Brendan McMahan et al., “Learning differentially private recurrent language models,” inICLR, 2018
work page 2018
-
[24]
Ilya Mironov, “R ´enyi differential privacy,” inCSF, 2017
work page 2017
-
[25]
Tempered sigmoid activations for deep learning with differential privacy,
Nicolas Papernot et al., “Tempered sigmoid activations for deep learning with differential privacy,” inAAAI, 2021
work page 2021
-
[26]
Ldp-fl: Practical private aggregation in federated learning,
Lichao Sun et al., “Ldp-fl: Practical private aggregation in federated learning,” inIJCAI, 2021
work page 2021
-
[27]
Local differential privacy for federated learning,
Chamikara M.A.P et al., “Local differential privacy for federated learning,”arXiv preprint arXiv:2202.06053, 2022
-
[28]
Large language models can be strong differentially private learners,
Xuechen Li, Florian Tram `er, et al., “Large language models can be strong differentially private learners,” inICLR, 2022
work page 2022
-
[29]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,
Zhan Tong et al., “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” inNeurIPS, 2022
work page 2022
-
[30]
Nvidia flare: Federated learning application runtime environment,
NVIDIA Corporation, “Nvidia flare: Federated learning application runtime environment,” 2022
work page 2022
-
[31]
Opacus: User-friendly differential privacy library in pytorch
Ashkan Yousefpour et al., “Opacus: User-friendly differential privacy library in pytorch,”arXiv preprint arXiv:2109.12298, 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.