Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding

Ishan Rajendrakumar Dave; Joseph Fioresi; Mubarak Shah

arxiv: 2511.08666 · v2 · submitted 2025-11-11 · 💻 cs.CV

Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding

Joseph Fioresi , Ishan Rajendrakumar Dave , Mubarak Shah This is my paper

Pith reviewed 2026-05-17 23:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords privacy preservationlatent anonymizationvideo foundation modelsanonymizing adapteraction recognitiongender biastemporal action detectionprivacy leakage

0 comments

The pith

A lightweight adapter removes private details from video model features while keeping their usefulness for action recognition and other tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to protect privacy in video understanding by working directly on the internal features extracted by foundation models instead of changing the raw video pixels. It adds a small Anonymizing Adapter Module that plugs into existing frozen encoders and trains with three objectives to lower the amount of sensitive information such as gender or skin color that leaks out. The approach reports a 35 percent drop in privacy leakage while performance on downstream tasks stays close to the original levels across action recognition, temporal detection, and anomaly detection benchmarks. This matters because video features are often shared or stored for analysis, and current pixel-level fixes require heavy retraining that does not scale to foundation models. The method also includes new ways to measure and reduce gender bias in these models.

Core claim

The paper shows that an Anonymizing Adapter Module inserted into frozen video encoders, trained using a clip-level self-supervised privacy objective to reduce mutual information, a co-training objective to keep utility on known tasks, and a latent consistency loss to support unseen tasks, achieves a 35 percent reduction in privacy leakage measured by sensitive attribute classifiers while delivering near-baseline results on action recognition using Kinetics400, UCF101, and HMDB51, temporal action detection on THUMOS14, and anomaly detection on UCF-Crime, plus mitigation of gender bias.

What carries the argument

The Anonymizing Adapter Module (AAM), a lightweight plug-in network that applies three training objectives to minimize private information in latent video features while preserving task utility.

If this is right

Privacy leakage on static clips for attributes like gender and skin color drops measurably without altering the input video.
Performance on action recognition, temporal action detection, and anomaly detection stays within a small margin of the baseline frozen encoder.
The adapter can be added to different video foundation models in a plug-and-play way without full model retraining or feature re-extraction.
New evaluation protocols reveal reduced gender bias in action recognition outputs after anonymization.
The latent consistency loss supports better generalization to tasks not seen during adapter training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-space approach might apply to other foundation models that process sequences, such as those handling audio or sensor data.
If the adapter generalizes reliably, organizations could safely share pre-computed video features for collaborative analysis without exposing personal details.
Testing the method on continuous real-world video streams rather than short clips would reveal whether temporal privacy leaks persist across longer durations.

Load-bearing premise

The three training objectives will keep removing private information from features of new encoders and unseen tasks without any retraining of the adapter.

What would settle it

Running the anonymized features through gender or clothing classifiers on a held-out video dataset and finding prediction accuracy remains near the level of the original unadapted features would show the privacy reduction does not hold.

Figures

Figures reproduced from arXiv: 2511.08666 by Ishan Rajendrakumar Dave, Joseph Fioresi, Mubarak Shah.

**Figure 2.** Figure 2: Workflow illustrating the SPLAVU training process. The process begins with a video clip [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Privacy-utility trade-off on PA-HMDB51. Privacy measured by attacker cMAP ( [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Graph showcasing the overall runtime and accuracy of 3 privacy-preserving methods. The [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding. https://joefioresi718.github.io/SPLAVU_webpage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable plug-and-play latent adapter that trims privacy leakage on video embeddings, but the main 35% figure comes from static-clip metrics and leaves temporal leakage under-tested.

read the letter

The main thing here is a lightweight Anonymizing Adapter Module that sits on top of a frozen video encoder. It trains with three losses: a clip-level self-supervised privacy term that cuts mutual information, a co-training term that keeps utility on seen tasks, and a consistency term meant to help on unseen ones. That combination lets them report a 35% privacy reduction while staying close to baseline on Kinetics, UCF101, THUMOS14, and UCF-Crime tasks, plus some extra checks on temporal attributes and gender bias in action recognition. The plug-and-play aspect is the clearest practical win; it avoids retraining the whole foundation model, which matters for anyone who already has embeddings stored or shared downstream.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces a lightweight Anonymizing Adapter Module (AAM) that operates in the latent space of frozen video foundation models to anonymize sensitive attributes (e.g., gender, skin color, clothing) while preserving utility. It employs three objectives: (1) a clip-level self-supervised privacy loss reducing mutual information between static clips and sensitive-attribute classifiers, (2) a co-training objective to retain performance on seen tasks, and (3) a latent consistency loss to support generalization to unseen tasks. The method is evaluated on action recognition (Kinetics400, UCF101, HMDB51), temporal action detection (THUMOS14), and anomaly detection (UCF-Crime), reporting a 35% privacy leakage reduction with near-baseline utility, plus separate analysis on temporal attributes and new protocols for assessing gender bias in action recognition.

Significance. If the privacy reduction generalizes beyond static-clip metrics and the adapter enables true plug-and-play use without retraining, the work would offer an efficient latent-space alternative to pixel-level anonymization for video foundation models, reducing the need for full model retraining when sharing features. The multi-dataset evaluation across action recognition, detection, and anomaly tasks, together with the introduction of gender-bias assessment protocols, strengthens the practical relevance. The significance would be higher if the temporal privacy analysis were integrated into the primary training objectives and metrics.

major comments (3)

The central privacy claim of 35% leakage reduction rests on a clip-level self-supervised objective that minimizes mutual information between static clips and sensitive-attribute classifiers. For video foundation models, however, attributes such as gait, behavioral sequences, or clothing dynamics are encoded across time; the latent consistency loss and co-training objective do not explicitly penalize these temporal correlations. Consequently, the reported reduction (measured on static-clip classifiers) may hold while real-world privacy leakage through downstream temporal models remains largely intact. This directly affects the soundness of the primary privacy claim for video data.
The main results report a 35% privacy reduction and near-baseline utility without error bars, standard deviations, or statistical significance tests, and without an ablation isolating the contribution of each of the three loss terms. These omissions make it difficult to determine whether the privacy-utility trade-off is robust or driven primarily by one objective.
The co-training objective and latent consistency loss are optimized jointly with the utility tasks. As a result, the reported utility numbers on seen and unseen tasks are partly fitted rather than reflecting pure out-of-distribution generalization, weakening the claim that the adapter supports plug-and-play use on new tasks without retraining.

minor comments (3)

Provide explicit details on how the mutual-information privacy metric was computed, including the architecture and training data distribution of the sensitive-attribute classifiers.
Clarify whether the adapter and base encoder require any task-specific fine-tuning for the temporal attribute analysis or for truly unseen downstream tasks, as the current description leaves this ambiguous.
The abstract states the method 'minimizes the computational burden of finetuning,' but the manuscript should quantify the parameter count and training cost of the AAM relative to full encoder fine-tuning to support this claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below.

read point-by-point responses

Referee: The central privacy claim of 35% leakage reduction rests on a clip-level self-supervised objective that minimizes mutual information between static clips and sensitive-attribute classifiers. For video foundation models, however, attributes such as gait, behavioral sequences, or clothing dynamics are encoded across time; the latent consistency loss and co-training objective do not explicitly penalize these temporal correlations. Consequently, the reported reduction (measured on static-clip classifiers) may hold while real-world privacy leakage through downstream temporal models remains largely intact. This directly affects the soundness of the primary privacy claim for video data.

Authors: We appreciate this observation regarding the temporal aspects of privacy in video data. Our latent consistency loss is intended to maintain temporal coherence in the anonymized features, thereby reducing the risk of leakage through dynamic attributes. The manuscript also presents a separate analysis on temporal attribute anonymization. In the revised version, we will integrate temporal privacy evaluation more closely with the primary metrics and objectives to provide stronger evidence for the claim. revision: partial
Referee: The main results report a 35% privacy reduction and near-baseline utility without error bars, standard deviations, or statistical significance tests, and without an ablation isolating the contribution of each of the three loss terms. These omissions make it difficult to determine whether the privacy-utility trade-off is robust or driven primarily by one objective.

Authors: We concur that adding error bars, standard deviations from multiple runs, and statistical significance tests will enhance the credibility of the results. We will revise the main results to include these. We will also include an ablation study detailing the impact of each individual loss term on both privacy reduction and utility preservation. revision: yes
Referee: The co-training objective and latent consistency loss are optimized jointly with the utility tasks. As a result, the reported utility numbers on seen and unseen tasks are partly fitted rather than reflecting pure out-of-distribution generalization, weakening the claim that the adapter supports plug-and-play use on new tasks without retraining.

Authors: The training process involves joint optimization to balance privacy and utility on seen tasks, but the plug-and-play functionality is demonstrated by applying the fixed adapter to entirely new tasks and datasets without any retraining of the adapter or the foundation model. The utility results on unseen tasks reflect this zero-shot application of the adapter. We will revise the text to more clearly distinguish between the training phase and the inference-time plug-and-play usage. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the method's objectives or empirical claims

full rationale

The paper proposes an Anonymizing Adapter Module trained with three objectives: a clip-level self-supervised privacy loss that reduces mutual information between static clips, a co-training loss to preserve utility on seen tasks, and a latent consistency loss to support generalization. The reported 35% privacy reduction and near-baseline utility on downstream tasks (Kinetics400, UCF101, HMDB51, THUMOS14, UCF-Crime) are presented as measured evaluation outcomes against external benchmarks and baselines, not as quantities that reduce by definition to the training objectives themselves. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps in the provided description. The framework remains self-contained with independent experimental validation on standard datasets, so the derivation chain does not collapse into its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the three training objectives can be balanced without task-specific retraining and that the privacy metric (mutual information on static clips) is a sufficient proxy for real privacy leakage. No explicit free parameters are named in the abstract, but the weighting of the three losses is implicitly fitted.

free parameters (1)

loss weighting coefficients
Relative weights among the privacy, utility co-training, and consistency losses must be chosen or tuned to achieve the reported trade-off.

axioms (1)

domain assumption Reducing mutual information between static clips removes private identity information without harming motion-based task utility.
Invoked in the design of the clip-level self-supervised privacy objective.

pith-pipeline@v0.9.0 · 5613 in / 1399 out tokens · 35161 ms · 2026-05-17T23:09:15.090677+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

clip-level self-supervised privacy objective to reduce mutual information between static clips ... NT-Xent contrastive loss ... latent consistency loss L_LC = ||f_E(x) - f_A(f_E(x))||_2^2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 9 internal anchors

[1]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint, 2024a. Adrien Bardes, Quentin Garrido, Jean Ponce, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learnin...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Imagenet-trained cnns are biased towards texture; increasing shape bias im- proves accuracy and robustness.arXiv preprint arXiv:1811.12231,

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias im- proves accuracy and robustness.arXiv preprint arXiv:1811.12231,

work page arXiv
[3]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017a. Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzy ´nska, Susanne West- phal, Heuna Kim, Valentin...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Privacy-preserving visual localization with event cameras

Junho Kim, Young Min Kim, Yicheng Wu, Ramzi Zahreddine, Weston A Welge, Gurunandan Kr- ishnan, Sizhuo Ma, and Jian Wang. Privacy-preserving visual localization with event cameras. arXiv preprint arXiv:2212.03177,

work page arXiv
[5]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Sb-bench: Stereotype bias benchmark for large multimodal models.arXiv preprint arXiv:2502.08779,

Vishal Narnaware, Ashmal Vayani, Rohit Gupta, Sirnam Swetha, and Mubarak Shah. Sb-bench: Stereotype bias benchmark for large multimodal models.arXiv preprint arXiv:2502.08779,

work page arXiv
[7]

U-net: Convolutional networks for biomed- ical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomed- ical image segmentation. InMedical Image Computing and Computer-Assisted Intervention– MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceed- ings, Part III 18, pp. 234–241. Springer,

work page 2015
[8]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generaliza- tion.arXiv preprint arXiv:1911.08731,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[9]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Image representations learned with unsupervised pre-training con- tain human-like biases

Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training con- tain human-like biases. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 701–713,

work page 2021
[11]

ConvNets and ImageNet Beyond Accuracy: Understanding Mistakes and Uncovering Biases

Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Explanations, bias detection, adversarial examples and model criticism.arXiv preprint arXiv:1711.11443,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Investigating bias in image classification using model explana- tions.arXiv preprint arXiv:2012.05463,

Schrasing Tong and Lalana Kagal. Investigating bias in image classification using model explana- tions.arXiv preprint arXiv:2012.05463,

work page arXiv 2012
[13]

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Predictive Inequity in Object Detection

Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. Predictive inequity in object detection. arXiv preprint arXiv:1902.11097,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[15]

Fedprompt: Communication- efficient and privacy-preserving prompt tuning in federated learning

13 Haodong Zhao, Wei Du, Fangqi Li, Peixuan Li, and Gongshen Liu. Fedprompt: Communication- efficient and privacy-preserving prompt tuning in federated learning. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2023
[16]

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints.arXiv preprint arXiv:1707.09457,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Each clip lasts around 10 seconds and is labeled with a single action class

APPENDIXOVERVIEW Section A: Dataset details Section B: Implementation details Section C: Additional experiment details Section D: Training algorithm A DATASETDETAILS Kinetics400 Carreira & Zisserman (2017)is a large-scale video action dataset of YouTube videos which includes 400 human action classes with at least 400 video clips for each action. Each clip...

work page 2017
[18]

For training, only random resized crop and random horizontal flip with probability 50% are utilized

B.2 INPUTS ANDAUGMENTATIONS All inputs consist of 16 frame clips sampled with consecutive frames, resized to spatial resolution of 224×224. For training, only random resized crop and random horizontal flip with probability 50% are utilized. In validation, the short edge is resized to 256, then a center crop of224×224is taken. Standard ImageNet Krizhevsky ...

work page 2012
[19]

Since 17 we want to ensure generalization across unseen tasks, action recognition is the only training utility task in this experiment. We found more solid support that with increasing the weightage of the latent consistency loss, performance maintains on the action-related utility, however, it significantly increases performance on the unseen anomaly det...

work page 2022
[20]

In this instance, our method did not make use of precomputed features, yet it still completed≈3.5xfaster than the next fastest method. The combined accuracy/privacy metric is simply defined as follows: yt = (acc t + (1−priv t))∗0.5,(14) wheretis the current time,y t is the performance score, andacc t andpriv t are the top-1 accuracy scores and privacy pre...

work page 2017

[1] [1]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint, 2024a. Adrien Bardes, Quentin Garrido, Jean Ponce, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learnin...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Imagenet-trained cnns are biased towards texture; increasing shape bias im- proves accuracy and robustness.arXiv preprint arXiv:1811.12231,

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias im- proves accuracy and robustness.arXiv preprint arXiv:1811.12231,

work page arXiv

[3] [3]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An- drew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017a. Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzy ´nska, Susanne West- phal, Heuna Kim, Valentin...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Privacy-preserving visual localization with event cameras

Junho Kim, Young Min Kim, Yicheng Wu, Ramzi Zahreddine, Weston A Welge, Gurunandan Kr- ishnan, Sizhuo Ma, and Jian Wang. Privacy-preserving visual localization with event cameras. arXiv preprint arXiv:2212.03177,

work page arXiv

[5] [5]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Sb-bench: Stereotype bias benchmark for large multimodal models.arXiv preprint arXiv:2502.08779,

Vishal Narnaware, Ashmal Vayani, Rohit Gupta, Sirnam Swetha, and Mubarak Shah. Sb-bench: Stereotype bias benchmark for large multimodal models.arXiv preprint arXiv:2502.08779,

work page arXiv

[7] [7]

U-net: Convolutional networks for biomed- ical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomed- ical image segmentation. InMedical Image Computing and Computer-Assisted Intervention– MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceed- ings, Part III 18, pp. 234–241. Springer,

work page 2015

[8] [8]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generaliza- tion.arXiv preprint arXiv:1911.08731,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[9] [9]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Image representations learned with unsupervised pre-training con- tain human-like biases

Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training con- tain human-like biases. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 701–713,

work page 2021

[11] [11]

ConvNets and ImageNet Beyond Accuracy: Understanding Mistakes and Uncovering Biases

Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Explanations, bias detection, adversarial examples and model criticism.arXiv preprint arXiv:1711.11443,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Investigating bias in image classification using model explana- tions.arXiv preprint arXiv:2012.05463,

Schrasing Tong and Lalana Kagal. Investigating bias in image classification using model explana- tions.arXiv preprint arXiv:2012.05463,

work page arXiv 2012

[13] [13]

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Predictive Inequity in Object Detection

Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. Predictive inequity in object detection. arXiv preprint arXiv:1902.11097,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[15] [15]

Fedprompt: Communication- efficient and privacy-preserving prompt tuning in federated learning

13 Haodong Zhao, Wei Du, Fangqi Li, Peixuan Li, and Gongshen Liu. Fedprompt: Communication- efficient and privacy-preserving prompt tuning in federated learning. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2023

[16] [16]

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints.arXiv preprint arXiv:1707.09457,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Each clip lasts around 10 seconds and is labeled with a single action class

APPENDIXOVERVIEW Section A: Dataset details Section B: Implementation details Section C: Additional experiment details Section D: Training algorithm A DATASETDETAILS Kinetics400 Carreira & Zisserman (2017)is a large-scale video action dataset of YouTube videos which includes 400 human action classes with at least 400 video clips for each action. Each clip...

work page 2017

[18] [18]

For training, only random resized crop and random horizontal flip with probability 50% are utilized

B.2 INPUTS ANDAUGMENTATIONS All inputs consist of 16 frame clips sampled with consecutive frames, resized to spatial resolution of 224×224. For training, only random resized crop and random horizontal flip with probability 50% are utilized. In validation, the short edge is resized to 256, then a center crop of224×224is taken. Standard ImageNet Krizhevsky ...

work page 2012

[19] [19]

Since 17 we want to ensure generalization across unseen tasks, action recognition is the only training utility task in this experiment. We found more solid support that with increasing the weightage of the latent consistency loss, performance maintains on the action-related utility, however, it significantly increases performance on the unseen anomaly det...

work page 2022

[20] [20]

In this instance, our method did not make use of precomputed features, yet it still completed≈3.5xfaster than the next fastest method. The combined accuracy/privacy metric is simply defined as follows: yt = (acc t + (1−priv t))∗0.5,(14) wheretis the current time,y t is the performance score, andacc t andpriv t are the top-1 accuracy scores and privacy pre...

work page 2017