Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering

Bryant Chen , Wilka Carvalho , Nathalie Baracaldo , Heiko Ludwig , Benjamin Edwards , Taesung Lee , Ian Molloy , Biplav Srivastava

Authors on Pith no claims yet

classification 💻 cs.LG cs.CRstat.ML

keywords modelbackdoorattackdatadetectingmodelsnetworksneural

0 comments

read the original abstract

While machine learning (ML) models are being increasingly trusted to make decisions in different and varying areas, the safety of systems using such models has become an increasing concern. In particular, ML models are often trained on data from potentially untrustworthy sources, providing adversaries with the opportunity to manipulate them by inserting carefully crafted samples into the training set. Recent work has shown that this type of attack, called a poisoning attack, allows adversaries to insert backdoors or trojans into the model, enabling malicious behavior with simple external backdoor triggers at inference time and only a blackbox perspective of the model itself. Detecting this type of attack is challenging because the unexpected behavior occurs only when a backdoor trigger, which is known only to the adversary, is present. Model users, either direct users of training data or users of pre-trained model from a catalog, may not guarantee the safe operation of their ML-based system. In this paper, we propose a novel approach to backdoor detection and removal for neural networks. Through extensive experimental results, we demonstrate its effectiveness for neural networks classifying text and images. To the best of our knowledge, this is the first methodology capable of detecting poisonous data crafted to insert backdoors and repairing the model that does not require a verified and trusted dataset.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

McNdroid: A Longitudinal Multimodal Benchmark for Robust Drift Detection in Android Malware
cs.CR 2026-05 unverdicted novelty 7.0

McNdroid is a new longitudinal multimodal benchmark showing that Android malware detectors degrade over time but multimodal approaches maintain better performance across long temporal gaps.
Undetectable Backdoors in Model Parameters: Hiding Sparse Secrets in High Dimensions
cs.CR 2026-05 unverdicted novelty 7.0

Sparse Backdoor plants a provably undetectable backdoor in neural network weights via structured sparse perturbations and isotropic Gaussian dithering, with detection hardness reduced to Sparse PCA.
Follow My Eyes: Backdoor Attacks on VLM-based Scanpath Prediction
cs.CR 2026-04 conditional novelty 7.0

Backdoor attacks on VLM-based scanpath predictors can redirect fixations toward chosen objects or inflate durations using input-conditioned triggers that evade cluster detection, and no tested defense blocks them with...
DETOUR: A Practical Backdoor Attack against Object Detection
cs.CR 2026-04 unverdicted novelty 6.0

DETOUR enables practical backdoor attacks on object detectors by training with rescaled semantic triggers from real-world objects placed at multiple locations to exploit the trigger radiating effect for reliable activ...
CSC: Turning the Adversary's Poison against Itself
cs.CR 2026-04 unverdicted novelty 6.0

CSC identifies backdoored samples via early-epoch latent clustering and conceals them by relabeling to a virtual class, driving attack success rates near zero on benchmarks with little clean accuracy loss.
PASTA: A Patch-Agnostic Twofold-Stealthy Backdoor Attack on Vision Transformers
cs.CV 2026-04 unverdicted novelty 6.0

PASTA enables patch-agnostic backdoor activation in ViTs via multi-location trigger insertion during training and bi-level optimization, achieving 99.13% average attack success with large gains in visual/attention ste...
A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 5.0

A patch-augmented cross-view regularization method reduces backdoor attack success rates in multimodal LLMs by enforcing output differences between original and perturbed views while using entropy constraints to prese...
DeepSeek Robustness Against Semantic-Character Dual-Space Mutated Prompt Injection
cs.CR 2026-04 unverdicted novelty 4.0

Dual-space semantic-character mutations on prompts achieve higher misuse success rates against DeepSeek than single-space attacks alone.