On the (Statistical) Detection of Adversarial Examples

Kathrin Grosse; Michael Backes; Nicolas Papernot; Patrick McDaniel; Praveen Manoharan

arxiv: 1702.06280 · v2 · pith:KGA5AYORnew · submitted 2017-02-21 · 💻 cs.CR · cs.LG· stat.ML

On the (Statistical) Detection of Adversarial Examples

Kathrin Grosse , Praveen Manoharan , Nicolas Papernot , Michael Backes , Patrick McDaniel This is my paper

classification 💻 cs.CR cs.LGstat.ML

keywords adversarialinputsexamplesmodelstatisticaldetectionapproachdata

0 comments

read the original abstract

Machine Learning (ML) models are applied in a variety of tasks such as network intrusion detection or Malware classification. Yet, these models are vulnerable to a class of malicious inputs known as adversarial examples. These are slightly perturbed inputs that are classified incorrectly by the ML model. The mitigation of these adversarial inputs remains an open problem. As a step towards understanding adversarial examples, we show that they are not drawn from the same distribution than the original data, and can thus be detected using statistical tests. Using thus knowledge, we introduce a complimentary approach to identify specific inputs that are adversarial. Specifically, we augment our ML model with an additional output, in which the model is trained to classify all adversarial inputs. We evaluate our approach on multiple adversarial example crafting methods (including the fast gradient sign and saliency map methods) with several datasets. The statistical test flags sample sets containing adversarial inputs confidently at sample sizes between 10 and 100 data points. Furthermore, our augmented model either detects adversarial examples as outliers with high accuracy (> 80%) or increases the adversary's cost - the perturbation added - by more than 150%. In this way, we show that statistical properties of adversarial examples are essential to their detection.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Unified Perspective on Adversarial Membership Manipulation in Vision Models
cs.CV 2026-04 conditional novelty 8.0

Adversarial perturbations reliably fabricate membership signals in vision-model MIAs, separated by a gradient-norm collapse trajectory that enables robust detection and inference.
Local Hessian Spectral Filtering for Robust Intrinsic Dimension Estimation
cs.LG 2026-05 unverdicted novelty 7.0

LHSD uses spectral filtering on the log-density Hessian to isolate tangent directions from noise and estimate local intrinsic dimension scalably via Stochastic Lanczos Quadrature.
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
cs.CV 2024-06 unverdicted novelty 7.0

MirrorCheck detects adversarial attacks on VLMs via T2I regeneration for semantic consistency checks, using stochastic model selection and one-time perturbations for robustness against adaptive attacks.
Stateful Detection of Black-Box Adversarial Attacks
cs.CR 2019-07 unverdicted novelty 7.0

The paper argues for stateful defenses over stateless ones to detect adversarial example generation via query history and introduces query blinding as a counter-attack.
Whispers in the Machine: Confidentiality in Agentic Systems
cs.CR 2024-02 unverdicted novelty 6.0

Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
cs.LG 2023-09 conditional novelty 6.0

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
MalPurifier: Enhancing Android Malware Detection with Adversarial Purification against Evasion Attacks
cs.CR 2023-12 unverdicted novelty 5.0

MalPurifier combines diversified adversarial perturbations, protective noise injection, and a denoising autoencoder with dual loss to defend Android malware detectors, reporting over 90.91% robust accuracy against 37 ...