On Evaluating Adversarial Robustness

Aleksander Madry; Alexey Kurakin; Anish Athalye; Dimitris Tsipras; Ian Goodfellow; Jonas Rauber; Nicholas Carlini; Nicolas Papernot; Wieland Brendel

arxiv: 1902.06705 · v2 · pith:NJEVEL4Inew · submitted 2019-02-18 · 💻 cs.LG · cs.CR· stat.ML

On Evaluating Adversarial Robustness

Nicholas Carlini , Anish Athalye , Nicolas Papernot , Wieland Brendel , Jonas Rauber , Dimitris Tsipras , Ian Goodfellow , Aleksander Madry

show 1 more author

Alexey Kurakin

This is my paper

classification 💻 cs.LG cs.CRstat.ML

keywords defensesadversarialevaluatingexamplesacceptedadaptiveadviceamount

0 comments

read the original abstract

Correctly evaluating defenses against adversarial examples has proven to be extremely difficult. Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks, few have succeeded; most papers that propose defenses are quickly shown to be incorrect. We believe a large contributing factor is the difficulty of performing security evaluations. In this paper, we discuss the methodological foundations, review commonly accepted best practices, and suggest new methods for evaluating defenses to adversarial examples. We hope that both researchers developing defenses as well as readers and reviewers who wish to understand the completeness of an evaluation consider our advice in order to avoid common pitfalls.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fortifying Time Series: DTW-Certified Robust Anomaly Detection
cs.LG 2026-05 unverdicted novelty 8.0

First DTW-certified robust anomaly detection for time series via randomized smoothing adapted through an l_p-to-DTW lower-bound transformation.
Low Rank Adaptation for Adversarial Perturbation
cs.LG 2026-04 unverdicted novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
Stateful Detection of Black-Box Adversarial Attacks
cs.CR 2019-07 unverdicted novelty 7.0

The paper argues for stateful defenses over stateless ones to detect adversarial example generation via query history and introduces query blinding as a counter-attack.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
cs.CR 2026-05 conditional novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
When AI reviews science: Can we trust the referee?
cs.AI 2026-04 unverdicted novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
FABLE: A Localized, Targeted Adversarial Attack on Weather Forecasting Models
cs.LG 2025-05 conditional novelty 6.0

FABLE applies 3D discrete wavelet decomposition to generate localized adversarial perturbations that steer deep learning weather forecasting models toward chosen forecast outcomes while keeping inputs close to the originals.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
cs.LG 2023-09 conditional novelty 6.0

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
Scaling Laws for Reward Model Overoptimization
cs.LG 2022-10 unverdicted novelty 6.0

Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model pa...
Position: Mind the Gap-AI Security and the Limits of Current Reporting Standards
cs.CR 2024-12 unverdicted novelty 3.0

Existing AI security incident reporting practices are misaligned with AI system characteristics, leaving key issues like IP treatment and vulnerability ownership unresolved and necessitating specialized standards as A...