On Evaluating Adversarial Robustness

Nicholas Carlini , Anish Athalye , Nicolas Papernot , Wieland Brendel , Jonas Rauber , Dimitris Tsipras , Ian Goodfellow , Aleksander Madry

show 1 more author

Alexey Kurakin

Authors on Pith no claims yet

classification 💻 cs.LG cs.CRstat.ML

keywords defensesadversarialevaluatingexamplesacceptedadaptiveadviceamount

0 comments

read the original abstract

Correctly evaluating defenses against adversarial examples has proven to be extremely difficult. Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks, few have succeeded; most papers that propose defenses are quickly shown to be incorrect. We believe a large contributing factor is the difficulty of performing security evaluations. In this paper, we discuss the methodological foundations, review commonly accepted best practices, and suggest new methods for evaluating defenses to adversarial examples. We hope that both researchers developing defenses as well as readers and reviewers who wish to understand the completeness of an evaluation consider our advice in order to avoid common pitfalls.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fortifying Time Series: DTW-Certified Robust Anomaly Detection
cs.LG 2026-05 unverdicted novelty 8.0

First DTW-certified robust anomaly detection for time series via randomized smoothing adapted through an l_p-to-DTW lower-bound transformation.
Low Rank Adaptation for Adversarial Perturbation
cs.LG 2026-04 unverdicted novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
cs.CR 2026-05 conditional novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
When AI reviews science: Can we trust the referee?
cs.AI 2026-04 unverdicted novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
cs.LG 2023-09 conditional novelty 6.0

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.