pith. sign in

arxiv: 1902.06705 · v2 · pith:NJEVEL4Inew · submitted 2019-02-18 · 💻 cs.LG · cs.CR· stat.ML

On Evaluating Adversarial Robustness

classification 💻 cs.LG cs.CRstat.ML
keywords defensesadversarialevaluatingexamplesacceptedadaptiveadviceamount
0
0 comments X
read the original abstract

Correctly evaluating defenses against adversarial examples has proven to be extremely difficult. Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks, few have succeeded; most papers that propose defenses are quickly shown to be incorrect. We believe a large contributing factor is the difficulty of performing security evaluations. In this paper, we discuss the methodological foundations, review commonly accepted best practices, and suggest new methods for evaluating defenses to adversarial examples. We hope that both researchers developing defenses as well as readers and reviewers who wish to understand the completeness of an evaluation consider our advice in order to avoid common pitfalls.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fortifying Time Series: DTW-Certified Robust Anomaly Detection

    cs.LG 2026-05 unverdicted novelty 8.0

    First DTW-certified robust anomaly detection for time series via randomized smoothing adapted through an l_p-to-DTW lower-bound transformation.

  2. Low Rank Adaptation for Adversarial Perturbation

    cs.LG 2026-04 unverdicted novelty 7.0

    Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.

  3. Stateful Detection of Black-Box Adversarial Attacks

    cs.CR 2019-07 unverdicted novelty 7.0

    The paper argues for stateful defenses over stateless ones to detect adversarial example generation via query history and introduces query blinding as a counter-attack.

  4. On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

    cs.CR 2026-05 conditional novelty 6.0

    An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.

  5. When AI reviews science: Can we trust the referee?

    cs.AI 2026-04 unverdicted novelty 6.0

    AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...

  6. Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.

  7. FABLE: A Localized, Targeted Adversarial Attack on Weather Forecasting Models

    cs.LG 2025-05 conditional novelty 6.0

    FABLE applies 3D discrete wavelet decomposition to generate localized adversarial perturbations that steer deep learning weather forecasting models toward chosen forecast outcomes while keeping inputs close to the originals.

  8. Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    cs.LG 2023-09 conditional novelty 6.0

    Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.

  9. Scaling Laws for Reward Model Overoptimization

    cs.LG 2022-10 unverdicted novelty 6.0

    Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model pa...

  10. Position: Mind the Gap-AI Security and the Limits of Current Reporting Standards

    cs.CR 2024-12 unverdicted novelty 3.0

    Existing AI security incident reporting practices are misaligned with AI system characteristics, leaving key issues like IP treatment and vulnerability ownership unresolved and necessitating specialized standards as A...