Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
read the original abstract
We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimization-based attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining non-certified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.
This paper has not been read by Pith yet.
Forward citations
Cited by 8 Pith papers
-
The Statistical Cost of Adaptation in Multi-Source Transfer Learning
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
-
When AI reviews science: Can we trust the referee?
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
-
Agent Security is a Systems Problem
Agent security must be treated as a systems problem by viewing the AI model as untrusted and applying established systems security principles to enforce invariants.
-
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp rob...
-
Connecting Lyapunov Control Theory to Adversarial Attacks
Connects Lyapunov control theory to a provable defense against weaker adversarial attacks on neural networks.
-
Why Blocking Targeted Adversarial Perturbations Impairs the Ability to Learn
Defensive distillation blocks non-targeted adversarial attacks but cannot block targeted ones without preventing the network from learning via its input gradient.
-
Agent Security is a Systems Problem
The paper argues that agent security is best addressed as a systems problem by applying principles from operating systems, networks, and formal methods rather than relying solely on model robustness improvements.
-
Using Intuition from Empirical Properties to Simplify Adversarial Training Defense
Modifications to single-step adversarial training based on empirical properties of iterative methods improve accuracy by up to 16.93% against iterative attacks while reducing training cost by 28.75%.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.