Adversarial Attacks on Neural Network Policies

Ian Goodfellow; Nicolas Papernot; Pieter Abbeel; Sandy Huang; Yan Duan

arxiv: 1702.02284 · v1 · pith:LRF3K7ZFnew · submitted 2017-02-08 · 💻 cs.LG · cs.CR· stat.ML

Adversarial Attacks on Neural Network Policies

Sandy Huang , Nicolas Papernot , Ian Goodfellow , Yan Duan , Pieter Abbeel This is my paper

classification 💻 cs.LG cs.CRstat.ML

keywords adversarialattackspoliciesadversarieslearningnetworkneuralperformance

0 comments

read the original abstract

Machine learning classifiers are known to be vulnerable to inputs maliciously constructed by adversaries to force misclassification. Such adversarial examples have been extensively studied in the context of computer vision applications. In this work, we show adversarial attacks are also effective when targeting neural network policies in reinforcement learning. Specifically, we show existing adversarial example crafting techniques can be used to significantly degrade test-time performance of trained policies. Our threat model considers adversaries capable of introducing small perturbations to the raw input of the policy. We characterize the degree of vulnerability across tasks and training algorithms, for a subclass of adversarial-example attacks in white-box and black-box settings. Regardless of the learned task or training algorithm, we observe a significant drop in performance, even with small adversarial perturbations that do not interfere with human perception. Videos are available at http://rll.berkeley.edu/adversarial.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Preference Poisoning Attack on Offline RLHF
cs.LG 2026-05 unverdicted novelty 8.0

Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
cs.LG 2026-03 unverdicted novelty 7.0

Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
A Speculative GLRT-Backed ApproachRobust Deep Learning-Based Array Processing
eess.SP 2025-12 unverdicted novelty 7.0

A speculative DL classifier validated by GLRT on spatially robust second-order statistics provides adversarially resilient array processing.
How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies
cs.LG 2025-02 unverdicted novelty 7.0

Modern imitation learning methods including Diffusion Policy and Implicit Behavior Cloning are highly vulnerable to universal adversarial perturbations, with successful black-box transfer attacks across algorithms.
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
cs.CV 2024-06 unverdicted novelty 7.0

MirrorCheck detects adversarial attacks on VLMs via T2I regeneration for semantic consistency checks, using stochastic model selection and one-time perturbations for robustness against adaptive attacks.
When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Adversarial action removal in self-play RL inflicts greater damage than random masking or learned perturbations, persists across algorithms and domains, transfers between agents, and resists recovery through extended ...
Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets
cs.LG 2025-10 conditional novelty 6.0

Weighted BC estimates trajectory density ratios from a clean reference set via binary discrimination and reweights the BC loss to converge to the clean expert policy with finite-sample bounds independent of contaminat...
Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning
cs.LG 2025-02 unverdicted novelty 6.0

Wolfpack attack framework disrupts MARL cooperation by targeting initial and assisting agents; WALL trains robust policies against it with reported experimental gains.
Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models
cs.CV 2026-05 conditional novelty 5.0

Black-box attacks, especially Pixle, reach 99-100% success on Arabic handwriting ConvNet models across two benchmark datasets while preserving character structure.
Learning to Cope with Adversarial Attacks
cs.LG 2019-06 unverdicted novelty 5.0

MLAH agent in deep RL demonstrates hierarchical coping mechanisms and improved reward maintenance under spaced adversarial attacks, at the expense of stability.
SoK: A Comprehensive Analysis of the Current Status of Neural Tangent Generalization Attacks with Research Directions
cs.LG 2026-05 accept novelty 3.0

NTGA is the first clean-label generalization attack under black-box settings but is vulnerable to adversarial training and image transformations, with newer attacks outperforming it.