arXiv preprint arXiv:1702.02284 , year=

Adversarial attacks on neural network policies , author= · 2017 · cs.LG · arXiv 1702.02284

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

Machine learning classifiers are known to be vulnerable to inputs maliciously constructed by adversaries to force misclassification. Such adversarial examples have been extensively studied in the context of computer vision applications. In this work, we show adversarial attacks are also effective when targeting neural network policies in reinforcement learning. Specifically, we show existing adversarial example crafting techniques can be used to significantly degrade test-time performance of trained policies. Our threat model considers adversaries capable of introducing small perturbations to the raw input of the policy. We characterize the degree of vulnerability across tasks and training algorithms, for a subclass of adversarial-example attacks in white-box and black-box settings. Regardless of the learned task or training algorithm, we observe a significant drop in performance, even with small adversarial perturbations that do not interfere with human perception. Videos are available at http://rll.berkeley.edu/adversarial.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Efficient Preference Poisoning Attack on Offline RLHF

cs.LG · 2026-05-04 · unverdicted · novelty 8.0

Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

cs.LG · 2026-03-30 · unverdicted · novelty 7.0

Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.

A Speculative GLRT-Backed ApproachRobust Deep Learning-Based Array Processing

eess.SP · 2025-12-10 · unverdicted · novelty 7.0

A speculative DL classifier validated by GLRT on spatially robust second-order statistics provides adversarially resilient array processing.

How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies

cs.LG · 2025-02-06 · unverdicted · novelty 7.0

Modern imitation learning methods including Diffusion Policy and Implicit Behavior Cloning are highly vulnerable to universal adversarial perturbations, with successful black-box transfer attacks across algorithms.

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Adversarial action removal in self-play RL inflicts greater damage than random masking or learned perturbations, persists across algorithms and domains, transfers between agents, and resists recovery through extended training.

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

cs.LG · 2025-10-01 · conditional · novelty 6.0

Weighted BC estimates trajectory density ratios from a clean reference set via binary discrimination and reweights the BC loss to converge to the clean expert policy with finite-sample bounds independent of contamination rate.

Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning

cs.LG · 2025-02-05 · unverdicted · novelty 6.0

Wolfpack attack framework disrupts MARL cooperation by targeting initial and assisting agents; WALL trains robust policies against it with reported experimental gains.

Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models

cs.CV · 2026-05-18 · conditional · novelty 5.0

Black-box attacks, especially Pixle, reach 99-100% success on Arabic handwriting ConvNet models across two benchmark datasets while preserving character structure.

SoK: A Comprehensive Analysis of the Current Status of Neural Tangent Generalization Attacks with Research Directions

cs.LG · 2026-05-12 · accept · novelty 3.0

NTGA is the first clean-label generalization attack under black-box settings but is vulnerable to adversarial training and image transformations, with newer attacks outperforming it.

citing papers explorer

Showing 9 of 9 citing papers.

Efficient Preference Poisoning Attack on Offline RLHF cs.LG · 2026-05-04 · unverdicted · none · ref 124
Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback cs.LG · 2026-03-30 · unverdicted · none · ref 5 · internal anchor
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
A Speculative GLRT-Backed ApproachRobust Deep Learning-Based Array Processing eess.SP · 2025-12-10 · unverdicted · none · ref 39 · internal anchor
A speculative DL classifier validated by GLRT on spatially robust second-order statistics provides adversarially resilient array processing.
How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies cs.LG · 2025-02-06 · unverdicted · none · ref 28 · internal anchor
Modern imitation learning methods including Diffusion Policy and Implicit Behavior Cloning are highly vulnerable to universal adversarial perturbations, with successful black-box transfer attacks across algorithms.
When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning cs.LG · 2026-05-04 · unverdicted · none · ref 2 · internal anchor
Adversarial action removal in self-play RL inflicts greater damage than random masking or learned perturbations, persists across algorithms and domains, transfers between agents, and resists recovery through extended training.
Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets cs.LG · 2025-10-01 · conditional · none · ref 7 · internal anchor
Weighted BC estimates trajectory density ratios from a clean reference set via binary discrimination and reweights the BC loss to converge to the clean expert policy with finite-sample bounds independent of contamination rate.
Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning cs.LG · 2025-02-05 · unverdicted · none · ref 6 · internal anchor
Wolfpack attack framework disrupts MARL cooperation by targeting initial and assisting agents; WALL trains robust policies against it with reported experimental gains.
Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models cs.CV · 2026-05-18 · conditional · none · ref 23 · internal anchor
Black-box attacks, especially Pixle, reach 99-100% success on Arabic handwriting ConvNet models across two benchmark datasets while preserving character structure.
SoK: A Comprehensive Analysis of the Current Status of Neural Tangent Generalization Attacks with Research Directions cs.LG · 2026-05-12 · accept · none · ref 42 · internal anchor
NTGA is the first clean-label generalization attack under black-box settings but is vulnerable to adversarial training and image transformations, with newer attacks outperforming it.

arXiv preprint arXiv:1702.02284 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer