pith. sign in

arxiv: 1702.02284 · v1 · pith:LRF3K7ZFnew · submitted 2017-02-08 · 💻 cs.LG · cs.CR· stat.ML

Adversarial Attacks on Neural Network Policies

classification 💻 cs.LG cs.CRstat.ML
keywords adversarialattackspoliciesadversarieslearningnetworkneuralperformance
0
0 comments X
read the original abstract

Machine learning classifiers are known to be vulnerable to inputs maliciously constructed by adversaries to force misclassification. Such adversarial examples have been extensively studied in the context of computer vision applications. In this work, we show adversarial attacks are also effective when targeting neural network policies in reinforcement learning. Specifically, we show existing adversarial example crafting techniques can be used to significantly degrade test-time performance of trained policies. Our threat model considers adversaries capable of introducing small perturbations to the raw input of the policy. We characterize the degree of vulnerability across tasks and training algorithms, for a subclass of adversarial-example attacks in white-box and black-box settings. Regardless of the learned task or training algorithm, we observe a significant drop in performance, even with small adversarial perturbations that do not interfere with human perception. Videos are available at http://rll.berkeley.edu/adversarial.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Preference Poisoning Attack on Offline RLHF

    cs.LG 2026-05 unverdicted novelty 8.0

    Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.

  2. Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

    cs.LG 2026-03 unverdicted novelty 7.0

    Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.

  3. A Speculative GLRT-Backed ApproachRobust Deep Learning-Based Array Processing

    eess.SP 2025-12 unverdicted novelty 7.0

    A speculative DL classifier validated by GLRT on spatially robust second-order statistics provides adversarially resilient array processing.

  4. How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies

    cs.LG 2025-02 unverdicted novelty 7.0

    Modern imitation learning methods including Diffusion Policy and Implicit Behavior Cloning are highly vulnerable to universal adversarial perturbations, with successful black-box transfer attacks across algorithms.

  5. MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

    cs.CV 2024-06 unverdicted novelty 7.0

    MirrorCheck detects adversarial attacks on VLMs via T2I regeneration for semantic consistency checks, using stochastic model selection and one-time perturbations for robustness against adaptive attacks.

  6. When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Adversarial action removal in self-play RL inflicts greater damage than random masking or learned perturbations, persists across algorithms and domains, transfers between agents, and resists recovery through extended ...

  7. Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

    cs.LG 2025-10 conditional novelty 6.0

    Weighted BC estimates trajectory density ratios from a clean reference set via binary discrimination and reweights the BC loss to converge to the clean expert policy with finite-sample bounds independent of contaminat...

  8. Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning

    cs.LG 2025-02 unverdicted novelty 6.0

    Wolfpack attack framework disrupts MARL cooperation by targeting initial and assisting agents; WALL trains robust policies against it with reported experimental gains.

  9. Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models

    cs.CV 2026-05 conditional novelty 5.0

    Black-box attacks, especially Pixle, reach 99-100% success on Arabic handwriting ConvNet models across two benchmark datasets while preserving character structure.

  10. Learning to Cope with Adversarial Attacks

    cs.LG 2019-06 unverdicted novelty 5.0

    MLAH agent in deep RL demonstrates hierarchical coping mechanisms and improved reward maintenance under spaced adversarial attacks, at the expense of stability.

  11. SoK: A Comprehensive Analysis of the Current Status of Neural Tangent Generalization Attacks with Research Directions

    cs.LG 2026-05 accept novelty 3.0

    NTGA is the first clean-label generalization attack under black-box settings but is vulnerable to adversarial training and image transformations, with newer attacks outperforming it.