Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
arXiv preprint arXiv:1702.02284 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
abstract
Machine learning classifiers are known to be vulnerable to inputs maliciously constructed by adversaries to force misclassification. Such adversarial examples have been extensively studied in the context of computer vision applications. In this work, we show adversarial attacks are also effective when targeting neural network policies in reinforcement learning. Specifically, we show existing adversarial example crafting techniques can be used to significantly degrade test-time performance of trained policies. Our threat model considers adversaries capable of introducing small perturbations to the raw input of the policy. We characterize the degree of vulnerability across tasks and training algorithms, for a subclass of adversarial-example attacks in white-box and black-box settings. Regardless of the learned task or training algorithm, we observe a significant drop in performance, even with small adversarial perturbations that do not interfere with human perception. Videos are available at http://rll.berkeley.edu/adversarial.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
A speculative DL classifier validated by GLRT on spatially robust second-order statistics provides adversarially resilient array processing.
Modern imitation learning methods including Diffusion Policy and Implicit Behavior Cloning are highly vulnerable to universal adversarial perturbations, with successful black-box transfer attacks across algorithms.
Adversarial action removal in self-play RL inflicts greater damage than random masking or learned perturbations, persists across algorithms and domains, transfers between agents, and resists recovery through extended training.
Weighted BC estimates trajectory density ratios from a clean reference set via binary discrimination and reweights the BC loss to converge to the clean expert policy with finite-sample bounds independent of contamination rate.
Wolfpack attack framework disrupts MARL cooperation by targeting initial and assisting agents; WALL trains robust policies against it with reported experimental gains.
Black-box attacks, especially Pixle, reach 99-100% success on Arabic handwriting ConvNet models across two benchmark datasets while preserving character structure.
NTGA is the first clean-label generalization attack under black-box settings but is vulnerable to adversarial training and image transformations, with newer attacks outperforming it.
citing papers explorer
-
Efficient Preference Poisoning Attack on Offline RLHF
Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
-
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
-
A Speculative GLRT-Backed ApproachRobust Deep Learning-Based Array Processing
A speculative DL classifier validated by GLRT on spatially robust second-order statistics provides adversarially resilient array processing.
-
How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies
Modern imitation learning methods including Diffusion Policy and Implicit Behavior Cloning are highly vulnerable to universal adversarial perturbations, with successful black-box transfer attacks across algorithms.
-
When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning
Adversarial action removal in self-play RL inflicts greater damage than random masking or learned perturbations, persists across algorithms and domains, transfers between agents, and resists recovery through extended training.
-
Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets
Weighted BC estimates trajectory density ratios from a clean reference set via binary discrimination and reweights the BC loss to converge to the clean expert policy with finite-sample bounds independent of contamination rate.
-
Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning
Wolfpack attack framework disrupts MARL cooperation by targeting initial and assisting agents; WALL trains robust policies against it with reported experimental gains.
-
Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models
Black-box attacks, especially Pixle, reach 99-100% success on Arabic handwriting ConvNet models across two benchmark datasets while preserving character structure.
-
SoK: A Comprehensive Analysis of the Current Status of Neural Tangent Generalization Attacks with Research Directions
NTGA is the first clean-label generalization attack under black-box settings but is vulnerable to adversarial training and image transformations, with newer attacks outperforming it.