How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies
Pith reviewed 2026-05-23 03:29 UTC · model grok-4.3
The pith
Modern behavior cloning policies are highly vulnerable to universal adversarial perturbation attacks, including black-box transfers across algorithms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that most existing imitation learning algorithms for behavior cloning are highly vulnerable to universal adversarial perturbations. This vulnerability appears in white-box settings where full model access is available as well as in black-box transfer attacks where perturbations crafted on one algorithm affect others. The study compares vulnerabilities across classic and recent methods and concludes that these algorithms share common weaknesses to such attacks.
What carries the argument
Universal adversarial perturbation attacks applied to the input observations of behavior cloning policies, tested in white-box, grey-box, and black-box transfer settings across multiple algorithms.
If this is right
- Vulnerability holds for both white-box and black-box attacks across a range of imitation learning algorithms.
- Black-box transfer attacks succeed, allowing perturbations to move between different algorithms without model access.
- Current imitation learning methods share limitations that make them susceptible to input perturbations.
- The findings point to the need for new approaches to improve robustness in behavior cloning policies.
Where Pith is reading between the lines
- If the claim holds, adversarial robustness testing should become a standard part of evaluating any new behavior cloning method.
- This raises the possibility that sensor noise or small environmental changes in real-world settings could act like these attacks and disrupt deployed policies.
- The results connect to larger questions about how learned policies handle input variations that were not present in training demonstrations.
Load-bearing premise
The tested algorithms and attack setups are representative of modern behavior cloning policies and realistic threats in their intended deployment domains.
What would settle it
New experiments on additional imitation learning algorithms or in physical robot deployments that show low attack success rates would falsify the claim of widespread high vulnerability.
Figures
read the original abstract
Learning from demonstrations is a popular approach to train AI models; however, their vulnerability to adversarial attacks remains underexplored. We present the first systematic study of adversarial attacks, across a range of both classic and recently proposed imitation learning algorithms, including Vanilla Behavior Cloning (Vanilla BC), LSTM-GMM, Implicit Behavior Cloning (IBC), Diffusion Policy (DP), and Vector-Quantized Behavior Transformer (VQ-BET). We study the vulnerability of these methods to both white-box, grey-box and black-box adversarial perturbations. Our experiments reveal that most existing methods are highly vulnerable to these attacks, including black-box transfer attacks that transfer across algorithms. To the best of our knowledge, we are the first to study and compare the vulnerabilities of different popular imitation learning algorithms to both white-box and black-box attacks. Our findings highlight the vulnerabilities of modern imitation learning algorithms, paving the way for future work in addressing such limitations. Videos and code are available at https://sites.google.com/view/uap-attacks-on-bc.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts the first systematic empirical study of universal adversarial perturbation (UAP) attacks on behavior cloning policies. It evaluates five imitation learning algorithms—Vanilla BC, LSTM-GMM, IBC, Diffusion Policy (DP), and VQ-BET—under white-box, grey-box, and black-box attack settings, reporting that most are highly vulnerable with successful black-box transfer across algorithms. The work positions itself as the initial comparison of these vulnerabilities and releases code and videos.
Significance. If the experimental results hold under broader conditions, the findings would establish a concrete security limitation for deployed imitation learning systems, particularly in robotics and control where BC is common. The cross-algorithm black-box transfer result, if robust, would be especially notable as it suggests shared vulnerabilities rather than algorithm-specific weaknesses. The public release of code supports reproducibility and follow-on work on defenses.
major comments (3)
- [§5] §5 (Experiments) and Table 2: The headline claim that 'most existing methods are highly vulnerable' rests on results from only five algorithms. No explicit justification or coverage argument is given for why Vanilla BC, LSTM-GMM, IBC, DP, and VQ-BET are representative of the current diversity of BC methods (e.g., newer transformer-based or flow-matching variants trained on large heterogeneous datasets). Without this, the transferability and vulnerability conclusions cannot be generalized beyond the tested suite.
- [§4.3] §4.3 and §5.2: The black-box transfer attack protocol is described at a high level, but the manuscript does not report the precise success-rate thresholds, number of source-target pairs, or statistical significance tests used to declare 'transfer across algorithms.' This detail is load-bearing for the central empirical claim.
- [§5.1] §5.1, Table 1: The environments and observation spaces used (e.g., dimensionality, horizon length, presence of safety constraints) are not compared against typical real-world BC deployment settings. If the chosen tasks are low-dimensional or lack the complexity of modern applications, the reported attack success rates may not indicate vulnerability under realistic threat models.
minor comments (2)
- [Introduction] The abstract states the study covers 'classic and recently proposed' algorithms, but the introduction does not cite the original papers for each of the five methods with publication years; adding these would improve context.
- Figure 3 (attack visualization) caption does not specify the perturbation magnitude (ε) or the exact policy being visualized; this reduces clarity for readers reproducing the results.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below with clarifications and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§5] §5 (Experiments) and Table 2: The headline claim that 'most existing methods are highly vulnerable' rests on results from only five algorithms. No explicit justification or coverage argument is given for why Vanilla BC, LSTM-GMM, IBC, DP, and VQ-BET are representative of the current diversity of BC methods (e.g., newer transformer-based or flow-matching variants trained on large heterogeneous datasets). Without this, the transferability and vulnerability conclusions cannot be generalized beyond the tested suite.
Authors: These five algorithms were deliberately selected to span classical (Vanilla BC, LSTM-GMM) to modern (IBC, Diffusion Policy, VQ-BET) approaches, covering feedforward, recurrent, implicit, diffusion, and transformer-based paradigms that dominate recent BC literature. VQ-BET specifically addresses transformer-based methods. We will add an explicit justification subsection in §5 discussing selection criteria, prevalence in the field, and scope limitations (e.g., excluding certain flow-matching variants). This will support the claims without overgeneralization. revision: yes
-
Referee: [§4.3] §4.3 and §5.2: The black-box transfer attack protocol is described at a high level, but the manuscript does not report the precise success-rate thresholds, number of source-target pairs, or statistical significance tests used to declare 'transfer across algorithms.' This detail is load-bearing for the central empirical claim.
Authors: The referee correctly notes that these implementation details are not fully reported in the current manuscript. We will revise §4.3 and §5.2 to explicitly state the success-rate thresholds, the number of source-target pairs evaluated, and any statistical significance tests used, ensuring the transfer results are fully reproducible and supported. revision: yes
-
Referee: [§5.1] §5.1, Table 1: The environments and observation spaces used (e.g., dimensionality, horizon length, presence of safety constraints) are not compared against typical real-world BC deployment settings. If the chosen tasks are low-dimensional or lack the complexity of modern applications, the reported attack success rates may not indicate vulnerability under realistic threat models.
Authors: The environments were chosen as standard benchmarks from the source papers of each method to enable fair comparisons. We agree a direct mapping to real-world settings is absent. In the revision we will expand §5.1 with a discussion comparing observation dimensionality, horizon lengths, and constraints to typical robotic deployments, plus an explicit limitations paragraph on the gap to more complex real-world scenarios. revision: partial
Circularity Check
No circularity: purely empirical evaluation of existing algorithms
full rationale
The paper conducts an experimental comparison of adversarial vulnerability across five imitation learning methods (Vanilla BC, LSTM-GMM, IBC, DP, VQ-BET) using white-box, grey-box, and black-box attacks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations that bear the load of any central claim exist. Results are reported directly from the described experiments on the chosen tasks and algorithms without reduction to prior definitions or ansatzes. The representativeness concern raised by the skeptic is a question of external validity, not circularity per the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Imitation learning policies can be subjected to gradient-based or transfer-based adversarial perturbations using techniques from supervised learning.
Forward citations
Cited by 1 Pith paper
-
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation
Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...
Reference graph
Works this paper leans on
-
[1]
Intriguing properties of neural networks
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” CoRR, vol. abs/1312.6199, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[2]
Explaining and Harnessing Adversarial Examples
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harness- ing adversarial examples,” CoRR, vol. abs/1412.6572, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[3]
Threat of adversarial attacks on deep learning in computer vision: A survey,
N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,” Ieee Access, 2018
work page 2018
-
[4]
Adversarial attacks on deep-learning models in natural language processing: A survey,
W. E. Zhang, Q. Z. Sheng, A. Alhazmi, and C. Li, “Adversarial attacks on deep-learning models in natural language processing: A survey,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 11, no. 3, pp. 1–41, 2020
work page 2020
-
[5]
A survey on adversarial attacks and defences,
A. Chakraborty, M. Alam, V . Dey, A. Chattopadhyay, and D. Mukhopadhyay, “A survey on adversarial attacks and defences,” CAAI Transactions on Intelligence Technology, 2021
work page 2021
-
[6]
Studying adversarial attacks on behavioral cloning dynamics,
G. Hall, A. Das, J. Quarles, and P. Rad, “Studying adversarial attacks on behavioral cloning dynamics,” in 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2020, pp. 452–459
work page 2020
-
[7]
Adversar- ial driving: Attacking end-to-end autonomous driving,
H. Wu, S. Yunas, S. Rowlands, W. Ruan, and J. Wahlstr ¨om, “Adversar- ial driving: Attacking end-to-end autonomous driving,” in 2023 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2023, pp. 1–7
work page 2023
-
[8]
Simple physical adversarial examples against end-to-end autonomous driv- ing models,
A. Boloor, X. He, C. Gill, Y . V orobeychik, and X. Zhang, “Simple physical adversarial examples against end-to-end autonomous driv- ing models,” in 2019 IEEE International Conference on Embedded Software and Systems (ICESS). IEEE, 2019, pp. 1–7
work page 2019
-
[9]
Diffusion policy attacker: Craft- ing adversarial attacks for diffusion-based policies,
Y . Chen, H. Xue, and Y . Chen, “Diffusion policy attacker: Craft- ing adversarial attacks for diffusion-based policies,” ArXiv, vol. abs/2405.19424, 2024
-
[10]
What matters in learning from offline human demonstrations for robot manipulation,
A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart’in-Mart’in, “What matters in learning from offline human demonstrations for robot manipulation,” in Conference on Robot Learning, 2021
work page 2021
-
[11]
P. Florence, C. Lynch, A. Zeng, O. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, “Implicit behavioral cloning,” Conference on Robot Learning (CoRL), 2021
work page 2021
-
[12]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proceedings of Robotics: Science and Systems (RSS), 2023
work page 2023
-
[13]
Behavior generation with latent actions
S. Lee, Y . Wang, H. Etukuru, H. J. Kim, N. Muhammad, M. Shafiullah, and L. Pinto, “Behavior generation with latent actions,” ArXiv, vol. abs/2403.03181, 2024
-
[14]
Uni- versal adversarial perturbations,
S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Uni- versal adversarial perturbations,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2017
-
[15]
(certified!!) adversarial robustness for free!
N. Carlini, F. Tram `er, K. D. Dvijotham, L. Rice, M. Sun, and J. Z. Kolter, “(certified!!) adversarial robustness for free!” in The Eleventh International Conference on Learning Representations. OpenReview, 2023
work page 2023
-
[16]
Physical adversarial attack on a robotic arm,
Y . Jia, C. M. Poskitt, J. Sun, and S. Chattopadhyay, “Physical adversarial attack on a robotic arm,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 9334–9341, 2022
work page 2022
-
[17]
Quantifying assistive robustness via the natural-adversarial frontier,
J. Z.-Y . He, D. S. Brown, Z. Erickson, and A. Dragan, “Quantifying assistive robustness via the natural-adversarial frontier,” in Conference on Robot Learning. PMLR, 2023, pp. 1865–1886
work page 2023
-
[18]
Preventing imitation learning with adversarial policy ensembles,
A. Zhan, S. Tiomkin, and P. Abbeel, “Preventing imitation learning with adversarial policy ensembles,” arXiv preprint arXiv:2002.01059, 2020
-
[19]
K. Zhao, H. Huang, M. Li, and Y . Wu, “Rethinking the intermediate features in adversarial attacks: Misleading robotic models via adver- sarial distillation,” arXiv preprint arXiv:2411.15222, 2024
-
[20]
Attacking deep reinforcement learning with decoupled adversarial policy,
K. Mo, W. Tang, J. Li, and X. Yuan, “Attacking deep reinforcement learning with decoupled adversarial policy,” IEEE Transactions on Dependable and Secure Computing, vol. 20, pp. 758–768, 2023
work page 2023
-
[21]
Stealthy and efficient adversarial attacks against deep reinforcement learning,
J. Sun, T. Zhang, X. Xie, L. Ma, Y . Zheng, K. Chen, and Y . Liu, “Stealthy and efficient adversarial attacks against deep reinforcement learning,” in AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[22]
Robust deep reinforcement learning with adversarial attacks,
A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, and G. V . Chowdhary, “Robust deep reinforcement learning with adversarial attacks,” in Adaptive Agents and Multi-Agent Systems, 2017
work page 2017
-
[23]
Tactics of adversarial attack on deep reinforcement learning agents,
Y .-C. Lin, Z.-W. Hong, Y .-H. Liao, M.-L. Shih, M.-Y . Liu, and M. Sun, “Tactics of adversarial attack on deep reinforcement learning agents,” in International Joint Conference on Artificial Intelligence, 2017
work page 2017
-
[24]
Adversarial policies: Attacking deep reinforcement learning.arXiv preprint arXiv:1905.10615, 2019
A. Gleave, M. Dennis, N. Kant, C. Wild, S. Levine, and S. J. Russell, “Adversarial policies: Attacking deep reinforcement learning,” ArXiv, vol. abs/1905.10615, 2019
-
[25]
Robust deep reinforcement learning against adversarial perturbations on state observations,
H. Zhang, H. Chen, C. Xiao, B. Li, M. Liu, D. Boning, and C.-J. Hsieh, “Robust deep reinforcement learning against adversarial perturbations on state observations,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 024–21 037, 2020
work page 2020
-
[26]
Robust reinforcement learning on state observations with learned optimal adversary,
H. Zhang, H. Chen, D. Boning, and C.-J. Hsieh, “Robust reinforcement learning on state observations with learned optimal adversary,” in International Conference on Learning Representation (ICLR), 2021
work page 2021
-
[27]
Robust reinforcement learning: A review of foundations and recent advances,
J. Moos, K. Hansel, H. Abdulsamad, S. Stark, D. Clever, and J. Peters, “Robust reinforcement learning: A review of foundations and recent advances,” Machine Learning and Knowledge Extraction, 2022
work page 2022
-
[28]
Adversarial Attacks on Neural Network Policies
S. Huang, N. Papernot, I. Goodfellow, Y . Duan, and P. Abbeel, “Adversarial attacks on neural network policies,” arXiv preprint arXiv:1702.02284, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
White- box adversarial policies in deep reinforcement learning,
S. Casper, T. Killian, G. Kreiman, and D. Hadfield-Menell, “White- box adversarial policies in deep reinforcement learning,”arXiv preprint arXiv:2209.02167, 2022
-
[30]
Bird: generalizable backdoor detection and removal for deep reinforcement learning,
X. Chen, W. Guo, G. Tao, X. Zhang, and D. Song, “Bird: generalizable backdoor detection and removal for deep reinforcement learning,” Advances in Neural Information Processing Systems, 2023
work page 2023
-
[31]
A framework for behavioural cloning
M. Bain and C. Sammut, “A framework for behavioural cloning.” in Machine Intelligence 15, 1995, pp. 103–129
work page 1995
-
[32]
Behavioral cloning from obser- vation,
F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from obser- vation,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4950–4957
work page 2018
-
[33]
A reduction of imitation learning and structured prediction to no-regret online learning,
S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011
work page 2011
-
[34]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, 1997
work page 1997
-
[35]
G. J. McLachlan and D. Peel, Finite mixture models. John Wiley & Sons, 2000
work page 2000
-
[36]
Denoising Diffusion Probabilistic Models
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” ArXiv, vol. abs/2006.11239, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[37]
Towards deep learning models resistant to adversarial attacks,
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.