pith. sign in

arxiv: 2606.10371 · v1 · pith:3IFOZSSJnew · submitted 2026-06-09 · 💻 cs.RO · cs.AI

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

Pith reviewed 2026-06-27 13:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords adversarial attacksdiffusion policiesvisuomotor policiestest-time attacksrobotic manipulationuniversal patchesgenerative inference
0
0 comments X

The pith

An attacker can hijack frozen diffusion-based robot policies in real time by overlaying learned visual patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion policies for robots can be taken over at test time without changing the policy itself. An attacker learns a small vocabulary of universal patches that bias the visual input so the generated actions follow attacker-chosen goals instead. These patches are switched in the live camera stream to compose trajectories, and the bias holds through the full iterative sampling process. Complete takeover occurs in every tested case across manipulation and navigation tasks, two encoders, and three inference methods. The result indicates that visual conditioning in embodied diffusion policies creates a controllable interface for an adversary.

Core claim

TAKO learns a vocabulary of reusable universal patches via differentiable diffusion inference; at test time an attacker switches among the patches in the camera stream to steer a frozen policy toward any chosen trajectory. The patches act on the visual conditioning pathway so the induced bias persists through iterative generative inference. A natural targeted baseline fails because the victim policy cannot supervise itself on out-of-distribution shifts. The attack reaches 100 percent success on attacker objectives across four tasks, two visual encoders, and DDPM, DDIM, and flow-matching inference.

What carries the argument

A small vocabulary of reusable universal patches learned through differentiable diffusion inference and switched in real time on the visual input to bias action generation.

If this is right

  • The attack reaches 100 percent takeover success on attacker-defined objectives in every evaluated setting.
  • The method works across 2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation.
  • It succeeds with both ResNet-18 and EfficientNet-B0 plus Transformer encoders.
  • It succeeds with DDPM, DDIM, and flow-matching generative inference.
  • Target-policy matching fails as a defense because the policy cannot supervise out-of-distribution target shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployed robot systems using diffusion policies may need real-time checks on incoming visual frames for patch-like patterns.
  • The same conditioning-pathway vulnerability could appear in other generative models used for control.
  • Testing the patches under changing lighting or partial occlusions would show whether the takeover remains reliable outside controlled conditions.

Load-bearing premise

The learned patches remain effective and undetectable when applied in real time through the live camera stream, and the induced bias reliably persists through the full iterative generative inference process without the policy recovering or the system intervening.

What would settle it

Apply the patches to a live physical robot camera feed and record whether the generated actions match the attacker-defined trajectory or revert to the original trained behavior.

Figures

Figures reproduced from arXiv: 2606.10371 by Peilin Chai, Siyuan Huang, Zhanhao Hu, Zi Yin.

Figure 1
Figure 1. Figure 1: Real-time adversarially steering imitation learning robots to follow arbitrary trajecto￾ries. The tasks include: (a) RealNav, (b) SimNav, (c) PushT, (d) DeliverDrone. The trajectories are controlled by a human operator and resemble four letters: S, P, I, and N. be applied to VLA models, cannot achieve out-of-distribution target trajectories, and can be detected by semantic checks on the language input. A m… view at source ↗
Figure 2
Figure 2. Figure 2: Attack pipeline. Top: offline universal-patch optimization via differentiable inference. Gradients flow from the directional loss through the frozen denoising chain and vision encoder back to the patch. Bottom: closed-loop online interactive deployment. The attacker selects a vocabulary entry via keyboard; the victim policy thus can be steered in real time. (DDPM, DDIM, or a forward-Euler integrator of a f… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the valuation tasks: (a) PushT: 2D manipulation with a T-shaped block. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the analysis. Top: Per-direction action predictions across five timesteps of a single rollout. TPM traces collapse into nearly identical paths (top row), whereas our traces fan out into four distinct directions (bottom row). Bottom: off-axis drift under open-loop vs. closed-loop attacker control with the same patch vocabulary. For each command, off-axis drift is the agent’s vertical displa… view at source ↗
read the original abstract

Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Test-time Adversarial Takeover (TAKO), a method to learn a small set of reusable universal patches via differentiable diffusion inference that, when applied to the live camera stream at test time, allow an attacker to steer a frozen robotic diffusion policy toward arbitrary attacker-chosen trajectories. The central empirical claim is that this achieves 100% takeover success across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, physical ground navigation), two visual encoders (ResNet-18, EfficientNet-B0+Transformer), and three inference families (DDPM, DDIM, flow matching).

Significance. If the empirical results hold, the work is significant because it demonstrates a stronger threat model than prior disruption attacks: diffusion policies can be converted into a real-time remote-control interface rather than merely made unreliable. The finding that target-policy matching fails while patch-based conditioning succeeds highlights a structural property of visual conditioning in generative policies.

major comments (2)
  1. [Abstract and Experiments] The 100% success claim across all configurations is load-bearing on the assumption that patch-induced bias persists through the full iterative denoising process (DDPM/DDIM/flow matching) without the policy recovering. The manuscript provides no ablation on the number of denoising steps, sensor noise levels, or lighting shifts that would test whether recovery occurs in the live camera stream (see skeptic concern on persistence).
  2. [Method] The claim that the perturbation 'acts on the visual conditioning pathway' and therefore persists is stated without a concrete analysis (e.g., intermediate feature visualizations or per-step action deviation measurements) showing that the bias is not corrected by the generative process under physical-world conditions.
minor comments (2)
  1. [Abstract] The project page link is given but the manuscript does not indicate whether the released code includes the exact patch optimization procedure and real-time application pipeline.
  2. [Method] Notation for the patch vocabulary and switching mechanism could be formalized more clearly to distinguish training-time optimization from test-time composition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the persistence of patch-induced bias through the diffusion process. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The 100% success claim across all configurations is load-bearing on the assumption that patch-induced bias persists through the full iterative denoising process (DDPM/DDIM/flow matching) without the policy recovering. The manuscript provides no ablation on the number of denoising steps, sensor noise levels, or lighting shifts that would test whether recovery occurs in the live camera stream (see skeptic concern on persistence).

    Authors: We agree that explicit ablations on denoising steps, sensor noise, and lighting shifts would strengthen the persistence claim. The reported physical navigation results already succeed under real sensor noise and lighting variation, but we will add controlled ablations varying denoising step count and introducing synthetic noise in the revision to directly test recovery. revision: yes

  2. Referee: [Method] The claim that the perturbation 'acts on the visual conditioning pathway' and therefore persists is stated without a concrete analysis (e.g., intermediate feature visualizations or per-step action deviation measurements) showing that the bias is not corrected by the generative process under physical-world conditions.

    Authors: We acknowledge that direct mechanistic evidence would be valuable. While cross-method success (DDPM, DDIM, flow matching) already indicates the bias is not corrected by the generative process, we will incorporate intermediate visual-encoder feature visualizations and per-step action deviation measurements in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack demonstration with no derivations reducing to inputs by construction

full rationale

The paper is an empirical study of an adversarial attack method (TAKO) that learns universal patches via differentiable diffusion inference and evaluates takeover success rates across tasks, encoders, and inference families. No load-bearing steps involve self-definitional equations, fitted inputs renamed as predictions, or self-citation chains that reduce the central claims to their own inputs. The 100% success figures are presented as experimental outcomes, not derived quantities. The persistence of bias through iterative inference is treated as an empirical observation under test conditions rather than a mathematical necessity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard assumptions of machine learning robustness research.

pith-pipeline@v0.9.1-grok · 5775 in / 1099 out tokens · 20237 ms · 2026-06-27T13:02:24.474057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Robotics: Science and Systems (RSS), 2023. URL https://arxiv.org/abs/2303.04137. arXiv:2303.04137

  2. [2]

    Tactics of adversarial attack on deep reinforcement learning agents

    Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. InProceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), pages 3756–3762, 2017. doi: 10.24963/ijcai.2017/525. URLhttps://doi.org/10.24963/ijcai.2017/525

  3. [3]

    Diffusion policy attacker: Craft- ing adversarial attacks for diffusion-based policies

    Yipu Chen, Haotian Xue, and Yongxin Chen. Diffusion policy attacker: Craft- ing adversarial attacks for diffusion-based policies. InAdvances in Neu- ral Information Processing Systems, volume 37, 2024. doi: 10.52202/ 079017-3800. URL https://proceedings.neurips.cc/paper_files/paper/ 2024/hash/d83fd70a31c64e020844ec80705ba87f-Abstract-Conference.html. arXi...

  4. [4]

    2025.doi: 10.48550/arXiv.2506

    Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, and J. Zico Kolter. Adversarial attacks on robotic vision language action models.arXiv preprint arXiv:2506.03350, 2025. doi: 10.48550/arXiv.2506. 03350. URLhttps://arxiv.org/abs/2506.03350

  5. [5]

    Dirty road can attack: Security of deep learning based automated lane centering under physical-world at- tack

    Takami Sato, Junjie Shen, Ningfei Wang, Yunhan Jia, Xue Lin, and Qi Alfred Chen. Dirty road can attack: Security of deep learning based automated lane centering under physical-world at- tack. In30th USENIX Security Symposium (USENIX Security 21), pages 3309–3326, 2021. URL https://www.usenix.org/conference/usenixsecurity21/presentation/sato

  6. [6]

    3D diffusion policy: Generalizable visuomotor policy learning via simple 3D representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D diffusion policy: Generalizable visuomotor policy learning via simple 3D representations. In Robotics: Science and Systems (RSS), 2024. URL https://arxiv.org/abs/2403.03954. arXiv:2403.03954

  7. [7]

    3D diffuser actor: Policy diffusion with 3D scene representations

    Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3D diffuser actor: Policy diffusion with 3D scene representations. InConference on Robot Learning (CoRL), 2024. URL https://arxiv.org/abs/2402.10885. arXiv:2402.10885

  8. [8]

    Consistency policy: Accelerated visuomotor policies via consistency distillation

    Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation. InRobotics: Science and Systems (RSS), 2024. URLhttps://arxiv.org/abs/2405.07503. arXiv:2405.07503

  9. [9]

    best checkpoint / average of last 5 checkpoints

    Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. FlowPolicy: Enabling fast and robust 3D flow-based policy via consistency flow matching for robot manipulation.arXiv preprint arXiv:2412.04987, 2024. doi: 10.48550/arXiv.2412.04987. URLhttps://arxiv.org/abs/2412.04987

  10. [10]

    One-step diffusion policy: Fast visuomotor policies via diffusion distillation.arXiv preprint arXiv:2410.21257,

    Zhendong Wang, Zhaoshuo Li, Ajay Mandlekar, Zhenjia Xu, Jiaojiao Fan, Yashraj Narang, Linxi Fan, Yuke Zhu, Yogesh Balaji, Mingyuan Zhou, Ming-Yu Liu, and Yu Zeng. One-step diffusion policy: Fast visuomotor policies via diffusion distillation.arXiv preprint arXiv:2410.21257,

  11. [11]
  12. [12]

    Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation

    Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2025. URL https://arxiv.org/abs/2409.14411. arXiv:2409.14411

  13. [13]

    RDT-1B: A diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: A diffusion foundation model for bimanual manipulation. In International Conference on Learning Representations (ICLR), 2025. URL https://arxiv. org/abs/2410.07864. arXiv:2410.07864. 10

  14. [14]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems (RSS),

  15. [15]

    arXiv:2405.12213

    URLhttps://arxiv.org/abs/2405.12213. arXiv:2405.12213

  16. [16]

    π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  17. [17]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    doi: 10.48550/arXiv.2410.24164. URLhttps://arxiv.org/abs/2410.24164

  18. [18]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  19. [19]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. CogACT: A foundational vision-language- action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2...

  20. [20]

    Diffusion transformer policy.arXiv preprint arXiv:2410.15959,

    Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao, Ronglei Tong, Yu Qiao, Jifeng Dai, and Yuntao Chen. Diffusion transformer policy.arXiv preprint arXiv:2410.15959,

  21. [21]

    URLhttps://arxiv.org/abs/2410.15959

    doi: 10.48550/arXiv.2410.15959. URLhttps://arxiv.org/abs/2410.15959

  22. [22]

    Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer

    Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch. InNeurIPS Workshop on Machine Learning and Computer Security, 2017. URL https://arxiv.org/abs/1712.09665. arXiv:1712.09665

  23. [23]

    Synthesizing robust adver- sarial examples

    Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adver- sarial examples. InInternational Conference on Machine Learning (ICML), pages 284–293,

  24. [24]

    arXiv:1707.07397

    URLhttps://arxiv.org/abs/1707.07397. arXiv:1707.07397

  25. [25]

    Robust physical-world attacks on deep learning models

    Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1625–1634, 2018. URLhttps://arxiv.org/abs/1707.08945. arXiv:1707.08945

  26. [26]

    DPatch: An Adversarial Patch Attack on Object Detectors

    Xin Liu, Huanrui Yang, Ziwei Liu, Linghao Song, Hai Li, and Yiran Chen. DPatch: An adversarial patch attack on object detectors.arXiv preprint arXiv:1806.02299, 2018. doi: 10.48550/arXiv.1806.02299. URLhttps://arxiv.org/abs/1806.02299

  27. [27]

    Fooling automated surveillance cameras: Adversarial patches to attack person detection

    Simen Thys, Wiebe Van Ranst, and Toon Goedemé. Fooling automated surveillance cameras: Adversarial patches to attack person detection. InCVPR Workshop on The Bright and Dark Sides of Computer Vision, 2019. URL https://arxiv.org/abs/1904.08653. arXiv:1904.08653

  28. [28]

    Adversarial T-shirt! evading person detectors in a physical world

    Kaidi Xu, Gaoyuan Zhang, Sijia Liu, Quanfu Fan, Mengshu Sun, Hongge Chen, Pin-Yu Chen, Yanzhi Wang, and Xue Lin. Adversarial T-shirt! evading person detectors in a physical world. InEuropean Conference on Computer Vision (ECCV), pages 665–681, 2020. URL https://arxiv.org/abs/1910.11099. arXiv:1910.11099

  29. [29]

    Adversarial patch attacks on monocular depth estimation networks.IEEE Access, 8:179094–179104,

    Koichiro Yamanaka, Ryutaroh Matsumoto, Keita Takahashi, and Toshiaki Fujii. Adversarial patch attacks on monocular depth estimation networks.IEEE Access, 8:179094–179104,

  30. [30]

    URL https://arxiv.org/abs/2010.03072

    doi: 10.1109/ACCESS.2020.3027791. URL https://arxiv.org/abs/2010.03072. arXiv:2010.03072

  31. [31]

    Physical attack on monocular depth estimation with optimal adversarial 11 patches

    Zhiyuan Cheng, James Liang, Hongjun Choi, Guanhong Tao, Zhiwen Cao, Dongfang Liu, and Xiangyu Zhang. Physical attack on monocular depth estimation with optimal adversarial 11 patches. InEuropean Conference on Computer Vision (ECCV), pages 514–532, 2022. URL https://arxiv.org/abs/2207.04718. arXiv:2207.04718

  32. [32]

    Anurag Ranjan, Joel Janai, Andreas Geiger, and Michael J. Black. Attacking optical flow. InInternational Conference on Computer Vision (ICCV), pages 2404–2413, 2019. URL https://arxiv.org/abs/1910.10053. arXiv:1910.10053

  33. [33]

    Adversarial texture for fooling person detectors in the physical world

    Zhanhao Hu, Siyuan Huang, Xiaopei Zhu, Fuchun Sun, Bo Zhang, and Xiaolin Hu. Adversarial texture for fooling person detectors in the physical world. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13307–13316, 2022

  34. [34]

    Physically real- izable natural-looking clothing textures evade person detectors via 3d modeling

    Zhanhao Hu, Wenda Chu, Xiaopei Zhu, Hui Zhang, Bo Zhang, and Xiaolin Hu. Physically real- izable natural-looking clothing textures evade person detectors via 3d modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16975–16984, 2023

  35. [35]

    Patch-fool: Are vision transformers always robust against adversarial perturbations? InInternational Conference on Learning Representations (ICLR), 2022

    Yonggan Fu, Shunyao Zhang, Shang Wu, Cheng Wan, and Yingyan Celine Lin. Patch-fool: Are vision transformers always robust against adversarial perturbations? InInternational Conference on Learning Representations (ICLR), 2022. URL https://arxiv.org/abs/2203.08392. arXiv:2203.08392

  36. [36]

    Ad- vCLIP: Downstream-agnostic adversarial examples in multimodal contrastive learning

    Ziqi Zhou, Shengshan Hu, Minghui Li, Hangtao Zhang, Yechao Zhang, and Hai Jin. Ad- vCLIP: Downstream-agnostic adversarial examples in multimodal contrastive learning. In ACM Multimedia, pages 6311–6320, 2023. doi: 10.1145/3581783.3612454. URL https: //doi.org/10.1145/3581783.3612454. arXiv:2308.07026

  37. [37]

    PhysGAN: Generating physical-world- resilient adversarial examples for autonomous driving

    Zelun Kong, Junfeng Guo, Ang Li, and Cong Liu. PhysGAN: Generating physical-world- resilient adversarial examples for autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14254–14263, 2020. URL https://arxiv. org/abs/1907.04449. arXiv:1907.04449

  38. [38]

    Marius Zöllner

    Svetlana Pavlitskaya, Sefa Ünver, and J. Marius Zöllner. Feasibility and suppression of ad- versarial patch attacks on end-to-end vehicle control. InIEEE Intelligent Transportation Systems Conference (ITSC), pages 1–8, 2020. doi: 10.1109/ITSC45102.2020.9294426. URL https://doi.org/10.1109/ITSC45102.2020.9294426

  39. [39]

    Adversarial attacks on neural network policies

    Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. InICLR Workshop, 2017. URL https://arxiv.org/ abs/1702.02284. arXiv:1702.02284

  40. [40]

    Robust deep reinforcement learning against adversarial perturbations on state observations

    Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho- Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, volume 33, pages 21024–21037, 2020. URLhttps://arxiv.org/abs/2003.08938. arXiv:2003.08938

  41. [41]

    Akansha Kalra, Basavasagar Patil, Guanhong Tao, and Daniel S. Brown. How vulnerable is my learned policy? universal adversarial perturbation attacks on modern behavior cloning policies.arXiv preprint arXiv:2502.03698, 2025. doi: 10.48550/arXiv.2502.03698. URL https://arxiv.org/abs/2502.03698

  42. [42]

    Exploring the adversarial vulnerabilities of vision-language-action models in robotics.arXiv preprint arXiv:2411.13587, 2024

    Taowen Wang, Cheng Han, James Chenhao Liang, Wenhao Yang, Dongfang Liu, Luna Xinyu Zhang, Qifan Wang, Jiebo Luo, and Ruixiang Tang. Exploring the adversarial vulnerabilities of vision-language-action models in robotics.arXiv preprint arXiv:2411.13587, 2024. doi: 10.48550/arXiv.2411.13587. URL https://arxiv.org/abs/2411.13587. ICCV camera ready

  43. [43]

    BadRobot: Jail- breaking embodied LLMs in the physical world

    Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo, and Leo Yu Zhang. BadRobot: Jail- breaking embodied LLMs in the physical world. InInternational Conference on Learning Rep- resentations (ICLR), 2025. URL https://arxiv.org/abs/2407.20242. arXiv:2407.20242

  44. [44]

    Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, and George J. Pappas. Jailbreaking LLM-controlled robots.arXiv preprint arXiv:2410.13691, 2024. doi: 10.48550/ arXiv.2410.13691. URLhttps://arxiv.org/abs/2410.13691

  45. [45]

    TrojDRL: Trojan attacks on deep reinforcement learning agents

    Panagiota Kiourti, Kacper Wardega, Susmit Jha, and Wenchao Li. TrojDRL: Trojan attacks on deep reinforcement learning agents. InDesign Automation Conference (DAC), 2020. URL https://arxiv.org/abs/1903.06638. arXiv:1903.06638. 12

  46. [46]

    Robot collapse: Supply chain backdoor attacks against VLM-based robotic manipulation.arXiv preprint arXiv:2411.11683, 2024

    Xianlong Wang, Hewen Pan, Hangtao Zhang, Minghui Li, Shengshan Hu, Ziqi Zhou, Lulu Xue, Peijin Guo, Aishan Liu, Leo Yu Zhang, and Xiaohua Jia. Robot collapse: Supply chain backdoor attacks against VLM-based robotic manipulation.arXiv preprint arXiv:2411.11683, 2024. doi: 10.48550/arXiv.2411.11683. URL https://arxiv.org/abs/2411.11683. Introduces the Troja...

  47. [47]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021. URL https://arxiv. org/abs/2010.02502. arXiv:2010.02502

  48. [48]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow match- ing for generative modeling. InInternational Conference on Learning Representations (ICLR),

  49. [49]

    arXiv:2210.02747

    URLhttps://openreview.net/forum?id=PqvMRDCJT9t. arXiv:2210.02747

  50. [50]

    NoMaD: Goal masked diffu- sion policies for navigation and exploration

    Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. NoMaD: Goal masked diffu- sion policies for navigation and exploration. InIEEE International Conference on Robotics and Automation (ICRA), 2024. URL https://arxiv.org/abs/2310.07896. arXiv:2310.07896

  51. [51]

    ViNT: A foundation model for visual navigation

    Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. ViNT: A foundation model for visual navigation. InConference on Robot Learning (CoRL), 2023. URLhttps://arxiv.org/abs/2306.14846. arXiv:2306.14846

  52. [52]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations (ICLR), 2015. URL https://arxiv.org/ abs/1412.6980. arXiv:1412.6980

  53. [53]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An interactive 3D environment for visual AI.arXiv preprint arXiv:1712.05474, 2017. doi: 10.48550/arXiv.1712.05474. URL https://arxiv.org/abs/ 1712.05474

  54. [54]

    dm_env_rpc: A networking protocol for agent-environment communication, 2019

    Tom Ward and Jay Lemmon. dm_env_rpc: A networking protocol for agent-environment communication, 2019. URLhttp://github.com/deepmind/dm_env_rpc

  55. [55]

    Using Unity to help solve intelligence

    Tom Ward, Andrew Bolt, Nik Hemmings, Simon Carter, Manuel Sanchez, Ricardo Barreira, Seb Noury, Keith Anderson, Jay Lemmon, Jonathan Coe, et al. Using Unity to help solve intelligence. arXiv preprint arXiv:2011.09294, 2020. URLhttps://arxiv.org/abs/2011.09294. 13 This appendix is organized into four sections. Section A1 supplements the experimental setup ...