ATAAT: Adaptive Threat-Aware Adversarial Tuning Framework against Backdoor Attacks on Vision-Language-Action Models

Kewei Chen; Mingsheng Shang; Shuai Li; Yayu Long

arxiv: 2605.08612 · v1 · submitted 2026-05-09 · 💻 cs.RO

ATAAT: Adaptive Threat-Aware Adversarial Tuning Framework against Backdoor Attacks on Vision-Language-Action Models

Kewei Chen , Yayu Long , Shuai Li , Mingsheng Shang This is my paper

Pith reviewed 2026-05-12 01:11 UTC · model grok-4.3

classification 💻 cs.RO

keywords backdoor attacksvision-language-action modelsgradient interferencedata poisoningadversarial tuningrobotics securitymultimodal models

0 comments

The pith

An adaptive framework resolves gradient interference to enable backdoor attacks on vision-language-action models with over 80% success at 5% poisoning rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies gradient interference as the reason traditional backdoor attacks fail on VLA models during end-to-end training. It proposes the ATAAT framework, whose Threat-Method Adaptive Mapping selects the best gradient decoupling approach based on the attacker's capabilities. This setup achieves targeted attack success rates above 80% while poisoning just 5% of the data and handles complex semantic triggers, including the first implicit decoupled attacks in poisoning scenarios. A reader would care because VLA models convert vision and language into robot actions, so reliable backdoors could compromise physical systems that rely on these models.

Core claim

Traditional backdoor attacks on VLA models fail due to gradient interference from conflicting strategies in end-to-end training. The ATAAT framework overcomes this through its Threat-Method Adaptive Mapping mechanism, which selects the optimal gradient decoupling strategy according to adversary capabilities, resulting in robust targeted attack success rates above 80% at a 5% poisoning rate, efficient handling of semantic-level triggers, and the first achievement of implicit decoupled attacks in data poisoning.

What carries the argument

Threat-Method Adaptive Mapping mechanism, which selects the optimal gradient decoupling strategy based on the adversary's capabilities to resolve gradient interference during VLA model training.

If this is right

Traditional backdoor attack methods fail on VLA models because of conflicting optimization strategies during training.
Adaptive selection of decoupling strategies enables attacks to succeed on complex semantic-level triggers.
Implicit decoupled attacks become feasible for the first time in data poisoning scenarios for these models.
High targeted success rates above 80% can be maintained alongside extreme stealth at a 5% poisoning rate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses for VLA models may need to target adaptive gradient selection rather than fixed backdoor patterns.
The vulnerability could extend to other end-to-end trained multimodal action models beyond those tested.
Further experiments on varied VLA architectures would test whether the adaptive mapping holds in broader settings.

Load-bearing premise

The assumption that gradient interference is the primary obstacle to traditional backdoor attacks on VLA models and that an adaptive mapping based on adversary capabilities will generalize beyond the tested scenarios.

What would settle it

An experiment in which ATAAT fails to reach high targeted attack success rates when gradient interference is removed from the training process or when the adaptive mapping is replaced by a fixed strategy.

Figures

Figures reproduced from arXiv: 2605.08612 by Kewei Chen, Mingsheng Shang, Shuai Li, Yayu Long.

**Figure 1.** Figure 1: Overview of the Adaptive Threat-Aware Adversarial Tuning (ATAAT) Framework. This figure illustrates how ATAAT achieves robust backdoor injection by eliminating “Gradient Interference” (Sim(θ) ≈ 0) across different supply chain privilege scenarios. (a) Left - Scenario 1: Data Poisoning (Implicit De-confliction). Under restricted access (Black-box), the attacker employs “Dual-Objective Sample Design.” This p… view at source ↗

**Figure 2.** Figure 2: ATAAT Implicit Decoupling Mechanism. The Dual-Objective Sample Design achieves feature separation without training intervention. A poisoned sample x˜ combines a visible trigger t (defining logic) with an invisible orthogonal perturbation δ (generated via gradient engineering). During end-to-end training, δ directs poisoned inputs to an independent malicious subspace, naturally separating them from the beni… view at source ↗

**Figure 3.** Figure 3: ATAAT Real-World Evaluation. Performance across threat models: (a) Scenario 1: Data Poisoning. Implicit mechanism successfully triggers on both fixed objects and dynamic interactive cues (e.g., bottom hands), proving robust feature-space decoupling without parameter anchoring. (b) Scenario 2: Semantic Backdoor. Explicit anchoring enables response to highlevel semantic triggers. Precise activation across … view at source ↗

**Figure 4.** Figure 4: Evolution of Gradient Cosine Similarity during Training. Shaded areas represent the Standard Deviation over multiple experiments. The gray dashed line (y = 0) is the orthogonal baseline. (1) Baseline (BadVLA) (red line) drops rapidly early in training and stabilizes in the negative range (Sim ≈ −0.4 after 400 steps), indicating persistent gradient cancellation that triggers performance collapse. (2) ATAAT … view at source ↗

**Figure 5.** Figure 5: Visual comparison of attack effectiveness [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Addressing the escalating security vulnerabilities in Vision-Language-Action (VLA) models, this study investigates backdoor attacks targeting the visual pathway. We identify a core obstacle causing the failure of traditional attack paradigms: "Gradient Interference." This phenomenon represents an optimization failure triggered by conflicting strategies during end-to-end training. To resolve this, we propose an Adaptive Threat-Aware Adversarial Tuning (ATAAT) framework. Through its core "Threat-Method Adaptive Mapping" mechanism, ATAAT intelligently selects the optimal gradient decoupling strategy based on the adversary's capabilities. Extensive experiments demonstrate that ATAAT exhibits significant advantages, achieving a highly robust Targeted Attack Success Rate (TASR > 80%) while maintaining extreme stealthiness with merely a 5% poisoning rate. It efficiently handles complex semantic-level triggers and achieves implicit decoupled attacks in data poisoning scenarios for the first time. This work reveals a critical security vulnerability in VLAs and provides theoretical and methodological support for future defense architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATAAT introduces an adaptive mapping to handle gradient interference in VLA backdoor attacks, but the reported results lack the details needed to assess whether the gains are real.

read the letter

The core claim is that standard backdoor attacks fail on vision-language-action models because of gradient interference during joint training, and ATAAT fixes this by using Threat-Method Adaptive Mapping to pick the right decoupling approach based on attacker resources. That mapping mechanism is the piece that looks new compared with earlier work on semantic triggers or poisoning in vision or language models alone. The authors also report that it reaches over 80% targeted attack success with only 5% poisoned samples while keeping the trigger stealthy, which would matter for anyone deploying these models in robotics settings.

Referee Report

2 major / 2 minor

Summary. The paper claims that backdoor attacks on Vision-Language-Action (VLA) models fail due to a core obstacle termed 'Gradient Interference' arising from conflicting optimization strategies in end-to-end training; it proposes the ATAAT framework whose Threat-Method Adaptive Mapping selects gradient-decoupling strategies based on adversary capabilities, yielding TASR >80% at a 5% poisoning rate while handling semantic-level triggers and enabling implicit decoupled attacks for the first time.

Significance. If the reported attack performance and generalization hold under rigorous verification, the work would be significant for exposing previously under-appreciated vulnerabilities in VLA models deployed in robotics, while the adaptive-mapping idea could inform both attack and defense research; the absence of any parameter-free derivation or machine-checked component, however, means the contribution rests entirely on empirical claims whose reproducibility remains unverified.

major comments (2)

[Abstract] Abstract: the claims of TASR >80% and 5% poisoning success are presented without any description of the VLA models tested, datasets, attack baselines, evaluation metrics, error bars, or statistical tests, rendering it impossible to assess whether the data actually support the central performance assertions.
[§3 (Method)] The weakest assumption—that gradient interference is the primary, resolvable obstacle and that the adaptive mapping generalizes beyond the tested scenarios—is stated without supporting ablation studies or counter-examples; if other factors (e.g., model scale or trigger semantics) dominate, the framework's claimed novelty collapses.

minor comments (2)

Define all acronyms (TASR, VLA, ATAAT) on first use and ensure consistent notation for 'gradient decoupling' versus 'implicit decoupled attacks'.
Add a dedicated related-work subsection contrasting ATAAT with prior backdoor attacks on vision-language or robotic models to clarify incremental contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of TASR >80% and 5% poisoning success are presented without any description of the VLA models tested, datasets, attack baselines, evaluation metrics, error bars, or statistical tests, rendering it impossible to assess whether the data actually support the central performance assertions.

Authors: We agree that the abstract requires additional context to support the performance claims. In the revised version, we will expand the abstract to specify the VLA models (RT-1, RT-2, OpenVLA), datasets (BridgeData V2, RT-X), attack baselines (BadNet, Blended, and others), the TASR metric, and note that results include error bars from multiple independent runs along with statistical significance tests. This will allow direct assessment of the claims. revision: yes
Referee: [§3 (Method)] The weakest assumption—that gradient interference is the primary, resolvable obstacle and that the adaptive mapping generalizes beyond the tested scenarios—is stated without supporting ablation studies or counter-examples; if other factors (e.g., model scale or trigger semantics) dominate, the framework's claimed novelty collapses.

Authors: We acknowledge the need for explicit validation of the core assumption. While the main experiments across models and triggers provide indirect support, we will add a new ablation subsection in the revised manuscript. This will include studies isolating gradient interference, varying model scale and trigger semantics, comparisons to non-adaptive baselines, and discussion of counter-examples where the framework underperforms, to demonstrate both the primacy of the obstacle and the generalization of the mapping. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract identifies 'Gradient Interference' as a core obstacle and introduces the ATAAT framework with its 'Threat-Method Adaptive Mapping' mechanism, but presents no equations, derivations, predictions, or self-citations that reduce any claimed result to fitted inputs or prior self-referential definitions by construction. Claims rest on the proposed adaptive selection strategy and reported experimental outcomes (TASR > 80% at 5% poisoning) rather than any self-definitional loop or renamed known result. The derivation chain is therefore self-contained against external benchmarks with no load-bearing steps matching the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review is limited to the abstract; no explicit free parameters, mathematical axioms, or independently evidenced invented entities are detailed. Gradient Interference is introduced as a core phenomenon but lacks external validation in the provided text.

invented entities (1)

Gradient Interference no independent evidence
purpose: Optimization failure from conflicting strategies in end-to-end training of backdoor attacks on VLA models
Presented as the key obstacle identified by the authors; no independent evidence or falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5475 in / 1312 out tokens · 66709 ms · 2026-05-12T01:11:26.114061+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify the theoretical root cause as 'Gradient Interference'... Sim(θ) = cos(g_benign, g_backdoor) ... min_θ L_backdoor(θ) s.t. Sim(θ)≈0
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Implicit De-confliction via Orthogonal Triggers... δ^*_orth ... Explicit De-confliction via Semantic Anchoring... binary mask M

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[2]

2017 IEEE International Conference on Data Mining (ICDM) , pages=

Gadei: On scale-up training as a service for deep learning , author=. 2017 IEEE International Conference on Data Mining (ICDM) , pages=. 2017 , organization=

work page 2017
[3]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Exploring the adversarial vulnerabilities of vision-language-action models in robotics , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[4]

Conference on Robot Learning , pages=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023
[5]

Conference on Robot Learning , pages=

OpenVLA: An Open-Source Vision-Language-Action Model , author=. Conference on Robot Learning , pages=. 2025 , organization=

work page 2025
[6]

Advances in Neural Information Processing Systems , volume=

Humanvla: Towards vision-language directed object rearrangement by physical humanoid , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025
[8]

Ieee Access , volume=

Badnets: Evaluating backdooring attacks on deep neural networks , author=. Ieee Access , volume=. 2019 , publisher=

work page 2019
[9]

Advances in Neural Information Processing Systems , volume=

Libero: Benchmarking knowledge transfer for lifelong robot learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

Is Embedding-as-a-Service Safe? Meta-Prompt-Based Backdoor Attacks for User-Specific Trigger Migration , volume=

Bagwe, Gaurav and Zhang, Lan and Guo, Linke and Pan, Miao and Ma, Xiaolong and Yuan, Xiaoyong , year=. Is Embedding-as-a-Service Safe? Meta-Prompt-Based Backdoor Attacks for User-Specific Trigger Migration , volume=. Transactions on Artificial Intelligence , publisher=

work page
[12]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[13]

IEEE Robotics and Automation Letters , year=

Safety Guardrails for LLM-Enabled Robots , author=. IEEE Robotics and Automation Letters , year=

work page
[14]

Advances in Neural Information Processing Systems , volume=

Improving alignment and robustness with circuit breakers , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

Xueyang Zhou and Guiyao Tie and Guowen Zhang and Hecheng Wang and Pan Zhou and Lichao Sun , booktitle=. Bad

work page
[17]

The Thirteenth International Conference on Learning Representations , year=

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[19]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

SAFE: Multitask Failure Detection for Vision-Language-Action Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[20]

Advances in Neural Information Processing Systems , volume=

Training with more confidence: Mitigating injected and natural backdoors during training , author=. Advances in Neural Information Processing Systems , volume=

work page
[21]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[22]

2024 , eprint=

Towards Action Hijacking of Large Language Model-based Agent , author=. 2024 , eprint=

work page 2024
[23]

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

Packnet: Adding multiple tasks to a single network by iterative pruning , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

work page
[24]

International conference on machine learning , pages=

Overcoming catastrophic forgetting with hard attention to the task , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[25]

Gaurav Bagwe, Lan Zhang, Linke Guo, Miao Pan, Xiaolong Ma, and Xiaoyong Yuan. 2025. Is embedding-as-a-service safe? meta-prompt-based backdoor attacks for user-specific trigger migration. Transactions on Artificial Intelligence, 1(1):16--27

work page 2025
[26]

Li Changjiang, Liang Jiacheng, Cao Bochuan, Chen Jinghui, and Wang Ting. 2025. Your agent can defend itself against backdoor attacks. arXiv preprint arXiv:2506.08336

work page arXiv 2025
[27]

Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. 2025. Safe: Multitask failure detection for vision-language-action models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[28]

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2019. Badnets: Evaluating backdooring attacks on deep neural networks. Ieee Access, 7:47230--47244

work page 2019
[29]

Changyue Jiang, Xudong Pan, and Min Yang. 2025 a . Think twice before you act: Enhancing agent behavioral safety with thought correction. arXiv preprint arXiv:2505.11063

work page arXiv 2025
[30]

Peihai Jiang, Xixiang Lyu, Yige Li, and Jing Ma. 2025 b . Backdoor token unlearning: Exposing and defending backdoors in pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24285--24293

work page 2025
[31]

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, and 1 others. 2025. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, pages 2679--2713. PMLR

work page 2025
[32]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36:44776--44791

work page 2023
[33]

Xuancun Lu, Zhengxian Huang, Xinfeng Li, Chi Zhang, Xiaoyu ji, and Wenyuan Xu. 2024. Poex: Towards policy executable jailbreak attacks against the llm-based robots. arXiv preprint arXiv:2412.16633

work page arXiv 2024
[34]

Oubo Ma, Linkang Du, Yang Dai, Chunyi Zhou, Qingming Li, Yuwen Pu, and Shouling Ji. 2025. Unidoor: A universal framework for action-level backdoor attacks in deep reinforcement learning. arXiv preprint arXiv:2501.15529

work page arXiv 2025
[35]

Arun Mallya and Svetlana Lazebnik. 2018. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765--7773

work page 2018
[36]

Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J Pappas, and Hamed Hassani. 2026. Safety guardrails for llm-enabled robots. IEEE Robotics and Automation Letters

work page 2026
[37]

Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. 2018. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pages 4548--4557. PMLR

work page 2018
[38]

Zhenting Wang, Hailun Ding, Juan Zhai, and Shiqing Ma. 2022. Training with more confidence: Mitigating injected and natural backdoors during training. Advances in Neural Information Processing Systems, 35:36396--36410

work page 2022
[39]

Xinyu Xu, Yizheng Zhang, Yong-Lu Li, Lei Han, and Cewu Lu. 2024. Humanvla: Towards vision-language directed object rearrangement by physical humanoid. Advances in Neural Information Processing Systems, 37:18633--18659

work page 2024
[40]

Borong Zhang, Yuhao Zhang, Jiaming Ji, Yingshan Lei, Josef Dai, Yuanpei Chen, and Yaodong Yang. 2025 a . Safevla: Towards safety alignment of vision-language-action model via constrained learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[41]

Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, and 1 others. 2025 b . Badrobot: Jailbreaking embodied llm agents in the physical world. In The Thirteenth International Conference on Learning Representations

work page 2025
[42]

Xueyang Zhou, Guiyao Tie, Guowen Zhang, Hecheng Wang, Pan Zhou, and Lichao Sun. 2025. Bad VLA : Towards backdoor attacks on vision-language-action models via objective-decoupled optimization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[43]

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, and 1 others. 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165--2183. PMLR

work page 2023
[44]

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. 2024. Improving alignment and robustness with circuit breakers. Advances in Neural Information Processing Systems, 37:83345--83373

work page 2024

[1] [2]

2017 IEEE International Conference on Data Mining (ICDM) , pages=

Gadei: On scale-up training as a service for deep learning , author=. 2017 IEEE International Conference on Data Mining (ICDM) , pages=. 2017 , organization=

work page 2017

[2] [3]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Exploring the adversarial vulnerabilities of vision-language-action models in robotics , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[3] [4]

Conference on Robot Learning , pages=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023

[4] [5]

Conference on Robot Learning , pages=

OpenVLA: An Open-Source Vision-Language-Action Model , author=. Conference on Robot Learning , pages=. 2025 , organization=

work page 2025

[5] [6]

Advances in Neural Information Processing Systems , volume=

Humanvla: Towards vision-language directed object rearrangement by physical humanoid , author=. Advances in Neural Information Processing Systems , volume=

work page

[6] [7]

QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025

[7] [8]

Ieee Access , volume=

Badnets: Evaluating backdooring attacks on deep neural networks , author=. Ieee Access , volume=. 2019 , publisher=

work page 2019

[8] [9]

Advances in Neural Information Processing Systems , volume=

Libero: Benchmarking knowledge transfer for lifelong robot learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[9] [10]

Is Embedding-as-a-Service Safe? Meta-Prompt-Based Backdoor Attacks for User-Specific Trigger Migration , volume=

Bagwe, Gaurav and Zhang, Lan and Guo, Linke and Pan, Miao and Ma, Xiaolong and Yuan, Xiaoyong , year=. Is Embedding-as-a-Service Safe? Meta-Prompt-Based Backdoor Attacks for User-Specific Trigger Migration , volume=. Transactions on Artificial Intelligence , publisher=

work page

[10] [12]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[11] [13]

IEEE Robotics and Automation Letters , year=

Safety Guardrails for LLM-Enabled Robots , author=. IEEE Robotics and Automation Letters , year=

work page

[12] [14]

Advances in Neural Information Processing Systems , volume=

Improving alignment and robustness with circuit breakers , author=. Advances in Neural Information Processing Systems , volume=

work page

[13] [16]

Xueyang Zhou and Guiyao Tie and Guowen Zhang and Hecheng Wang and Pan Zhou and Lichao Sun , booktitle=. Bad

work page

[14] [17]

The Thirteenth International Conference on Learning Representations , year=

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[15] [19]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

SAFE: Multitask Failure Detection for Vision-Language-Action Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[16] [20]

Advances in Neural Information Processing Systems , volume=

Training with more confidence: Mitigating injected and natural backdoors during training , author=. Advances in Neural Information Processing Systems , volume=

work page

[17] [21]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[18] [22]

2024 , eprint=

Towards Action Hijacking of Large Language Model-based Agent , author=. 2024 , eprint=

work page 2024

[19] [23]

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

Packnet: Adding multiple tasks to a single network by iterative pruning , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

work page

[20] [24]

International conference on machine learning , pages=

Overcoming catastrophic forgetting with hard attention to the task , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018

[21] [25]

Gaurav Bagwe, Lan Zhang, Linke Guo, Miao Pan, Xiaolong Ma, and Xiaoyong Yuan. 2025. Is embedding-as-a-service safe? meta-prompt-based backdoor attacks for user-specific trigger migration. Transactions on Artificial Intelligence, 1(1):16--27

work page 2025

[22] [26]

Li Changjiang, Liang Jiacheng, Cao Bochuan, Chen Jinghui, and Wang Ting. 2025. Your agent can defend itself against backdoor attacks. arXiv preprint arXiv:2506.08336

work page arXiv 2025

[23] [27]

Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. 2025. Safe: Multitask failure detection for vision-language-action models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[24] [28]

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2019. Badnets: Evaluating backdooring attacks on deep neural networks. Ieee Access, 7:47230--47244

work page 2019

[25] [29]

Changyue Jiang, Xudong Pan, and Min Yang. 2025 a . Think twice before you act: Enhancing agent behavioral safety with thought correction. arXiv preprint arXiv:2505.11063

work page arXiv 2025

[26] [30]

Peihai Jiang, Xixiang Lyu, Yige Li, and Jing Ma. 2025 b . Backdoor token unlearning: Exposing and defending backdoors in pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24285--24293

work page 2025

[27] [31]

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, and 1 others. 2025. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, pages 2679--2713. PMLR

work page 2025

[28] [32]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36:44776--44791

work page 2023

[29] [33]

Xuancun Lu, Zhengxian Huang, Xinfeng Li, Chi Zhang, Xiaoyu ji, and Wenyuan Xu. 2024. Poex: Towards policy executable jailbreak attacks against the llm-based robots. arXiv preprint arXiv:2412.16633

work page arXiv 2024

[30] [34]

Oubo Ma, Linkang Du, Yang Dai, Chunyi Zhou, Qingming Li, Yuwen Pu, and Shouling Ji. 2025. Unidoor: A universal framework for action-level backdoor attacks in deep reinforcement learning. arXiv preprint arXiv:2501.15529

work page arXiv 2025

[31] [35]

Arun Mallya and Svetlana Lazebnik. 2018. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765--7773

work page 2018

[32] [36]

Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J Pappas, and Hamed Hassani. 2026. Safety guardrails for llm-enabled robots. IEEE Robotics and Automation Letters

work page 2026

[33] [37]

Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. 2018. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pages 4548--4557. PMLR

work page 2018

[34] [38]

Zhenting Wang, Hailun Ding, Juan Zhai, and Shiqing Ma. 2022. Training with more confidence: Mitigating injected and natural backdoors during training. Advances in Neural Information Processing Systems, 35:36396--36410

work page 2022

[35] [39]

Xinyu Xu, Yizheng Zhang, Yong-Lu Li, Lei Han, and Cewu Lu. 2024. Humanvla: Towards vision-language directed object rearrangement by physical humanoid. Advances in Neural Information Processing Systems, 37:18633--18659

work page 2024

[36] [40]

Borong Zhang, Yuhao Zhang, Jiaming Ji, Yingshan Lei, Josef Dai, Yuanpei Chen, and Yaodong Yang. 2025 a . Safevla: Towards safety alignment of vision-language-action model via constrained learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[37] [41]

Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, and 1 others. 2025 b . Badrobot: Jailbreaking embodied llm agents in the physical world. In The Thirteenth International Conference on Learning Representations

work page 2025

[38] [42]

Xueyang Zhou, Guiyao Tie, Guowen Zhang, Hecheng Wang, Pan Zhou, and Lichao Sun. 2025. Bad VLA : Towards backdoor attacks on vision-language-action models via objective-decoupled optimization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[39] [43]

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, and 1 others. 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165--2183. PMLR

work page 2023

[40] [44]

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. 2024. Improving alignment and robustness with circuit breakers. Advances in Neural Information Processing Systems, 37:83345--83373

work page 2024