arxiv: 2604.23775 · v1 · submitted 2026-04-26 · 💻 cs.RO

Recognition: unknown

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

Qi Li , Bo Yin , Weiqi Huang , Ruhao Liu , Bojun Zou , Runpeng Yu , Jingwen Ye , Weihao Yu

show 1 more author

Xinchao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:52 UTC · model grok-4.3

classification 💻 cs.RO

keywords safetyincludingtraining-timeacrossattackschallengesdefensesembodied

0 comments

The pith

A survey unifies safety for Vision-Language-Action models by organizing threats and defenses along training-time versus inference-time axes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-Language-Action models combine vision, language, and action outputs to control physical robots and agents, creating safety risks with irreversible physical effects and multimodal vulnerabilities. The paper argues that prior work on these risks is scattered across robotics, adversarial ML, and alignment research, so it supplies a single map that places each threat at the stage where it can be stopped. The map splits attacks into training-time problems such as data poisoning and backdoors versus inference-time problems such as patches and jailbreaks, and it does the same for defenses. By linking each threat to its natural mitigation window, the survey shows where current methods fall short and where new work is needed to keep embodied systems safe during long trajectories.

Core claim

VLA safety literature can be organized into four quadrants defined by attack timing (training-time versus inference-time) and defense timing (training-time versus inference-time). Training-time attacks include data poisoning and backdoors; inference-time attacks include adversarial patches, cross-modal perturbations, semantic jailbreaks, and freezing attacks. Corresponding defenses are reviewed at both stages, together with benchmarks, metrics, and domain-specific deployment issues.

What carries the argument

The two parallel timing axes (attack timing and defense timing, each divided into training-time and inference-time) that structure every threat and mitigation reviewed in the survey.

If this is right

Training-time defenses can block data poisoning and backdoors before a VLA model is deployed.
Inference-time defenses must operate under real-time latency limits to counter patches and cross-modal attacks on physical hardware.
Benchmarks must include long-horizon trajectories and physical irreversibility to measure true safety.
Deployment in domains such as autonomous vehicles or household robots will require separate safety analyses because attack surfaces differ.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same timing grid could be applied to other embodied multimodal systems that are not strictly VLA.
Certified robustness techniques for trajectories would directly address one of the open problems listed in the survey.
A unified runtime safety layer might combine multiple inference-time defenses into a single lightweight. I need to read the full text using the tool mentioned, but since it's not available here, I'll use the abstract and notes to infer. The prompt says

Load-bearing premise

The existing literature on VLA safety is fragmented enough that this particular two-axis timing organization adds clear value and that the cited papers already cover the main threats.

What would settle it

A sizable set of VLA safety papers whose threats or defenses cannot be placed on the training-versus-inference grid for attacks and defenses, or a major class of embodied threats absent from the survey.

read the original abstract

Vision-Language-Action (VLA) models are emerging as a unified substrate for embodied intelligence. This shift raises a new class of safety challenges, stemming from the embodied nature of VLA systems, including irreversible physical consequences, a multimodal attack surface across vision, language, and state, real-time latency constraints on defense, error propagation over long-horizon trajectories, and vulnerabilities in the data supply chain. Yet the literature remains fragmented across robotic learning, adversarial machine learning, AI alignment, and autonomous systems safety. This survey provides a unified and up-to-date overview of safety in Vision-Language-Action models. We organize the field along two parallel timing axes, attack timing (training-time vs. inference-time and defense timing (training-time vs. inference-time, linking each class of threat to the stage at which it can be mitigated. We first define the scope of VLA safety, distinguishing it from text-only LLM safety and classical robotic safety, and review the foundations of VLA models, including architectures, training paradigms, and inference mechanisms. We then examine the literature through four lenses: Attacks, Defenses, Evaluation, and Deployment. We survey training-time threats such as data poisoning and backdoors, as well as inference-time attacks including adversarial patches, cross-modal perturbations, semantic jailbreaks, and freezing attacks. We review training-time and runtime defenses, analyze existing benchmarks and metrics, and discuss safety challenges across six deployment domains. Finally, we highlight key open problems, including certified robustness for embodied trajectories, physically realizable defenses, safety-aware training, unified runtime safety architectures, and standardized evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes VLA safety literature around attack and defense timing axes but introduces no new methods or data.

read the letter

The main thing to know is that this is a survey paper that brings together fragmented research on safety for vision-language-action models in embodied AI. It organizes the material along two timing axes—attack timing and defense timing, each split into training and inference—to link threats directly to mitigation opportunities. It does well at highlighting what sets VLA safety apart from LLM safety or classical robotics: the irreversible physical effects, the multimodal attack surfaces, latency needs, and error buildup in trajectories. The review covers training-time issues like data poisoning and backdoors, inference-time attacks such as adversarial patches and cross-modal perturbations, along with defenses, benchmarks, and deployment challenges in six domains. The open problems section points to useful directions like certified robustness for trajectories and safety-aware training. Since it is a survey, there are no new algorithms, experiments, or formal results. The contribution is the structure itself, which seems logical and helps make sense of the field without claiming to be exhaustive. Soft spots are minor but worth noting. Coverage of the literature could always be more complete, and I would check if the cited works fully represent recent VLA-specific papers. The timing axes are a useful but not unique lens; they might not fully address safety in ongoing, adaptive deployments. The distinctions from other safety areas are clearly motivated by embodied specifics. This paper is aimed at researchers in robotics, AI safety, and embodied intelligence who want an overview to navigate the area or spot gaps. It deserves peer review because the topic is timely and the organization could guide future efforts, even if it requires expansion on evaluations or additional citations during revision.

Referee Report

0 major / 1 minor

Summary. The paper claims to deliver a unified survey of safety issues in Vision-Language-Action (VLA) models for embodied intelligence. It distinguishes VLA safety from text-only LLM safety and classical robotic safety by emphasizing embodied factors such as physical irreversibility, multimodal attack surfaces, latency constraints, trajectory error propagation, and data supply chain vulnerabilities. The central contribution is an organizational framework using two parallel timing axes—attack timing (training-time vs. inference-time) and defense timing (training-time vs. inference-time)—that maps threats to mitigation stages. The manuscript reviews VLA foundations, then surveys the literature through four lenses (Attacks, Defenses, Evaluation, Deployment), covering training-time threats like data poisoning and backdoors, inference-time attacks like adversarial patches and semantic jailbreaks, corresponding defenses, benchmarks, and domain-specific deployment challenges, before listing open problems such as certified robustness for trajectories and standardized evaluation.

Significance. If the two-axis organization proves effective at linking threats to mitigation stages, the survey would provide a valuable structured reference for the emerging VLA safety literature, which the abstract correctly notes is currently fragmented across robotic learning, adversarial ML, AI alignment, and autonomous systems. The explicit grounding in embodied specifics (e.g., irreversible physical consequences and cross-modal surfaces) strengthens its relevance beyond generic LLM safety surveys. The paper earns credit for adopting a standard yet appropriate timing-based lens without overclaiming exhaustiveness or superiority to all alternatives, and for clearly scoping the VLA definition before applying the framework.

minor comments (1)

Abstract: the sentence introducing the two axes contains a missing closing parenthesis and awkward phrasing ('attack timing (training-time vs. inference-time and defense timing (training-time vs. inference-time, linking'), which reduces readability; this should be corrected to 'attack timing (training-time vs. inference-time) and defense timing (training-time vs. inference-time), linking'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our survey on Vision-Language-Action safety. The recommendation for minor revision is noted, and we appreciate the recognition of the two-axis organizational framework and the emphasis on embodied factors. As no specific major comments were raised in the report, we have no rebuttals to provide and will incorporate any editorial or minor suggestions in the revised version.

Circularity Check

0 steps flagged

No significant circularity as a literature survey

full rationale

This paper is a survey that reviews and organizes external literature on VLA safety without any original derivations, equations, fitted parameters, or predictions that could reduce to its own inputs by construction. The two-axis organization (attack timing and defense timing) is introduced as a conceptual lens to structure the review of threats, defenses, evaluations, and deployments, motivated by embodied characteristics listed in the abstract, but not derived from or equivalent to any self-referential content. No self-citations are load-bearing for a central claim, no uniqueness theorems are invoked from prior author work, and no ansatzes or renamings of known results occur. The scope definition and four-lens structure precede the organization, and the paper makes no claim that the axes are exhaustive or mathematically forced, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a literature survey paper with no new mathematical models, empirical fits, or postulated entities; it relies on standard definitions from the cited fields.

axioms (1)

domain assumption VLA models form a unified substrate for embodied intelligence distinct from text-only LLMs and classical robotic systems.
Invoked in the opening of the abstract to define scope.

pith-pipeline@v0.9.0 · 5612 in / 997 out tokens · 31062 ms · 2026-05-08T05:52:47.043400+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

104 extracted references · 61 canonical work pages · 23 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review arXiv 2022
[3]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review arXiv 2018
[4]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

1933
[5]

Large language model-based task planning for service robots: A review.Biomimetic Intelligence and Robotics, page 100274, 2026

Shaohan Bian, Ying Zhang, Guohui Tian, Zhiqiang Miao, Edmond Q Wu, Simon X Yang, and Changchun Hua. Large language model-based task planning for service robots: A review.Biomimetic Intelligence and Robotics, page 100274, 2026

2026
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review arXiv 2024
[7]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review arXiv 2021
[8]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review arXiv 2022
[9]

Can vision-language models understand construction workers? an exploratory study.arXiv preprint arXiv:2601.10835, 2026

Hieu Bui, Nathaniel E Chodosh, and Arash Tavakoli. Can vision-language models understand construction workers? an exploratory study.arXiv preprint arXiv:2601.10835, 2026

work page arXiv 2026
[10]

If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

Jiamin Chang, Minhui Xue, Ruoxi Sun, Shuchao Pang, Salil S. Kanhere, and Hammond Pearce. If you’re waiting for a sign... that might not be it! mitigating trust boundary confusion from visual injections on vision-language agentic systems, 2026.https://arxiv.org/abs/2604.19844

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

SafeMind: Benchmarking and mitigating safety risks in embodied LLM agents.arXiv preprintarXiv:2509.25885, 2025

Ruolin Chen, Yinqian Sun, Jihang Wang, Mingyang Lv, Qian Zhang, and Yi Zeng. Safemind: benchmarking and mitigating safety risks in embodied llm agents.arXiv preprint arXiv:2509.25885, 2025. 36

work page arXiv 2025
[12]

HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models

Zixing Chen, Yifeng Gao, Li Wang, Yunhan Zhao, Yi Liu, Jiayu Li, Xiang Zheng, Zuxuan Wu, Cong Wang, Xingjun Ma, et al. Hazardarena: Evaluating semantic safety in vision-language-action models.arXiv preprint arXiv:2604.12447, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[14]

From words to safety: Language-conditioned safety filtering for robot navigation.arXiv preprint arXiv:2511.05889, 2025

Zeyuan Feng, Haimingyue Zhang, and Somil Bansal. From words to safety: Language-conditioned safety filtering for robot navigation.arXiv preprint arXiv:2511.05889, 2025

work page arXiv 2025
[15]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review arXiv 2024
[16]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[17]

Safe: Multitask failure detection for vision-language-action models.arXiv preprint arXiv:2506.09937, 2025

Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. Safe: Multitask failure detection for vision-language-action models.arXiv preprint arXiv:2506.09937, 2025

work page arXiv 2025
[18]

State backdoor: Towards stealthy real-world poisoning attack on vision-language-action model in state space, 2026

Ji Guo, Wenbo Jiang, Yansong Lin, Yijing Liu, Ruichen Zhang, Guomin Lu, Aiguo Chen, Xinshuo Han, Hongwei Li, and Dusit Niyato. State backdoor: Towards stealthy real-world poisoning attack on vision-language-action model in state space, 2026

2026
[19]

Embodied ai for smart robotic cells in manufacturing applications

Satyandra K Gupta. Embodied ai for smart robotic cells in manufacturing applications. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28630–28636, 2025

2025
[20]

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

Kaiser Hamid, Can Cui, and Nade Liang. Icr-drive: Instruction counterfactual robustness for end-to-end language-driven autonomous driving.arXiv preprint arXiv:2604.05378, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Run-time observation interventions make vision- language-action models more visually robust

Asher J Hancock, Allen Z Ren, and Anirudha Majumdar. Run-time observation interventions make vision- language-action models more visually robust. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9499–9506. IEEE, 2025

2025
[22]

Safety optimized reinforcement learning via multi-objective policy optimization, 2024

Homayoun Honari, Mehran Ghafarian Tamizi, and Homayoun Najjaran. Safety optimized reinforcement learning via multi-objective policy optimization, 2024

2024
[23]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[24]

Vlsa: Vision-language-action models with plug-and-play safety constraint layer.arXivpreprint arXiv:2512.11891, 2025

Songqiao Hu, Zeyi Liu, Shuang Liu, Jun Cen, Zihan Meng, and Xiao He. Vlsa: Vision-language-action models with plug-and-play safety constraint layer.arXiv preprint arXiv:2512.11891, 2025

work page arXiv 2025
[25]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760, 2025

work page arXiv 2025
[26]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

work page internal anchor Pith review arXiv 2023
[27]

Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving.Transportation Research Part C: Emerging Technologies, 180:105321, 2025

Zilin Huang, Zihao Sheng, Yansong Qu, Junwei You, and Sikai Chen. Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving.Transportation Research Part C: Emerging Technologies, 180:105321, 2025

2025
[28]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review arXiv 2024
[29]

EMMA: End-to-End Multimodal Model for Autonomous Driving

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

work page internal anchor Pith review arXiv 2024
[30]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.pi_0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 37

work page internal anchor Pith review arXiv 2025
[31]

A survey on vision-language-action models for autonomous driving

Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025

2025
[32]

Adversarial attacks on robotic vision language action models,

Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J Pappas, Hamed Hassani, Matt Fredrikson, and J Zico Kolter. Adversarial attacks on robotic vision language action models.arXiv preprint arXiv:2506.03350, 2025

work page arXiv 2025
[33]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review arXiv 2024
[34]

Pedagogical alignment for vision-language-action models: A comprehensive framework for data, architecture, and evaluation in education, 2026

Unggi Lee, Jahyun Jeong, Sunyoung Shin, Haeun Park, Jeongsu Moon, Youngchang Song, Jaechang Shim, JaeHwan Lee, Yunju Noh, Seungwon Choi, Ahhyun Kim, TaeHyeon Kim, Kyungtae Joo, Taeyeong Kim, and Gyeonggeon Lee. Pedagogical alignment for vision-language-action models: A comprehensive framework for data, architecture, and evaluation in education, 2026

2026
[35]

The shawshank redemption of embodied ai: Understanding and benchmarking indirect environmental jailbreaks.arXiv preprint arXiv:2511.16347, 2025

Chunyang Li, Zifeng Kang, Junwei Zhang, Zhuo Ma, Anda Cheng, Xinghua Li, and Jianfeng Ma. The shawshank redemption of embodied ai: Understanding and benchmarking indirect environmental jailbreaks.arXiv preprint arXiv:2511.16347, 2025

work page arXiv 2025
[36]

Attackvla: Benchmarking adversarial and backdoor attacks on vision- language-action models,

Jiayu Li, Yunhan Zhao, Xiang Zheng, Zonghuan Xu, Yige Li, Xingjun Ma, and Yu-Gang Jiang. Attackvla: Bench- marking adversarial and backdoor attacks on vision-language-action models.arXiv preprint arXiv:2511.12149, 2025

work page arXiv 2025
[37]

Robonurse-vla: Robotic scrub nurse system based on vision-language-action model

Shunlei Li, Jin Wang, Rui Dai, Wanyu Ma, Wing Yin Ng, Yingbai Hu, and Zheng Li. Robonurse-vla: Robotic scrub nurse system based on vision-language-action model. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3986–3993. IEEE, 2025

2025
[38]

Causal scene narration with runtime safety supervision for vision-language-action driving.arXiv preprint arXiv:2604.01723, 2026

Yun Li, Yidu Zhang, Simon Thompson, Ehsan Javanmardi, and Manabu Tsukada. Causal scene narration with runtime safety supervision for vision-language-action driving.arXiv preprint arXiv:2604.01723, 2026

work page arXiv 2026
[39]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

2023
[40]

arXiv preprint arXiv:2510.01642 , year=

Zijun Lin, Jiafei Duan, Haoquan Fang, Dieter Fox, Ranjay Krishna, Cheston Tan, and Bihan Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models.arXiv preprint arXiv:2510.01642, 2025

work page arXiv 2025
[41]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023
[42]

Evovla: Self-evolving vision-language-action model, 2025

Zeting Liu, Zida Yang, Zeyu Zhang, and Hao Tang. Evovla: Self-evolving vision-language-action model, 2025

2025
[43]

Reflect: Summarizing robot experiences for failure explanation and correction.arXiv preprint arXiv:2306.15724, 2023

Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction.arXiv preprint arXiv:2306.15724, 2023

work page arXiv 2023
[44]

Human-in-the-loop online rejection sampling for robotic manipulation, 2025

Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, and Yansong Tang. Human-in-the-loop online rejection sampling for robotic manipulation, 2025

2025
[45]

Exploring the robustness of vision-language-action models against sensor attacks

Xuancun Lu, Jiaxiang Chen, Shilin Xiao, Zizhi Jin, Ruochen Zhou, Xiaoyu Ji, and Wenyuan Xu. Exploring the robustness of vision-language-action models against sensor attacks. InProceedings of the 2025 Workshop on Large AI Systems and Models with Privacy and Security Analysis, pages 11–18, 2025

2025
[46]

Phantom menace: Exploring and enhancing the robustness of vla models against physical sensor attacks

Xuancun Lu, Jiaxiang Chen, Shilin Xiao, Zizhi Jin, Zhangrui Chen, Hanwen Yu, Bohan Qian, Ruochen Zhou, Xiaoyu Ji, and Wenyuan Xu. Phantom menace: Exploring and enhancing the robustness of vla models against physical sensor attacks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35689–35697, 2026

2026
[47]

Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809, 2026

FawadMehboob, MonijesuJames, AmirHabel, JeffrinSam, MiguelAltamiranoCabrera, andDzmitryTsetserukou. Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809, 2026

work page arXiv 2026
[48]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 38

2022
[49]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[50]

Wristworld: Generating wrist-views via 4d world models for robotic manipulation.CoRR, abs/2510.07313, 2025

Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, and Shanghang Zhang. Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313, 2025

work page arXiv 2025
[51]

Vl-safe: Vision- language guided safety-aware reinforcement learning with world models for autonomous driving.arXiv preprint arXiv:2505.16377, 2025

Yansong Qu, Zilin Huang, Zihao Sheng, Jiancong Chen, Sikai Chen, and Samuel Labi. Vl-safe: Vision- language guided safety-aware reinforcement learning with world models for autonomous driving.arXiv preprint arXiv:2505.16377, 2025

work page arXiv 2025
[52]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[53]

VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models

Ravi Ranjan and Agoritsa Polyzou. Vla-forget: Vision-language-action unlearning for embodied foundation models.arXiv preprint arXiv:2604.03956, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

How VLAs (Really) Work In Open-World Environments

Amir Rasouli, Yangzheng Wu, Zhiyuan Li, Rui Heng Yang, Xuan Zhao, Charles Eret, and Sajjad Pakdamansavoji. How vlas (really) work in open-world environments, 2026.https://arxiv.org/abs/2604.21192

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026

Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J Pappas, and Hamed Hassani. Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026

2026
[56]

Jailbreaking llm-controlled robots

Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, and George J Pappas. Jailbreaking llm-controlled robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11948–11956. IEEE, 2025

2025
[57]

VLA-risk: Benchmarking vision-language-action models with physical robustness, 2026.https://openreview.net/forum?id=31EjDFwFEe

Yanchi Ru, Zhengyue Zhao, YingziYingzi Ma, Xiaogeng Liu, and Chaowei Xiao. VLA-risk: Benchmarking vision-language-action models with physical robustness, 2026.https://openreview.net/forum?id=31EjDFwFEe

2026
[58]

Safe-smart: Safety analysis and formal evaluation using stl metrics for autonomous robots.arXiv preprint arXiv:2511.17781, 2025

Kristy Sakano, Jianyu An, Dinesh Manocha, and Huan Xu. Safe-smart: Safety analysis and formal evaluation using stl metrics for autonomous robots.arXiv preprint arXiv:2511.17781, 2025

work page arXiv 2025
[59]

Costnav: A navigation benchmark for cost-aware evaluation of embodied agents.arXiv preprint arXiv:2511.20216, 2025

Haebin Seong, Sungmin Kim, Minchan Kim, Yongjun Cho, Myunchul Joe, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Yoonshik Kim, Samwoo Seong, et al. Costnav: A navigation benchmark for cost-aware evaluation of embodied agents.arXiv preprint arXiv:2511.20216, 2025

work page arXiv 2025
[60]

Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

Pierre Sermanet, Anirudha Majumdar, Alex Irpan, Dmitry Kalashnikov, and Vikas Sindhwani. Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

work page arXiv 2025
[61]

Vlm- social-nav: Socially aware robot navigation through scoring using vision-language models.IEEE Robotics and Automation Letters, 10(1):508–515, 2024

Daeun Song, Jing Liang, Amirreza Payandeh, Amir Hossain Raj, Xuesu Xiao, and Dinesh Manocha. Vlm- social-nav: Socially aware robot navigation through scoring using vision-language models.IEEE Robotics and Automation Letters, 10(1):508–515, 2024

2024
[62]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review arXiv 2023
[63]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review arXiv 2024
[64]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review arXiv 2024
[65]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review arXiv 2024
[66]

Towards safe robot foundation models using inductive biases.arXiv preprint arXiv:2505.10219, 2025

Maximilian Tölle, Theo Gruner, Daniel Palenicek, Tim Schneider, Jonas Günster, Joe Watson, Davide Tateo, Puze Liu, and Jan Peters. Towards safe robot foundation models using inductive biases.arXiv preprint arXiv:2505.10219, 2025. 39

work page arXiv 2025
[67]

Pablo Valle, Chengjie Lu, Shaukat Ali, and Aitor Arri- eta

Pablo Valle, Chengjie Lu, Shaukat Ali, and Aitor Arrieta. Evaluating uncertainty and quality of visual language action-enabled robots.arXiv preprint arXiv:2507.17049, 2025

work page arXiv 2025
[68]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[69]

A survey of constraint formulations in safe reinforcement learning

Akifumi Wachi, Xun Shen, and Yanan Sui. A survey of constraint formulations in safe reinforcement learning. arXiv preprint arXiv:2402.02025, 2024

work page arXiv 2024
[70]

Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, and Hongliang Ren. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

work page arXiv 2024
[71]

Robosafe: Safeguarding embodied agents via executable safety logic, 2025

Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, and Xianglong Liu. Robosafe: Safeguarding embodied agents via executable safety logic, 2025

2025
[72]

Physical attacks on robot navigation systems

Meng Wang, Yohei Hayamizu, Matthew Tang, Kevin Gopalan, Shiqi Zhang, and Ping Yang. Physical attacks on robot navigation systems. InRSS 2025 Workshop on Reliable Robotics: Safety and Security in the Face of Generative AI, 2025.https://openreview.net/forum?id=A4A WclA4aC

2025
[73]

Exploring the adversarial vulnerabilities of vision-language-action models in robotics

Taowen Wang, Cheng Han, James Liang, Wenhao Yang, Dongfang Liu, Luna Xinyu Zhang, Qifan Wang, Jiebo Luo, and Ruixiang Tang. Exploring the adversarial vulnerabilities of vision-language-action models in robotics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6948–6958, 2025

2025
[74]

Freezevla: Action-freezing attacks against vision-language-action models,

Xin Wang, Jie Li, Zejia Weng, Yixu Wang, Yifeng Gao, Tianyu Pang, Chao Du, Yan Teng, Yingchun Wang, Zuxuan Wu, et al. Freezevla: Action-freezing attacks against vision-language-action models.arXiv preprint arXiv:2509.19870, 2025

work page arXiv 2025
[75]

Vlatest: Testing and evaluating vision-language-action models for robotic manipulation.Proceedings of the ACM on Software Engineering, 2 (FSE):1615–1638, 2025

Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma. Vlatest: Testing and evaluating vision-language-action models for robotic manipulation.Proceedings of the ACM on Software Engineering, 2 (FSE):1615–1638, 2025

2025
[76]

Human-assisted robotic policy refinement via action preference optimization, 2025

Wenke Xia, Yichu Yang, Hongtao Wu, Xiao Ma, Tao Kong, and Di Hu. Human-assisted robotic policy refinement via action preference optimization, 2025

2025
[77]

Parallels between vla model post-training and human motor learning: Progress, challenges, and trends.arXiv preprint arXiv:2506.20966, 2025

Tian-Yu Xiang, Ao-Qun Jin, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Sheng-Bin Duan, Fu-Chao Xie, Wen-Kai Wang, et al. Parallels between vla model post-training and human motor learning: Progress, challenges, and trends.arXiv preprint arXiv:2506.20966, 2025

work page arXiv 2025
[78]

Silentdrift: Exploiting action chunking for stealthy backdoor attacks on vision-language-action models, 2026

Bingxin Xu, Yuzhang Shang, Binghui Wang, and Emilio Ferrara. Silentdrift: Exploiting action chunking for stealthy backdoor attacks on vision-language-action models, 2026

2026
[79]

Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation.arXiv preprint arXiv:2512.07472, 2025

Siyu Xu, Zijian Wang, Yunke Wang, Chenghao Xia, Tao Huang, and Chang Xu. Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation.arXiv preprint arXiv:2512.07472, 2025

work page arXiv 2025
[80]

Dropvla: An action-level backdoor attack on vision-language-action models, 2026

Zonghuan Xu, Jiayu Li, Yunhan Zhao, Xiang Zheng, Xingjun Ma, and Yu-Gang Jiang. Dropvla: An action-level backdoor attack on vision-language-action models, 2026

2026

Showing first 80 references.