Recognition: unknown
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
Pith reviewed 2026-05-08 05:52 UTC · model grok-4.3
The pith
A survey unifies safety for Vision-Language-Action models by organizing threats and defenses along training-time versus inference-time axes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLA safety literature can be organized into four quadrants defined by attack timing (training-time versus inference-time) and defense timing (training-time versus inference-time). Training-time attacks include data poisoning and backdoors; inference-time attacks include adversarial patches, cross-modal perturbations, semantic jailbreaks, and freezing attacks. Corresponding defenses are reviewed at both stages, together with benchmarks, metrics, and domain-specific deployment issues.
What carries the argument
The two parallel timing axes (attack timing and defense timing, each divided into training-time and inference-time) that structure every threat and mitigation reviewed in the survey.
If this is right
- Training-time defenses can block data poisoning and backdoors before a VLA model is deployed.
- Inference-time defenses must operate under real-time latency limits to counter patches and cross-modal attacks on physical hardware.
- Benchmarks must include long-horizon trajectories and physical irreversibility to measure true safety.
- Deployment in domains such as autonomous vehicles or household robots will require separate safety analyses because attack surfaces differ.
Where Pith is reading between the lines
- The same timing grid could be applied to other embodied multimodal systems that are not strictly VLA.
- Certified robustness techniques for trajectories would directly address one of the open problems listed in the survey.
- A unified runtime safety layer might combine multiple inference-time defenses into a single lightweight. I need to read the full text using the tool mentioned, but since it's not available here, I'll use the abstract and notes to infer. The prompt says
Load-bearing premise
The existing literature on VLA safety is fragmented enough that this particular two-axis timing organization adds clear value and that the cited papers already cover the main threats.
What would settle it
A sizable set of VLA safety papers whose threats or defenses cannot be placed on the training-versus-inference grid for attacks and defenses, or a major class of embodied threats absent from the survey.
read the original abstract
Vision-Language-Action (VLA) models are emerging as a unified substrate for embodied intelligence. This shift raises a new class of safety challenges, stemming from the embodied nature of VLA systems, including irreversible physical consequences, a multimodal attack surface across vision, language, and state, real-time latency constraints on defense, error propagation over long-horizon trajectories, and vulnerabilities in the data supply chain. Yet the literature remains fragmented across robotic learning, adversarial machine learning, AI alignment, and autonomous systems safety. This survey provides a unified and up-to-date overview of safety in Vision-Language-Action models. We organize the field along two parallel timing axes, attack timing (training-time vs. inference-time and defense timing (training-time vs. inference-time, linking each class of threat to the stage at which it can be mitigated. We first define the scope of VLA safety, distinguishing it from text-only LLM safety and classical robotic safety, and review the foundations of VLA models, including architectures, training paradigms, and inference mechanisms. We then examine the literature through four lenses: Attacks, Defenses, Evaluation, and Deployment. We survey training-time threats such as data poisoning and backdoors, as well as inference-time attacks including adversarial patches, cross-modal perturbations, semantic jailbreaks, and freezing attacks. We review training-time and runtime defenses, analyze existing benchmarks and metrics, and discuss safety challenges across six deployment domains. Finally, we highlight key open problems, including certified robustness for embodied trajectories, physically realizable defenses, safety-aware training, unified runtime safety architectures, and standardized evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to deliver a unified survey of safety issues in Vision-Language-Action (VLA) models for embodied intelligence. It distinguishes VLA safety from text-only LLM safety and classical robotic safety by emphasizing embodied factors such as physical irreversibility, multimodal attack surfaces, latency constraints, trajectory error propagation, and data supply chain vulnerabilities. The central contribution is an organizational framework using two parallel timing axes—attack timing (training-time vs. inference-time) and defense timing (training-time vs. inference-time)—that maps threats to mitigation stages. The manuscript reviews VLA foundations, then surveys the literature through four lenses (Attacks, Defenses, Evaluation, Deployment), covering training-time threats like data poisoning and backdoors, inference-time attacks like adversarial patches and semantic jailbreaks, corresponding defenses, benchmarks, and domain-specific deployment challenges, before listing open problems such as certified robustness for trajectories and standardized evaluation.
Significance. If the two-axis organization proves effective at linking threats to mitigation stages, the survey would provide a valuable structured reference for the emerging VLA safety literature, which the abstract correctly notes is currently fragmented across robotic learning, adversarial ML, AI alignment, and autonomous systems. The explicit grounding in embodied specifics (e.g., irreversible physical consequences and cross-modal surfaces) strengthens its relevance beyond generic LLM safety surveys. The paper earns credit for adopting a standard yet appropriate timing-based lens without overclaiming exhaustiveness or superiority to all alternatives, and for clearly scoping the VLA definition before applying the framework.
minor comments (1)
- Abstract: the sentence introducing the two axes contains a missing closing parenthesis and awkward phrasing ('attack timing (training-time vs. inference-time and defense timing (training-time vs. inference-time, linking'), which reduces readability; this should be corrected to 'attack timing (training-time vs. inference-time) and defense timing (training-time vs. inference-time), linking'.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our survey on Vision-Language-Action safety. The recommendation for minor revision is noted, and we appreciate the recognition of the two-axis organizational framework and the emphasis on embodied factors. As no specific major comments were raised in the report, we have no rebuttals to provide and will incorporate any editorial or minor suggestions in the revised version.
Circularity Check
No significant circularity as a literature survey
full rationale
This paper is a survey that reviews and organizes external literature on VLA safety without any original derivations, equations, fitted parameters, or predictions that could reduce to its own inputs by construction. The two-axis organization (attack timing and defense timing) is introduced as a conceptual lens to structure the review of threats, defenses, evaluations, and deployments, motivated by embodied characteristics listed in the abstract, but not derived from or equivalent to any self-referential content. No self-citations are load-bearing for a central claim, no uniqueness theorems are invoked from prior author work, and no ansatzes or renamings of known results occur. The scope definition and four-lens structure precede the organization, and the paper makes no claim that the axes are exhaustive or mathematically forced, rendering the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLA models form a unified substrate for embodied intelligence distinct from text-only LLMs and classical robotic systems.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review arXiv 2023
-
[2]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review arXiv 2022
-
[3]
On Evaluation of Embodied Navigation Agents
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018
work page internal anchor Pith review arXiv 2018
-
[4]
Covla: Comprehensive vision-language-action dataset for autonomous driving
Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025
1933
-
[5]
Large language model-based task planning for service robots: A review.Biomimetic Intelligence and Robotics, page 100274, 2026
Shaohan Bian, Ying Zhang, Guohui Tian, Zhiqiang Miao, Edmond Q Wu, Simon X Yang, and Changchun Hua. Large language model-based task planning for service robots: A review.Biomimetic Intelligence and Robotics, page 100274, 2026
2026
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review arXiv 2024
-
[7]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review arXiv 2021
-
[8]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review arXiv 2022
-
[9]
Hieu Bui, Nathaniel E Chodosh, and Arash Tavakoli. Can vision-language models understand construction workers? an exploratory study.arXiv preprint arXiv:2601.10835, 2026
-
[10]
Jiamin Chang, Minhui Xue, Ruoxi Sun, Shuchao Pang, Salil S. Kanhere, and Hammond Pearce. If you’re waiting for a sign... that might not be it! mitigating trust boundary confusion from visual injections on vision-language agentic systems, 2026.https://arxiv.org/abs/2604.19844
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Ruolin Chen, Yinqian Sun, Jihang Wang, Mingyang Lv, Qian Zhang, and Yi Zeng. Safemind: benchmarking and mitigating safety risks in embodied llm agents.arXiv preprint arXiv:2509.25885, 2025. 36
-
[12]
HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models
Zixing Chen, Yifeng Gao, Li Wang, Yunhan Zhao, Yi Liu, Jiayu Li, Xiang Zheng, Zuxuan Wu, Cong Wang, Xingjun Ma, et al. Hazardarena: Evaluating semantic safety in vision-language-action models.arXiv preprint arXiv:2604.12447, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
2025
-
[14]
Zeyuan Feng, Haimingyue Zhang, and Somil Bansal. From words to safety: Language-conditioned safety filtering for robot navigation.arXiv preprint arXiv:2511.05889, 2025
-
[15]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[17]
Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. Safe: Multitask failure detection for vision-language-action models.arXiv preprint arXiv:2506.09937, 2025
-
[18]
State backdoor: Towards stealthy real-world poisoning attack on vision-language-action model in state space, 2026
Ji Guo, Wenbo Jiang, Yansong Lin, Yijing Liu, Ruichen Zhang, Guomin Lu, Aiguo Chen, Xinshuo Han, Hongwei Li, and Dusit Niyato. State backdoor: Towards stealthy real-world poisoning attack on vision-language-action model in state space, 2026
2026
-
[19]
Embodied ai for smart robotic cells in manufacturing applications
Satyandra K Gupta. Embodied ai for smart robotic cells in manufacturing applications. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28630–28636, 2025
2025
-
[20]
ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving
Kaiser Hamid, Can Cui, and Nade Liang. Icr-drive: Instruction counterfactual robustness for end-to-end language-driven autonomous driving.arXiv preprint arXiv:2604.05378, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Run-time observation interventions make vision- language-action models more visually robust
Asher J Hancock, Allen Z Ren, and Anirudha Majumdar. Run-time observation interventions make vision- language-action models more visually robust. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9499–9506. IEEE, 2025
2025
-
[22]
Safety optimized reinforcement learning via multi-objective policy optimization, 2024
Homayoun Honari, Mehran Ghafarian Tamizi, and Homayoun Najjaran. Safety optimized reinforcement learning via multi-objective policy optimization, 2024
2024
-
[23]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
2022
-
[24]
Songqiao Hu, Zeyi Liu, Shuang Liu, Jun Cen, Zihan Meng, and Xiao He. Vlsa: Vision-language-action models with plug-and-play safety constraint layer.arXiv preprint arXiv:2512.11891, 2025
-
[25]
Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760, 2025
-
[26]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023
work page internal anchor Pith review arXiv 2023
-
[27]
Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving.Transportation Research Part C: Emerging Technologies, 180:105321, 2025
Zilin Huang, Zihao Sheng, Yansong Qu, Junwei You, and Sikai Chen. Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving.Transportation Research Part C: Emerging Technologies, 180:105321, 2025
2025
-
[28]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.pi_0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 37
work page internal anchor Pith review arXiv 2025
-
[31]
A survey on vision-language-action models for autonomous driving
Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025
2025
-
[32]
Adversarial attacks on robotic vision language action models,
Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J Pappas, Hamed Hassani, Matt Fredrikson, and J Zico Kolter. Adversarial attacks on robotic vision language action models.arXiv preprint arXiv:2506.03350, 2025
-
[33]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review arXiv 2024
-
[34]
Pedagogical alignment for vision-language-action models: A comprehensive framework for data, architecture, and evaluation in education, 2026
Unggi Lee, Jahyun Jeong, Sunyoung Shin, Haeun Park, Jeongsu Moon, Youngchang Song, Jaechang Shim, JaeHwan Lee, Yunju Noh, Seungwon Choi, Ahhyun Kim, TaeHyeon Kim, Kyungtae Joo, Taeyeong Kim, and Gyeonggeon Lee. Pedagogical alignment for vision-language-action models: A comprehensive framework for data, architecture, and evaluation in education, 2026
2026
-
[35]
Chunyang Li, Zifeng Kang, Junwei Zhang, Zhuo Ma, Anda Cheng, Xinghua Li, and Jianfeng Ma. The shawshank redemption of embodied ai: Understanding and benchmarking indirect environmental jailbreaks.arXiv preprint arXiv:2511.16347, 2025
-
[36]
Attackvla: Benchmarking adversarial and backdoor attacks on vision- language-action models,
Jiayu Li, Yunhan Zhao, Xiang Zheng, Zonghuan Xu, Yige Li, Xingjun Ma, and Yu-Gang Jiang. Attackvla: Bench- marking adversarial and backdoor attacks on vision-language-action models.arXiv preprint arXiv:2511.12149, 2025
-
[37]
Robonurse-vla: Robotic scrub nurse system based on vision-language-action model
Shunlei Li, Jin Wang, Rui Dai, Wanyu Ma, Wing Yin Ng, Yingbai Hu, and Zheng Li. Robonurse-vla: Robotic scrub nurse system based on vision-language-action model. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3986–3993. IEEE, 2025
2025
-
[38]
Yun Li, Yidu Zhang, Simon Thompson, Ehsan Javanmardi, and Manabu Tsukada. Causal scene narration with runtime safety supervision for vision-language-action driving.arXiv preprint arXiv:2604.01723, 2026
-
[39]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023
2023
-
[40]
arXiv preprint arXiv:2510.01642 , year=
Zijun Lin, Jiafei Duan, Haoquan Fang, Dieter Fox, Ranjay Krishna, Cheston Tan, and Bihan Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models.arXiv preprint arXiv:2510.01642, 2025
-
[41]
Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023
2023
-
[42]
Evovla: Self-evolving vision-language-action model, 2025
Zeting Liu, Zida Yang, Zeyu Zhang, and Hao Tang. Evovla: Self-evolving vision-language-action model, 2025
2025
-
[43]
Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction.arXiv preprint arXiv:2306.15724, 2023
-
[44]
Human-in-the-loop online rejection sampling for robotic manipulation, 2025
Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, and Yansong Tang. Human-in-the-loop online rejection sampling for robotic manipulation, 2025
2025
-
[45]
Exploring the robustness of vision-language-action models against sensor attacks
Xuancun Lu, Jiaxiang Chen, Shilin Xiao, Zizhi Jin, Ruochen Zhou, Xiaoyu Ji, and Wenyuan Xu. Exploring the robustness of vision-language-action models against sensor attacks. InProceedings of the 2025 Workshop on Large AI Systems and Models with Privacy and Security Analysis, pages 11–18, 2025
2025
-
[46]
Phantom menace: Exploring and enhancing the robustness of vla models against physical sensor attacks
Xuancun Lu, Jiaxiang Chen, Shilin Xiao, Zizhi Jin, Zhangrui Chen, Hanwen Yu, Bohan Qian, Ruochen Zhou, Xiaoyu Ji, and Wenyuan Xu. Phantom menace: Exploring and enhancing the robustness of vla models against physical sensor attacks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35689–35697, 2026
2026
-
[47]
Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809, 2026
FawadMehboob, MonijesuJames, AmirHabel, JeffrinSam, MiguelAltamiranoCabrera, andDzmitryTsetserukou. Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809, 2026
-
[48]
Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 38
2022
-
[49]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[50]
Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, and Shanghang Zhang. Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313, 2025
-
[51]
Yansong Qu, Zilin Huang, Zihao Sheng, Jiancong Chen, Sikai Chen, and Samuel Labi. Vl-safe: Vision- language guided safety-aware reinforcement learning with world models for autonomous driving.arXiv preprint arXiv:2505.16377, 2025
-
[52]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[53]
VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models
Ravi Ranjan and Agoritsa Polyzou. Vla-forget: Vision-language-action unlearning for embodied foundation models.arXiv preprint arXiv:2604.03956, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[54]
How VLAs (Really) Work In Open-World Environments
Amir Rasouli, Yangzheng Wu, Zhiyuan Li, Rui Heng Yang, Xuan Zhao, Charles Eret, and Sajjad Pakdamansavoji. How vlas (really) work in open-world environments, 2026.https://arxiv.org/abs/2604.21192
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[55]
Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026
Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J Pappas, and Hamed Hassani. Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026
2026
-
[56]
Jailbreaking llm-controlled robots
Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, and George J Pappas. Jailbreaking llm-controlled robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11948–11956. IEEE, 2025
2025
-
[57]
VLA-risk: Benchmarking vision-language-action models with physical robustness, 2026.https://openreview.net/forum?id=31EjDFwFEe
Yanchi Ru, Zhengyue Zhao, YingziYingzi Ma, Xiaogeng Liu, and Chaowei Xiao. VLA-risk: Benchmarking vision-language-action models with physical robustness, 2026.https://openreview.net/forum?id=31EjDFwFEe
2026
-
[58]
Kristy Sakano, Jianyu An, Dinesh Manocha, and Huan Xu. Safe-smart: Safety analysis and formal evaluation using stl metrics for autonomous robots.arXiv preprint arXiv:2511.17781, 2025
-
[59]
Haebin Seong, Sungmin Kim, Minchan Kim, Yongjun Cho, Myunchul Joe, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Yoonshik Kim, Samwoo Seong, et al. Costnav: A navigation benchmark for cost-aware evaluation of embodied agents.arXiv preprint arXiv:2511.20216, 2025
-
[60]
Pierre Sermanet, Anirudha Majumdar, Alex Irpan, Dmitry Kalashnikov, and Vikas Sindhwani. Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025
-
[61]
Vlm- social-nav: Socially aware robot navigation through scoring using vision-language models.IEEE Robotics and Automation Letters, 10(1):508–515, 2024
Daeun Song, Jing Liang, Amirreza Payandeh, Amir Hossain Raj, Xuesu Xiao, and Dinesh Manocha. Vlm- social-nav: Socially aware robot navigation through scoring using vision-language models.IEEE Robotics and Automation Letters, 10(1):508–515, 2024
2024
-
[62]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review arXiv 2023
-
[63]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review arXiv 2024
-
[64]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review arXiv 2024
-
[65]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024
work page internal anchor Pith review arXiv 2024
-
[66]
Towards safe robot foundation models using inductive biases.arXiv preprint arXiv:2505.10219, 2025
Maximilian Tölle, Theo Gruner, Daniel Palenicek, Tim Schneider, Jonas Günster, Joe Watson, Davide Tateo, Puze Liu, and Jan Peters. Towards safe robot foundation models using inductive biases.arXiv preprint arXiv:2505.10219, 2025. 39
-
[67]
Pablo Valle, Chengjie Lu, Shaukat Ali, and Aitor Arri- eta
Pablo Valle, Chengjie Lu, Shaukat Ali, and Aitor Arrieta. Evaluating uncertainty and quality of visual language action-enabled robots.arXiv preprint arXiv:2507.17049, 2025
-
[68]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[69]
A survey of constraint formulations in safe reinforcement learning
Akifumi Wachi, Xun Shen, and Yanan Sui. A survey of constraint formulations in safe reinforcement learning. arXiv preprint arXiv:2402.02025, 2024
-
[70]
Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, and Hongliang Ren. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024
-
[71]
Robosafe: Safeguarding embodied agents via executable safety logic, 2025
Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, and Xianglong Liu. Robosafe: Safeguarding embodied agents via executable safety logic, 2025
2025
-
[72]
Physical attacks on robot navigation systems
Meng Wang, Yohei Hayamizu, Matthew Tang, Kevin Gopalan, Shiqi Zhang, and Ping Yang. Physical attacks on robot navigation systems. InRSS 2025 Workshop on Reliable Robotics: Safety and Security in the Face of Generative AI, 2025.https://openreview.net/forum?id=A4A WclA4aC
2025
-
[73]
Exploring the adversarial vulnerabilities of vision-language-action models in robotics
Taowen Wang, Cheng Han, James Liang, Wenhao Yang, Dongfang Liu, Luna Xinyu Zhang, Qifan Wang, Jiebo Luo, and Ruixiang Tang. Exploring the adversarial vulnerabilities of vision-language-action models in robotics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6948–6958, 2025
2025
-
[74]
Freezevla: Action-freezing attacks against vision-language-action models,
Xin Wang, Jie Li, Zejia Weng, Yixu Wang, Yifeng Gao, Tianyu Pang, Chao Du, Yan Teng, Yingchun Wang, Zuxuan Wu, et al. Freezevla: Action-freezing attacks against vision-language-action models.arXiv preprint arXiv:2509.19870, 2025
-
[75]
Vlatest: Testing and evaluating vision-language-action models for robotic manipulation.Proceedings of the ACM on Software Engineering, 2 (FSE):1615–1638, 2025
Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma. Vlatest: Testing and evaluating vision-language-action models for robotic manipulation.Proceedings of the ACM on Software Engineering, 2 (FSE):1615–1638, 2025
2025
-
[76]
Human-assisted robotic policy refinement via action preference optimization, 2025
Wenke Xia, Yichu Yang, Hongtao Wu, Xiao Ma, Tao Kong, and Di Hu. Human-assisted robotic policy refinement via action preference optimization, 2025
2025
-
[77]
Tian-Yu Xiang, Ao-Qun Jin, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Sheng-Bin Duan, Fu-Chao Xie, Wen-Kai Wang, et al. Parallels between vla model post-training and human motor learning: Progress, challenges, and trends.arXiv preprint arXiv:2506.20966, 2025
-
[78]
Silentdrift: Exploiting action chunking for stealthy backdoor attacks on vision-language-action models, 2026
Bingxin Xu, Yuzhang Shang, Binghui Wang, and Emilio Ferrara. Silentdrift: Exploiting action chunking for stealthy backdoor attacks on vision-language-action models, 2026
2026
-
[79]
Siyu Xu, Zijian Wang, Yunke Wang, Chenghao Xia, Tao Huang, and Chang Xu. Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation.arXiv preprint arXiv:2512.07472, 2025
-
[80]
Dropvla: An action-level backdoor attack on vision-language-action models, 2026
Zonghuan Xu, Jiayu Li, Yunhan Zhao, Xiang Zheng, Xingjun Ma, and Yu-Gang Jiang. Dropvla: An action-level backdoor attack on vision-language-action models, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.