ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models
Pith reviewed 2026-06-30 09:49 UTC · model grok-4.3
The pith
Even the strongest vision-language-action policies incur non-trivial safety costs and unsafe successes, with scene structure and visual variation causing more degradation than language changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ForesightSafety-VLA instantiates 66 safety-augmented base scenarios in RoboTwin across five embodiments and evaluates VLA baselines under three variation dimensions. Even the strongest policy shows non-trivial safety cost and unsafe nominal success. Structure and visual variation produce substantially stronger safety degradation than ordinary language variation, indicating that embodied safety is coupled to perception, grounding, and control competence rather than addressable by post-hoc filtering alone.
What carries the argument
The 13-category safety taxonomy (Safe-Core for physical interactions, Safe-Lang for instructions, Safe-Vis for perception) paired with controlled variations in scene structure, language command, and visual observation, plus metrics of cumulative safety cost (CC), risk exposure time (RET), and four-quadrant decomposition of safe/unsafe success and failure.
If this is right
- Safety in VLA systems requires integration into perception, grounding, and control rather than reliance on separate filtering steps.
- Diagnostic evaluation across multiple variation dimensions is required to isolate whether failures originate in scene structure, visuals, or commands.
- Binary task success alone is insufficient; process-level measures such as cumulative safety cost and risk exposure time must be reported.
- Stronger degradation from structure and visual changes implies that improvements in visual grounding will have larger safety impact than language-only adjustments.
- Claims about VLA safety limits depend on testing across multiple embodiments and controlled scenario variations.
Where Pith is reading between the lines
- Training or fine-tuning VLA models directly on the benchmark's safety metrics could produce policies with lower risk exposure in unstructured settings.
- Extending the scenarios beyond simulation to physical robot trials would test whether the observed safety patterns hold outside RoboTwin.
- The greater effect of visual and structural variation suggests that safety research in embodied AI should prioritize perception robustness over command understanding.
- Common failure patterns identified here could inform unified safety standards across different robot platforms and task domains.
Load-bearing premise
The 66 safety-augmented scenarios across five embodiments in RoboTwin are representative enough of real-world physical interaction risks to support general claims about VLA safety limits.
What would settle it
A VLA policy achieving zero cumulative safety cost and exclusively safe successes across all structure, language, and visual variations on the 66 scenarios would falsify the claim of persistent non-trivial safety issues.
Figures
read the original abstract
In embodied intelligence, safety is a prerequisite for reliable robot deployment in the physical world. Current vision-language-action (VLA) models continue to advance toward general-purpose task capability, yet their embodied safety limits remain poorly understood. To address this gap, we introduce ForesightSafety-VLA, a diagnostic benchmark that makes safety the primary evaluation target for VLA systems. We define a 13-category safety taxonomy covering physical interaction safety (Safe-Core), instruction-side safety (Safe-Lang), and perception-side safety (Safe-Vis), and evaluate policies under three controlled dimensions of variation -- scene structure, language command, and visual observation -- so that failure sources can be diagnosed rather than hidden in a single aggregate score. Beyond binary task success, ForesightSafety-VLA measures process-level risk through cumulative safety cost (CC) and risk exposure time (RET), together with a four-quadrant decomposition of safe/unsafe success and failure. We instantiate 66 safety-augmented base scenarios in RoboTwin across 5 embodiments and report results on representative VLA baselines. Across the evaluated baselines, even the strongest policy incurs non-trivial safety cost and unsafe nominal success, while structure and visual variation induce substantially stronger safety degradation than ordinary language variation. These results suggest that embodied safety is tightly coupled to perception, grounding, and control competence rather than being reducible to post-hoc safety filtering alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ForesightSafety-VLA, a diagnostic benchmark for safety in vision-language-action (VLA) models. It defines a 13-category taxonomy spanning Safe-Core (physical interaction), Safe-Lang (instruction-side), and Safe-Vis (perception-side) safety, evaluates policies under controlled variations in scene structure, language commands, and visual observations, and employs process-level metrics including cumulative safety cost (CC), risk exposure time (RET), and a four-quadrant safe/unsafe success/failure decomposition. The benchmark is instantiated via 66 safety-augmented scenarios in RoboTwin across 5 embodiments; results on representative VLA baselines indicate non-trivial safety costs even for the strongest policies, with structure and visual variations causing substantially stronger degradation than language variation, suggesting safety is coupled to core perception/grounding/control competence rather than post-hoc filtering.
Significance. If the empirical results hold under broader validation, the work supplies a needed diagnostic framework that moves beyond aggregate success rates to isolate failure sources in embodied VLA systems. The controlled variation dimensions and multi-metric decomposition are strengths that could support reproducible safety auditing in the field.
major comments (2)
- [Abstract] Abstract: the claim that 'structure and visual variation induce substantially stronger safety degradation than ordinary language variation' and the broader suggestion that 'embodied safety is tightly coupled to perception, grounding, and control competence' rest on results from the 66 RoboTwin scenarios; however, the manuscript provides no coverage analysis, external validation against real-world incident data, or ablation of omitted failure modes (e.g., long-horizon dynamics or contact-rich manipulation), which is load-bearing for the generalizability of the ordering of degradation sources.
- [Abstract] Abstract (and instantiation paragraph): the 13-category taxonomy is presented as comprehensively covering relevant failure modes, yet no validation procedure, inter-rater agreement, or mapping to documented physical interaction risks is described; this directly affects the diagnostic reliability of the Safe-Core/Safe-Lang/Safe-Vis partition and the four-quadrant decomposition.
minor comments (1)
- The abstract refers to 'representative VLA baselines' without naming the specific models or reporting their individual CC/RET values; adding these details (presumably in §4 or the results tables) would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on generalizability and taxonomy validation. We respond to each major comment below, clarifying the benchmark's scope as a controlled diagnostic tool in simulation while committing to targeted revisions for transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'structure and visual variation induce substantially stronger safety degradation than ordinary language variation' and the broader suggestion that 'embodied safety is tightly coupled to perception, grounding, and control competence' rest on results from the 66 RoboTwin scenarios; however, the manuscript provides no coverage analysis, external validation against real-world incident data, or ablation of omitted failure modes (e.g., long-horizon dynamics or contact-rich manipulation), which is load-bearing for the generalizability of the ordering of degradation sources.
Authors: The observed ordering of degradation sources is reported specifically for the 66 RoboTwin scenarios and the tested VLA baselines; the abstract frames this as a suggestive finding within the benchmark rather than a universal claim. We agree that coverage analysis, real-world incident mapping, and ablations of long-horizon or contact-rich modes are absent and would strengthen generalizability statements. We will add a limitations subsection explicitly discussing the simulation scope, the absence of external real-world validation, and potential omitted failure modes to prevent overgeneralization. revision: partial
-
Referee: [Abstract] Abstract (and instantiation paragraph): the 13-category taxonomy is presented as comprehensively covering relevant failure modes, yet no validation procedure, inter-rater agreement, or mapping to documented physical interaction risks is described; this directly affects the diagnostic reliability of the Safe-Core/Safe-Lang/Safe-Vis partition and the four-quadrant decomposition.
Authors: The taxonomy was derived by synthesizing categories from prior robotics safety literature and observed VLA failure patterns, with the Safe-Core/Safe-Lang/Safe-Vis partition intended to isolate distinct risk sources. We did not include a formal validation procedure or inter-rater study in the original manuscript. We will revise the instantiation section to add explicit references to source literature, example mappings to physical risks, and a brief rationale for the partition to improve transparency and diagnostic interpretability. revision: yes
Circularity Check
No circularity: empirical benchmark with direct measurements on external baselines
full rationale
The paper defines a 13-category safety taxonomy, instantiates 66 scenarios in RoboTwin, and reports process-level metrics (CC, RET, four-quadrant decomposition) on representative VLA baselines. All load-bearing claims are empirical observations from these evaluations rather than derivations, fitted predictions, or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked; self-citations (if any) are not load-bearing for the central results. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- ad hoc to paper The 13 safety categories (Safe-Core, Safe-Lang, Safe-Vis) comprehensively cover the relevant failure modes for embodied VLA systems.
Reference graph
Works this paper leans on
-
[1]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183
2023
-
[3]
Octo: An Open-Source Generalist Robot Policy
O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xuet al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
OpenVLA: An Open-Source Vision-Language-Action Model
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,
A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903
2024
-
[6]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,”arXiv preprint arXiv:2401.02117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “ π0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “ π0. 5: A vision-language- action model with open-world generalization. arxiv 2025,”arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,
S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,” 2020
2020
-
[10]
Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,
O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022
2022
-
[11]
Man- iSkill2: A unified benchmark for generalizable ma- nipulation skills, 2023
J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yaoet al., “Maniskill2: A unified benchmark for generalizable manipulation skills,”arXiv preprint arXiv:2302.04659, 2023
-
[12]
Robotwin: Dual-arm robot benchmark with generative digital twins,
Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xuet al., “Robotwin: Dual-arm robot benchmark with generative digital twins,” inProceedings of the computer vision and pattern recognition conference, 2025, pp. 27 649–27 660
2025
-
[13]
Evaluating Real-World Robot Manipulation Policies in Simulation
X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmaniet al., “Evaluating real-world robot manipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models
B. Zhang, J. Li, J. Shen, Y . Cai, Y . Zhang, Y . Chen, J. Dai, J. Ji, and Y . Yang, “Vla-arena: An open-source framework for benchmarking vision-language-action models,”arXiv preprint arXiv:2512.22539, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Generative image as action models,
M. Shridhar, Y . L. Lo, and S. James, “Generative image as action models,” arXiv preprint arXiv:2407.07875, 2024
-
[16]
Ro- bustnav: Towards benchmarking robustness in embodied navigation,
P. Chattopadhyay, J. Hoffman, R. Mottaghi, and A. Kembhavi, “Ro- bustnav: Towards benchmarking robustness in embodied navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 691–15 700
2021
-
[17]
T. B. Brown, D. Man ´e, A. Roy, M. Abadi, and J. Gilmer, “Adversarial patch,”arXiv preprint arXiv:1712.09665, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Prompt Injection attack against LLM-integrated Applications
Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zhenget al., “Prompt injection attack against llm-integrated applications,”arXiv preprint arXiv:2306.05499, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,
C. Xu, W. Ding, W. Lyu, Z. Liu, S. Wang, Y . He, H. Hu, D. Zhao, and B. Li, “Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 667–25 682, 2022
2022
-
[20]
Benchmarking Batch Deep Reinforcement Learning Algorithms
A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,”arXiv preprint arXiv:1910.01708, vol. 7, no. 1, p. 2, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[21]
Altman,Constrained Markov decision processes
E. Altman,Constrained Markov decision processes. Routledge, 2021
2021
-
[22]
Safety gymnasium: A unified safe reinforcement learning benchmark,
J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y . Geng, Y . Zhong, J. Dai, and Y . Yang, “Safety gymnasium: A unified safe reinforcement learning benchmark,”Advances in Neural Information Processing Systems, vol. 36, pp. 18 964–18 993, 2023
2023
-
[23]
Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,
Z. Yuan, A. W. Hall, S. Zhou, L. Brunke, M. Greeff, J. Panerati, and A. P. Schoellig, “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 142–11 149, 2022
2022
-
[24]
Concrete Problems in AI Safety
D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man ´e, “Concrete problems in ai safety,”arXiv preprint arXiv:1606.06565, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[25]
Libero: Benchmarking knowledge transfer for lifelong robot learning,
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023
2023
-
[26]
Control barrier function based quadratic programs for safety critical systems,
A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,”IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2016
2016
-
[27]
A comprehensive survey on safe reinforce- ment learning,
J. Garcıa and F. Fern ´andez, “A comprehensive survey on safe reinforce- ment learning,”Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015
2015
-
[28]
Safe learning in robotics: From learning-based control to safe reinforcement learning,
L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022
2022
-
[29]
A survey of methods for safe human-robot interaction,
P. A. Lasota, T. Fong, and J. A. Shah, “A survey of methods for safe human-robot interaction,”Foundations and Trends® in Robotics, vol. 5, no. 4, pp. 261–349, 2017
2017
-
[30]
Safety of embodied navigation: A survey,
Z. Wang, J. Hu, and R. Mu, “Safety of embodied navigation: A survey,” arXiv preprint arXiv:2508.05855, 2025
-
[31]
Generating robot constitutions & benchmarks for semantic safety,
P. Sermanet, A. Majumdar, A. Irpan, D. Kalashnikov, and V . Sindhwani, “Generating robot constitutions & benchmarks for semantic safety,”arXiv preprint arXiv:2503.08663, 2025
-
[32]
Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,
L. Wei, J. Ma, Y . Hu, and R. Zhang, “Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,”arXiv preprint arXiv:2412.10349, 2024
-
[33]
Towards safe robot foundation models,
T. Gruner, D. Palenicek, P. Liu, J. Watson, D. Tateo, J. Peterset al., “Towards safe robot foundation models,”arXiv e-prints, pp. arXiv–2503, 2025
2025
-
[34]
Sapien: A simulated part-based interactive environment,
F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su, “Sapien: A simulated part-based interactive environment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11 097–11 107. [Online]. Available: https://openaccess.the...
2020
-
[35]
Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,
T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H. ang Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu, “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”
-
[36]
[Online]. Available: https://arxiv.org/abs/2506.18088
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,” arXiv preprint arXiv:2410.07864, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025
2025
-
[39]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng, “Dexvla: Vision- language model with plug-in diffusion expert for general robot control,” arXiv preprint arXiv:2502.05855, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023
2023
-
[42]
Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,
J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shenet al., “Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,”IEEE Robotics and Automation Letters, 2025
2025
-
[43]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.