ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models
Pith reviewed 2026-06-26 05:07 UTC · model grok-4.3
The pith
Vision-language-action models incur non-trivial safety costs and unsafe successes, with structure and visual variation degrading safety more than language changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ForesightSafety-VLA shows that embodied safety in VLA models is tightly coupled to perception, grounding, and control competence rather than reducible to post-hoc safety filtering, as even the strongest evaluated policies produce non-trivial safety cost and unsafe nominal success, while structure and visual variation induce substantially stronger safety degradation than ordinary language variation.
What carries the argument
The 13-category safety taxonomy (Safe-Core, Safe-Lang, Safe-Vis) combined with process-level metrics of cumulative safety cost and risk exposure time, applied across three controlled variation dimensions on 66 scenarios in five embodiments.
If this is right
- Safety evaluation for VLA models must track process-level risk metrics in addition to binary task success.
- Diagnostic benchmarks should separate effects of scene structure, language, and visual changes to isolate failure sources.
- Embodied safety improvements require advances in perception and control rather than relying solely on output filtering.
- Benchmarks that decompose safe versus unsafe success and failure provide clearer signals than aggregate scores alone.
- Variation in visual observations and scene structure should be prioritized in safety testing over language variation alone.
Where Pith is reading between the lines
- Models that improve visual grounding accuracy may show measurable reductions in the observed safety degradation under visual variation.
- Extending the benchmark to real-world robot deployments could test whether simulation-based safety costs predict physical-world outcomes.
- The taxonomy could be applied to non-VLA robot policies to check whether the coupling of safety to perception holds more broadly.
- Training regimes that explicitly penalize risk exposure time during learning might lower the safety costs reported here.
Load-bearing premise
The 66 safety-augmented scenarios instantiated in RoboTwin across five embodiments, together with the 13-category taxonomy, are representative enough to diagnose general safety limitations in VLA models.
What would settle it
A new VLA policy that records near-zero cumulative safety cost and zero unsafe nominal successes across all three variation dimensions on the 66 scenarios would directly challenge the reported non-trivial safety costs in the strongest policies.
Figures
read the original abstract
In embodied intelligence, safety is a prerequisite for reliable robot deployment in the physical world. Current vision-language-action (VLA) models continue to advance toward general-purpose task capability, yet their embodied safety limits remain poorly understood. To address this gap, we introduce ForesightSafety-VLA, a diagnostic benchmark that makes safety the primary evaluation target for VLA systems. We define a 13-category safety taxonomy covering physical interaction safety (Safe-Core), instruction-side safety (Safe-Lang), and perception-side safety (Safe-Vis), and evaluate policies under three controlled dimensions of variation -- scene structure, language command, and visual observation -- so that failure sources can be diagnosed rather than hidden in a single aggregate score. Beyond binary task success, ForesightSafety-VLA measures process-level risk through cumulative safety cost (CC) and risk exposure time (RET), together with a four-quadrant decomposition of safe/unsafe success and failure. We instantiate 66 safety-augmented base scenarios in RoboTwin across 5 embodiments and report results on representative VLA baselines. Across the evaluated baselines, even the strongest policy incurs non-trivial safety cost and unsafe nominal success, while structure and visual variation induce substantially stronger safety degradation than ordinary language variation. These results suggest that embodied safety is tightly coupled to perception, grounding, and control competence rather than being reducible to post-hoc safety filtering alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ForesightSafety-VLA, a diagnostic benchmark for safety in vision-language-action (VLA) models. It defines a 13-category taxonomy spanning Safe-Core (physical interaction), Safe-Lang (instruction-side), and Safe-Vis (perception-side) safety. The benchmark evaluates policies on 66 safety-augmented scenarios instantiated in RoboTwin across 5 embodiments, under controlled variations in scene structure, language, and visual observation. It reports process-level metrics including cumulative safety cost (CC), risk exposure time (RET), and a four-quadrant safe/unsafe success/failure decomposition. Results on representative baselines indicate that even the strongest policy shows non-trivial safety cost and unsafe nominal success, with structure and visual variation causing substantially stronger degradation than language variation, implying embodied safety is coupled to perception, grounding, and control rather than reducible to post-hoc filtering.
Significance. If the benchmark construction and observed patterns hold under broader validation, the work supplies a needed diagnostic framework that shifts safety evaluation from aggregate success to source-specific risk measurement. The controlled variation axes and multi-metric reporting (CC, RET, quadrant decomposition) enable targeted diagnosis of failure modes in VLA systems, which is directly relevant to reliable physical deployment. The explicit separation of Safe-Core/Lang/Vis categories and the finding that perception/control competence dominates safety outcomes provide concrete guidance for model design beyond filtering approaches.
major comments (2)
- [Benchmark construction and scenario selection] The instantiation of the 66 safety-augmented base scenarios (described in the benchmark construction section) provides no coverage argument, diversity metric, or sensitivity analysis demonstrating that the chosen scenarios and 5 embodiments are representative of general VLA safety limits. This is load-bearing for the central claim that structure and visual variation induce substantially stronger degradation than language variation, because the differential patterns could be artifacts of simulator-specific couplings or under-sampling of perception-grounding-control interactions.
- [Evaluation and results reporting] The results reporting (including the abstract and evaluation sections) states outcomes on baselines but supplies no details on exact evaluation protocols, number of trials per scenario, statistical significance testing, or baseline selection/implementation criteria. Without these, the support for the reported degradation patterns and the interpretive claim that safety is not reducible to post-hoc filtering cannot be verified.
minor comments (2)
- [Abstract] The abstract refers to 'representative VLA baselines' without naming the models or their selection rationale; adding this would improve immediate readability.
- [Metrics definition] Notation for the four-quadrant decomposition and the definitions of CC and RET could be introduced with a small table or explicit equations in the metrics section to aid quick reference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on ForesightSafety-VLA. The two major comments highlight areas where additional justification and protocol details will improve clarity and verifiability. We will revise the manuscript accordingly to strengthen the presentation of benchmark construction and evaluation procedures while preserving the core diagnostic contributions.
read point-by-point responses
-
Referee: [Benchmark construction and scenario selection] The instantiation of the 66 safety-augmented base scenarios (described in the benchmark construction section) provides no coverage argument, diversity metric, or sensitivity analysis demonstrating that the chosen scenarios and 5 embodiments are representative of general VLA safety limits. This is load-bearing for the central claim that structure and visual variation induce substantially stronger degradation than language variation, because the differential patterns could be artifacts of simulator-specific couplings or under-sampling of perception-grounding-control interactions.
Authors: We agree that an explicit coverage argument and sensitivity analysis are currently absent and would strengthen the manuscript. The 66 scenarios were derived by mapping each of the 13 taxonomy categories to concrete RoboTwin instantiations across the five embodiments, with at least one base scenario per category and controlled axis variations. In revision we will add a dedicated subsection with a coverage table, selection rationale (prioritizing common physical, linguistic, and perceptual failure modes reported in prior VLA literature), and a sensitivity study that subsamples scenarios and re-computes the structure-vs-language and vision-vs-language degradation gaps. This will directly support the reported patterns. We note that claims of full representativeness across all possible real-world VLA deployments would require multi-simulator validation, which we position as future work rather than a current claim. revision: yes
-
Referee: [Evaluation and results reporting] The results reporting (including the abstract and evaluation sections) states outcomes on baselines but supplies no details on exact evaluation protocols, number of trials per scenario, statistical significance testing, or baseline selection/implementation criteria. Without these, the support for the reported degradation patterns and the interpretive claim that safety is not reducible to post-hoc filtering cannot be verified.
Authors: We acknowledge that the current manuscript omits these protocol details. Evaluations used five independent rollouts per scenario (with fixed random seeds for reproducibility across baselines), and statistical significance of metric differences (CC, RET) was assessed via paired t-tests (p < 0.05). Baselines were chosen as the leading publicly released VLA models with available checkpoints and code at submission time; implementation followed the original papers' recommended inference settings. In the revision we will expand the evaluation section with a complete protocol description, trial counts, significance tables, and explicit baseline selection criteria. These additions will allow direct verification of the degradation patterns and the interpretation that safety outcomes depend on perception-grounding-control competence. revision: yes
Circularity Check
No significant circularity; empirical benchmark with direct reporting.
full rationale
The paper defines a 13-category taxonomy, instantiates 66 scenarios in RoboTwin, and reports measured safety costs, risk exposure, and degradation patterns under controlled variations. No equations, fitted parameters, predictions, or self-citation chains appear in the provided text. All load-bearing claims rest on explicit benchmark construction and observed outcomes rather than reducing to inputs by definition or prior self-work. This is the standard non-circular case for benchmark papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 13-category safety taxonomy (Safe-Core, Safe-Lang, Safe-Vis) covers the primary safety concerns for VLA models in embodied settings.
Reference graph
Works this paper leans on
-
[1]
Rt-1: Robotics transformer for real-world control at scale,
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022
Pith/arXiv arXiv 2022
-
[2]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183
2023
-
[3]
Octo: An open-source generalist robot policy,
O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xuet al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024
Pith/arXiv arXiv 2024
-
[4]
Openvla: An open- source vision-language-action model,
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[5]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,
A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903
2024
-
[6]
Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,
Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,”arXiv preprint arXiv:2401.02117, 2024
Pith/arXiv arXiv 2024
-
[7]
π0: A vision-language-action flow model for general robot control,
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “ π0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[8]
π0. 5: A vision-language- action model with open-world generalization. arxiv 2025,
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “ π0. 5: A vision-language- action model with open-world generalization. arxiv 2025,”arXiv preprint arXiv:2504.16054, 2025
Pith/arXiv arXiv 2025
-
[9]
Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,
S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,” 2020
2020
-
[10]
Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,
O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022
2022
-
[11]
Maniskill2: A unified benchmark for generalizable manipulation skills,
J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yaoet al., “Maniskill2: A unified benchmark for generalizable manipulation skills,”arXiv preprint arXiv:2302.04659, 2023
arXiv 2023
-
[12]
Robotwin: Dual-arm robot benchmark with generative digital twins,
Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xuet al., “Robotwin: Dual-arm robot benchmark with generative digital twins,” inProceedings of the computer vision and pattern recognition conference, 2025, pp. 27 649–27 660
2025
-
[13]
Evaluating real-world robot manipulation policies in simulation,
X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmaniet al., “Evaluating real-world robot manipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024
Pith/arXiv arXiv 2024
-
[14]
Vla-arena: An open-source framework for benchmarking vision-language-action models,
B. Zhang, J. Li, J. Shen, Y . Cai, Y . Zhang, Y . Chen, J. Dai, J. Ji, and Y . Yang, “Vla-arena: An open-source framework for benchmarking vision-language-action models,”arXiv preprint arXiv:2512.22539, 2025
Pith/arXiv arXiv 2025
-
[15]
Generative image as action models,
M. Shridhar, Y . L. Lo, and S. James, “Generative image as action models,” arXiv preprint arXiv:2407.07875, 2024
arXiv 2024
-
[16]
Ro- bustnav: Towards benchmarking robustness in embodied navigation,
P. Chattopadhyay, J. Hoffman, R. Mottaghi, and A. Kembhavi, “Ro- bustnav: Towards benchmarking robustness in embodied navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 691–15 700
2021
-
[17]
T. B. Brown, D. Man ´e, A. Roy, M. Abadi, and J. Gilmer, “Adversarial patch,”arXiv preprint arXiv:1712.09665, 2017
Pith/arXiv arXiv 2017
-
[18]
Prompt injection attack against llm-integrated applications,
Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zhenget al., “Prompt injection attack against llm-integrated applications,”arXiv preprint arXiv:2306.05499, 2023
Pith/arXiv arXiv 2023
-
[19]
Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,
C. Xu, W. Ding, W. Lyu, Z. Liu, S. Wang, Y . He, H. Hu, D. Zhao, and B. Li, “Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 667–25 682, 2022
2022
-
[20]
Benchmarking safe exploration in deep reinforcement learning,
A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,”arXiv preprint arXiv:1910.01708, vol. 7, no. 1, p. 2, 2019
Pith/arXiv arXiv 1910
-
[21]
Altman,Constrained Markov decision processes
E. Altman,Constrained Markov decision processes. Routledge, 2021
2021
-
[22]
Safety gymnasium: A unified safe reinforcement learning benchmark,
J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y . Geng, Y . Zhong, J. Dai, and Y . Yang, “Safety gymnasium: A unified safe reinforcement learning benchmark,”Advances in Neural Information Processing Systems, vol. 36, pp. 18 964–18 993, 2023
2023
-
[23]
Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,
Z. Yuan, A. W. Hall, S. Zhou, L. Brunke, M. Greeff, J. Panerati, and A. P. Schoellig, “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 142–11 149, 2022
2022
-
[24]
Concrete problems in ai safety,
D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man ´e, “Concrete problems in ai safety,”arXiv preprint arXiv:1606.06565, 2016
Pith/arXiv arXiv 2016
-
[25]
Libero: Benchmarking knowledge transfer for lifelong robot learning,
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023
2023
-
[26]
Control barrier function based quadratic programs for safety critical systems,
A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,”IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2016
2016
-
[27]
A comprehensive survey on safe reinforce- ment learning,
J. Garcıa and F. Fern ´andez, “A comprehensive survey on safe reinforce- ment learning,”Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015
2015
-
[28]
Safe learning in robotics: From learning-based control to safe reinforcement learning,
L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022
2022
-
[29]
A survey of methods for safe human-robot interaction,
P. A. Lasota, T. Fong, and J. A. Shah, “A survey of methods for safe human-robot interaction,”Foundations and Trends® in Robotics, vol. 5, no. 4, pp. 261–349, 2017
2017
-
[30]
Safety of embodied navigation: A survey,
Z. Wang, J. Hu, and R. Mu, “Safety of embodied navigation: A survey,” arXiv preprint arXiv:2508.05855, 2025
arXiv 2025
-
[31]
Generating robot constitutions & benchmarks for semantic safety,
P. Sermanet, A. Majumdar, A. Irpan, D. Kalashnikov, and V . Sindhwani, “Generating robot constitutions & benchmarks for semantic safety,”arXiv preprint arXiv:2503.08663, 2025
arXiv 2025
-
[32]
Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,
L. Wei, J. Ma, Y . Hu, and R. Zhang, “Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,”arXiv preprint arXiv:2412.10349, 2024
arXiv 2024
-
[33]
Towards safe robot foundation models,
T. Gruner, D. Palenicek, P. Liu, J. Watson, D. Tateo, J. Peterset al., “Towards safe robot foundation models,”arXiv e-prints, pp. arXiv–2503, 2025
2025
-
[34]
Sapien: A simulated part-based interactive environment,
F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su, “Sapien: A simulated part-based interactive environment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11 097–11 107. [Online]. Available: https://openaccess.the...
2020
-
[35]
Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,
T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H. ang Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu, “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”
-
[36]
Available: https://arxiv.org/abs/2506.18088
[Online]. Available: https://arxiv.org/abs/2506.18088
-
[37]
Rdt-1b: a diffusion foundation model for bimanual manipulation,
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,” arXiv preprint arXiv:2410.07864, 2024
Pith/arXiv arXiv 2024
-
[38]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025
2025
-
[39]
Learning fine- grained bimanual manipulation with low-cost hardware,
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023
Pith/arXiv arXiv 2023
-
[40]
Dexvla: Vision- language model with plug-in diffusion expert for general robot control,
J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng, “Dexvla: Vision- language model with plug-in diffusion expert for general robot control,” arXiv preprint arXiv:2502.05855, 2025
Pith/arXiv arXiv 2025
-
[41]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023
2023
-
[42]
Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,
J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shenet al., “Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,”IEEE Robotics and Automation Letters, 2025
2025
-
[43]
3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,
Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024
Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.