pith. sign in

arxiv: 2606.23686 · v2 · pith:6AZFLEP2new · submitted 2026-06-22 · 💻 cs.RO

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

Pith reviewed 2026-06-29 04:38 UTC · model grok-4.3

classification 💻 cs.RO
keywords Vision-Language-Action modelssafety benchmarkembodied AIrobot manipulationtrajectory synthesissemantic misalignmentdata generation pipelinedomain randomization
0
0 comments X

The pith

LIBERO-Safety benchmark reveals that high-diversity training improves VLA trajectory safety but task success stays limited by sub-optimal synthesis and semantic misalignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LIBERO-Safety, a parametric benchmark that procedurally generates safety-critical physical and semantic scenarios for vision-language-action models. It pairs this with a keypose-driven pipeline to produce 19,664 collision-free demonstrations that incorporate extensive domain randomization. Cross-paradigm tests of eight VLA models and two embodied foundation models expose a generalization-safety tension in which greater training diversity yields safer behaviors while overall task completion remains constrained by trajectory quality and semantic alignment problems. The work supplies both the infrastructure and the failure-mode observations needed to guide development of more reliable VLA systems.

Core claim

The central claim is that although high-diversity training fosters safer trajectories in VLA models, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment, established through systematic evaluation on the new parametric safety benchmark and the curated collision-free dataset.

What carries the argument

The keypose-driven data generation pipeline that produces strictly collision-free demonstrations covering stochastic physical and semantic safety scenarios with extensive domain randomization.

If this is right

  • High-diversity training produces safer trajectories in VLA models.
  • Task success rates are limited by the quality of trajectory synthesis.
  • Semantic misalignment constitutes a primary bottleneck for VLA performance under safety constraints.
  • The parametric benchmark and dataset enable consistent cross-paradigm safety evaluation of VLA and embodied foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods that directly improve trajectory synthesis could relax the observed success bottleneck.
  • The same parametric generation approach could be reused to create safety benchmarks for additional robot morphologies or task families.
  • Integrating explicit semantic alignment objectives during training might reduce the misalignment failures identified in the evaluations.

Load-bearing premise

The keypose-driven data generation pipeline produces strictly collision-free demonstrations that comprehensively cover stochastic physical and semantic safety-critical scenarios without introducing systematic artifacts or biases.

What would settle it

An evaluation in which models trained on the high-diversity LIBERO-Safety data fail to produce measurably safer trajectories than low-diversity baselines, or in which trajectory synthesis and semantic misalignment do not account for the majority of task failures, would refute the reported tension.

Figures

Figures reproduced from arXiv: 2606.23686 by Guocai Yao, Haohan Chi, Hao Zhao, Jiaolong Yang, Jinbang Guo, Jingrui Pang, Rongxu Cui, Saining Zhang, Shaoxuan Xie, Xianyuan Zhan, Xin Jin, Yao Mu, Ya-Qin Zhang, Zongzheng Zhang.

Figure 1
Figure 1. Figure 1: Real-world VLA deployment is severely bottlenecked by physical safety and semantic reasoning, constituting critical (a) VLA Safety Challenges. To systemati￾cally evaluate these challenges, we introduce a comprehensive VLA safety benchmark and develop an efficient (b) Data Generation Pipeline to synthesize 19.7K strictly collision-free demonstrations. By evaluating VLA models fine-tuned on this corpus along… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our VLA Safety Benchmark. (a) Comprehensive En￾vironments: Powered by our UBDDL, we construct massive, stochastic simulation environments featuring multi-dimensional visual/physical randomizations and human￾object interactions. (b) Hierarchical Safety Taxonomy: A systematic evaluation suite assessing five critical dimensions of physical and semantic safety, strictly scaled across 3 difficulty t… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of State Space Distributions. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Emergent Spatial Reasoning. High-diversity training enables the model to transition from (a) non-linear avoid￾ance to (b) optimal trajectory synthe￾sis in obstacle-free workspaces. Key Finding 3: High-diversity training data mitigates trajec￾tory overfitting and facilitates emergent spatial reasoning. To in￾vestigate the trade-off between trajec￾tory memorization and visual-spatial generalization, we condu… view at source ↗
Figure 6
Figure 6. Figure 6: Representative examples of (a) Instruction-Aligned Execution and (b) Semantic Misalignment. While the policy is capable of generating collision-free trajectories, perceptual er￾rors in multi-object scenes can lead the end-effector toward incorrect targets. yields a collision-free task incompletion, sacrificing the manipulation objective to kinematically sub-optimal planning. Key Finding 8: Semantic misalig… view at source ↗
read the original abstract

Despite the impressive manipulation capabilities of Vision-Language-Action (VLA) models, their operational safety under strict constraints remains largely unverified. To address this, we introduce a parametric safety benchmark to procedurally generate safety-critical scenarios with comprehensive stochasticity. To overcome the scalability bottlenecks of human teleoperation, we develop a novel keypose-driven data generation pipeline. Leveraging this infrastructure, we curate a large-scale dataset of 19,664 strictly collision-free demonstrations with extensive domain randomization. We then conduct a systematic cross-paradigm evaluation of eight VLA and two embodied foundation models. Our analysis reveals a critical generalization-safety tension: although high-diversity training fosters safer trajectories, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment. By providing a scalable pipeline, a robust dataset, and profound failure-mode insights, LIBERO-Safety establishes a crucial foundation for developing safe and reliable VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces LIBERO-Safety, a parametric benchmark for generating safety-critical scenarios for Vision-Language-Action (VLA) models. It proposes a keypose-driven data generation pipeline to curate a dataset of 19,664 strictly collision-free demonstrations with domain randomization, evaluates eight VLA models and two embodied foundation models, and reports a generalization-safety tension: high-diversity training yields safer trajectories, yet task success remains limited by sub-optimal trajectory synthesis and semantic misalignment.

Significance. If the central empirical claims hold, the work supplies a scalable pipeline and large dataset for safety evaluation in VLAs together with concrete failure-mode analysis across paradigms; these elements would constitute a useful foundation for subsequent safety research in the field.

major comments (1)
  1. [Abstract] Abstract (paragraph describing the pipeline and dataset curation): the assertion that the keypose-driven pipeline produces 'strictly collision-free demonstrations' that 'comprehensively cover stochastic physical and semantic safety-critical scenarios without introducing systematic artifacts or biases' is load-bearing for the reported generalization-safety tension, yet the manuscript provides no verification methods, quantitative error metrics, or controls for generation artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract claims. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph describing the pipeline and dataset curation): the assertion that the keypose-driven pipeline produces 'strictly collision-free demonstrations' that 'comprehensively cover stochastic physical and semantic safety-critical scenarios without introducing systematic artifacts or biases' is load-bearing for the reported generalization-safety tension, yet the manuscript provides no verification methods, quantitative error metrics, or controls for generation artifacts.

    Authors: We agree that the abstract's claims about the pipeline require explicit supporting verification to substantiate the generalization-safety tension. The manuscript describes the keypose-driven pipeline (Section 3) as using simulation-based collision checking and domain randomization to produce the 19,664 demonstrations, but does not include dedicated verification metrics or bias controls. In the revised manuscript we will add a new subsection on pipeline validation that reports quantitative collision rates from simulation logs (targeting zero collisions), coverage statistics over stochastic parameters, and controls such as randomization ablation to rule out systematic artifacts. This addition will directly support the empirical claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with no self-referential derivations

full rationale

The paper introduces a procedural benchmark and keypose-driven data pipeline to generate 19,664 collision-free demonstrations, then evaluates eight VLA models empirically. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. The generalization-safety tension claim follows directly from observed model performance on the generated data rather than from any self-definition or self-citation chain. The load-bearing assumption about the pipeline's lack of artifacts is a validity concern, not a circularity issue per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the unverified correctness of the procedural generation process and the assumption that the generated scenarios adequately represent real-world safety constraints.

axioms (2)
  • domain assumption The parametric safety benchmark generates scenarios that comprehensively cover physical and semantic safety-critical cases with appropriate stochasticity.
    Invoked when claiming the benchmark addresses 'comprehensive' safety; no independent validation described in abstract.
  • domain assumption The keypose-driven pipeline produces strictly collision-free demonstrations without systematic bias from the generation method itself.
    Central to the dataset curation claim; verification method not detailed in abstract.

pith-pipeline@v0.9.1-grok · 5735 in / 1261 out tokens · 70867 ms · 2026-06-29T04:38:24.333968+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 24 linked inside Pith

  1. [1]

    arXiv preprint arXiv:2511.14759 (2025)

    Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., et al.:π∗ 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)

  2. [2]

    arXiv preprint arXiv:2506.09985 (2025)

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

  3. [3]

    IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)

    Balasubramanian, S., Melendez-Calderon, A., Burdet, E.: A robust and sensitive metric for quantifying movement smoothness. IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)

  4. [4]

    arXiv preprint arXiv:2503.14734 (2025)

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  5. [5]

    arXiv preprint arXiv:2410.24164 (2024)

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: a vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

  6. [6]

    In: arXiv preprint arXiv:2307.15818 (2023)

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control. In: arXiv preprint arXiv:2307.15818 (2023)

  7. [7]

    In: arXiv preprint arXiv:2212.06817 (2022)

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. In: arXiv preprint arXiv:2212.06817 (2022)

  8. [8]

    In: RSS (2025)

    Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Univla: Learning to act anywhere with task-centric latent actions. In: RSS (2025)

  9. [9]

    arXiv preprint arXiv:2511.17502 (2025)

    Cen, J., Huang, S., Yuan, Y., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, K., Luo, H., Wang, F., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)

  10. [10]

    arXiv preprint arXiv:2506.21539 (2025)

    Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)

  11. [11]

    arXiv preprint arXiv:2506.18088 (2025)

    Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Liang, Q., Li, Z., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)

  12. [12]

    arXiv preprint arXiv:2602.14979 (2026)

    Dang, R., Guo, J., Hou, B., Leng, S., Li, K., Li, X., Liu, J., Mao, Y., Wang, Z., Yuan, Y., et al.: Rynnbrain: Open embodied foundation models. arXiv preprint arXiv:2602.14979 (2026)

  13. [13]

    In: CoRL (2025) 16 R

    Deng, H., Guo, W., Wang, Q., Wu, Z., Wang, Z.: Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation. In: CoRL (2025) 16 R. Cui, Z. Zhang et al

  14. [14]

    In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Ding, K., Chen, B., Wu, R., Li, Y., Zhang, Z., Gao, H.a., Li, S., Zhou, G., Zhu, Y., Dong, H., et al.: Preafford: Universal affordance-based pre-grasping for diverse ob- jects and environments. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7278–7285. IEEE (2024)

  15. [15]

    arXiv preprint arXiv:2510.13626 (2025)

    Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., et al.: Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025)

  16. [16]

    In: ICRA (2025)

    Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J.: Improving vision-language-action model with online reinforcement learning. In: ICRA (2025)

  17. [17]

    arXiv preprint arXiv:2512.11891 (2025)

    Hu, S., Liu, Z., Liu, S., Cen, J., Meng, Z., He, X.: Vlsa: Vision-language-action models with plug-and-play safety constraint layer. arXiv preprint arXiv:2512.11891 (2025)

  18. [18]

    arXiv preprint arXiv:2511.14659 (2025)

    Hung, C.Y., Majumder, N., Deng, H., Renhang, L., Ang, Y., Zadeh, A., Li, C., Herremans, D., Wang, Z., Poria, S.: Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards. arXiv preprint arXiv:2511.14659 (2025)

  19. [19]

    arXiv preprint arXiv:2504.16054 (2025)

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

  20. [20]

    IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)

    James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning bench- mark & learning environment. IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)

  21. [21]

    arXiv preprint arXiv:2509.15212 (2025)

    Jiang, Y., Huang, S., Xue, S., Zhao, Y., Cen, J., Leng, S., Li, K., Guo, J., Wang, K., Chen, M., et al.: Rynnvla-001: Using human demonstrations to improve robot manipulation. arXiv preprint arXiv:2509.15212 (2025)

  22. [22]

    arXiv preprint arXiv:2502.19645 (2025)

    Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)

  23. [23]

    arXiv preprint arXiv:2406.09246 (2024)

    Kim, M., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision- language-action model. arXiv preprint arXiv:2406.09246 (2024)

  24. [24]

    arXiv preprint arXiv:2510.14830 (2025)

    Lei, K., Li, H., Yu, D., Wei, Z., Guo, L., Jiang, Z., Wang, Z., Liang, S., Xu, H.: Rl-100: Performant robotic manipulation with real-world reinforcement learning. arXiv preprint arXiv:2510.14830 (2025)

  25. [25]

    arXiv preprint arXiv:2509.09674 (2025)

    Li, H., Zuo, Y., Yu, J., Zhang, Y., Yang, Z., Zhang, K., Zhu, X., Zhang, Y., Chen, T., Cui, G., et al.: Simplevla-rl: Scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674 (2025)

  26. [26]

    In: CoRL (2024)

    Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: CoRL (2024)

  27. [27]

    arXiv preprint arXiv:2512.01801 (2025)

    Li, Y., Ma, X., Xu, J., Cui, Y., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y., Niu, H., et al.: Gr-rl: Going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801 (2025)

  28. [28]

    NeurIPS36, 44776–44791 (2023)

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Bench- marking knowledge transfer for lifelong robot learning. NeurIPS36, 44776–44791 (2023)

  29. [29]

    In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt- 1b: A diffusion foundation model for bimanual manipulation. In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R. (eds.) ICLR. vol. 2025, pp. 29982–30009 (2025)

  30. [30]

    arXiv preprint arXiv:2405.14093 (2024) LIBERO-Safety Benchmark 17

    Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024) LIBERO-Safety Benchmark 17

  31. [31]

    IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

    Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

  32. [32]

    In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)

    Morton, D., Pavone, M.: Safe, task-consistent manipulation with operational space control barrier functions. In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 187–194 (2025)

  33. [33]

    In: CVPR

    Mu, Y., Chen, T., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Yu, Q., Zou, Y., Xu, M., Lin, L., Xie, Z., Ding, M., Luo, P.: Robotwin: Dual-arm robot benchmark with generative digital twins. In: CVPR. pp. 27649–27660 (2025)

  34. [34]

    arXiv preprint arXiv:2502.00935 (2025)

    Nakamura, K., Peters, L., Bajcsy, A.: Generalizing safety beyond collision- avoidance via latent-space reachability analysis. arXiv preprint arXiv:2502.00935 (2025)

  35. [35]

    In: RSS (2024)

    Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: RSS (2024)

  36. [36]

    In: RSS (2024)

    Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., et al.: Octo: An open-source generalist robot policy. In: RSS (2024)

  37. [37]

    In: ICRA

    Ranjan, A., Agrawal, S., Jain, A., Jagtap, P., Kolathaya, S., et al.: Barrier functions inspired reward shaping for reinforcement learning. In: ICRA. pp. 10807–10813 (2024)

  38. [38]

    ACM Transactions on Graphics36(6) (2017)

    Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics36(6) (2017)

  39. [39]

    In: CoRL

    Srivastava, S., Li, C., Lingelbach, M., Martín-Martín, R., Xia, F., Vainio, K.E., Lian, Z., Gokmen, C., Buch, S., Liu, K., Savarese, S., Gweon, H., Wu, J., Fei-Fei, L.: Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL. vol. 164, pp. 477–490 (2022)

  40. [40]

    arXiv preprint arXiv:2602.10098 (2026)

    Sun, J., Zhang, W., Qi, Z., Ren, S., Liu, Z., Zhu, H., Sun, G., Jin, X., Chen, Z.: Vla-jepa: Enhancing vision-language-action model with latent world model. arXiv preprint arXiv:2602.10098 (2026)

  41. [41]

    arXiv preprint arXiv:2502.03132 (2025)

    Sun, Y., Chen, R., Yun, K.S., Fang, Y., Jung, S., Li, F., Li, B., Zhao, W., Liu, C.: Spark: A modular benchmark for humanoid robot safety. arXiv preprint arXiv:2502.03132 (2025)

  42. [42]

    arXiv preprint arXiv:2310.17274 (2023)

    Sundaralingam, B., Hari, S.K.S., Fishman, A., Garrett, C., Wyk, K.V., Blukis, V., Millane, A., Oleynikova, H., Handa, A., Ramos, F., Ratliff, N., Fox, D.: curobo: Parallelized collision-free minimum-jerk robot motion generation. arXiv preprint arXiv:2310.17274 (2023)

  43. [43]

    In: ECCV

    Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: A dataset of whole-body human grasping of objects. In: ECCV. pp. 581–600 (2020)

  44. [44]

    arXiv preprint arXiv:2505.17016 (2025)

    Tan, S., Dou, K., Zhao, Y., Krähenbühl, P.: Interactive post-training for vision- language-action models. arXiv preprint arXiv:2505.17016 (2025)

  45. [45]

    arXiv preprint arXiv:2507.02029 (2025)

    Team, B.R., Cao, M., Tan, H., Ji, Y., Chen, X., Lin, M., Li, Z., Cao, Z., Wang, P., Zhou, E., et al.: Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029 (2025)

  46. [46]

    IEEE Control Systems Magazine43(5), 137–177 (2023)

    Wabersich, K.P., Taylor, A.J., Choi, J.J., Sreenath, K., Tomlin, C.J., Ames, A.D., Zeilinger, M.N.: Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine43(5), 137–177 (2023)

  47. [47]

    Wang, G., Zhang, C., Liu, Q., Zhang, J., Cai, J., Liu, J., Liu, X.: Libero-x: Ro- bustnesslitmusforvision-language-actionmodels.arXivpreprintarXiv:2602.06556 (2026) 18 R. Cui, Z. Zhang et al

  48. [48]

    arXiv preprint arXiv:2511.17441 (2025)

    Wu, S., Liu, X., Xie, S., Wang, P., Li, X., Yang, B., Li, Z., Zhu, K., Wu, H., Liu, Y., et al.: Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441 (2025)

  49. [49]

    arXiv preprint arXiv:2505.09388 (2025)

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  50. [50]

    arXiv preprint arXiv:2510.14959 (2025)

    Yang, L., Werner, B., de Sa, M., Ames, A.D.: Cbf-rl: Safety filtering reinforcement learning in training with control barrier functions. arXiv preprint arXiv:2510.14959 (2025)

  51. [51]

    Safety Science127, 104667 (2020)

    Zacharaki, A., Kostavelis, I., Gasteratos, A., Dokas, I.: Safety bounds in human robot interaction: A survey. Safety Science127, 104667 (2020)

  52. [52]

    arXiv preprint arXiv:2512.22539 (2025)

    Zhang, B., Li, J., Shen, J., Cai, Y., Zhang, Y., Chen, Y., Dai, J., Ji, J., Yang, Y.: Vla-arena: An open-source framework for benchmarking vision-language-action models. arXiv preprint arXiv:2512.22539 (2025)

  53. [53]

    In: NeurIPS (2025)

    Zhang, B., Zhang, Y., Ji, J., Lei, Y., Dai, J., Chen, Y., Yang, Y.: SafeVLA: To- wards safety alignment of vision-language-action model via constrained learning. In: NeurIPS (2025)

  54. [54]

    arXiv preprint arXiv:2605.18722 (2026)

    Zhang, Z., Pang, J., Yang, Z., Li, K., Liao, M., Zhang, S., Chi, G., Guo, J., Gao, H.a., Shi, M., et al.: Dexora: Open-source vla for high-dof bimanual dexterity. arXiv preprint arXiv:2605.18722 (2026)

  55. [55]

    arXiv preprint arXiv:2509.07962 (2025)

    Zhang, Z., Xu, H., Yang, Z., Yue, C., Lin, Z., Gao, H.a., Wang, Z., Zhao, H.: Ta- vla: Elucidating the design space of torque-aware vision-language-action models. arXiv preprint arXiv:2509.07962 (2025)

  56. [56]

    arXiv preprint arXiv:2509.08820 (2025)

    Zhang, Z., Yue, C., Xu, H., Liao, M., Qi, X., Gao, H.a., Wang, Z., Zhao, H.: Robochemist: Long-horizon and safety-compliant robotic chemical experimenta- tion. arXiv preprint arXiv:2509.08820 (2025)

  57. [57]

    arXiv preprint arXiv:2605.28548 (2026)

    Zhao, R., Li, B., Liu, Z., Liang, Y., Ye, J., Liu, F., Wu, D., Wang, Z., Yu, X., Rao, Y., et al.: Gem: Generative supervision helps embodied intelligence. arXiv preprint arXiv:2605.28548 (2026)

  58. [58]

    arXiv preprint arXiv:2510.10274 (2025)

    Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y., Zheng, Y., Zou, J., Chen, Y., Zeng, J., et al.: X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274 (2025)

  59. [59]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhong, C., Zheng, Y., Zheng, Y., Zhao, H., Yi, L., Mu, X., Wang, L., Li, P., Zhou, G., Yang, C., et al.: 3d implicit transporter for temporally consistent keypoint discovery. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3869–3880 (2023)

  60. [60]

    Zhou, X., Xu, Y., Tie, G., Chen, Y., Zhang, G., Chu, D., Zhou, P., Sun, L.: Libero- pro: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827 (2025) LIBERO-Safety Benchmark 19 Appendix This appendix provides supplementary technical details, extended discussions to support the main finding...

  61. [61]

    to interpret multimodal instructions and spatial contexts, which continu- LIBERO-Safety Benchmark 33 ously conditions a dedicated Diffusion Transformer module (System 1). This structural decoupling enables robust high-level semantic reasoning while the diffusion module independently denoises and generates fluid, high-frequency continuous motor actions, en...