LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

Guocai Yao; Haohan Chi; Hao Zhao; Jiaolong Yang; Jinbang Guo; Jingrui Pang; Rongxu Cui; Saining Zhang; Shaoxuan Xie; Xianyuan Zhan

arxiv: 2606.23686 · v2 · pith:6AZFLEP2new · submitted 2026-06-22 · 💻 cs.RO

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

Rongxu Cui , Zongzheng Zhang , Jingrui Pang , Haohan Chi , Jinbang Guo , Saining Zhang , Shaoxuan Xie , Xin Jin

show 6 more authors

Yao Mu Jiaolong Yang Guocai Yao Xianyuan Zhan Ya-Qin Zhang Hao Zhao

This is my paper

Pith reviewed 2026-06-29 04:38 UTC · model grok-4.3

classification 💻 cs.RO

keywords Vision-Language-Action modelssafety benchmarkembodied AIrobot manipulationtrajectory synthesissemantic misalignmentdata generation pipelinedomain randomization

0 comments

The pith

LIBERO-Safety benchmark reveals that high-diversity training improves VLA trajectory safety but task success stays limited by sub-optimal synthesis and semantic misalignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LIBERO-Safety, a parametric benchmark that procedurally generates safety-critical physical and semantic scenarios for vision-language-action models. It pairs this with a keypose-driven pipeline to produce 19,664 collision-free demonstrations that incorporate extensive domain randomization. Cross-paradigm tests of eight VLA models and two embodied foundation models expose a generalization-safety tension in which greater training diversity yields safer behaviors while overall task completion remains constrained by trajectory quality and semantic alignment problems. The work supplies both the infrastructure and the failure-mode observations needed to guide development of more reliable VLA systems.

Core claim

The central claim is that although high-diversity training fosters safer trajectories in VLA models, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment, established through systematic evaluation on the new parametric safety benchmark and the curated collision-free dataset.

What carries the argument

The keypose-driven data generation pipeline that produces strictly collision-free demonstrations covering stochastic physical and semantic safety scenarios with extensive domain randomization.

If this is right

High-diversity training produces safer trajectories in VLA models.
Task success rates are limited by the quality of trajectory synthesis.
Semantic misalignment constitutes a primary bottleneck for VLA performance under safety constraints.
The parametric benchmark and dataset enable consistent cross-paradigm safety evaluation of VLA and embodied foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Methods that directly improve trajectory synthesis could relax the observed success bottleneck.
The same parametric generation approach could be reused to create safety benchmarks for additional robot morphologies or task families.
Integrating explicit semantic alignment objectives during training might reduce the misalignment failures identified in the evaluations.

Load-bearing premise

The keypose-driven data generation pipeline produces strictly collision-free demonstrations that comprehensively cover stochastic physical and semantic safety-critical scenarios without introducing systematic artifacts or biases.

What would settle it

An evaluation in which models trained on the high-diversity LIBERO-Safety data fail to produce measurably safer trajectories than low-diversity baselines, or in which trajectory synthesis and semantic misalignment do not account for the majority of task failures, would refute the reported tension.

Figures

Figures reproduced from arXiv: 2606.23686 by Guocai Yao, Haohan Chi, Hao Zhao, Jiaolong Yang, Jinbang Guo, Jingrui Pang, Rongxu Cui, Saining Zhang, Shaoxuan Xie, Xianyuan Zhan, Xin Jin, Yao Mu, Ya-Qin Zhang, Zongzheng Zhang.

**Figure 1.** Figure 1: Real-world VLA deployment is severely bottlenecked by physical safety and semantic reasoning, constituting critical (a) VLA Safety Challenges. To systematically evaluate these challenges, we introduce a comprehensive VLA safety benchmark and develop an efficient (b) Data Generation Pipeline to synthesize 19.7K strictly collision-free demonstrations. By evaluating VLA models fine-tuned on this corpus along… view at source ↗

**Figure 2.** Figure 2: Overview of our VLA Safety Benchmark. (a) Comprehensive Environments: Powered by our UBDDL, we construct massive, stochastic simulation environments featuring multi-dimensional visual/physical randomizations and humanobject interactions. (b) Hierarchical Safety Taxonomy: A systematic evaluation suite assessing five critical dimensions of physical and semantic safety, strictly scaled across 3 difficulty t… view at source ↗

**Figure 3.** Figure 3: Comparison of State Space Distributions. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Emergent Spatial Reasoning. High-diversity training enables the model to transition from (a) non-linear avoidance to (b) optimal trajectory synthesis in obstacle-free workspaces. Key Finding 3: High-diversity training data mitigates trajectory overfitting and facilitates emergent spatial reasoning. To investigate the trade-off between trajectory memorization and visual-spatial generalization, we condu… view at source ↗

**Figure 6.** Figure 6: Representative examples of (a) Instruction-Aligned Execution and (b) Semantic Misalignment. While the policy is capable of generating collision-free trajectories, perceptual errors in multi-object scenes can lead the end-effector toward incorrect targets. yields a collision-free task incompletion, sacrificing the manipulation objective to kinematically sub-optimal planning. Key Finding 8: Semantic misalig… view at source ↗

read the original abstract

Despite the impressive manipulation capabilities of Vision-Language-Action (VLA) models, their operational safety under strict constraints remains largely unverified. To address this, we introduce a parametric safety benchmark to procedurally generate safety-critical scenarios with comprehensive stochasticity. To overcome the scalability bottlenecks of human teleoperation, we develop a novel keypose-driven data generation pipeline. Leveraging this infrastructure, we curate a large-scale dataset of 19,664 strictly collision-free demonstrations with extensive domain randomization. We then conduct a systematic cross-paradigm evaluation of eight VLA and two embodied foundation models. Our analysis reveals a critical generalization-safety tension: although high-diversity training fosters safer trajectories, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment. By providing a scalable pipeline, a robust dataset, and profound failure-mode insights, LIBERO-Safety establishes a crucial foundation for developing safe and reliable VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a new parametric benchmark and 19k-demo dataset for VLA safety testing, but the reported generalization-safety tension rests on an unverified claim that the keypose pipeline produces strictly collision-free data without artifacts.

read the letter

The core offering here is infrastructure: a parametric way to generate safety-critical scenarios, a keypose-driven pipeline that scales past teleoperation, and a curated set of 19,664 demonstrations with domain randomization. They run ten models across paradigms and surface the pattern that more diverse training reduces unsafe trajectories while task completion stays capped by trajectory quality and semantic gaps.

That scale and the procedural approach are the parts that could actually get used. Robotics groups working on deployment safety often lack large, controlled negative examples; this gives them one without manual collection.

The soft spot is exactly where the stress test flags it. The abstract asserts the demonstrations are strictly collision-free and cover stochastic physical and semantic cases, yet supplies no verification method, collision metric, or artifact check. Without that, the claimed tension between diversity and success could trace back to how the data was synthesized rather than to the models themselves. If keypose selection or trajectory stitching quietly avoids certain failure modes, the results become harder to interpret as model properties.

The work is aimed at people who need concrete safety testbeds for VLA systems rather than theorists chasing new architectures. A reader who wants to run their own models against standardized risky scenarios would find the dataset and pipeline useful once the generation details are confirmed.

It is worth sending to peer review. The infrastructure angle is concrete enough that referees can check the pipeline reproducibility and data quality directly; the empirical claim can be strengthened or qualified in revision.

Referee Report

1 major / 0 minor

Summary. The paper introduces LIBERO-Safety, a parametric benchmark for generating safety-critical scenarios for Vision-Language-Action (VLA) models. It proposes a keypose-driven data generation pipeline to curate a dataset of 19,664 strictly collision-free demonstrations with domain randomization, evaluates eight VLA models and two embodied foundation models, and reports a generalization-safety tension: high-diversity training yields safer trajectories, yet task success remains limited by sub-optimal trajectory synthesis and semantic misalignment.

Significance. If the central empirical claims hold, the work supplies a scalable pipeline and large dataset for safety evaluation in VLAs together with concrete failure-mode analysis across paradigms; these elements would constitute a useful foundation for subsequent safety research in the field.

major comments (1)

[Abstract] Abstract (paragraph describing the pipeline and dataset curation): the assertion that the keypose-driven pipeline produces 'strictly collision-free demonstrations' that 'comprehensively cover stochastic physical and semantic safety-critical scenarios without introducing systematic artifacts or biases' is load-bearing for the reported generalization-safety tension, yet the manuscript provides no verification methods, quantitative error metrics, or controls for generation artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract claims. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph describing the pipeline and dataset curation): the assertion that the keypose-driven pipeline produces 'strictly collision-free demonstrations' that 'comprehensively cover stochastic physical and semantic safety-critical scenarios without introducing systematic artifacts or biases' is load-bearing for the reported generalization-safety tension, yet the manuscript provides no verification methods, quantitative error metrics, or controls for generation artifacts.

Authors: We agree that the abstract's claims about the pipeline require explicit supporting verification to substantiate the generalization-safety tension. The manuscript describes the keypose-driven pipeline (Section 3) as using simulation-based collision checking and domain randomization to produce the 19,664 demonstrations, but does not include dedicated verification metrics or bias controls. In the revised manuscript we will add a new subsection on pipeline validation that reports quantitative collision rates from simulation logs (targeting zero collisions), coverage statistics over stochastic parameters, and controls such as randomization ablation to rule out systematic artifacts. This addition will directly support the empirical claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with no self-referential derivations

full rationale

The paper introduces a procedural benchmark and keypose-driven data pipeline to generate 19,664 collision-free demonstrations, then evaluates eight VLA models empirically. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. The generalization-safety tension claim follows directly from observed model performance on the generated data rather than from any self-definition or self-citation chain. The load-bearing assumption about the pipeline's lack of artifacts is a validity concern, not a circularity issue per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the unverified correctness of the procedural generation process and the assumption that the generated scenarios adequately represent real-world safety constraints.

axioms (2)

domain assumption The parametric safety benchmark generates scenarios that comprehensively cover physical and semantic safety-critical cases with appropriate stochasticity.
Invoked when claiming the benchmark addresses 'comprehensive' safety; no independent validation described in abstract.
domain assumption The keypose-driven pipeline produces strictly collision-free demonstrations without systematic bias from the generation method itself.
Central to the dataset curation claim; verification method not detailed in abstract.

pith-pipeline@v0.9.1-grok · 5735 in / 1261 out tokens · 70867 ms · 2026-06-29T04:38:24.333968+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 24 linked inside Pith

[1]

arXiv preprint arXiv:2511.14759 (2025)

Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., et al.:π∗ 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)

Pith/arXiv arXiv 2025
[2]

arXiv preprint arXiv:2506.09985 (2025)

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

Pith/arXiv arXiv 2025
[3]

IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)

Balasubramanian, S., Melendez-Calderon, A., Burdet, E.: A robust and sensitive metric for quantifying movement smoothness. IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)

2012
[4]

arXiv preprint arXiv:2503.14734 (2025)

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2410.24164 (2024)

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: a vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

Pith/arXiv arXiv 2024
[6]

In: arXiv preprint arXiv:2307.15818 (2023)

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control. In: arXiv preprint arXiv:2307.15818 (2023)

Pith/arXiv arXiv 2023
[7]

In: arXiv preprint arXiv:2212.06817 (2022)

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. In: arXiv preprint arXiv:2212.06817 (2022)

Pith/arXiv arXiv 2022
[8]

In: RSS (2025)

Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Univla: Learning to act anywhere with task-centric latent actions. In: RSS (2025)

2025
[9]

arXiv preprint arXiv:2511.17502 (2025)

Cen, J., Huang, S., Yuan, Y., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, K., Luo, H., Wang, F., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)

Pith/arXiv arXiv 2025
[10]

arXiv preprint arXiv:2506.21539 (2025)

Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)

Pith/arXiv arXiv 2025
[11]

arXiv preprint arXiv:2506.18088 (2025)

Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Liang, Q., Li, Z., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)

Pith/arXiv arXiv 2025
[12]

arXiv preprint arXiv:2602.14979 (2026)

Dang, R., Guo, J., Hou, B., Leng, S., Li, K., Li, X., Liu, J., Mao, Y., Wang, Z., Yuan, Y., et al.: Rynnbrain: Open embodied foundation models. arXiv preprint arXiv:2602.14979 (2026)

arXiv 2026
[13]

In: CoRL (2025) 16 R

Deng, H., Guo, W., Wang, Q., Wu, Z., Wang, Z.: Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation. In: CoRL (2025) 16 R. Cui, Z. Zhang et al

2025
[14]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Ding, K., Chen, B., Wu, R., Li, Y., Zhang, Z., Gao, H.a., Li, S., Zhou, G., Zhu, Y., Dong, H., et al.: Preafford: Universal affordance-based pre-grasping for diverse ob- jects and environments. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7278–7285. IEEE (2024)

2024
[15]

arXiv preprint arXiv:2510.13626 (2025)

Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., et al.: Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025)

Pith/arXiv arXiv 2025
[16]

In: ICRA (2025)

Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J.: Improving vision-language-action model with online reinforcement learning. In: ICRA (2025)

2025
[17]

arXiv preprint arXiv:2512.11891 (2025)

Hu, S., Liu, Z., Liu, S., Cen, J., Meng, Z., He, X.: Vlsa: Vision-language-action models with plug-and-play safety constraint layer. arXiv preprint arXiv:2512.11891 (2025)

arXiv 2025
[18]

arXiv preprint arXiv:2511.14659 (2025)

Hung, C.Y., Majumder, N., Deng, H., Renhang, L., Ang, Y., Zadeh, A., Li, C., Herremans, D., Wang, Z., Poria, S.: Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards. arXiv preprint arXiv:2511.14659 (2025)

arXiv 2025
[19]

arXiv preprint arXiv:2504.16054 (2025)

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

Pith/arXiv arXiv 2025
[20]

IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning bench- mark & learning environment. IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)

2020
[21]

arXiv preprint arXiv:2509.15212 (2025)

Jiang, Y., Huang, S., Xue, S., Zhao, Y., Cen, J., Leng, S., Li, K., Guo, J., Wang, K., Chen, M., et al.: Rynnvla-001: Using human demonstrations to improve robot manipulation. arXiv preprint arXiv:2509.15212 (2025)

arXiv 2025
[22]

arXiv preprint arXiv:2502.19645 (2025)

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)

Pith/arXiv arXiv 2025
[23]

arXiv preprint arXiv:2406.09246 (2024)

Kim, M., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision- language-action model. arXiv preprint arXiv:2406.09246 (2024)

Pith/arXiv arXiv 2024
[24]

arXiv preprint arXiv:2510.14830 (2025)

Lei, K., Li, H., Yu, D., Wei, Z., Guo, L., Jiang, Z., Wang, Z., Liang, S., Xu, H.: Rl-100: Performant robotic manipulation with real-world reinforcement learning. arXiv preprint arXiv:2510.14830 (2025)

arXiv 2025
[25]

arXiv preprint arXiv:2509.09674 (2025)

Li, H., Zuo, Y., Yu, J., Zhang, Y., Yang, Z., Zhang, K., Zhu, X., Zhang, Y., Chen, T., Cui, G., et al.: Simplevla-rl: Scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674 (2025)

Pith/arXiv arXiv 2025
[26]

In: CoRL (2024)

Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: CoRL (2024)

2024
[27]

arXiv preprint arXiv:2512.01801 (2025)

Li, Y., Ma, X., Xu, J., Cui, Y., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y., Niu, H., et al.: Gr-rl: Going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801 (2025)

arXiv 2025
[28]

NeurIPS36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Bench- marking knowledge transfer for lifelong robot learning. NeurIPS36, 44776–44791 (2023)

2023
[29]

In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt- 1b: A diffusion foundation model for bimanual manipulation. In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R. (eds.) ICLR. vol. 2025, pp. 29982–30009 (2025)

2025
[30]

arXiv preprint arXiv:2405.14093 (2024) LIBERO-Safety Benchmark 17

Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024) LIBERO-Safety Benchmark 17

Pith/arXiv arXiv 2024
[31]

IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

2022
[32]

In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)

Morton, D., Pavone, M.: Safe, task-consistent manipulation with operational space control barrier functions. In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 187–194 (2025)

2025
[33]

In: CVPR

Mu, Y., Chen, T., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Yu, Q., Zou, Y., Xu, M., Lin, L., Xie, Z., Ding, M., Luo, P.: Robotwin: Dual-arm robot benchmark with generative digital twins. In: CVPR. pp. 27649–27660 (2025)

2025
[34]

arXiv preprint arXiv:2502.00935 (2025)

Nakamura, K., Peters, L., Bajcsy, A.: Generalizing safety beyond collision- avoidance via latent-space reachability analysis. arXiv preprint arXiv:2502.00935 (2025)

arXiv 2025
[35]

In: RSS (2024)

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: RSS (2024)

2024
[36]

In: RSS (2024)

Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., et al.: Octo: An open-source generalist robot policy. In: RSS (2024)

2024
[37]

In: ICRA

Ranjan, A., Agrawal, S., Jain, A., Jagtap, P., Kolathaya, S., et al.: Barrier functions inspired reward shaping for reinforcement learning. In: ICRA. pp. 10807–10813 (2024)

2024
[38]

ACM Transactions on Graphics36(6) (2017)

Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics36(6) (2017)

2017
[39]

In: CoRL

Srivastava, S., Li, C., Lingelbach, M., Martín-Martín, R., Xia, F., Vainio, K.E., Lian, Z., Gokmen, C., Buch, S., Liu, K., Savarese, S., Gweon, H., Wu, J., Fei-Fei, L.: Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL. vol. 164, pp. 477–490 (2022)

2022
[40]

arXiv preprint arXiv:2602.10098 (2026)

Sun, J., Zhang, W., Qi, Z., Ren, S., Liu, Z., Zhu, H., Sun, G., Jin, X., Chen, Z.: Vla-jepa: Enhancing vision-language-action model with latent world model. arXiv preprint arXiv:2602.10098 (2026)

arXiv 2026
[41]

arXiv preprint arXiv:2502.03132 (2025)

Sun, Y., Chen, R., Yun, K.S., Fang, Y., Jung, S., Li, F., Li, B., Zhao, W., Liu, C.: Spark: A modular benchmark for humanoid robot safety. arXiv preprint arXiv:2502.03132 (2025)

arXiv 2025
[42]

arXiv preprint arXiv:2310.17274 (2023)

Sundaralingam, B., Hari, S.K.S., Fishman, A., Garrett, C., Wyk, K.V., Blukis, V., Millane, A., Oleynikova, H., Handa, A., Ramos, F., Ratliff, N., Fox, D.: curobo: Parallelized collision-free minimum-jerk robot motion generation. arXiv preprint arXiv:2310.17274 (2023)

arXiv 2023
[43]

In: ECCV

Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: A dataset of whole-body human grasping of objects. In: ECCV. pp. 581–600 (2020)

2020
[44]

arXiv preprint arXiv:2505.17016 (2025)

Tan, S., Dou, K., Zhao, Y., Krähenbühl, P.: Interactive post-training for vision- language-action models. arXiv preprint arXiv:2505.17016 (2025)

Pith/arXiv arXiv 2025
[45]

arXiv preprint arXiv:2507.02029 (2025)

Team, B.R., Cao, M., Tan, H., Ji, Y., Chen, X., Lin, M., Li, Z., Cao, Z., Wang, P., Zhou, E., et al.: Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029 (2025)

arXiv 2025
[46]

IEEE Control Systems Magazine43(5), 137–177 (2023)

Wabersich, K.P., Taylor, A.J., Choi, J.J., Sreenath, K., Tomlin, C.J., Ames, A.D., Zeilinger, M.N.: Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine43(5), 137–177 (2023)

2023
[47]

Wang, G., Zhang, C., Liu, Q., Zhang, J., Cai, J., Liu, J., Liu, X.: Libero-x: Ro- bustnesslitmusforvision-language-actionmodels.arXivpreprintarXiv:2602.06556 (2026) 18 R. Cui, Z. Zhang et al

arXiv 2026
[48]

arXiv preprint arXiv:2511.17441 (2025)

Wu, S., Liu, X., Xie, S., Wang, P., Li, X., Yang, B., Li, Z., Zhu, K., Wu, H., Liu, Y., et al.: Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441 (2025)

Pith/arXiv arXiv 2025
[49]

arXiv preprint arXiv:2505.09388 (2025)

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

Pith/arXiv arXiv 2025
[50]

arXiv preprint arXiv:2510.14959 (2025)

Yang, L., Werner, B., de Sa, M., Ames, A.D.: Cbf-rl: Safety filtering reinforcement learning in training with control barrier functions. arXiv preprint arXiv:2510.14959 (2025)

Pith/arXiv arXiv 2025
[51]

Safety Science127, 104667 (2020)

Zacharaki, A., Kostavelis, I., Gasteratos, A., Dokas, I.: Safety bounds in human robot interaction: A survey. Safety Science127, 104667 (2020)

2020
[52]

arXiv preprint arXiv:2512.22539 (2025)

Zhang, B., Li, J., Shen, J., Cai, Y., Zhang, Y., Chen, Y., Dai, J., Ji, J., Yang, Y.: Vla-arena: An open-source framework for benchmarking vision-language-action models. arXiv preprint arXiv:2512.22539 (2025)

Pith/arXiv arXiv 2025
[53]

In: NeurIPS (2025)

Zhang, B., Zhang, Y., Ji, J., Lei, Y., Dai, J., Chen, Y., Yang, Y.: SafeVLA: To- wards safety alignment of vision-language-action model via constrained learning. In: NeurIPS (2025)

2025
[54]

arXiv preprint arXiv:2605.18722 (2026)

Zhang, Z., Pang, J., Yang, Z., Li, K., Liao, M., Zhang, S., Chi, G., Guo, J., Gao, H.a., Shi, M., et al.: Dexora: Open-source vla for high-dof bimanual dexterity. arXiv preprint arXiv:2605.18722 (2026)

Pith/arXiv arXiv 2026
[55]

arXiv preprint arXiv:2509.07962 (2025)

Zhang, Z., Xu, H., Yang, Z., Yue, C., Lin, Z., Gao, H.a., Wang, Z., Zhao, H.: Ta- vla: Elucidating the design space of torque-aware vision-language-action models. arXiv preprint arXiv:2509.07962 (2025)

arXiv 2025
[56]

arXiv preprint arXiv:2509.08820 (2025)

Zhang, Z., Yue, C., Xu, H., Liao, M., Qi, X., Gao, H.a., Wang, Z., Zhao, H.: Robochemist: Long-horizon and safety-compliant robotic chemical experimenta- tion. arXiv preprint arXiv:2509.08820 (2025)

arXiv 2025
[57]

arXiv preprint arXiv:2605.28548 (2026)

Zhao, R., Li, B., Liu, Z., Liang, Y., Ye, J., Liu, F., Wu, D., Wang, Z., Yu, X., Rao, Y., et al.: Gem: Generative supervision helps embodied intelligence. arXiv preprint arXiv:2605.28548 (2026)

Pith/arXiv arXiv 2026
[58]

arXiv preprint arXiv:2510.10274 (2025)

Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y., Zheng, Y., Zou, J., Chen, Y., Zeng, J., et al.: X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274 (2025)

Pith/arXiv arXiv 2025
[59]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhong, C., Zheng, Y., Zheng, Y., Zhao, H., Yi, L., Mu, X., Wang, L., Li, P., Zhou, G., Yang, C., et al.: 3d implicit transporter for temporally consistent keypoint discovery. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3869–3880 (2023)

2023
[60]

Zhou, X., Xu, Y., Tie, G., Chen, Y., Zhang, G., Chu, D., Zhou, P., Sun, L.: Libero- pro: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827 (2025) LIBERO-Safety Benchmark 19 Appendix This appendix provides supplementary technical details, extended discussions to support the main finding...

Pith/arXiv arXiv 2025
[61]

to interpret multimodal instructions and spatial contexts, which continu- LIBERO-Safety Benchmark 33 ously conditions a dedicated Diffusion Transformer module (System 1). This structural decoupling enables robust high-level semantic reasoning while the diffusion module independently denoises and generates fluid, high-frequency continuous motor actions, en...

[1] [1]

arXiv preprint arXiv:2511.14759 (2025)

Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., et al.:π∗ 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)

Pith/arXiv arXiv 2025

[2] [2]

arXiv preprint arXiv:2506.09985 (2025)

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

Pith/arXiv arXiv 2025

[3] [3]

IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)

Balasubramanian, S., Melendez-Calderon, A., Burdet, E.: A robust and sensitive metric for quantifying movement smoothness. IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)

2012

[4] [4]

arXiv preprint arXiv:2503.14734 (2025)

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

Pith/arXiv arXiv 2025

[5] [5]

arXiv preprint arXiv:2410.24164 (2024)

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: a vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

Pith/arXiv arXiv 2024

[6] [6]

In: arXiv preprint arXiv:2307.15818 (2023)

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control. In: arXiv preprint arXiv:2307.15818 (2023)

Pith/arXiv arXiv 2023

[7] [7]

In: arXiv preprint arXiv:2212.06817 (2022)

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. In: arXiv preprint arXiv:2212.06817 (2022)

Pith/arXiv arXiv 2022

[8] [8]

In: RSS (2025)

Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Univla: Learning to act anywhere with task-centric latent actions. In: RSS (2025)

2025

[9] [9]

arXiv preprint arXiv:2511.17502 (2025)

Cen, J., Huang, S., Yuan, Y., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, K., Luo, H., Wang, F., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)

Pith/arXiv arXiv 2025

[10] [10]

arXiv preprint arXiv:2506.21539 (2025)

Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)

Pith/arXiv arXiv 2025

[11] [11]

arXiv preprint arXiv:2506.18088 (2025)

Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Liang, Q., Li, Z., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)

Pith/arXiv arXiv 2025

[12] [12]

arXiv preprint arXiv:2602.14979 (2026)

Dang, R., Guo, J., Hou, B., Leng, S., Li, K., Li, X., Liu, J., Mao, Y., Wang, Z., Yuan, Y., et al.: Rynnbrain: Open embodied foundation models. arXiv preprint arXiv:2602.14979 (2026)

arXiv 2026

[13] [13]

In: CoRL (2025) 16 R

Deng, H., Guo, W., Wang, Q., Wu, Z., Wang, Z.: Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation. In: CoRL (2025) 16 R. Cui, Z. Zhang et al

2025

[14] [14]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Ding, K., Chen, B., Wu, R., Li, Y., Zhang, Z., Gao, H.a., Li, S., Zhou, G., Zhu, Y., Dong, H., et al.: Preafford: Universal affordance-based pre-grasping for diverse ob- jects and environments. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7278–7285. IEEE (2024)

2024

[15] [15]

arXiv preprint arXiv:2510.13626 (2025)

Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., et al.: Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025)

Pith/arXiv arXiv 2025

[16] [16]

In: ICRA (2025)

Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J.: Improving vision-language-action model with online reinforcement learning. In: ICRA (2025)

2025

[17] [17]

arXiv preprint arXiv:2512.11891 (2025)

Hu, S., Liu, Z., Liu, S., Cen, J., Meng, Z., He, X.: Vlsa: Vision-language-action models with plug-and-play safety constraint layer. arXiv preprint arXiv:2512.11891 (2025)

arXiv 2025

[18] [18]

arXiv preprint arXiv:2511.14659 (2025)

Hung, C.Y., Majumder, N., Deng, H., Renhang, L., Ang, Y., Zadeh, A., Li, C., Herremans, D., Wang, Z., Poria, S.: Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards. arXiv preprint arXiv:2511.14659 (2025)

arXiv 2025

[19] [19]

arXiv preprint arXiv:2504.16054 (2025)

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

Pith/arXiv arXiv 2025

[20] [20]

IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning bench- mark & learning environment. IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)

2020

[21] [21]

arXiv preprint arXiv:2509.15212 (2025)

Jiang, Y., Huang, S., Xue, S., Zhao, Y., Cen, J., Leng, S., Li, K., Guo, J., Wang, K., Chen, M., et al.: Rynnvla-001: Using human demonstrations to improve robot manipulation. arXiv preprint arXiv:2509.15212 (2025)

arXiv 2025

[22] [22]

arXiv preprint arXiv:2502.19645 (2025)

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)

Pith/arXiv arXiv 2025

[23] [23]

arXiv preprint arXiv:2406.09246 (2024)

Kim, M., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision- language-action model. arXiv preprint arXiv:2406.09246 (2024)

Pith/arXiv arXiv 2024

[24] [24]

arXiv preprint arXiv:2510.14830 (2025)

Lei, K., Li, H., Yu, D., Wei, Z., Guo, L., Jiang, Z., Wang, Z., Liang, S., Xu, H.: Rl-100: Performant robotic manipulation with real-world reinforcement learning. arXiv preprint arXiv:2510.14830 (2025)

arXiv 2025

[25] [25]

arXiv preprint arXiv:2509.09674 (2025)

Li, H., Zuo, Y., Yu, J., Zhang, Y., Yang, Z., Zhang, K., Zhu, X., Zhang, Y., Chen, T., Cui, G., et al.: Simplevla-rl: Scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674 (2025)

Pith/arXiv arXiv 2025

[26] [26]

In: CoRL (2024)

Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: CoRL (2024)

2024

[27] [27]

arXiv preprint arXiv:2512.01801 (2025)

Li, Y., Ma, X., Xu, J., Cui, Y., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y., Niu, H., et al.: Gr-rl: Going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801 (2025)

arXiv 2025

[28] [28]

NeurIPS36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Bench- marking knowledge transfer for lifelong robot learning. NeurIPS36, 44776–44791 (2023)

2023

[29] [29]

In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt- 1b: A diffusion foundation model for bimanual manipulation. In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R. (eds.) ICLR. vol. 2025, pp. 29982–30009 (2025)

2025

[30] [30]

arXiv preprint arXiv:2405.14093 (2024) LIBERO-Safety Benchmark 17

Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024) LIBERO-Safety Benchmark 17

Pith/arXiv arXiv 2024

[31] [31]

IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

2022

[32] [32]

In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)

Morton, D., Pavone, M.: Safe, task-consistent manipulation with operational space control barrier functions. In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 187–194 (2025)

2025

[33] [33]

In: CVPR

Mu, Y., Chen, T., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Yu, Q., Zou, Y., Xu, M., Lin, L., Xie, Z., Ding, M., Luo, P.: Robotwin: Dual-arm robot benchmark with generative digital twins. In: CVPR. pp. 27649–27660 (2025)

2025

[34] [34]

arXiv preprint arXiv:2502.00935 (2025)

Nakamura, K., Peters, L., Bajcsy, A.: Generalizing safety beyond collision- avoidance via latent-space reachability analysis. arXiv preprint arXiv:2502.00935 (2025)

arXiv 2025

[35] [35]

In: RSS (2024)

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: RSS (2024)

2024

[36] [36]

In: RSS (2024)

Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., et al.: Octo: An open-source generalist robot policy. In: RSS (2024)

2024

[37] [37]

In: ICRA

Ranjan, A., Agrawal, S., Jain, A., Jagtap, P., Kolathaya, S., et al.: Barrier functions inspired reward shaping for reinforcement learning. In: ICRA. pp. 10807–10813 (2024)

2024

[38] [38]

ACM Transactions on Graphics36(6) (2017)

Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics36(6) (2017)

2017

[39] [39]

In: CoRL

Srivastava, S., Li, C., Lingelbach, M., Martín-Martín, R., Xia, F., Vainio, K.E., Lian, Z., Gokmen, C., Buch, S., Liu, K., Savarese, S., Gweon, H., Wu, J., Fei-Fei, L.: Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL. vol. 164, pp. 477–490 (2022)

2022

[40] [40]

arXiv preprint arXiv:2602.10098 (2026)

Sun, J., Zhang, W., Qi, Z., Ren, S., Liu, Z., Zhu, H., Sun, G., Jin, X., Chen, Z.: Vla-jepa: Enhancing vision-language-action model with latent world model. arXiv preprint arXiv:2602.10098 (2026)

arXiv 2026

[41] [41]

arXiv preprint arXiv:2502.03132 (2025)

Sun, Y., Chen, R., Yun, K.S., Fang, Y., Jung, S., Li, F., Li, B., Zhao, W., Liu, C.: Spark: A modular benchmark for humanoid robot safety. arXiv preprint arXiv:2502.03132 (2025)

arXiv 2025

[42] [42]

arXiv preprint arXiv:2310.17274 (2023)

Sundaralingam, B., Hari, S.K.S., Fishman, A., Garrett, C., Wyk, K.V., Blukis, V., Millane, A., Oleynikova, H., Handa, A., Ramos, F., Ratliff, N., Fox, D.: curobo: Parallelized collision-free minimum-jerk robot motion generation. arXiv preprint arXiv:2310.17274 (2023)

arXiv 2023

[43] [43]

In: ECCV

Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: A dataset of whole-body human grasping of objects. In: ECCV. pp. 581–600 (2020)

2020

[44] [44]

arXiv preprint arXiv:2505.17016 (2025)

Tan, S., Dou, K., Zhao, Y., Krähenbühl, P.: Interactive post-training for vision- language-action models. arXiv preprint arXiv:2505.17016 (2025)

Pith/arXiv arXiv 2025

[45] [45]

arXiv preprint arXiv:2507.02029 (2025)

Team, B.R., Cao, M., Tan, H., Ji, Y., Chen, X., Lin, M., Li, Z., Cao, Z., Wang, P., Zhou, E., et al.: Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029 (2025)

arXiv 2025

[46] [46]

IEEE Control Systems Magazine43(5), 137–177 (2023)

Wabersich, K.P., Taylor, A.J., Choi, J.J., Sreenath, K., Tomlin, C.J., Ames, A.D., Zeilinger, M.N.: Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine43(5), 137–177 (2023)

2023

[47] [47]

Wang, G., Zhang, C., Liu, Q., Zhang, J., Cai, J., Liu, J., Liu, X.: Libero-x: Ro- bustnesslitmusforvision-language-actionmodels.arXivpreprintarXiv:2602.06556 (2026) 18 R. Cui, Z. Zhang et al

arXiv 2026

[48] [48]

arXiv preprint arXiv:2511.17441 (2025)

Wu, S., Liu, X., Xie, S., Wang, P., Li, X., Yang, B., Li, Z., Zhu, K., Wu, H., Liu, Y., et al.: Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441 (2025)

Pith/arXiv arXiv 2025

[49] [49]

arXiv preprint arXiv:2505.09388 (2025)

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

Pith/arXiv arXiv 2025

[50] [50]

arXiv preprint arXiv:2510.14959 (2025)

Yang, L., Werner, B., de Sa, M., Ames, A.D.: Cbf-rl: Safety filtering reinforcement learning in training with control barrier functions. arXiv preprint arXiv:2510.14959 (2025)

Pith/arXiv arXiv 2025

[51] [51]

Safety Science127, 104667 (2020)

Zacharaki, A., Kostavelis, I., Gasteratos, A., Dokas, I.: Safety bounds in human robot interaction: A survey. Safety Science127, 104667 (2020)

2020

[52] [52]

arXiv preprint arXiv:2512.22539 (2025)

Zhang, B., Li, J., Shen, J., Cai, Y., Zhang, Y., Chen, Y., Dai, J., Ji, J., Yang, Y.: Vla-arena: An open-source framework for benchmarking vision-language-action models. arXiv preprint arXiv:2512.22539 (2025)

Pith/arXiv arXiv 2025

[53] [53]

In: NeurIPS (2025)

Zhang, B., Zhang, Y., Ji, J., Lei, Y., Dai, J., Chen, Y., Yang, Y.: SafeVLA: To- wards safety alignment of vision-language-action model via constrained learning. In: NeurIPS (2025)

2025

[54] [54]

arXiv preprint arXiv:2605.18722 (2026)

Zhang, Z., Pang, J., Yang, Z., Li, K., Liao, M., Zhang, S., Chi, G., Guo, J., Gao, H.a., Shi, M., et al.: Dexora: Open-source vla for high-dof bimanual dexterity. arXiv preprint arXiv:2605.18722 (2026)

Pith/arXiv arXiv 2026

[55] [55]

arXiv preprint arXiv:2509.07962 (2025)

Zhang, Z., Xu, H., Yang, Z., Yue, C., Lin, Z., Gao, H.a., Wang, Z., Zhao, H.: Ta- vla: Elucidating the design space of torque-aware vision-language-action models. arXiv preprint arXiv:2509.07962 (2025)

arXiv 2025

[56] [56]

arXiv preprint arXiv:2509.08820 (2025)

Zhang, Z., Yue, C., Xu, H., Liao, M., Qi, X., Gao, H.a., Wang, Z., Zhao, H.: Robochemist: Long-horizon and safety-compliant robotic chemical experimenta- tion. arXiv preprint arXiv:2509.08820 (2025)

arXiv 2025

[57] [57]

arXiv preprint arXiv:2605.28548 (2026)

Zhao, R., Li, B., Liu, Z., Liang, Y., Ye, J., Liu, F., Wu, D., Wang, Z., Yu, X., Rao, Y., et al.: Gem: Generative supervision helps embodied intelligence. arXiv preprint arXiv:2605.28548 (2026)

Pith/arXiv arXiv 2026

[58] [58]

arXiv preprint arXiv:2510.10274 (2025)

Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y., Zheng, Y., Zou, J., Chen, Y., Zeng, J., et al.: X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274 (2025)

Pith/arXiv arXiv 2025

[59] [59]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhong, C., Zheng, Y., Zheng, Y., Zhao, H., Yi, L., Mu, X., Wang, L., Li, P., Zhou, G., Yang, C., et al.: 3d implicit transporter for temporally consistent keypoint discovery. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3869–3880 (2023)

2023

[60] [60]

Zhou, X., Xu, Y., Tie, G., Chen, Y., Zhang, G., Chu, D., Zhou, P., Sun, L.: Libero- pro: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827 (2025) LIBERO-Safety Benchmark 19 Appendix This appendix provides supplementary technical details, extended discussions to support the main finding...

Pith/arXiv arXiv 2025

[61] [61]

to interpret multimodal instructions and spatial contexts, which continu- LIBERO-Safety Benchmark 33 ously conditions a dedicated Diffusion Transformer module (System 1). This structural decoupling enables robust high-level semantic reasoning while the diffusion module independently denoises and generates fluid, high-frequency continuous motor actions, en...