LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models
Pith reviewed 2026-06-29 04:38 UTC · model grok-4.3
The pith
LIBERO-Safety benchmark reveals that high-diversity training improves VLA trajectory safety but task success stays limited by sub-optimal synthesis and semantic misalignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that although high-diversity training fosters safer trajectories in VLA models, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment, established through systematic evaluation on the new parametric safety benchmark and the curated collision-free dataset.
What carries the argument
The keypose-driven data generation pipeline that produces strictly collision-free demonstrations covering stochastic physical and semantic safety scenarios with extensive domain randomization.
If this is right
- High-diversity training produces safer trajectories in VLA models.
- Task success rates are limited by the quality of trajectory synthesis.
- Semantic misalignment constitutes a primary bottleneck for VLA performance under safety constraints.
- The parametric benchmark and dataset enable consistent cross-paradigm safety evaluation of VLA and embodied foundation models.
Where Pith is reading between the lines
- Methods that directly improve trajectory synthesis could relax the observed success bottleneck.
- The same parametric generation approach could be reused to create safety benchmarks for additional robot morphologies or task families.
- Integrating explicit semantic alignment objectives during training might reduce the misalignment failures identified in the evaluations.
Load-bearing premise
The keypose-driven data generation pipeline produces strictly collision-free demonstrations that comprehensively cover stochastic physical and semantic safety-critical scenarios without introducing systematic artifacts or biases.
What would settle it
An evaluation in which models trained on the high-diversity LIBERO-Safety data fail to produce measurably safer trajectories than low-diversity baselines, or in which trajectory synthesis and semantic misalignment do not account for the majority of task failures, would refute the reported tension.
Figures
read the original abstract
Despite the impressive manipulation capabilities of Vision-Language-Action (VLA) models, their operational safety under strict constraints remains largely unverified. To address this, we introduce a parametric safety benchmark to procedurally generate safety-critical scenarios with comprehensive stochasticity. To overcome the scalability bottlenecks of human teleoperation, we develop a novel keypose-driven data generation pipeline. Leveraging this infrastructure, we curate a large-scale dataset of 19,664 strictly collision-free demonstrations with extensive domain randomization. We then conduct a systematic cross-paradigm evaluation of eight VLA and two embodied foundation models. Our analysis reveals a critical generalization-safety tension: although high-diversity training fosters safer trajectories, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment. By providing a scalable pipeline, a robust dataset, and profound failure-mode insights, LIBERO-Safety establishes a crucial foundation for developing safe and reliable VLA models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LIBERO-Safety, a parametric benchmark for generating safety-critical scenarios for Vision-Language-Action (VLA) models. It proposes a keypose-driven data generation pipeline to curate a dataset of 19,664 strictly collision-free demonstrations with domain randomization, evaluates eight VLA models and two embodied foundation models, and reports a generalization-safety tension: high-diversity training yields safer trajectories, yet task success remains limited by sub-optimal trajectory synthesis and semantic misalignment.
Significance. If the central empirical claims hold, the work supplies a scalable pipeline and large dataset for safety evaluation in VLAs together with concrete failure-mode analysis across paradigms; these elements would constitute a useful foundation for subsequent safety research in the field.
major comments (1)
- [Abstract] Abstract (paragraph describing the pipeline and dataset curation): the assertion that the keypose-driven pipeline produces 'strictly collision-free demonstrations' that 'comprehensively cover stochastic physical and semantic safety-critical scenarios without introducing systematic artifacts or biases' is load-bearing for the reported generalization-safety tension, yet the manuscript provides no verification methods, quantitative error metrics, or controls for generation artifacts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract claims. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph describing the pipeline and dataset curation): the assertion that the keypose-driven pipeline produces 'strictly collision-free demonstrations' that 'comprehensively cover stochastic physical and semantic safety-critical scenarios without introducing systematic artifacts or biases' is load-bearing for the reported generalization-safety tension, yet the manuscript provides no verification methods, quantitative error metrics, or controls for generation artifacts.
Authors: We agree that the abstract's claims about the pipeline require explicit supporting verification to substantiate the generalization-safety tension. The manuscript describes the keypose-driven pipeline (Section 3) as using simulation-based collision checking and domain randomization to produce the 19,664 demonstrations, but does not include dedicated verification metrics or bias controls. In the revised manuscript we will add a new subsection on pipeline validation that reports quantitative collision rates from simulation logs (targeting zero collisions), coverage statistics over stochastic parameters, and controls such as randomization ablation to rule out systematic artifacts. This addition will directly support the empirical claims. revision: yes
Circularity Check
No circularity: empirical benchmark construction with no self-referential derivations
full rationale
The paper introduces a procedural benchmark and keypose-driven data pipeline to generate 19,664 collision-free demonstrations, then evaluates eight VLA models empirically. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. The generalization-safety tension claim follows directly from observed model performance on the generated data rather than from any self-definition or self-citation chain. The load-bearing assumption about the pipeline's lack of artifacts is a validity concern, not a circularity issue per the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The parametric safety benchmark generates scenarios that comprehensively cover physical and semantic safety-critical cases with appropriate stochasticity.
- domain assumption The keypose-driven pipeline produces strictly collision-free demonstrations without systematic bias from the generation method itself.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2511.14759 (2025)
Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., et al.:π∗ 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)
Pith/arXiv arXiv 2025
-
[2]
arXiv preprint arXiv:2506.09985 (2025)
Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)
Pith/arXiv arXiv 2025
-
[3]
IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)
Balasubramanian, S., Melendez-Calderon, A., Burdet, E.: A robust and sensitive metric for quantifying movement smoothness. IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)
2012
-
[4]
arXiv preprint arXiv:2503.14734 (2025)
Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)
Pith/arXiv arXiv 2025
-
[5]
arXiv preprint arXiv:2410.24164 (2024)
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: a vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)
Pith/arXiv arXiv 2024
-
[6]
In: arXiv preprint arXiv:2307.15818 (2023)
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control. In: arXiv preprint arXiv:2307.15818 (2023)
Pith/arXiv arXiv 2023
-
[7]
In: arXiv preprint arXiv:2212.06817 (2022)
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. In: arXiv preprint arXiv:2212.06817 (2022)
Pith/arXiv arXiv 2022
-
[8]
In: RSS (2025)
Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Univla: Learning to act anywhere with task-centric latent actions. In: RSS (2025)
2025
-
[9]
arXiv preprint arXiv:2511.17502 (2025)
Cen, J., Huang, S., Yuan, Y., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, K., Luo, H., Wang, F., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)
Pith/arXiv arXiv 2025
-
[10]
arXiv preprint arXiv:2506.21539 (2025)
Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)
Pith/arXiv arXiv 2025
-
[11]
arXiv preprint arXiv:2506.18088 (2025)
Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Liang, Q., Li, Z., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)
Pith/arXiv arXiv 2025
-
[12]
arXiv preprint arXiv:2602.14979 (2026)
Dang, R., Guo, J., Hou, B., Leng, S., Li, K., Li, X., Liu, J., Mao, Y., Wang, Z., Yuan, Y., et al.: Rynnbrain: Open embodied foundation models. arXiv preprint arXiv:2602.14979 (2026)
arXiv 2026
-
[13]
In: CoRL (2025) 16 R
Deng, H., Guo, W., Wang, Q., Wu, Z., Wang, Z.: Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation. In: CoRL (2025) 16 R. Cui, Z. Zhang et al
2025
-
[14]
In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Ding, K., Chen, B., Wu, R., Li, Y., Zhang, Z., Gao, H.a., Li, S., Zhou, G., Zhu, Y., Dong, H., et al.: Preafford: Universal affordance-based pre-grasping for diverse ob- jects and environments. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7278–7285. IEEE (2024)
2024
-
[15]
arXiv preprint arXiv:2510.13626 (2025)
Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., et al.: Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025)
Pith/arXiv arXiv 2025
-
[16]
In: ICRA (2025)
Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J.: Improving vision-language-action model with online reinforcement learning. In: ICRA (2025)
2025
-
[17]
arXiv preprint arXiv:2512.11891 (2025)
Hu, S., Liu, Z., Liu, S., Cen, J., Meng, Z., He, X.: Vlsa: Vision-language-action models with plug-and-play safety constraint layer. arXiv preprint arXiv:2512.11891 (2025)
arXiv 2025
-
[18]
arXiv preprint arXiv:2511.14659 (2025)
Hung, C.Y., Majumder, N., Deng, H., Renhang, L., Ang, Y., Zadeh, A., Li, C., Herremans, D., Wang, Z., Poria, S.: Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards. arXiv preprint arXiv:2511.14659 (2025)
arXiv 2025
-
[19]
arXiv preprint arXiv:2504.16054 (2025)
Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)
Pith/arXiv arXiv 2025
-
[20]
IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)
James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning bench- mark & learning environment. IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)
2020
-
[21]
arXiv preprint arXiv:2509.15212 (2025)
Jiang, Y., Huang, S., Xue, S., Zhao, Y., Cen, J., Leng, S., Li, K., Guo, J., Wang, K., Chen, M., et al.: Rynnvla-001: Using human demonstrations to improve robot manipulation. arXiv preprint arXiv:2509.15212 (2025)
arXiv 2025
-
[22]
arXiv preprint arXiv:2502.19645 (2025)
Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)
Pith/arXiv arXiv 2025
-
[23]
arXiv preprint arXiv:2406.09246 (2024)
Kim, M., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision- language-action model. arXiv preprint arXiv:2406.09246 (2024)
Pith/arXiv arXiv 2024
-
[24]
arXiv preprint arXiv:2510.14830 (2025)
Lei, K., Li, H., Yu, D., Wei, Z., Guo, L., Jiang, Z., Wang, Z., Liang, S., Xu, H.: Rl-100: Performant robotic manipulation with real-world reinforcement learning. arXiv preprint arXiv:2510.14830 (2025)
arXiv 2025
-
[25]
arXiv preprint arXiv:2509.09674 (2025)
Li, H., Zuo, Y., Yu, J., Zhang, Y., Yang, Z., Zhang, K., Zhu, X., Zhang, Y., Chen, T., Cui, G., et al.: Simplevla-rl: Scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674 (2025)
Pith/arXiv arXiv 2025
-
[26]
In: CoRL (2024)
Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: CoRL (2024)
2024
-
[27]
arXiv preprint arXiv:2512.01801 (2025)
Li, Y., Ma, X., Xu, J., Cui, Y., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y., Niu, H., et al.: Gr-rl: Going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801 (2025)
arXiv 2025
-
[28]
NeurIPS36, 44776–44791 (2023)
Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Bench- marking knowledge transfer for lifelong robot learning. NeurIPS36, 44776–44791 (2023)
2023
-
[29]
In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R
Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt- 1b: A diffusion foundation model for bimanual manipulation. In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R. (eds.) ICLR. vol. 2025, pp. 29982–30009 (2025)
2025
-
[30]
arXiv preprint arXiv:2405.14093 (2024) LIBERO-Safety Benchmark 17
Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024) LIBERO-Safety Benchmark 17
Pith/arXiv arXiv 2024
-
[31]
IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)
Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)
2022
-
[32]
In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)
Morton, D., Pavone, M.: Safe, task-consistent manipulation with operational space control barrier functions. In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 187–194 (2025)
2025
-
[33]
In: CVPR
Mu, Y., Chen, T., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Yu, Q., Zou, Y., Xu, M., Lin, L., Xie, Z., Ding, M., Luo, P.: Robotwin: Dual-arm robot benchmark with generative digital twins. In: CVPR. pp. 27649–27660 (2025)
2025
-
[34]
arXiv preprint arXiv:2502.00935 (2025)
Nakamura, K., Peters, L., Bajcsy, A.: Generalizing safety beyond collision- avoidance via latent-space reachability analysis. arXiv preprint arXiv:2502.00935 (2025)
arXiv 2025
-
[35]
In: RSS (2024)
Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: RSS (2024)
2024
-
[36]
In: RSS (2024)
Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., et al.: Octo: An open-source generalist robot policy. In: RSS (2024)
2024
-
[37]
In: ICRA
Ranjan, A., Agrawal, S., Jain, A., Jagtap, P., Kolathaya, S., et al.: Barrier functions inspired reward shaping for reinforcement learning. In: ICRA. pp. 10807–10813 (2024)
2024
-
[38]
ACM Transactions on Graphics36(6) (2017)
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics36(6) (2017)
2017
-
[39]
In: CoRL
Srivastava, S., Li, C., Lingelbach, M., Martín-Martín, R., Xia, F., Vainio, K.E., Lian, Z., Gokmen, C., Buch, S., Liu, K., Savarese, S., Gweon, H., Wu, J., Fei-Fei, L.: Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL. vol. 164, pp. 477–490 (2022)
2022
-
[40]
arXiv preprint arXiv:2602.10098 (2026)
Sun, J., Zhang, W., Qi, Z., Ren, S., Liu, Z., Zhu, H., Sun, G., Jin, X., Chen, Z.: Vla-jepa: Enhancing vision-language-action model with latent world model. arXiv preprint arXiv:2602.10098 (2026)
arXiv 2026
-
[41]
arXiv preprint arXiv:2502.03132 (2025)
Sun, Y., Chen, R., Yun, K.S., Fang, Y., Jung, S., Li, F., Li, B., Zhao, W., Liu, C.: Spark: A modular benchmark for humanoid robot safety. arXiv preprint arXiv:2502.03132 (2025)
arXiv 2025
-
[42]
arXiv preprint arXiv:2310.17274 (2023)
Sundaralingam, B., Hari, S.K.S., Fishman, A., Garrett, C., Wyk, K.V., Blukis, V., Millane, A., Oleynikova, H., Handa, A., Ramos, F., Ratliff, N., Fox, D.: curobo: Parallelized collision-free minimum-jerk robot motion generation. arXiv preprint arXiv:2310.17274 (2023)
arXiv 2023
-
[43]
In: ECCV
Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: A dataset of whole-body human grasping of objects. In: ECCV. pp. 581–600 (2020)
2020
-
[44]
arXiv preprint arXiv:2505.17016 (2025)
Tan, S., Dou, K., Zhao, Y., Krähenbühl, P.: Interactive post-training for vision- language-action models. arXiv preprint arXiv:2505.17016 (2025)
Pith/arXiv arXiv 2025
-
[45]
arXiv preprint arXiv:2507.02029 (2025)
Team, B.R., Cao, M., Tan, H., Ji, Y., Chen, X., Lin, M., Li, Z., Cao, Z., Wang, P., Zhou, E., et al.: Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029 (2025)
arXiv 2025
-
[46]
IEEE Control Systems Magazine43(5), 137–177 (2023)
Wabersich, K.P., Taylor, A.J., Choi, J.J., Sreenath, K., Tomlin, C.J., Ames, A.D., Zeilinger, M.N.: Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine43(5), 137–177 (2023)
2023
-
[47]
Wang, G., Zhang, C., Liu, Q., Zhang, J., Cai, J., Liu, J., Liu, X.: Libero-x: Ro- bustnesslitmusforvision-language-actionmodels.arXivpreprintarXiv:2602.06556 (2026) 18 R. Cui, Z. Zhang et al
arXiv 2026
-
[48]
arXiv preprint arXiv:2511.17441 (2025)
Wu, S., Liu, X., Xie, S., Wang, P., Li, X., Yang, B., Li, Z., Zhu, K., Wu, H., Liu, Y., et al.: Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441 (2025)
Pith/arXiv arXiv 2025
-
[49]
arXiv preprint arXiv:2505.09388 (2025)
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
Pith/arXiv arXiv 2025
-
[50]
arXiv preprint arXiv:2510.14959 (2025)
Yang, L., Werner, B., de Sa, M., Ames, A.D.: Cbf-rl: Safety filtering reinforcement learning in training with control barrier functions. arXiv preprint arXiv:2510.14959 (2025)
Pith/arXiv arXiv 2025
-
[51]
Safety Science127, 104667 (2020)
Zacharaki, A., Kostavelis, I., Gasteratos, A., Dokas, I.: Safety bounds in human robot interaction: A survey. Safety Science127, 104667 (2020)
2020
-
[52]
arXiv preprint arXiv:2512.22539 (2025)
Zhang, B., Li, J., Shen, J., Cai, Y., Zhang, Y., Chen, Y., Dai, J., Ji, J., Yang, Y.: Vla-arena: An open-source framework for benchmarking vision-language-action models. arXiv preprint arXiv:2512.22539 (2025)
Pith/arXiv arXiv 2025
-
[53]
In: NeurIPS (2025)
Zhang, B., Zhang, Y., Ji, J., Lei, Y., Dai, J., Chen, Y., Yang, Y.: SafeVLA: To- wards safety alignment of vision-language-action model via constrained learning. In: NeurIPS (2025)
2025
-
[54]
arXiv preprint arXiv:2605.18722 (2026)
Zhang, Z., Pang, J., Yang, Z., Li, K., Liao, M., Zhang, S., Chi, G., Guo, J., Gao, H.a., Shi, M., et al.: Dexora: Open-source vla for high-dof bimanual dexterity. arXiv preprint arXiv:2605.18722 (2026)
Pith/arXiv arXiv 2026
-
[55]
arXiv preprint arXiv:2509.07962 (2025)
Zhang, Z., Xu, H., Yang, Z., Yue, C., Lin, Z., Gao, H.a., Wang, Z., Zhao, H.: Ta- vla: Elucidating the design space of torque-aware vision-language-action models. arXiv preprint arXiv:2509.07962 (2025)
arXiv 2025
-
[56]
arXiv preprint arXiv:2509.08820 (2025)
Zhang, Z., Yue, C., Xu, H., Liao, M., Qi, X., Gao, H.a., Wang, Z., Zhao, H.: Robochemist: Long-horizon and safety-compliant robotic chemical experimenta- tion. arXiv preprint arXiv:2509.08820 (2025)
arXiv 2025
-
[57]
arXiv preprint arXiv:2605.28548 (2026)
Zhao, R., Li, B., Liu, Z., Liang, Y., Ye, J., Liu, F., Wu, D., Wang, Z., Yu, X., Rao, Y., et al.: Gem: Generative supervision helps embodied intelligence. arXiv preprint arXiv:2605.28548 (2026)
Pith/arXiv arXiv 2026
-
[58]
arXiv preprint arXiv:2510.10274 (2025)
Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y., Zheng, Y., Zou, J., Chen, Y., Zeng, J., et al.: X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274 (2025)
Pith/arXiv arXiv 2025
-
[59]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhong, C., Zheng, Y., Zheng, Y., Zhao, H., Yi, L., Mu, X., Wang, L., Li, P., Zhou, G., Yang, C., et al.: 3d implicit transporter for temporally consistent keypoint discovery. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3869–3880 (2023)
2023
-
[60]
Zhou, X., Xu, Y., Tie, G., Chen, Y., Zhang, G., Chu, D., Zhou, P., Sun, L.: Libero- pro: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827 (2025) LIBERO-Safety Benchmark 19 Appendix This appendix provides supplementary technical details, extended discussions to support the main finding...
Pith/arXiv arXiv 2025
-
[61]
to interpret multimodal instructions and spatial contexts, which continu- LIBERO-Safety Benchmark 33 ously conditions a dedicated Diffusion Transformer module (System 1). This structural decoupling enables robust high-level semantic reasoning while the diffusion module independently denoises and generates fluid, high-frequency continuous motor actions, en...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.