pith. sign in

arxiv: 2606.20698 · v1 · pith:AYWLE74Jnew · submitted 2026-06-15 · 💻 cs.RO

SafeDojo: Safe Reinforcement Learning for VLA via Interactive World Model

Pith reviewed 2026-06-27 04:10 UTC · model grok-4.3

classification 💻 cs.RO
keywords safe reinforcement learningvision-language-actioninteractive world modelmodel-based RLLagrangian optimizationSafeLIBEROembodied safetyFranka deployment
0
0 comments X

The pith

SafeDojo trains vision-language-action policies safely by running reinforcement learning inside an interactive video world model that imagines action outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safe reinforcement learning for vision-language-action models becomes practical when an interactive video world model generates action-conditioned future frames and latent states, from which a ResNet classifier scores task progress and a lightweight head scores safety costs per step. These decoupled signals are then optimized together under explicit safety constraints using a Lagrangian formulation of the GRPO objective. A sympathetic reader would care because prior safe RL approaches either demand risky real-world exploration or rely on hand-engineered cost functions, neither of which scales to open physical environments. If the claim holds, VLAs can improve both success and safety through imagination before any physical deployment.

Core claim

SafeDojo performs online reinforcement learning on top of an interactive video world model that produces action-conditioned future predictions; a ResNet success classifier estimates per-step task progress from the imagined frames while a lightweight safety head predicts per-step safety costs from latent context and the proposed action chunk; the resulting task-reward and safety-cost signals are balanced through a Lagrangian-based constrained GRPO objective, producing coordinated gains in task success and safety.

What carries the argument

The interactive video world model, which supplies action-conditioned future predictions used by separate heads to estimate task progress and safety costs for constrained optimization.

If this is right

  • The same world-model imagination loop yields the highest aggregate task success, safe success, and execution efficiency among compared inference-time, model-free, and model-based baselines on SafeLIBERO.
  • An 8.25 percentage-point gain in average safe-success rate appears on Level I relative to the strongest baseline.
  • Real-world Franka experiments across five tasks show the highest average task-success and safe-success rates.
  • The decoupled reward and cost signals allow explicit safety constraints to be maintained while task performance improves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the world model generalizes across new objects and scenes, the same training loop could be applied to additional VLA architectures without new hand-crafted safety functions.
  • The separation of task and safety heads inside the world-model loop suggests a route to adding further constraints, such as energy or collision limits, by attaching new prediction heads.
  • Because all learning occurs in imagined trajectories, the approach may lower the total number of real-robot trials needed to reach a target safety level.

Load-bearing premise

The world model must produce sufficiently accurate action-conditioned predictions so that the success classifier and safety head can reliably judge imagined trajectories.

What would settle it

Deploy the learned policy on the real Franka arm and observe whether the measured safety violations or task failures match the rates predicted from the world model's imagined rollouts.

Figures

Figures reproduced from arXiv: 2606.20698 by Chun-Kai Fan, Fangyuan Zhao, Fubing Yang, Jiajun Cao, Jian Tang, Jinchang Xu, Jixian Wu, Kai Tang, Kevin Zhang, Peidong Jia, Rui Ma, Shanghang Zhang, Sixiang Chen, Weishi Mi, Xiaowei Chi, Xiaozhu Ju, Yichen Guo, Zhong Chu.

Figure 1
Figure 1. Figure 1: Overview of SafeDojo. SafeDojo enhances VLA policies with world model based reward￾cost evaluation and safe GRPO, boosting safe success and efficiency in simulation and real scenarios. arXiv:2606.20698v1 [cs.RO] 15 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detailed SafeDojo Pipeline. SafeDojo optimizes VLA policies entirely inside an interactive video world model by rolling out candidate action trajectories into imagined future dynamics. Task reward and safety cost are decoupled and optimized via Lagrangian-based constrained GRPO, improving task success while reducing safety risks without potentially damaging real-world rollouts. In implementation, we build … view at source ↗
Figure 3
Figure 3. Figure 3: Real-World Experiment Visualization. SafeDojo completes the real-world task safely, while baselines either fail the task, violate safety, or succeed only with unsafe contacts. 46.0 74.0 60.0 60.0 40.0 38.0 36.0 58.0 48.0 46.0 30.0 26.0 222.0 186.3 203.0 208.8 232.7 244.4 0.1 0.2(Ours) 0.3 0.4 0.5 0.6 0 50 100 150 200 250 Metric Value η sensitivity Task Success Rate (TSR)↑ Safe Success Rate (SSR) ↑ Execute … view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies on SafeLIBERO Level I Spatial Task 0. (a) Component ablation: removing [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative SafeDojo Real-World Demos. Representative snapshots from SafeDojo executions on five real-world tasks. constraint, and dual-arm coordination under obstacle interference. A trial is considered task-successful if the final object configuration satisfies the language instruction, and safe-successful only if the task is completed without contacting the obstacle. This mirrors the SafeLIBERO evalu… view at source ↗
read the original abstract

Safe control is a prerequisite for real-world embodied intelligence, for which safe reinforcement learning has emerged as a promising paradigm. However, existing safe reinforcement learning methods either require costly real-world exploration or depend on hand-crafted safety functions. Neither scales to vision-language-action models deployed in open-world physical environments. We propose SafeDojo, the first model-based safe reinforcement learning framework for vision-language-action policies designed to learn safe actions through world model-based imagination. Specifically, SafeDojo performs online reinforcement learning on top of an interactive video world model. The world model generates action-conditioned future predictions, from which a tailored ResNet success classifier estimates per-step task progress from imagined frames and a lightweight safety head predicts per-step safety costs from latent context together with the proposed action chunk, enabling simultaneous assessment of task execution and trajectory safety. The decoupled task-reward and safety-cost signals are balanced through a Lagrangian-based constrained GRPO objective, enabling coordinated improvement of task success and safety under explicit constraints. On SafeLIBERO, SafeDojo achieves the best aggregate task success, safe success, and execution efficiency among inference-time safety, model-free RL, and model-based RL baselines, with the best average safe-success rate on both levels and an 8.25 percentage-point improvement over the strongest baseline on Level I. Real-world Franka deployment further shows the best average task and safe-success rates across five tasks. Our results position world model-based safe reinforcement learning as a scalable and generalizable path toward safe embodied intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes SafeDojo, the first model-based safe RL framework for vision-language-action (VLA) policies. It performs online RL atop an interactive video world model that generates action-conditioned future predictions; a ResNet success classifier estimates per-step task progress from imagined frames while a lightweight safety head predicts per-step safety costs from latent context plus the action chunk. These decoupled signals are balanced via a Lagrangian-constrained GRPO objective. On SafeLIBERO the method reports the best aggregate task success, safe success, and execution efficiency versus inference-time safety, model-free RL, and model-based RL baselines (including an 8.25 pp safe-success gain on Level I), with additional best-in-class results on five real-world Franka tasks.

Significance. If the empirical claims hold after addressing the validation gap, the work would represent a meaningful step toward scalable safe embodied intelligence. It demonstrates how world-model imagination can replace hand-crafted safety functions and costly real-world exploration for VLAs, while the Lagrangian-GRPO formulation provides a principled way to trade off task and safety objectives. The real-world Franka results add practical weight.

major comments (1)
  1. [Abstract and Methods (world-model and classifier sections)] The headline performance claims (best aggregate task/safe success on SafeLIBERO Levels I/II and real-world Franka results) rest on the assumption that the interactive video world model produces faithful action-conditioned rollouts that can be reliably fed to the ResNet classifier and safety head. No quantitative world-model metrics (frame-level MSE, FVD, or classifier accuracy on imagined versus real frames) are reported anywhere in the manuscript, leaving the quality of the estimated rewards and costs unverified. This is load-bearing for the central claim that the observed gains reflect true policy improvement rather than artifacts of prediction error.
minor comments (2)
  1. [Methods] The description of the Lagrangian multiplier update schedule and the precise form of the GRPO objective would benefit from an explicit equation or pseudocode block to allow reproduction.
  2. [Experiments] Table captions and axis labels in the SafeLIBERO results should explicitly state the number of evaluation episodes and random seeds used for each method.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of validating the interactive world model's prediction quality, which is central to our claims. We address this point below and will incorporate the requested metrics in the revision.

read point-by-point responses
  1. Referee: [Abstract and Methods (world-model and classifier sections)] The headline performance claims (best aggregate task/safe success on SafeLIBERO Levels I/II and real-world Franka results) rest on the assumption that the interactive video world model produces faithful action-conditioned rollouts that can be reliably fed to the ResNet classifier and safety head. No quantitative world-model metrics (frame-level MSE, FVD, or classifier accuracy on imagined versus real frames) are reported anywhere in the manuscript, leaving the quality of the estimated rewards and costs unverified. This is load-bearing for the central claim that the observed gains reflect true policy improvement rather than artifacts of prediction error.

    Authors: We agree that quantitative validation of the world model is essential to substantiate that the observed gains arise from policy improvement rather than prediction artifacts. The current manuscript focuses on end-to-end task and safety metrics but omits direct evaluation of rollout fidelity. In the revised version, we will add a dedicated subsection in Methods (and corresponding results) reporting frame-level MSE, Fréchet Video Distance (FVD), and per-frame classifier accuracy of the ResNet success head on both real and imagined frames across SafeLIBERO levels. These metrics will be computed on held-out trajectories to confirm that the imagined rollouts remain sufficiently accurate for reward and cost estimation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks.

full rationale

The paper describes a model-based safe RL framework (world model + ResNet classifier + safety head + Lagrangian GRPO) and reports aggregate performance metrics on SafeLIBERO Levels I/II plus real-world Franka tasks. All load-bearing claims are comparative results against listed baselines rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations or sections exhibit self-definitional loops, fitted inputs renamed as predictions, or uniqueness theorems imported from the same authors. The method is self-contained against the reported external evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; key components like the world model, classifiers, and GRPO objective are introduced but no explicit free parameters, axioms, or invented entities are detailed. The framework relies on the assumption of accurate world model predictions.

pith-pipeline@v0.9.1-grok · 5861 in / 1075 out tokens · 67296 ms · 2026-06-27T04:10:31.573994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 14 canonical work pages

  1. [1]

    Brunke, M

    L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022

  2. [2]

    S. Gu, L. Yang, Y . Du, G. Chen, F. Walter, J. Wang, and A. Knoll. A review of safe reinforcement learning: Methods, theories, and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11216–11235, 2024

  3. [3]

    O. Khatib. Real-time obstacle avoidance for manipulators and mobile robots.The International Journal of Robotics Research, 5(1):90–98, 1986. doi:10.1177/027836498600500106

  4. [4]

    Haddadin, A

    S. Haddadin, A. Albu-Sch¨affer, A. De Luca, and G. Hirzinger. Collision detection and reaction: A contribution to safe physical human-robot interaction. In2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3356–3363, 2008. doi:10.1109/IROS. 2008.4650764

  5. [5]

    Haddadin, A

    S. Haddadin, A. Albu-Sch¨affer, and G. Hirzinger. Requirements for safe robots: Measurements, analysis and new insights.The International Journal of Robotics Research, 28(11–12):1507– 1527, 2009. doi:10.1177/0278364909343970

  6. [6]

    Haddadin, A

    S. Haddadin, A. De Luca, and A. Albu-Sch ¨affer. Robot collisions: A survey on detection, isolation, and identification.IEEE Transactions on Robotics, 33(6):1292–1312, 2017. doi: 10.1109/TRO.2017.2723903

  7. [7]

    A. M. Zanchettin, N. M. Ceriani, P. Rocco, H. Ding, and B. Matthias. Safety in human- robot collaborative manufacturing environments: Metrics and control.IEEE Transactions on Automation Science and Engineering, 13(2):882–893, 2016. doi:10.1109/TASE.2015.2412256

  8. [8]

    P. A. Lasota, T. Fong, and J. A. Shah. A survey of methods for safe human-robot interaction. F oundations and Trends in Robotics, 5(4):261–349, 2017. doi:10.1561/2300000052

  9. [9]

    X. Ding, H. Wang, Y . Ren, Y . Zheng, C. Chen, and J. He. Safety-critical optimal control for robotic manipulators in a cluttered environment.arXiv preprint arXiv:2211.04944, 2022. doi:10.48550/arXiv.2211.04944

  10. [10]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, 2023

  11. [11]

    Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 2023

    Open X-Embodiment Collaboration et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 2023

  12. [12]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  13. [13]

    O. M. Team et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  14. [14]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  15. [15]

    A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada. Control barrier functions: Theory and applications.2019 18th European Control Conference (ECC), pages 3420–3431, 2019. doi:10.23919/ECC.2019.8796030. 10

  16. [16]

    Huang, J

    W. Huang, J. Ji, C. Xia, B. Zhang, and Y . Yang. SafeDreamer: Safe reinforcement learning with world models. InInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=tsE5HLYtYg

  17. [17]

    Zhang, Y

    B. Zhang, Y . Zhang, J. Ji, Y . Lei, J. Dai, Y . Chen, and Y . Yang. SafeVLA: Towards safety alignment of vision-language-action model via constrained learning. InThirty-ninth Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=dt940loCBT. Spotlight

  18. [18]

    S. Hu, Z. Liu, S. Liu, J. Cen, Z. Meng, and X. He. VLSA: Vision-language-action models with plug-and-play safety constraint layer.arXiv preprint arXiv:2512.11891, 2025. URL https://arxiv.org/abs/2512.11891

  19. [19]

    C. Cao, Y . Xin, S. Wu, L. He, Z. Yan, J. Tan, and X. Wang. FOSP: Fine-tuning offline safe policy through world models. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=dbuFJg7eaw

  20. [20]

    D. Yu, Q. Zhou, B. Huang, M. Khadiv, and Z. Yang. Safe-night VLA: Seeing the unseen via thermal-perceptive vision-language-action models for safety-critical manipulation.arXiv preprint arXiv:2603.05754, 2026. URLhttps://arxiv.org/abs/2603.05754

  21. [21]

    Son, D.-K

    Y .-C. Son, D.-K. Ko, Y .-J. Choi, and S.-C. Lim. ThermoAct: Thermal-aware vision-language- action models for robotic perception and decision-making.IEEE Robotics and Automation Letters, 11(5):6106–6113, 2026. doi:10.1109/LRA.2026.3678130

  22. [22]

    X. Zhai, B. Ou, Q. Yu, C. Hao, and Y . Liu. CoFreeVLA: Collision-free dual-arm manipulation via vision-language-action model and risk estimation, 2026. URL https://arxiv.org/abs/ 2601.21712

  23. [23]

    Q. Gu, Y . Ju, S. Sun, I. Gilitschenski, H. Nishimura, M. Itkina, and F. Shkurti. SAFE: Multitask failure detection for vision-language-action models.arXiv preprint arXiv:2506.09937, 2025. doi:10.48550/arXiv.2506.09937. URLhttps://arxiv.org/abs/2506.09937

  24. [24]

    Zhang, K

    Z. Zhang, K. Zheng, Z. Chen, J. Jang, Y . Li, S. Han, C. Wang, M. Ding, D. Fox, and H. Yao. GRAPE: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,

  25. [25]

    URLhttps://arxiv.org/abs/2411.19309

    doi:10.48550/arXiv.2411.19309. URLhttps://arxiv.org/abs/2411.19309

  26. [26]

    Altman.Constrained Markov Decision Processes

    E. Altman.Constrained Markov Decision Processes. Chapman and Hall/CRC, 1999

  27. [27]

    Achiam, D

    J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 22–31. PMLR, 2017. URL https://proceedings.mlr. press/v70/achiam17a.html

  28. [28]

    A. Ray, J. Achiam, and D. Amodei. Benchmarking safe exploration in deep reinforce- ment learning. OpenAI technical report, 2019. URL https://openai.com/index/ benchmarking-safe-exploration-in-deep-reinforcement-learning/

  29. [29]

    Tessler, D

    C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=SkfrvsA9FX

  30. [30]

    Zhang, Q

    Y . Zhang, Q. Vuong, and K. Ross. First order constrained optimization in policy space. InAdvances in Neural Information Processing Systems, vol- ume 33, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ af5d5ef24881f3c3049a7b9bfe74d58b-Abstract.html. 11

  31. [31]

    Stooke, J

    A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by PID lagrangian methods. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 9133–9143. PMLR, 2020. URLhttps://proceedings.mlr.press/v119/stooke20a.html

  32. [32]

    Thomas, Y

    G. Thomas, Y . Luo, and T. Ma. Safe reinforcement learning by imagining the near future. InAdvances in Neural Information Processing Systems, volume 34, 2021. URL https: //openreview.net/forum?id=vIDBSGl3vzl

  33. [33]

    Hogewind, T

    Y . Hogewind, T. D. Sim˜ao, T. Kachman, and N. Jansen. Safe reinforcement learning from pixels using a stochastic latent representation. InInternational Conference on Learning Representa- tions, 2023. URLhttps://openreview.net/forum?id=b39dQt_uffW

  34. [34]

    Nakamura, L

    K. Nakamura, L. Peters, and A. Bajcsy. Generalizing safety beyond collision-avoidance via latent-space reachability analysis. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.113. URL https://www. roboticsproceedings.org/rss21/p113.html

  35. [35]

    J. Seo, K. Nakamura, and A. Bajcsy. Uncertainty-aware latent safety filters for avoiding out- of-distribution failures. InProceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 4442–4472. PMLR, 2025. URL https://proceedings.mlr.press/v305/seo25a.html

  36. [36]

    F. Zhu, Z. Yan, Z. Hong, Q. Shou, X. Ma, and S. Guo. WMPO: World model-based policy optimization for vision-language-action models. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=qE2FyvRvuF

  37. [37]

    Jiang, S

    Z. Jiang, S. Zhou, Y . Jiang, Z. Huang, M. Wei, Y . Chen, T. Zhou, Z. Guo, H. Lin, Q. Zhang, Y . Wang, H. Li, C. Yu, and D. Zhao. WoVR: World models as reliable simulators for post- training VLA policies with RL.arXiv preprint arXiv:2602.13977, 2026. URL https://arxiv. org/abs/2602.13977

  38. [38]

    J. Xiao, Y . Yang, X. Chang, R. Chen, F. Xiong, M. Xu, W.-S. Zheng, and Q. Zhang. World- Env: Leveraging world model as a virtual environment for VLA post-training.arXiv preprint arXiv:2509.24948, 2025. URLhttps://arxiv.org/abs/2509.24948

  39. [39]

    H. Li, P. Ding, R. Suo, Y . Wang, Z. Ge, D. Zang, K. Yu, M. Sun, H. Zhang, D. Wang, and W. Su. VLA-RFT: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators.arXiv preprint arXiv:2510.00406, 2025. doi:10.48550/arXiv.2510.00406. URL https://arxiv.org/abs/2510.00406

  40. [40]

    X. Liu, Z. Bai, H. Ci, K. Y . Ma, and M. Z. Shou. World-VLA-Loop: Closed-loop learning of video world model and VLA policy.arXiv preprint arXiv:2602.06508, 2026. doi:10.48550/ arXiv.2602.06508. URLhttps://arxiv.org/abs/2602.06508

  41. [41]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

  42. [42]

    Girgis, R

    R. Girgis, R. de Schaetzen, L. Rowe, A. Robitaille, C. Pal, and L. Paull. Constrained group relative policy optimization, 2026. URLhttps://arxiv.org/abs/2602.05863

  43. [43]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmark- ing knowledge transfer for lifelong robot learning. InThirty-seventh Conference on Neu- ral Information Processing Systems Datasets and Benchmarks Track, 2023. URL https: //openreview.net/forum?id=xzEtNSuDJk. 12

  44. [44]

    Y . Zhu, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, A. Joshi, K. Lin, A. Maddukuri, S. Nasiriany, and Y . Zhu. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020. URLhttps://arxiv.org/abs/2009.12293

  45. [45]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June

  46. [46]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    doi:10.15607/RSS.2025.XXI.017. URL https://www.roboticsproceedings.org/ rss21/p017.html. 13 A Overview of Appendices • Appendix B: Limitations and Future Work • Appendix C: Detailed Related Work • Appendix D: Broader Impact • Appendix E: Per-Task Results on SafeLIBERO Level I • Appendix F: Per-Task Results on SafeLIBERO Level II • Appendix G: Task Descrip...