pith. sign in

arxiv: 2606.27079 · v1 · pith:CZXMLXYBnew · submitted 2026-06-25 · 💻 cs.RO

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

Pith reviewed 2026-06-26 05:07 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-actionsafety benchmarkembodied safetyrobot safetyVLA modelsdiagnostic evaluationphysical interaction safetyperception safety
0
0 comments X

The pith

Vision-language-action models incur non-trivial safety costs and unsafe successes, with structure and visual variation degrading safety more than language changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ForesightSafety-VLA as a benchmark that treats safety as the main evaluation target for vision-language-action models rather than an afterthought. It organizes evaluation around a 13-category taxonomy spanning physical interaction, instruction safety, and perception safety, then applies controlled changes in scene structure, language commands, and visual observations. Measurements go beyond task success to track cumulative safety cost and risk exposure time, revealing that even leading policies produce unsafe outcomes and that visual and structural shifts drive larger failures than language shifts. This points to safety being inseparable from the model's core abilities in perception, grounding, and control.

Core claim

ForesightSafety-VLA shows that embodied safety in VLA models is tightly coupled to perception, grounding, and control competence rather than reducible to post-hoc safety filtering, as even the strongest evaluated policies produce non-trivial safety cost and unsafe nominal success, while structure and visual variation induce substantially stronger safety degradation than ordinary language variation.

What carries the argument

The 13-category safety taxonomy (Safe-Core, Safe-Lang, Safe-Vis) combined with process-level metrics of cumulative safety cost and risk exposure time, applied across three controlled variation dimensions on 66 scenarios in five embodiments.

If this is right

  • Safety evaluation for VLA models must track process-level risk metrics in addition to binary task success.
  • Diagnostic benchmarks should separate effects of scene structure, language, and visual changes to isolate failure sources.
  • Embodied safety improvements require advances in perception and control rather than relying solely on output filtering.
  • Benchmarks that decompose safe versus unsafe success and failure provide clearer signals than aggregate scores alone.
  • Variation in visual observations and scene structure should be prioritized in safety testing over language variation alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that improve visual grounding accuracy may show measurable reductions in the observed safety degradation under visual variation.
  • Extending the benchmark to real-world robot deployments could test whether simulation-based safety costs predict physical-world outcomes.
  • The taxonomy could be applied to non-VLA robot policies to check whether the coupling of safety to perception holds more broadly.
  • Training regimes that explicitly penalize risk exposure time during learning might lower the safety costs reported here.

Load-bearing premise

The 66 safety-augmented scenarios instantiated in RoboTwin across five embodiments, together with the 13-category taxonomy, are representative enough to diagnose general safety limitations in VLA models.

What would settle it

A new VLA policy that records near-zero cumulative safety cost and zero unsafe nominal successes across all three variation dimensions on the 66 scenarios would directly challenge the reported non-trivial safety costs in the strongest policies.

Figures

Figures reproduced from arXiv: 2606.27079 by Feifei Zhao, Huangrui Li, Mingyang Lyu, Moquan Sha, Sicheng Shen, Yinqian Sun, Yiyang Jia, Yi Zeng.

Figure 1
Figure 1. Figure 1: Overview of ForesightSafety-VLA. Left: the safety taxonomy of ForesightSafety-VLA, covering Safe-Core physical safety together with Safe-Lang instruction safety and Safe-Vis perceptual safety. Right: the three diagnostic evaluation dimensions used in the benchmark. The top row shows structure/layout variation (L0–L2), the middle row shows language variation (W0–W4), and the bottom row shows visual variatio… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 2
Figure 2. Figure 2: Anticipatory safety comparison. Two trajectories may share the same task outcome (success without hard violation) while exhibiting substantially different anticipatory safety behavior. (a) Policy A reaches the goal by skimming the boundary and repeatedly entering the soft risk buffer, which leads to higher cumulative cost (CC) and higher risk exposure time (RET). (b) Policy B detours earlier in response to… view at source ↗
Figure 3
Figure 3. Figure 3: Global safety–success landscape across measured model runs. Each point denotes a model. The vertical axis reports safe success rate (SSR), the horizontal axis reports cumulative safety cost (CC), and bubble size encodes unsafe success rate (USR). The upper-left region is preferable, corresponding to higher safe success with lower accumulated risk. non-zero cumulative safety cost (CC in [0.18, 0.39]), non￾z… view at source ↗
Figure 4
Figure 4. Figure 4: Safety diagnosis under structure, language, and vision variation. Solid lines show safe success rate (SSR, left axis) and dashed lines show cumulative safety cost (CC, right axis) for representative models as severity increases. Structure/layout variation (a) and visual variation (c) induce steady degradation, while ordinary language variation (b, W0–W2) is comparatively mild; the shaded adversarial region… view at source ↗
Figure 5
Figure 5. Figure 5: Process-level risk dissection of an unsafe-success episode. Top: four keyframes from a lift_pot rollout in which the task is completed but the arm traverses a heat hazard zone. Bottom: time-aligned channel-wise safety scores for Force/Torque, Thermal/Energy, and Spatial Boundary. The boundary channel crosses the hard boundary near the middle of the episode, while the heat channel remains in the soft-buffer… view at source ↗
read the original abstract

In embodied intelligence, safety is a prerequisite for reliable robot deployment in the physical world. Current vision-language-action (VLA) models continue to advance toward general-purpose task capability, yet their embodied safety limits remain poorly understood. To address this gap, we introduce ForesightSafety-VLA, a diagnostic benchmark that makes safety the primary evaluation target for VLA systems. We define a 13-category safety taxonomy covering physical interaction safety (Safe-Core), instruction-side safety (Safe-Lang), and perception-side safety (Safe-Vis), and evaluate policies under three controlled dimensions of variation -- scene structure, language command, and visual observation -- so that failure sources can be diagnosed rather than hidden in a single aggregate score. Beyond binary task success, ForesightSafety-VLA measures process-level risk through cumulative safety cost (CC) and risk exposure time (RET), together with a four-quadrant decomposition of safe/unsafe success and failure. We instantiate 66 safety-augmented base scenarios in RoboTwin across 5 embodiments and report results on representative VLA baselines. Across the evaluated baselines, even the strongest policy incurs non-trivial safety cost and unsafe nominal success, while structure and visual variation induce substantially stronger safety degradation than ordinary language variation. These results suggest that embodied safety is tightly coupled to perception, grounding, and control competence rather than being reducible to post-hoc safety filtering alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ForesightSafety-VLA, a diagnostic benchmark for safety in vision-language-action (VLA) models. It defines a 13-category taxonomy spanning Safe-Core (physical interaction), Safe-Lang (instruction-side), and Safe-Vis (perception-side) safety. The benchmark evaluates policies on 66 safety-augmented scenarios instantiated in RoboTwin across 5 embodiments, under controlled variations in scene structure, language, and visual observation. It reports process-level metrics including cumulative safety cost (CC), risk exposure time (RET), and a four-quadrant safe/unsafe success/failure decomposition. Results on representative baselines indicate that even the strongest policy shows non-trivial safety cost and unsafe nominal success, with structure and visual variation causing substantially stronger degradation than language variation, implying embodied safety is coupled to perception, grounding, and control rather than reducible to post-hoc filtering.

Significance. If the benchmark construction and observed patterns hold under broader validation, the work supplies a needed diagnostic framework that shifts safety evaluation from aggregate success to source-specific risk measurement. The controlled variation axes and multi-metric reporting (CC, RET, quadrant decomposition) enable targeted diagnosis of failure modes in VLA systems, which is directly relevant to reliable physical deployment. The explicit separation of Safe-Core/Lang/Vis categories and the finding that perception/control competence dominates safety outcomes provide concrete guidance for model design beyond filtering approaches.

major comments (2)
  1. [Benchmark construction and scenario selection] The instantiation of the 66 safety-augmented base scenarios (described in the benchmark construction section) provides no coverage argument, diversity metric, or sensitivity analysis demonstrating that the chosen scenarios and 5 embodiments are representative of general VLA safety limits. This is load-bearing for the central claim that structure and visual variation induce substantially stronger degradation than language variation, because the differential patterns could be artifacts of simulator-specific couplings or under-sampling of perception-grounding-control interactions.
  2. [Evaluation and results reporting] The results reporting (including the abstract and evaluation sections) states outcomes on baselines but supplies no details on exact evaluation protocols, number of trials per scenario, statistical significance testing, or baseline selection/implementation criteria. Without these, the support for the reported degradation patterns and the interpretive claim that safety is not reducible to post-hoc filtering cannot be verified.
minor comments (2)
  1. [Abstract] The abstract refers to 'representative VLA baselines' without naming the models or their selection rationale; adding this would improve immediate readability.
  2. [Metrics definition] Notation for the four-quadrant decomposition and the definitions of CC and RET could be introduced with a small table or explicit equations in the metrics section to aid quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on ForesightSafety-VLA. The two major comments highlight areas where additional justification and protocol details will improve clarity and verifiability. We will revise the manuscript accordingly to strengthen the presentation of benchmark construction and evaluation procedures while preserving the core diagnostic contributions.

read point-by-point responses
  1. Referee: [Benchmark construction and scenario selection] The instantiation of the 66 safety-augmented base scenarios (described in the benchmark construction section) provides no coverage argument, diversity metric, or sensitivity analysis demonstrating that the chosen scenarios and 5 embodiments are representative of general VLA safety limits. This is load-bearing for the central claim that structure and visual variation induce substantially stronger degradation than language variation, because the differential patterns could be artifacts of simulator-specific couplings or under-sampling of perception-grounding-control interactions.

    Authors: We agree that an explicit coverage argument and sensitivity analysis are currently absent and would strengthen the manuscript. The 66 scenarios were derived by mapping each of the 13 taxonomy categories to concrete RoboTwin instantiations across the five embodiments, with at least one base scenario per category and controlled axis variations. In revision we will add a dedicated subsection with a coverage table, selection rationale (prioritizing common physical, linguistic, and perceptual failure modes reported in prior VLA literature), and a sensitivity study that subsamples scenarios and re-computes the structure-vs-language and vision-vs-language degradation gaps. This will directly support the reported patterns. We note that claims of full representativeness across all possible real-world VLA deployments would require multi-simulator validation, which we position as future work rather than a current claim. revision: yes

  2. Referee: [Evaluation and results reporting] The results reporting (including the abstract and evaluation sections) states outcomes on baselines but supplies no details on exact evaluation protocols, number of trials per scenario, statistical significance testing, or baseline selection/implementation criteria. Without these, the support for the reported degradation patterns and the interpretive claim that safety is not reducible to post-hoc filtering cannot be verified.

    Authors: We acknowledge that the current manuscript omits these protocol details. Evaluations used five independent rollouts per scenario (with fixed random seeds for reproducibility across baselines), and statistical significance of metric differences (CC, RET) was assessed via paired t-tests (p < 0.05). Baselines were chosen as the leading publicly released VLA models with available checkpoints and code at submission time; implementation followed the original papers' recommended inference settings. In the revision we will expand the evaluation section with a complete protocol description, trial counts, significance tables, and explicit baseline selection criteria. These additions will allow direct verification of the degradation patterns and the interpretation that safety outcomes depend on perception-grounding-control competence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with direct reporting.

full rationale

The paper defines a 13-category taxonomy, instantiates 66 scenarios in RoboTwin, and reports measured safety costs, risk exposure, and degradation patterns under controlled variations. No equations, fitted parameters, predictions, or self-citation chains appear in the provided text. All load-bearing claims rest on explicit benchmark construction and observed outcomes rather than reducing to inputs by definition or prior self-work. This is the standard non-circular case for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that the chosen taxonomy and simulator scenarios capture the relevant safety space; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The 13-category safety taxonomy (Safe-Core, Safe-Lang, Safe-Vis) covers the primary safety concerns for VLA models in embodied settings.
    Invoked as the foundation for all evaluation categories and scenario design.

pith-pipeline@v0.9.1-grok · 5807 in / 1293 out tokens · 27841 ms · 2026-06-26T05:07:42.830815+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 17 linked inside Pith

  1. [1]

    Rt-1: Robotics transformer for real-world control at scale,

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

  2. [2]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

  3. [3]

    Octo: An open-source generalist robot policy,

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xuet al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024

  4. [4]

    Openvla: An open- source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  5. [5]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

  6. [6]

    Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,

    Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,”arXiv preprint arXiv:2401.02117, 2024

  7. [7]

    π0: A vision-language-action flow model for general robot control,

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “ π0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  8. [8]

    π0. 5: A vision-language- action model with open-world generalization. arxiv 2025,

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “ π0. 5: A vision-language- action model with open-world generalization. arxiv 2025,”arXiv preprint arXiv:2504.16054, 2025

  9. [9]

    Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,” 2020

  10. [10]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022

  11. [11]

    Maniskill2: A unified benchmark for generalizable manipulation skills,

    J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yaoet al., “Maniskill2: A unified benchmark for generalizable manipulation skills,”arXiv preprint arXiv:2302.04659, 2023

  12. [12]

    Robotwin: Dual-arm robot benchmark with generative digital twins,

    Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xuet al., “Robotwin: Dual-arm robot benchmark with generative digital twins,” inProceedings of the computer vision and pattern recognition conference, 2025, pp. 27 649–27 660

  13. [13]

    Evaluating real-world robot manipulation policies in simulation,

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmaniet al., “Evaluating real-world robot manipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024

  14. [14]

    Vla-arena: An open-source framework for benchmarking vision-language-action models,

    B. Zhang, J. Li, J. Shen, Y . Cai, Y . Zhang, Y . Chen, J. Dai, J. Ji, and Y . Yang, “Vla-arena: An open-source framework for benchmarking vision-language-action models,”arXiv preprint arXiv:2512.22539, 2025

  15. [15]

    Generative image as action models,

    M. Shridhar, Y . L. Lo, and S. James, “Generative image as action models,” arXiv preprint arXiv:2407.07875, 2024

  16. [16]

    Ro- bustnav: Towards benchmarking robustness in embodied navigation,

    P. Chattopadhyay, J. Hoffman, R. Mottaghi, and A. Kembhavi, “Ro- bustnav: Towards benchmarking robustness in embodied navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 691–15 700

  17. [17]

    Adversarial patch,

    T. B. Brown, D. Man ´e, A. Roy, M. Abadi, and J. Gilmer, “Adversarial patch,”arXiv preprint arXiv:1712.09665, 2017

  18. [18]

    Prompt injection attack against llm-integrated applications,

    Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zhenget al., “Prompt injection attack against llm-integrated applications,”arXiv preprint arXiv:2306.05499, 2023

  19. [19]

    Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,

    C. Xu, W. Ding, W. Lyu, Z. Liu, S. Wang, Y . He, H. Hu, D. Zhao, and B. Li, “Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 667–25 682, 2022

  20. [20]

    Benchmarking safe exploration in deep reinforcement learning,

    A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,”arXiv preprint arXiv:1910.01708, vol. 7, no. 1, p. 2, 2019

  21. [21]

    Altman,Constrained Markov decision processes

    E. Altman,Constrained Markov decision processes. Routledge, 2021

  22. [22]

    Safety gymnasium: A unified safe reinforcement learning benchmark,

    J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y . Geng, Y . Zhong, J. Dai, and Y . Yang, “Safety gymnasium: A unified safe reinforcement learning benchmark,”Advances in Neural Information Processing Systems, vol. 36, pp. 18 964–18 993, 2023

  23. [23]

    Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,

    Z. Yuan, A. W. Hall, S. Zhou, L. Brunke, M. Greeff, J. Panerati, and A. P. Schoellig, “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 142–11 149, 2022

  24. [24]

    Concrete problems in ai safety,

    D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man ´e, “Concrete problems in ai safety,”arXiv preprint arXiv:1606.06565, 2016

  25. [25]

    Libero: Benchmarking knowledge transfer for lifelong robot learning,

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

  26. [26]

    Control barrier function based quadratic programs for safety critical systems,

    A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,”IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2016

  27. [27]

    A comprehensive survey on safe reinforce- ment learning,

    J. Garcıa and F. Fern ´andez, “A comprehensive survey on safe reinforce- ment learning,”Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015

  28. [28]

    Safe learning in robotics: From learning-based control to safe reinforcement learning,

    L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022

  29. [29]

    A survey of methods for safe human-robot interaction,

    P. A. Lasota, T. Fong, and J. A. Shah, “A survey of methods for safe human-robot interaction,”Foundations and Trends® in Robotics, vol. 5, no. 4, pp. 261–349, 2017

  30. [30]

    Safety of embodied navigation: A survey,

    Z. Wang, J. Hu, and R. Mu, “Safety of embodied navigation: A survey,” arXiv preprint arXiv:2508.05855, 2025

  31. [31]

    Generating robot constitutions & benchmarks for semantic safety,

    P. Sermanet, A. Majumdar, A. Irpan, D. Kalashnikov, and V . Sindhwani, “Generating robot constitutions & benchmarks for semantic safety,”arXiv preprint arXiv:2503.08663, 2025

  32. [32]

    Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,

    L. Wei, J. Ma, Y . Hu, and R. Zhang, “Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,”arXiv preprint arXiv:2412.10349, 2024

  33. [33]

    Towards safe robot foundation models,

    T. Gruner, D. Palenicek, P. Liu, J. Watson, D. Tateo, J. Peterset al., “Towards safe robot foundation models,”arXiv e-prints, pp. arXiv–2503, 2025

  34. [34]

    Sapien: A simulated part-based interactive environment,

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su, “Sapien: A simulated part-based interactive environment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11 097–11 107. [Online]. Available: https://openaccess.the...

  35. [35]

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H. ang Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu, “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”

  36. [36]

    Available: https://arxiv.org/abs/2506.18088

    [Online]. Available: https://arxiv.org/abs/2506.18088

  37. [37]

    Rdt-1b: a diffusion foundation model for bimanual manipulation,

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,” arXiv preprint arXiv:2410.07864, 2024

  38. [38]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

  39. [39]

    Learning fine- grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

  40. [40]

    Dexvla: Vision- language model with plug-in diffusion expert for general robot control,

    J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng, “Dexvla: Vision- language model with plug-in diffusion expert for general robot control,” arXiv preprint arXiv:2502.05855, 2025

  41. [41]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

  42. [42]

    Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,

    J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shenet al., “Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,”IEEE Robotics and Automation Letters, 2025

  43. [43]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024