pith. sign in

arxiv: 2606.27079 · v2 · pith:CZXMLXYBnew · submitted 2026-06-25 · 💻 cs.RO

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

Pith reviewed 2026-06-30 09:49 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelssafety benchmarkembodied AIrobot safetydiagnostic evaluationphysical interaction safetyVLA policiesrisk metrics
0
0 comments X

The pith

Even the strongest vision-language-action policies incur non-trivial safety costs and unsafe successes, with scene structure and visual variation causing more degradation than language changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ForesightSafety-VLA as a benchmark that treats safety as the primary evaluation target for vision-language-action models rather than an add-on. It creates a 13-category taxonomy across physical interaction, instruction, and perception safety, then measures performance under controlled changes to scene structure, language commands, and visual observations. Metrics include cumulative safety cost and risk exposure time plus a breakdown into safe versus unsafe successes and failures. Tests on representative baselines across 66 scenarios in five robot embodiments show that safety problems persist even in top policies and worsen more from structural or visual shifts than from language shifts. This setup allows specific diagnosis of whether failures arise from perception, grounding, or control instead of hiding them in overall success rates.

Core claim

ForesightSafety-VLA instantiates 66 safety-augmented base scenarios in RoboTwin across five embodiments and evaluates VLA baselines under three variation dimensions. Even the strongest policy shows non-trivial safety cost and unsafe nominal success. Structure and visual variation produce substantially stronger safety degradation than ordinary language variation, indicating that embodied safety is coupled to perception, grounding, and control competence rather than addressable by post-hoc filtering alone.

What carries the argument

The 13-category safety taxonomy (Safe-Core for physical interactions, Safe-Lang for instructions, Safe-Vis for perception) paired with controlled variations in scene structure, language command, and visual observation, plus metrics of cumulative safety cost (CC), risk exposure time (RET), and four-quadrant decomposition of safe/unsafe success and failure.

If this is right

  • Safety in VLA systems requires integration into perception, grounding, and control rather than reliance on separate filtering steps.
  • Diagnostic evaluation across multiple variation dimensions is required to isolate whether failures originate in scene structure, visuals, or commands.
  • Binary task success alone is insufficient; process-level measures such as cumulative safety cost and risk exposure time must be reported.
  • Stronger degradation from structure and visual changes implies that improvements in visual grounding will have larger safety impact than language-only adjustments.
  • Claims about VLA safety limits depend on testing across multiple embodiments and controlled scenario variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training or fine-tuning VLA models directly on the benchmark's safety metrics could produce policies with lower risk exposure in unstructured settings.
  • Extending the scenarios beyond simulation to physical robot trials would test whether the observed safety patterns hold outside RoboTwin.
  • The greater effect of visual and structural variation suggests that safety research in embodied AI should prioritize perception robustness over command understanding.
  • Common failure patterns identified here could inform unified safety standards across different robot platforms and task domains.

Load-bearing premise

The 66 safety-augmented scenarios across five embodiments in RoboTwin are representative enough of real-world physical interaction risks to support general claims about VLA safety limits.

What would settle it

A VLA policy achieving zero cumulative safety cost and exclusively safe successes across all structure, language, and visual variations on the 66 scenarios would falsify the claim of persistent non-trivial safety issues.

Figures

Figures reproduced from arXiv: 2606.27079 by Feifei Zhao, Huangrui Li, Mingyang Lyu, Moquan Sha, Sicheng Shen, Yinqian Sun, Yiyang Jia, Yi Zeng.

Figure 1
Figure 1. Figure 1: Overview of ForesightSafety-VLA. Left: the safety taxonomy of ForesightSafety-VLA, covering Safe-Core physical safety together with Safe-Lang instruction safety and Safe-Vis perceptual safety. Right: the three diagnostic evaluation dimensions used in the benchmark. The top row shows structure/layout variation (L0–L2), the middle row shows language variation (W0–W4), and the bottom row shows visual variatio… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 2
Figure 2. Figure 2: Anticipatory safety comparison. Two trajectories may share the same task outcome (success without hard violation) while exhibiting substantially different anticipatory safety behavior. (a) Policy A reaches the goal by skimming the boundary and repeatedly entering the soft risk buffer, which leads to higher cumulative cost (CC) and higher risk exposure time (RET). (b) Policy B detours earlier in response to… view at source ↗
Figure 3
Figure 3. Figure 3: Global safety–success landscape across measured model runs. Each point denotes a model. The vertical axis reports safe success rate (SSR), the horizontal axis reports cumulative safety cost (CC), and bubble size encodes unsafe success rate (USR). The upper-left region is preferable, corresponding to higher safe success with lower accumulated risk. non-zero cumulative safety cost (CC in [0.18, 0.39]), non￾z… view at source ↗
Figure 4
Figure 4. Figure 4: Safety diagnosis under structure, language, and vision variation. Solid lines show safe success rate (SSR, left axis) and dashed lines show cumulative safety cost (CC, right axis) for representative models as severity increases. Structure/layout variation (a) and visual variation (c) induce steady degradation, while ordinary language variation (b, W0–W2) is comparatively mild; the shaded adversarial region… view at source ↗
Figure 5
Figure 5. Figure 5: Process-level risk dissection of an unsafe-success episode. Top: four keyframes from a lift_pot rollout in which the task is completed but the arm traverses a heat hazard zone. Bottom: time-aligned channel-wise safety scores for Force/Torque, Thermal/Energy, and Spatial Boundary. The boundary channel crosses the hard boundary near the middle of the episode, while the heat channel remains in the soft-buffer… view at source ↗
read the original abstract

In embodied intelligence, safety is a prerequisite for reliable robot deployment in the physical world. Current vision-language-action (VLA) models continue to advance toward general-purpose task capability, yet their embodied safety limits remain poorly understood. To address this gap, we introduce ForesightSafety-VLA, a diagnostic benchmark that makes safety the primary evaluation target for VLA systems. We define a 13-category safety taxonomy covering physical interaction safety (Safe-Core), instruction-side safety (Safe-Lang), and perception-side safety (Safe-Vis), and evaluate policies under three controlled dimensions of variation -- scene structure, language command, and visual observation -- so that failure sources can be diagnosed rather than hidden in a single aggregate score. Beyond binary task success, ForesightSafety-VLA measures process-level risk through cumulative safety cost (CC) and risk exposure time (RET), together with a four-quadrant decomposition of safe/unsafe success and failure. We instantiate 66 safety-augmented base scenarios in RoboTwin across 5 embodiments and report results on representative VLA baselines. Across the evaluated baselines, even the strongest policy incurs non-trivial safety cost and unsafe nominal success, while structure and visual variation induce substantially stronger safety degradation than ordinary language variation. These results suggest that embodied safety is tightly coupled to perception, grounding, and control competence rather than being reducible to post-hoc safety filtering alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ForesightSafety-VLA, a diagnostic benchmark for safety in vision-language-action (VLA) models. It defines a 13-category taxonomy spanning Safe-Core (physical interaction), Safe-Lang (instruction-side), and Safe-Vis (perception-side) safety, evaluates policies under controlled variations in scene structure, language commands, and visual observations, and employs process-level metrics including cumulative safety cost (CC), risk exposure time (RET), and a four-quadrant safe/unsafe success/failure decomposition. The benchmark is instantiated via 66 safety-augmented scenarios in RoboTwin across 5 embodiments; results on representative VLA baselines indicate non-trivial safety costs even for the strongest policies, with structure and visual variations causing substantially stronger degradation than language variation, suggesting safety is coupled to core perception/grounding/control competence rather than post-hoc filtering.

Significance. If the empirical results hold under broader validation, the work supplies a needed diagnostic framework that moves beyond aggregate success rates to isolate failure sources in embodied VLA systems. The controlled variation dimensions and multi-metric decomposition are strengths that could support reproducible safety auditing in the field.

major comments (2)
  1. [Abstract] Abstract: the claim that 'structure and visual variation induce substantially stronger safety degradation than ordinary language variation' and the broader suggestion that 'embodied safety is tightly coupled to perception, grounding, and control competence' rest on results from the 66 RoboTwin scenarios; however, the manuscript provides no coverage analysis, external validation against real-world incident data, or ablation of omitted failure modes (e.g., long-horizon dynamics or contact-rich manipulation), which is load-bearing for the generalizability of the ordering of degradation sources.
  2. [Abstract] Abstract (and instantiation paragraph): the 13-category taxonomy is presented as comprehensively covering relevant failure modes, yet no validation procedure, inter-rater agreement, or mapping to documented physical interaction risks is described; this directly affects the diagnostic reliability of the Safe-Core/Safe-Lang/Safe-Vis partition and the four-quadrant decomposition.
minor comments (1)
  1. The abstract refers to 'representative VLA baselines' without naming the specific models or reporting their individual CC/RET values; adding these details (presumably in §4 or the results tables) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on generalizability and taxonomy validation. We respond to each major comment below, clarifying the benchmark's scope as a controlled diagnostic tool in simulation while committing to targeted revisions for transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'structure and visual variation induce substantially stronger safety degradation than ordinary language variation' and the broader suggestion that 'embodied safety is tightly coupled to perception, grounding, and control competence' rest on results from the 66 RoboTwin scenarios; however, the manuscript provides no coverage analysis, external validation against real-world incident data, or ablation of omitted failure modes (e.g., long-horizon dynamics or contact-rich manipulation), which is load-bearing for the generalizability of the ordering of degradation sources.

    Authors: The observed ordering of degradation sources is reported specifically for the 66 RoboTwin scenarios and the tested VLA baselines; the abstract frames this as a suggestive finding within the benchmark rather than a universal claim. We agree that coverage analysis, real-world incident mapping, and ablations of long-horizon or contact-rich modes are absent and would strengthen generalizability statements. We will add a limitations subsection explicitly discussing the simulation scope, the absence of external real-world validation, and potential omitted failure modes to prevent overgeneralization. revision: partial

  2. Referee: [Abstract] Abstract (and instantiation paragraph): the 13-category taxonomy is presented as comprehensively covering relevant failure modes, yet no validation procedure, inter-rater agreement, or mapping to documented physical interaction risks is described; this directly affects the diagnostic reliability of the Safe-Core/Safe-Lang/Safe-Vis partition and the four-quadrant decomposition.

    Authors: The taxonomy was derived by synthesizing categories from prior robotics safety literature and observed VLA failure patterns, with the Safe-Core/Safe-Lang/Safe-Vis partition intended to isolate distinct risk sources. We did not include a formal validation procedure or inter-rater study in the original manuscript. We will revise the instantiation section to add explicit references to source literature, example mappings to physical risks, and a brief rationale for the partition to improve transparency and diagnostic interpretability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements on external baselines

full rationale

The paper defines a 13-category safety taxonomy, instantiates 66 scenarios in RoboTwin, and reports process-level metrics (CC, RET, four-quadrant decomposition) on representative VLA baselines. All load-bearing claims are empirical observations from these evaluations rather than derivations, fitted predictions, or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked; self-citations (if any) are not load-bearing for the central results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on an author-defined safety taxonomy and scenario set with no external derivation or validation shown in the abstract; no free parameters or invented physical entities are introduced.

axioms (1)
  • ad hoc to paper The 13 safety categories (Safe-Core, Safe-Lang, Safe-Vis) comprehensively cover the relevant failure modes for embodied VLA systems.
    Defined by the authors as the primary evaluation target without reference to prior validation studies.

pith-pipeline@v0.9.1-grok · 5807 in / 1306 out tokens · 33092 ms · 2026-06-30T09:49:26.399198+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages · 17 internal anchors

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

  2. [2]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

  3. [3]

    Octo: An Open-Source Generalist Robot Policy

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xuet al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024

  4. [4]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  5. [5]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

  6. [6]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,”arXiv preprint arXiv:2401.02117, 2024

  7. [7]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “ π0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  8. [8]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “ π0. 5: A vision-language- action model with open-world generalization. arxiv 2025,”arXiv preprint arXiv:2504.16054, 2025

  9. [9]

    Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,” 2020

  10. [10]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022

  11. [11]

    Man- iSkill2: A unified benchmark for generalizable ma- nipulation skills, 2023

    J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yaoet al., “Maniskill2: A unified benchmark for generalizable manipulation skills,”arXiv preprint arXiv:2302.04659, 2023

  12. [12]

    Robotwin: Dual-arm robot benchmark with generative digital twins,

    Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xuet al., “Robotwin: Dual-arm robot benchmark with generative digital twins,” inProceedings of the computer vision and pattern recognition conference, 2025, pp. 27 649–27 660

  13. [13]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmaniet al., “Evaluating real-world robot manipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024

  14. [14]

    VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

    B. Zhang, J. Li, J. Shen, Y . Cai, Y . Zhang, Y . Chen, J. Dai, J. Ji, and Y . Yang, “Vla-arena: An open-source framework for benchmarking vision-language-action models,”arXiv preprint arXiv:2512.22539, 2025

  15. [15]

    Generative image as action models,

    M. Shridhar, Y . L. Lo, and S. James, “Generative image as action models,” arXiv preprint arXiv:2407.07875, 2024

  16. [16]

    Ro- bustnav: Towards benchmarking robustness in embodied navigation,

    P. Chattopadhyay, J. Hoffman, R. Mottaghi, and A. Kembhavi, “Ro- bustnav: Towards benchmarking robustness in embodied navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 691–15 700

  17. [17]

    Adversarial Patch

    T. B. Brown, D. Man ´e, A. Roy, M. Abadi, and J. Gilmer, “Adversarial patch,”arXiv preprint arXiv:1712.09665, 2017

  18. [18]

    Prompt Injection attack against LLM-integrated Applications

    Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zhenget al., “Prompt injection attack against llm-integrated applications,”arXiv preprint arXiv:2306.05499, 2023

  19. [19]

    Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,

    C. Xu, W. Ding, W. Lyu, Z. Liu, S. Wang, Y . He, H. Hu, D. Zhao, and B. Li, “Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 667–25 682, 2022

  20. [20]

    Benchmarking Batch Deep Reinforcement Learning Algorithms

    A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,”arXiv preprint arXiv:1910.01708, vol. 7, no. 1, p. 2, 2019

  21. [21]

    Altman,Constrained Markov decision processes

    E. Altman,Constrained Markov decision processes. Routledge, 2021

  22. [22]

    Safety gymnasium: A unified safe reinforcement learning benchmark,

    J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y . Geng, Y . Zhong, J. Dai, and Y . Yang, “Safety gymnasium: A unified safe reinforcement learning benchmark,”Advances in Neural Information Processing Systems, vol. 36, pp. 18 964–18 993, 2023

  23. [23]

    Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,

    Z. Yuan, A. W. Hall, S. Zhou, L. Brunke, M. Greeff, J. Panerati, and A. P. Schoellig, “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 142–11 149, 2022

  24. [24]

    Concrete Problems in AI Safety

    D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man ´e, “Concrete problems in ai safety,”arXiv preprint arXiv:1606.06565, 2016

  25. [25]

    Libero: Benchmarking knowledge transfer for lifelong robot learning,

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

  26. [26]

    Control barrier function based quadratic programs for safety critical systems,

    A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,”IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2016

  27. [27]

    A comprehensive survey on safe reinforce- ment learning,

    J. Garcıa and F. Fern ´andez, “A comprehensive survey on safe reinforce- ment learning,”Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015

  28. [28]

    Safe learning in robotics: From learning-based control to safe reinforcement learning,

    L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022

  29. [29]

    A survey of methods for safe human-robot interaction,

    P. A. Lasota, T. Fong, and J. A. Shah, “A survey of methods for safe human-robot interaction,”Foundations and Trends® in Robotics, vol. 5, no. 4, pp. 261–349, 2017

  30. [30]

    Safety of embodied navigation: A survey,

    Z. Wang, J. Hu, and R. Mu, “Safety of embodied navigation: A survey,” arXiv preprint arXiv:2508.05855, 2025

  31. [31]

    Generating robot constitutions & benchmarks for semantic safety,

    P. Sermanet, A. Majumdar, A. Irpan, D. Kalashnikov, and V . Sindhwani, “Generating robot constitutions & benchmarks for semantic safety,”arXiv preprint arXiv:2503.08663, 2025

  32. [32]

    Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,

    L. Wei, J. Ma, Y . Hu, and R. Zhang, “Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,”arXiv preprint arXiv:2412.10349, 2024

  33. [33]

    Towards safe robot foundation models,

    T. Gruner, D. Palenicek, P. Liu, J. Watson, D. Tateo, J. Peterset al., “Towards safe robot foundation models,”arXiv e-prints, pp. arXiv–2503, 2025

  34. [34]

    Sapien: A simulated part-based interactive environment,

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su, “Sapien: A simulated part-based interactive environment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11 097–11 107. [Online]. Available: https://openaccess.the...

  35. [35]

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H. ang Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu, “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”

  36. [36]
  37. [37]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,” arXiv preprint arXiv:2410.07864, 2024

  38. [38]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

  39. [39]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

  40. [40]

    DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng, “Dexvla: Vision- language model with plug-in diffusion expert for general robot control,” arXiv preprint arXiv:2502.05855, 2025

  41. [41]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

  42. [42]

    Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,

    J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shenet al., “Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,”IEEE Robotics and Automation Letters, 2025

  43. [43]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024