ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

Feifei Zhao; Huangrui Li; Mingyang Lyu; Moquan Sha; Sicheng Shen; Yinqian Sun; Yiyang Jia; Yi Zeng

arxiv: 2606.27079 · v2 · pith:CZXMLXYBnew · submitted 2026-06-25 · 💻 cs.RO

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

Mingyang Lyu , Yinqian Sun , Yiyang Jia , Sicheng Shen , Moquan Sha , Huangrui Li , Feifei Zhao , Yi Zeng This is my paper

Pith reviewed 2026-06-30 09:49 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelssafety benchmarkembodied AIrobot safetydiagnostic evaluationphysical interaction safetyVLA policiesrisk metrics

0 comments

The pith

Even the strongest vision-language-action policies incur non-trivial safety costs and unsafe successes, with scene structure and visual variation causing more degradation than language changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ForesightSafety-VLA as a benchmark that treats safety as the primary evaluation target for vision-language-action models rather than an add-on. It creates a 13-category taxonomy across physical interaction, instruction, and perception safety, then measures performance under controlled changes to scene structure, language commands, and visual observations. Metrics include cumulative safety cost and risk exposure time plus a breakdown into safe versus unsafe successes and failures. Tests on representative baselines across 66 scenarios in five robot embodiments show that safety problems persist even in top policies and worsen more from structural or visual shifts than from language shifts. This setup allows specific diagnosis of whether failures arise from perception, grounding, or control instead of hiding them in overall success rates.

Core claim

ForesightSafety-VLA instantiates 66 safety-augmented base scenarios in RoboTwin across five embodiments and evaluates VLA baselines under three variation dimensions. Even the strongest policy shows non-trivial safety cost and unsafe nominal success. Structure and visual variation produce substantially stronger safety degradation than ordinary language variation, indicating that embodied safety is coupled to perception, grounding, and control competence rather than addressable by post-hoc filtering alone.

What carries the argument

The 13-category safety taxonomy (Safe-Core for physical interactions, Safe-Lang for instructions, Safe-Vis for perception) paired with controlled variations in scene structure, language command, and visual observation, plus metrics of cumulative safety cost (CC), risk exposure time (RET), and four-quadrant decomposition of safe/unsafe success and failure.

If this is right

Safety in VLA systems requires integration into perception, grounding, and control rather than reliance on separate filtering steps.
Diagnostic evaluation across multiple variation dimensions is required to isolate whether failures originate in scene structure, visuals, or commands.
Binary task success alone is insufficient; process-level measures such as cumulative safety cost and risk exposure time must be reported.
Stronger degradation from structure and visual changes implies that improvements in visual grounding will have larger safety impact than language-only adjustments.
Claims about VLA safety limits depend on testing across multiple embodiments and controlled scenario variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training or fine-tuning VLA models directly on the benchmark's safety metrics could produce policies with lower risk exposure in unstructured settings.
Extending the scenarios beyond simulation to physical robot trials would test whether the observed safety patterns hold outside RoboTwin.
The greater effect of visual and structural variation suggests that safety research in embodied AI should prioritize perception robustness over command understanding.
Common failure patterns identified here could inform unified safety standards across different robot platforms and task domains.

Load-bearing premise

The 66 safety-augmented scenarios across five embodiments in RoboTwin are representative enough of real-world physical interaction risks to support general claims about VLA safety limits.

What would settle it

A VLA policy achieving zero cumulative safety cost and exclusively safe successes across all structure, language, and visual variations on the 66 scenarios would falsify the claim of persistent non-trivial safety issues.

Figures

Figures reproduced from arXiv: 2606.27079 by Feifei Zhao, Huangrui Li, Mingyang Lyu, Moquan Sha, Sicheng Shen, Yinqian Sun, Yiyang Jia, Yi Zeng.

**Figure 1.** Figure 1: Overview of ForesightSafety-VLA. Left: the safety taxonomy of ForesightSafety-VLA, covering Safe-Core physical safety together with Safe-Lang instruction safety and Safe-Vis perceptual safety. Right: the three diagnostic evaluation dimensions used in the benchmark. The top row shows structure/layout variation (L0–L2), the middle row shows language variation (W0–W4), and the bottom row shows visual variatio… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 2.** Figure 2: Anticipatory safety comparison. Two trajectories may share the same task outcome (success without hard violation) while exhibiting substantially different anticipatory safety behavior. (a) Policy A reaches the goal by skimming the boundary and repeatedly entering the soft risk buffer, which leads to higher cumulative cost (CC) and higher risk exposure time (RET). (b) Policy B detours earlier in response to… view at source ↗

**Figure 3.** Figure 3: Global safety–success landscape across measured model runs. Each point denotes a model. The vertical axis reports safe success rate (SSR), the horizontal axis reports cumulative safety cost (CC), and bubble size encodes unsafe success rate (USR). The upper-left region is preferable, corresponding to higher safe success with lower accumulated risk. non-zero cumulative safety cost (CC in [0.18, 0.39]), nonz… view at source ↗

**Figure 4.** Figure 4: Safety diagnosis under structure, language, and vision variation. Solid lines show safe success rate (SSR, left axis) and dashed lines show cumulative safety cost (CC, right axis) for representative models as severity increases. Structure/layout variation (a) and visual variation (c) induce steady degradation, while ordinary language variation (b, W0–W2) is comparatively mild; the shaded adversarial region… view at source ↗

**Figure 5.** Figure 5: Process-level risk dissection of an unsafe-success episode. Top: four keyframes from a lift_pot rollout in which the task is completed but the arm traverses a heat hazard zone. Bottom: time-aligned channel-wise safety scores for Force/Torque, Thermal/Energy, and Spatial Boundary. The boundary channel crosses the hard boundary near the middle of the episode, while the heat channel remains in the soft-buffer… view at source ↗

read the original abstract

In embodied intelligence, safety is a prerequisite for reliable robot deployment in the physical world. Current vision-language-action (VLA) models continue to advance toward general-purpose task capability, yet their embodied safety limits remain poorly understood. To address this gap, we introduce ForesightSafety-VLA, a diagnostic benchmark that makes safety the primary evaluation target for VLA systems. We define a 13-category safety taxonomy covering physical interaction safety (Safe-Core), instruction-side safety (Safe-Lang), and perception-side safety (Safe-Vis), and evaluate policies under three controlled dimensions of variation -- scene structure, language command, and visual observation -- so that failure sources can be diagnosed rather than hidden in a single aggregate score. Beyond binary task success, ForesightSafety-VLA measures process-level risk through cumulative safety cost (CC) and risk exposure time (RET), together with a four-quadrant decomposition of safe/unsafe success and failure. We instantiate 66 safety-augmented base scenarios in RoboTwin across 5 embodiments and report results on representative VLA baselines. Across the evaluated baselines, even the strongest policy incurs non-trivial safety cost and unsafe nominal success, while structure and visual variation induce substantially stronger safety degradation than ordinary language variation. These results suggest that embodied safety is tightly coupled to perception, grounding, and control competence rather than being reducible to post-hoc safety filtering alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New diagnostic taxonomy and metrics for VLA safety, but the 66 scenarios leave the main claims on shaky ground.

read the letter

The paper introduces a 13-category taxonomy split across Safe-Core, Safe-Lang, and Safe-Vis, plus process metrics (cumulative safety cost and risk exposure time) and a four-quadrant success/failure breakdown. It tests these under controlled changes in scene structure, language, and visuals on 66 RoboTwin scenarios across five embodiments. That structured approach to spotting failure sources is the clearest step forward from standard aggregate success rates.

The taxonomy and metrics give a practical way to diagnose whether problems come from perception, grounding, or control rather than just post-hoc filtering. Reporting results on representative VLA baselines shows the idea in action and surfaces that even stronger policies still carry non-trivial safety cost.

The main weakness is the narrow base for the stronger claims. The finding that structure and visual variation hurt safety more than language variation, and the suggestion that embodied safety is tightly coupled to core competences, rests on those 66 scenarios. The abstract gives no coverage check against real incident data, no external validation of the categories, and no ablation on omitted failure modes like long-horizon contact-rich tasks. Without that, the general conclusions stay preliminary.

This is for groups building or evaluating VLA systems who want better diagnostic tools. A reader focused on benchmark design would find the taxonomy and metrics useful to adapt. The work shows clear thinking on what safety evaluation should track, so it deserves referee time even if the empirical section needs more grounding on scenario selection and statistical support.

Referee Report

2 major / 1 minor

Summary. The paper introduces ForesightSafety-VLA, a diagnostic benchmark for safety in vision-language-action (VLA) models. It defines a 13-category taxonomy spanning Safe-Core (physical interaction), Safe-Lang (instruction-side), and Safe-Vis (perception-side) safety, evaluates policies under controlled variations in scene structure, language commands, and visual observations, and employs process-level metrics including cumulative safety cost (CC), risk exposure time (RET), and a four-quadrant safe/unsafe success/failure decomposition. The benchmark is instantiated via 66 safety-augmented scenarios in RoboTwin across 5 embodiments; results on representative VLA baselines indicate non-trivial safety costs even for the strongest policies, with structure and visual variations causing substantially stronger degradation than language variation, suggesting safety is coupled to core perception/grounding/control competence rather than post-hoc filtering.

Significance. If the empirical results hold under broader validation, the work supplies a needed diagnostic framework that moves beyond aggregate success rates to isolate failure sources in embodied VLA systems. The controlled variation dimensions and multi-metric decomposition are strengths that could support reproducible safety auditing in the field.

major comments (2)

[Abstract] Abstract: the claim that 'structure and visual variation induce substantially stronger safety degradation than ordinary language variation' and the broader suggestion that 'embodied safety is tightly coupled to perception, grounding, and control competence' rest on results from the 66 RoboTwin scenarios; however, the manuscript provides no coverage analysis, external validation against real-world incident data, or ablation of omitted failure modes (e.g., long-horizon dynamics or contact-rich manipulation), which is load-bearing for the generalizability of the ordering of degradation sources.
[Abstract] Abstract (and instantiation paragraph): the 13-category taxonomy is presented as comprehensively covering relevant failure modes, yet no validation procedure, inter-rater agreement, or mapping to documented physical interaction risks is described; this directly affects the diagnostic reliability of the Safe-Core/Safe-Lang/Safe-Vis partition and the four-quadrant decomposition.

minor comments (1)

The abstract refers to 'representative VLA baselines' without naming the specific models or reporting their individual CC/RET values; adding these details (presumably in §4 or the results tables) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on generalizability and taxonomy validation. We respond to each major comment below, clarifying the benchmark's scope as a controlled diagnostic tool in simulation while committing to targeted revisions for transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'structure and visual variation induce substantially stronger safety degradation than ordinary language variation' and the broader suggestion that 'embodied safety is tightly coupled to perception, grounding, and control competence' rest on results from the 66 RoboTwin scenarios; however, the manuscript provides no coverage analysis, external validation against real-world incident data, or ablation of omitted failure modes (e.g., long-horizon dynamics or contact-rich manipulation), which is load-bearing for the generalizability of the ordering of degradation sources.

Authors: The observed ordering of degradation sources is reported specifically for the 66 RoboTwin scenarios and the tested VLA baselines; the abstract frames this as a suggestive finding within the benchmark rather than a universal claim. We agree that coverage analysis, real-world incident mapping, and ablations of long-horizon or contact-rich modes are absent and would strengthen generalizability statements. We will add a limitations subsection explicitly discussing the simulation scope, the absence of external real-world validation, and potential omitted failure modes to prevent overgeneralization. revision: partial
Referee: [Abstract] Abstract (and instantiation paragraph): the 13-category taxonomy is presented as comprehensively covering relevant failure modes, yet no validation procedure, inter-rater agreement, or mapping to documented physical interaction risks is described; this directly affects the diagnostic reliability of the Safe-Core/Safe-Lang/Safe-Vis partition and the four-quadrant decomposition.

Authors: The taxonomy was derived by synthesizing categories from prior robotics safety literature and observed VLA failure patterns, with the Safe-Core/Safe-Lang/Safe-Vis partition intended to isolate distinct risk sources. We did not include a formal validation procedure or inter-rater study in the original manuscript. We will revise the instantiation section to add explicit references to source literature, example mappings to physical risks, and a brief rationale for the partition to improve transparency and diagnostic interpretability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements on external baselines

full rationale

The paper defines a 13-category safety taxonomy, instantiates 66 scenarios in RoboTwin, and reports process-level metrics (CC, RET, four-quadrant decomposition) on representative VLA baselines. All load-bearing claims are empirical observations from these evaluations rather than derivations, fitted predictions, or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked; self-citations (if any) are not load-bearing for the central results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on an author-defined safety taxonomy and scenario set with no external derivation or validation shown in the abstract; no free parameters or invented physical entities are introduced.

axioms (1)

ad hoc to paper The 13 safety categories (Safe-Core, Safe-Lang, Safe-Vis) comprehensively cover the relevant failure modes for embodied VLA systems.
Defined by the authors as the primary evaluation target without reference to prior validation studies.

pith-pipeline@v0.9.1-grok · 5807 in / 1306 out tokens · 33092 ms · 2026-06-30T09:49:26.399198+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages · 17 internal anchors

[1]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

2023
[3]

Octo: An Open-Source Generalist Robot Policy

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xuet al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

2024
[6]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,”arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “ π0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “ π0. 5: A vision-language- action model with open-world generalization. arxiv 2025,”arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,” 2020

2020
[10]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022

2022
[11]

Man- iSkill2: A unified benchmark for generalizable ma- nipulation skills, 2023

J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yaoet al., “Maniskill2: A unified benchmark for generalizable manipulation skills,”arXiv preprint arXiv:2302.04659, 2023

work page arXiv 2023
[12]

Robotwin: Dual-arm robot benchmark with generative digital twins,

Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xuet al., “Robotwin: Dual-arm robot benchmark with generative digital twins,” inProceedings of the computer vision and pattern recognition conference, 2025, pp. 27 649–27 660

2025
[13]

Evaluating Real-World Robot Manipulation Policies in Simulation

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmaniet al., “Evaluating real-world robot manipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

B. Zhang, J. Li, J. Shen, Y . Cai, Y . Zhang, Y . Chen, J. Dai, J. Ji, and Y . Yang, “Vla-arena: An open-source framework for benchmarking vision-language-action models,”arXiv preprint arXiv:2512.22539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Generative image as action models,

M. Shridhar, Y . L. Lo, and S. James, “Generative image as action models,” arXiv preprint arXiv:2407.07875, 2024

work page arXiv 2024
[16]

Ro- bustnav: Towards benchmarking robustness in embodied navigation,

P. Chattopadhyay, J. Hoffman, R. Mottaghi, and A. Kembhavi, “Ro- bustnav: Towards benchmarking robustness in embodied navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 691–15 700

2021
[17]

Adversarial Patch

T. B. Brown, D. Man ´e, A. Roy, M. Abadi, and J. Gilmer, “Adversarial patch,”arXiv preprint arXiv:1712.09665, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Prompt Injection attack against LLM-integrated Applications

Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zhenget al., “Prompt injection attack against llm-integrated applications,”arXiv preprint arXiv:2306.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,

C. Xu, W. Ding, W. Lyu, Z. Liu, S. Wang, Y . He, H. Hu, D. Zhao, and B. Li, “Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 667–25 682, 2022

2022
[20]

Benchmarking Batch Deep Reinforcement Learning Algorithms

A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,”arXiv preprint arXiv:1910.01708, vol. 7, no. 1, p. 2, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[21]

Altman,Constrained Markov decision processes

E. Altman,Constrained Markov decision processes. Routledge, 2021

2021
[22]

Safety gymnasium: A unified safe reinforcement learning benchmark,

J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y . Geng, Y . Zhong, J. Dai, and Y . Yang, “Safety gymnasium: A unified safe reinforcement learning benchmark,”Advances in Neural Information Processing Systems, vol. 36, pp. 18 964–18 993, 2023

2023
[23]

Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,

Z. Yuan, A. W. Hall, S. Zhou, L. Brunke, M. Greeff, J. Panerati, and A. P. Schoellig, “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 142–11 149, 2022

2022
[24]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man ´e, “Concrete problems in ai safety,”arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

2023
[26]

Control barrier function based quadratic programs for safety critical systems,

A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,”IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2016

2016
[27]

A comprehensive survey on safe reinforce- ment learning,

J. Garcıa and F. Fern ´andez, “A comprehensive survey on safe reinforce- ment learning,”Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015

2015
[28]

Safe learning in robotics: From learning-based control to safe reinforcement learning,

L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022

2022
[29]

A survey of methods for safe human-robot interaction,

P. A. Lasota, T. Fong, and J. A. Shah, “A survey of methods for safe human-robot interaction,”Foundations and Trends® in Robotics, vol. 5, no. 4, pp. 261–349, 2017

2017
[30]

Safety of embodied navigation: A survey,

Z. Wang, J. Hu, and R. Mu, “Safety of embodied navigation: A survey,” arXiv preprint arXiv:2508.05855, 2025

work page arXiv 2025
[31]

Generating robot constitutions & benchmarks for semantic safety,

P. Sermanet, A. Majumdar, A. Irpan, D. Kalashnikov, and V . Sindhwani, “Generating robot constitutions & benchmarks for semantic safety,”arXiv preprint arXiv:2503.08663, 2025

work page arXiv 2025
[32]

Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,

L. Wei, J. Ma, Y . Hu, and R. Zhang, “Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,”arXiv preprint arXiv:2412.10349, 2024

work page arXiv 2024
[33]

Towards safe robot foundation models,

T. Gruner, D. Palenicek, P. Liu, J. Watson, D. Tateo, J. Peterset al., “Towards safe robot foundation models,”arXiv e-prints, pp. arXiv–2503, 2025

2025
[34]

Sapien: A simulated part-based interactive environment,

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su, “Sapien: A simulated part-based interactive environment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11 097–11 107. [Online]. Available: https://openaccess.the...

2020
[35]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H. ang Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu, “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”
[36]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

[Online]. Available: https://arxiv.org/abs/2506.18088

work page internal anchor Pith review Pith/arXiv arXiv
[37]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,” arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025
[39]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng, “Dexvla: Vision- language model with plug-in diffusion expert for general robot control,” arXiv preprint arXiv:2502.05855, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

2023
[42]

Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,

J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shenet al., “Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,”IEEE Robotics and Automation Letters, 2025

2025
[43]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

2023

[3] [3]

Octo: An Open-Source Generalist Robot Policy

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xuet al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

2024

[6] [6]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,”arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “ π0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “ π0. 5: A vision-language- action model with open-world generalization. arxiv 2025,”arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark and learning environment. ieee robotics and automation letters 5, 2 (2020), 3019–3026,” 2020

2020

[10] [10]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022

2022

[11] [11]

Man- iSkill2: A unified benchmark for generalizable ma- nipulation skills, 2023

J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yaoet al., “Maniskill2: A unified benchmark for generalizable manipulation skills,”arXiv preprint arXiv:2302.04659, 2023

work page arXiv 2023

[12] [12]

Robotwin: Dual-arm robot benchmark with generative digital twins,

Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xuet al., “Robotwin: Dual-arm robot benchmark with generative digital twins,” inProceedings of the computer vision and pattern recognition conference, 2025, pp. 27 649–27 660

2025

[13] [13]

Evaluating Real-World Robot Manipulation Policies in Simulation

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmaniet al., “Evaluating real-world robot manipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

B. Zhang, J. Li, J. Shen, Y . Cai, Y . Zhang, Y . Chen, J. Dai, J. Ji, and Y . Yang, “Vla-arena: An open-source framework for benchmarking vision-language-action models,”arXiv preprint arXiv:2512.22539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Generative image as action models,

M. Shridhar, Y . L. Lo, and S. James, “Generative image as action models,” arXiv preprint arXiv:2407.07875, 2024

work page arXiv 2024

[16] [16]

Ro- bustnav: Towards benchmarking robustness in embodied navigation,

P. Chattopadhyay, J. Hoffman, R. Mottaghi, and A. Kembhavi, “Ro- bustnav: Towards benchmarking robustness in embodied navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 691–15 700

2021

[17] [17]

Adversarial Patch

T. B. Brown, D. Man ´e, A. Roy, M. Abadi, and J. Gilmer, “Adversarial patch,”arXiv preprint arXiv:1712.09665, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Prompt Injection attack against LLM-integrated Applications

Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zhenget al., “Prompt injection attack against llm-integrated applications,”arXiv preprint arXiv:2306.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,

C. Xu, W. Ding, W. Lyu, Z. Liu, S. Wang, Y . He, H. Hu, D. Zhao, and B. Li, “Safebench: A benchmarking platform for safety evaluation of autonomous vehicles,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 667–25 682, 2022

2022

[20] [20]

Benchmarking Batch Deep Reinforcement Learning Algorithms

A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,”arXiv preprint arXiv:1910.01708, vol. 7, no. 1, p. 2, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[21] [21]

Altman,Constrained Markov decision processes

E. Altman,Constrained Markov decision processes. Routledge, 2021

2021

[22] [22]

Safety gymnasium: A unified safe reinforcement learning benchmark,

J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y . Geng, Y . Zhong, J. Dai, and Y . Yang, “Safety gymnasium: A unified safe reinforcement learning benchmark,”Advances in Neural Information Processing Systems, vol. 36, pp. 18 964–18 993, 2023

2023

[23] [23]

Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,

Z. Yuan, A. W. Hall, S. Zhou, L. Brunke, M. Greeff, J. Panerati, and A. P. Schoellig, “Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 142–11 149, 2022

2022

[24] [24]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man ´e, “Concrete problems in ai safety,”arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [25]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

2023

[26] [26]

Control barrier function based quadratic programs for safety critical systems,

A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,”IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2016

2016

[27] [27]

A comprehensive survey on safe reinforce- ment learning,

J. Garcıa and F. Fern ´andez, “A comprehensive survey on safe reinforce- ment learning,”Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015

2015

[28] [28]

Safe learning in robotics: From learning-based control to safe reinforcement learning,

L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022

2022

[29] [29]

A survey of methods for safe human-robot interaction,

P. A. Lasota, T. Fong, and J. A. Shah, “A survey of methods for safe human-robot interaction,”Foundations and Trends® in Robotics, vol. 5, no. 4, pp. 261–349, 2017

2017

[30] [30]

Safety of embodied navigation: A survey,

Z. Wang, J. Hu, and R. Mu, “Safety of embodied navigation: A survey,” arXiv preprint arXiv:2508.05855, 2025

work page arXiv 2025

[31] [31]

Generating robot constitutions & benchmarks for semantic safety,

P. Sermanet, A. Majumdar, A. Irpan, D. Kalashnikov, and V . Sindhwani, “Generating robot constitutions & benchmarks for semantic safety,”arXiv preprint arXiv:2503.08663, 2025

work page arXiv 2025

[32] [32]

Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,

L. Wei, J. Ma, Y . Hu, and R. Zhang, “Ensuring force safety in vision- guided robotic manipulation via implicit tactile calibration,”arXiv preprint arXiv:2412.10349, 2024

work page arXiv 2024

[33] [33]

Towards safe robot foundation models,

T. Gruner, D. Palenicek, P. Liu, J. Watson, D. Tateo, J. Peterset al., “Towards safe robot foundation models,”arXiv e-prints, pp. arXiv–2503, 2025

2025

[34] [34]

Sapien: A simulated part-based interactive environment,

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su, “Sapien: A simulated part-based interactive environment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 11 097–11 107. [Online]. Available: https://openaccess.the...

2020

[35] [35]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H. ang Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu, “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”

[36] [36]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

[Online]. Available: https://arxiv.org/abs/2506.18088

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,” arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025

[39] [39]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng, “Dexvla: Vision- language model with plug-in diffusion expert for general robot control,” arXiv preprint arXiv:2502.05855, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

2023

[42] [42]

Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,

J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shenet al., “Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,”IEEE Robotics and Automation Letters, 2025

2025

[43] [43]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024