pith. sign in

arxiv: 2606.00773 · v1 · pith:62G5EWQ5new · submitted 2026-05-30 · 💻 cs.RO

SafeVLA-Bench: A Benchmark for the Success-Safety Gap in Vision-Language-Action Models

Pith reviewed 2026-06-28 18:25 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-actionsafety evaluationSignal Temporal Logicrobot manipulationbenchmarksuccess-safety gapunsafe-success metrics
0
0 comments X

The pith

High task success in vision-language-action models often comes with unsafe behavior such as excessive contact or disturbing objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SafeVLA-Bench as a post-hoc evaluation layer that adds formal safety checks to existing VLA benchmarks. It defines safety rules through Signal Temporal Logic and measures how often successful task completions still break those rules. On standard tabletop and kitchen datasets, policies that reach high success rates still produce unsafe episodes in 13 to 56 percent of cases. A reader would care because deployment of these models in homes or factories requires both goal achievement and avoidance of damage or collisions.

Core claim

SafeVLA-Bench formalizes task-aware safety requirements as Signal Temporal Logic specifications and introduces two unsafe-success metrics: Succ-But-Unsafe, the fraction of successful rollouts that violate at least one safety clause, and Violation Severity Index, a bounded score of the worst violation depth. Evaluation of nine policy entries on LIBERO and RoboCasa-365 shows that high success rates do not imply safe execution, with unsafe-episode rates of 13 to 15 percent on tabletop tasks and 36 to 56 percent of successful kitchen rollouts violating safety clauses.

What carries the argument

Signal Temporal Logic specifications that encode safety constraints such as limits on contact force, avoidance of bystander objects, and prevention of self-collision, combined with Succ-But-Unsafe and Violation Severity Index to quantify the gap between task success and safety compliance.

If this is right

  • Benchmarking of vision-language-action models must report both task success and safety violation rates rather than success alone.
  • Policies trained only to maximize task success on current datasets will leave a measurable fraction of executions unsafe.
  • Post-hoc safety evaluation can be applied to any existing simulator-based VLA benchmark without retraining the policies.
  • Kitchen-scale tasks exhibit higher unsafe-success rates than tabletop tasks under the same evaluation framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating the safety specifications directly into policy training objectives could close the observed gap rather than only measuring it after training.
  • The same approach could be extended to real-robot data if sensor streams are logged at sufficient temporal resolution to support the temporal logic checks.
  • Different task domains may require distinct STL templates, so the framework's value depends on developing reusable safety libraries per environment type.

Load-bearing premise

The chosen Signal Temporal Logic rules correctly and completely capture the safety requirements that matter for these manipulation tasks.

What would settle it

Re-evaluating the same policy rollouts with an alternative set of safety specifications or with human judgments of unsafe behavior that produces substantially different Succ-But-Unsafe percentages would indicate that the reported gap depends on the particular rules chosen.

Figures

Figures reproduced from arXiv: 2606.00773 by Fanxin Kong, Insup Lee, Jialiang Fan, Oleg Sokolsky, Weizhe Xu.

Figure 1
Figure 1. Figure 1: SafeVLA-Bench overview. SafeVLA-Bench combines task-aware STL safety specifica￾tions, per-task applicability, and SBU/VSI metrics to measure how often successful rollouts violate safety and how severe their worst violation is, exposing success–safety gaps hidden by native suc￾cess rates. Violation Severity Index (VSI) that measures the normalized severity of the worst applicable viola￾tion. The contributio… view at source ↗
Figure 2
Figure 2. Figure 2: Task success, safety, and violation type are complementary. Left: each marker is a LIBERO model–suite cell or RoboCasa-365 model aggregate; bars show Wilson 95% CIs for SR and Safety, and the dashed curve marks the non-dominated frontier. Right: cells show episode violation rates (%) for the three scored safety families. looser variants with policies, seeds, success labels, and task applicability fixed. Re… view at source ↗
Figure 3
Figure 3. Figure 3: Successful rollouts can still be physically unsafe. Each row shows one successful-but￾unsafe rollout with three frames: the initial state, the safety-violating moment (red border), and the final task-success state (green border). The examples are chosen from SafeVLA-Bench clauses and cover scene interaction and execution semantics from [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

Vision-language-action (VLA) benchmarks measure whether a policy completes a requested manipulation task, but binary success can hide safety-relevant trajectory behavior: reaching the goal while applying excessive contact, disturbing bystander objects, destabilizing the held object, or entering robot self-contact. We present SafeVLA-Bench, a post-hoc safety-evaluation framework for existing simulator-based VLA benchmarks. It formalizes task-aware safety requirements as Signal Temporal Logic (STL) specifications and reports native success with two unsafe-success metrics: Succ-But-Unsafe (SBU), the fraction of rollouts that both succeed and violate safety, and Violation Severity Index (VSI), a bounded worst-violation depth score. We instantiate SafeVLA-Bench on LIBERO and RoboCasa-365, evaluating nine policy-benchmark entries across tabletop and kitchen manipulation tasks. High task success does not imply safe execution: high-SR tabletop baselines still leave 13 to 15 percent unsafe-episode rates,and 36 to 56 percent of successful RoboCasa-365 rollouts violate at least one active safety clause. Project page: https://safevla.org.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SafeVLA-Bench, a post-hoc framework that applies task-aware Signal Temporal Logic (STL) specifications to existing VLA benchmark rollouts (LIBERO and RoboCasa-365) to quantify the gap between task success and safety. It defines two metrics—Succ-But-Unsafe (SBU) and Violation Severity Index (VSI)—and evaluates nine policy-benchmark combinations, reporting that high success rates still yield 13–15% unsafe episodes on tabletop tasks and 36–56% safety violations among successful RoboCasa-365 rollouts.

Significance. If the STL specifications hold, the work provides concrete, reproducible evidence that binary success is an incomplete proxy for safe execution in VLA models. The approach is strengthened by its reliance on existing simulator rollouts and first-principles STL definitions rather than fitted parameters or new data collection, offering a practical tool for post-hoc auditing of published policies.

major comments (1)
  1. [Section 3] STL formalization (Section 3): the predicates and thresholds for contact force, bystander disturbance, object stability, and self-contact are presented without validation (human ratings, threshold sensitivity analysis, or alignment with real incident data). This directly affects the load-bearing claim that the reported 13–15% and 36–56% unsafe-success rates measure genuine safety gaps rather than specification artifacts.
minor comments (2)
  1. [Abstract] Abstract and §4: the nine evaluated policies are referenced only by aggregate results; a table listing each policy, benchmark, and success rate would improve traceability.
  2. [Section 3.3] Notation: the bounded worst-violation depth in VSI is described qualitatively; an explicit formula or pseudocode would clarify its computation from STL robustness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for stronger justification of the STL predicates and thresholds. We agree this is an important point for ensuring the reported safety gaps reflect meaningful concerns rather than arbitrary choices, and we outline our response below.

read point-by-point responses
  1. Referee: [Section 3] STL formalization (Section 3): the predicates and thresholds for contact force, bystander disturbance, object stability, and self-contact are presented without validation (human ratings, threshold sensitivity analysis, or alignment with real incident data). This directly affects the load-bearing claim that the reported 13–15% and 36–56% unsafe-success rates measure genuine safety gaps rather than specification artifacts.

    Authors: We acknowledge that the manuscript presents the STL predicates and thresholds (e.g., force limits, stability margins) without empirical validation such as human ratings or direct alignment to real-world incident data. These choices are grounded in first-principles physical considerations and standard robotics references (e.g., manufacturer force limits for contact and common stability criteria from manipulation literature), rather than fitted parameters. However, the absence of sensitivity analysis or external validation does represent a limitation for interpreting the absolute SBU rates as definitive safety gaps. In revision we will expand Section 3 with: (1) explicit rationale and citations for each threshold, (2) a sensitivity analysis demonstrating how SBU and VSI change under reasonable threshold perturbations, and (3) a clearer statement that the framework is intended to support customizable specifications. This addition will not alter the core empirical observation that success and safety diverge under the chosen specs, but will better contextualize the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines SBU and VSI directly from STL predicates applied to existing benchmark rollouts (LIBERO, RoboCasa-365). These are computed quantities on fixed trajectories, not predictions fitted to or derived from the target statistics themselves. No equations reduce the reported unsafe-success rates to the input definitions by construction, no parameters are tuned then relabeled as predictions, and no self-citation chain supplies the load-bearing uniqueness or ansatz for the central empirical claim. The work is a measurement framework whose outputs are falsifiable against the same external rollouts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests primarily on the domain assumption that STL can express task-aware safety constraints for manipulation trajectories; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Signal Temporal Logic specifications can be defined to capture safety requirements such as contact force limits and collision avoidance in a task-aware manner
    The entire evaluation pipeline depends on this assumption for the post-hoc analysis.

pith-pipeline@v0.9.1-grok · 5745 in / 1275 out tokens · 23725 ms · 2026-06-28T18:25:09.258015+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 8 canonical work pages · 7 internal anchors

  1. [1]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  2. [2]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

  3. [3]

    Y . Song, L. Le, Y .-H. Park, J. Wang, J. Shi, L. Liu, J. Gu, E. Eaton, D. Jayaraman, and K. Dani- ilidis. OmniGuide: Universal guidance fields for enhancing generalist robot policies.arXiv preprint arXiv:2603.10052, 2026

  4. [4]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  5. [5]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  8. [8]

    M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, et al. Cosmos Policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

  9. [9]

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

  10. [10]

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  11. [11]

    Shukla, S

    A. Shukla, S. Tao, and H. Su. ManiSkill-HAB: A benchmark for low-level manipulation in home rearrangement tasks. InInternational Conference on Learning Representations, volume 2025, pages 15288–15317, 2025

  12. [12]

    Zhang, Y

    B. Zhang, Y . Zhang, J. Ji, Y . Lei, J. Dai, Y . Chen, and Y . Yang. Safevla: Towards safety alignment of vision-language-action model via constrained learning.Advances in Neural In- formation Processing Systems, 38:153335–153373, 2026

  13. [13]

    A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada. Control barrier function based quadratic programs for safety critical systems.IEEE Transactions on Automatic Control, 62(8):3861– 3876, 2016

  14. [14]

    G. E. Fainekos and G. J. Pappas. Robustness of temporal logic specifications for continuous- time signals.Theoretical Computer Science, 410(42):4262–4291, 2009

  15. [15]

    Donz ´e and O

    A. Donz ´e and O. Maler. Robust satisfaction of temporal logic over real-valued signals. In International Conference on Formal Modeling and Analysis of Timed Systems, pages 92–106. Springer, 2010. 9

  16. [16]

    Leung, N

    K. Leung, N. Ar ´echiga, and M. Pavone. Backpropagation through signal temporal logic speci- fications: Infusing logical structure into gradient-based methods.The International Journal of Robotics Research, 42(6):356–370, 2023

  17. [17]

    D. Kim, H. Jang, M. Koo, S. Jang, T. Kim, B. Kim, B. Yoon, C. Jang, D. Choi, D. Han, et al. Rldx-1 technical report.arXiv preprint arXiv:2605.03269, 2026

  18. [18]

    grasp-slip @2cm

    M. J. Rosenstrauch and J. Kr ¨uger. Safe human-robot-collaboration-introduction and experi- ment using ISO/TS 15066. In2017 3rd International Conference on Control, Automation and Robotics (ICCAR), pages 740–744. IEEE, 2017. 10 Appendix Appendix contents A Additional Experimental Results 12 A.1 Qualitative Successful-but-Unsafe Examples . . . . . . . . . ...