{"total":12,"items":[{"citing_arxiv_id":"2606.27079","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-06-25T14:19:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23686","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-06-22T17:59:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23856","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00053","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-BasedData Synthesis","primary_cat":"cs.RO","submitted_at":"2026-05-16T08:52:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLAMotor exposes VLA failures via distance-aware uncertainty testing and synthesizes agent-planned repair data to fine-tune models, reporting 49.25% success rate gains in simulation and 57.5% on hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14950","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2026-05-14T15:21:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Evo-Depth is a compact VLA model using a lightweight implicit depth encoder from RGB views plus progressive alignment to boost manipulation performance without added hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12386","ref_index":23,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T16:49:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SafeManip is a benchmark applying reusable LTLf templates across eight safety categories to evaluate temporal properties in robotic manipulation on VLA policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Long-horizon mobile manipulation in AI2-THOR FailureBench [9] Intervention failures and recovery Binary indicators of unsafe-state Push-style manipulation with frag- ile, bounded, and obstructed variants RedVLA [26] robot damage, environ- mental harm Safety predicates and cumulative costs LIBERO manipulation tasks with in- jected physical risk factors VLA-Arena [23] Contact, force, distance, spill, and falling-object CBDDL predicates and cost blocks RoboSuite manipulation with obsta- cles, hazards, and skill composition SENTINEL [22] State, ordering, response, and timing constraints LTL/CTL specifications Household embodied tasks across semantic, plan, and trajectory levels Table 1: Comparison of safety-oriented robotic manipulation benchmarks."},{"citing_arxiv_id":"2605.12160","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:10:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"conventional inference: where a conventional VLA waits for the prompt to be complete before its first forward pass, Premover interleaves focus map computation with the user's typing and begins acting before the prompt is finalized. We evaluate Premover onπ0.5 [2], a recent vision-language-action model with hierarchical subtask prediction. We use two simulated benchmarks: LIBERO [11] (Spatial, Object, Goal, and LIBERO- 10) and VLA-arena [ 22] (Extrapolation, Distractor, Safe, and Long-horizon). Both expose the per-instance segmentation masks needed for focus map supervision, and they cover distinct task distributions, scene compositions, and instruction styles. Premover reduces end-to-end wall-clock time with little to no loss in full-prompt success rate; for example, on the LIBERO benchmark"},{"citing_arxiv_id":"2605.09948","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"where θ is a predefined threshold determined heuristically based on the wσ rule of a normal distribu- tion (e.g., θ≈0.68,0.95,0.997 for w= 1,2,3 , respectively).The final action is selected from the most probable visited step: n∗ = arg max k≤n p(k), A=A (n∗).(16) 4 Experiments 4.1 Experiment Settings Simulation Benchmark Details.We evaluate on LIBERO [ 35], VLA-Arena [36], and LIBERO- Plus [37]. LIBERO contains multiple task suites covering spatial reasoning, object interaction, and long-horizon manipulation, while VLA-Arena provides a standardized benchmark with diverse tasks and difficulty levels. To assess generalization, we additionally report zero-shot performance on LIBERO-Plus, which introduces unseen task compositions."},{"citing_arxiv_id":"2605.07381","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation","primary_cat":"cs.RO","submitted_at":"2026-05-08T07:35:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For fixed Nr, minimizing (65) over Kr is the same calculation as in (36), with LcP replaced by Lcrv1/d r and σ replaced by σr. This yields (66). To minimize the maximum regional error, the two optimized errors should be balanced unless one region receives a boundary allocation. Setting E ⋆ core(Ncore)≍ E ⋆ bd(Nbd) and raising both sides to the power d+ 2gives (Cσcore)2(Lccorev1/d core)d Ncore ≍ (Cσbd)2(Lcbdv1/d bd )d Nbd .(68) Canceling common constants gives (67). Interpretation.For a worst-case objective, the split depends on region volume, geometry, and noise scale. Larger or geometrically harder regions require more coverage; noisier regions require more replication. If ccore ≈c bd, the split simplifies to Ncore Nbd ≍ vcore vbd \u0012 σcore σbd \u00132 .(69) Thus, a boundary region with larger noise can require a non-negligible fraction of the budget even if its volume is smaller."},{"citing_arxiv_id":"2604.23775","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms","primary_cat":"cs.RO","submitted_at":"2026-04-26T15:58:19+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", successful grasping) fail to fully capture embodied safety. Consequently, evaluation has evolved into a multi-layered certification process encompassing physical resilience, semantic alignment, and human-centric awareness. At the baseline level, frameworks like VLA-Arena standardize the capability evaluation of models across diverse manipulation tasks [90]. Since real environments are rarely benign, VLA-Risk explicitly tests physical robustness by quantifying task success rates under multimodal perturbations [57]. However, as traditional 21 success rates are often \"progress-agnostic\", newer benchmarks in open-world environments introduce safety- aware metrics-such as the Safety Q-score (sQ)-to penalize safety violations during the execution process"},{"citing_arxiv_id":"2604.18000","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-20T09:25:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For instance, GemBench [39] evaluates robustness to novel placements, object instances, and temporal extensibility, and VLABench [40] incorporates new object categories and implicit language instructions. LIBERO-Plus [16] and LIBERO-Pro [17] further expose the fragility of current VLAs under visual perturbations. To better characterize capability boundaries, VLA-Arena [41] proposes a structured evaluation framework that disentangles task difficulty across task structure, language, and visual observation, enabling more fine-grained analysis of model behavior. 3 This work introduces BeTTER, shifting the diagnostic focus from perceptual robustness to embodied reasoning. Rather than applying surface-level visual perturbations to static scenes, our extensible framework systematically"},{"citing_arxiv_id":"2604.11751","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Grounded World Model for Semantically Generalizable Planning","primary_cat":"cs.RO","submitted_at":"2026-04-13T17:25:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Grasp Reach Success Grasp Reach Success InstructVLA[56] 70 0.98 0.92 0.89 0.79 0.51 0.47 SmolVLA[46] 75 0.99 1.000.99 0.29 0.31 0.08 Wall-OSS[57] 801.00 1.00 1.000.68 0.50 0.40 GR00T-N1.6[42] 1001.00 1.00 1.000.72 0.18 0.18 InternVLA-A1[10] 1001.000.91 0.88 0.63 0.40 0.26 π0.5[25] 1001.000.99 0.99 0.70 0.38 0.26 π0[3] 1001.00 1.00 1.000.47 0.14 0.08 XVLA[62] 1001.000.88 0.88 0.44 0.17 0.17 UniVLA[8] 120 0.79 0.62 0.63 0.38 0.18 0.13 Motus[2] 300 0.78 0.72 0.72 0.34 0.14 0.14 Baseline Average-0.95 0.90 0.90 0.54 0.29 0.22 GWM-MPC 20 0.97 0.95 0.920.99 0.88 0.87 GWM Ablation Study DreamDojo-MPC [16] 24 0.22 0.41 0.15 0.28 0.44 0.17 GWM-MPC-AC 20 0.91 0.77 0.74 0.47 0.42 0.24 GWM-MPC-xArm6 - 0.96 0.91 0."}],"limit":50,"offset":0}