How VLAs (Really) Work In Open-World Environments

· 2026 · cs.RO · arXiv 2604.21192

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Vision-language-action models (VLAs) have been extensively used in robotics applications, achieving great success in various manipulation problems. More recently, VLAs have been used in long-horizon tasks and evaluated on benchmarks, such as BEHAVIOR1K (B1K), for solving complex household chores. The common metric for measuring progress in such benchmarks is success rate or partial score based on satisfaction of progress-agnostic criteria, meaning only the final states of the objects are considered, regardless of the events that lead to such states. In this paper, we argue that using such evaluation protocols say little about safety aspects of operation and can potentially exaggerate reported performance, undermining core challenges for future real-world deployment. To this end, we conduct a thorough analysis of state-of-the-art models on the B1K Challenge and evaluate policies in terms of robustness via reproducibility and consistency of performance, safety aspects of policies operations, task awareness, and key elements leading to the incompletion of tasks. We then propose evaluation protocols to capture safety violations to better measure the true performance of the policies in more complex and interactive scenarios. At the end, we discuss the limitations of the existing VLAs and motivate future research.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Flash-WAM: Modality-Aware Distillation for World Action Models

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

Flash-WAM introduces modality-specific consistency parametrizations to distill joint video-action diffusion models to single-step inference, delivering 23x speedup with preserved benchmark performance.

Make Your VLA More Robust Without More Data By Interleaving Motion Planning

cs.RO · 2026-05-31 · unverdicted · novelty 5.0

MPVI interleaves model-based motion planning with VLAs via VLM completion checking to achieve 113% higher task progress on BEHAVIOR-1K without extra data.

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

cs.RO · 2026-04-26 · accept · novelty 4.0

A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Flash-WAM: Modality-Aware Distillation for World Action Models cs.LG · 2026-06-03 · unverdicted · none · ref 27 · internal anchor
Flash-WAM introduces modality-specific consistency parametrizations to distill joint video-action diffusion models to single-step inference, delivering 23x speedup with preserved benchmark performance.

How VLAs (Really) Work In Open-World Environments

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer