Recognition: unknown
How VLAs (Really) Work In Open-World Environments
Pith reviewed 2026-05-09 22:07 UTC · model grok-4.3
The pith
Current success metrics for vision-language-action models only check final task states and ignore safety during execution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision-language-action models achieve reported success on complex household tasks when scored only on final object states, yet this ignores unsafe intermediate actions and inconsistent execution. Analysis of top models on BEHAVIOR1K shows low reproducibility across runs, frequent safety violations, limited task awareness, and task incompletions driven by issues not captured in standard scores. Protocols that record safety violations during operation are proposed to measure true performance in more complex interactive scenarios.
What carries the argument
Progress-agnostic success criteria that evaluate only end states of objects regardless of path taken, together with proposed protocols for capturing safety violations.
If this is right
- Existing success rates on benchmarks provide limited insight into the safety of VLA operations.
- Reported performance can be exaggerated because intermediate unsafe events are ignored.
- VLAs often lack reproducibility and task awareness that current metrics do not detect.
- Safety-focused protocols allow better evaluation in interactive and complex scenarios.
Where Pith is reading between the lines
- Robotics applications in homes would require evaluation methods that penalize unsafe actions even if the final state is correct.
- Similar gaps between reported success and actual safety may exist when applying current VLAs to other benchmarks or real environments.
- Training or inference methods for VLAs could benefit from explicit mechanisms to avoid safety violations rather than only optimizing for end states.
Load-bearing premise
That adding safety violation tracking will give a meaningfully better measure of real performance and that patterns observed on the BEHAVIOR1K benchmark will hold in broader open-world settings.
What would settle it
Re-evaluate the same VLAs on BEHAVIOR1K tasks while counting both standard success and safety violations, then observe whether model orderings or overall scores shift substantially compared to progress-agnostic results alone.
Figures
read the original abstract
Vision-language-action models (VLAs) have been extensively used in robotics applications, achieving great success in various manipulation problems. More recently, VLAs have been used in long-horizon tasks and evaluated on benchmarks, such as BEHAVIOR1K (B1K), for solving complex household chores. The common metric for measuring progress in such benchmarks is success rate or partial score based on satisfaction of progress-agnostic criteria, meaning only the final states of the objects are considered, regardless of the events that lead to such states. In this paper, we argue that using such evaluation protocols say little about safety aspects of operation and can potentially exaggerate reported performance, undermining core challenges for future real-world deployment. To this end, we conduct a thorough analysis of state-of-the-art models on the B1K Challenge and evaluate policies in terms of robustness via reproducibility and consistency of performance, safety aspects of policies operations, task awareness, and key elements leading to the incompletion of tasks. We then propose evaluation protocols to capture safety violations to better measure the true performance of the policies in more complex and interactive scenarios. At the end, we discuss the limitations of the existing VLAs and motivate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard success-rate and partial-score metrics for VLAs on benchmarks such as BEHAVIOR1K (B1K) are inadequate because they are progress-agnostic and consider only final object states, thereby ignoring safety violations during execution and potentially exaggerating reported performance. It reports an analysis of SOTA VLAs on B1K along axes of robustness/reproducibility, safety, task awareness, and failure modes, then proposes new evaluation protocols to detect safety violations and better measure true performance in complex interactive scenarios.
Significance. If the proposed safety-violation protocols were shown to be reproducible, to produce quantitatively different model rankings, and to generalize beyond the 1K household tasks, the work would usefully highlight a gap between current benchmark scores and real-world deployability of VLAs. The manuscript does not yet supply the concrete criteria, quantitative re-ranking results, or validation needed to establish that improvement.
major comments (2)
- [Evaluation protocols section] The description of the safety-violation protocols (introduced after the B1K analysis) supplies no explicit, reproducible decision rules, contact-force thresholds, annotated violation taxonomy, or inter-annotator agreement statistics. Without these, the claim that the protocols 'better measure true performance' cannot be evaluated or reproduced.
- [Analysis of SOTA models on B1K] The analysis of SOTA VLAs on B1K reports no quantitative comparison (e.g., number or fraction of trajectories reclassified as unsafe under the new protocols versus success-rate metrics) and no evidence that the new protocols alter model rankings. This leaves the central assertion that existing metrics 'exaggerate reported performance' unsupported by data.
minor comments (1)
- [Abstract / Introduction] The abstract and introduction use 'progress-agnostic criteria' without a concise definition or reference to the exact B1K success criteria being critiqued.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our analysis of VLA evaluation limitations and the proposed protocols. The comments identify key areas where additional detail and quantification will strengthen the manuscript. We address each major comment below and will revise accordingly.
read point-by-point responses
-
Referee: [Evaluation protocols section] The description of the safety-violation protocols (introduced after the B1K analysis) supplies no explicit, reproducible decision rules, contact-force thresholds, annotated violation taxonomy, or inter-annotator agreement statistics. Without these, the claim that the protocols 'better measure true performance' cannot be evaluated or reproduced.
Authors: We agree that the protocols section requires more precise specifications to enable reproducibility. In the revised manuscript we will add explicit decision rules for violation detection, concrete contact-force thresholds (derived from the simulation environment parameters), a categorized taxonomy of safety violations with annotated examples from the B1K trajectories, and inter-annotator agreement statistics for any human-validated components. These additions will directly support the claim that the protocols better capture true performance. revision: yes
-
Referee: [Analysis of SOTA models on B1K] The analysis of SOTA VLAs on B1K reports no quantitative comparison (e.g., number or fraction of trajectories reclassified as unsafe under the new protocols versus success-rate metrics) and no evidence that the new protocols alter model rankings. This leaves the central assertion that existing metrics 'exaggerate reported performance' unsupported by data.
Authors: The manuscript's B1K analysis already documents concrete safety violations and intermediate failures that standard success-rate metrics overlook, providing qualitative support for the exaggeration claim. However, we acknowledge that quantitative reclassification statistics and ranking shifts would offer stronger evidence. We will incorporate these in the revision by reporting the fraction of trajectories newly flagged as unsafe, the resulting changes in model performance scores, and any re-ordering of SOTA models under the proposed protocols. revision: yes
Circularity Check
No circularity: empirical critique with no derivations or self-referential reductions
full rationale
The paper conducts an empirical analysis of VLAs on the B1K benchmark and proposes new safety-violation evaluation protocols based on observed failure modes. No equations, fitted parameters, predictions, or first-principles derivations are present. The central argument—that final-state success metrics overlook safety issues—is an independent interpretive claim supported by the reported analysis rather than a quantity that reduces to its own inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing steps. The absence of any mathematical or definitional loop satisfies the criteria for a score of 0.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
Reference graph
Works this paper leans on
-
[1]
π 0.5: a vision-language-action model with open-world generalization,
K. Black, N. Brown,et al., “π 0.5: a vision-language-action model with open-world generalization,” inCoRL, 2025
2025
-
[2]
Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation,
C. Li, R. Zhang,et al., “Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation,” inCoRL, 2023
2023
-
[3]
Getting smarter for motion planning in autonomous driving systems,
M. Alban, E. Ahmadi,et al., “Getting smarter for motion planning in autonomous driving systems,” inIV, 2025
2025
-
[4]
Transporter networks: Rearranging the visual world for robotic manipulation,
A. Zeng, P. Florence,et al., “Transporter networks: Rearranging the visual world for robotic manipulation,” inCoRL, 2021
2021
-
[5]
Vima: General robot manipulation with multimodal prompts,
Y . Jiang, A. Gupta,et al., “Vima: General robot manipulation with multimodal prompts,” inICML, 2022
2022
-
[6]
Robocasa365: A large-scale simu- lation framework for training and benchmarking generalist robots,
S. Nasiriany, S. Nasiriany,et al., “Robocasa365: A large-scale simu- lation framework for training and benchmarking generalist robots,” in ICLR, 2026
2026
-
[7]
X-VLA: Soft-prompted transformer as scalable cross-embodiment vision-language-action model,
J. Zheng, J. Li,et al., “X-VLA: Soft-prompted transformer as scalable cross-embodiment vision-language-action model,” inICLR, 2026
2026
-
[8]
RT-1: Robotics Transformer for Real- World Control at Scale,
A. Brohan, N. Brown,et al., “RT-1: Robotics Transformer for Real- World Control at Scale,” inRSS, 2023
2023
-
[9]
RT-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu,et al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,” inCoRL, 2023
2023
-
[10]
Palm-e: an embodied multimodal language model,
D. Driess, F. Xia,et al., “Palm-e: an embodied multimodal language model,” inICML, 2023
2023
-
[11]
Unified vision-language-action model,
Y . Wang, X. Li,et al., “Unified vision-language-action model,” in ICLR
-
[12]
I. Larchenko, G. Zarin, and A. Karnatak, “Task adaptation of vision- language-action model: 1st place solution for the 2025 behavior challenge,”arXiv:2512.06951, 2025
-
[13]
Openpi comet: Competition solution for 2025 behavior challenge,
J. Bai, Y .-W. Chao,et al., “Openpi comet: Competition solution for 2025 behavior challenge,”arXiv preprint arXiv:2512.10071, 2025
-
[14]
S. Pakdamansavoji, M. Pourkeshavarz,et al., “Improving robotic manipulation robustness via nice scene surgery,”arXiv:2511.22777, 2025
-
[15]
Distracted robot: How visual clutter undermine robotic manipulation,
A. Rasouli, M. Alban,et al., “Distracted robot: How visual clutter undermine robotic manipulation,”arXiv preprint arXiv:2511.22780, 2025
-
[16]
Language-driven representation learn- ing for robotics,
S. Karamcheti, S. Nair,et al., “Language-driven representation learn- ing for robotics,” inRSS, 2023
2023
-
[17]
The colosseum: A benchmark for evaluating generalization for robotic manipulation,
W. Pumacay, I. Singh,et al., “The colosseum: A benchmark for evaluating generalization for robotic manipulation,” inRSS, 2024
2024
-
[18]
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Y . Zhu, J. Wong,et al., “robosuite: A modular simulation framework and benchmark for robot learning,”arXiv:2009.12293, 2020
work page internal anchor Pith review arXiv 2009
-
[19]
Roco: Dialectic multi-robot collab- oration with large language models,
Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot collab- oration with large language models,” inICRA, 2024
2024
-
[20]
IKEA furniture assembly environment for long-horizon complex manipulation tasks,
Y . Lee, E. S. Hu, and J. J. Lim, “IKEA furniture assembly environment for long-horizon complex manipulation tasks,” inICRA, 2021
2021
-
[21]
Human-robot gym: Benchmark- ing reinforcement learning in human-robot collaboration,
J. Thumm, F. Trost, and M. Althoff, “Human-robot gym: Benchmark- ing reinforcement learning in human-robot collaboration,” inICRA, 2024
2024
-
[22]
Libero: Benchmarking knowledge transfer for lifelong robot learning,
B. Liu, Y . Zhu,et al., “Libero: Benchmarking knowledge transfer for lifelong robot learning,” inNeurIPS, 2023
2023
-
[23]
{ALFW}orld: Aligning text and embodied environments for interactive learning,
M. Shridhar, X. Yuan,et al., “{ALFW}orld: Aligning text and embodied environments for interactive learning,” inICLR, 2021
2021
-
[24]
Virtualhome: Simulating household activities via programs,
X. Puig, K. Ra,et al., “Virtualhome: Simulating household activities via programs,” inCVPR, 2018
2018
-
[25]
2025 behavior challenge,
“2025 behavior challenge,” https://behavior.stanford.edu/challenge/, accessed: 2026-02-26
2025
-
[26]
Housekeep: Tidying virtual house- holds using commonsense reasoning,
Y . Kant, A. Ramachandran,et al., “Housekeep: Tidying virtual house- holds using commonsense reasoning,” inECCV, 2022
2022
-
[27]
Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning,
K. Rana, J. Haviland,et al., “Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning,” in CoRL, 2023
2023
-
[28]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,
W. Huang, P. Abbeel,et al., “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” inICML, 2022
2022
-
[29]
Simple but effective: Clip embed- dings for embodied ai,
A. Khandelwal, L. Weihs,et al., “Simple but effective: Clip embed- dings for embodied ai,” inCVPR, 2022
2022
-
[30]
Ra-dp: Rapid adaptive diffusion policy for training-free high-frequency robotics replanning,
X. Ye, R. H. Yang,et al., “Ra-dp: Rapid adaptive diffusion policy for training-free high-frequency robotics replanning,” inIROS, 2025
2025
-
[31]
Motion planning diffusion: Learning and adapting robot motion planning with diffusion models,
J. Carvalho, A. T. Le,et al., “Motion planning diffusion: Learning and adapting robot motion planning with diffusion models,”Transactions on Robotics, 2025
2025
-
[32]
Cape: Context-aware diffusion policy via proximal mode expansion for collision avoidance,
R. H. Yang, X. Zhao,et al., “Cape: Context-aware diffusion policy via proximal mode expansion for collision avoidance,”arXiv:2511.22773, 2025
-
[33]
Groot: Learning to follow instructions by watching gameplay videos,
S. Cai, B. Zhang,et al., “Groot: Learning to follow instructions by watching gameplay videos,” inICLR, 2024
2024
-
[34]
Towards long-horizon vision-language-action system: Reasoning, acting and memory,
D. Li, Y . Zhang,et al., “Towards long-horizon vision-language-action system: Reasoning, acting and memory,” inICCV, 2025
2025
-
[35]
Contrastive difference predictive coding,
C. Zheng, R. Salakhutdinov, and B. Eysenbach, “Contrastive difference predictive coding,” inICLR, 2024
2024
-
[36]
Q. Garrido, N. Ballas,et al., “Intuitive physics understanding emerges from self-supervised pretraining on natural videos,”arXiv:2502.11831, 2025
-
[37]
Modeling long-horizon tasks as sequential interaction landscapes,
S. Pirk, K. Hausman,et al., “Modeling long-horizon tasks as sequential interaction landscapes,” inCoRL, 2020
2020
-
[38]
S. Hu, Z. Liu,et al., “Vlsa: Vision-language-action models with plug- and-play safety constraint layer,”arXiv:2512.11891, 2025
-
[39]
H2o+: an improved framework for hybrid offline- and-online rl with dynamics gaps,
H. Niu, T. Ji,et al., “H2o+: an improved framework for hybrid offline- and-online rl with dynamics gaps,” inICRA, 2025. APPENDIX Task Descriptions.The following are the task description of B1K Challenge and corresponding ids. The ones appear in blue color are the tasks chosen for new metric analysis. 0turning on radio,1picking up trash,2putting away Hallo...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.