super hub Canonical reference

OpenVLA: An Open-Source Vision-Language-Action Model

Ashwin Balakrishna, Karl Pertsch, Moo Jin Kim, Siddharth Karamcheti, Suraj Nair, Ted Xiao · 2024 · cs.RO · arXiv 2406.09246

Canonical reference. 72% of citing Pith papers cite this work as background.

566 Pith papers citing it

Background 72% of classified citations

open full Pith review browse 566 citing papers more from Ashwin Balakrishna arXiv PDF

abstract

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 94 baseline 20 method 7 other 2

citation-polarity summary

background 89 baseline 20 use method 7 unclear 6 support 1

claims ledger

abstract Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for ado

authors

Ashwin Balakrishna Karl Pertsch Moo Jin Kim Siddharth Karamcheti Suraj Nair Ted Xiao

co-cited works

representative citing papers

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 8.0

HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

cs.CV · 2026-03-30 · unverdicted · novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.

LIME: Learning Intent-aware Camera Motion from Egocentric Video

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

LIME formulates language-conditioned camera motion as predicting SE(3) target poses from RGB and intent text, using mined multi-intent supervision from egocentric video and a flow-matching pose head.

EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

EgoSafetyBench shows VLMs reliably spot hazard-containing videos but miss specific contextual hazards and are degraded by misleading in-scene text.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.

Labimus: A Simulation and Benchmark for Humanoid Dexterous Manipulation in Chemical Laboratory

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

Labimus is the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories, exposing a gap between task completion and required experimental precision.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

SurgVLA-Bench: Towards Evaluating Vision-Language-Action Models for Laparoscopic Surgical Robotics

cs.AI · 2026-06-28 · unverdicted · novelty 7.0

SurgVLA-Bench supplies a hierarchical task taxonomy and multi-dimensional evaluation framework for VLA models in laparoscopic robotics simulation, showing autoregressive models excel at semantics while flow-matching models achieve higher precision but all fall short due to endoscopic view constraint

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

Cloak: Zero-Shot Cross-Embodiment Manipulation by Masking the End-Effector from the VLA

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

Masking the end-effector from wrist views during training lets a single-gripper VLA transfer zero-shot to other grippers, arms, and five-fingered hands while keeping original performance.

Geometric Entropy: When Trajectory Diversity Helps and Hurts in Imitation Learning

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

Geometric diversity of demonstration trajectories exhibits an inverted-U effect on imitation learning success, with the peak shifting lower as mastery increases via more data, easier tasks, or stronger priors.

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

FAFM performs flow matching in the frequency domain using DCT on action sequences to produce continuous temporally consistent robotic actions with a Sobolev-style smoothness regularizer.

EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

EquiVLA is the first general framework for end-to-end SO(2)-equivariant VLA models using EquiPerceptor and EquiActor modules, reporting improved success rates on LIBERO, CALVIN, and real-robot benchmarks.

Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

PAINT reframes asynchronous flow-based action chunking as an initial noise selection problem solved via backward Euler inversion and a repainting rule.

Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

Mix-QVLA is a task-evidence-aware mixed-precision PTQ framework for VLA models that preserves task-relevant evidence via evidence-mass and attribution-distribution metrics to guide bit allocation under memory and BitOps constraints.

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

Act2Answer protocol reveals VLA models retain simple concepts but show larger gaps on complex semantics than source VLMs, with VQA co-training linked to better retention and knowledge signals peaking in middle layers.

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.

HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

HumanoidArena is a new benchmark of 7 leg-critical HOI/HSI tasks that evaluates egocentric hierarchical whole-body policies in humanoids and finds performance is strongly conditioned on the low-level GMT used.

citing papers explorer

Showing 50 of 566 citing papers.

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 16 · internal anchor
HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.
Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration? cs.CV · 2026-05-31 · accept · none · ref 52 · internal anchor
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models cs.CV · 2026-03-30 · unverdicted · none · ref 13 · internal anchor
FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.
Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots cs.RO · 2026-07-02 · unverdicted · none · ref 1 · internal anchor
Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
LIME: Learning Intent-aware Camera Motion from Egocentric Video cs.RO · 2026-07-02 · unverdicted · none · ref 10 · internal anchor
LIME formulates language-conditioned camera motion as predicting SE(3) target poses from RGB and intent text, using mined multi-intent supervision from egocentric video and a flow-matching pose head.
EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards cs.CV · 2026-06-30 · unverdicted · none · ref 16 · internal anchor
EgoSafetyBench shows VLMs reliably spot hazard-containing videos but miss specific contextual hazards and are degraded by misleading in-scene text.
Adapting Generalist Robot Policies with Semantic Reinforcement Learning cs.RO · 2026-06-30 · unverdicted · none · ref 3 · internal anchor
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation cs.RO · 2026-06-30 · unverdicted · none · ref 13 · internal anchor
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
Labimus: A Simulation and Benchmark for Humanoid Dexterous Manipulation in Chemical Laboratory cs.RO · 2026-06-30 · unverdicted · none · ref 48 · internal anchor
Labimus is the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories, exposing a gap between task completion and required experimental precision.
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 19 · internal anchor
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
SurgVLA-Bench: Towards Evaluating Vision-Language-Action Models for Laparoscopic Surgical Robotics cs.AI · 2026-06-28 · unverdicted · none · ref 12 · internal anchor
SurgVLA-Bench supplies a hierarchical task taxonomy and multi-dimensional evaluation framework for VLA models in laparoscopic robotics simulation, showing autoregressive models excel at semantics while flow-matching models achieve higher precision but all fall short due to endoscopic view constraint
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models cs.RO · 2026-06-25 · unverdicted · none · ref 4 · internal anchor
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models cs.RO · 2026-06-22 · unverdicted · none · ref 23 · internal anchor
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
Cloak: Zero-Shot Cross-Embodiment Manipulation by Masking the End-Effector from the VLA cs.RO · 2026-06-22 · unverdicted · none · ref 9 · internal anchor
Masking the end-effector from wrist views during training lets a single-gripper VLA transfer zero-shot to other grippers, arms, and five-fingered hands while keeping original performance.
Geometric Entropy: When Trajectory Diversity Helps and Hurts in Imitation Learning cs.RO · 2026-06-18 · unverdicted · none · ref 6 · internal anchor
Geometric diversity of demonstration trajectories exhibits an inverted-U effect on imitation learning success, with the peak shifting lower as mastery increases via more data, easier tasks, or stronger priors.
HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining cs.CV · 2026-06-18 · unverdicted · none · ref 19 · internal anchor
Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.
Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation cs.RO · 2026-06-18 · unverdicted · none · ref 33 · internal anchor
FAFM performs flow matching in the frequency domain using DCT on action sequences to produce continuous temporally consistent robotic actions with a Sobolev-style smoothness regularizer.
EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models cs.RO · 2026-06-18 · unverdicted · none · ref 3 · internal anchor
EquiVLA is the first general framework for end-to-end SO(2)-equivariant VLA models using EquiPerceptor and EquiActor modules, reporting improved success rates on LIBERO, CALVIN, and real-robot benchmarks.
Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection cs.RO · 2026-06-18 · unverdicted · none · ref 14 · internal anchor
PAINT reframes asynchronous flow-based action chunking as an initial noise selection problem solved via backward Euler inversion and a repainting rule.
Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models cs.CV · 2026-06-17 · unverdicted · none · ref 1 · internal anchor
Mix-QVLA is a task-evidence-aware mixed-precision PTQ framework for VLA models that preserves task-relevant evidence via evidence-mass and attribution-distribution metrics to guide bit allocation under memory and BitOps constraints.
Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models cs.LG · 2026-06-17 · unverdicted · none · ref 57 · internal anchor
Act2Answer protocol reveals VLA models retain simple concepts but show larger gaps on complex semantics than source VLMs, with VQA co-training linked to better retention and knowledge signals peaking in middle layers.
EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies cs.RO · 2026-06-16 · unverdicted · none · ref 9 · internal anchor
EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.
PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space cs.RO · 2026-06-16 · unverdicted · none · ref 2 · internal anchor
PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.
HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning cs.RO · 2026-06-16 · unverdicted · none · ref 6 · internal anchor
HumanoidArena is a new benchmark of 7 leg-critical HOI/HSI tasks that evaluates egocentric hierarchical whole-body policies in humanoids and finds performance is strongly conditioned on the low-level GMT used.
MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 11 · internal anchor
MuseVLA adds on-demand sensor selection via tokens and converts readings into grounded sensor images for multimodal fusion, reporting 80.6% average success on real-robot dexterous tasks that need non-visual sensing.
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation cs.RO · 2026-06-11 · unverdicted · none · ref 11 · internal anchor
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
Trajectory-Level Redirection Attacks on Vision-Language-Action Models cs.RO · 2026-06-11 · unverdicted · none · ref 2 · internal anchor
A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics cs.RO · 2026-06-10 · unverdicted · none · ref 5 · internal anchor
Ambient Diffusion Policy enables better imitation learning from suboptimal robot data by leveraging spectral properties to restrict data usage to specific diffusion times.
Learning Object Manipulation from Scratch via Contrastive Interaction cs.RO · 2026-06-10 · unverdicted · none · ref 3 · internal anchor
IWR improves CRL sample efficiency and performance in interaction-rich manipulation by interaction-aware resampling that preserves mode boundaries, yielding 19.8% average gains and a real-world air-hockey agent.
LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination cs.CV · 2026-06-09 · unverdicted · none · ref 38 · internal anchor
Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.
ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models cs.RO · 2026-06-08 · unverdicted · none · ref 1 · internal anchor
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies cs.RO · 2026-06-08 · unverdicted · none · ref 25 · internal anchor
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
Targeting World Models to Compromise Robot Learning Pipelines cs.RO · 2026-06-08 · unverdicted · none · ref 4 · internal anchor
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection cs.RO · 2026-06-08 · unverdicted · none · ref 5 · internal anchor
B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.
PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback cs.RO · 2026-06-07 · unverdicted · none · ref 7 · internal anchor
PhysAgent is a simulator-in-the-loop multi-agent system that automates physically grounded 4D synthesis from multimodal prompts by using trajectory feedback from vision models and LLM reasoning to optimize force fields.
X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining cs.CV · 2026-06-07 · unverdicted · none · ref 32 · internal anchor
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies cs.RO · 2026-06-07 · unverdicted · none · ref 3 · internal anchor
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies cs.RO · 2026-06-06 · unverdicted · none · ref 4 · internal anchor
Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.
ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors? cs.CV · 2026-06-06 · unverdicted · none · ref 2 · internal anchor
ChronoPhyBench is a new benchmark and dataset for chronological physical dynamics reasoning that combines video-conditioned next-state prediction with VQA to reduce language bias in MLLM evaluation.
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation cs.RO · 2026-06-05 · unverdicted · none · ref 32 · internal anchor
VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.
PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation cs.RO · 2026-06-04 · unverdicted · none · ref 4 · internal anchor
PiL-World introduces a chunk-wise world model for closed-loop VLA policy evaluation that reduces the gap between simulated and real success rates from 63.2% to 12.0% on three dual-arm manipulation tasks by conditioning on action-derived visual control and latent histories while training on both succ
HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning cs.RO · 2026-06-03 · unverdicted · none · ref 9 · internal anchor
HapTile introduces a visuotactile dataset with haptic-informed teleoperation for language-conditioned contact-rich manipulation tasks and provides baseline policy benchmarks.
NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models cs.CV · 2026-06-03 · unverdicted · none · ref 36 · internal anchor
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
A Dataset for Dynamic Human Preferences for Vision Language Models cs.CV · 2026-06-02 · unverdicted · none · ref 16 · internal anchor
Introduces a benchmark dataset with automated pipeline for evaluating VLMs on dynamic in-context human preferences, distinct from static benchmarks.
Same Weights, Different Robot: A Deployment Safety View of VLA Policies cs.CR · 2026-06-02 · unverdicted · none · ref 13 · internal anchor
The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies cs.LG · 2026-05-28 · unverdicted · none · ref 9 · internal anchor
BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.
PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology cs.RO · 2026-05-28 · unverdicted · none · ref 36 · internal anchor
PhAIL provides an open benchmark and distributional evaluation method for real-robot VLA policies using time-to-success CDF, HRT scoring, and KS significance tests.
{\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling cs.CV · 2026-05-27 · unverdicted · none · ref 6 · internal anchor
Omega-QVLA is a post-training quantization framework achieving uniform W4A4 for VLA models' LLM backbone and DiT action head via composite SVD-Hadamard rotation and per-step scaling, matching FP16 success rates on LIBERO.
How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures cs.RO · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
VLA architectures exhibit architecture-specific failure signatures at the motor-command level, with direction reversal as a universal predictor and velocity monitoring ineffective for continuous models.
Colosseum V2: Benchmarking Generalization for Vision Language Action Models cs.RO · 2026-05-26 · unverdicted · none · ref 31 · internal anchor
Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.

OpenVLA: An Open-Source Vision-Language-Action Model

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer