hub Mixed citations

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian · 2025 · cs.RO · arXiv 2510.13626

Mixed citation behavior. Most common role is background (46%).

89 Pith papers citing it

Background 46% of classified citations

open full Pith review browse 89 citing papers arXiv PDF

abstract

Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 dataset 7 baseline 3 method 1

citation-polarity summary

background 12 use dataset 7 baseline 3 unclear 3 use method 1

claims ledger

abstract Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit
background However, standard VLA models do not explicitly model world dynamics ithey learn direct observation-to- action mappings without predicting how the environment changes under intervention[ 4]. This absence of predictive physical reasoning limits their generalization, where anticipating future states is essential. Equip- ping embodied policy models with world modeling capabilities thus emerges as a natural direction [ 5]. A growing body of recent work has begun integrating world models into the embo

co-cited works

representative citing papers

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

cs.RO · 2026-05-08 · accept · novelty 8.0

TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

cs.CV · 2026-03-30 · unverdicted · novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

DuoBench introduces eleven bimanual manipulation tasks with stage-based evaluation and human datasets to benchmark imitation-learning and vision-language-action policies on dual-arm robots in sim and real settings.

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

cs.RO · 2026-05-19 · unverdicted · novelty 7.0

MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

cs.RO · 2026-05-12 · conditional · novelty 7.0 · 2 refs

GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.

Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

cs.RO · 2026-05-01 · unverdicted · novelty 7.0

Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents

cs.AI · 2026-04-18 · unverdicted · novelty 7.0

Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

cs.RO · 2026-03-23 · unverdicted · novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

PlayWorld: Learning Robot World Models from Autonomous Play

cs.RO · 2026-03-09 · unverdicted · novelty 7.0

PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

cs.RO · 2026-07-02 · unverdicted · novelty 6.0

TAP uses two-stage pretraining on unlabeled data to learn physical competence before language grounding, matching 1M-expert models with far less labeled data and showing robustness on real robots.

Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

cs.RO · 2026-06-30 · unverdicted · novelty 6.0 · 2 refs

3D HAMSTER adds depth encoding and reconstruction to VLMs to produce 3D waypoint sequences that feed directly into pointcloud policies, claiming better generalization than 2D baselines under shifts.

Sequential Planning via Anchored Robotic Keypoints

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.

Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.

citing papers explorer

Showing 17 of 67 citing papers after filters.

HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 22 · internal anchor
HiMem-WAM integrates hierarchical latent actions and boundary-aware memory gates into world action models to enhance robustness and performance on memory-dependent long-horizon robotic tasks.
MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models cs.RO · 2026-06-08 · unverdicted · none · ref 74 · internal anchor
MemoryVLA++ integrates a perceptual-cognitive memory bank and denoising world model into VLA models to enable temporal reasoning, yielding performance gains on manipulation benchmarks and real-robot tasks.
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction cs.RO · 2026-05-20 · unverdicted · none · ref 16 · internal anchor
PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.
RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models cs.RO · 2026-05-19 · unverdicted · none · ref 20 · internal anchor
RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.
Key-Gram: Extensible World Knowledge for Embodied Manipulation cs.RO · 2026-05-18 · unverdicted · none · ref 19 · internal anchor
Key-Gram uses a memory module with key-grams and hashed lookup to inject static linguistic priors into vision-language-action backbones, yielding reported gains on manipulation benchmarks.
VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-BasedData Synthesis cs.RO · 2026-05-16 · unverdicted · none · ref 7 · internal anchor
VLAMotor exposes VLA failures via distance-aware uncertainty testing and synthesizes agent-planned repair data to fine-tune models, reporting 49.25% success rate gains in simulation and 57.5% on hardware.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models cs.RO · 2026-05-07 · unverdicted · none · ref 36 · internal anchor
CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts cs.RO · 2026-05-07 · unverdicted · none · ref 9 · 2 links · internal anchor
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance cs.RO · 2026-04-22 · unverdicted · none · ref 9 · internal anchor
PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
Bridge-WA: Predicting Where and How the World Changes for Robotic Action cs.RO · 2026-07-02 · unverdicted · none · ref 53 · internal anchor
Bridge-WA introduces a lightweight distillation-based world-action model that uses future-change priors to improve robotic task success and robustness without deployment-time dense rollouts.
Guided Action Flow: Q-Guided Inference for Flow-Matching Vision-Language-Action Policies cs.RO · 2026-07-02 · unverdicted · none · ref 14 · internal anchor
Guided Action Flow applies a rollout-trained critic to steer frozen flow-matching VLA policies at inference time via action gradients, reporting success rate gains on LIBERO manipulation tasks.
Unleashing More Actions via Action Compositional Training for VLA Models cs.RO · 2026-07-01 · unverdicted · none · ref 11 · internal anchor
ACT-VLA synthesizes novel demonstrations from existing VLA tasks via latent representations to reduce overfitting and improve generalization on manipulation tasks in simulation.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
RLDX-1 Technical Report cs.RO · 2026-05-05 · unverdicted · none · ref 33 · 2 links · internal anchor
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
Vision-Language-Action Models: Experimental Insights from a Real-World UR5 Platform cs.RO · 2026-06-29 · unverdicted · none · ref 13 · internal anchor
Real-robot trials with OpenVLA on a UR5e arm show consistent offline-to-closed-loop gaps driven by action semantics, coordinate conventions, temporal alignment, image preprocessing, and dataset quality rather than model capacity.
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines cs.RO · 2026-04-24 · unverdicted · none · ref 8 · internal anchor
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer