TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.
hub Mixed citations
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Mixed citation behavior. Most common role is background (46%).
abstract
Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit
- background However, standard VLA models do not explicitly model world dynamics ithey learn direct observation-to- action mappings without predicting how the environment changes under intervention[ 4]. This absence of predictive physical reasoning limits their generalization, where anticipating future states is essential. Equip- ping embodied policy models with world modeling capabilities thus emerges as a natural direction [ 5]. A growing body of recent work has begun integrating world models into the embo
co-cited works
representative citing papers
FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.
Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.
ELAN4D introduces plug-and-play 4D keypoint track supervision from forward kinematics to enhance VLA policy generalization in robotic manipulation tasks.
VLAConf is a one-class discriminative method that estimates step-wise task-success confidence for VLA models via anomaly scoring on frozen representations plus step-conditioned modeling, shown to be more efficient than ensemble or probability baselines on LIBERO and real robots.
3DVLA is a plug-and-play framework that enhances pretrained VLAs with pervasive 3D feature encoding using multi-view consistency and Spatially-Conditioned Geometry Aggregation, an instance estimation module, and a masked self-supervised 3D branch, yielding gains on LIBERO-Plus and RoboTwin 2.0.
AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.
A one-shot LfD framework abstracts a single demonstration into environmental-constraint primitives, then uses self-exploration, human corrections, and compliant recovery to produce a policy that generalizes across poses and geometries, achieving over 90% success on seven real-world multi-stage tasks
Feedback world model closes the prediction-observation loop at inference time to correct errors and improve diffusion policy performance under distribution shift in robotics.
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
citing papers explorer
-
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.