TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.
super hub Mixed citations
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Mixed citation behavior. Most common role is background (44%).
abstract
Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit
- background However, standard VLA models do not explicitly model world dynamics ithey learn direct observation-to- action mappings without predicting how the environment changes under intervention[ 4]. This absence of predictive physical reasoning limits their generalization, where anticipating future states is essential. Equip- ping embodied policy models with world modeling capabilities thus emerges as a natural direction [ 5]. A growing body of recent work has begun integrating world models into the embo
authors
co-cited works
representative citing papers
FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.
PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.
DuoBench introduces eleven bimanual manipulation tasks with stage-based evaluation and human datasets to benchmark imitation-learning and vision-language-action policies on dual-arm robots in sim and real settings.
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.
TAP uses two-stage pretraining on unlabeled data to learn physical competence before language grounding, matching 1M-expert models with far less labeled data and showing robustness on real robots.
VLAFlow shows that combining language-supervised co-training with future latent alignment produces the most stable transfer performance for vision-language-action models trained on mixed robot data.
Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
citing papers explorer
-
TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.
-
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.
-
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
-
EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies
EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.
-
PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space
PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.
-
DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World
DuoBench introduces eleven bimanual manipulation tasks with stage-based evaluation and human datasets to benchmark imitation-learning and vision-language-action policies on dual-arm robots in sim and real settings.
-
ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
-
Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation
MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.
-
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.
-
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models
Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents
Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.
-
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
-
PlayWorld: Learning Robot World Models from Autonomous Play
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.
-
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
TAP uses two-stage pretraining on unlabeled data to learn physical competence before language grounding, matching 1M-expert models with far less labeled data and showing robustness on real robots.
-
VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment
VLAFlow shows that combining language-supervised co-training with future latent alignment produces the most stable transfer performance for vision-language-action models trained on mixed robot data.
-
Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation
Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.
-
ABot-M0.5: Unified Mobility-and-Manipulation World Action Model
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
-
3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance
3D HAMSTER adds depth encoding and reconstruction to VLMs to produce 3D waypoint sequences that feed directly into pointcloud policies, claiming better generalization than 2D baselines under shifts.
-
Sequential Planning via Anchored Robotic Keypoints
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
-
Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization
Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.
-
Decoupling the Declarative from the Procedural in Vision-Language-Action Models
w²VLA restructures VLA information flow to decouple declarative semantics from procedural skills, enabling zero-shot transfer to novel objects.
-
VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models
VLA-FAIL introduces last-layer Mahalanobis distance and action chunk consistency detectors that together enable early, reliable failure detection in finetuned VLAs without failure data or expensive sampling.
-
Inductive Generalization for Robotic Manipulation
The paper introduces an inductive generalization evaluation protocol for manipulation policies and shows that SOTA vision-language-action models fail on progressively harder task variants.
-
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.
-
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
Qwen-RobotManip applies unified alignment across representation, motion, and behavior to enable large-scale training on heterogeneous manipulation data, yielding emergent generalization on out-of-distribution robotic benchmarks.
-
Geometric Action Model for Robot Policy Learning
GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.
-
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies
APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.
-
$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models
ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.
-
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation
GEAR-VLA learns geometry-aware action representations via coarse-to-fine pretraining, gradient-decoupled DiT action expert, semantic-aligned 3D integration, and embodiment canonicalization, reporting SOTA results on LIBERO benchmarks and over 80% success on unseen embodiments and 212 unseen objects.
-
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
vla.cpp is a unified C++ runtime that serves multiple VLA architectures with flow-matching and diffusion patterns, matching SOTA performance on LIBERO while running on low-memory embedded hardware.
-
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training
A 3D-thinking-guided co-training method disentangles geometry perception and spatial reasoning to inject latent 3D priors into VLA models via adapters, achieving SOTA on manipulation benchmarks while running on 2D images only.
-
What Are We Actually Benchmarking in Robot Manipulation?
LIBERO and CALVIN fail multiple proposed diagnostics for shortcut solvability, statistical significance, overfitting, and data dependence, while a tiny 0.09B probe reaches near-SOTA on LIBERO.
-
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
-
FATE-VLA:Failue-aware test generation for vision-language-action models
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
-
ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation
ELAN4D introduces plug-and-play 4D keypoint track supervision from forward kinematics to enhance VLA policy generalization in robotic manipulation tasks.
-
VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models
VLAConf is a one-class discriminative method that estimates step-wise task-success confidence for VLA models via anomaly scoring on frozen representations plus step-conditioned modeling, shown to be more efficient than ensemble or probability baselines on LIBERO and real robots.
-
3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding
3DVLA is a plug-and-play framework that enhances pretrained VLAs with pervasive 3D feature encoding using multi-view consistency and Spatially-Conditioned Geometry Aggregation, an instance estimation module, and a masked self-supervised 3D branch, yielding gains on LIBERO-Plus and RoboTwin 2.0.
-
Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance
Afford-VLA internalizes task-conditioned affordance as an explicit visual planning interface within VLA models via learnable <AFF> tokens, achieving SOTA on LIBERO and SimplerEnv benchmarks.
-
Action with Visual Primitives
AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.
-
From a Single Demonstration to a General Policy for Contact-Rich Manipulation
A one-shot LfD framework abstracts a single demonstration into environmental-constraint primitives, then uses self-exploration, human corrections, and compliant recovery to produce a policy that generalizes across poses and geometries, achieving over 90% success on seven real-world multi-stage tasks
-
Feedback World Model Enables Precise Guidance of Diffusion Policy
Feedback world model closes the prediction-observation loop at inference time to correct errors and improve diffusion policy performance under distribution shift in robotics.
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.