Recognition: unknown
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Pith reviewed 2026-05-12 10:35 UTC · model grok-4.3
The pith
Vision-language-action models drop from 95% to under 30% success with small camera or starting-position changes, yet ignore language instructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through the LIBERO-Plus benchmark that adds perturbations to object layout, camera viewpoints, robot initial states, language instructions, lighting, backgrounds, and sensor noise, state-of-the-art VLA models display extreme sensitivity to viewpoint and initial-state changes while remaining largely insensitive to language variations and frequently ignoring instructions entirely.
What carries the argument
A seven-dimension controlled perturbation framework that systematically varies scene elements and measures resulting success-rate changes on multiple VLA models.
If this is right
- High scores on fixed robotic benchmarks do not indicate that models will perform reliably when camera angle or starting position varies.
- Models appear to rely on visual shortcuts rather than following or even processing language instructions.
- Evaluation of future VLA systems must include robustness tests across these dimensions to reveal actual competence.
- Training procedures should be revised to reduce dependence on fixed viewpoints and initial configurations.
Where Pith is reading between the lines
- Current training data and objectives likely encourage overfitting to the exact conditions present in the original benchmarks.
- The same brittleness may appear in other embodied AI systems that combine vision and action without explicit robustness training.
- One testable extension would be to retrain models on data that already includes the seven perturbation types and measure whether the performance drops disappear.
Load-bearing premise
The seven chosen perturbation dimensions together with the specific models tested are representative enough to support broad claims about VLA robustness.
What would settle it
An experiment in which a VLA model maintains success rates above 80 percent across the same modest camera-viewpoint and initial-state perturbations would contradict the reported sensitivity.
read the original abstract
Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LIBERO-Plus as an extension for systematic robustness analysis of Vision-Language-Action (VLA) models. It applies controlled perturbations across seven dimensions (object layout, camera viewpoints, robot initial states, language instructions, lighting, backgrounds, sensor noise) to multiple state-of-the-art models on robotic manipulation tasks, reporting large performance drops (e.g., 95% to below 30%) on viewpoint and initial-state perturbations while finding models largely insensitive to language variations and often ignoring instructions entirely. The central claim is that high benchmark scores mask fundamental brittleness and that current evaluation practices are insufficient.
Significance. If the empirical patterns hold after addressing selection and methodological gaps, the work would be significant for robotics and embodied AI: it provides concrete evidence that VLA competence on standard benchmarks does not imply reliability under realistic variation, and it could drive adoption of perturbation-based testing as a standard practice. The systematic multi-dimension design and the language-insensitivity observation are potentially actionable findings for model developers.
major comments (3)
- [Abstract and §3] Abstract and §3 (Perturbation Dimensions): The seven chosen dimensions are introduced without a taxonomy, ablation, or external reference establishing that they are the most load-bearing or representative axes of real-world variation. The broad claim of 'consistent brittleness' and 'extreme sensitivity' therefore rests on an unmotivated testbed; factors such as object dynamics, multi-object interactions, or long-horizon dependencies are not compared, leaving open the possibility that the observed pattern is an artifact of the selected dimensions rather than a general property of VLAs.
- [§4] §4 (Model Evaluation and Language Experiments): The finding that 'models tend to ignore language instructions completely' is based on a limited set of models whose architectural diversity, scale, and training regimes are not characterized. Without ablations (e.g., language-only vs. vision-only controls) or mechanistic evidence (attention maps, counterfactuals), the insensitivity result cannot be extrapolated beyond the tested models and risks over-generalization.
- [Methods] Methods and Experimental Details: The manuscript reports large quantitative drops (95% to <30%) and statistical patterns but omits full protocol details, variance estimates, number of trials per condition, and data-release statements. These omissions make the central empirical claims difficult to reproduce or stress-test, directly affecting the soundness of the robustness conclusions.
minor comments (2)
- [Figures] Figure captions and axis labels should explicitly state the number of trials and error bars used for each reported success rate.
- [Related Work] The related-work section should include recent robustness studies on VLAs or embodied agents to better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the motivation, rigor, and reproducibility of our robustness analysis. We address each major comment point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Perturbation Dimensions): The seven chosen dimensions are introduced without a taxonomy, ablation, or external reference establishing that they are the most load-bearing or representative axes of real-world variation. The broad claim of 'consistent brittleness' and 'extreme sensitivity' therefore rests on an unmotivated testbed; factors such as object dynamics, multi-object interactions, or long-horizon dependencies are not compared, leaving open the possibility that the observed pattern is an artifact of the selected dimensions rather than a general property of VLAs.
Authors: We agree that an explicit justification and taxonomy for the seven dimensions would strengthen the manuscript. These dimensions were selected as they represent core sources of variation in the LIBERO benchmark and common real-world robotic manipulation challenges (perception, state initialization, and instruction following). In the revision, we will add a new subsection in §3 that provides a taxonomy of perturbation categories, cites relevant prior robustness literature in robotics, and explains the scope (focusing on single-task perturbations rather than dynamics or long-horizon planning, which are orthogonal and left for future work). We will also moderate the language around 'general property' to emphasize that the results demonstrate brittleness along these practically relevant axes. revision: partial
-
Referee: [§4] §4 (Model Evaluation and Language Experiments): The finding that 'models tend to ignore language instructions completely' is based on a limited set of models whose architectural diversity, scale, and training regimes are not characterized. Without ablations (e.g., language-only vs. vision-only controls) or mechanistic evidence (attention maps, counterfactuals), the insensitivity result cannot be extrapolated beyond the tested models and risks over-generalization.
Authors: We acknowledge the need for greater characterization and supporting evidence. The evaluated models span multiple prominent VLA families with differing scales and training data. In the revised §4, we will add a table summarizing architectural details, parameter counts, and training regimes for each model. We will also include new ablation experiments (language-only and vision-only controls) and attention-map visualizations to provide mechanistic support for the observed language insensitivity. These additions will allow readers to better assess the scope of the finding. revision: yes
-
Referee: [Methods] Methods and Experimental Details: The manuscript reports large quantitative drops (95% to <30%) and statistical patterns but omits full protocol details, variance estimates, number of trials per condition, and data-release statements. These omissions make the central empirical claims difficult to reproduce or stress-test, directly affecting the soundness of the robustness conclusions.
Authors: We apologize for these omissions, which resulted from space limitations. The revised Methods section will specify that each perturbation condition was run for 100 trials across multiple random seeds, include standard deviation and confidence interval reporting, and add statistical significance tests for the reported performance drops. We will also include a clear data-availability statement committing to the release of the full LIBERO-Plus perturbation suite, evaluation code, and raw results upon acceptance. revision: yes
Circularity Check
No circularity: purely empirical perturbation study with no derivations or fitted predictions.
full rationale
The paper performs a direct empirical analysis by applying controlled perturbations in seven dimensions to multiple VLA models and reporting measured success rates. No equations, first-principles derivations, parameter fitting, or predictions are present. Central claims (performance drops from 95% to <30%, language insensitivity) follow immediately from the experimental results without reduction to self-defined inputs or self-citation chains. The selection of perturbation dimensions is an experimental design choice, not a circular derivation step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The seven perturbation dimensions capture the most relevant sources of real-world variation for VLA tasks.
Forward citations
Cited by 42 Pith papers
-
TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.
-
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents
Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.
-
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
-
PlayWorld: Learning Robot World Models from Autonomous Play
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models
Interventional attribution via ISS and NMR diagnoses causal misalignment in VLA policies and predicts their generalization performance across manipulation tasks.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.
-
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
-
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
-
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
-
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
LLM-driven multi-planner scheduling framework turns open-ended passenger instructions into safe, traceable control signals for autonomous vehicles while cutting query costs and matching specialized safety levels.
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming
DAERT generates diverse adversarial instructions via a uniform policy in RL to drop VLA task success rates from 93.33% to 5.85% on benchmarks with models like π0 and OpenVLA.
-
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
-
Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses
The survey organizes over 400 papers on embodied AI safety into a multi-level taxonomy and flags overlooked issues such as fragile multimodal fusion and unstable planning under jailbreaks.
-
ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling
ROBOGATE applies adaptive boundary-focused sampling in simulation to discover robot policy failure boundaries, revealing a 97.65 percentage point performance gap for a VLA model between LIBERO and industrial scenarios.
-
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
-
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation a...
-
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
-
Test-Time Training for Visual Foresight Vision-Language-Action Models
T³VF applies test-time training with adaptive filtering to reduce OOD failures in VF-VLA models by treating predicted future images and actual next observations as natural training pairs.
-
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
-
RLDX-1 Technical Report
RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
-
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.