hub Mixed citations

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian · 2025 · cs.RO · arXiv 2510.13626

Mixed citation behavior. Most common role is background (46%).

82 Pith papers citing it

Background 46% of classified citations

open full Pith review browse 82 citing papers arXiv PDF

abstract

Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 dataset 7 baseline 3 method 1

citation-polarity summary

background 12 use dataset 7 baseline 3 unclear 3 use method 1

claims ledger

abstract Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit
background However, standard VLA models do not explicitly model world dynamics ithey learn direct observation-to- action mappings without predicting how the environment changes under intervention[ 4]. This absence of predictive physical reasoning limits their generalization, where anticipating future states is essential. Equip- ping embodied policy models with world modeling capabilities thus emerges as a natural direction [ 5]. A growing body of recent work has begun integrating world models into the embo

co-cited works

representative citing papers

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

cs.RO · 2026-05-08 · accept · novelty 8.0

TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

cs.CV · 2026-03-30 · unverdicted · novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

cs.RO · 2026-05-19 · unverdicted · novelty 7.0

MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

cs.RO · 2026-05-12 · conditional · novelty 7.0 · 2 refs

GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.

Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

cs.RO · 2026-05-01 · unverdicted · novelty 7.0

Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents

cs.AI · 2026-04-18 · unverdicted · novelty 7.0

Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

cs.RO · 2026-03-23 · unverdicted · novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

PlayWorld: Learning Robot World Models from Autonomous Play

cs.RO · 2026-03-09 · unverdicted · novelty 7.0

PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.

Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

cs.RO · 2026-06-30 · unverdicted · novelty 6.0 · 2 refs

3D HAMSTER adds depth encoding and reconstruction to VLMs to produce 3D waypoint sequences that feed directly into pointcloud policies, claiming better generalization than 2D baselines under shifts.

Sequential Planning via Anchored Robotic Keypoints

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.

Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.

$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

cs.RO · 2026-06-07 · unverdicted · novelty 6.0

GEAR-VLA learns geometry-aware action representations via coarse-to-fine pretraining, gradient-decoupled DiT action expert, semantic-aligned 3D integration, and embodiment canonicalization, reporting SOTA results on LIBERO benchmarks and over 80% success on unseen embodiments and 212 unseen objects.

citing papers explorer

Showing 50 of 82 citing papers.

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning cs.RO · 2026-05-08 · accept · none · ref 57 · internal anchor
TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models cs.CV · 2026-03-30 · unverdicted · none · ref 10 · internal anchor
FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models cs.RO · 2026-06-22 · unverdicted · none · ref 15 · internal anchor
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models cs.RO · 2026-06-08 · unverdicted · none · ref 5 · internal anchor
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation cs.RO · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR cs.LG · 2026-05-19 · conditional · none · ref 7 · internal anchor
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 19 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model cs.RO · 2026-05-12 · conditional · none · ref 1 · 2 links · internal anchor
GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models cs.AI · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 37 · internal anchor
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models cs.RO · 2026-05-09 · unverdicted · none · ref 36 · internal anchor
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 19 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models cs.RO · 2026-05-01 · unverdicted · none · ref 12 · internal anchor
Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 121 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents cs.AI · 2026-04-18 · unverdicted · none · ref 2 · internal anchor
Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models cs.RO · 2026-03-23 · unverdicted · none · ref 10 · internal anchor
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
PlayWorld: Learning Robot World Models from Autonomous Play cs.RO · 2026-03-09 · unverdicted · none · ref 77 · internal anchor
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.
Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation cs.RO · 2026-07-01 · unverdicted · none · ref 15 · internal anchor
Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.
ABot-M0.5: Unified Mobility-and-Manipulation World Action Model cs.CV · 2026-07-01 · unverdicted · none · ref 16 · internal anchor
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance cs.RO · 2026-06-30 · unverdicted · none · ref 8 · 2 links · internal anchor
3D HAMSTER adds depth encoding and reconstruction to VLMs to produce 3D waypoint sequences that feed directly into pointcloud policies, claiming better generalization than 2D baselines under shifts.
Sequential Planning via Anchored Robotic Keypoints cs.RO · 2026-06-29 · unverdicted · none · ref 17 · internal anchor
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization cs.RO · 2026-06-26 · unverdicted · none · ref 8 · internal anchor
Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.
$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models cs.RO · 2026-06-08 · unverdicted · none · ref 8 · internal anchor
ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 11 · internal anchor
GEAR-VLA learns geometry-aware action representations via coarse-to-fine pretraining, gradient-decoupled DiT action expert, semantic-aligned 3D integration, and embodiment canonicalization, reporting SOTA results on LIBERO benchmarks and over 80% success on unseen embodiments and 212 unseen objects.
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models cs.RO · 2026-06-06 · conditional · none · ref 31 · internal anchor
vla.cpp is a unified C++ runtime that serves multiple VLA architectures with flow-matching and diffusion patterns, matching SOTA performance on LIBERO while running on low-memory embedded hardware.
3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training cs.CV · 2026-06-03 · unverdicted · none · ref 48 · internal anchor
A 3D-thinking-guided co-training method disentangles geometry perception and spatial reasoning to inject latent 3D priors into VLA models via adapters, achieving SOTA on manipulation benchmarks while running on 2D images only.
What Are We Actually Benchmarking in Robot Manipulation? cs.RO · 2026-06-02 · conditional · none · ref 18 · internal anchor
LIBERO and CALVIN fail multiple proposed diagnostics for shortcut solvability, statistical significance, overfitting, and data dependence, while a tiny 0.09B probe reaches near-SOTA on LIBERO.
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 15 · internal anchor
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
FATE-VLA:Failue-aware test generation for vision-language-action models cs.RO · 2026-06-01 · unverdicted · none · ref 19 · internal anchor
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation cs.RO · 2026-05-28 · unverdicted · none · ref 11 · internal anchor
ELAN4D introduces plug-and-play 4D keypoint track supervision from forward kinematics to enhance VLA policy generalization in robotic manipulation tasks.
VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models cs.RO · 2026-05-28 · unverdicted · none · ref 33 · internal anchor
VLAConf is a one-class discriminative method that estimates step-wise task-success confidence for VLA models via anomaly scoring on frozen representations plus step-conditioned modeling, shown to be more efficient than ensemble or probability baselines on LIBERO and real robots.
3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding cs.RO · 2026-05-28 · unverdicted · none · ref 9 · internal anchor
3DVLA is a plug-and-play framework that enhances pretrained VLAs with pervasive 3D feature encoding using multi-view consistency and Spatially-Conditioned Geometry Aggregation, an instance estimation module, and a masked self-supervised 3D branch, yielding gains on LIBERO-Plus and RoboTwin 2.0.
Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance cs.RO · 2026-05-22 · unverdicted · none · ref 7 · internal anchor
Afford-VLA internalizes task-conditioned affordance as an explicit visual planning interface within VLA models via learnable <AFF> tokens, achieving SOTA on LIBERO and SimplerEnv benchmarks.
Action with Visual Primitives cs.RO · 2026-05-21 · unverdicted · none · ref 31 · internal anchor
AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.
From a Single Demonstration to a General Policy for Contact-Rich Manipulation cs.RO · 2026-05-17 · unverdicted · none · ref 141 · internal anchor
A one-shot LfD framework abstracts a single demonstration into environmental-constraint primitives, then uses self-exploration, human corrections, and compliant recovery to produce a policy that generalizes across poses and geometries, achieving over 90% success on seven real-world multi-stage tasks
Feedback World Model Enables Precise Guidance of Diffusion Policy cs.RO · 2026-05-15 · unverdicted · none · ref 2 · internal anchor
Feedback world model closes the prediction-observation loop at inference time to correct errors and improve diffusion policy performance under distribution shift in robotics.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models cs.RO · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization cs.RO · 2026-05-12 · unverdicted · none · ref 28 · 2 links · internal anchor
GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs cs.RO · 2026-05-11 · unverdicted · none · ref 7 · 2 links · internal anchor
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation cs.RO · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 12 · internal anchor
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training cs.RO · 2026-04-25 · unverdicted · none · ref 18 · internal anchor
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist policies.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors cs.RO · 2026-04-23 · unverdicted · none · ref 27 · internal anchor
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models cs.CV · 2026-04-20 · unverdicted · none · ref 6 · internal anchor
PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-04-20 · unverdicted · none · ref 16 · internal anchor
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 22 · internal anchor
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models cs.RO · 2026-04-10 · unverdicted · none · ref 9 · internal anchor
ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles cs.RO · 2026-04-09 · unverdicted · none · ref 14 · internal anchor
LLM-driven multi-planner scheduling framework turns open-ended passenger instructions into safe, traceable control signals for autonomous vehicles while cutting query costs and matching specialized safety levels.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model cs.RO · 2026-04-07 · unverdicted · none · ref 10 · internal anchor
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming cs.RO · 2026-04-07 · unverdicted · none · ref 7 · internal anchor
DAERT generates diverse adversarial instructions via a uniform policy in RL to drop VLA task success rates from 93.33% to 5.85% on benchmarks with models like π0 and OpenVLA.

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer