STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.
Eva-VLA: Evaluating vision-language-action mod- els’ robustness under real-world physical variations
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
Thermally activated clothing with thermochromic dyes and heaters creates dynamic adversarial patterns that evade AI surveillance in visible and infrared modalities while appearing ordinary when inactive.
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
RoboStressBench decomposes visual stress into four physically grounded dimensions to benchmark VLM robustness in embodied scenes and proposes a stress-aware solver.
Any VLA policy satisfies I(A*; Aπ) + [I(Aπ; Ãπ) − I(Aπ; δ)] ≤ H(A*) + I(X; X̃) by two applications of the Data Processing Inequality.
VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.
VLA models exhibit joint-dependent success degradation under realistic physical faults, which J-PARC mitigates via latent regime inference and residual action correction.
VLAMotor exposes VLA failures via distance-aware uncertainty testing and synthesizes agent-planned repair data to fine-tune models, reporting 49.25% success rate gains in simulation and 57.5% on hardware.
AFIL trains dual action generators on success and failure rollouts from a pretrained VLA to steer diffusion policies away from failure modes during inference.
Changes in Chain-of-Causation explanations under sensor perturbations correlate with 5.3× higher trajectory deviation in a driving VLA, and enabling such explanations yields 11.8% better accuracy.
citing papers explorer
-
Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning
VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.