Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
Openvla: An open-source vision-language-action model
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7representative citing papers
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.
Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-training success.
MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.
citing papers explorer
-
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models
Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
-
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
-
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.
-
Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-training success.
-
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.