Inverse dynamics prediction is added as an auxiliary task to reduce state aliasing in VLA models by directly supervising the vision encoder on action-relevant visual distinctions using only standard observation-action pairs.
MoTVLA: A vision-language-action model with unified fast-slow reasoning
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
A visuo-tactile policy learning method that exploits tactile motion correlation for contact state distinction and Mixture-of-Transformers for cross-modal fusion.
citing papers explorer
-
Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning
Inverse dynamics prediction is added as an auxiliary task to reduce state aliasing in VLA models by directly supervising the vision encoder on action-relevant visual distinctions using only standard observation-action pairs.
-
Seeing Touch from Motion: A Unified Modality-Aware Visuo-Tactile Policy with Tactile Motion Correlation
A visuo-tactile policy learning method that exploits tactile motion correlation for contact state distinction and Mixture-of-Transformers for cross-modal fusion.