TAP-VLA improves VLA performance in contact-rich manipulation by visually annotating tactile shear fields onto input images, reaching 78% success versus under 50% for vision-only and other tactile methods.
hub
InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 18roles
background 1polarities
background 1representative citing papers
Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.
CoStream composes semantic, predictive, and reactive behaviors on an SE(3) interface to enable precise, generalizable performance on eight real-world contact-rich manipulation tasks.
Reflective VLA improves VLA generalization on LIBERO-Plus and LIBERO-Plus-Hard by 5.4 and 4.2 percentage points by conditioning on action consequences instead of reactive single-frame inputs.
dVLA-RL models denoising as an MDP to enable RL on dVLAs via trajectory probabilities, reporting 99.7% success on LIBERO and 30.6% gains over SFT on RoboTwin 2.0.
ACE-Ego-0 is a VLA pretraining framework that turns egocentric human videos into robot-format pseudo-actions via a video-to-action pipeline and trains jointly with robot data under a reliability-aware objective.
A VLA policy using view-selective visual routing and interaction-aware action MoE improves average success by 27.7% in simulation and 43.3% in real-world bimanual tasks over monolithic baselines.
DIRECT is a multimodal-context router that allocates test-time compute across chain-of-thought depth, model size, and memory history for VLM embodied planners, improving the success-cost Pareto frontier and matching stronger models at up to 65% lower latency on benchmarks and a physical Franka arm.
Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full observability.
CT-VAM is a 68M-parameter cerebello-thalamic-inspired model that achieves competitive LIBERO success rates with lower inference latency than larger VLA models by using a stream-separated attention decoder called TARS.
EmbodimentSemantic is a spatial scene-graph dataset and benchmark for evaluating relational grounding in vision-language models on embodied manipulation trajectories.
SDP constructs sets of desired action-chunks from human correction pairs and trains diffusion policies to align with those sets, yielding better performance and robustness than standard behavior cloning on robotic tasks.
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.
R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
PACT is a self-evolving post-training framework that projects diffusion policies onto constraint-feasible regions via reverse-KL distillation and a tightening curriculum, reporting 31% fewer safety violations and 30.7% higher task success on embodied manipulation benchmarks.
Biasing the training time distribution toward high-noise states enables one-step action generation in VLA models that matches or exceeds ten-step decoding on LIBERO benchmarks and real-robot tasks.
RouterVLA reports that a simple probe-success rule from outcome-separated smoke tests raises held-out VLA success by 14.64pp on 34,752 LIBERO-Plus records, with learned scorers adding no further gain.
citing papers explorer
-
TAP-VLA: Tactile Annotation Prompting for Vision Language Action Models
TAP-VLA improves VLA performance in contact-rich manipulation by visually annotating tactile shear fields onto input images, reaching 78% success versus under 50% for vision-only and other tactile methods.
-
LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation
Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.
-
CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation
CoStream composes semantic, predictive, and reactive behaviors on an SE(3) interface to enable precise, generalizable performance on eight real-world contact-rich manipulation tasks.
-
Reflective VLA: In-Context Action Consequences Make VLAs Generalize
Reflective VLA improves VLA generalization on LIBERO-Plus and LIBERO-Plus-Hard by 5.4 and 4.2 percentage points by conditioning on action consequences instead of reactive single-frame inputs.
-
dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models
dVLA-RL models denoising as an MDP to enable RL on dVLAs via trajectory probabilities, reporting 99.7% success on LIBERO and 30.6% gains over SFT on RoboTwin 2.0.
-
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining
ACE-Ego-0 is a VLA pretraining framework that turns egocentric human videos into robot-format pseudo-actions via a video-to-action pipeline and trains jointly with robot data under a reliability-aware objective.
-
See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation
A VLA policy using view-selective visual routing and interaction-aware action MoE improves average success by 27.7% in simulation and 43.3% in real-world bimanual tasks over monolithic baselines.
-
DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
DIRECT is a multimodal-context router that allocates test-time compute across chain-of-thought depth, model size, and memory history for VLM embodied planners, improving the success-cost Pareto frontier and matching stronger models at up to 65% lower latency on benchmarks and a physical Franka arm.
-
$\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full observability.
-
CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control
CT-VAM is a 68M-parameter cerebello-thalamic-inspired model that achieves competitive LIBERO success rates with lower inference latency than larger VLA models by using a stream-separated attention decoder called TARS.
-
EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories
EmbodimentSemantic is a spatial scene-graph dataset and benchmark for evaluating relational grounding in vision-language models on embodied manipulation trajectories.
-
Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections
SDP constructs sets of desired action-chunks from human correction pairs and trains diffusion policies to align with those sets, yielding better performance and robustness than standard behavior cloning on robotic tasks.
-
Continuous Reasoning for Vision-Language-Action
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
-
Turning Video Models into Generalist Robot Policies
Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.
-
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning
R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
-
PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation
PACT is a self-evolving post-training framework that projects diffusion policies onto constraint-feasible regions via reverse-KL distillation and a tightening curriculum, reporting 31% fewer safety violations and 30.7% higher task success on embodied manipulation benchmarks.
-
RouterVLA: Turning Smoke Tests into Supervision for Heterogeneous VLA Selection
RouterVLA reports that a simple probe-success rule from outcome-separated smoke tests raises held-out VLA success by 14.64pp on 34,752 LIBERO-Plus records, with learned scorers adding no further gain.