A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Eo-1: Interleaved vision-text-action pretraining for general robot control
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
Legato trains flow-based VLA policies with schedule-shaped action-noise mixtures and randomized conditions to achieve smoother trajectories and ~10% faster task completion than real-time chunking across five real-world manipulation tasks.
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
citing papers explorer
-
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies
Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
-
Learning Native Continuation for Action Chunking Flow Policies
Legato trains flow-based VLA policies with schedule-shaped action-noise mixtures and randomized conditions to achieve smoother trajectories and ~10% faster task completion than real-time chunking across five real-world manipulation tasks.
-
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
-
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.
-
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.