Training VLAs to perform embodied chain-of-thought reasoning about plans, sub-tasks, motions, and grounded visual features before acting raises OpenVLA success rates by 28% on challenging generalization tasks without new robot data.
Gemini: A family of highly capable multimodal models
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
citing papers explorer
-
Robotic Control via Embodied Chain-of-Thought Reasoning
Training VLAs to perform embodied chain-of-thought reasoning about plans, sub-tasks, motions, and grounded visual features before acting raises OpenVLA success rates by 28% on challenging generalization tasks without new robot data.
-
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
-
Long Context Transfer from Language to Vision
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
-
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.