Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.
URL https://arxiv.org/abs/ 2408.14368
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.RO 2verdicts
UNVERDICTED 2representative citing papers
EL3DD extends latent 3D diffusion with language inputs and reference demonstrations to improve success rates on sequential manipulation tasks in the CALVIN dataset.
citing papers explorer
-
What Matters in Building Vision-Language-Action Models for Generalist Robots
Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.
-
EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation
EL3DD extends latent 3D diffusion with language inputs and reference demonstrations to improve success rates on sequential manipulation tasks in the CALVIN dataset.