Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
Gnm: A general navigation model to drive any robot
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2024 4roles
background 4polarities
background 4representative citing papers
Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
citing papers explorer
-
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
-
Octo: An Open-Source Generalist Robot Policy
Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
-
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.