NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
Vision and language navigation in the real world via online visual language mapping
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
MAGNIFIED applies RL fine-tuning to MLLMs for autonomous driving motion planning, yielding over 10.5% lower overlap rate and 38.9% lower off-road rate than SFT baseline on Waymo Open Motion Dataset.
HSAN integrates hierarchical semantic graphs, optimal transport-based goal selection, and graph-aware RL to claim SOTA results on VLN-CE tasks.
citing papers explorer
-
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
-
MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning
MAGNIFIED applies RL fine-tuning to MLLMs for autonomous driving motion planning, yielding over 10.5% lower overlap rate and 38.9% lower off-road rate than SFT baseline on Waymo Open Motion Dataset.
-
Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation
HSAN integrates hierarchical semantic graphs, optimal transport-based goal selection, and graph-aware RL to claim SOTA results on VLN-CE tasks.