NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
Robust navigation with language pretraining and stochastic sampling
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3verdicts
UNVERDICTED 3roles
background 2polarities
background 2representative citing papers
SDB balances behavioral diversity and learning stability in VLN self-improvement by expanding decisions into latent hypotheses, performing reliability-aware aggregation, and applying a regularizer, yielding gains such as SPL 33.73 to 35.93 on REVERIE val-unseen.
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.
citing papers explorer
-
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
-
The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation
SDB balances behavioral diversity and learning stability in VLN self-improvement by expanding decisions into latent hypotheses, performing reliability-aware aggregation, and applying a regularizer, yielding gains such as SPL 33.73 to 35.93 on REVERIE val-unseen.
-
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.