NavGPT: Explicit reasoning in vision-and-language navigation with large language models

Gengze Zhou, Yicong Hong, Qi Wu · 2023 · arXiv 2305.16986

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

PhotoFlow: Agentic 3D Virtual Photography Missions

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

PhotoFlow is a closed-loop agent framework that searches for camera parameters in 3D scenes according to language intent and outperforms one-shot, reflection, and random baselines on the new VPhotoBench of 47 scenes and 141 missions.

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

cs.RO · 2024-12-09 · unverdicted · novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

cs.CV · 2024-02-24 · unverdicted · novelty 6.0

NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.

Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation

cs.AI · 2025-12-13 · unverdicted · novelty 5.0

Floorplan2Guide uses LLMs to parse floor plans into navigable graphs, reporting up to 92% accuracy on short routes with 5-shot prompting and 15% gains from graph structure over direct visual reasoning for BLV indoor navigation.

Agent AI: Surveying the Horizons of Multimodal Interaction

cs.AI · 2024-01-07 · unverdicted · novelty 4.0

The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks cs.RO · 2024-12-09 · unverdicted · none · ref 121
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.

NavGPT: Explicit reasoning in vision-and-language navigation with large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer