NavGPT: Explicit reasoning in vision-and-language navigation with large language models

[Zhouet al · 2023 · arXiv 2305.16986

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

PhotoFlow: Agentic 3D Virtual Photography Missions

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

PhotoFlow is a closed-loop agent framework that searches for camera parameters in 3D scenes according to language intent and outperforms one-shot, reflection, and random baselines on the new VPhotoBench of 47 scenes and 141 missions.

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.

The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

Experiments reveal that topological cues robustly support LLM navigation planning while incorrect semantic cues derail it, with linguistic format effects varying by model size and compression.

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

cs.RO · 2024-12-09 · unverdicted · novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

cs.CV · 2024-02-24 · unverdicted · novelty 6.0

NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.

Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation

cs.AI · 2025-12-13 · unverdicted · novelty 5.0

Floorplan2Guide uses LLMs to parse floor plans into navigable graphs, reporting up to 92% accuracy on short routes with 5-shot prompting and 15% gains from graph structure over direct visual reasoning for BLV indoor navigation.

Agent AI: Surveying the Horizons of Multimodal Interaction

cs.AI · 2024-01-07 · unverdicted · novelty 4.0

The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.

citing papers explorer

Showing 7 of 7 citing papers.

PhotoFlow: Agentic 3D Virtual Photography Missions cs.CV · 2026-05-22 · unverdicted · none · ref 35
PhotoFlow is a closed-loop agent framework that searches for camera parameters in 3D scenes according to language intent and outperforms one-shot, reflection, and random baselines on the new VPhotoBench of 47 scenes and 141 missions.
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace cs.AI · 2026-04-09 · unverdicted · none · ref 67
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning cs.CL · 2026-05-29 · unverdicted · none · ref 39
Experiments reveal that topological cues robustly support LLM navigation planning while incorrect semantic cues derail it, with linguistic format effects varying by model size and compression.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks cs.RO · 2024-12-09 · unverdicted · none · ref 121
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation cs.CV · 2024-02-24 · unverdicted · none · ref 121
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation cs.AI · 2025-12-13 · unverdicted · none · ref 4
Floorplan2Guide uses LLMs to parse floor plans into navigable graphs, reporting up to 92% accuracy on short routes with 5-shot prompting and 15% gains from graph structure over direct visual reasoning for BLV indoor navigation.
Agent AI: Surveying the Horizons of Multimodal Interaction cs.AI · 2024-01-07 · unverdicted · none · ref 112
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.

NavGPT: Explicit reasoning in vision-and-language navigation with large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer