Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Hanxuan Chen; Hanzhong Guo; Jie Zheng; Ji Pei; Kangli Wang; Ruilong Ren; Shuai Yuan; Siqi Yang; Siwei Feng; Songsheng Cheng

arxiv: 2604.13654 · v1 · submitted 2026-04-15 · 💻 cs.RO

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Hanxuan Chen , Jie Zheng , Siqi Yang , Tianle Zeng , Siwei Feng , Songsheng Cheng , Ruilong Ren , Hanzhong Guo

show 4 more authors

Shuai Yuan Xiangyue Wang Kangli Wang Ji Pei

This is my paper

Pith reviewed 2026-05-10 13:44 UTC · model grok-4.3

classification 💻 cs.RO

keywords UAVVision-and-Language NavigationEmbodied AIVision Language ModelsFoundation ModelsResearch RoadmapSimulation to Reality

0 comments

The pith

UAV vision-and-language navigation has progressed from modular and deep learning methods to agentic systems using large foundation models, with a proposed roadmap for addressing deployment barriers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey seeks to organize the field of UAV vision-and-language navigation by creating a clear taxonomy of methods that shows the move from early modular pipelines and deep learning techniques to modern systems driven by large models such as vision-language models and vision-language-action models, along with emerging generative world models. A reader would care because it supplies a map of resources like simulators and datasets plus a breakdown of obstacles to practical use, making it easier to see where research should head next for drones that can follow spoken instructions in real 3D settings. If the taxonomy holds, it would help coordinate efforts toward solving issues that currently limit UAVs from handling long tasks autonomously.

Core claim

The paper establishes a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including VLMs, VLA models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. It reviews the ecosystem of simulators, datasets, and evaluation metrics, conducts a critical analysis of the primary challenges impeding real-world deployment including the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and efficient deployment of large models, and concludes by proposing a

What carries the argument

The methodological taxonomy that organizes the evolution of UAV-VLN approaches from modular and deep learning to agentic foundation model systems.

Load-bearing premise

That the taxonomy comprehensively captures all important developments in the field and that the four challenges are the main barriers to real-world UAV deployment.

What would settle it

A review of the latest literature identifying multiple UAV-VLN approaches that fall outside the proposed taxonomy categories or uncovering additional primary challenges not listed in the survey.

Figures

Figures reproduced from arXiv: 2604.13654 by Hanxuan Chen, Hanzhong Guo, Jie Zheng, Ji Pei, Kangli Wang, Ruilong Ren, Shuai Yuan, Siqi Yang, Siwei Feng, Songsheng Cheng, Tianle Zeng, Xiangyue Wang.

**Figure 1.** Figure 1: An overview of the UAV-based Vision-and-Language Navigation (UAV-VLN) research landscape, illustrating the core [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The UAV-VLN task modeled as a Partially Observable Markov Decision Process (POMDP). The agent receives an [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A comparison of action space formulations for UAV navigation. Low-level continuous control offers fine-grained [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: A timeline of the methodological evolution in UAV Vision-and-Language Navigation (UAV-VLN). The figure charts [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: This figure illustrates the evolution of early learning approaches in UAV-VLN, categorized into three primary paradigms. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Two primary architectural paradigms for long-horizon navigation. Temporal history encoding (left) treats the task as [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Architectural paradigms for foundation model-driven agents. VLMs can act as a deliberative cognitive core to generate [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: The evolution from single-engine simulators to multi-engine data generation pipelines. Platforms like OpenFly integrate [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: A comparison of evaluation paradigms. Holistic metrics like SPL provide a single score for task completion, while [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: A conceptual breakdown of the sim-to-real gap into its three constituent challenges: disparities in visual perception, [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: The typical multi-stage pipeline for sim-to-real transfer, progressing from pure simulation to Software-in-the-Loop [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: A comparison of three primary strategies for reasoning under uncertainty. Procedural reasoning decomposes [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Three architectural paradigms for safety assurance in autonomous flight. Formal methods use explicit safety filters, [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: A comparison of coordination architectures for UAV swarms. (a) Centralized models rely on a single planner, offering [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Illustration of a foundation model acting as a collaborative intelligence layer for an air-ground team. The model fuses [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

read the original abstract

Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real-world deployment: the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource-constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward-looking research roadmap to guide future inquiry into key frontiers such as multi-agent swarm coordination and air-ground collaborative robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper surveys Vision-and-Language Navigation for UAVs (UAV-VLN), defining the task and establishing a methodological taxonomy that traces evolution from early modular and deep-learning approaches to contemporary agentic systems based on VLMs, VLA models, and generative world models. It reviews the ecosystem of simulators, datasets, and evaluation metrics, critically analyzes four primary challenges (simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and efficient deployment of large models), and concludes with a forward-looking research roadmap covering multi-agent swarm coordination and air-ground collaboration.

Significance. If the taxonomy and challenge prioritization hold, the survey would organize a rapidly evolving subfield of embodied AI, synthesize benchmarks, and provide a useful roadmap for UAV navigation research. The structured progression from modular to foundation-model approaches and the explicit focus on real-world deployment barriers could help standardize evaluation and direct effort toward high-impact areas.

major comments (1)

[Taxonomy and Challenges sections] The central claim that the survey establishes a comprehensive methodological taxonomy and identifies the four primary challenges rests on an undocumented literature selection process. No section (including the taxonomy presentation or challenges analysis) specifies search databases, keywords, date ranges, inclusion/exclusion criteria, total papers screened, or quantitative breakdown of coverage per category. This absence makes it impossible to assess whether the reviewed works are representative or whether omitted issues (e.g., safety certification or regulatory constraints) are equally load-bearing.

minor comments (2)

[Abstract] The abstract and introduction could include a brief quantitative overview (e.g., number of papers reviewed or distribution across taxonomy categories) to give readers immediate context on scope.
[Resources and Evaluation Metrics sections] A summary table mapping taxonomy categories to representative papers, simulators, and metrics would improve readability and allow quick cross-referencing.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address the major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Taxonomy and Challenges sections] The central claim that the survey establishes a comprehensive methodological taxonomy and identifies the four primary challenges rests on an undocumented literature selection process. No section (including the taxonomy presentation or challenges analysis) specifies search databases, keywords, date ranges, inclusion/exclusion criteria, total papers screened, or quantitative breakdown of coverage per category. This absence makes it impossible to assess whether the reviewed works are representative or whether omitted issues (e.g., safety certification or regulatory constraints) are equally load-bearing.

Authors: We agree that documenting the literature selection process would enhance the transparency and reproducibility of the survey. Our review was comprehensive, drawing from key publications in the field, but it was not conducted as a formal systematic review with predefined protocols. To address this, we will revise the manuscript to include a dedicated 'Literature Review Methodology' subsection. This will specify the primary sources (arXiv, major robotics and AI conferences such as ICRA, IROS, CVPR, NeurIPS), search keywords (e.g., 'UAV VLN', 'drone vision language navigation', 'aerial embodied AI'), date range (papers published from 2015 onwards), and quantitative breakdown of coverage per category. Regarding potential omitted issues such as safety certification and regulatory constraints, we acknowledge their importance for real-world UAV deployment. We will incorporate a brief discussion of these in the challenges section and the research roadmap, noting them as complementary barriers alongside the four primary technical challenges we identified. revision: yes

Circularity Check

0 steps flagged

No circularity: survey synthesizes external literature without self-referential derivations

full rationale

This paper is a literature survey that defines a taxonomy of UAV-VLN approaches and lists challenges by reviewing prior external work. No equations, fitted parameters, predictions, or derivations appear in the abstract or described structure. The central claims rest on synthesis of cited literature rather than any reduction to the paper's own inputs, self-citations as load-bearing premises, or renamed ansatzes. Absence of documented search methods affects representativeness but does not create circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the work does not introduce or rely on new free parameters, axioms, or invented entities; it reviews and categorizes content from the existing literature.

pith-pipeline@v0.9.0 · 5557 in / 1105 out tokens · 21634 ms · 2026-05-10T13:44:36.671762+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CosFly: Plan in the Matrix, Fly in the World
cs.RO 2026-05 unverdicted novelty 6.0

CosFly introduces a box-structured planning and multimodal simulation pipeline for aerial target tracking in CARLA, paired with the public CosFly-Track dataset containing 250 trajectories and approximately 100,000 ren...

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

Feng et al

T. Feng et al. Embodied AI: From LLMs to world models. IEEE Circuits and Systems Magazine , 2025

work page 2025
[2]

Large model empow- ered embodied ai: A survey on decision-making and embodied learning,

W. Liang et al. Large model empowered embodied ai: A survey on decision-making and embodied learning. arXiv preprint arXiv:2508.10399, 2025

work page arXiv 2025
[3]

Aligning cyber space with physical world: A comprehensive survey on embodied ai.arXiv preprint arXiv:2407.06886, 2024

Y. Liu et al. Aligning cyber space with physical world: A comprehensive survey on embodied AI. arXiv preprint arXiv:2407.06886, 2024

work page arXiv 2024
[4]

Choutri et al

K. Choutri et al. Leveraging large language models for real-time uav control. Electronics, 14(21):4312, 2025

work page 2025
[5]

S. A. Salunkhe et al. Intuitive human-drone collaborative navigation in unknown environments through mixed reality. In 2025 International Conference on Unmanned Aircraft Systems (ICUAS), 2025

work page 2025
[6]

C. F. Liew and T. Yairi. Companion unmanned aerial vehicles: A survey. arXiv preprint arXiv:2001.04637 , 2020

work page arXiv 2001
[7]

Zhang, H

Y. Zhang et al. Air-ground collaborative robots for fire and rescue missions: Towards mapping and navigation perspective. arXiv preprint arXiv:2412.20699 , 2024

work page arXiv 2024
[8]

Cai et al

H. Cai et al. FlightGPT: Towards generalizable and inter- pretable UA V vision-and-Language Navigation with vision- Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pp. 6659– 6676, Suzhou, China, november 2025. Association for Compu- tational Linguistics

work page 2025
[9]

L. Feng. Invited speaker 1: Navigation without GPS for un- manned aerial vehicles. In 2019 International Conference on Computer and Drone Applications (IConDA) , pp. 1, 2019

work page 2019
[10]

Seidel et al

L. Seidel et al. Advancing early wildfire detection: Integration of vision language models with unmanned aerial vehicle remote sensing for enhanced situational awareness. Drones, 9(5):347, 2025

work page 2025
[11]

Salahat et al

E. Salahat et al. Waypoint planning for autonomous aerial inspection of large-scale solar farms. In IECON 2019 - 45th Annual Conference of the IEEE Industrial Electronics Society , pp. 763–769, 2019

work page 2019
[12]

Zhou et al

Z. Zhou et al. A lightweight drone vision system for autonomous inspection with real-time processing. Drones, 10(2):126, 2026

work page 2026
[13]

Zhang et al

X. Zhang et al. Logisticsvln: Vision-language navigation for low- altitude terminal delivery based on agentic uavs. In 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), 2025

work page 2025
[14]

D. H. Lee et al. A review on recent deep learning-based seman- tic segmentation for urban greenness measurement. Sensors, 24(7):2245, 2024

work page 2024
[15]

Uavs meet agentic ai: A multidomain survey of autonomous aerial intelligence and agentic uavs,

R. Sapkota et al. Uavs meet agentic ai: A multidomain survey of autonomous aerial intelligence and agentic uavs. arXiv preprint arXiv:2506.08045, 2025

work page arXiv 2025
[16]

Morando and G

L. Morando and G. Loianno. Spatial assisted human-drone col- laborative navigation and interaction through immersive mixed reality. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , 2024

work page 2024
[17]

A survey of large language model-powered spatial intelli- gence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848, 2025

J. Feng et al. A survey of large language model-powered spatial intelligence across scales: Advances in embodied agents, smart cities, and earth science. arXiv preprint arXiv:2504.09848 , 2025

work page arXiv 2025
[18]

Zhang et al

Y. Zhang et al. Vision-and-language navigation today and to- morrow: A survey in the era of foundation models. Transactions on Machine Learning Research, 2024. Survey Certification

work page 2024
[19]

Firoozi et al

R. Firoozi et al. Foundation models in robotics: Applications, challenges, and the future. The International Journal of Robotics Research, 44(5):701–739, 2024

work page 2024
[20]

Chen et al

S. Chen et al. Exploring embodied multimodal large models: Development, datasets, and future directions. Information Fusion, 122:103198, 2025

work page 2025
[21]

Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

D. Zhang et al. Pure vision language action (vla) models: A comprehensive survey. arXiv preprint arXiv:2509.19012 , 2025

work page arXiv 2025
[22]

Vision-language- action models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769,

R. Sapkota et al. Vision-language-action (vla) models: Con- cepts, progress, applications and challenges. arXiv preprint arXiv:2505.04769, 2025

work page arXiv 2025
[23]

Embodied navigation foundation model

J. Zhang et al. Embodied Navigation Foundation Model. arXiv preprint arXiv:2509.12129, 2025

work page arXiv 2025
[24]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024. 31

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck et al. GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025

work page internal anchor Pith review arXiv 2025
[26]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA. Cosmos-Reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558 , 2025

work page internal anchor Pith review arXiv 2025
[27]

Anderson et al

P. Anderson et al. Vision-and-language navigation: Interpret- ing visually-grounded navigation instructions in real environ- ments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683, 2018

work page 2018
[28]

Wu et al

W. Wu et al. Vision-language navigation: a survey and tax- onomy. Neural Computing and Applications , 36(7):3291–3316, 2024

work page 2024
[29]

Liu et al

S. Liu et al. Aerialvln: Vision-and-language navigation for uavs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 15384–15394, 2023

work page 2023
[30]

Gao et al

Y. Gao et al. Openfly: A versatile toolchain and large-scale benchmark for aerial vision-language navigation. In Interna- tional Conference on Learning Representations, 2026

work page 2026
[31]

Wu et al

R. Wu et al. Aeroduo: Aerial duo for uav-based vision and language navigation. In Proceedings of the 33rd ACM Interna- tional Conference on Multimedia, 2025

work page 2025
[32]

Zhang et al

W. Zhang et al. CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pp. 31292–31309, Vienna, Austria, jul 2025. Association for Computational Linguistics

work page 2025
[33]

Lee et al

J. Lee et al. Citynav: A large-scale dataset for real-world aerial navigation. In Proceedings of the IEEE/ CVF International Conference on Computer Vision (ICCV) , pp. 5912–5922, Octo- ber 2025

work page 2025
[34]

L. H. K. Wong et al. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai. arXiv preprint arXiv:2505.01458 , 2025

work page arXiv 2025
[35]

Gao et al

Y. Gao et al. Openfly: A comprehensive platform for aerial vision-language navigation. In International Conference on Learning Representations, 2026

work page 2026
[36]

Liu et al

X. Liu et al. Indooruav: Benchmarking vision-language uav navigation in continuous indoor environments. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pp. 23864–23872, 2026

work page 2026
[37]

Tian et al

Y. Tian et al. UA Vs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility. Information Fusion , 122:103158, 2025

work page 2025
[38]

Javaid et al

S. Javaid et al. Large language models for uavs: Current state and pathways to the future. arXiv preprint arXiv:2405.01745 , 2024

work page arXiv 2024
[39]

Zhao et al

X. Zhao et al. Agrivln: Vision-and-language navigation for agricultural robots. arXiv preprint arXiv:2508.07406 , 2025

work page arXiv 2025
[40]

Torneiro et al

A. Torneiro et al. Towards general urban monitoring with vision-language models: A review, evaluation, and a research agenda. arXiv preprint arXiv:2510.12400 , 2025

work page arXiv 2025
[41]

Osmani and D

K. Osmani and D. Schulz. Comprehensive investigation of un- manned aerial vehicles (uavs): An in-depth analysis of avionics systems. Sensors, 24(10):3064, 2024

work page 2024
[42]

Xiao et al

A. Xiao et al. Foundation models for remote sensing and earth observation: A survey. IEEE Geoscience and Remote Sensing Magazine, 2024

work page 2024
[43]

Weng et al

X. Weng et al. Vision-language modeling meets remote sens- ing: Models, datasets and perspectives. IEEE Geoscience and Remote Sensing Magazine , 2025

work page 2025
[44]

Bu et al

Y. Bu et al. Advancement challenges in uav swarm formation control: A comprehensive review. Drones, 8(7):320, 2024

work page 2024
[45]

Zhai et al

L. Zhai et al. Intelligent optimization algorithms for multi-uav path planning: A comprehensive review. IEEE Access, 13:1–1, 2025

work page 2025
[46]

W. Y. H. Adoni et al. Investigation of autonomous multi- uav systems for target detection in distributed environment: Current developments and open challenges. Drones, 7(4):263, 2023

work page 2023
[47]

Gong et al

Y. Gong et al. Safe and economical uav trajectory planning in low-altitude airspace: A hybrid drl-llm approach with compli- ance awareness. arXiv preprint arXiv:2506.08532 , 2025

work page arXiv 2025
[48]

Cereda et al

E. Cereda et al. On-device self-supervised learning of visual per- ception tasks aboard hardware-limited nano-quadrotors. arXiv preprint arXiv:2403.04071, 2024

work page arXiv 2024
[49]

Wang et al

J. Wang et al. Vision-based deep reinforcement learning of unmanned aerial vehicle (uav) autonomous navigation using privileged information. Drones, 8(12):782, 2024

work page 2024
[50]

Sartori et al

M. Sartori et al. AI and vision based autonomous navigation of nano-drones in partially-known environments. In 2025 21st International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT) , 2025

work page 2025
[51]

Doma et al

P. Doma et al. LLM-Enhanced Path Planning: Safe and Eﬀi- cient Autonomous Navigation with Instructional Inputs. arXiv preprint arXiv:2412.02655, 2024

work page arXiv 2024
[52]

Frattolillo et al

F. Frattolillo et al. Scalable and cooperative deep reinforcement learning approaches for multi-uav systems: A systematic review. Drones, 7(4):236, 2023

work page 2023
[53]

K. I. Qureshi et al. Multi-agent drl for air-to-ground com- munication planning in uav-enabled iot networks. Sensors, 24(20):6535, 2024

work page 2024
[54]

Singla et al

A. Singla et al. Memory-based deep reinforcement learning for obstacle avoidance in UA V with limited environment knowl- edge. IEEE Transactions on Intelligent Transportation Sys- tems, 22(1):107–118, 2021

work page 2021
[55]

Fan et al

Y. Fan et al. Aerial vision-and-dialog navigation. In Findings of the Association for Computational Linguistics: ACL 2023 , pp. 3043–3061, Toronto, Canada, July 2023. Association for Computational Linguistics

work page 2023
[56]

Hong et al

H. Hong et al. Why only text: Empowering vision-and-language navigation with multi-modal prompts. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intel- ligence, IJCAI-24, pp. 839–847, 8 2024

work page 2024
[57]

Liu et al

Z. Liu et al. ReasonGrounder: L VLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025
[58]

Huang et al

C. Huang et al. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automa- tion (ICRA), pp. 9947–9954, 2023

work page 2023
[59]

Liang et al

X. Liang et al. Real-time semantic octree mapping under aerial-ground cooperative system. Intelligent Service Robotics , 18(3):567–578, 2025

work page 2025
[60]

Dagger diffusion navigation: Dagger boosted diffusion policy for vision-language navigation.arXiv preprint arXiv:2508.09444,

H. Shi et al. DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation. arXiv preprint arXiv:2508.09444, 2025

work page arXiv 2025
[61]

Wu et al

H. Wu et al. Model-free uav navigation in unknown complex en- vironments using vision-based reinforcement learning. Drones, 9(8):566, 2025

work page 2025
[62]

Wang et al

X. Wang et al. UA V-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UA V Imitation Learning. In Thirty- ninth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2025

work page 2025
[63]

Song et al

X. Song et al. Towards long-horizon vision-language navigation: Platform, benchmark and method. In Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12078–12088, 2025

work page 2025
[64]

Saxena et al

P. Saxena et al. UA V-VLN: End-to-end vision language guided navigation for UA Vs. In 2025 European Conference on Mobile Robots (ECMR), pp. 1–6, 2025

work page 2025
[65]

Krantz et al

J. Krantz et al. Waypoint models for instruction-guided naviga- tion in continuous environments. In Proceedings of the IEEE/ CVF International Conference on Computer Vision (ICCV) , pp. 15162–15171, 2021

work page 2021
[66]

Sautenkov et al

O. Sautenkov et al. UA V-VLA: Vision-language-action sys- tem for large scale aerial mission generation. arXiv preprint arXiv:2501.05014, 2025

work page arXiv 2025
[67]

Manjunath et al

T. Manjunath et al. Reprohrl: Towards multi-goal navigation in the real world using hierarchical agents. In AAAI Conference on Artificial Intelligence, RL Ready for Production Workshop , 2023

work page 2023
[68]

Zhao et al

F. Zhao et al. Autonomous localized path planning algorithm for UA Vs based on TD3 strategy. Scientific Reports, 14(1):763, 2024

work page 2024
[69]

Jiang et al

L. Jiang et al. Improving multi-UA V cooperative path-finding through multiagent experience learning. Applied Intelligence , 54:11103–11119, 2024

work page 2024
[70]

Margapuri

V. Margapuri. Prompt informed reinforcement learning for vi- sual coverage path planning. arXiv preprint arXiv:2507.10284 , 2025

work page arXiv 2025
[71]

Sanyal and K

S. Sanyal and K. Roy. Asma: An adaptive safety margin algorithm for vision-language drone navigation via scene-aware 32 control barrier functions. IEEE Robotics and Automation Letters, 10(8):7536–7543, 2025

work page 2025
[72]

Human-Centric Aware UAV Trajectory Planning in Search and Rescue Missions Employing Multi- Objective Reinforcement Learning with AHP and Similarity -Based Experience Replay,

M. Ramezani and J. L. Sanchez-Lopez. Human-Centric Aware UA V Trajectory Planning in Search and Rescue Mis- sions Employing Multi-Objective Reinforcement Learning with AHP and Similarity-Based Experience Replay. arXiv preprint arXiv:2402.18487, 2024

work page arXiv 2024
[73]

Wang et al

C. Wang et al. Uav path planning in multi-task environments with risks through natural language understanding. Drones, 7(3):147, 2023

work page 2023
[74]

GPS denied IBVS-based navigation and collision avoidance of UA V using a low-cost RGB camera,

X. Wang et al. GPS denied IBVS-based navigation and collision avoidance of UA V using a low-cost RGB camera.arXiv preprint arXiv:2509.17435, 2025

work page arXiv 2025
[75]

Xue and T

Z. Xue and T. Gonsalves. Vision based drone obstacle avoid- ance by deep reinforcement learning. AI, 2(3):366–380, 2021

work page 2021
[76]

Chen et al

S. Chen et al. History aware multimodal transformer for vision- and-language navigation. In Advances in Neural Information Processing Systems, 2021

work page 2021
[77]

Xu et al

H. Xu et al. GeoNav: Empowering MLLMs with dual-scale geospatial reasoning for language-goal aerial navigation. Pat- tern Recognition, 177:113365, 2026

work page 2026
[78]

Target-grounded graph- aware transformer for aerial vision-and-dialog navigation,

Y. Su et al. Target-grounded graph-aware transformer for aerial vision-and-dialog navigation. arXiv preprint arXiv:2308.11561, 2023

work page arXiv 2023
[79]

Trivla: a unified triple- system-based unified vision-language-action model for general robot control,

Z. Liu et al. Trivla: A triple-system-based unified vision- language-action model with episodic world modeling for general robot control. arXiv preprint arXiv:2507.01424 , 2025

work page arXiv 2025
[80]

Li et al

T. Li et al. Skyvln: Vision-and-language navigation and nmpc control for uavs in urban environments. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 17199–17206, 2025

work page 2025

Showing first 80 references.

[1] [1]

Feng et al

T. Feng et al. Embodied AI: From LLMs to world models. IEEE Circuits and Systems Magazine , 2025

work page 2025

[2] [2]

Large model empow- ered embodied ai: A survey on decision-making and embodied learning,

W. Liang et al. Large model empowered embodied ai: A survey on decision-making and embodied learning. arXiv preprint arXiv:2508.10399, 2025

work page arXiv 2025

[3] [3]

Aligning cyber space with physical world: A comprehensive survey on embodied ai.arXiv preprint arXiv:2407.06886, 2024

Y. Liu et al. Aligning cyber space with physical world: A comprehensive survey on embodied AI. arXiv preprint arXiv:2407.06886, 2024

work page arXiv 2024

[4] [4]

Choutri et al

K. Choutri et al. Leveraging large language models for real-time uav control. Electronics, 14(21):4312, 2025

work page 2025

[5] [5]

S. A. Salunkhe et al. Intuitive human-drone collaborative navigation in unknown environments through mixed reality. In 2025 International Conference on Unmanned Aircraft Systems (ICUAS), 2025

work page 2025

[6] [6]

C. F. Liew and T. Yairi. Companion unmanned aerial vehicles: A survey. arXiv preprint arXiv:2001.04637 , 2020

work page arXiv 2001

[7] [7]

Zhang, H

Y. Zhang et al. Air-ground collaborative robots for fire and rescue missions: Towards mapping and navigation perspective. arXiv preprint arXiv:2412.20699 , 2024

work page arXiv 2024

[8] [8]

Cai et al

H. Cai et al. FlightGPT: Towards generalizable and inter- pretable UA V vision-and-Language Navigation with vision- Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pp. 6659– 6676, Suzhou, China, november 2025. Association for Compu- tational Linguistics

work page 2025

[9] [9]

L. Feng. Invited speaker 1: Navigation without GPS for un- manned aerial vehicles. In 2019 International Conference on Computer and Drone Applications (IConDA) , pp. 1, 2019

work page 2019

[10] [10]

Seidel et al

L. Seidel et al. Advancing early wildfire detection: Integration of vision language models with unmanned aerial vehicle remote sensing for enhanced situational awareness. Drones, 9(5):347, 2025

work page 2025

[11] [11]

Salahat et al

E. Salahat et al. Waypoint planning for autonomous aerial inspection of large-scale solar farms. In IECON 2019 - 45th Annual Conference of the IEEE Industrial Electronics Society , pp. 763–769, 2019

work page 2019

[12] [12]

Zhou et al

Z. Zhou et al. A lightweight drone vision system for autonomous inspection with real-time processing. Drones, 10(2):126, 2026

work page 2026

[13] [13]

Zhang et al

X. Zhang et al. Logisticsvln: Vision-language navigation for low- altitude terminal delivery based on agentic uavs. In 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), 2025

work page 2025

[14] [14]

D. H. Lee et al. A review on recent deep learning-based seman- tic segmentation for urban greenness measurement. Sensors, 24(7):2245, 2024

work page 2024

[15] [15]

Uavs meet agentic ai: A multidomain survey of autonomous aerial intelligence and agentic uavs,

R. Sapkota et al. Uavs meet agentic ai: A multidomain survey of autonomous aerial intelligence and agentic uavs. arXiv preprint arXiv:2506.08045, 2025

work page arXiv 2025

[16] [16]

Morando and G

L. Morando and G. Loianno. Spatial assisted human-drone col- laborative navigation and interaction through immersive mixed reality. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , 2024

work page 2024

[17] [17]

A survey of large language model-powered spatial intelli- gence across scales: Advances in embodied agents, smart cities, and earth science.arXiv preprint arXiv:2504.09848, 2025

J. Feng et al. A survey of large language model-powered spatial intelligence across scales: Advances in embodied agents, smart cities, and earth science. arXiv preprint arXiv:2504.09848 , 2025

work page arXiv 2025

[18] [18]

Zhang et al

Y. Zhang et al. Vision-and-language navigation today and to- morrow: A survey in the era of foundation models. Transactions on Machine Learning Research, 2024. Survey Certification

work page 2024

[19] [19]

Firoozi et al

R. Firoozi et al. Foundation models in robotics: Applications, challenges, and the future. The International Journal of Robotics Research, 44(5):701–739, 2024

work page 2024

[20] [20]

Chen et al

S. Chen et al. Exploring embodied multimodal large models: Development, datasets, and future directions. Information Fusion, 122:103198, 2025

work page 2025

[21] [21]

Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

D. Zhang et al. Pure vision language action (vla) models: A comprehensive survey. arXiv preprint arXiv:2509.19012 , 2025

work page arXiv 2025

[22] [22]

Vision-language- action models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769,

R. Sapkota et al. Vision-language-action (vla) models: Con- cepts, progress, applications and challenges. arXiv preprint arXiv:2505.04769, 2025

work page arXiv 2025

[23] [23]

Embodied navigation foundation model

J. Zhang et al. Embodied Navigation Foundation Model. arXiv preprint arXiv:2509.12129, 2025

work page arXiv 2025

[24] [24]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024. 31

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck et al. GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025

work page internal anchor Pith review arXiv 2025

[26] [26]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA. Cosmos-Reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558 , 2025

work page internal anchor Pith review arXiv 2025

[27] [27]

Anderson et al

P. Anderson et al. Vision-and-language navigation: Interpret- ing visually-grounded navigation instructions in real environ- ments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683, 2018

work page 2018

[28] [28]

Wu et al

W. Wu et al. Vision-language navigation: a survey and tax- onomy. Neural Computing and Applications , 36(7):3291–3316, 2024

work page 2024

[29] [29]

Liu et al

S. Liu et al. Aerialvln: Vision-and-language navigation for uavs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 15384–15394, 2023

work page 2023

[30] [30]

Gao et al

Y. Gao et al. Openfly: A versatile toolchain and large-scale benchmark for aerial vision-language navigation. In Interna- tional Conference on Learning Representations, 2026

work page 2026

[31] [31]

Wu et al

R. Wu et al. Aeroduo: Aerial duo for uav-based vision and language navigation. In Proceedings of the 33rd ACM Interna- tional Conference on Multimedia, 2025

work page 2025

[32] [32]

Zhang et al

W. Zhang et al. CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pp. 31292–31309, Vienna, Austria, jul 2025. Association for Computational Linguistics

work page 2025

[33] [33]

Lee et al

J. Lee et al. Citynav: A large-scale dataset for real-world aerial navigation. In Proceedings of the IEEE/ CVF International Conference on Computer Vision (ICCV) , pp. 5912–5922, Octo- ber 2025

work page 2025

[34] [34]

L. H. K. Wong et al. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai. arXiv preprint arXiv:2505.01458 , 2025

work page arXiv 2025

[35] [35]

Gao et al

Y. Gao et al. Openfly: A comprehensive platform for aerial vision-language navigation. In International Conference on Learning Representations, 2026

work page 2026

[36] [36]

Liu et al

X. Liu et al. Indooruav: Benchmarking vision-language uav navigation in continuous indoor environments. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pp. 23864–23872, 2026

work page 2026

[37] [37]

Tian et al

Y. Tian et al. UA Vs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility. Information Fusion , 122:103158, 2025

work page 2025

[38] [38]

Javaid et al

S. Javaid et al. Large language models for uavs: Current state and pathways to the future. arXiv preprint arXiv:2405.01745 , 2024

work page arXiv 2024

[39] [39]

Zhao et al

X. Zhao et al. Agrivln: Vision-and-language navigation for agricultural robots. arXiv preprint arXiv:2508.07406 , 2025

work page arXiv 2025

[40] [40]

Torneiro et al

A. Torneiro et al. Towards general urban monitoring with vision-language models: A review, evaluation, and a research agenda. arXiv preprint arXiv:2510.12400 , 2025

work page arXiv 2025

[41] [41]

Osmani and D

K. Osmani and D. Schulz. Comprehensive investigation of un- manned aerial vehicles (uavs): An in-depth analysis of avionics systems. Sensors, 24(10):3064, 2024

work page 2024

[42] [42]

Xiao et al

A. Xiao et al. Foundation models for remote sensing and earth observation: A survey. IEEE Geoscience and Remote Sensing Magazine, 2024

work page 2024

[43] [43]

Weng et al

X. Weng et al. Vision-language modeling meets remote sens- ing: Models, datasets and perspectives. IEEE Geoscience and Remote Sensing Magazine , 2025

work page 2025

[44] [44]

Bu et al

Y. Bu et al. Advancement challenges in uav swarm formation control: A comprehensive review. Drones, 8(7):320, 2024

work page 2024

[45] [45]

Zhai et al

L. Zhai et al. Intelligent optimization algorithms for multi-uav path planning: A comprehensive review. IEEE Access, 13:1–1, 2025

work page 2025

[46] [46]

W. Y. H. Adoni et al. Investigation of autonomous multi- uav systems for target detection in distributed environment: Current developments and open challenges. Drones, 7(4):263, 2023

work page 2023

[47] [47]

Gong et al

Y. Gong et al. Safe and economical uav trajectory planning in low-altitude airspace: A hybrid drl-llm approach with compli- ance awareness. arXiv preprint arXiv:2506.08532 , 2025

work page arXiv 2025

[48] [48]

Cereda et al

E. Cereda et al. On-device self-supervised learning of visual per- ception tasks aboard hardware-limited nano-quadrotors. arXiv preprint arXiv:2403.04071, 2024

work page arXiv 2024

[49] [49]

Wang et al

J. Wang et al. Vision-based deep reinforcement learning of unmanned aerial vehicle (uav) autonomous navigation using privileged information. Drones, 8(12):782, 2024

work page 2024

[50] [50]

Sartori et al

M. Sartori et al. AI and vision based autonomous navigation of nano-drones in partially-known environments. In 2025 21st International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT) , 2025

work page 2025

[51] [51]

Doma et al

P. Doma et al. LLM-Enhanced Path Planning: Safe and Eﬀi- cient Autonomous Navigation with Instructional Inputs. arXiv preprint arXiv:2412.02655, 2024

work page arXiv 2024

[52] [52]

Frattolillo et al

F. Frattolillo et al. Scalable and cooperative deep reinforcement learning approaches for multi-uav systems: A systematic review. Drones, 7(4):236, 2023

work page 2023

[53] [53]

K. I. Qureshi et al. Multi-agent drl for air-to-ground com- munication planning in uav-enabled iot networks. Sensors, 24(20):6535, 2024

work page 2024

[54] [54]

Singla et al

A. Singla et al. Memory-based deep reinforcement learning for obstacle avoidance in UA V with limited environment knowl- edge. IEEE Transactions on Intelligent Transportation Sys- tems, 22(1):107–118, 2021

work page 2021

[55] [55]

Fan et al

Y. Fan et al. Aerial vision-and-dialog navigation. In Findings of the Association for Computational Linguistics: ACL 2023 , pp. 3043–3061, Toronto, Canada, July 2023. Association for Computational Linguistics

work page 2023

[56] [56]

Hong et al

H. Hong et al. Why only text: Empowering vision-and-language navigation with multi-modal prompts. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intel- ligence, IJCAI-24, pp. 839–847, 8 2024

work page 2024

[57] [57]

Liu et al

Z. Liu et al. ReasonGrounder: L VLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025

[58] [58]

Huang et al

C. Huang et al. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automa- tion (ICRA), pp. 9947–9954, 2023

work page 2023

[59] [59]

Liang et al

X. Liang et al. Real-time semantic octree mapping under aerial-ground cooperative system. Intelligent Service Robotics , 18(3):567–578, 2025

work page 2025

[60] [60]

Dagger diffusion navigation: Dagger boosted diffusion policy for vision-language navigation.arXiv preprint arXiv:2508.09444,

H. Shi et al. DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation. arXiv preprint arXiv:2508.09444, 2025

work page arXiv 2025

[61] [61]

Wu et al

H. Wu et al. Model-free uav navigation in unknown complex en- vironments using vision-based reinforcement learning. Drones, 9(8):566, 2025

work page 2025

[62] [62]

Wang et al

X. Wang et al. UA V-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UA V Imitation Learning. In Thirty- ninth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2025

work page 2025

[63] [63]

Song et al

X. Song et al. Towards long-horizon vision-language navigation: Platform, benchmark and method. In Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12078–12088, 2025

work page 2025

[64] [64]

Saxena et al

P. Saxena et al. UA V-VLN: End-to-end vision language guided navigation for UA Vs. In 2025 European Conference on Mobile Robots (ECMR), pp. 1–6, 2025

work page 2025

[65] [65]

Krantz et al

J. Krantz et al. Waypoint models for instruction-guided naviga- tion in continuous environments. In Proceedings of the IEEE/ CVF International Conference on Computer Vision (ICCV) , pp. 15162–15171, 2021

work page 2021

[66] [66]

Sautenkov et al

O. Sautenkov et al. UA V-VLA: Vision-language-action sys- tem for large scale aerial mission generation. arXiv preprint arXiv:2501.05014, 2025

work page arXiv 2025

[67] [67]

Manjunath et al

T. Manjunath et al. Reprohrl: Towards multi-goal navigation in the real world using hierarchical agents. In AAAI Conference on Artificial Intelligence, RL Ready for Production Workshop , 2023

work page 2023

[68] [68]

Zhao et al

F. Zhao et al. Autonomous localized path planning algorithm for UA Vs based on TD3 strategy. Scientific Reports, 14(1):763, 2024

work page 2024

[69] [69]

Jiang et al

L. Jiang et al. Improving multi-UA V cooperative path-finding through multiagent experience learning. Applied Intelligence , 54:11103–11119, 2024

work page 2024

[70] [70]

Margapuri

V. Margapuri. Prompt informed reinforcement learning for vi- sual coverage path planning. arXiv preprint arXiv:2507.10284 , 2025

work page arXiv 2025

[71] [71]

Sanyal and K

S. Sanyal and K. Roy. Asma: An adaptive safety margin algorithm for vision-language drone navigation via scene-aware 32 control barrier functions. IEEE Robotics and Automation Letters, 10(8):7536–7543, 2025

work page 2025

[72] [72]

Human-Centric Aware UAV Trajectory Planning in Search and Rescue Missions Employing Multi- Objective Reinforcement Learning with AHP and Similarity -Based Experience Replay,

M. Ramezani and J. L. Sanchez-Lopez. Human-Centric Aware UA V Trajectory Planning in Search and Rescue Mis- sions Employing Multi-Objective Reinforcement Learning with AHP and Similarity-Based Experience Replay. arXiv preprint arXiv:2402.18487, 2024

work page arXiv 2024

[73] [73]

Wang et al

C. Wang et al. Uav path planning in multi-task environments with risks through natural language understanding. Drones, 7(3):147, 2023

work page 2023

[74] [74]

GPS denied IBVS-based navigation and collision avoidance of UA V using a low-cost RGB camera,

X. Wang et al. GPS denied IBVS-based navigation and collision avoidance of UA V using a low-cost RGB camera.arXiv preprint arXiv:2509.17435, 2025

work page arXiv 2025

[75] [75]

Xue and T

Z. Xue and T. Gonsalves. Vision based drone obstacle avoid- ance by deep reinforcement learning. AI, 2(3):366–380, 2021

work page 2021

[76] [76]

Chen et al

S. Chen et al. History aware multimodal transformer for vision- and-language navigation. In Advances in Neural Information Processing Systems, 2021

work page 2021

[77] [77]

Xu et al

H. Xu et al. GeoNav: Empowering MLLMs with dual-scale geospatial reasoning for language-goal aerial navigation. Pat- tern Recognition, 177:113365, 2026

work page 2026

[78] [78]

Target-grounded graph- aware transformer for aerial vision-and-dialog navigation,

Y. Su et al. Target-grounded graph-aware transformer for aerial vision-and-dialog navigation. arXiv preprint arXiv:2308.11561, 2023

work page arXiv 2023

[79] [79]

Trivla: a unified triple- system-based unified vision-language-action model for general robot control,

Z. Liu et al. Trivla: A triple-system-based unified vision- language-action model with episodic world modeling for general robot control. arXiv preprint arXiv:2507.01424 , 2025

work page arXiv 2025

[80] [80]

Li et al

T. Li et al. Skyvln: Vision-and-language navigation and nmpc control for uavs in urban environments. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 17199–17206, 2025

work page 2025