Recognition: 2 theorem links
· Lean TheoremVision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3
The pith
Aerial VLN methods fall into five categories but face seven specific gaps in language grounding, continuous control, and real-world use when scaled with LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors formally define Aerial VLN with single-instruction and dialog-based interaction paradigms, place existing methods into five architectural categories (sequence-to-sequence and attention-based, end-to-end LLM/VLM, hierarchical, multi-agent, and dialog-based), analyze design rationales and performance trade-offs on shared benchmarks, document shortcomings in current datasets and metrics, and synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation.
What carries the argument
A five-category taxonomy of Aerial VLN architectures that structures the comparison of discrete versus continuous actions, end-to-end versus hierarchical designs, and simulation-to-reality gaps.
If this is right
- Progress on long-horizon instruction grounding would allow drones to execute multi-step missions from a single command without intermediate human input.
- Solutions for continuous 6-DoF action execution would reduce the simulation-to-reality gap compared with discrete action spaces.
- Standardized benchmarks with greater environmental diversity would enable direct cross-method comparisons that current platforms do not support.
- Onboard deployment research would shift focus from cloud-dependent models to resource-constrained UAV hardware.
- Multi-UAV swarm navigation methods would extend single-agent techniques to coordinated teams following shared language instructions.
Where Pith is reading between the lines
- The survey's emphasis on continuous actions suggests that hierarchical methods may scale better to real flights than purely end-to-end LLM pipelines.
- Benchmark standardization could accelerate progress in the same way shared simulators did for ground-based VLN.
- Viewpoint robustness and scalable spatial representation together point to the need for explicit 3D world models rather than 2D image features alone.
Load-bearing premise
The reviewed literature is complete enough that the five-category taxonomy covers the field and the seven listed gaps are the most important ones.
What would settle it
Publication of an Aerial VLN method that cannot be placed in any of the five categories, or release of a benchmark that already solves all seven listed open problems, would show the survey's organization and gap analysis are incomplete.
Figures
read the original abstract
Aerial vision-and-language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three-dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision-language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single-instruction and dialog-based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods. For each category, we systematically analyze design rationales, technical trade-offs, and reported performance. We critically assess the evaluation infrastructure for Aerial VLN, including datasets, simulation platforms, and metrics, and identify their gaps in scale, environmental diversity, real-world grounding, and metric coverage. We consolidate cross-method comparisons on shared benchmarks and analyze key architectural trade-offs, including discrete versus continuous actions, end-to-end versus hierarchical designs, and the simulation-to-reality gap. Finally, we synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation, with specific research directions grounded in the evidence presented throughout the survey.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a critical survey of Aerial Vision-Language Navigation (Aerial VLN) for UAVs. It formally defines the problem and two interaction paradigms (single-instruction and dialog-based), organizes existing methods into a five-category taxonomy (sequence-to-sequence/attention-based, end-to-end LLM/VLM, hierarchical, multi-agent, and dialog-based), analyzes design rationales, trade-offs, and reported performance for each, evaluates datasets/simulators/metrics and their limitations, consolidates cross-method comparisons on shared benchmarks, and synthesizes seven open problems (long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation) with grounded research directions.
Significance. If the literature coverage and analysis hold, the survey provides substantial value by structuring an emerging interdisciplinary field at the intersection of VLN, LLMs/VLMs, and aerial robotics. It explicitly credits the synthesis of actionable open problems derived from cross-category comparisons and evaluation gaps, offering a reference that can guide targeted research on real-world deployment challenges such as continuous control and swarm coordination.
minor comments (3)
- [Taxonomy section] The taxonomy introduction would benefit from an explicit justification or decision tree explaining why the five categories are mutually exclusive and exhaustive, particularly regarding overlap between hierarchical and multi-agent approaches.
- [Evaluation section] In the evaluation infrastructure assessment, the discussion of metric coverage could include a table summarizing which metrics are used across the reviewed papers to make the identified gaps more quantifiable.
- [Cross-method comparisons] The consolidated cross-method comparisons on shared benchmarks would be strengthened by noting the number of papers per benchmark and any statistical significance tests applied to performance differences.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our survey, which correctly identifies the taxonomy, evaluation analysis, and open problems. We appreciate the recommendation for minor revision and will incorporate any editorial or minor clarifications in the next version. No specific major comments were provided in the report.
Circularity Check
No significant circularity in this literature survey
full rationale
This is a survey paper that formally defines the Aerial VLN problem, organizes existing methods into a five-category taxonomy, analyzes design trade-offs and performance on shared benchmarks, critiques datasets/metrics, and synthesizes seven open problems as an analytical summary of gaps identified across the reviewed literature. No mathematical derivations, equations, fitted parameters, or predictions appear; the synthesis of open problems is explicitly grounded in external prior work rather than reducing to self-defined inputs or self-citations by construction. The taxonomy and gaps are presented as critical review, not as a load-bearing uniqueness theorem or ansatz imported from the authors' prior work.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.lean, Cost/FunctionalEquation.lean, AlexanderDuality.leanreality_from_one_distinction, washburn_uniqueness_aczel, alexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods... synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution...
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean, Cost.leanJcost functional equation uniqueness, 8-tick period forcing unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Aerial VLN problem is typically formulated as a language-conditioned sequential decision-making problem in partially observable 3D space... action space A... discrete... or continuous at = (vt, ωt) ∈ R3 × R3... POMDP (S, A, O, T, Ω, L)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A novel uav-enabled data collection scheme for intelligent transportation system through uav speed control,
X. Li, J. Tan, A. Liu, P. Vijayakumar, N. Kumar, and M. Alazab, “A novel uav-enabled data collection scheme for intelligent transportation system through uav speed control,”IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 4, pp. 2100–2110, 2021. 24
2021
-
[2]
Uav-basedelivery systems: A systematic review, current trends, and research challenges,
F. Betti Sorbelli, “Uav-basedelivery systems: A systematic review, current trends, and research challenges,”ACM Journal on Autonomous Transportation Systems, vol. 1, no. 3, pp. 1–40, 2024
2024
-
[3]
Challenges and opportunities for autonomous micro-uavs in precision agriculture,
X. Liu, S. W. Chen, G. V . Nardari, C. Qu, F. Cladera, C. J. Taylor, and V . Kumar, “Challenges and opportunities for autonomous micro-uavs in precision agriculture,”IEEE Micro, vol. 42, no. 1, pp. 61–68, 2022
2022
-
[4]
Online path plan- ning for multi-robot multi-source seeking using distributed gaussian processes,
H. Huang, H. Zhu, X. Zhu, W. Mei, and B. Deng, “Online path plan- ning for multi-robot multi-source seeking using distributed gaussian processes,”IET Cyber-Systems and Robotics, vol. 7, no. 1, p. e70030, 2025
2025
-
[5]
Online informative path planning for active information gathering of a 3d surface,
H. Zhu, J. J. Chung, N. R. Lawrance, R. Siegwart, and J. Alonso-Mora, “Online informative path planning for active information gathering of a 3d surface,” inIEEE International Conference on Robotics and Automation, pp. 1488–1494, 2021
2021
-
[6]
Edge computing powers aerial swarms in sensing, communication, and planning,
H. Zhu, Q. Chen, X. Zhu, W. Yao, and X. Chen, “Edge computing powers aerial swarms in sensing, communication, and planning,”The Innovation, p. 100506, 2023
2023
-
[7]
The small-drone revolution is coming — scientists need to ensure it will be safe,
X. Huang, “The small-drone revolution is coming — scientists need to ensure it will be safe,”Nature, vol. 637, no. 8044, pp. 29–30, 2025
2025
-
[8]
UA Vs meet LLMs: Overviews and perspectives towards agentic low-altitude mobility,
Y . Tian, F. Lin, Y . Li, T. Zhang, Q. Zhang, X. Fu, J. Huang, X. Dai, Y . Wang, C. Tian, B. Li, Y . Lv, L. Kov ´acs, and F.-Y . Wang, “UA Vs meet LLMs: Overviews and perspectives towards agentic low-altitude mobility,”Information Fusion, vol. 122, p. 103158, 2025
2025
-
[9]
A comprehensive review on autonomous navigation,
S. Nahavandi, R. Alizadehsani, D. Nahavandi, S. Mohamed, N. Mo- hajer, M. Rokonuzzaman, and I. Hossain, “A comprehensive review on autonomous navigation,”ACM Computing Surveys, vol. 57, no. 9, 2025
2025
-
[10]
A survey on lidar-based autonomous aerial vehicles,
Y . Ren, Y . Cai, H. Li, N. Chen, F. Zhu, L. Yin, F. Kong, R. Li, and F. Zhang, “A survey on lidar-based autonomous aerial vehicles,” IEEE/ASME Transactions on Mechatronics, pp. 1–17, 2025
2025
-
[11]
Learning high-speed flight in the wild,
A. Loquercio, E. Kaufmann, R. Ranftl, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Learning high-speed flight in the wild,”Science Robotics, vol. 6, no. 59, p. eabg5810, 2021
2021
-
[12]
Explainable deep reinforcement learning for UA V autonomous path planning,
L. He, N. Aouf, and B. Song, “Explainable deep reinforcement learning for UA V autonomous path planning,”Aerospace Science and Technology, vol. 118, p. 107052, 2021
2021
-
[13]
AerialVLN: Vision-and-language navigation for UA Vs,
S. Liu, H. Zhang, Y . Qi, P. Wang, Y . Zhang, and Q. Wu, “AerialVLN: Vision-and-language navigation for UA Vs,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15338– 15348, 2023
2023
-
[14]
Aerial vision-and-dialog navigation,
Y . Fan, W. Chen, T. Jiang, C. Zhou, Y . Zhang, and X. E. Wang, “Aerial vision-and-dialog navigation,”Findings of the Association for Computational Linguistics: ACL 2023, pp. 3043–3061, 2023
2023
-
[15]
Onfly: On- board zero-shot aerial vision-language navigation toward safety and efficiency,
G. Zheng, Y . Ban, M. Zhang, J. Zheng, and B. Zhou, “Onfly: On- board zero-shot aerial vision-language navigation toward safety and efficiency,”arXiv:2603.10682, 2026
-
[16]
Apex: A decoupled memory-based explorer for asynchronous aerial object goal navigation,
D. Zhang, P. Chen, X. Xia, X. Su, R. Zhen, J. Xiao, and S. Yang, “Apex: A decoupled memory-based explorer for asynchronous aerial object goal navigation,”arXiv:2602.00551, 2026
-
[17]
LogisticsVLN: Vision-language navigation for low-altitude terminal delivery based on agentic UA Vs,
X. Zhang, Y . Tian, F. Lin, Y . Liu, J. Ma, K. S. Szatm ´ary, and F.- Y . Wang, “LogisticsVLN: Vision-language navigation for low-altitude terminal delivery based on agentic UA Vs,” inIEEE 28th International Conference on Intelligent Transportation Systems, pp. 4437–4442, 2025
2025
-
[18]
Multimodal large language models-enabled UA V swarm: Towards efficient and intelligent autonomous aerial systems,
Y . Ping, T. Liang, H. Ding, G. Lei, J. Wu, X. Zou, K. Shi, R. Shao, C. Zhang, W. Zhang, W. Yuan, and T. Zhang, “Multimodal large language models-enabled UA V swarm: Towards efficient and intelligent autonomous aerial systems,”IEEE Wireless Communications, vol. 33, no. 1, pp. 89–97, 2025
2025
-
[19]
Language models are few-shot learn- ers,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...
1901
-
[20]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,
D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. G...
2025
-
[21]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y . Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y . Zhang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021
2021
-
[23]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3992–4003, 2023
2023
-
[24]
Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,” inPro- ceedings of the European Conference on Computer Vision, vol. 15105, pp. 38–55, 2025
2025
-
[25]
Navgemini: a multi-modal llm agent for vision-and-language navigation,
G. Zhao, G. Li, and Y . Yu, “Navgemini: a multi-modal llm agent for vision-and-language navigation,”Visual Intelligence, vol. 4, 2026
2026
-
[26]
Unemo: Collaborative visual-language reasoning and navigation via a multimodal world model,
C. Huang, L. Tang, Z. Zhan, L. Yu, R. Zeng, Z. Liu, Z. Wang, and J. Li, “Unemo: Collaborative visual-language reasoning and navigation via a multimodal world model,”ArXiv, vol. 2511.18845, 2025
-
[27]
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View,
R. Schumann, W. Zhu, W. Feng, T.-J. Fu, S. Riezler, and W. Y . Wang, “VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, pp. 18924–18933, 2024
2024
-
[28]
LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,
D. Shah, B. Osi ´nski, B. Ichter, and S. Levine, “LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,” inProceedings of the 6th Conference on Robot Learning, pp. 492–504, 2023
2023
-
[29]
AeroVerse-review: Comprehensive survey on aerial embodied vision-and-language navigation,
F. Yao, Y . Liu, W. Zhang, Z. Zhu, C. Li, N. Liu, P. Hu, Y . Yue, K. Wei, X. He, X. Zhao, Z. Wei, H. Xu, Z. Wang, G. Shao, L. Yang, D. Zhao, and Y . Yang, “AeroVerse-review: Comprehensive survey on aerial embodied vision-and-language navigation,”The Innovation Informatics, vol. 1, no. 1, p. 100015, 2025
2025
-
[30]
FlightGPT: Towards generalizable and in- terpretable UA V vision-and-language navigation with vision-language models,
H. Cai, J. Dong, J. Tan, J. Deng, S. Li, Z. Gao, H. Wang, Z. Su, A. Sumalee, and R. Zhong, “FlightGPT: Towards generalizable and in- terpretable UA V vision-and-language navigation with vision-language models,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 6659–6676, 2025
2025
-
[31]
Y . Liu, F. Yao, Y . Yue, G. Xu, X. Sun, and K. Fu, “NavAgent: Multi- scale urban street view fusion for UA V embodied vision-and-language navigation,”arXiv:2411.08579, 2024
-
[32]
SkyVLN: Vision-and-language navigation and NMPC control for UA Vs in urban environments,
T. Li, T. Huai, Z. Li, Y . Gao, H. Li, and X. Zheng, “SkyVLN: Vision-and-language navigation and NMPC control for UA Vs in urban environments,” inIEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 17199–17206, 2025
2025
-
[33]
Openfly: A comprehensive platform for aerial vision-language navigation,
Y . Gao, C. Li, Z. You, J. Liu, Z. Li, P. Chen, Q. Chen, Z. Tang, Y . Tang, Y . Tang, S. Liang, S. Zhu, Z. Xiong, Y . Su, X. Ye, J. Li, Y . Ding, D. Wang, Z. Wang, B. Zhao, and X. Li, “OpenFly: A comprehensive platform for aerial vision-language navigation,”arXiv: 2502.18041, 2025
-
[34]
”hi AirStar, guide me to the badminton court
Z. Wang, J. Chen, X. Zheng, Q. Liao, L. Huang, and S. Liu, “”hi AirStar, guide me to the badminton court.”,” inACM International Conference on Multimedia, pp. 13477–13479, 2025. 25
2025
-
[35]
Uav-codeagents: Scalable uav mission planning via multi-agent react and vision- language reasoning,
O. Sautenkov, Y . Yaqoot, M. A. Mustafa, F. Batool, J. Sam, A. Lykov, C.-Y . Wen, and D. Tsetserukou, “UA V-CodeAgents: Scalable UA V mis- sion planning via multi-agent ReAct and vision-language reasoning,” arXiv: 2505.07236, 2025
-
[36]
MMCNav: MLLM- empowered multi-agent collaboration for outdoor visual language navi- gation,
Z. Zhang, M. Chen, S. Zhu, T. Han, and Z. Yu, “MMCNav: MLLM- empowered multi-agent collaboration for outdoor visual language navi- gation,” inProceedings of the International Conference on Multimedia Retrieval, pp. 1767–1776, 2025
2025
-
[37]
Vision-and- language navigation: A survey of tasks, methods, and future directions,
J. Gu, E. Stefani, Q. Wu, J. Thomason, and X. E. Wang, “Vision-and- language navigation: A survey of tasks, methods, and future directions,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7606–7623, 2022
2022
-
[38]
Vision-language navigation: A sur- vey and taxonomy,
W. Wu, T. Chang, and X. Li, “Vision-language navigation: A sur- vey and taxonomy,”Neural Computing and Applications, vol. 36, p. 3291–3316, 2024
2024
-
[39]
Vision-and-language navigation today and to- morrow: A survey in the era of foundation models,
Y . Zhang, Z. Ma, J. Li, Y . Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi, “Vision-and-language navigation today and to- morrow: A survey in the era of foundation models,”Transactions on Machine Learning Research, 2024
2024
-
[40]
Mapping instructions to actions in 3D environments with visual goal prediction,
D. Misra, A. Bennett, V . Blukis, E. Niklasson, M. Shatkhin, and Y . Artzi, “Mapping instructions to actions in 3D environments with visual goal prediction,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2667–2678, 2018
2018
-
[41]
Mapping naviga- tion instructions to continuous control actions with position-visitation prediction,
V . Blukis, D. Misra, R. A. Knepper, and Y . Artzi, “Mapping naviga- tion instructions to continuous control actions with position-visitation prediction,” inProceedings of the 2nd Conference on Robot Learning, vol. 87, pp. 505–518, 2018
2018
-
[42]
GRAD-NA V++: Vision-language model enabled visual drone naviga- tion with gaussian radiance fields and differentiable dynamics,
Q. Chen, N. Gao, S. Huang, J. Low, T. Chen, J. Sun, and M. Schwager, “GRAD-NA V++: Vision-language model enabled visual drone naviga- tion with gaussian radiance fields and differentiable dynamics,”IEEE Robotics and Automation Letters, vol. 11, no. 2, pp. 1418–1425, 2026
2026
-
[43]
Aerial vision-and-language navigation with grid-based view selection and map construction,
G. Zhao, G. Li, J. Pan, and Y . Yu, “Aerial vision-and-language navigation with grid-based view selection and map construction,” arXiv:2503.11091, 2025
-
[45]
Towards realistic UA V vision-language navigation: Plat- form, benchmark, and methodology,
X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, and S. Liu, “Towards realistic UA V vision-language navigation: Plat- form, benchmark, and methodology,” inThe 13th International Con- ference on Learning Representations, pp. 75433–75451, 2025
2025
-
[46]
CityNavAgent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory,
W. Zhang, C. Gao, S. Yu, R. Peng, B. Zhao, Q. Zhang, J. Cui, X. Chen, and Y . Li, “CityNavAgent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 31292–31309, 2025
2025
-
[47]
CityNav: A large-scale dataset for real-world aerial navigation,
J. Lee, T. Miyanishi, S. Kurita, K. Sakamoto, D. Azuma, Y . Matsuo, and N. Inoue, “CityNav: A large-scale dataset for real-world aerial navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5912–5922, 2025
2025
-
[48]
Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3674–3683, 2018
2018
-
[49]
Beyond the nav-graph: Vision-and-language navigation in continuous environ- ments,
J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environ- ments,” inProceedings of the European Conference on Computer Vision, vol. 12373, pp. 104–120, 2020
2020
-
[50]
TOUCHDOWN: Natural language navigation and spatial reasoning in visual street envi- ronments,
H. Chen, A. Suhr, D. Misra, N. Snavely, and Y . Artzi, “TOUCHDOWN: Natural language navigation and spatial reasoning in visual street envi- ronments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12530–12539, 2019
2019
-
[51]
The StreetLearn Environment and Dataset
P. Mirowski, A. Banki-Horvath, K. Anderson, D. Teplyashin, K. M. Hermann, M. Malinowski, M. K. Grimes, K. Simonyan, K. Kavukcuoglu, A. Zisserman, and R. Hadsell, “The StreetLearn Environment and Dataset,”arXiv:1903.01292, 2019
work page Pith review arXiv 1903
-
[52]
A survey of embodied ai: From simulators to research tasks,
J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022
2022
-
[53]
Recent advances in vision-and-language navigation,
S. Shuang-Lin, H. Yan, H. Ke-Ji, A. Dong, Y . Hui, and W. Liang, “Recent advances in vision-and-language navigation,”Acta Automatica Sinica, vol. 49, no. 1, pp. 1–14, 2023
2023
-
[54]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inThe 9th International Conference on Learning Representations, 2021
2021
-
[55]
Visual language navigation: A survey and open challenges,
S.-M. Park and Y .-G. Kim, “Visual language navigation: A survey and open challenges,”Artificial Intelligence Review, vol. 56, no. 1, pp. 365– 427, 2023
2023
-
[56]
Stay on the path: Instruction fidelity in vision-and-language navigation,
V . Jain, G. Magalhaes, A. Ku, A. Vaswani, E. Ie, and J. Baldridge, “Stay on the path: Instruction fidelity in vision-and-language navigation,” in Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pp. 1862–1872, Association for Computational Linguistics, 2019
2019
-
[57]
Language and visual entity relationship graph for agent navigation,
Y . Hong, C. Rodriguez, Y . Qi, Q. Wu, and S. Gould, “Language and visual entity relationship graph for agent navigation,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 7685–7696, 2020
2020
-
[58]
Learning to follow directions in street view,
K. M. Hermann, M. Malinowski, P. Mirowski, A. Banki-Horvath, K. Anderson, and R. Hadsell, “Learning to follow directions in street view,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11773–11781, 2020
2020
-
[59]
Path planning techniques for unmanned aerial vehicles: A review, solutions, and challenges,
S. Aggarwal and N. Kumar, “Path planning techniques for unmanned aerial vehicles: A review, solutions, and challenges,”Computer Com- munications, vol. 149, pp. 270–299, 2020
2020
-
[60]
Unmanned aerial vehicles (uavs): practical aspects, applica- tions, open challenges, security issues, and future trends,
S. A. H. Mohsan, N. Q. H. Othman, Y . Li, M. H. Alsharif, and M. A. Khan, “Unmanned aerial vehicles (uavs): practical aspects, applica- tions, open challenges, security issues, and future trends,”Intelligent Service Robotics, vol. 16, pp. 109 – 137, 2023
2023
-
[61]
Multimodal alignment and fusion: A survey.arXiv preprint arXiv:2411.17040, 2024
S. Li and H. Tang, “Multimodal alignment and fusion: A survey,” arXiv:2411.17040, 2025
-
[62]
Enhancing visual aligning and grounding for aerial vision-and-dialog navigation,
G. Qiao, D. Yi, L. Wu, H. Wu, and J. Wang, “Enhancing visual aligning and grounding for aerial vision-and-dialog navigation,”IEEE Signal Processing Letters, vol. 32, pp. 2853–2857, 2025
2025
-
[63]
Adaptive zone-aware hierarchical planner for vision-language navigation,
C. Gao, X. Peng, M. Yan, H. Wang, L. Yang, H. Ren, H. Li, and S. Liu, “Adaptive zone-aware hierarchical planner for vision-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14911–14920, 2023
2023
-
[64]
Towards long- horizon vision-language navigation: Platform, benchmark and method,
X. Song, W. Chen, Y . Liu, W. Chen, G. Li, and L. Lin, “Towards long- horizon vision-language navigation: Platform, benchmark and method,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12078–12088, 2025
2025
-
[65]
Structured instruction parsing and scene alignment for UA V vision-language navigation,
L. Zhou, R. Xue, and X. Luo, “Structured instruction parsing and scene alignment for UA V vision-language navigation,” inIEEE International Conference on Image Processing, pp. 2600–2605, 2025
2025
-
[66]
Demo abstract: Embodied aerial agent for city-level visual language naviga- tion using large language model,
W. Zhang, Y . Liu, X. Wang, X. Chen, C. Gao, and X. Chen, “Demo abstract: Embodied aerial agent for city-level visual language naviga- tion using large language model,” inThe 23rd ACM/IEEE International Conference on Information Processing in Sensor Networks, pp. 265– 266, 2024
2024
-
[67]
Typefly: Low-latency drone planning with large language models,
G. Chen, X. Yu, N. Ling, and L. Zhong, “Typefly: Low-latency drone planning with large language models,”IEEE Transactions on Mobile Computing, vol. 24, no. 9, pp. 9068–9079, 2025
2025
-
[68]
Research progress on embodied navigation of low-altitude uav,
G. S. XU Yueyue, DU Huajun, “Research progress on embodied navigation of low-altitude uav,”Aerospace Control, vol. 43, no. 4, pp. 7–14, 2025
2025
-
[69]
Follow- ing high-level navigation instructions on a simulated quadcopter with imitation learning,
V . Blukis, N. Brukhim, A. Bennett, R. Knepper, and Y . Artzi, “Follow- ing high-level navigation instructions on a simulated quadcopter with imitation learning,” inRobotics: Science and Systems XIV, Robotics: Science and Systems Foundation, 2018
2018
-
[70]
Target-grounded graph- aware transformer for aerial vision-and-dialog navigation,
Y . Su, D. An, Y . Xu, K. Chen, and Y . Huang, “Target-grounded graph- aware transformer for aerial vision-and-dialog navigation,”arXiv: 2308.11561, 2023
-
[71]
Learning fine-grained alignment for aerial vision-dialog navigation,
Y . Su, D. An, K. Chen, W. Yu, B. Ning, Y . Ling, Y . Huang, and L. Wang, “Learning fine-grained alignment for aerial vision-dialog navigation,” inProceedings of the AAAI Conference on Artificial Intelligence, no. 7, pp. 7060–7068, 2025
2025
-
[72]
Speaker- follower models for vision-and-language navigation,
D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker- follower models for vision-and-language navigation,” inAdvances in Neural Information Processing Systems, vol. 31, 2018
2018
-
[73]
Natural language command of an autonomous micro-air vehicle,
A. S. Huang, S. Tellex, A. Bachrach, T. Kollar, D. Roy, and N. Roy, “Natural language command of an autonomous micro-air vehicle,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2663–2669, 2010
2010
-
[74]
Learning to map natural language instructions to physical quadcopter control using simulated flight,
V . Blukis, Y . Terme, E. Niklasson, R. A. Knepper, and Y . Artzi, “Learning to map natural language instructions to physical quadcopter control using simulated flight,” inProceedings of the Conference on Robot Learning, vol. 100, pp. 1415–1438, 2020
2020
-
[75]
A reduction of imitation learning and structured prediction to no-regret online learning,
S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProceedings of the Fourteenth International Conference on Artificial Intelligence 26 and Statistics, pp. 627–635, JMLR Workshop and Conference Pro- ceedings, 2011
2011
-
[76]
History-enhanced two-stage transformer for aerial vision-and-language navigation,
X. Ding, J. Gao, C. Pan, W. Wang, and J. Qin, “History-enhanced two-stage transformer for aerial vision-and-language navigation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 18225–18233, 2026
2026
-
[77]
Multimodal learning with transform- ers: A survey,
P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transform- ers: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12113–12132, 2023
2023
-
[78]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016
2016
-
[79]
Long short-term memory,
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997
1997
-
[80]
Dual-branch dynamic perception and interaction framework for aerial vision-and-language navigation,
Z. Wang, “Dual-branch dynamic perception and interaction framework for aerial vision-and-language navigation,” inThe 4th International Conference on Artificial Intelligence, Internet and Digital Economy, pp. 307–310, 2025
2025
-
[81]
BERT: Pre- training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” inProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.