AirNav: A Large-Scale UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions
Pith reviewed 2026-05-21 16:35 UTC · model grok-4.3
The pith
AirNav provides 137K real urban aerial samples with natural language instructions to train UAV navigation agents that achieve 52 percent success in unseen settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose AirNav, a large-scale benchmark built on real urban aerial data, comprising 137K navigation samples with natural and diverse instructions generated via a human-LLM collaborative pipeline with 10 user personas. We conduct a systematic evaluation of representative approaches on AirNav, ranging from traditional models to multimodal large language models, under unified metrics. We further propose AirVLN-R1, trained via supervised fine-tuning and reinforcement fine-tuning, achieving a 51.82 percent success rate on the test-unseen split, with preliminary real-world experiments on a physical UAV platform showing sim-to-real transferability.
What carries the argument
The human-LLM collaborative pipeline with 10 user personas, which creates the natural and diverse instructions that scale the benchmark and support realistic agent training.
If this is right
- Traditional and multimodal models can be compared directly on the same large real-world aerial benchmark under consistent metrics.
- Supervised and reinforcement fine-tuning on the dataset produces a model that reaches over 50 percent success on previously unseen routes.
- Early physical UAV flights indicate that behaviors learned from the simulated data can transfer to hardware platforms.
- Public release of the dataset and code lets other researchers extend the benchmark or improve the proposed model.
Where Pith is reading between the lines
- The scale and realism of the dataset could support development of language-controlled drones for tasks such as urban monitoring or emergency response.
- The persona-driven instruction generation method might transfer to creating training data for other robot navigation domains like ground vehicles or indoor robots.
- Additional experiments with instructions from broader sets of real users could test and improve how well the benchmark reflects everyday command styles.
Load-bearing premise
The human-LLM collaborative pipeline with 10 user personas produces instructions that are sufficiently natural, diverse, and representative of real user commands to enable effective training and fair evaluation of UAV VLN agents.
What would settle it
Testing whether a model trained on AirNav maintains above 40 percent success when given instructions collected directly from human operators issuing commands during actual UAV flights over new urban areas.
Figures
read the original abstract
Existing UAV vision-and-language navigation (VLN) benchmarks rarely provide realistic aerial scenes, natural process-level instructions, and sufficient scale simultaneously, making it difficult to systematically train and evaluate UAV VLN agents under realistic settings. To address this, we propose \textbf{AirNav}, a large-scale benchmark built on real urban aerial data, comprising 137K navigation samples with natural and diverse instructions generated via a human--LLM collaborative pipeline with 10 user personas. We conduct a systematic evaluation of representative approaches on AirNav, ranging from traditional models to multimodal large language models (MLLMs), under unified metrics with open-source implementations. We further propose \textbf{AirVLN-R1}, trained via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), achieving state-of-the-art performance with a 51.82\% success rate on the test-unseen split. Real-world experiments on a physical UAV platform provide preliminary evidence of sim-to-real transferability, and our dataset and code are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AirNav, a large-scale UAV vision-and-language navigation benchmark comprising 137K navigation samples derived from real urban aerial data. Instructions are generated through a human-LLM collaborative pipeline involving 10 user personas to produce natural and diverse commands. The authors perform systematic evaluations of traditional models and multimodal LLMs under unified metrics, propose AirVLN-R1 trained with supervised fine-tuning and reinforcement fine-tuning that achieves 51.82% success rate on the test-unseen split, and report preliminary real-world experiments on a physical UAV platform demonstrating sim-to-real transferability, with the dataset and code released publicly.
Significance. If the generated instructions are shown to be natural and representative of real UAV operator commands, this benchmark would meaningfully advance UAV VLN research by simultaneously providing realistic aerial scenes, process-level instructions, and sufficient scale. Explicit strengths include the public release of the dataset and code, the systematic evaluation across model classes, the concrete performance improvement reported for AirVLN-R1, and the inclusion of physical UAV experiments that address sim-to-real gaps.
major comments (1)
- [Section 3.2] Section 3.2: The human-LLM collaborative pipeline with 10 user personas is presented as producing natural and diverse instructions, yet the section supplies only qualitative examples and persona descriptions. No quantitative metrics (e.g., instruction-length variance, lexical entropy, semantic coverage) or human preference studies against real UAV operator logs are reported, nor are ablations on persona count or prompting. This omission is load-bearing for both the benchmark's core value proposition and the validity of the 51.82% success rate achieved by AirVLN-R1 on the test-unseen split.
minor comments (1)
- [Abstract] The abstract states that 'representative approaches' were evaluated but does not enumerate the specific models or baselines; adding this detail would improve immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the strengths of AirNav, including the public release of the dataset and code, systematic evaluations across model classes, the performance gains with AirVLN-R1, and the preliminary physical UAV experiments. We address the major comment below and will make targeted revisions to strengthen the evidence for instruction naturalness and diversity.
read point-by-point responses
-
Referee: [Section 3.2] Section 3.2: The human-LLM collaborative pipeline with 10 user personas is presented as producing natural and diverse instructions, yet the section supplies only qualitative examples and persona descriptions. No quantitative metrics (e.g., instruction-length variance, lexical entropy, semantic coverage) or human preference studies against real UAV operator logs are reported, nor are ablations on persona count or prompting. This omission is load-bearing for both the benchmark's core value proposition and the validity of the 51.82% success rate achieved by AirVLN-R1 on the test-unseen split.
Authors: We agree that quantitative metrics would provide stronger support for the claims of naturalness and diversity. In the revised manuscript we will add instruction-length statistics (mean, variance, and histograms), lexical diversity measures (type-token ratio and entropy), and semantic coverage analysis via sentence embeddings. We will also include an ablation on persona count (e.g., 5 vs. 10 personas) and prompting variations to quantify their effect on output diversity. These additions will be placed in Section 3.2 and will directly bolster the benchmark's value proposition. However, human preference studies against real UAV operator logs cannot be performed at this time because no suitable public logs exist; obtaining them would require new data collection from operators, which lies outside the current project scope. We will explicitly note this limitation and position the 10-persona pipeline as a practical proxy for diversity, supported by the added quantitative results. revision: partial
- Human preference studies against real UAV operator logs, as no such public logs are available and new collection is beyond current scope.
Circularity Check
No circularity: new dataset and held-out evaluation are self-contained
full rationale
The paper introduces AirNav as a new benchmark dataset generated via a human-LLM pipeline and reports model performance (including the 51.82% success rate) on explicitly held-out test-unseen splits. No mathematical derivations, parameter fits, or equations appear in the provided text. Central claims rest on the scale and realism of the collected data plus standard supervised/reinforcement fine-tuning, not on any reduction to self-referential inputs, self-citations, or renamed empirical patterns. Evaluation follows conventional train/test splits on the introduced corpus, which is externally falsifiable and does not presuppose the target result by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard vision-and-language navigation assumptions that aerial images plus textual instructions suffice to define navigation tasks and success.
Forward citations
Cited by 1 Pith paper
-
Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.
Reference graph
Works this paper leans on
-
[1]
Cityrefer: Geography-aware 3d visual ground- ing dataset on city-scale point cloud data. Don Norman. 2013.The Design of Everyday Things: Revised and Expanded Edition. The Design of Ev- eryday Things: Revised and Expanded Edition. OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Ala...
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[2]
Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (V olume 3: System Demonstra- tions), Bangkok, Thailand. Association for Computa- tional Linguistics. Fengda Zhu, Xiwen Liang, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. 2021. Soon: Scenario orient...
work page 2021
-
[3]
Naturalness: Whether the instruction sounds like spontaneous human speech rather than a rigid, scripted, or templated command
-
[4]
Practicality: Whether the instruction provides actionable route guidance through landmarks, relative directions, or intermediate cues, rather than low-level action enumeration
-
[5]
Human Alignment: Whether the wording and structure align with how a human would naturally phrase a navigation request in everyday use. Rating Scale: Rate it on a scale from 1 to 5: - 1 = very unnatural (robotic, templated, or action-list-like) - 2 = somewhat unnatural (syntactically valid but awkward or artificial) - 3 = neutral (reasonable but not strong...
-
[6]
A given natural language navigation instruction,
-
[7]
The current state of the UA V , including its position and heading angle,
-
[8]
The current first-person UA V view image,
-
[9]
Up to four historical first-person view images from previous time steps (if available),
-
[10]
The previously executed UA V actions (if available). Text Input: - Navigation instruction:{Instruction} - Current state of the UA V:{Current State} - Previously executed actions:{Historical Action sequence} (A list of past actions the UA V has taken, in chronological order.) Image Input: UA V (Unmanned Aerial Vehicle) View Sequence - Historical views (fro...
-
[11]
Predict no more than 8 future actions for the UA V to execute
-
[12]
If the target location is reachable in fewer than 8 actions, output less than 8 actions sequence and end with "STOP". Otherwise, it clearly requires more than 8 actions to approach the target, output exactly 8 future actions
-
[13]
You must output "STOP" if the UA V has already reached the described target
-
[14]
Output a JSON list of actions, in the exact order they should be executed
-
[15]
Do not include any explanations, reasoning, or additional text — only output the JSON list. Discrete Action Space: - MOVE_FORW ARD: move straight 5 meters in the current heading - TURN_LEFT: rotate left 30 degrees - TURN_RIGHT: rotate right 30 degrees - STOP: stop the flight Output Format Examples: ["TURN_RIGHT", "TURN_RIGHT", "MOVE_FORWARD", "MOVE_FORWAR...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.