AirNav: A Large-Scale UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

Changhao Nai; Hengxing Cai; Jingjun Tan; Jinhan Dong; Jue Hou; Ligang Huang; Renxin Zhong; Wenhao Lu; Yijie Rao; Zanyang Zhong

arxiv: 2601.03707 · v2 · pith:MF5SNKCYnew · submitted 2026-01-07 · 💻 cs.CL

AirNav: A Large-Scale UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

Hengxing Cai , Yijie Rao , Ligang Huang , Zanyang Zhong , Jinhan Dong , Jingjun Tan , Changhao Nai , Jue Hou

show 2 more authors

Wenhao Lu Renxin Zhong

This is my paper

Pith reviewed 2026-05-21 16:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords UAV navigationvision-and-language navigationaerial datasetbenchmarklanguage instructionsmultimodal modelssim-to-real transferurban aerial data

0 comments

The pith

AirNav provides 137K real urban aerial samples with natural language instructions to train UAV navigation agents that achieve 52 percent success in unseen settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AirNav as a benchmark dataset built from actual city aerial footage that pairs 137,000 navigation paths with instructions written in everyday language. These instructions come from a process that combines human input with large language models across ten different user types to capture varied ways people might direct a drone. The authors test multiple existing models on this data and introduce their own fine-tuned system that reaches over half success on new routes, plus early checks on a real flying drone. A reader would care because reliable language-based control could open practical uses for drones in crowded environments where operators need to give spoken or typed directions rather than code paths manually.

Core claim

We propose AirNav, a large-scale benchmark built on real urban aerial data, comprising 137K navigation samples with natural and diverse instructions generated via a human-LLM collaborative pipeline with 10 user personas. We conduct a systematic evaluation of representative approaches on AirNav, ranging from traditional models to multimodal large language models, under unified metrics. We further propose AirVLN-R1, trained via supervised fine-tuning and reinforcement fine-tuning, achieving a 51.82 percent success rate on the test-unseen split, with preliminary real-world experiments on a physical UAV platform showing sim-to-real transferability.

What carries the argument

The human-LLM collaborative pipeline with 10 user personas, which creates the natural and diverse instructions that scale the benchmark and support realistic agent training.

If this is right

Traditional and multimodal models can be compared directly on the same large real-world aerial benchmark under consistent metrics.
Supervised and reinforcement fine-tuning on the dataset produces a model that reaches over 50 percent success on previously unseen routes.
Early physical UAV flights indicate that behaviors learned from the simulated data can transfer to hardware platforms.
Public release of the dataset and code lets other researchers extend the benchmark or improve the proposed model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scale and realism of the dataset could support development of language-controlled drones for tasks such as urban monitoring or emergency response.
The persona-driven instruction generation method might transfer to creating training data for other robot navigation domains like ground vehicles or indoor robots.
Additional experiments with instructions from broader sets of real users could test and improve how well the benchmark reflects everyday command styles.

Load-bearing premise

The human-LLM collaborative pipeline with 10 user personas produces instructions that are sufficiently natural, diverse, and representative of real user commands to enable effective training and fair evaluation of UAV VLN agents.

What would settle it

Testing whether a model trained on AirNav maintains above 40 percent success when given instructions collected directly from human operators issuing commands during actual UAV flights over new urban areas.

Figures

Figures reproduced from arXiv: 2601.03707 by Changhao Nai, Hengxing Cai, Jingjun Tan, Jinhan Dong, Jue Hou, Ligang Huang, Renxin Zhong, Wenhao Lu, Yijie Rao, Zanyang Zhong.

**Figure 2.** Figure 2: Dataset Analysis and Instruction Naturalness of AirNav. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the AirVLN-R1 architecture. The model receives multimodal inputs and predicts an action [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Real-World UAV VLN Deployment in Indoor and Outdoor Scenes. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Existing UAV vision-and-language navigation (VLN) benchmarks rarely provide realistic aerial scenes, natural process-level instructions, and sufficient scale simultaneously, making it difficult to systematically train and evaluate UAV VLN agents under realistic settings. To address this, we propose \textbf{AirNav}, a large-scale benchmark built on real urban aerial data, comprising 137K navigation samples with natural and diverse instructions generated via a human--LLM collaborative pipeline with 10 user personas. We conduct a systematic evaluation of representative approaches on AirNav, ranging from traditional models to multimodal large language models (MLLMs), under unified metrics with open-source implementations. We further propose \textbf{AirVLN-R1}, trained via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), achieving state-of-the-art performance with a 51.82\% success rate on the test-unseen split. Real-world experiments on a physical UAV platform provide preliminary evidence of sim-to-real transferability, and our dataset and code are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AirNav adds a sizable real-aerial VLN dataset and a fine-tuned model, but the instructions' claimed naturalness rests on thin validation.

read the letter

The core offering is a 137K-sample UAV navigation benchmark drawn from real urban aerial imagery, with instructions produced through a human-LLM pipeline that uses ten personas. They also train AirVLN-R1 via SFT and RFT, report 51.82% success on the test-unseen split, run a systematic comparison against prior models, and include a few physical UAV trials. The dataset and code are released, which is straightforwardly useful for anyone building aerial agents or testing multimodal navigation methods.

Referee Report

1 major / 1 minor

Summary. The paper introduces AirNav, a large-scale UAV vision-and-language navigation benchmark comprising 137K navigation samples derived from real urban aerial data. Instructions are generated through a human-LLM collaborative pipeline involving 10 user personas to produce natural and diverse commands. The authors perform systematic evaluations of traditional models and multimodal LLMs under unified metrics, propose AirVLN-R1 trained with supervised fine-tuning and reinforcement fine-tuning that achieves 51.82% success rate on the test-unseen split, and report preliminary real-world experiments on a physical UAV platform demonstrating sim-to-real transferability, with the dataset and code released publicly.

Significance. If the generated instructions are shown to be natural and representative of real UAV operator commands, this benchmark would meaningfully advance UAV VLN research by simultaneously providing realistic aerial scenes, process-level instructions, and sufficient scale. Explicit strengths include the public release of the dataset and code, the systematic evaluation across model classes, the concrete performance improvement reported for AirVLN-R1, and the inclusion of physical UAV experiments that address sim-to-real gaps.

major comments (1)

[Section 3.2] Section 3.2: The human-LLM collaborative pipeline with 10 user personas is presented as producing natural and diverse instructions, yet the section supplies only qualitative examples and persona descriptions. No quantitative metrics (e.g., instruction-length variance, lexical entropy, semantic coverage) or human preference studies against real UAV operator logs are reported, nor are ablations on persona count or prompting. This omission is load-bearing for both the benchmark's core value proposition and the validity of the 51.82% success rate achieved by AirVLN-R1 on the test-unseen split.

minor comments (1)

[Abstract] The abstract states that 'representative approaches' were evaluated but does not enumerate the specific models or baselines; adding this detail would improve immediate clarity for readers.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive review and for recognizing the strengths of AirNav, including the public release of the dataset and code, systematic evaluations across model classes, the performance gains with AirVLN-R1, and the preliminary physical UAV experiments. We address the major comment below and will make targeted revisions to strengthen the evidence for instruction naturalness and diversity.

read point-by-point responses

Referee: [Section 3.2] Section 3.2: The human-LLM collaborative pipeline with 10 user personas is presented as producing natural and diverse instructions, yet the section supplies only qualitative examples and persona descriptions. No quantitative metrics (e.g., instruction-length variance, lexical entropy, semantic coverage) or human preference studies against real UAV operator logs are reported, nor are ablations on persona count or prompting. This omission is load-bearing for both the benchmark's core value proposition and the validity of the 51.82% success rate achieved by AirVLN-R1 on the test-unseen split.

Authors: We agree that quantitative metrics would provide stronger support for the claims of naturalness and diversity. In the revised manuscript we will add instruction-length statistics (mean, variance, and histograms), lexical diversity measures (type-token ratio and entropy), and semantic coverage analysis via sentence embeddings. We will also include an ablation on persona count (e.g., 5 vs. 10 personas) and prompting variations to quantify their effect on output diversity. These additions will be placed in Section 3.2 and will directly bolster the benchmark's value proposition. However, human preference studies against real UAV operator logs cannot be performed at this time because no suitable public logs exist; obtaining them would require new data collection from operators, which lies outside the current project scope. We will explicitly note this limitation and position the 10-persona pipeline as a practical proxy for diversity, supported by the added quantitative results. revision: partial

standing simulated objections not resolved

Human preference studies against real UAV operator logs, as no such public logs are available and new collection is beyond current scope.

Circularity Check

0 steps flagged

No circularity: new dataset and held-out evaluation are self-contained

full rationale

The paper introduces AirNav as a new benchmark dataset generated via a human-LLM pipeline and reports model performance (including the 51.82% success rate) on explicitly held-out test-unseen splits. No mathematical derivations, parameter fits, or equations appear in the provided text. Central claims rest on the scale and realism of the collected data plus standard supervised/reinforcement fine-tuning, not on any reduction to self-referential inputs, self-citations, or renamed empirical patterns. Evaluation follows conventional train/test splits on the introduced corpus, which is externally falsifiable and does not presuppose the target result by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Dataset paper that relies on standard VLN task definitions and evaluation practices from prior literature; introduces no free parameters, new physical entities, or ad-hoc axioms beyond domain conventions for navigation benchmarks.

axioms (1)

domain assumption Standard vision-and-language navigation assumptions that aerial images plus textual instructions suffice to define navigation tasks and success.
Invoked implicitly when framing the benchmark and metrics; no explicit new axioms stated.

pith-pipeline@v0.9.0 · 5740 in / 1340 out tokens · 63451 ms · 2026-05-21T16:35:09.780193+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
cs.RO 2026-04 unverdicted novelty 4.0

This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

GPT-4o System Card

Cityrefer: Geography-aware 3d visual ground- ing dataset on city-scale point cloud data. Don Norman. 2013.The Design of Everyday Things: Revised and Expanded Edition. The Design of Ev- eryday Things: Revised and Expanded Edition. OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Ala...

work page internal anchor Pith review Pith/arXiv arXiv 2013
[2]

InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (V olume 3: System Demonstra- tions), Bangkok, Thailand

Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (V olume 3: System Demonstra- tions), Bangkok, Thailand. Association for Computa- tional Linguistics. Fengda Zhu, Xiwen Liang, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. 2021. Soon: Scenario orient...

work page 2021
[3]

Naturalness: Whether the instruction sounds like spontaneous human speech rather than a rigid, scripted, or templated command

work page
[4]

Practicality: Whether the instruction provides actionable route guidance through landmarks, relative directions, or intermediate cues, rather than low-level action enumeration

work page
[5]

Human Alignment: Whether the wording and structure align with how a human would naturally phrase a navigation request in everyday use. Rating Scale: Rate it on a scale from 1 to 5: - 1 = very unnatural (robotic, templated, or action-list-like) - 2 = somewhat unnatural (syntactically valid but awkward or artificial) - 3 = neutral (reasonable but not strong...

work page
[6]

A given natural language navigation instruction,

work page
[7]

The current state of the UA V , including its position and heading angle,

work page
[8]

The current first-person UA V view image,

work page
[9]

Up to four historical first-person view images from previous time steps (if available),

work page
[10]

The previously executed UA V actions (if available). Text Input: - Navigation instruction:{Instruction} - Current state of the UA V:{Current State} - Previously executed actions:{Historical Action sequence} (A list of past actions the UA V has taken, in chronological order.) Image Input: UA V (Unmanned Aerial Vehicle) View Sequence - Historical views (fro...

work page
[11]

Predict no more than 8 future actions for the UA V to execute

work page
[12]

Otherwise, it clearly requires more than 8 actions to approach the target, output exactly 8 future actions

If the target location is reachable in fewer than 8 actions, output less than 8 actions sequence and end with "STOP". Otherwise, it clearly requires more than 8 actions to approach the target, output exactly 8 future actions

work page
[13]

You must output "STOP" if the UA V has already reached the described target

work page
[14]

Output a JSON list of actions, in the exact order they should be executed

work page
[15]

TURN_RIGHT

Do not include any explanations, reasoning, or additional text — only output the JSON list. Discrete Action Space: - MOVE_FORW ARD: move straight 5 meters in the current heading - TURN_LEFT: rotate left 30 degrees - TURN_RIGHT: rotate right 30 degrees - STOP: stop the flight Output Format Examples: ["TURN_RIGHT", "TURN_RIGHT", "MOVE_FORWARD", "MOVE_FORWAR...

work page 2017

[1] [1]

GPT-4o System Card

Cityrefer: Geography-aware 3d visual ground- ing dataset on city-scale point cloud data. Don Norman. 2013.The Design of Everyday Things: Revised and Expanded Edition. The Design of Ev- eryday Things: Revised and Expanded Edition. OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Ala...

work page internal anchor Pith review Pith/arXiv arXiv 2013

[2] [2]

InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (V olume 3: System Demonstra- tions), Bangkok, Thailand

Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (V olume 3: System Demonstra- tions), Bangkok, Thailand. Association for Computa- tional Linguistics. Fengda Zhu, Xiwen Liang, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. 2021. Soon: Scenario orient...

work page 2021

[3] [3]

Naturalness: Whether the instruction sounds like spontaneous human speech rather than a rigid, scripted, or templated command

work page

[4] [4]

Practicality: Whether the instruction provides actionable route guidance through landmarks, relative directions, or intermediate cues, rather than low-level action enumeration

work page

[5] [5]

Human Alignment: Whether the wording and structure align with how a human would naturally phrase a navigation request in everyday use. Rating Scale: Rate it on a scale from 1 to 5: - 1 = very unnatural (robotic, templated, or action-list-like) - 2 = somewhat unnatural (syntactically valid but awkward or artificial) - 3 = neutral (reasonable but not strong...

work page

[6] [6]

A given natural language navigation instruction,

work page

[7] [7]

The current state of the UA V , including its position and heading angle,

work page

[8] [8]

The current first-person UA V view image,

work page

[9] [9]

Up to four historical first-person view images from previous time steps (if available),

work page

[10] [10]

The previously executed UA V actions (if available). Text Input: - Navigation instruction:{Instruction} - Current state of the UA V:{Current State} - Previously executed actions:{Historical Action sequence} (A list of past actions the UA V has taken, in chronological order.) Image Input: UA V (Unmanned Aerial Vehicle) View Sequence - Historical views (fro...

work page

[11] [11]

Predict no more than 8 future actions for the UA V to execute

work page

[12] [12]

Otherwise, it clearly requires more than 8 actions to approach the target, output exactly 8 future actions

If the target location is reachable in fewer than 8 actions, output less than 8 actions sequence and end with "STOP". Otherwise, it clearly requires more than 8 actions to approach the target, output exactly 8 future actions

work page

[13] [13]

You must output "STOP" if the UA V has already reached the described target

work page

[14] [14]

Output a JSON list of actions, in the exact order they should be executed

work page

[15] [15]

TURN_RIGHT

Do not include any explanations, reasoning, or additional text — only output the JSON list. Discrete Action Space: - MOVE_FORW ARD: move straight 5 meters in the current heading - TURN_LEFT: rotate left 30 degrees - TURN_RIGHT: rotate right 30 degrees - STOP: stop the flight Output Format Examples: ["TURN_RIGHT", "TURN_RIGHT", "MOVE_FORWARD", "MOVE_FORWAR...

work page 2017