Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation

Donglin Di; Gangyi Ding; Hu Zhang; Jintao Sun; Zhedong Zheng

arxiv: 2604.05377 · v2 · submitted 2026-04-07 · 💻 cs.CV

Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation

Jintao Sun , Hu Zhang , Donglin Di , Gangyi Ding , Zhedong Zheng This is my paper

Pith reviewed 2026-05-10 19:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords UAVVision-Language ModelsAerial ReasoningCross-Modal GenerationNadir ViewVQADatasetSemantic Segmentation

0 comments

The pith

Joint reasoning and generation training supplies geometry-aware priors that ground vision-language models in UAV scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard vision-language models falter on high-altitude UAV imagery because objects appear tiny and packed, textures repeat, and top-down orientations lack familiar cues. The paper creates UAVReason, a dataset pairing RGB, depth, segmentation, captions, and question-answer pairs across 23.6K frames and hundreds of thousands of samples. It then trains a single model, UAVReason-Bagel, to optimize both language reasoning and dense visual generation together. This joint objective yields large lifts on temporal VQA, heading-aware questions, and conditioned image synthesis. Ablations indicate that the two tasks reinforce each other: generation improves consistency in reasoning while reasoning regularizes sparse synthesis.

Core claim

We introduce UAVReason, a large-scale UAV-native dataset and evaluation suite for studying unified aerial reasoning and generation under this nadir-view domain shift. UAVReason aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain. We further adapt UAVReason-Bagel as a unified understanding-and-generation baseline that jointly optimizes language reasoning and dense visual generation objectives. Experiments show that general-purpose VLMs and off-the-shelf unified generators struggle with UAV-native grounding, while UAVReason-Bagel substantially improves over its pretrained counterpart. More importantly, our ablations

What carries the argument

UAVReason-Bagel, a model that jointly optimizes language reasoning objectives and dense visual generation objectives across RGB, depth, and segmentation modalities.

If this is right

Dense generation objectives improve temporal semantic consistency in multi-frame aerial questions.
Language-level reasoning regularizes image synthesis under sparse conditioning such as depth-plus-text.
The unified model raises heading-aware VQA F1 from 0.798 to 0.973 on nadir-view scenes.
Segmentation mIoU for generated outputs increases to 0.143 while KID for depth-conditioned synthesis drops to 0.048.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar joint training may help vision-language models adapt to other viewpoint shifts such as underwater or microscopic imagery.
The structural priors learned here could support downstream UAV tasks like path planning that require consistent 3D understanding from 2D inputs.
If the synergy holds, future aerial agents might maintain a single shared representation instead of separate perception and planning modules.

Load-bearing premise

The reported gains arise chiefly from the bidirectional interaction between reasoning and generation rather than from simply adding more UAV-specific training data or from tuning on this particular dataset.

What would settle it

A controlled experiment that trains an otherwise identical model on the same quantity of UAV imagery and annotations but omits the joint generation loss and then measures whether VQA F1 scores and generation metrics still reach 0.711, 0.822, and 0.143 respectively.

Figures

Figures reproduced from arXiv: 2604.05377 by Donglin Di, Gangyi Ding, Hu Zhang, Jintao Sun, Zhedong Zheng.

**Figure 1.** Figure 1: UAVReason, a unified benchmark for nadir-view spatio-temporal reasoning and cross-modal generation. 1Frame-VAQ (Left): Single-frame reasoning with orientation cues (e.g., north), requiring object counting, scene captioning, spatial reasoning, and comparison ability. 2Frame-VAQ (Middle): Two-frame temporal reasoning over aligned viewpoints, probing motion direction and temporal changes in relative distanc… view at source ↗

**Figure 2.** Figure 2: Overview of UAVReason benchmark tasks and I/O protocols. Top (Reasoning): Language-centric reasoning on nadir-view UAV imagery, including single-frame spatial queries (e.g., counting/comparison and referring questions with orientation cues), global scene captioning, and two-frame temporal reasoning (motion direction and cross-time relational verification). Middle (Geometry prediction): Dense pixel-level su… view at source ↗

**Figure 4.** Figure 4: Task and Data Distribution in UAVReason. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Unified multi-task baseline for UAVReason. We adopt a shared multi-modal transformer backbone with two task-specialized experts: a Reasoning Expert (top) optimized for next-token prediction over language outputs, and a Generation Expert (bottom) optimized for velocity prediction under diffusion-style latent denoising. Text is tokenized and fused with visual features via multi-modal self-attention, while ex… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on fine-grained spatio-temporal reasoning. We evaluate UAVReason-Bagel (Ours) against leading general-domain VLMs (e.g., Qwen2.5-VL (Li et al., 2023), InternVL (Wang et al., 2023a)) across three challenging aerial tasks. Left (Counting): Our model accurately counts tiny, densely packed instances (traffic barriers), whereas baselines suffer from severe miss-detection or count hallucin… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on fine-grained scene description. On a nadir-view airstrip, UAVReason-Bagel (Ours) achieves superior semantic granularity, correctly identifying the context (“airstrip”) and resolving specific instances (e.g., “sedans”). In contrast, general-domain VLMs exhibit severe semantic drift, often hallucinating non-existent scenes (e.g., “construction site”) or resorting to coarse abstracti… view at source ↗

**Figure 8.** Figure 8: Qualitative comparison of dense perception and conditional synthesis. We visualize predictions from UAVReason-Bagel versus general-domain baselines (BAGEL, OmniGen2). In perception tasks (Top rows), our model recovers fine-grained object boundaries and smooth depth gradients, whereas baselines exhibit significant noise and semantic errors. In conditional synthesis (Bottom rows), our model generates photore… view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of dense perception and conditional synthesis. We visualize predictions from UAVReason-Bagel versus general-domain baselines (BAGEL, OmniGen2). In perception tasks (Top rows), our model recovers fine-grained object boundaries and smooth depth gradients, whereas baselines exhibit significant noise and semantic errors. In conditional synthesis (Bottom rows), our model generates photore… view at source ↗

**Figure 10.** Figure 10: Prompts for the Image Enrichment phase. Step-1 verifies global layout, while Step-2 fills fine-grained attributes and spatial relations [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Prompts for Question Generation. The top block instructs the model to cover diverse cognitive tasks (e.g., counting, spatial reasoning) on single frames. The bottom block enforces cross-frame tracking and dynamic state analysis for temporal understanding [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Prompts for the Program-Aided Answer Generation. The pipeline consists of three stages: the Planner parses the natural language question into a structured JSON query; the Writer synthesizes a fluent answer based on ground-truth facts computed by the deterministic executor; and the Validator performs a final sanity check to ensure no numerical hallucinations are included [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

read the original abstract

Vision-Language Models have achieved strong progress in ground-view visual understanding, yet they remain brittle in high-altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitive, and top-down orientations are ambiguous. We introduce UAVReason, a large-scale UAV-native dataset and evaluation suite for studying unified aerial reasoning and generation under this nadir-view domain shift. UAVReason aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain. It contains 23.6K captioned frames, 273K VQA pairs including 68.2K two-frame temporal questions, and 188.8K cross-modal generation samples across RGB, depth, and segmentation modalities. We further adapt UAVReason-Bagel as a unified understanding-and-generation baseline that jointly optimizes language reasoning and dense visual generation objectives. Experiments show that general-purpose VLMs and off-the-shelf unified generators struggle with UAV-native grounding, while UAVReason-Bagel substantially improves over its pretrained counterpart, increasing VQA-1F F1 from 0.394 to 0.711, VQA-2F F1 from 0.427 to 0.822, and heading-aware VQA F1 from 0.798 to 0.973. For generation, it improves segmentation mIoU to 0.143 and reduces KID from 0.078 to 0.048 for depth-segmentation-text-conditioned RGB synthesis. More importantly, our ablations reveal a bidirectional synergy between synthesis and reasoning. Dense generation objectives improve temporal semantic consistency, while language-level reasoning regularizes sparse-condition image synthesis. These results suggest that unified reasoning and generation provide effective geometry-aware structural priors for physically grounded aerial intelligence. All data, code, and evaluation tools will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New UAV dataset and joint baseline are the real value here, but the bidirectional synergy claim needs tighter controls to hold up.

read the letter

The paper gives us UAVReason, a dataset of 23.6K aerial frames with aligned depth, segmentation, captions, and 273K VQA pairs plus generation targets. They pair it with UAVReason-Bagel, a model that trains reasoning and dense generation together and reports clear lifts over pretrained VLMs on VQA F1 and segmentation mIoU. That is the concrete new thing: a domain-specific resource for nadir-view UAV tasks where standard models fail on small objects and orientation ambiguity. Releasing the data and code is useful on its own for anyone working on drone perception in agriculture or monitoring.

Referee Report

2 major / 3 minor

Summary. The paper introduces the UAVReason dataset (23.6K captioned frames, 273K VQA pairs including 68.2K two-frame temporal questions, and 188.8K cross-modal generation samples) for nadir-view UAV reasoning and generation. It proposes UAVReason-Bagel, a unified model jointly optimizing language reasoning and dense visual generation (RGB, depth, segmentation) objectives, reporting large gains over pretrained VLMs and off-the-shelf generators (VQA-1F F1 0.394→0.711, VQA-2F F1 0.427→0.822, heading-aware VQA F1 0.798→0.973, segmentation mIoU 0.143, KID 0.078→0.048) and attributing them to bidirectional synergy that supplies geometry-aware structural priors.

Significance. If the performance deltas are shown to arise from joint optimization rather than data volume or tuning alone, the work would provide a useful new benchmark and baseline for UAV-native VLMs, demonstrating value in unifying reasoning and generation for domains with tiny objects, repetitive textures, and orientation ambiguity. The explicit plan to release the full dataset, code, and evaluation tools is a clear strength that supports reproducibility.

major comments (2)

[Ablation studies / Experiments] Ablation studies (as summarized in the abstract and Experiments): the central claim that 'bidirectional synergy' between reasoning and generation supplies geometry-aware priors rests on comparisons of the joint UAVReason-Bagel only against its pretrained checkpoint and off-the-shelf VLMs. No control arms fine-tune reasoning-only and generation-only models on the identical 23.6K frames + 273K VQA + 188.8K generation samples under the same schedule and hyperparameters. Without these, the reported lifts (e.g., VQA-2F F1 0.427→0.822 and improved temporal consistency) cannot be attributed to joint training rather than domain-specific data exposure.
[Dataset construction] Dataset and evaluation section: the abstract states that UAVReason 'aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain,' yet no details are provided on train/test splits, collection protocol, or checks for overlap with the pretraining corpora of the base VLM. This is load-bearing for interpreting whether the gains reflect genuine UAV-native generalization or leakage.

minor comments (3)

[Model] Clarify the precise formulation of the joint training objective (e.g., weighting between language modeling loss and dense generation losses) and the architecture modifications in UAVReason-Bagel.
[Experiments] Define all reported metrics (KID, heading-aware VQA F1) and provide error bars or statistical significance for the metric improvements.
[Ablations] The abstract claims 'dense generation objectives improve temporal semantic consistency'; include the specific ablation table or metric used to quantify this temporal effect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments help strengthen the attribution of our results to joint optimization and improve the transparency of the UAVReason dataset. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Ablation studies / Experiments] Ablation studies (as summarized in the abstract and Experiments): the central claim that 'bidirectional synergy' between reasoning and generation supplies geometry-aware priors rests on comparisons of the joint UAVReason-Bagel only against its pretrained checkpoint and off-the-shelf VLMs. No control arms fine-tune reasoning-only and generation-only models on the identical 23.6K frames + 273K VQA + 188.8K generation samples under the same schedule and hyperparameters. Without these, the reported lifts (e.g., VQA-2F F1 0.427→0.822 and improved temporal consistency) cannot be attributed to joint training rather than domain-specific data exposure.

Authors: We agree that the current comparisons, while showing clear gains over pretrained checkpoints and off-the-shelf models, do not fully isolate the contribution of joint optimization from the effects of domain-specific fine-tuning data. To address this, we will add new control experiments in the revised manuscript: (1) a reasoning-only model fine-tuned solely on the VQA and captioning objectives using the identical 23.6K frames and 273K VQA pairs, and (2) a generation-only model fine-tuned on the 188.8K cross-modal generation samples, both under the same training schedule and hyperparameters as UAVReason-Bagel. These results will be reported alongside the joint model, with quantitative analysis of temporal consistency and generation metrics to demonstrate the bidirectional synergy. revision: yes
Referee: [Dataset construction] Dataset and evaluation section: the abstract states that UAVReason 'aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain,' yet no details are provided on train/test splits, collection protocol, or checks for overlap with the pretraining corpora of the base VLM. This is load-bearing for interpreting whether the gains reflect genuine UAV-native generalization or leakage.

Authors: We acknowledge that additional details on dataset construction are necessary for proper interpretation. In the revision, we will expand the Dataset section with: (i) the full collection protocol, including UAV platforms, flight parameters (altitude, speed, camera angles), geographic regions, and annotation pipeline; (ii) explicit train/test split criteria (e.g., by disjoint geographic tiles and temporal windows to avoid leakage); and (iii) overlap verification procedures, including perceptual hash comparisons and caption similarity searches against common pretraining corpora such as LAION-5B, COCO, and Visual Genome. We confirm that UAVReason was collected independently from new UAV flights and contains no direct overlap with these corpora. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on new dataset with external baselines

full rationale

The paper introduces UAVReason dataset and trains UAVReason-Bagel on it, reporting measured performance gains (e.g., VQA F1 improvements) against pretrained checkpoints and off-the-shelf VLMs. No equations, fitted parameters renamed as predictions, or self-citations are invoked to derive the synergy claim; the bidirectional synergy statement follows directly from the reported ablation comparisons without reducing to input quantities by construction. The work is self-contained as standard empirical ML research with new data and model training.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a newly collected UAV dataset faithfully captures the domain shift and that joint optimization of language and dense visual objectives yields geometry-aware priors; no free parameters or invented physical entities are introduced beyond standard deep-learning training.

free parameters (1)

joint training hyperparameters
Standard learning-rate, loss-weight and batch-size choices required to train UAVReason-Bagel; not enumerated in the abstract.

axioms (1)

domain assumption Joint optimization of reasoning and dense generation objectives produces bidirectional synergy
Invoked to explain the ablation results and the final claim about geometry-aware priors.

pith-pipeline@v0.9.0 · 5645 in / 1428 out tokens · 65077 ms · 2026-05-10T19:49:25.105997+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

anything-to-image

Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195. Ye Lyu, George V osselman, Gui-Song Xia, Alper Yil- maz, and Michael Ying Yang. 2020. UA Vid: A se- mantic segmentation dataset for UA V imagery.IS- PRS Journal of Photogrammetry and Remote Sens- ing. Ishan Nigam, Chen...

work page arXiv 2020
[2]

approaching

as the judge model and apply fixed prompt templates and rubrics to ensure consistent and re- producible scoring. We denote these semantic scores asLLM-Judge(LLM-J), with task-specific scales used across VQA and captioning. VQA (Single-frame and Two-frame).We re- portExact Match (EM)andToken-level F1after standard answer normalization, including lowercas- ...

work page arXiv 2025

[1] [1]

anything-to-image

Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195. Ye Lyu, George V osselman, Gui-Song Xia, Alper Yil- maz, and Michael Ying Yang. 2020. UA Vid: A se- mantic segmentation dataset for UA V imagery.IS- PRS Journal of Photogrammetry and Remote Sens- ing. Ishan Nigam, Chen...

work page arXiv 2020

[2] [2]

approaching

as the judge model and apply fixed prompt templates and rubrics to ensure consistent and re- producible scoring. We denote these semantic scores asLLM-Judge(LLM-J), with task-specific scales used across VQA and captioning. VQA (Single-frame and Two-frame).We re- portExact Match (EM)andToken-level F1after standard answer normalization, including lowercas- ...

work page arXiv 2025