DynFly: Dynamic-Aware Continuous Trajectory Generation for UAV Vision-Language Navigation in Urban Environments

Bin Xu; Hanfang Liang; Hongwei Duan; Huaping Liu; Jinyuan Liu; Kangyao Huang; Li Wang; Shaoyu Liu; Wang Xu; Wei Fan

arxiv: 2606.31654 · v1 · pith:6KOQZRHOnew · submitted 2026-06-30 · 💻 cs.RO · cs.CV

DynFly: Dynamic-Aware Continuous Trajectory Generation for UAV Vision-Language Navigation in Urban Environments

Wen Jiang , Hanfang Liang , Li Wang , Kangyao Huang , Wang Xu , Wei Fan , Jinyuan Liu , Shaoyu Liu

show 4 more authors

Hongwei Duan Bin Xu Xiangyang Ji Huaping Liu

This is my paper

Pith reviewed 2026-07-01 05:32 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords UAV vision-language navigationcontinuous trajectory generationB-spline representationflow matchingdynamic-aware supervisionurban environmentsnavigation performance

0 comments

The pith

DynFly generates continuous UAV trajectories from vision-language navigation commands by representing them as B-spline control points and training a generator with flow matching plus dynamic losses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing UAV vision-language navigation methods produce only discrete actions or sparse waypoints, leaving a gap to real continuous motion. DynFly fills this by encoding expert paths in B-spline control-point space and training a Spline-DiT model via flow matching under losses that penalize violations of position, finite-difference velocity, acceleration, heading, and local target alignment. The layer can be added to any existing high-level reasoning pipeline without altering it. A sympathetic reader would care because smoother, more executable trajectories could raise success rates and reduce navigation error in urban UAV tasks. If correct, the work shows that discrete-action baselines are limited by their inability to enforce continuous motion constraints.

Core claim

DynFly bridges high-level navigation intent and continuous UAV motion through a lightweight trajectory generation layer. Specifically, it represents expert trajectories in B-spline control-point space and employs a Spline-DiT generator to learn conditional trajectory generation via flow matching. UAV-oriented dynamic-aware supervision over position, finite-difference velocity, finite-difference acceleration, heading consistency, and local target alignment enables the generated trajectories to better satisfy UAV motion characteristics. The framework integrates with existing UAV-VLN pipelines while preserving their original visual-language reasoning.

What carries the argument

Spline-DiT generator trained by flow matching on B-spline control points, supervised by dynamic losses on position, velocity, acceleration, heading, and target alignment

If this is right

The trajectory layer integrates with any existing UAV-VLN framework without changing its visual-language reasoning pipeline.
On the Test Unseen Full split the method raises the strongest baseline by 4.69 NDTW, 2.40 SDTW, 2.14 SR, and 4.87 OSR while cutting NE by 4.51 m.
Generated paths better match UAV motion properties than paths from discrete-action or sparse-waypoint baselines.
Both navigation success and trajectory quality improve on the OpenUAV benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same B-spline-plus-flow-matching approach could be tested on ground robots or other vehicles that require smooth continuous control from language instructions.
Physical flight tests on real UAVs would show whether the simulated metric gains translate to hardware under wind and sensor noise.
The dynamic losses might be reused in other trajectory tasks such as multi-agent coordination or energy-aware path planning.

Load-bearing premise

Expert trajectories encoded in B-spline control-point space, when trained with flow matching and the listed dynamic losses, will produce motions that UAVs can execute more effectively than discrete-action predictions.

What would settle it

Implementing the full DynFly pipeline on the OpenUAV Test Unseen Full split and observing no gains over the strongest baseline in NDTW, SDTW, SR, OSR, or NE.

Figures

Figures reproduced from arXiv: 2606.31654 by Bin Xu, Hanfang Liang, Hongwei Duan, Huaping Liu, Jinyuan Liu, Kangyao Huang, Li Wang, Shaoyu Liu, Wang Xu, Wei Fan, Wen Jiang, Xiangyang Ji.

**Figure 2.** Figure 2: Overall architecture and workflow of DynFly. The Qwen2.5-3B visual-language front-end [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Pseudo control-point label generation. Discrete expert waypoints are fitted with an open [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: DiT-based control-point flow generation. The condition encoder fuses visual context, the [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Joint loss design for Spline-DiT trajectory generation. Flow matching provides control [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Full-split comparison across Unseen Overall, Unseen Object, and Unseen Map settings. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity to the number of B-spline control points. Eight control points provide the best [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative navigation comparison under the same instruction and visual scene. SpatialFly [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Additional qualitative comparison with multi-view trajectory details. The enlarged views [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

read the original abstract

Recent advances in multimodal large models have significantly improved UAV vision-language navigation (UAV-VLN) by enhancing high-level perception and reasoning. However, existing methods mainly focus on predicting discrete actions, local targets, or sparse waypoints, while the continuous transition from navigation intent to executable UAV motion remains weakly modeled. This motion-interface gap limits the continuity, stability, and executability of generated UAV trajectories. To address this gap, we propose DynFly, a dynamic-aware continuous trajectory generation framework that bridges high-level navigation reasoning and executable UAV motion. DynFly bridges high-level navigation intent and continuous UAV motion through a lightweight trajectory generation layer. Specifically, it represents expert trajectories in B-spline control-point space and employs a Spline-DiT generator to learn conditional trajectory generation via flow matching. Furthermore, we introduce UAV-oriented dynamic-aware supervision over position, finite-difference velocity, finite-difference acceleration, heading consistency, and local target alignment, enabling the generated trajectories to better satisfy UAV motion characteristics. And our trajectory generation framework can also be integrated with an existing UAV-VLN framework while preserving its original visual-language reasoning pipeline. Extensive experiments on the OpenUAV UAV-VLN benchmark show that DynFly improves both navigation performance and trajectory quality. On the Test Unseen Full split, DynFly improves the strongest baseline by 4.69 NDTW, 2.40 SDTW, 2.14 SR points and 4.87 OSR points, while reducing NE by 4.51 m.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DynFly adds a B-spline flow-matching layer with finite-difference dynamic losses to UAV-VLN pipelines and reports benchmark gains, but the physical executability claim rests on soft supervision that may not enforce real UAV limits.

read the letter

DynFly puts a continuous trajectory generator on top of existing vision-language navigation models for UAVs. It encodes expert paths as B-spline control points, trains a Spline-DiT via flow matching, and adds five losses on position, finite-difference velocity and acceleration, heading, and local target alignment. The module plugs into prior pipelines without touching the visual-language reasoning part.

The combination looks like a genuine technical step for this sub-area. The reported numbers on OpenUAV Test Unseen Full are concrete: +4.69 NDTW, +2.40 SDTW, +2.14 SR, +4.87 OSR, and -4.51 m NE against the strongest baseline.

The soft spots are in the dynamic supervision and the evaluation. Finite-difference terms on velocity and acceleration are only soft penalties; they do not impose hard actuator, thrust, or higher-order dynamic bounds that real UAVs face. The benchmark metrics track task success, not tracking error under a dynamics model or feasibility in simulation. The abstract gives no ablations, no baseline descriptions, and no error bars, so the contribution of the dynamic losses versus the continuous representation is unclear.

The stress-test concern holds up on the given description: the bridging claim to executable motion is not yet strongly supported by evidence of physical realism.

This is for people already working on UAV-VLN who want a drop-in continuous layer. It is worth a serious referee because it has a clear method, integration story, and measurable gains, even though the physical side needs tighter validation.

Referee Report

2 major / 2 minor

Summary. The paper proposes DynFly, a framework for UAV vision-language navigation that generates continuous trajectories by representing expert paths in B-spline control-point space and training a Spline-DiT generator via flow matching. It adds UAV-oriented dynamic-aware supervision consisting of position, finite-difference velocity, finite-difference acceleration, heading consistency, and local target alignment losses. The method is presented as integrable with existing VLN pipelines without altering their visual-language reasoning, and it reports concrete gains on the OpenUAV benchmark (Test Unseen Full split): +4.69 NDTW, +2.40 SDTW, +2.14 SR, +4.87 OSR, and -4.51 m NE relative to the strongest baseline.

Significance. If the central claim holds after verification, the work would be significant for UAV-VLN because it directly targets the motion-interface gap between high-level multimodal reasoning and executable continuous trajectories. The lightweight integration property and use of flow matching on B-splines are practical strengths. However, significance is limited by the absence of evidence that the soft finite-difference losses produce trajectories that respect realistic UAV dynamics or actuator constraints; the reported metrics reflect task success rather than physical executability or tracking performance under a dynamics model.

major comments (2)

[Abstract] Abstract: the central claim that the dynamic-aware supervision (position, finite-difference velocity/acceleration, heading, local target) produces trajectories that 'better satisfy UAV motion characteristics' and bridge to 'executable UAV motion' rests on soft losses only; no hard constraints on thrust, actuator limits, or higher-order dynamics are described, and no evaluation of physical feasibility or tracking error under a UAV dynamics model is provided. This directly undermines the executability claim.
[Abstract] Abstract (and Experiments section): headline metric gains (+4.69 NDTW, +2.14 SR, -4.51 m NE) are reported without baseline descriptions, ablation results isolating the contribution of each dynamic loss, error bars, or statistical tests. This makes it impossible to verify that the dynamic supervision, rather than other components of the Spline-DiT or flow-matching setup, drives the improvements.

minor comments (2)

[Abstract] Abstract: the phrase 'lightweight trajectory generation layer' is used without quantifying parameters or inference cost relative to the baselines.
The integration claim ('can also be integrated with an existing UAV-VLN framework while preserving its original visual-language reasoning pipeline') would benefit from a concrete diagram or pseudocode showing the interface points.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below, acknowledging limitations where they exist and outlining targeted revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the dynamic-aware supervision (position, finite-difference velocity/acceleration, heading, local target) produces trajectories that 'better satisfy UAV motion characteristics' and bridge to 'executable UAV motion' rests on soft losses only; no hard constraints on thrust, actuator limits, or higher-order dynamics are described, and no evaluation of physical feasibility or tracking error under a UAV dynamics model is provided. This directly undermines the executability claim.

Authors: We agree that the supervision consists solely of soft losses without hard constraints on thrust or actuator limits, and that no physical feasibility evaluation or tracking error under a UAV dynamics model is reported. The finite-difference terms are intended to promote smoother, more UAV-plausible trajectories in the learned distribution. We will revise the abstract and method description to remove or qualify language implying direct executability and will add an explicit limitations paragraph noting the absence of dynamics-model validation. revision: yes
Referee: [Abstract] Abstract (and Experiments section): headline metric gains (+4.69 NDTW, +2.14 SR, -4.51 m NE) are reported without baseline descriptions, ablation results isolating the contribution of each dynamic loss, error bars, or statistical tests. This makes it impossible to verify that the dynamic supervision, rather than other components of the Spline-DiT or flow-matching setup, drives the improvements.

Authors: The experiments section already describes the baselines and reports aggregate gains relative to the strongest baseline. However, the current ablations do not fully isolate every individual dynamic loss term with error bars and statistical tests. We will expand the ablation table in the revision to include per-loss contributions, add error bars, and report statistical significance where sample sizes permit; the abstract will be updated to reference the key baselines. revision: partial

Circularity Check

0 steps flagged

No circularity: training losses and benchmark metrics are independent

full rationale

The paper trains a Spline-DiT generator via flow matching on expert B-spline trajectories, augmented by finite-difference dynamic losses. Navigation metrics (NDTW, SDTW, SR, NE) are computed on the external OpenUAV benchmark and are not fitted quantities or self-referential predictions. No equations, self-citations, or uniqueness claims appear in the provided text that would reduce any claimed result to its inputs by construction. The derivation chain is a standard supervised generative model evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.1-grok · 5839 in / 1133 out tokens · 40625 ms · 2026-07-01T05:32:41.497151+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 27 canonical work pages · 8 internal anchors

[1]

M. Dai, E. Zheng, W. Cheng, J. Chen, Z. Feng, W. Yang, Drl: An ef- ficient heterogeneous spatial feature interaction framework for uav self- localization, Pattern Recognition 177 (2026) 113330

2026
[2]

Y . Gu, W. Chen, D. Peng, Uav-based multimodal object detection via fea- ture enhancement and dynamic gated fusion, Pattern Recognition 172 (2026) 112722

2026
[3]

Dewangan, M

B. Dewangan, M. Srinivas, Amsf-yolo: An attention-based multi-scale fea- ture extraction model for uav small object detection, Pattern Recognition 177 (2026) 113303

2026
[4]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, A. Van Den Hengel, Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments, in: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683

2018
[5]

Speaker-Follower Models for Vision-and-Language Navigation

D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg- Kirkpatrick, K. Saenko, D. Klein, T. Darrell, Speaker-follower models for vision-and-language navigation, arXiv preprint arXiv:1806.02724 (2018). arXiv:1806.02724

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

K. He, Y . Jing, Y . Huang, Z. Lu, D. An, L. Wang, Memory-adaptive vision- and-language navigation, Pattern Recognition 153 (2024) 110511

2024
[7]

Mohammadi, E

B. Mohammadi, E. Abbasnejad, Y . Qi, Q. Wu, A. Van Den Hengel, J. Q. Shi, Parameter-efficient action planning with large language models for vision- and-language navigation, Pattern Recognition 172 (2026) 112462

2026
[8]

S. Liu, J. Li, G. Zhao, Y . Zhang, X. Meng, F. R. Yu, X. Ji, M. Li, Eventgpt: Event stream understanding with multimodal large language models (2024). arXiv:2412.00832

work page arXiv 2024
[9]

X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, S. Liu, Towards realistic uav vision-language navigation: Platform, bench- mark, and methodology, arXiv preprint arXiv:2410.07087 (2024). 31

work page arXiv 2024
[10]

X. Sun, W. Si, W. Ni, Y . Li, D. Wu, F. Xie, R. Guan, H.-Y . Xu, H. Ding, Y . Wu, et al., Autofly: Vision-language-action model for uav au- tonomous navigation in the wild, arXiv preprint arXiv:2602.09657 (2026). arXiv:2602.09657

work page arXiv 2026
[11]

Jiang, L

W. Jiang, L. Wang, K. Huang, W. Fan, J. Liu, S. Liu, H. Duan, B. Xu, X. Ji, Longfly: Long-horizon uav vision-and-language navigation with spatiotem- poral context integration, arXiv preprint arXiv:2512.22010 (2025)

work page arXiv 2025
[12]

Y . Liu, F. Yao, Y . Yue, G. Xu, X. Sun, K. Fu, Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language navigation (2024). arXiv:2411.08579. URLhttps://arxiv.org/abs/2411.08579

work page arXiv 2024
[13]

Chen, P.-L

S. Chen, P.-L. Guhur, C. Schmid, I. Laptev, History aware multi- modal transformer for vision-and-language navigation, arXiv preprint arXiv:2110.13309 (2021). arXiv:2110.13309

work page arXiv 2021
[14]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, I. Laptev, Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation, arXiv preprint arXiv:2202.11742 (2022). arXiv:2202.11742

work page arXiv 2022
[15]

J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, K.-Y . K. Wong, Mapgpt: Map- guided prompting with adaptive path planning for vision-and-language nav- igation, arXiv preprint arXiv:2401.07314 (2024). arXiv:2401.07314

work page arXiv 2024
[16]

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, R. Xu, Mapnav: A novel memory representation via annotated semantic maps for vision-and-language navigation, arXiv preprint arXiv:2502.13451 (2025). arXiv:2502.13451

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Z. Xin, W. Li, Y . Jiang, Z. Huang, B. Wang, P. Li, J. Zhu, J. Qin, S. Huang, Agentvln: Towards agentic vision-and-language navigation, arXiv preprint arXiv:2603.17670 (2026). arXiv:2603.17670

work page arXiv 2026
[18]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied multi- modal language model, in: International Conference on Machine Learning, 2023, pp. 8469–8488. 32

2023
[19]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choroman- ski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., Rt-2: Vision-language- action models transfer web knowledge to robotic control, arXiv preprint arXiv:2307.15818 (2023). arXiv:2307.15818

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al., Openvla: An open-source vision-language-action model, arXiv preprint arXiv:2406.09246 (2024). arXiv:2406.09246

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

D. Jing, J. Nie, T. Zhang, J. Liu, H. Yao, Z. Lu, M. Ding, Tem- povla: Learning speed-controllable vision-language-action policies (2026). arXiv:2606.06491. URLhttps://arxiv.org/abs/2606.06491

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

S. Liu, H. Zhang, Y . Qi, P. Wang, Y . Zhang, Q. Wu, Aerialvln: Vision-and- language navigation for uavs, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15384–15394

2023
[23]

Y . Gao, C. Li, Z. You, J. Liu, Z. Li, P. Chen, Q. Chen, Z. Tang, L. Wang, P. Yang, et al., Openfly: A comprehensive platform for aerial vision- language navigation, arXiv preprint arXiv:2502.18041 (2025)

work page arXiv 2025
[24]

H. Cai, J. Dong, J. Tan, J. Deng, S. Li, Z. Gao, H. Wang, Z. Su, A. Sumalee, R. Zhong, Flightgpt: Towards generalizable and inter- pretable uav vision-and-language navigation with vlms, arXiv preprint arXiv:2505.12835 (2025). arXiv:2505.12835

work page arXiv 2025
[25]

Optimal Trajectory-Planning of UAVs via B-Splines and Disjunctive Programming

A. Babaei, A. Karimi, Optimal trajectory-planning of uavs via b-splines and disjunctive programming, arXiv preprint arXiv:1807.02931 (2018). arXiv:1807.02931

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

X. Zhou, Z. Wang, H. Ye, C. Xu, F. Gao, Ego-planner: An esdf-free gradient-based local planner for quadrotors, arXiv preprint arXiv:2008.08835 (2020). arXiv:2008.08835

work page arXiv 2008
[27]

Burke, A

D. Burke, A. Chapman, I. Shames, Fast spline trajectory planning: Minimum snap and beyond, arXiv preprint arXiv:2105.01788 (2021). arXiv:2105.01788. 33

work page arXiv 2021
[28]

Liang, S

H. Liang, S. Yuan, F. Liu, Y . Yang, B. Wang, Z. Huang, C. Shi, J. Jin, Label- free long-horizon 3d uav trajectory prediction via motion-aligned rgb and event cues (2025). arXiv:2507.03365. URLhttps://arxiv.org/abs/2507.03365

work page arXiv 2025
[29]

J. Qiu, Q. Liu, J. Qin, D. Cheng, Y . Tian, Q. Ma, Pe-planner: A performance-enhanced quadrotor motion planner for autonomous flight in complex and dynamic environments, arXiv preprint arXiv:2403.12865 (2024). arXiv:2403.12865

work page arXiv 2024
[30]

C. Chi, S. Feng, S. Du, Z. Xu, E. Cousineau, B. Burchfiel, S. Song, Diffu- sion policy: Visuomotor policy learning via action diffusion, arXiv preprint arXiv:2303.04137 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.,π 0: A vision-language-action flow model for general robot control, arXiv preprint arXiv:2410.24164 (2024). arXiv:2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Nguyen, A

K. Nguyen, A. T. Le, T. Pham, M. Huber, J. Peters, M. N. Vu, Flowmp: Learning motion fields for robot planning with conditional flow matching, arXiv preprint arXiv:2503.06135 (2025)

work page arXiv 2025
[33]

S. Shah, D. Dey, C. Lovett, A. Kapoor, Airsim: High-fidelity visual and physical simulation for autonomous vehicles, in: Field and Service Robotics, 2017

2017
[34]

P. Lin, G. Sun, C. Liu, F. Li, W. Ren, Y . Cong, Openvln: Open-world aerial vision-language navigation, arXiv preprint arXiv:2511.06182 (2025)

work page arXiv 2025
[35]

Embodied navigation foundation model, 2025

J. Zhang, A. Li, Y . Qi, M. Li, J. Liu, S. Wang, H. Liu, G. Zhou, Y . Wu, X. Li, et al., Embodied navigation foundation model, arXiv preprint arXiv:2509.12129 (2025)

work page arXiv 2025
[36]

Jiang, K

W. Jiang, K. Huang, L. Wang, W. Xu, W. Fan, J. Liu, S. Liu, H. Liang, H. Duan, B. Xu, X. Ji, Spatialfly: Geometry-guided representation align- ment for uav vision-and-language navigation in urban environments (2026). arXiv:2603.21046. 34

work page arXiv 2026

[1] [1]

M. Dai, E. Zheng, W. Cheng, J. Chen, Z. Feng, W. Yang, Drl: An ef- ficient heterogeneous spatial feature interaction framework for uav self- localization, Pattern Recognition 177 (2026) 113330

2026

[2] [2]

Y . Gu, W. Chen, D. Peng, Uav-based multimodal object detection via fea- ture enhancement and dynamic gated fusion, Pattern Recognition 172 (2026) 112722

2026

[3] [3]

Dewangan, M

B. Dewangan, M. Srinivas, Amsf-yolo: An attention-based multi-scale fea- ture extraction model for uav small object detection, Pattern Recognition 177 (2026) 113303

2026

[4] [4]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, A. Van Den Hengel, Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments, in: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683

2018

[5] [5]

Speaker-Follower Models for Vision-and-Language Navigation

D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg- Kirkpatrick, K. Saenko, D. Klein, T. Darrell, Speaker-follower models for vision-and-language navigation, arXiv preprint arXiv:1806.02724 (2018). arXiv:1806.02724

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

K. He, Y . Jing, Y . Huang, Z. Lu, D. An, L. Wang, Memory-adaptive vision- and-language navigation, Pattern Recognition 153 (2024) 110511

2024

[7] [7]

Mohammadi, E

B. Mohammadi, E. Abbasnejad, Y . Qi, Q. Wu, A. Van Den Hengel, J. Q. Shi, Parameter-efficient action planning with large language models for vision- and-language navigation, Pattern Recognition 172 (2026) 112462

2026

[8] [8]

S. Liu, J. Li, G. Zhao, Y . Zhang, X. Meng, F. R. Yu, X. Ji, M. Li, Eventgpt: Event stream understanding with multimodal large language models (2024). arXiv:2412.00832

work page arXiv 2024

[9] [9]

X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, S. Liu, Towards realistic uav vision-language navigation: Platform, bench- mark, and methodology, arXiv preprint arXiv:2410.07087 (2024). 31

work page arXiv 2024

[10] [10]

X. Sun, W. Si, W. Ni, Y . Li, D. Wu, F. Xie, R. Guan, H.-Y . Xu, H. Ding, Y . Wu, et al., Autofly: Vision-language-action model for uav au- tonomous navigation in the wild, arXiv preprint arXiv:2602.09657 (2026). arXiv:2602.09657

work page arXiv 2026

[11] [11]

Jiang, L

W. Jiang, L. Wang, K. Huang, W. Fan, J. Liu, S. Liu, H. Duan, B. Xu, X. Ji, Longfly: Long-horizon uav vision-and-language navigation with spatiotem- poral context integration, arXiv preprint arXiv:2512.22010 (2025)

work page arXiv 2025

[12] [12]

Y . Liu, F. Yao, Y . Yue, G. Xu, X. Sun, K. Fu, Navagent: Multi-scale urban street view fusion for uav embodied vision-and-language navigation (2024). arXiv:2411.08579. URLhttps://arxiv.org/abs/2411.08579

work page arXiv 2024

[13] [13]

Chen, P.-L

S. Chen, P.-L. Guhur, C. Schmid, I. Laptev, History aware multi- modal transformer for vision-and-language navigation, arXiv preprint arXiv:2110.13309 (2021). arXiv:2110.13309

work page arXiv 2021

[14] [14]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, I. Laptev, Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation, arXiv preprint arXiv:2202.11742 (2022). arXiv:2202.11742

work page arXiv 2022

[15] [15]

J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, K.-Y . K. Wong, Mapgpt: Map- guided prompting with adaptive path planning for vision-and-language nav- igation, arXiv preprint arXiv:2401.07314 (2024). arXiv:2401.07314

work page arXiv 2024

[16] [16]

MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, R. Xu, Mapnav: A novel memory representation via annotated semantic maps for vision-and-language navigation, arXiv preprint arXiv:2502.13451 (2025). arXiv:2502.13451

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Z. Xin, W. Li, Y . Jiang, Z. Huang, B. Wang, P. Li, J. Zhu, J. Qin, S. Huang, Agentvln: Towards agentic vision-and-language navigation, arXiv preprint arXiv:2603.17670 (2026). arXiv:2603.17670

work page arXiv 2026

[18] [18]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied multi- modal language model, in: International Conference on Machine Learning, 2023, pp. 8469–8488. 32

2023

[19] [19]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choroman- ski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., Rt-2: Vision-language- action models transfer web knowledge to robotic control, arXiv preprint arXiv:2307.15818 (2023). arXiv:2307.15818

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al., Openvla: An open-source vision-language-action model, arXiv preprint arXiv:2406.09246 (2024). arXiv:2406.09246

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

D. Jing, J. Nie, T. Zhang, J. Liu, H. Yao, Z. Lu, M. Ding, Tem- povla: Learning speed-controllable vision-language-action policies (2026). arXiv:2606.06491. URLhttps://arxiv.org/abs/2606.06491

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

S. Liu, H. Zhang, Y . Qi, P. Wang, Y . Zhang, Q. Wu, Aerialvln: Vision-and- language navigation for uavs, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15384–15394

2023

[23] [23]

Y . Gao, C. Li, Z. You, J. Liu, Z. Li, P. Chen, Q. Chen, Z. Tang, L. Wang, P. Yang, et al., Openfly: A comprehensive platform for aerial vision- language navigation, arXiv preprint arXiv:2502.18041 (2025)

work page arXiv 2025

[24] [24]

H. Cai, J. Dong, J. Tan, J. Deng, S. Li, Z. Gao, H. Wang, Z. Su, A. Sumalee, R. Zhong, Flightgpt: Towards generalizable and inter- pretable uav vision-and-language navigation with vlms, arXiv preprint arXiv:2505.12835 (2025). arXiv:2505.12835

work page arXiv 2025

[25] [25]

Optimal Trajectory-Planning of UAVs via B-Splines and Disjunctive Programming

A. Babaei, A. Karimi, Optimal trajectory-planning of uavs via b-splines and disjunctive programming, arXiv preprint arXiv:1807.02931 (2018). arXiv:1807.02931

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

X. Zhou, Z. Wang, H. Ye, C. Xu, F. Gao, Ego-planner: An esdf-free gradient-based local planner for quadrotors, arXiv preprint arXiv:2008.08835 (2020). arXiv:2008.08835

work page arXiv 2008

[27] [27]

Burke, A

D. Burke, A. Chapman, I. Shames, Fast spline trajectory planning: Minimum snap and beyond, arXiv preprint arXiv:2105.01788 (2021). arXiv:2105.01788. 33

work page arXiv 2021

[28] [28]

Liang, S

H. Liang, S. Yuan, F. Liu, Y . Yang, B. Wang, Z. Huang, C. Shi, J. Jin, Label- free long-horizon 3d uav trajectory prediction via motion-aligned rgb and event cues (2025). arXiv:2507.03365. URLhttps://arxiv.org/abs/2507.03365

work page arXiv 2025

[29] [29]

J. Qiu, Q. Liu, J. Qin, D. Cheng, Y . Tian, Q. Ma, Pe-planner: A performance-enhanced quadrotor motion planner for autonomous flight in complex and dynamic environments, arXiv preprint arXiv:2403.12865 (2024). arXiv:2403.12865

work page arXiv 2024

[30] [30]

C. Chi, S. Feng, S. Du, Z. Xu, E. Cousineau, B. Burchfiel, S. Song, Diffu- sion policy: Visuomotor policy learning via action diffusion, arXiv preprint arXiv:2303.04137 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.,π 0: A vision-language-action flow model for general robot control, arXiv preprint arXiv:2410.24164 (2024). arXiv:2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Nguyen, A

K. Nguyen, A. T. Le, T. Pham, M. Huber, J. Peters, M. N. Vu, Flowmp: Learning motion fields for robot planning with conditional flow matching, arXiv preprint arXiv:2503.06135 (2025)

work page arXiv 2025

[33] [33]

S. Shah, D. Dey, C. Lovett, A. Kapoor, Airsim: High-fidelity visual and physical simulation for autonomous vehicles, in: Field and Service Robotics, 2017

2017

[34] [34]

P. Lin, G. Sun, C. Liu, F. Li, W. Ren, Y . Cong, Openvln: Open-world aerial vision-language navigation, arXiv preprint arXiv:2511.06182 (2025)

work page arXiv 2025

[35] [35]

Embodied navigation foundation model, 2025

J. Zhang, A. Li, Y . Qi, M. Li, J. Liu, S. Wang, H. Liu, G. Zhou, Y . Wu, X. Li, et al., Embodied navigation foundation model, arXiv preprint arXiv:2509.12129 (2025)

work page arXiv 2025

[36] [36]

Jiang, K

W. Jiang, K. Huang, L. Wang, W. Xu, W. Fan, J. Liu, S. Liu, H. Liang, H. Duan, B. Xu, X. Ji, Spatialfly: Geometry-guided representation align- ment for uav vision-and-language navigation in urban environments (2026). arXiv:2603.21046. 34

work page arXiv 2026