CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

Hanxuan Chen; Hanzhong Guo; Jie Zheng; Ji Pei; Kangli Wang; Ruilong Ren; Shuai Yuan; Songsheng Cheng; Tianle Zeng; Xiangyue Wang

arxiv: 2605.17776 · v2 · pith:H225SINTnew · submitted 2026-05-18 · 💻 cs.RO

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

Xiangyue Wang , Hanxuan Chen , Songsheng Cheng , Ruilong Ren , Jie Zheng , Shuai Yuan , Tianle Zeng , Hanzhong Guo

show 2 more authors

Kangli Wang Ji Pei

This is my paper

Pith reviewed 2026-05-21 08:56 UTC · model grok-4.3

classification 💻 cs.RO

keywords UAV visual trackingmulti-modal datasettrajectory optimizationvision-language modelsurban environmentsdrone navigationtarget following

0 comments

The pith

CosFlyTrack supplies 12,000 optimized UAV trajectories and aligned multi-modal data that let vision-language models track moving targets at success rates of 78 to 96 percent after fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CosFlyTrack, a large-scale dataset built from 6,000 pedestrian paths to produce roughly 12,000 UAV trajectories covering 2.4 million timesteps with seven synchronized data streams: RGB images, metric depth, semantic segmentation, drone pose, target state with visibility, bilingual instructions, and metadata. It introduces MuCO, an optimizer that generates expert trajectories by planning directly in continuous three-dimensional space while enforcing visibility, collision avoidance, smoothness, and kinematic limits through accelerated spatial queries. Fine-tuning experiments across seven vision-language models demonstrate large gains over zero-shot performance, reaching 78.3 to 95.6 percent success rate at one meter. A sympathetic reader would care because existing aerial vision-language datasets focus on reaching static goals, whereas practical UAV operations often require sustained pursuit of moving targets in cluttered urban settings. If the claim holds, the work supplies a practical route to create scalable training data for dynamic following agents without relying on manual labeling.

Core claim

The paper establishes CosFlyTrack as a dataset of approximately 12,000 expert and perturbed UAV trajectories derived from 6,000 pedestrian paths, yielding 2.4 million timesteps with seven aligned channels that include RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual Chinese-English instructions, and trajectory-pair metadata. These trajectories are produced by MuCO, a multi-constraint optimizer that plans in continuous three-dimensional space with BVH-accelerated collision and visibility queries to jointly satisfy target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility. Fine-tuning

What carries the argument

MuCO, the multi-constraint optimizer that plans UAV trajectories directly in continuous three-dimensional space while jointly enforcing target visibility, collision avoidance, smoothness, and kinematic feasibility through BVH-accelerated spatial queries.

If this is right

Fine-tuning on the dataset raises tracking success rate at one meter from zero-shot baselines to the range of 78.3 to 95.6 percent across seven vision-language models.
The pipeline produces aligned multi-modal data at scale, including depth, segmentation, and bilingual instructions, without discretization artifacts.
Perturbed trajectories complement expert paths to increase data variety for training dynamic target-following agents.
The continuous-space optimizer avoids grid-based artifacts and post-hoc smoothing common in earlier planners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generation approach could be extended to produce training data for tracking vehicles or other non-pedestrian moving objects in similar environments.
Models trained this way may require additional domain adaptation when transferred to real UAV hardware with sensor noise and wind disturbances.
The dataset structure suggests a template for creating comparable resources for other continuous-control robotics tasks that combine vision and language.

Load-bearing premise

The generated trajectories and simulated sensor data accurately capture the visibility, collision, and kinematic constraints that matter for real UAV hardware operating in physical urban environments.

What would settle it

Collecting a real-world UAV tracking dataset in comparable urban scenes, running the fine-tuned models on it, and checking whether the reported 53-to-69-point gains over zero-shot baselines persist or shrink substantially.

Figures

Figures reproduced from arXiv: 2605.17776 by Hanxuan Chen, Hanzhong Guo, Jie Zheng, Ji Pei, Kangli Wang, Ruilong Ren, Shuai Yuan, Songsheng Cheng, Tianle Zeng, Xiangyue Wang.

**Figure 1.** Figure 1: CosFly-Track pipeline. From urban scenes to dataset: 3D grid construction → pedestrian path generation → MuCO trajectory optimization (9-term objective, soft/hard constraints) → paired expert/perturbed rendering with 7 aligned data channels. Details in Section 4. Navigation aims to reach a fixed goal; tracking requires continuously adapting to a moving target under visibility, viewpoint, collision, and kin… view at source ↗

**Figure 2.** Figure 2: MuCO vs. A∗ . Left: A∗ expands 1.4M voxels on a 315 m path (10s, visibility 0.79); MuCO produces a smooth trajectory in 311 ms (32× faster, visibility 0.64). Right: four additional scenarios showing 20–32× speedup with comparable tracking quality. Dashed lines: pedestrian paths [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper mainly adds a large multi-modal dataset for UAV tracking of moving targets plus a continuous optimizer to generate the trajectories, with big reported gains after fine-tuning.

read the letter

This paper mainly adds a large multi-modal dataset for UAV tracking of moving targets plus a continuous optimizer to generate the trajectories, with big reported gains after fine-tuning. The scale stands out: roughly 12,000 trajectories drawn from 6,000 pedestrian paths, 2.4 million timesteps, and seven aligned channels including RGB, metric depth, segmentation, drone pose, target visibility, and bilingual instructions. MuCO plans directly in continuous 3D space and uses BVH queries to enforce visibility, collision avoidance, smoothness, and kinematics together. That choice looks like a practical improvement over grid-based methods that need extra smoothing. The fine-tuning results on seven vision-language models show success-rate jumps of 53 to 69 points at the 1-meter threshold, which suggests the data can serve as useful training material for dynamic following policies. The main limitation is the simulation-only setup. All trajectories and sensor data come from a virtual urban scene with perfect depth and segmentation; the paper gives no real-flight logs, noise models, or hardware tests to measure how well these signals match actual UAV conditions like wind jitter or motion blur. Without that evidence the transfer story stays untested. The abstract also leaves out details on statistical significance or how the perturbed trajectories were sampled. This work is aimed at researchers in aerial robotics and vision-language navigation who need data for continuous target following rather than static goals. Readers building trackers or studying sim-to-real gaps would find the released dataset and checkpoints directly usable. The contribution is concrete enough to deserve peer review. I would send it out, but I would ask referees to focus on whether the generated data is realistic enough for hardware transfer.

Referee Report

2 major / 2 minor

Summary. The paper introduces CosFlyTrack, a large-scale multi-modal dataset for UAV visual tracking in urban environments. It consists of approximately 12,000 expert and perturbed UAV trajectories derived from 6,000 pedestrian paths, yielding 2.4 million timesteps across seven aligned channels (RGB, metric depth, semantic segmentation, 6-DoF drone pose, target state with visibility flag, bilingual instructions, and trajectory metadata). Trajectories are generated via MuCO, a multi-constraint optimizer that plans in continuous 3D space using BVH-accelerated queries to jointly enforce target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility. Fine-tuning experiments on seven vision-language models report tracking success rates of 78.3–95.6% SR@1 m, representing 53–69 percentage point gains over zero-shot baselines.

Significance. If the simulated trajectories and observations prove transferable, the dataset would fill a clear gap in aerial VLN resources by targeting dynamic target following rather than static goal navigation, and the reported fine-tuning gains indicate that the generated data can improve model performance on the tracking task. The scale, multi-modality, and continuous-space optimization approach are strengths that could support reproducible progress in UAV tracking agents.

major comments (2)

[Abstract] Abstract: the headline claim of 53–69 percentage point gains in SR@1 m after fine-tuning rests on trajectories and sensor data produced entirely in simulation with perfect depth and segmentation. No real-world flight logs, domain-randomization ablations, or hardware-in-the-loop tests are described to quantify the sim-to-real gap for camera intrinsics, motion blur, wind-induced pose jitter, or pedestrian dynamics; this directly affects whether the reported improvements transfer to physical UAV hardware.
[Fine-tuning experiments] Fine-tuning experiments (as summarized in the abstract): the manuscript reports numerical gains on seven models but supplies no details on the number of runs, variance, statistical significance tests, or controls for trajectory realism and sensor noise modeling. Without these, the reliability and reproducibility of the 78.3–95.6% SR@1 m figures cannot be evaluated.

minor comments (2)

[Abstract] The abstract states 'approximately 12,000' trajectories and 'approximately 334 hours'; providing exact counts or confidence intervals would improve precision.
[Abstract] The public links to the dataset and checkpoints are given; ensure they remain stable and include a clear license statement in the final version.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which highlight important aspects of experimental rigor and the sim-to-real gap. We address each point below and will revise the manuscript to improve clarity and reproducibility while maintaining the focus on the simulated dataset's utility for training UAV tracking agents.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 53–69 percentage point gains in SR@1 m after fine-tuning rests on trajectories and sensor data produced entirely in simulation with perfect depth and segmentation. No real-world flight logs, domain-randomization ablations, or hardware-in-the-loop tests are described to quantify the sim-to-real gap for camera intrinsics, motion blur, wind-induced pose jitter, or pedestrian dynamics; this directly affects whether the reported improvements transfer to physical UAV hardware.

Authors: We agree that the reported gains are obtained in simulation with idealized sensor data. CosFlyTrack is explicitly positioned as a large-scale synthetic resource to enable training where real-world collection of perfectly aligned multi-modal trajectories at this scale is impractical. In the revised manuscript we will add a dedicated Limitations subsection that explicitly discusses the sim-to-real gap, including the lack of hardware-in-the-loop validation and the potential effects of motion blur, wind-induced jitter, and varying pedestrian dynamics. We will also outline planned future work on domain randomization and real-world fine-tuning. The current results still demonstrate that the generated trajectories and annotations are effective for improving vision-language tracking models in controlled settings, providing a reproducible baseline for subsequent transfer studies. revision: yes
Referee: [Fine-tuning experiments] Fine-tuning experiments (as summarized in the abstract): the manuscript reports numerical gains on seven models but supplies no details on the number of runs, variance, statistical significance tests, or controls for trajectory realism and sensor noise modeling. Without these, the reliability and reproducibility of the 78.3–95.6% SR@1 m figures cannot be evaluated.

Authors: We thank the referee for noting this omission. The revised Experiments section will report: (i) three independent fine-tuning runs per model using different random seeds, (ii) mean success rates with standard deviations, (iii) statistical significance via paired Wilcoxon signed-rank tests between fine-tuned and zero-shot conditions, and (iv) explicit sensor-noise modeling consisting of additive Gaussian noise on RGB and depth channels plus random trajectory perturbations to simulate realism. These additions will allow readers to assess both reliability and reproducibility of the reported 78.3–95.6% SR@1 m figures. revision: yes

standing simulated objections not resolved

We currently lack real-world flight logs or hardware-in-the-loop experiments, so we cannot directly quantify the sim-to-real gap for the listed factors.

Circularity Check

0 steps flagged

No circularity: empirical dataset generation and fine-tuning evaluation are self-contained

full rationale

The paper presents a generation pipeline that uses the MuCO optimizer to produce trajectories enforcing visibility, collision, smoothness and kinematic constraints in simulation, followed by empirical fine-tuning of seven vision-language models that reports measured SR@1 meter gains over zero-shot baselines. No derivation, prediction or first-principles claim reduces to its own inputs by construction; the central performance numbers are obtained from external model training runs on the generated data rather than from any fitted parameter or self-citation chain. The work contains no load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation, making the derivation chain independent of the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the fidelity of the simulation pipeline and the assumption that optimized trajectories constitute expert demonstrations transferable to real UAVs; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Multi-constraint optimization in continuous 3D space with BVH queries produces higher-quality expert trajectories than grid-based planners.
Invoked to justify MuCO over discretization methods.

pith-pipeline@v0.9.0 · 5858 in / 1278 out tokens · 36499 ms · 2026-05-21T08:56:28.343721+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MuCO ... plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

C(W) = sum 9 cost terms ... visibility, jerk, safety, pitch, altitude

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 7 internal anchors

[1]

OpenFly: A comprehensive platform for aerial vision-language navigation

Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Xuelong Li, Zhigang Wang, and Bin Zhao. OpenFly: A comprehensive platform for aerial vision-language navigation. ...

work page arXiv 2026
[2]

AirNav: A large-scale real-world UA V vision-and-language navigation dataset with natural and diverse instructions, 2026

Hengxing Cai, Yijie Rao, Ligang Huang, Zanyang Zhong, Jinhan Dong, Jingjun Tan, Wenhao Lu, and Renxin Zhong. AirNav: A large-scale real-world UA V vision-and-language navigation dataset with natural and diverse instructions, 2026

work page 2026
[3]

CityNav: A large-scale dataset for real-world aerial navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, and Naka- masa Inoue. CityNav: A large-scale dataset for real-world aerial navigation. InIEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2406.14240

work page arXiv 2025
[4]

AerialVLN: Vision- and-language navigation for UA Vs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, and Qi Wu. AerialVLN: Vision- and-language navigation for UA Vs. InIEEE/CVF International Conference on Computer Vision (ICCV),

work page
[5]

IndoorUA V: Benchmarking vision- language UA V navigation in continuous indoor environments

Xu Liu, Yu Liu, Hanshuo Qiu, Qirong Yang, and Zhouhui Lian. IndoorUA V: Benchmarking vision- language UA V navigation in continuous indoor environments. InAAAI Conference on Artificial Intelligence (AAAI), 2026. arXiv:2512.19024

work page arXiv 2026
[6]

Vision meets drones: A challenge, 2018

Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu. Vision meets drones: A challenge, 2018

work page 2018
[7]

A benchmark and simulator for UA V tracking

Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for UA V tracking. In European Conference on Computer Vision (ECCV), 2016

work page 2016
[8]

The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InEuropean Conference on Computer Vision (ECCV), 2018. arXiv:1804.00518

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. arXiv:1711.07280

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Beyond the nav-graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision (ECCV), 2020. arXiv:2004.02857

work page arXiv 2020
[11]

AirSim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. InField and Service Robotics (FSR), 2018

work page 2018
[12]

Hart, Nils J

Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths.IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968. 10

work page 1968
[13]

Sven Koenig and Maxim Likhachev. D* lite. InAAAI Conference on Artificial Intelligence (AAAI), 2002

work page 2002
[14]

Steven M. LaValle. Rapidly-exploring random trees: A new tool for path planning. Technical Report TR 98-11, Iowa State University, Computer Science Department, 1998

work page 1998
[15]

Andrew Bagnell, and Siddhartha Srinivasa

Nathan Ratliff, Matt Zucker, J. Andrew Bagnell, and Siddhartha Srinivasa. CHOMP: Gradient optimization techniques for efficient motion planning. InIEEE International Conference on Robotics and Automation (ICRA), 2009

work page 2009
[16]

Motion planning with sequential convex optimization and convex collision checking.International Journal of Robotics Research (IJRR), 33(9):1251–1270, 2014

John Schulman, Yan Duan, Jonathan Ho, Alex Lee, Ibrahim Awwal, Henry Bradlow, Jia Pan, Sachin Patil, Ken Goldberg, and Pieter Abbeel. Motion planning with sequential convex optimization and convex collision checking.International Journal of Robotics Research (IJRR), 33(9):1251–1270, 2014

work page 2014
[17]

Gaussian process motion planning

Mustafa Mukadam, Xinyan Yan, and Byron Boots. Gaussian process motion planning. InIEEE Interna- tional Conference on Robotics and Automation (ICRA), 2016

work page 2016
[18]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

work page 2023
[20]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.14238

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

RT-2: Vision-language-action models transfer web knowledge to robotic control, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control, 2023

work page 2023
[22]

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, and Tong Zhang. Embod- iedBench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. InInternational Conference on Machine Learning (ICML), 2025. arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Aerial vision- and-dialog navigation

Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, and Xin Eric Wang. Aerial vision- and-dialog navigation. InFindings of the Association for Computational Linguistics (ACL), 2023. arXiv:2205.12219

work page arXiv 2023
[24]

Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology

Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, and Si Liu. Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.07087

work page arXiv 2025
[25]

CosFly: Plan in the Matrix, Fly in the World

Hanxuan Chen, Xiangyue Wang, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Binbo Li, Kangli Wang, and Ji Pei. Cosfly: Plan in the matrix, fly in the world.arXiv preprint arXiv:2605.19120, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

work page 2026
[27]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning (CoRL), 2017

work page 2017
[28]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011

work page 2011
[29]

Croissant: A metadata format for ML-ready datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Gonzalez, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, et al. Croissant: A metadata format for ML-ready datasets. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. arXiv:2403.19546

work page arXiv 2024
[30]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025
[31]

GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, et al. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025. 11

work page 2025
[32]

Gemma 4, April 2026

Google DeepMind. Gemma 4, April 2026. URL https://blog.google/technology/developers/ gemma-4/

work page 2026
[33]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[34]

ZeRO: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020

work page 2020
[35]

Depth anything V2, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything V2, 2024

work page 2024
[36]

SAM 2: Segment anything in images and videos, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos, 2024

work page 2024
[37]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision (ECCV), 2024. arXiv:2303.05499

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. A MuCO Complete Algorithm Description This appendix provides the complete algorithm description of the MuCO multi-constraint optimizer, sufficient for independe...

work page 2021

[1] [1]

OpenFly: A comprehensive platform for aerial vision-language navigation

Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Xuelong Li, Zhigang Wang, and Bin Zhao. OpenFly: A comprehensive platform for aerial vision-language navigation. ...

work page arXiv 2026

[2] [2]

AirNav: A large-scale real-world UA V vision-and-language navigation dataset with natural and diverse instructions, 2026

Hengxing Cai, Yijie Rao, Ligang Huang, Zanyang Zhong, Jinhan Dong, Jingjun Tan, Wenhao Lu, and Renxin Zhong. AirNav: A large-scale real-world UA V vision-and-language navigation dataset with natural and diverse instructions, 2026

work page 2026

[3] [3]

CityNav: A large-scale dataset for real-world aerial navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, and Naka- masa Inoue. CityNav: A large-scale dataset for real-world aerial navigation. InIEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2406.14240

work page arXiv 2025

[4] [4]

AerialVLN: Vision- and-language navigation for UA Vs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, and Qi Wu. AerialVLN: Vision- and-language navigation for UA Vs. InIEEE/CVF International Conference on Computer Vision (ICCV),

work page

[5] [5]

IndoorUA V: Benchmarking vision- language UA V navigation in continuous indoor environments

Xu Liu, Yu Liu, Hanshuo Qiu, Qirong Yang, and Zhouhui Lian. IndoorUA V: Benchmarking vision- language UA V navigation in continuous indoor environments. InAAAI Conference on Artificial Intelligence (AAAI), 2026. arXiv:2512.19024

work page arXiv 2026

[6] [6]

Vision meets drones: A challenge, 2018

Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu. Vision meets drones: A challenge, 2018

work page 2018

[7] [7]

A benchmark and simulator for UA V tracking

Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for UA V tracking. In European Conference on Computer Vision (ECCV), 2016

work page 2016

[8] [8]

The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InEuropean Conference on Computer Vision (ECCV), 2018. arXiv:1804.00518

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. arXiv:1711.07280

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Beyond the nav-graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision (ECCV), 2020. arXiv:2004.02857

work page arXiv 2020

[11] [11]

AirSim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. InField and Service Robotics (FSR), 2018

work page 2018

[12] [12]

Hart, Nils J

Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths.IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968. 10

work page 1968

[13] [13]

Sven Koenig and Maxim Likhachev. D* lite. InAAAI Conference on Artificial Intelligence (AAAI), 2002

work page 2002

[14] [14]

Steven M. LaValle. Rapidly-exploring random trees: A new tool for path planning. Technical Report TR 98-11, Iowa State University, Computer Science Department, 1998

work page 1998

[15] [15]

Andrew Bagnell, and Siddhartha Srinivasa

Nathan Ratliff, Matt Zucker, J. Andrew Bagnell, and Siddhartha Srinivasa. CHOMP: Gradient optimization techniques for efficient motion planning. InIEEE International Conference on Robotics and Automation (ICRA), 2009

work page 2009

[16] [16]

Motion planning with sequential convex optimization and convex collision checking.International Journal of Robotics Research (IJRR), 33(9):1251–1270, 2014

John Schulman, Yan Duan, Jonathan Ho, Alex Lee, Ibrahim Awwal, Henry Bradlow, Jia Pan, Sachin Patil, Ken Goldberg, and Pieter Abbeel. Motion planning with sequential convex optimization and convex collision checking.International Journal of Robotics Research (IJRR), 33(9):1251–1270, 2014

work page 2014

[17] [17]

Gaussian process motion planning

Mustafa Mukadam, Xinyan Yan, and Byron Boots. Gaussian process motion planning. InIEEE Interna- tional Conference on Robotics and Automation (ICRA), 2016

work page 2016

[18] [18]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

work page 2023

[20] [20]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.14238

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

RT-2: Vision-language-action models transfer web knowledge to robotic control, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control, 2023

work page 2023

[22] [22]

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, and Tong Zhang. Embod- iedBench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. InInternational Conference on Machine Learning (ICML), 2025. arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Aerial vision- and-dialog navigation

Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, and Xin Eric Wang. Aerial vision- and-dialog navigation. InFindings of the Association for Computational Linguistics (ACL), 2023. arXiv:2205.12219

work page arXiv 2023

[24] [24]

Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology

Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, and Si Liu. Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.07087

work page arXiv 2025

[25] [25]

CosFly: Plan in the Matrix, Fly in the World

Hanxuan Chen, Xiangyue Wang, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Binbo Li, Kangli Wang, and Ji Pei. Cosfly: Plan in the matrix, fly in the world.arXiv preprint arXiv:2605.19120, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

work page 2026

[27] [27]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning (CoRL), 2017

work page 2017

[28] [28]

Gordon, and J

Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011

work page 2011

[29] [29]

Croissant: A metadata format for ML-ready datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Gonzalez, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, et al. Croissant: A metadata format for ML-ready datasets. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. arXiv:2403.19546

work page arXiv 2024

[30] [30]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025

[31] [31]

GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, et al. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025. 11

work page 2025

[32] [32]

Gemma 4, April 2026

Google DeepMind. Gemma 4, April 2026. URL https://blog.google/technology/developers/ gemma-4/

work page 2026

[33] [33]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022

[34] [34]

ZeRO: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020

work page 2020

[35] [35]

Depth anything V2, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything V2, 2024

work page 2024

[36] [36]

SAM 2: Segment anything in images and videos, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos, 2024

work page 2024

[37] [37]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision (ECCV), 2024. arXiv:2303.05499

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. A MuCO Complete Algorithm Description This appendix provides the complete algorithm description of the MuCO multi-constraint optimizer, sufficient for independe...

work page 2021