CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization
Pith reviewed 2026-05-21 08:56 UTC · model grok-4.3
The pith
CosFlyTrack supplies 12,000 optimized UAV trajectories and aligned multi-modal data that let vision-language models track moving targets at success rates of 78 to 96 percent after fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes CosFlyTrack as a dataset of approximately 12,000 expert and perturbed UAV trajectories derived from 6,000 pedestrian paths, yielding 2.4 million timesteps with seven aligned channels that include RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual Chinese-English instructions, and trajectory-pair metadata. These trajectories are produced by MuCO, a multi-constraint optimizer that plans in continuous three-dimensional space with BVH-accelerated collision and visibility queries to jointly satisfy target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility. Fine-tuning
What carries the argument
MuCO, the multi-constraint optimizer that plans UAV trajectories directly in continuous three-dimensional space while jointly enforcing target visibility, collision avoidance, smoothness, and kinematic feasibility through BVH-accelerated spatial queries.
If this is right
- Fine-tuning on the dataset raises tracking success rate at one meter from zero-shot baselines to the range of 78.3 to 95.6 percent across seven vision-language models.
- The pipeline produces aligned multi-modal data at scale, including depth, segmentation, and bilingual instructions, without discretization artifacts.
- Perturbed trajectories complement expert paths to increase data variety for training dynamic target-following agents.
- The continuous-space optimizer avoids grid-based artifacts and post-hoc smoothing common in earlier planners.
Where Pith is reading between the lines
- The same generation approach could be extended to produce training data for tracking vehicles or other non-pedestrian moving objects in similar environments.
- Models trained this way may require additional domain adaptation when transferred to real UAV hardware with sensor noise and wind disturbances.
- The dataset structure suggests a template for creating comparable resources for other continuous-control robotics tasks that combine vision and language.
Load-bearing premise
The generated trajectories and simulated sensor data accurately capture the visibility, collision, and kinematic constraints that matter for real UAV hardware operating in physical urban environments.
What would settle it
Collecting a real-world UAV tracking dataset in comparable urban scenes, running the fine-tuned models on it, and checking whether the reported 53-to-69-point gains over zero-shot baselines persist or shrink substantially.
Figures
read the original abstract
Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CosFlyTrack, a large-scale multi-modal dataset for UAV visual tracking in urban environments. It consists of approximately 12,000 expert and perturbed UAV trajectories derived from 6,000 pedestrian paths, yielding 2.4 million timesteps across seven aligned channels (RGB, metric depth, semantic segmentation, 6-DoF drone pose, target state with visibility flag, bilingual instructions, and trajectory metadata). Trajectories are generated via MuCO, a multi-constraint optimizer that plans in continuous 3D space using BVH-accelerated queries to jointly enforce target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility. Fine-tuning experiments on seven vision-language models report tracking success rates of 78.3–95.6% SR@1 m, representing 53–69 percentage point gains over zero-shot baselines.
Significance. If the simulated trajectories and observations prove transferable, the dataset would fill a clear gap in aerial VLN resources by targeting dynamic target following rather than static goal navigation, and the reported fine-tuning gains indicate that the generated data can improve model performance on the tracking task. The scale, multi-modality, and continuous-space optimization approach are strengths that could support reproducible progress in UAV tracking agents.
major comments (2)
- [Abstract] Abstract: the headline claim of 53–69 percentage point gains in SR@1 m after fine-tuning rests on trajectories and sensor data produced entirely in simulation with perfect depth and segmentation. No real-world flight logs, domain-randomization ablations, or hardware-in-the-loop tests are described to quantify the sim-to-real gap for camera intrinsics, motion blur, wind-induced pose jitter, or pedestrian dynamics; this directly affects whether the reported improvements transfer to physical UAV hardware.
- [Fine-tuning experiments] Fine-tuning experiments (as summarized in the abstract): the manuscript reports numerical gains on seven models but supplies no details on the number of runs, variance, statistical significance tests, or controls for trajectory realism and sensor noise modeling. Without these, the reliability and reproducibility of the 78.3–95.6% SR@1 m figures cannot be evaluated.
minor comments (2)
- [Abstract] The abstract states 'approximately 12,000' trajectories and 'approximately 334 hours'; providing exact counts or confidence intervals would improve precision.
- [Abstract] The public links to the dataset and checkpoints are given; ensure they remain stable and include a clear license statement in the final version.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of experimental rigor and the sim-to-real gap. We address each point below and will revise the manuscript to improve clarity and reproducibility while maintaining the focus on the simulated dataset's utility for training UAV tracking agents.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of 53–69 percentage point gains in SR@1 m after fine-tuning rests on trajectories and sensor data produced entirely in simulation with perfect depth and segmentation. No real-world flight logs, domain-randomization ablations, or hardware-in-the-loop tests are described to quantify the sim-to-real gap for camera intrinsics, motion blur, wind-induced pose jitter, or pedestrian dynamics; this directly affects whether the reported improvements transfer to physical UAV hardware.
Authors: We agree that the reported gains are obtained in simulation with idealized sensor data. CosFlyTrack is explicitly positioned as a large-scale synthetic resource to enable training where real-world collection of perfectly aligned multi-modal trajectories at this scale is impractical. In the revised manuscript we will add a dedicated Limitations subsection that explicitly discusses the sim-to-real gap, including the lack of hardware-in-the-loop validation and the potential effects of motion blur, wind-induced jitter, and varying pedestrian dynamics. We will also outline planned future work on domain randomization and real-world fine-tuning. The current results still demonstrate that the generated trajectories and annotations are effective for improving vision-language tracking models in controlled settings, providing a reproducible baseline for subsequent transfer studies. revision: yes
-
Referee: [Fine-tuning experiments] Fine-tuning experiments (as summarized in the abstract): the manuscript reports numerical gains on seven models but supplies no details on the number of runs, variance, statistical significance tests, or controls for trajectory realism and sensor noise modeling. Without these, the reliability and reproducibility of the 78.3–95.6% SR@1 m figures cannot be evaluated.
Authors: We thank the referee for noting this omission. The revised Experiments section will report: (i) three independent fine-tuning runs per model using different random seeds, (ii) mean success rates with standard deviations, (iii) statistical significance via paired Wilcoxon signed-rank tests between fine-tuned and zero-shot conditions, and (iv) explicit sensor-noise modeling consisting of additive Gaussian noise on RGB and depth channels plus random trajectory perturbations to simulate realism. These additions will allow readers to assess both reliability and reproducibility of the reported 78.3–95.6% SR@1 m figures. revision: yes
- We currently lack real-world flight logs or hardware-in-the-loop experiments, so we cannot directly quantify the sim-to-real gap for the listed factors.
Circularity Check
No circularity: empirical dataset generation and fine-tuning evaluation are self-contained
full rationale
The paper presents a generation pipeline that uses the MuCO optimizer to produce trajectories enforcing visibility, collision, smoothness and kinematic constraints in simulation, followed by empirical fine-tuning of seven vision-language models that reports measured SR@1 meter gains over zero-shot baselines. No derivation, prediction or first-principles claim reduces to its own inputs by construction; the central performance numbers are obtained from external model training runs on the generated data rather than from any fitted parameter or self-citation chain. The work contains no load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation, making the derivation chain independent of the reported results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-constraint optimization in continuous 3D space with BVH queries produces higher-quality expert trajectories than grid-based planners.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MuCO ... plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
C(W) = sum 9 cost terms ... visibility, jerk, safety, pitch, altitude
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
OpenFly: A comprehensive platform for aerial vision-language navigation
Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Xuelong Li, Zhigang Wang, and Bin Zhao. OpenFly: A comprehensive platform for aerial vision-language navigation. ...
-
[2]
Hengxing Cai, Yijie Rao, Ligang Huang, Zanyang Zhong, Jinhan Dong, Jingjun Tan, Wenhao Lu, and Renxin Zhong. AirNav: A large-scale real-world UA V vision-and-language navigation dataset with natural and diverse instructions, 2026
work page 2026
-
[3]
CityNav: A large-scale dataset for real-world aerial navigation
Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, and Naka- masa Inoue. CityNav: A large-scale dataset for real-world aerial navigation. InIEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2406.14240
-
[4]
AerialVLN: Vision- and-language navigation for UA Vs
Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yanning Zhang, and Qi Wu. AerialVLN: Vision- and-language navigation for UA Vs. InIEEE/CVF International Conference on Computer Vision (ICCV),
-
[5]
IndoorUA V: Benchmarking vision- language UA V navigation in continuous indoor environments
Xu Liu, Yu Liu, Hanshuo Qiu, Qirong Yang, and Zhouhui Lian. IndoorUA V: Benchmarking vision- language UA V navigation in continuous indoor environments. InAAAI Conference on Artificial Intelligence (AAAI), 2026. arXiv:2512.19024
-
[6]
Vision meets drones: A challenge, 2018
Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu. Vision meets drones: A challenge, 2018
work page 2018
-
[7]
A benchmark and simulator for UA V tracking
Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for UA V tracking. In European Conference on Computer Vision (ECCV), 2016
work page 2016
-
[8]
The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking
Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InEuropean Conference on Computer Vision (ECCV), 2018. arXiv:1804.00518
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. arXiv:1711.07280
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Beyond the nav-graph: Vision-and-language navigation in continuous environments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision (ECCV), 2020. arXiv:2004.02857
-
[11]
AirSim: High-fidelity visual and physical simulation for autonomous vehicles
Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. InField and Service Robotics (FSR), 2018
work page 2018
-
[12]
Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths.IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968. 10
work page 1968
-
[13]
Sven Koenig and Maxim Likhachev. D* lite. InAAAI Conference on Artificial Intelligence (AAAI), 2002
work page 2002
-
[14]
Steven M. LaValle. Rapidly-exploring random trees: A new tool for path planning. Technical Report TR 98-11, Iowa State University, Computer Science Department, 1998
work page 1998
-
[15]
Andrew Bagnell, and Siddhartha Srinivasa
Nathan Ratliff, Matt Zucker, J. Andrew Bagnell, and Siddhartha Srinivasa. CHOMP: Gradient optimization techniques for efficient motion planning. InIEEE International Conference on Robotics and Automation (ICRA), 2009
work page 2009
-
[16]
John Schulman, Yan Duan, Jonathan Ho, Alex Lee, Ibrahim Awwal, Henry Bradlow, Jia Pan, Sachin Patil, Ken Goldberg, and Pieter Abbeel. Motion planning with sequential convex optimization and convex collision checking.International Journal of Robotics Research (IJRR), 33(9):1251–1270, 2014
work page 2014
-
[17]
Gaussian process motion planning
Mustafa Mukadam, Xinyan Yan, and Byron Boots. Gaussian process motion planning. InIEEE Interna- tional Conference on Robotics and Automation (ICRA), 2016
work page 2016
-
[18]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2304.08485
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023
work page 2023
-
[20]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.14238
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
RT-2: Vision-language-action models transfer web knowledge to robotic control, 2023
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control, 2023
work page 2023
-
[22]
Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, and Tong Zhang. Embod- iedBench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. InInternational Conference on Machine Learning (ICML), 2025. arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Aerial vision- and-dialog navigation
Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, and Xin Eric Wang. Aerial vision- and-dialog navigation. InFindings of the Association for Computational Linguistics (ACL), 2023. arXiv:2205.12219
-
[24]
Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology
Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, and Si Liu. Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.07087
-
[25]
CosFly: Plan in the Matrix, Fly in the World
Hanxuan Chen, Xiangyue Wang, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Binbo Li, Kangli Wang, and Ji Pei. Cosfly: Plan in the matrix, fly in the world.arXiv preprint arXiv:2605.19120, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5
work page 2026
-
[27]
CARLA: An open urban driving simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning (CoRL), 2017
work page 2017
-
[28]
Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011
work page 2011
-
[29]
Croissant: A metadata format for ML-ready datasets
Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Gonzalez, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, et al. Croissant: A metadata format for ML-ready datasets. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. arXiv:2403.19546
- [30]
-
[31]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, et al. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025. 11
work page 2025
-
[32]
Google DeepMind. Gemma 4, April 2026. URL https://blog.google/technology/developers/ gemma-4/
work page 2026
-
[33]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[34]
ZeRO: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020
work page 2020
-
[35]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything V2, 2024
work page 2024
-
[36]
SAM 2: Segment anything in images and videos, 2024
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos, 2024
work page 2024
-
[37]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision (ECCV), 2024. arXiv:2303.05499
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. A MuCO Complete Algorithm Description This appendix provides the complete algorithm description of the MuCO multi-constraint optimizer, sufficient for independe...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.