pith. sign in

arxiv: 2605.19120 · v1 · pith:3MDDZICWnew · submitted 2026-05-18 · 💻 cs.RO

CosFly: Plan in the Matrix, Fly in the World

Pith reviewed 2026-05-20 09:01 UTC · model grok-4.3

classification 💻 cs.RO
keywords aerial trackingUAV simulationmulti-modal datasettrajectory planningdynamic target trackingdrone navigationsensor data rendering6-DOF annotations
0
0 comments X

The pith

CosFly provides a seven-step pipeline that converts 3D environments into planned UAV trajectories and synchronized multi-modal sensor data for aerial tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CosFly, a box-structured planning and multimodal simulation pipeline that simplifies complex 3D worlds into obstacle grids, generates trajectories for dynamic target tracking, and renders them back as RGB images, depth maps, semantic masks, and natural language instructions with 6-DOF pose annotations. It releases the CosFly-Track dataset containing 250 validated trajectories and approximately 100,000 images collected across urban, highway, rural, forest, and coastal settings. The pipeline supports configurable fixed-FOV camera settings and compares a conventional two-stage planning method against a direct gradient-based optimization approach. A sympathetic reader would care because this combination supplies a scalable source of paired planning and perception data that could accelerate development of UAV navigation and multi-modal perception systems without requiring equivalent volumes of real-world collection.

Core claim

The central claim is that the modular seven-step pipeline, built on the CARLA simulator, converts 3D map data into structured obstacle representations suitable for trajectory planning, then projects the resulting paths into multi-modal sensor outputs that include configurable camera intrinsics, enabling the creation of large-scale datasets that pair navigation instructions with precise pose information for aerial tracking research.

What carries the argument

The box-structured planning and multimodal simulation pipeline that simplifies 3D worlds into grids for trajectory optimization and renders the results as synchronized multi-modal observations with 6-DOF annotations.

If this is right

  • Enables large-scale training of dynamic target tracking algorithms using paired multi-modal sensor data and natural language instructions.
  • Supports UAV navigation research through complete 6-DOF pose annotations across hundreds of trajectories.
  • Allows direct comparison of two-stage candidate generation versus single-objective gradient-based planning within the same simulated environments.
  • Provides a foundation for multi-modal perception studies that combine RGB, depth, and semantic segmentation outputs.
  • Scales aerial-ground collaborative experiments by covering diverse scene types without repeated real-world data gathering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the simulated data transfers well, the approach could lower the barrier to entry for researchers who lack access to physical drone fleets.
  • The fixed-FOV zoom feature could be used to isolate how changes in camera focal length affect tracking robustness across different distances.
  • Extending the pipeline to include modeled sensor noise or wind disturbances would test robustness claims more directly.
  • Integration with other simulators might allow systematic study of how environment variety influences learned tracking policies.

Load-bearing premise

That trajectories planned and sensor data rendered inside the CARLA simulator will prove representative enough of real-world UAV dynamics and perception to train systems that transfer effectively to physical platforms.

What would settle it

A side-by-side test in which a model trained exclusively on the CosFly-Track dataset shows substantially lower tracking accuracy when flown on a physical drone in environments matched to the simulated ones.

Figures

Figures reproduced from arXiv: 2605.19120 by Binbo Li, Hanxuan Chen, Hanzhong Guo, Jie Zheng, Ji Pei, Kangli Wang, Ruilong Ren, Shuai Yuan, Songsheng Cheng, Tianle Zeng, Xiangyue Wang.

Figure 1
Figure 1. Figure 1: Derived from the cultural metaphor of The Matrix, the core view—“We do not transform reality, but transform the Matrix”—summarizes our paradigm: we build editable, controllable virtual worlds to bypass the limitations of physical reality. The left half illustrates trajectory planning within a structured 3D “Matrix”; the right half shows the photorealistic world for UAV flight execution; the bottom row pres… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the seven-step COSFLY construction pipeline. Step 4 produces the current CosFly-Track [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 3D bounding boxes in Town10HD_Opt by semantic category (Vegetation dominates; long tail of [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Grid simplification: (a/c) Vegetation before/after merging; (b/d) traffic lights and poles before/after [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pedestrian trajectories in Town10HD_Opt: 20 A* paths sampled in the ROI-masked walkable region [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Building-circling failure mode (top-down schematic). When a building blocks line-of-sight for an [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: TA*+Smooth (red = raw TA*, blue = post-smoothed) vs. MuCO (orange) on a Town10HD_Opt [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: One synchronized sample: (a) RGB, (b) depth (display rendering of float32 array), (c) [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A failure case excluded from the release (six synchronized timestamps, each shown as planning matrix [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Three-stage CoC generation pipeline. A Qwen3.5-397B-A17B-FP8 teacher generates [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-stage wall-clock budget on the 20-trajectory pilot (Town10HD_Opt). Stage A [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Measured CARLA rendering at W ∈ {1, 2, 4, 6} workers. Left: throughput in frames￾per-second against the contention-free “ideal linear” upper bound; the gap visualises the cost of GPU and shader contention. Middle: mean GPU utilisation and mean GPU memory fraction of the 47.6 GiB on-board budget; the dotted line marks the 85% saturation threshold beyond which we start observing watchdog restart events. Rig… view at source ↗
Figure 13
Figure 13. Figure 13: Gantt-style watchdog timeline. Each horizontal bar represents one child-process attempt; [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visual comparison of depth map storage and visualization methods. (a) RGB input. (b) [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visual demonstration of zoom capabilities across four FOV configurations. Each row [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative comparison between digital zoom and optical zoom at 5 [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Joint perturbation system overview. The figure shows the perturbation parameter space [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Dual-trajectory data synthesis pipeline. (a) A sliding window of size 10 moves along the [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Training sample generation through sliding window sampling. Each trajectory of 170–200 [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Cumulative trajectory production across the distributed cluster over a four-day period. [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Per-map trajectory completion status on Machine C. The system successfully generated the [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: GPU utilization and memory usage on Machine A (dual-GPU) over the rendering period. [PITH_FULL_IMAGE:figures/full_fig_p038_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: System resource utilization on Machine A. CPU usage remains stable at [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Simulator restart analysis on Machine B. Left: Per-worker restart counts, showing that [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: The five-phase UAV CoC data production pipeline. The workflow ensures causal safety [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Sliding window observation sequence. The model receives 5 historical frames ( [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Character length distribution of the generated CoC components. The [PITH_FULL_IMAGE:figures/full_fig_p040_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Consistency verdict distribution across 19,066 generated samples. While 21.5% of sam [PITH_FULL_IMAGE:figures/full_fig_p041_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Top confusion patterns between ground-truth labels and model predictions. The most [PITH_FULL_IMAGE:figures/full_fig_p041_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Comparison of constrained re-generation strategies. The [PITH_FULL_IMAGE:figures/full_fig_p042_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: shows the corresponding current frame (the fifth and newest observation image in the 5- frame sliding window) for this sample [PITH_FULL_IMAGE:figures/full_fig_p042_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Distribution of BERTScore F1 values for Qwen3.5-2B and Qwen3.5-4B models on the [PITH_FULL_IMAGE:figures/full_fig_p044_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Comparison of average BERTScore Precision, Recall, and F1 metrics. The 4B model [PITH_FULL_IMAGE:figures/full_fig_p044_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Per-decision accuracy for the top four flight commands. The 4B model shows superior [PITH_FULL_IMAGE:figures/full_fig_p045_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Confusion matrices for flight decision prediction. The 4B model (right) demonstrates [PITH_FULL_IMAGE:figures/full_fig_p045_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: ROI mask editor view overlaid on a CARLA top-down rendering of Town10HD_Opt. [PITH_FULL_IMAGE:figures/full_fig_p047_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: ROI mask editor user interface. Left-click places polygon vertices to annotate the walkable [PITH_FULL_IMAGE:figures/full_fig_p048_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: ROI mask annotation on Town07_Opt, illustrating exclusions that the geometric 3D box [PITH_FULL_IMAGE:figures/full_fig_p049_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Four-axis radar comparison of all seven primary planners. TA*+Smooth (blue, [PITH_FULL_IMAGE:figures/full_fig_p050_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Four-axis grouped bar chart. TA*+Smooth ranks first on the composite mean (0.782); [PITH_FULL_IMAGE:figures/full_fig_p051_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Pairwise projections including the safety axis. TA*+Smooth and MuCO occupy the high [PITH_FULL_IMAGE:figures/full_fig_p051_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Measured metric deltas between TA* + Smooth and MuCO. Positive deltas indicate a TA* [PITH_FULL_IMAGE:figures/full_fig_p052_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: 3D measured trajectory reproduction on the four scenarios with the largest TA*+Smooth [PITH_FULL_IMAGE:figures/full_fig_p054_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Weather injection pipeline: preset definitions are loaded once per batch; the mode resolver [PITH_FULL_IMAGE:figures/full_fig_p058_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Nine representative (weather, ToD) configurations rendered from the same trajectory [PITH_FULL_IMAGE:figures/full_fig_p059_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: RPSS pipeline. Solid arrows mark the linear data dependency. The two dashed arrows [PITH_FULL_IMAGE:figures/full_fig_p061_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Step 1 visual outcome. (a) The UE service exposes only the default ground and sky. [PITH_FULL_IMAGE:figures/full_fig_p062_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: Step 2 interactive viewer rendering of the canonical box map. Buildings are translucent [PITH_FULL_IMAGE:figures/full_fig_p062_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Step 5 interactive viewer rendering of all [PITH_FULL_IMAGE:figures/full_fig_p064_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: Step 6 replay preview (frames sampled at [PITH_FULL_IMAGE:figures/full_fig_p065_50.png] view at source ↗
read the original abstract

We present CosFly, a box-structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly-Track, a large-scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7-step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi-modal sensor data -- including RGB images, high-precision depth maps, and semantic segmentation masks -- paired with natural language navigation instructions. A key feature is the support for configurable fixed-FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera-intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi-modal rendering with 6-DOF pose annotations, quality inspection, and teacher-student caption generation. We analyze two trajectory-planning paradigms for aerial target tracking: a conventional two-stage pipeline with front-end candidate generation and backend refinement, and a direct gradient-based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly-Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6-DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception across diverse environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents CosFly, a modular 7-step CARLA-based pipeline that converts 3D worlds into structured obstacle grids for planning, generates trajectories via either a two-stage or gradient-based optimizer, and renders multi-modal outputs (RGB, depth, segmentation) with 6-DOF pose annotations and natural-language instructions. It releases the CosFly-Track dataset of 250 validated trajectories and ~100k images spanning urban, highway, rural, forest, and coastal environments, with configurable fixed-FOV camera settings.

Significance. If the simulated trajectories and sensor renders prove representative, the pipeline and dataset would supply a useful, scalable resource for training models in dynamic target tracking, UAV navigation, and multi-modal aerial-ground perception. The explicit release of 6-DOF annotations, multi-modal renders, and two distinct planning formulations is a concrete strength that could accelerate reproducible research in the area.

major comments (2)
  1. [Abstract] Abstract: the central claim that the pipeline and dataset 'establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception' is not accompanied by any quantitative validation, error analysis, success rates, or baseline comparisons, leaving the effectiveness of both planning paradigms and the sim-to-real utility untested.
  2. [Abstract] Abstract: the assumption that CARLA-generated obstacle grids, gradient-based planners, and fixed-FOV projections produce statistics representative of physical UAV aerodynamics, wind effects, motion blur, and depth noise is load-bearing for all real-world claims, yet no quantitative comparison to real drone logs or hardware sensor characteristics is reported.
minor comments (1)
  1. [Abstract] Abstract: specify the exact criteria and quantitative thresholds used during the 'quality inspection' step of the 7-step pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below, clarifying the scope of the contribution as a simulation pipeline and dataset release while committing to revisions that better align the abstract and discussion with the manuscript content.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the pipeline and dataset 'establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception' is not accompanied by any quantitative validation, error analysis, success rates, or baseline comparisons, leaving the effectiveness of both planning paradigms and the sim-to-real utility untested.

    Authors: We agree that the abstract phrasing overstates the validated scope. The manuscript centers on the modular 7-step CARLA pipeline, the two planning formulations (two-stage and gradient-based), and the release of the CosFly-Track dataset with 250 trajectories and ~100k multi-modal images. No quantitative benchmarks, success rates, or baseline comparisons appear because the work is positioned as a resource for the community rather than an evaluated method. In revision we will rewrite the abstract to describe the pipeline and dataset as a scalable simulation foundation without claiming untested effectiveness for real-world tasks, and we will expand the validation section with qualitative trajectory inspection results and any internal consistency checks performed during dataset construction. revision: yes

  2. Referee: [Abstract] Abstract: the assumption that CARLA-generated obstacle grids, gradient-based planners, and fixed-FOV projections produce statistics representative of physical UAV aerodynamics, wind effects, motion blur, and depth noise is load-bearing for all real-world claims, yet no quantitative comparison to real drone logs or hardware sensor characteristics is reported.

    Authors: The manuscript does not assert that the CARLA outputs are statistically representative of physical UAV dynamics or sensor noise; all experiments remain inside the simulator. We will revise the abstract to remove any implication of direct real-world transfer and add an explicit limitations paragraph that states the absence of paired real-drone logs, the lack of wind or motion-blur modeling, and the fixed-FOV simplification. This will make the sim-to-real gap transparent to readers. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline and dataset construction are self-contained descriptions

full rationale

The paper describes a 7-step CARLA-based construction pipeline that converts 3D worlds into obstacle grids, plans trajectories (two-stage or gradient-based), renders multi-modal sensor data with 6-DOF poses, and generates captions. No equations, fitted parameters, predictions, or first-principles derivations are presented that reduce to the inputs by construction. The central claim is that the released CosFly-Track dataset (250 trajectories, ~100k images) supports UAV research; this is a factual statement about the artifact, not a result forced by self-definition or self-citation. No load-bearing uniqueness theorems or ansatzes appear. The work is a systems contribution whose validity rests on external validation against real UAV data, not internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the CARLA simulator provides a sufficiently faithful model of real environments and dynamics for the generated data to be useful.

axioms (1)
  • domain assumption CARLA simulator accurately models real-world physics, sensor behavior, and diverse environments including urban, rural, and natural settings
    The entire pipeline is implemented on CARLA and claims to support real-world-like tracking and navigation research.

pith-pipeline@v0.9.0 · 5849 in / 1383 out tokens · 36777 ms · 2026-05-20T09:01:05.570275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

    cs.RO 2026-05 conditional novelty 7.0

    CosFlyTrack supplies 2.4 million timesteps of aligned RGB, depth, segmentation, pose, target state, and bilingual instructions from expert UAV trajectories, with experiments showing 53-69 point gains in SR@1m after fi...

  2. CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

    cs.RO 2026-05 conditional novelty 7.0

    CosFlyTrack provides 12,000 expert UAV trajectories with aligned RGB, depth, segmentation, pose, target state, and bilingual instructions to train visual tracking agents, yielding 53-69 point gains in success rate aft...

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3674– 3683, 2018

  2. [2]

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) , Part IV , LNCS 7577, pages 611–625, 2012

  3. [3]

    A class of local interpolating splines

    Edwin Catmull and Raphael Rom. A class of local interpolating splines. In Robert E. Barn- hill and Richard F. Riesenfeld, editors, Computer Aided Geometric Design , pages 317–326. Academic Press, 1974

  4. [4]

    Track A*: Fast Visibility-Aware Trajectory Planning for Active Target Tracking

    Hanxuan Chen, Kangli Wang, and Ji Pei. Track a*: Fast visibility-aware trajectory planning for active target tracking. arXiv preprint arXiv:2605.05338, 2026

  5. [5]

    Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

    Hanxuan Chen, Jie Zheng, Siqi Y ang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo, Shuai Yuan, Xiangyue Wang, and others. Vision-and-language navigation for UA Vs: Progress, challenges, and a research roadmap. arXiv preprint arXiv:2604.13654, 2026

  6. [6]

    Video depth anything: Consistent depth estimation for super-long videos

    Sili Chen, Lihe Y ang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Video depth anything: Consistent depth estimation for super-long videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

  7. [7]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 2432–2443, 2017

  8. [8]

    Daniel, A

    K. Daniel, A. Nash, S. Koenig, and A. Felner. Theta*: Any-angle path planning on grids. Journal of Artificial Intelligence Research, 39:533–579, 2010

  9. [9]

    Carla: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio López, and Vladlen Koltun. Carla: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learn- ing (CoRL) , volume 78 of Proceedings of Machine Learning Research , pages 1–16. PMLR, 2017. 24

  10. [10]

    Douglas and Thomas K

    David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization , 10(2):112–122, 1973

  11. [11]

    The unmanned aerial vehicle benchmark: Object detection and tracking

    Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Y ang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 370– 386, 2018

  12. [12]

    Openfly: A comprehensive platform for aerial vision-language navigation

    Yunpeng Gao, Chenhui Li, Zhongrui Y ou, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhong- han Tang, Liansheng Wang, Penghui Y ang, et al. Openfly: A comprehensive platform for aerial vision-language navigation. In The Fourteenth International Conference on Learning Represen- tations, 2026

  13. [13]

    Openfly: A comprehensive platform for aerial vision-language navigation, 2026

    Yunpeng Gao, Chenhui Li, Zhongrui Y ou, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhong- han Tang, Liansheng Wang, Penghui Y ang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Y e, Jianan Li, Y an Ding, Dong Wang, Xuelong Li, Zhigang Wang, and Bin Zhao. Openfly: A comprehensive platform for aerial vision-language naviga...

  14. [14]

    Are we ready for autonomous driving? The KITTI vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012

  15. [15]

    Jing Gu, Manolis Savva, and Angel X. Gao. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022

  16. [17]

    Hart, Nils J

    Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic deter- mination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics , 4(2):100–107, 1968

  17. [18]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  18. [19]

    Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In In- ternational Conference on Learning Representations (ICLR) , 2022

  19. [20]

    DepthCrafter: Generating consistent long depth sequences for open-world videos

    Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Y ong Zhang, Long Quan, and Ying Shan. DepthCrafter: Generating consistent long depth sequences for open-world videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

  20. [21]

    Sampling-based algorithms for optimal motion planning

    Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research , 30(7):846–894, 2011

  21. [22]

    Kavraki, Petr Svestka, Jean-Claude Latombe, and Mark H

    Lydia E. Kavraki, Petr Svestka, Jean-Claude Latombe, and Mark H. Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4):566–580, 1996

  22. [23]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Kon- rad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

  23. [24]

    Real-time obstacle avoidance for manipulators and mobile robots

    Oussama Khatib. Real-time obstacle avoidance for manipulators and mobile robots. The Inter- national Journal of Robotics Research , 5(1):90–98, 1986

  24. [25]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Roberto Gordon, Chenxia Zhu, Ali Farhadi, Arsalan Mousavian, Ram Vedantam, and Aniruddha Kembhavi. AI2-THOR: An interactive 3d environment for visual AI. arXiv preprint arXiv:1712.05474, 2017

  25. [26]

    Model Predictive Control: Classical, Robust and Stochastic

    Basil Kouvaritakis and Mark Cannon. Model Predictive Control: Classical, Robust and Stochastic. Advanced Textbooks in Control and Signal Processing. Springer, 2016. 25

  26. [27]

    A probabilistic B-spline motion planning algorithm for unmanned helicopters flying in dense 3D environments

    Emre Koyuncu and Gokhan Inalhan. A probabilistic B-spline motion planning algorithm for unmanned helicopters flying in dense 3D environments. In 2008 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, pages 815–821. IEEE, 2008

  27. [28]

    Kyriakopoulos and George N

    Konstantinos J. Kyriakopoulos and George N. Saridis. Minimum jerk path generation. In Proceedings. 1988 IEEE International Conference on Robotics and Automation , pages 364–

  28. [29]

    IEEE Computer Society Press, 1988

  29. [30]

    Steven M. LaValle. Rapidly-exploring random trees: A new tool for path planning. Technical Report TR 98-11, Computer Science Department, Iowa State University, 1998

  30. [31]

    CityNav: A large-scale dataset for real-world aerial navigation

    Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Mat- suo, and Nakamasa Inoue. CityNav: A large-scale dataset for real-world aerial navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2025

  31. [32]

    BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML) , volume 202 of Proceedings of Machine Learning Research. PMLR, 2023

  32. [33]

    MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond. In Interna- tional Conference on Computer Vision (ICCV) , 2023

  33. [34]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Y ong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2023. Oral Presen- tation

  34. [35]

    Aerialvln: Vision-and-language navigation for uavs

    Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Y aning Zhang, and Qi Wu. Aerialvln: Vision-and-language navigation for uavs. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 15384–15394, 2023

  35. [36]

    Aerialvln: Vision-and-language navigation for uavs, 2023

    Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Y aning Zhang, and Qi Wu. Aerialvln: Vision-and-language navigation for uavs, 2023

  36. [37]

    Tomás Lozano-Pérez and Michael A. Wesley. An algorithm for planning collision-free paths among polyhedral obstacles. Communications of the ACM, 22(10):560–570, 1979

  37. [38]

    Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Y aroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

  38. [39]

    A benchmark and simulator for UA V track- ing

    Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for UA V track- ing. In European Conference on Computer Vision (ECCV) , pages 445–461. Springer, 2016

  39. [40]

    BuckTales: A multi-UA V dataset for multi-object tracking and re-identification of wild antelopes

    Hemal Naik, Junran Y ang, Dipin Das, Margaret C Crofoot, Akanksha Rathore, and Vivek Hari Sridhar. BuckTales: A multi-UA V dataset for multi-object tracking and re-identification of wild antelopes. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024

  40. [41]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  41. [42]

    Heuristic search viewed as path finding in a graph

    Ira Pohl. Heuristic search viewed as path finding in a graph. Artificial Intelligence, 1(3–4):193– 204, 1970

  42. [43]

    Aria synthetic environments dataset

    Project Aria. Aria synthetic environments dataset. https://www.projectaria.com/ datasets/ase/, 2024

  43. [44]

    MUST: The first dataset and unified framework for multispectral UA V single object tracking

    Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, and Jianan Li. MUST: The first dataset and unified framework for multispectral UA V single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

  44. [45]

    Elastic bands: connecting path planning and control

    Sean Quinlan and Oussama Khatib. Elastic bands: connecting path planning and control. In Pro- ceedings IEEE International Conference on Robotics and Automation , pages 802–807. IEEE Computer Society Press, 1993

  45. [46]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Pro- ceedings of the 38th International Conference on Machine Learning (ICML) , volume 139 of P...

  46. [47]

    Andrew Bagnell, and Siddhartha Srinivasa

    Nathan Ratliff, Matt Zucker, J. Andrew Bagnell, and Siddhartha Srinivasa. CHOMP: Gradient optimization techniques for efficient motion planning. In 2009 IEEE International Conference on Robotics and Automation , pages 489–494. IEEE, 2009

  47. [48]

    Susskind

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vi- sion (ICCV), pages 10912–10922, 2021

  48. [49]

    Habitat: A platform for embodied AI research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied AI research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 9339–9347, 2019

  49. [50]

    High-resolution stereo datasets with subpixel-accurate ground truth

    Daniel Scharstein, Heiko Hirschmüller, Y ork Kitajima, Greg Krathwohl, Nera Nešić, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on Pattern Recognition (GCPR) , pages 31–42, 2014

  50. [51]

    Motion planning with sequential convex op- timization and convex collision checking

    John Schulman, Y an Duan, Jonathan Ho, Alex Lee, Ibrahim Awwal, Henry Bradlow, Jia Pan, Sachin Patil, Ken Goldberg, and Pieter Abbeel. Motion planning with sequential convex op- timization and convex collision checking. The International Journal of Robotics Research , 33(9):1251–1270, 2014

  51. [52]

    Airsim: High-fidelity visual and physical simulation for autonomous vehicles

    Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, Springer Pro- ceedings in Advanced Robotics, pages 621–635. Springer, 2018

  52. [53]

    Indoor segmentation and support inference from RGBD images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV) , pages 746–760, 2012

  53. [54]

    Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark, 2025

    Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Zeyu Han, Haibao Yu, Chuang Zhang, Lei He, Shaobing Xu, and Jianqiang Wang. Griffin: Aerial-ground cooperative detection and tracking dataset and benchmark, 2025

  54. [55]

    UA VScenes: A multi-modal dataset for UA Vs

    Sijie Wang, Siqi Li, Y awei Zhang, Shangshu Yu, Shenghai Yuan, Rui She, Quanjiang Guo, JinXuan Zheng, Ong Kang Howe, Leonrich Chandra, et al. UA VScenes: A multi-modal dataset for UA Vs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  55. [56]

    TartanAir: A dataset to push the limits of visual SLAM

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Y aoyu Hu, Yuheng Qiu, Chen Wang, Y afei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 4909– 4916, 2020

  56. [57]

    Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology

    Xiangyu Wang, Donglin Y ang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hong- sheng Li, Yue Liao, and Si Liu. Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology. In Proceedings of the International Conference on Learning Representations (ICLR), 2025

  57. [58]

    Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

    Y an Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail. arXiv preprint arXiv:2511.00088, 2025

  58. [59]

    Detection, tracking, and counting meets drones in crowds: A benchmark

    Longyin Wen, Dawei Du, Pengfei Zhu, Xiao Bian, Haibin Ling, Qinghua Hu, and Tao Mei. Detection, tracking, and counting meets drones in crowds: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 7780– 7789, 2021

  59. [60]

    Zamir, Zhi- Y ang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

    Fei Xia, Amir R. Zamir, Zhi- Y ang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 9068–9079, 2018. Spotlight Oral

  60. [61]

    RGBD objects in the wild: Scaling real- world 3D object learning from RGB-D videos

    Hongchi Xia, Y ang Fu, Sifei Liu, and Xiaolong Wang. RGBD objects in the wild: Scaling real- world 3D object learning from RGB-D videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 27

  61. [62]

    Depth anything V2

    Lihe Y ang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Heng- shuang Zhao. Depth anything V2. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  62. [63]

    ScanNet++: A high- fidelity dataset of 3D indoor scenes

    Chandan Y eshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high- fidelity dataset of 3D indoor scenes. In International Conference on Computer Vision (ICCV) , 2023

  63. [64]

    CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

    Tianle Zeng, Hanxuan Chen, Y anci Wen, and Hong Zhang. CARLA-Air: Fly drones inside a CARLA world–a unified infrastructure for air-ground embodied intelligence. arXiv preprint arXiv:2603.28032, 2026

  64. [65]

    Y oco: Y ou only calibrate once for accurate extrinsic parameter in lidar-camera systems.Measurement Science and Technology, 36(7):075009, 2025

    Tianle Zeng, Xinrong Gu, Feifan Y an, Meixi He, and Dengke He. Y oco: Y ou only calibrate once for accurate extrinsic parameter in lidar-camera systems.Measurement Science and Technology, 36(7):075009, 2025

  65. [66]

    EZREAL: Enhancing zero-shot outdoor robot navigation toward distant targets under varying visibility

    Tianle Zeng, Jianwei Peng, Hanjing Y e, Guangcheng Chen, Senzi Luo, and Hong Zhang. EZREAL: Enhancing zero-shot outdoor robot navigation toward distant targets under varying visibility. arXiv preprint arXiv:2509.13720, 2025

  66. [67]

    WebUA V-3M: A benchmark for unveiling the power of million-scale deep UA V tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14538– 14556, 2023

    Chunhui Zhang et al. WebUA V-3M: A benchmark for unveiling the power of million-scale deep UA V tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14538– 14556, 2023

  67. [68]

    Uav-track vla: Embodied aerial tracking via vision-language- action models, 2026

    Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, and Y onglin Tian. Uav-track vla: Embodied aerial tracking via vision-language- action models, 2026

  68. [69]

    Weinberger, and Y oav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Y oav Artzi. BERTScore: Evaluating text generation with BERT. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020

  69. [70]

    M3OT: A multi-drone multi-modality dataset for multi-object tracking

    Xin Zhang et al. M3OT: A multi-drone multi-modality dataset for multi-object tracking. Sci- entific Data, 12, 2025

  70. [71]

    Anti-UA V challenge 2023: Methods and results

    Jian Zhao et al. Anti-UA V challenge 2023: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , 2023

  71. [72]

    Drone-person tracking in uniform appearance crowd: A new dataset

    Jianwei Zhao et al. Drone-person tracking in uniform appearance crowd: A new dataset. Sci- entific Data, 10, 2023

  72. [73]

    Byrd, Peihuang Lu, and Jorge Nocedal

    Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS- B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software, 23(4):550–560, 1997

  73. [74]

    Detection and tracking meet drones challenge, 2021

    Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge, 2021

  74. [75]

    Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7380–7399, 2021

    Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Haibin Ling, Qinghua Hu, Jie Nie, Chengjie Chen, Y afei Wang, Xin Zhang, Xinyao Lyu, Jianhua Liu, Guan Zhou, Yue Kang, Heng Liu, Jiayuan Cheng, and Tao Mei. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7380–7399, 2021

  75. [76]

    banding” or “staircase

    Longkun Zou, Jiale Wang, Rongqin Liang, Hai Wu, Ke Chen, and Y aowei Wang. UA V-MM3D: A large-scale synthetic benchmark for 3d perception of unmanned aerial vehicles with multi- modal data. arXiv preprint arXiv:2511.22404, 2025. 28 A Justification for 32-bit Float Depth Map Storage This section provides a comprehensive quantitative and qualitative analysi...

  76. [77]

    TA* + Smooth release output: included; primary method

  77. [78]

    MuCO release output: included; primary one-shot global-planning baseline

  78. [79]

    RRT*, PRM, B-spline PRM, elastic band, minimum jerk: included as Python reference context

  79. [80]

    3D A*, Weighted A*, Theta*, Visibility-A*: included as search-family controls

  80. [81]

    Potential field, CHOMP-lite, L-BFGS-B TrajOpt: included as local-optimization controls

Showing first 80 references.