CosFly: Plan in the Matrix, Fly in the World

Binbo Li; Hanxuan Chen; Hanzhong Guo; Jie Zheng; Ji Pei; Kangli Wang; Ruilong Ren; Shuai Yuan; Songsheng Cheng; Tianle Zeng

arxiv: 2605.19120 · v1 · pith:3MDDZICWnew · submitted 2026-05-18 · 💻 cs.RO

CosFly: Plan in the Matrix, Fly in the World

Hanxuan Chen , Xiangyue Wang , Songsheng Cheng , Ruilong Ren , Jie Zheng , Shuai Yuan , Tianle Zeng , Hanzhong Guo

show 3 more authors

Binbo Li Kangli Wang Ji Pei

This is my paper

Pith reviewed 2026-05-20 09:01 UTC · model grok-4.3

classification 💻 cs.RO

keywords aerial trackingUAV simulationmulti-modal datasettrajectory planningdynamic target trackingdrone navigationsensor data rendering6-DOF annotations

0 comments

The pith

CosFly provides a seven-step pipeline that converts 3D environments into planned UAV trajectories and synchronized multi-modal sensor data for aerial tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CosFly, a box-structured planning and multimodal simulation pipeline that simplifies complex 3D worlds into obstacle grids, generates trajectories for dynamic target tracking, and renders them back as RGB images, depth maps, semantic masks, and natural language instructions with 6-DOF pose annotations. It releases the CosFly-Track dataset containing 250 validated trajectories and approximately 100,000 images collected across urban, highway, rural, forest, and coastal settings. The pipeline supports configurable fixed-FOV camera settings and compares a conventional two-stage planning method against a direct gradient-based optimization approach. A sympathetic reader would care because this combination supplies a scalable source of paired planning and perception data that could accelerate development of UAV navigation and multi-modal perception systems without requiring equivalent volumes of real-world collection.

Core claim

The central claim is that the modular seven-step pipeline, built on the CARLA simulator, converts 3D map data into structured obstacle representations suitable for trajectory planning, then projects the resulting paths into multi-modal sensor outputs that include configurable camera intrinsics, enabling the creation of large-scale datasets that pair navigation instructions with precise pose information for aerial tracking research.

What carries the argument

The box-structured planning and multimodal simulation pipeline that simplifies 3D worlds into grids for trajectory optimization and renders the results as synchronized multi-modal observations with 6-DOF annotations.

If this is right

Enables large-scale training of dynamic target tracking algorithms using paired multi-modal sensor data and natural language instructions.
Supports UAV navigation research through complete 6-DOF pose annotations across hundreds of trajectories.
Allows direct comparison of two-stage candidate generation versus single-objective gradient-based planning within the same simulated environments.
Provides a foundation for multi-modal perception studies that combine RGB, depth, and semantic segmentation outputs.
Scales aerial-ground collaborative experiments by covering diverse scene types without repeated real-world data gathering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the simulated data transfers well, the approach could lower the barrier to entry for researchers who lack access to physical drone fleets.
The fixed-FOV zoom feature could be used to isolate how changes in camera focal length affect tracking robustness across different distances.
Extending the pipeline to include modeled sensor noise or wind disturbances would test robustness claims more directly.
Integration with other simulators might allow systematic study of how environment variety influences learned tracking policies.

Load-bearing premise

That trajectories planned and sensor data rendered inside the CARLA simulator will prove representative enough of real-world UAV dynamics and perception to train systems that transfer effectively to physical platforms.

What would settle it

A side-by-side test in which a model trained exclusively on the CosFly-Track dataset shows substantially lower tracking accuracy when flown on a physical drone in environments matched to the simulated ones.

Figures

Figures reproduced from arXiv: 2605.19120 by Binbo Li, Hanxuan Chen, Hanzhong Guo, Jie Zheng, Ji Pei, Kangli Wang, Ruilong Ren, Shuai Yuan, Songsheng Cheng, Tianle Zeng, Xiangyue Wang.

**Figure 1.** Figure 1: Derived from the cultural metaphor of The Matrix, the core view—“We do not transform reality, but transform the Matrix”—summarizes our paradigm: we build editable, controllable virtual worlds to bypass the limitations of physical reality. The left half illustrates trajectory planning within a structured 3D “Matrix”; the right half shows the photorealistic world for UAV flight execution; the bottom row pres… view at source ↗

**Figure 2.** Figure 2: Overview of the seven-step COSFLY construction pipeline. Step 4 produces the current CosFly-Track [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: 3D bounding boxes in Town10HD_Opt by semantic category (Vegetation dominates; long tail of [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Grid simplification: (a/c) Vegetation before/after merging; (b/d) traffic lights and poles before/after [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Pedestrian trajectories in Town10HD_Opt: 20 A* paths sampled in the ROI-masked walkable region [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Building-circling failure mode (top-down schematic). When a building blocks line-of-sight for an [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: TA*+Smooth (red = raw TA*, blue = post-smoothed) vs. MuCO (orange) on a Town10HD_Opt [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: One synchronized sample: (a) RGB, (b) depth (display rendering of float32 array), (c) [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: A failure case excluded from the release (six synchronized timestamps, each shown as planning matrix [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Three-stage CoC generation pipeline. A Qwen3.5-397B-A17B-FP8 teacher generates [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Per-stage wall-clock budget on the 20-trajectory pilot (Town10HD_Opt). Stage A [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Measured CARLA rendering at W ∈ {1, 2, 4, 6} workers. Left: throughput in framesper-second against the contention-free “ideal linear” upper bound; the gap visualises the cost of GPU and shader contention. Middle: mean GPU utilisation and mean GPU memory fraction of the 47.6 GiB on-board budget; the dotted line marks the 85% saturation threshold beyond which we start observing watchdog restart events. Rig… view at source ↗

**Figure 13.** Figure 13: Gantt-style watchdog timeline. Each horizontal bar represents one child-process attempt; [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Visual comparison of depth map storage and visualization methods. (a) RGB input. (b) [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: Visual demonstration of zoom capabilities across four FOV configurations. Each row [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison between digital zoom and optical zoom at 5 [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗

**Figure 17.** Figure 17: Joint perturbation system overview. The figure shows the perturbation parameter space [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Dual-trajectory data synthesis pipeline. (a) A sliding window of size 10 moves along the [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

**Figure 19.** Figure 19: Training sample generation through sliding window sampling. Each trajectory of 170–200 [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗

**Figure 20.** Figure 20: Cumulative trajectory production across the distributed cluster over a four-day period. [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗

**Figure 21.** Figure 21: Per-map trajectory completion status on Machine C. The system successfully generated the [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗

**Figure 22.** Figure 22: GPU utilization and memory usage on Machine A (dual-GPU) over the rendering period. [PITH_FULL_IMAGE:figures/full_fig_p038_22.png] view at source ↗

**Figure 23.** Figure 23: System resource utilization on Machine A. CPU usage remains stable at [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗

**Figure 24.** Figure 24: Simulator restart analysis on Machine B. Left: Per-worker restart counts, showing that [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗

**Figure 25.** Figure 25: The five-phase UAV CoC data production pipeline. The workflow ensures causal safety [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗

**Figure 26.** Figure 26: Sliding window observation sequence. The model receives 5 historical frames ( [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗

**Figure 27.** Figure 27: Character length distribution of the generated CoC components. The [PITH_FULL_IMAGE:figures/full_fig_p040_27.png] view at source ↗

**Figure 28.** Figure 28: Consistency verdict distribution across 19,066 generated samples. While 21.5% of sam [PITH_FULL_IMAGE:figures/full_fig_p041_28.png] view at source ↗

**Figure 29.** Figure 29: Top confusion patterns between ground-truth labels and model predictions. The most [PITH_FULL_IMAGE:figures/full_fig_p041_29.png] view at source ↗

**Figure 30.** Figure 30: Comparison of constrained re-generation strategies. The [PITH_FULL_IMAGE:figures/full_fig_p042_30.png] view at source ↗

**Figure 31.** Figure 31: shows the corresponding current frame (the fifth and newest observation image in the 5- frame sliding window) for this sample [PITH_FULL_IMAGE:figures/full_fig_p042_31.png] view at source ↗

**Figure 32.** Figure 32: Distribution of BERTScore F1 values for Qwen3.5-2B and Qwen3.5-4B models on the [PITH_FULL_IMAGE:figures/full_fig_p044_32.png] view at source ↗

**Figure 33.** Figure 33: Comparison of average BERTScore Precision, Recall, and F1 metrics. The 4B model [PITH_FULL_IMAGE:figures/full_fig_p044_33.png] view at source ↗

**Figure 34.** Figure 34: Per-decision accuracy for the top four flight commands. The 4B model shows superior [PITH_FULL_IMAGE:figures/full_fig_p045_34.png] view at source ↗

**Figure 35.** Figure 35: Confusion matrices for flight decision prediction. The 4B model (right) demonstrates [PITH_FULL_IMAGE:figures/full_fig_p045_35.png] view at source ↗

**Figure 36.** Figure 36: ROI mask editor view overlaid on a CARLA top-down rendering of Town10HD_Opt. [PITH_FULL_IMAGE:figures/full_fig_p047_36.png] view at source ↗

**Figure 37.** Figure 37: ROI mask editor user interface. Left-click places polygon vertices to annotate the walkable [PITH_FULL_IMAGE:figures/full_fig_p048_37.png] view at source ↗

**Figure 38.** Figure 38: ROI mask annotation on Town07_Opt, illustrating exclusions that the geometric 3D box [PITH_FULL_IMAGE:figures/full_fig_p049_38.png] view at source ↗

**Figure 39.** Figure 39: Four-axis radar comparison of all seven primary planners. TA*+Smooth (blue, [PITH_FULL_IMAGE:figures/full_fig_p050_39.png] view at source ↗

**Figure 40.** Figure 40: Four-axis grouped bar chart. TA*+Smooth ranks first on the composite mean (0.782); [PITH_FULL_IMAGE:figures/full_fig_p051_40.png] view at source ↗

**Figure 41.** Figure 41: Pairwise projections including the safety axis. TA*+Smooth and MuCO occupy the high [PITH_FULL_IMAGE:figures/full_fig_p051_41.png] view at source ↗

**Figure 42.** Figure 42: Measured metric deltas between TA* + Smooth and MuCO. Positive deltas indicate a TA* [PITH_FULL_IMAGE:figures/full_fig_p052_42.png] view at source ↗

**Figure 43.** Figure 43: 3D measured trajectory reproduction on the four scenarios with the largest TA*+Smooth [PITH_FULL_IMAGE:figures/full_fig_p054_43.png] view at source ↗

**Figure 44.** Figure 44: Weather injection pipeline: preset definitions are loaded once per batch; the mode resolver [PITH_FULL_IMAGE:figures/full_fig_p058_44.png] view at source ↗

**Figure 45.** Figure 45: Nine representative (weather, ToD) configurations rendered from the same trajectory [PITH_FULL_IMAGE:figures/full_fig_p059_45.png] view at source ↗

**Figure 46.** Figure 46: RPSS pipeline. Solid arrows mark the linear data dependency. The two dashed arrows [PITH_FULL_IMAGE:figures/full_fig_p061_46.png] view at source ↗

**Figure 47.** Figure 47: Step 1 visual outcome. (a) The UE service exposes only the default ground and sky. [PITH_FULL_IMAGE:figures/full_fig_p062_47.png] view at source ↗

**Figure 48.** Figure 48: Step 2 interactive viewer rendering of the canonical box map. Buildings are translucent [PITH_FULL_IMAGE:figures/full_fig_p062_48.png] view at source ↗

**Figure 49.** Figure 49: Step 5 interactive viewer rendering of all [PITH_FULL_IMAGE:figures/full_fig_p064_49.png] view at source ↗

**Figure 50.** Figure 50: Step 6 replay preview (frames sampled at [PITH_FULL_IMAGE:figures/full_fig_p065_50.png] view at source ↗

read the original abstract

We present CosFly, a box-structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly-Track, a large-scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7-step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi-modal sensor data -- including RGB images, high-precision depth maps, and semantic segmentation masks -- paired with natural language navigation instructions. A key feature is the support for configurable fixed-FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera-intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi-modal rendering with 6-DOF pose annotations, quality inspection, and teacher-student caption generation. We analyze two trajectory-planning paradigms for aerial target tracking: a conventional two-stage pipeline with front-end candidate generation and backend refinement, and a direct gradient-based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly-Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6-DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception across diverse environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CosFly provides a practical CARLA pipeline and dataset for UAV tracking simulation, though real-world fidelity remains unproven.

read the letter

CosFly gives a detailed way to generate large synthetic datasets for UAV tracking by planning in simplified grids inside CARLA and then rendering the views back out. The new part is the specific 7-step box-structured pipeline. It takes complex 3D environments, turns them into obstacle grids, plans paths for both the target and the drone using either a two-stage approach or a single gradient-based optimization, and then renders RGB, depth, and segmentation with 6-DOF annotations. The fixed-FOV zoom feature lets you pick one camera setting per trajectory and keep it. They also add teacher-student caption generation for language instructions. The released CosFly-Track has 250 trajectories and roughly 100k images from urban, rural, forest, and coastal settings. This setup works well for producing consistent multi-modal data at scale. The modular steps make it easier to reproduce or extend, and including both planning paradigms lets users compare them directly. Releasing the data with complete pose info and varied environments is a concrete contribution that others can build on for training tracking or navigation models. The weaker part is the connection to real-world use. The abstract and description emphasize support for UAV navigation and perception in real-like environments, but CARLA's physics and sensor models are tuned for ground vehicles. Without reported tests on how the generated trajectories or image statistics compare to actual drone flights or hardware sensors, it's not clear how much the sim-to-real gap affects downstream performance. That assumption sits at the center of the claims. This paper is for researchers in aerial robotics who need synthetic data to develop or test algorithms for dynamic target tracking and multi-modal perception. Someone looking for ready-to-use large datasets with annotations would get immediate value from the release. It deserves a serious referee because the pipeline is reproducible and the dataset adds new resources, even if the authors should add more on validation and limitations. I would send it for peer review and ask the referees to check the empirical support for the real-world applicability.

Referee Report

2 major / 1 minor

Summary. The paper presents CosFly, a modular 7-step CARLA-based pipeline that converts 3D worlds into structured obstacle grids for planning, generates trajectories via either a two-stage or gradient-based optimizer, and renders multi-modal outputs (RGB, depth, segmentation) with 6-DOF pose annotations and natural-language instructions. It releases the CosFly-Track dataset of 250 validated trajectories and ~100k images spanning urban, highway, rural, forest, and coastal environments, with configurable fixed-FOV camera settings.

Significance. If the simulated trajectories and sensor renders prove representative, the pipeline and dataset would supply a useful, scalable resource for training models in dynamic target tracking, UAV navigation, and multi-modal aerial-ground perception. The explicit release of 6-DOF annotations, multi-modal renders, and two distinct planning formulations is a concrete strength that could accelerate reproducible research in the area.

major comments (2)

[Abstract] Abstract: the central claim that the pipeline and dataset 'establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception' is not accompanied by any quantitative validation, error analysis, success rates, or baseline comparisons, leaving the effectiveness of both planning paradigms and the sim-to-real utility untested.
[Abstract] Abstract: the assumption that CARLA-generated obstacle grids, gradient-based planners, and fixed-FOV projections produce statistics representative of physical UAV aerodynamics, wind effects, motion blur, and depth noise is load-bearing for all real-world claims, yet no quantitative comparison to real drone logs or hardware sensor characteristics is reported.

minor comments (1)

[Abstract] Abstract: specify the exact criteria and quantitative thresholds used during the 'quality inspection' step of the 7-step pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below, clarifying the scope of the contribution as a simulation pipeline and dataset release while committing to revisions that better align the abstract and discussion with the manuscript content.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the pipeline and dataset 'establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception' is not accompanied by any quantitative validation, error analysis, success rates, or baseline comparisons, leaving the effectiveness of both planning paradigms and the sim-to-real utility untested.

Authors: We agree that the abstract phrasing overstates the validated scope. The manuscript centers on the modular 7-step CARLA pipeline, the two planning formulations (two-stage and gradient-based), and the release of the CosFly-Track dataset with 250 trajectories and ~100k multi-modal images. No quantitative benchmarks, success rates, or baseline comparisons appear because the work is positioned as a resource for the community rather than an evaluated method. In revision we will rewrite the abstract to describe the pipeline and dataset as a scalable simulation foundation without claiming untested effectiveness for real-world tasks, and we will expand the validation section with qualitative trajectory inspection results and any internal consistency checks performed during dataset construction. revision: yes
Referee: [Abstract] Abstract: the assumption that CARLA-generated obstacle grids, gradient-based planners, and fixed-FOV projections produce statistics representative of physical UAV aerodynamics, wind effects, motion blur, and depth noise is load-bearing for all real-world claims, yet no quantitative comparison to real drone logs or hardware sensor characteristics is reported.

Authors: The manuscript does not assert that the CARLA outputs are statistically representative of physical UAV dynamics or sensor noise; all experiments remain inside the simulator. We will revise the abstract to remove any implication of direct real-world transfer and add an explicit limitations paragraph that states the absence of paired real-drone logs, the lack of wind or motion-blur modeling, and the fixed-FOV simplification. This will make the sim-to-real gap transparent to readers. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline and dataset construction are self-contained descriptions

full rationale

The paper describes a 7-step CARLA-based construction pipeline that converts 3D worlds into obstacle grids, plans trajectories (two-stage or gradient-based), renders multi-modal sensor data with 6-DOF poses, and generates captions. No equations, fitted parameters, predictions, or first-principles derivations are presented that reduce to the inputs by construction. The central claim is that the released CosFly-Track dataset (250 trajectories, ~100k images) supports UAV research; this is a factual statement about the artifact, not a result forced by self-definition or self-citation. No load-bearing uniqueness theorems or ansatzes appear. The work is a systems contribution whose validity rests on external validation against real UAV data, not internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the CARLA simulator provides a sufficiently faithful model of real environments and dynamics for the generated data to be useful.

axioms (1)

domain assumption CARLA simulator accurately models real-world physics, sensor behavior, and diverse environments including urban, rural, and natural settings
The entire pipeline is implemented on CARLA and claims to support real-world-like tracking and navigation research.

pith-pipeline@v0.9.0 · 5849 in / 1383 out tokens · 36777 ms · 2026-05-20T09:01:05.570275+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We analyze two trajectory-planning paradigms... direct gradient-based formulation that optimizes multiple tracking constraints in a single objective... MuCO jointly adjusts every interior waypoint by finite-difference gradient descent
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

visibility-aware Track A* frontend on a 4D spatio-temporal voxel grid... five-ray visibility test

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization
cs.RO 2026-05 conditional novelty 7.0

CosFlyTrack supplies 2.4 million timesteps of aligned RGB, depth, segmentation, pose, target state, and bilingual instructions from expert UAV trajectories, with experiments showing 53-69 point gains in SR@1m after fi...
CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization
cs.RO 2026-05 conditional novelty 7.0

CosFlyTrack provides 12,000 expert UAV trajectories with aligned RGB, depth, segmentation, pose, target state, and bilingual instructions to train visual tracking agents, yielding 53-69 point gains in success rate aft...

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3674– 3683, 2018

work page 2018
[2]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) , Part IV , LNCS 7577, pages 611–625, 2012

work page 2012
[3]

A class of local interpolating splines

Edwin Catmull and Raphael Rom. A class of local interpolating splines. In Robert E. Barn- hill and Richard F. Riesenfeld, editors, Computer Aided Geometric Design , pages 317–326. Academic Press, 1974

work page 1974
[4]

Track A*: Fast Visibility-Aware Trajectory Planning for Active Target Tracking

Hanxuan Chen, Kangli Wang, and Ji Pei. Track a*: Fast visibility-aware trajectory planning for active target tracking. arXiv preprint arXiv:2605.05338, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Hanxuan Chen, Jie Zheng, Siqi Y ang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo, Shuai Yuan, Xiangyue Wang, and others. Vision-and-language navigation for UA Vs: Progress, challenges, and a research roadmap. arXiv preprint arXiv:2604.13654, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Lihe Y ang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Video depth anything: Consistent depth estimation for super-long videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025
[7]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 2432–2443, 2017

work page 2017
[8]

Daniel, A

K. Daniel, A. Nash, S. Koenig, and A. Felner. Theta*: Any-angle path planning on grids. Journal of Artificial Intelligence Research, 39:533–579, 2010

work page 2010
[9]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio López, and Vladlen Koltun. Carla: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learn- ing (CoRL) , volume 78 of Proceedings of Machine Learning Research , pages 1–16. PMLR, 2017. 24

work page 2017
[10]

Douglas and Thomas K

David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization , 10(2):112–122, 1973

work page 1973
[11]

The unmanned aerial vehicle benchmark: Object detection and tracking

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Y ang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 370– 386, 2018

work page 2018
[12]

Openfly: A comprehensive platform for aerial vision-language navigation

Yunpeng Gao, Chenhui Li, Zhongrui Y ou, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhong- han Tang, Liansheng Wang, Penghui Y ang, et al. Openfly: A comprehensive platform for aerial vision-language navigation. In The Fourteenth International Conference on Learning Represen- tations, 2026

work page 2026
[13]

Openfly: A comprehensive platform for aerial vision-language navigation, 2026

Yunpeng Gao, Chenhui Li, Zhongrui Y ou, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhong- han Tang, Liansheng Wang, Penghui Y ang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Y e, Jianan Li, Y an Ding, Dong Wang, Xuelong Li, Zhigang Wang, and Bin Zhao. Openfly: A comprehensive platform for aerial vision-language naviga...

work page 2026
[14]

Are we ready for autonomous driving? The KITTI vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012

work page 2012
[15]

Jing Gu, Manolis Savva, and Angel X. Gao. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022
[17]

Hart, Nils J

Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic deter- mination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics , 4(2):100–107, 1968

work page 1968
[18]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In In- ternational Conference on Learning Representations (ICLR) , 2022

work page 2022
[20]

DepthCrafter: Generating consistent long depth sequences for open-world videos

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Y ong Zhang, Long Quan, and Ying Shan. DepthCrafter: Generating consistent long depth sequences for open-world videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025
[21]

Sampling-based algorithms for optimal motion planning

Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research , 30(7):846–894, 2011

work page 2011
[22]

Kavraki, Petr Svestka, Jean-Claude Latombe, and Mark H

Lydia E. Kavraki, Petr Svestka, Jean-Claude Latombe, and Mark H. Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4):566–580, 1996

work page 1996
[23]

Repurposing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Kon- rad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

work page 2024
[24]

Real-time obstacle avoidance for manipulators and mobile robots

Oussama Khatib. Real-time obstacle avoidance for manipulators and mobile robots. The Inter- national Journal of Robotics Research , 5(1):90–98, 1986

work page 1986
[25]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Roberto Gordon, Chenxia Zhu, Ali Farhadi, Arsalan Mousavian, Ram Vedantam, and Aniruddha Kembhavi. AI2-THOR: An interactive 3d environment for visual AI. arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Model Predictive Control: Classical, Robust and Stochastic

Basil Kouvaritakis and Mark Cannon. Model Predictive Control: Classical, Robust and Stochastic. Advanced Textbooks in Control and Signal Processing. Springer, 2016. 25

work page 2016
[27]

A probabilistic B-spline motion planning algorithm for unmanned helicopters flying in dense 3D environments

Emre Koyuncu and Gokhan Inalhan. A probabilistic B-spline motion planning algorithm for unmanned helicopters flying in dense 3D environments. In 2008 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, pages 815–821. IEEE, 2008

work page 2008
[28]

Kyriakopoulos and George N

Konstantinos J. Kyriakopoulos and George N. Saridis. Minimum jerk path generation. In Proceedings. 1988 IEEE International Conference on Robotics and Automation , pages 364–

work page 1988
[29]

IEEE Computer Society Press, 1988

work page 1988
[30]

Steven M. LaValle. Rapidly-exploring random trees: A new tool for path planning. Technical Report TR 98-11, Computer Science Department, Iowa State University, 1998

work page 1998
[31]

CityNav: A large-scale dataset for real-world aerial navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Mat- suo, and Nakamasa Inoue. CityNav: A large-scale dataset for real-world aerial navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2025

work page 2025
[32]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML) , volume 202 of Proceedings of Machine Learning Research. PMLR, 2023

work page 2023
[33]

MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond. In Interna- tional Conference on Computer Vision (ICCV) , 2023

work page 2023
[34]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Y ong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2023. Oral Presen- tation

work page 2023
[35]

Aerialvln: Vision-and-language navigation for uavs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Y aning Zhang, and Qi Wu. Aerialvln: Vision-and-language navigation for uavs. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 15384–15394, 2023

work page 2023
[36]

Aerialvln: Vision-and-language navigation for uavs, 2023

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Y aning Zhang, and Qi Wu. Aerialvln: Vision-and-language navigation for uavs, 2023

work page 2023
[37]

Tomás Lozano-Pérez and Michael A. Wesley. An algorithm for planning collision-free paths among polyhedral obstacles. Communications of the ACM, 22(10):560–570, 1979

work page 1979
[38]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Y aroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023
[39]

A benchmark and simulator for UA V track- ing

Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for UA V track- ing. In European Conference on Computer Vision (ECCV) , pages 445–461. Springer, 2016

work page 2016
[40]

BuckTales: A multi-UA V dataset for multi-object tracking and re-identification of wild antelopes

Hemal Naik, Junran Y ang, Dipin Das, Margaret C Crofoot, Akanksha Rathore, and Vivek Hari Sridhar. BuckTales: A multi-UA V dataset for multi-object tracking and re-identification of wild antelopes. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024

work page 2024
[41]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Heuristic search viewed as path finding in a graph

Ira Pohl. Heuristic search viewed as path finding in a graph. Artificial Intelligence, 1(3–4):193– 204, 1970

work page 1970
[43]

Aria synthetic environments dataset

Project Aria. Aria synthetic environments dataset. https://www.projectaria.com/ datasets/ase/, 2024

work page 2024
[44]

MUST: The first dataset and unified framework for multispectral UA V single object tracking

Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, and Jianan Li. MUST: The first dataset and unified framework for multispectral UA V single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025
[45]

Elastic bands: connecting path planning and control

Sean Quinlan and Oussama Khatib. Elastic bands: connecting path planning and control. In Pro- ceedings IEEE International Conference on Robotics and Automation , pages 802–807. IEEE Computer Society Press, 1993

work page 1993
[46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Pro- ceedings of the 38th International Conference on Machine Learning (ICML) , volume 139 of P...

work page 2021
[47]

Andrew Bagnell, and Siddhartha Srinivasa

Nathan Ratliff, Matt Zucker, J. Andrew Bagnell, and Siddhartha Srinivasa. CHOMP: Gradient optimization techniques for eﬀicient motion planning. In 2009 IEEE International Conference on Robotics and Automation , pages 489–494. IEEE, 2009

work page 2009
[48]

Susskind

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vi- sion (ICCV), pages 10912–10922, 2021

work page 2021
[49]

Habitat: A platform for embodied AI research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied AI research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 9339–9347, 2019

work page 2019
[50]

High-resolution stereo datasets with subpixel-accurate ground truth

Daniel Scharstein, Heiko Hirschmüller, Y ork Kitajima, Greg Krathwohl, Nera Nešić, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on Pattern Recognition (GCPR) , pages 31–42, 2014

work page 2014
[51]

Motion planning with sequential convex op- timization and convex collision checking

John Schulman, Y an Duan, Jonathan Ho, Alex Lee, Ibrahim Awwal, Henry Bradlow, Jia Pan, Sachin Patil, Ken Goldberg, and Pieter Abbeel. Motion planning with sequential convex op- timization and convex collision checking. The International Journal of Robotics Research , 33(9):1251–1270, 2014

work page 2014
[52]

Airsim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, Springer Pro- ceedings in Advanced Robotics, pages 621–635. Springer, 2018

work page 2018
[53]

Indoor segmentation and support inference from RGBD images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV) , pages 746–760, 2012

work page 2012
[54]

Griﬀin: Aerial-ground cooperative detection and tracking dataset and benchmark, 2025

Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Zeyu Han, Haibao Yu, Chuang Zhang, Lei He, Shaobing Xu, and Jianqiang Wang. Griﬀin: Aerial-ground cooperative detection and tracking dataset and benchmark, 2025

work page 2025
[55]

UA VScenes: A multi-modal dataset for UA Vs

Sijie Wang, Siqi Li, Y awei Zhang, Shangshu Yu, Shenghai Yuan, Rui She, Quanjiang Guo, JinXuan Zheng, Ong Kang Howe, Leonrich Chandra, et al. UA VScenes: A multi-modal dataset for UA Vs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[56]

TartanAir: A dataset to push the limits of visual SLAM

Wenshan Wang, Delong Zhu, Xiangwei Wang, Y aoyu Hu, Yuheng Qiu, Chen Wang, Y afei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 4909– 4916, 2020

work page 2020
[57]

Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology

Xiangyu Wang, Donglin Y ang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hong- sheng Li, Yue Liao, and Si Liu. Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology. In Proceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[58]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Y an Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail. arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Detection, tracking, and counting meets drones in crowds: A benchmark

Longyin Wen, Dawei Du, Pengfei Zhu, Xiao Bian, Haibin Ling, Qinghua Hu, and Tao Mei. Detection, tracking, and counting meets drones in crowds: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 7780– 7789, 2021

work page 2021
[60]

Zamir, Zhi- Y ang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhi- Y ang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 9068–9079, 2018. Spotlight Oral

work page 2018
[61]

RGBD objects in the wild: Scaling real- world 3D object learning from RGB-D videos

Hongchi Xia, Y ang Fu, Sifei Liu, and Xiaolong Wang. RGBD objects in the wild: Scaling real- world 3D object learning from RGB-D videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 27

work page 2024
[62]

Depth anything V2

Lihe Y ang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Heng- shuang Zhao. Depth anything V2. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[63]

ScanNet++: A high- fidelity dataset of 3D indoor scenes

Chandan Y eshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high- fidelity dataset of 3D indoor scenes. In International Conference on Computer Vision (ICCV) , 2023

work page 2023
[64]

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Tianle Zeng, Hanxuan Chen, Y anci Wen, and Hong Zhang. CARLA-Air: Fly drones inside a CARLA world–a unified infrastructure for air-ground embodied intelligence. arXiv preprint arXiv:2603.28032, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

Y oco: Y ou only calibrate once for accurate extrinsic parameter in lidar-camera systems.Measurement Science and Technology, 36(7):075009, 2025

Tianle Zeng, Xinrong Gu, Feifan Y an, Meixi He, and Dengke He. Y oco: Y ou only calibrate once for accurate extrinsic parameter in lidar-camera systems.Measurement Science and Technology, 36(7):075009, 2025

work page 2025
[66]

EZREAL: Enhancing zero-shot outdoor robot navigation toward distant targets under varying visibility

Tianle Zeng, Jianwei Peng, Hanjing Y e, Guangcheng Chen, Senzi Luo, and Hong Zhang. EZREAL: Enhancing zero-shot outdoor robot navigation toward distant targets under varying visibility. arXiv preprint arXiv:2509.13720, 2025

work page arXiv 2025
[67]

WebUA V-3M: A benchmark for unveiling the power of million-scale deep UA V tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14538– 14556, 2023

Chunhui Zhang et al. WebUA V-3M: A benchmark for unveiling the power of million-scale deep UA V tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14538– 14556, 2023

work page 2023
[68]

Uav-track vla: Embodied aerial tracking via vision-language- action models, 2026

Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, and Y onglin Tian. Uav-track vla: Embodied aerial tracking via vision-language- action models, 2026

work page 2026
[69]

Weinberger, and Y oav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Y oav Artzi. BERTScore: Evaluating text generation with BERT. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020

work page 2020
[70]

M3OT: A multi-drone multi-modality dataset for multi-object tracking

Xin Zhang et al. M3OT: A multi-drone multi-modality dataset for multi-object tracking. Sci- entific Data, 12, 2025

work page 2025
[71]

Anti-UA V challenge 2023: Methods and results

Jian Zhao et al. Anti-UA V challenge 2023: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , 2023

work page 2023
[72]

Drone-person tracking in uniform appearance crowd: A new dataset

Jianwei Zhao et al. Drone-person tracking in uniform appearance crowd: A new dataset. Sci- entific Data, 10, 2023

work page 2023
[73]

Byrd, Peihuang Lu, and Jorge Nocedal

Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS- B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software, 23(4):550–560, 1997

work page 1997
[74]

Detection and tracking meet drones challenge, 2021

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge, 2021

work page 2021
[75]

Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7380–7399, 2021

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Haibin Ling, Qinghua Hu, Jie Nie, Chengjie Chen, Y afei Wang, Xin Zhang, Xinyao Lyu, Jianhua Liu, Guan Zhou, Yue Kang, Heng Liu, Jiayuan Cheng, and Tao Mei. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7380–7399, 2021

work page 2021
[76]

banding” or “staircase

Longkun Zou, Jiale Wang, Rongqin Liang, Hai Wu, Ke Chen, and Y aowei Wang. UA V-MM3D: A large-scale synthetic benchmark for 3d perception of unmanned aerial vehicles with multi- modal data. arXiv preprint arXiv:2511.22404, 2025. 28 A Justification for 32-bit Float Depth Map Storage This section provides a comprehensive quantitative and qualitative analysi...

work page arXiv 2025
[77]

TA* + Smooth release output: included; primary method

work page
[78]

MuCO release output: included; primary one-shot global-planning baseline

work page
[79]

RRT*, PRM, B-spline PRM, elastic band, minimum jerk: included as Python reference context

work page
[80]

3D A*, Weighted A*, Theta*, Visibility-A*: included as search-family controls

work page
[81]

Potential field, CHOMP-lite, L-BFGS-B TrajOpt: included as local-optimization controls

work page

Showing first 80 references.

[1] [1]

Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3674– 3683, 2018

work page 2018

[2] [2]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) , Part IV , LNCS 7577, pages 611–625, 2012

work page 2012

[3] [3]

A class of local interpolating splines

Edwin Catmull and Raphael Rom. A class of local interpolating splines. In Robert E. Barn- hill and Richard F. Riesenfeld, editors, Computer Aided Geometric Design , pages 317–326. Academic Press, 1974

work page 1974

[4] [4]

Track A*: Fast Visibility-Aware Trajectory Planning for Active Target Tracking

Hanxuan Chen, Kangli Wang, and Ji Pei. Track a*: Fast visibility-aware trajectory planning for active target tracking. arXiv preprint arXiv:2605.05338, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

Hanxuan Chen, Jie Zheng, Siqi Y ang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo, Shuai Yuan, Xiangyue Wang, and others. Vision-and-language navigation for UA Vs: Progress, challenges, and a research roadmap. arXiv preprint arXiv:2604.13654, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Lihe Y ang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Video depth anything: Consistent depth estimation for super-long videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025

[7] [7]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 2432–2443, 2017

work page 2017

[8] [8]

Daniel, A

K. Daniel, A. Nash, S. Koenig, and A. Felner. Theta*: Any-angle path planning on grids. Journal of Artificial Intelligence Research, 39:533–579, 2010

work page 2010

[9] [9]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio López, and Vladlen Koltun. Carla: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learn- ing (CoRL) , volume 78 of Proceedings of Machine Learning Research , pages 1–16. PMLR, 2017. 24

work page 2017

[10] [10]

Douglas and Thomas K

David H. Douglas and Thomas K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization , 10(2):112–122, 1973

work page 1973

[11] [11]

The unmanned aerial vehicle benchmark: Object detection and tracking

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Y ang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 370– 386, 2018

work page 2018

[12] [12]

Openfly: A comprehensive platform for aerial vision-language navigation

Yunpeng Gao, Chenhui Li, Zhongrui Y ou, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhong- han Tang, Liansheng Wang, Penghui Y ang, et al. Openfly: A comprehensive platform for aerial vision-language navigation. In The Fourteenth International Conference on Learning Represen- tations, 2026

work page 2026

[13] [13]

Openfly: A comprehensive platform for aerial vision-language navigation, 2026

Yunpeng Gao, Chenhui Li, Zhongrui Y ou, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhong- han Tang, Liansheng Wang, Penghui Y ang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Y e, Jianan Li, Y an Ding, Dong Wang, Xuelong Li, Zhigang Wang, and Bin Zhao. Openfly: A comprehensive platform for aerial vision-language naviga...

work page 2026

[14] [14]

Are we ready for autonomous driving? The KITTI vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012

work page 2012

[15] [15]

Jing Gu, Manolis Savva, and Angel X. Gao. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022

[16] [17]

Hart, Nils J

Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic deter- mination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics , 4(2):100–107, 1968

work page 1968

[17] [18]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [19]

Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In In- ternational Conference on Learning Representations (ICLR) , 2022

work page 2022

[19] [20]

DepthCrafter: Generating consistent long depth sequences for open-world videos

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Y ong Zhang, Long Quan, and Ying Shan. DepthCrafter: Generating consistent long depth sequences for open-world videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025

[20] [21]

Sampling-based algorithms for optimal motion planning

Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research , 30(7):846–894, 2011

work page 2011

[21] [22]

Kavraki, Petr Svestka, Jean-Claude Latombe, and Mark H

Lydia E. Kavraki, Petr Svestka, Jean-Claude Latombe, and Mark H. Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4):566–580, 1996

work page 1996

[22] [23]

Repurposing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Kon- rad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

work page 2024

[23] [24]

Real-time obstacle avoidance for manipulators and mobile robots

Oussama Khatib. Real-time obstacle avoidance for manipulators and mobile robots. The Inter- national Journal of Robotics Research , 5(1):90–98, 1986

work page 1986

[24] [25]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Roberto Gordon, Chenxia Zhu, Ali Farhadi, Arsalan Mousavian, Ram Vedantam, and Aniruddha Kembhavi. AI2-THOR: An interactive 3d environment for visual AI. arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [26]

Model Predictive Control: Classical, Robust and Stochastic

Basil Kouvaritakis and Mark Cannon. Model Predictive Control: Classical, Robust and Stochastic. Advanced Textbooks in Control and Signal Processing. Springer, 2016. 25

work page 2016

[26] [27]

A probabilistic B-spline motion planning algorithm for unmanned helicopters flying in dense 3D environments

Emre Koyuncu and Gokhan Inalhan. A probabilistic B-spline motion planning algorithm for unmanned helicopters flying in dense 3D environments. In 2008 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, pages 815–821. IEEE, 2008

work page 2008

[27] [28]

Kyriakopoulos and George N

Konstantinos J. Kyriakopoulos and George N. Saridis. Minimum jerk path generation. In Proceedings. 1988 IEEE International Conference on Robotics and Automation , pages 364–

work page 1988

[28] [29]

IEEE Computer Society Press, 1988

work page 1988

[29] [30]

Steven M. LaValle. Rapidly-exploring random trees: A new tool for path planning. Technical Report TR 98-11, Computer Science Department, Iowa State University, 1998

work page 1998

[30] [31]

CityNav: A large-scale dataset for real-world aerial navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Mat- suo, and Nakamasa Inoue. CityNav: A large-scale dataset for real-world aerial navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2025

work page 2025

[31] [32]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML) , volume 202 of Proceedings of Machine Learning Research. PMLR, 2023

work page 2023

[32] [33]

MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond. In Interna- tional Conference on Computer Vision (ICCV) , 2023

work page 2023

[33] [34]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Y ong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2023. Oral Presen- tation

work page 2023

[34] [35]

Aerialvln: Vision-and-language navigation for uavs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Y aning Zhang, and Qi Wu. Aerialvln: Vision-and-language navigation for uavs. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 15384–15394, 2023

work page 2023

[35] [36]

Aerialvln: Vision-and-language navigation for uavs, 2023

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Y aning Zhang, and Qi Wu. Aerialvln: Vision-and-language navigation for uavs, 2023

work page 2023

[36] [37]

Tomás Lozano-Pérez and Michael A. Wesley. An algorithm for planning collision-free paths among polyhedral obstacles. Communications of the ACM, 22(10):560–570, 1979

work page 1979

[37] [38]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Y aroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023

[38] [39]

A benchmark and simulator for UA V track- ing

Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for UA V track- ing. In European Conference on Computer Vision (ECCV) , pages 445–461. Springer, 2016

work page 2016

[39] [40]

BuckTales: A multi-UA V dataset for multi-object tracking and re-identification of wild antelopes

Hemal Naik, Junran Y ang, Dipin Das, Margaret C Crofoot, Akanksha Rathore, and Vivek Hari Sridhar. BuckTales: A multi-UA V dataset for multi-object tracking and re-identification of wild antelopes. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024

work page 2024

[40] [41]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [42]

Heuristic search viewed as path finding in a graph

Ira Pohl. Heuristic search viewed as path finding in a graph. Artificial Intelligence, 1(3–4):193– 204, 1970

work page 1970

[42] [43]

Aria synthetic environments dataset

Project Aria. Aria synthetic environments dataset. https://www.projectaria.com/ datasets/ase/, 2024

work page 2024

[43] [44]

MUST: The first dataset and unified framework for multispectral UA V single object tracking

Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, and Jianan Li. MUST: The first dataset and unified framework for multispectral UA V single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2025

work page 2025

[44] [45]

Elastic bands: connecting path planning and control

Sean Quinlan and Oussama Khatib. Elastic bands: connecting path planning and control. In Pro- ceedings IEEE International Conference on Robotics and Automation , pages 802–807. IEEE Computer Society Press, 1993

work page 1993

[45] [46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Pro- ceedings of the 38th International Conference on Machine Learning (ICML) , volume 139 of P...

work page 2021

[46] [47]

Andrew Bagnell, and Siddhartha Srinivasa

Nathan Ratliff, Matt Zucker, J. Andrew Bagnell, and Siddhartha Srinivasa. CHOMP: Gradient optimization techniques for eﬀicient motion planning. In 2009 IEEE International Conference on Robotics and Automation , pages 489–494. IEEE, 2009

work page 2009

[47] [48]

Susskind

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vi- sion (ICCV), pages 10912–10922, 2021

work page 2021

[48] [49]

Habitat: A platform for embodied AI research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied AI research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 9339–9347, 2019

work page 2019

[49] [50]

High-resolution stereo datasets with subpixel-accurate ground truth

Daniel Scharstein, Heiko Hirschmüller, Y ork Kitajima, Greg Krathwohl, Nera Nešić, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on Pattern Recognition (GCPR) , pages 31–42, 2014

work page 2014

[50] [51]

Motion planning with sequential convex op- timization and convex collision checking

John Schulman, Y an Duan, Jonathan Ho, Alex Lee, Ibrahim Awwal, Henry Bradlow, Jia Pan, Sachin Patil, Ken Goldberg, and Pieter Abbeel. Motion planning with sequential convex op- timization and convex collision checking. The International Journal of Robotics Research , 33(9):1251–1270, 2014

work page 2014

[51] [52]

Airsim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, Springer Pro- ceedings in Advanced Robotics, pages 621–635. Springer, 2018

work page 2018

[52] [53]

Indoor segmentation and support inference from RGBD images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV) , pages 746–760, 2012

work page 2012

[53] [54]

Griﬀin: Aerial-ground cooperative detection and tracking dataset and benchmark, 2025

Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Zeyu Han, Haibao Yu, Chuang Zhang, Lei He, Shaobing Xu, and Jianqiang Wang. Griﬀin: Aerial-ground cooperative detection and tracking dataset and benchmark, 2025

work page 2025

[54] [55]

UA VScenes: A multi-modal dataset for UA Vs

Sijie Wang, Siqi Li, Y awei Zhang, Shangshu Yu, Shenghai Yuan, Rui She, Quanjiang Guo, JinXuan Zheng, Ong Kang Howe, Leonrich Chandra, et al. UA VScenes: A multi-modal dataset for UA Vs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[55] [56]

TartanAir: A dataset to push the limits of visual SLAM

Wenshan Wang, Delong Zhu, Xiangwei Wang, Y aoyu Hu, Yuheng Qiu, Chen Wang, Y afei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 4909– 4916, 2020

work page 2020

[56] [57]

Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology

Xiangyu Wang, Donglin Y ang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hong- sheng Li, Yue Liao, and Si Liu. Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology. In Proceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025

[57] [58]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Y an Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail. arXiv preprint arXiv:2511.00088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [59]

Detection, tracking, and counting meets drones in crowds: A benchmark

Longyin Wen, Dawei Du, Pengfei Zhu, Xiao Bian, Haibin Ling, Qinghua Hu, and Tao Mei. Detection, tracking, and counting meets drones in crowds: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 7780– 7789, 2021

work page 2021

[59] [60]

Zamir, Zhi- Y ang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhi- Y ang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 9068–9079, 2018. Spotlight Oral

work page 2018

[60] [61]

RGBD objects in the wild: Scaling real- world 3D object learning from RGB-D videos

Hongchi Xia, Y ang Fu, Sifei Liu, and Xiaolong Wang. RGBD objects in the wild: Scaling real- world 3D object learning from RGB-D videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 27

work page 2024

[61] [62]

Depth anything V2

Lihe Y ang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Heng- shuang Zhao. Depth anything V2. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[62] [63]

ScanNet++: A high- fidelity dataset of 3D indoor scenes

Chandan Y eshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high- fidelity dataset of 3D indoor scenes. In International Conference on Computer Vision (ICCV) , 2023

work page 2023

[63] [64]

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Tianle Zeng, Hanxuan Chen, Y anci Wen, and Hong Zhang. CARLA-Air: Fly drones inside a CARLA world–a unified infrastructure for air-ground embodied intelligence. arXiv preprint arXiv:2603.28032, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[64] [65]

Y oco: Y ou only calibrate once for accurate extrinsic parameter in lidar-camera systems.Measurement Science and Technology, 36(7):075009, 2025

Tianle Zeng, Xinrong Gu, Feifan Y an, Meixi He, and Dengke He. Y oco: Y ou only calibrate once for accurate extrinsic parameter in lidar-camera systems.Measurement Science and Technology, 36(7):075009, 2025

work page 2025

[65] [66]

EZREAL: Enhancing zero-shot outdoor robot navigation toward distant targets under varying visibility

Tianle Zeng, Jianwei Peng, Hanjing Y e, Guangcheng Chen, Senzi Luo, and Hong Zhang. EZREAL: Enhancing zero-shot outdoor robot navigation toward distant targets under varying visibility. arXiv preprint arXiv:2509.13720, 2025

work page arXiv 2025

[66] [67]

WebUA V-3M: A benchmark for unveiling the power of million-scale deep UA V tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14538– 14556, 2023

Chunhui Zhang et al. WebUA V-3M: A benchmark for unveiling the power of million-scale deep UA V tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14538– 14556, 2023

work page 2023

[67] [68]

Uav-track vla: Embodied aerial tracking via vision-language- action models, 2026

Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, and Y onglin Tian. Uav-track vla: Embodied aerial tracking via vision-language- action models, 2026

work page 2026

[68] [69]

Weinberger, and Y oav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Y oav Artzi. BERTScore: Evaluating text generation with BERT. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020

work page 2020

[69] [70]

M3OT: A multi-drone multi-modality dataset for multi-object tracking

Xin Zhang et al. M3OT: A multi-drone multi-modality dataset for multi-object tracking. Sci- entific Data, 12, 2025

work page 2025

[70] [71]

Anti-UA V challenge 2023: Methods and results

Jian Zhao et al. Anti-UA V challenge 2023: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , 2023

work page 2023

[71] [72]

Drone-person tracking in uniform appearance crowd: A new dataset

Jianwei Zhao et al. Drone-person tracking in uniform appearance crowd: A new dataset. Sci- entific Data, 10, 2023

work page 2023

[72] [73]

Byrd, Peihuang Lu, and Jorge Nocedal

Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS- B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software, 23(4):550–560, 1997

work page 1997

[73] [74]

Detection and tracking meet drones challenge, 2021

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge, 2021

work page 2021

[74] [75]

Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7380–7399, 2021

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Haibin Ling, Qinghua Hu, Jie Nie, Chengjie Chen, Y afei Wang, Xin Zhang, Xinyao Lyu, Jianhua Liu, Guan Zhou, Yue Kang, Heng Liu, Jiayuan Cheng, and Tao Mei. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7380–7399, 2021

work page 2021

[75] [76]

banding” or “staircase

Longkun Zou, Jiale Wang, Rongqin Liang, Hai Wu, Ke Chen, and Y aowei Wang. UA V-MM3D: A large-scale synthetic benchmark for 3d perception of unmanned aerial vehicles with multi- modal data. arXiv preprint arXiv:2511.22404, 2025. 28 A Justification for 32-bit Float Depth Map Storage This section provides a comprehensive quantitative and qualitative analysi...

work page arXiv 2025

[76] [77]

TA* + Smooth release output: included; primary method

work page

[77] [78]

MuCO release output: included; primary one-shot global-planning baseline

work page

[78] [79]

RRT*, PRM, B-spline PRM, elastic band, minimum jerk: included as Python reference context

work page

[79] [80]

3D A*, Weighted A*, Theta*, Visibility-A*: included as search-family controls

work page

[80] [81]

Potential field, CHOMP-lite, L-BFGS-B TrajOpt: included as local-optimization controls

work page