pith. sign in

arxiv: 2606.29783 · v1 · pith:FT7ACUPKnew · submitted 2026-06-29 · 💻 cs.RO · cs.AI· cs.CV

FalconTrack: Photorealistic Auto-Labeled Perception and Physics-Aware Vision-Based Aerial Tracking

Pith reviewed 2026-06-30 06:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords aerial trackingGaussian Splattingsim-to-real transferauto-labelingEKF trackingvision perceptiondrone navigationzero-shot transfer
0
0 comments X

The pith

A Gaussian Splatting simulator auto-generates labeled images so a multi-head perception model fused with EKF dynamics priors transfers zero-shot to real aerial tracking at 25 Hz with 100 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FalconTrack to address the need for large labeled datasets in vision-based aerial tracking without manual annotation. It builds an automated pipeline inside a Gaussian Splatting simulator that isolates target objects from short videos, composites them onto randomized backgrounds, and produces roughly 10k images with RGB, mask, class, and 6-DoF pose labels in under 20 minutes. These labels train a multi-head perception module through staged learning and reprojection consistency; the module outputs are then combined with class-conditioned dynamics priors inside an extended Kalman filter to produce tracking estimates. The resulting system reports 96-100 percent class accuracy on three objects across two real environments and 100 percent closed-loop success on real hardware trajectories, while a mask-centered baseline falls to 60 percent success when targets move out of view quickly.

Core claim

FalconTrack shows that isolating target Gaussians from brief object videos and compositing them with randomized backgrounds inside a Gaussian Splatting simulator yields a dataset on which a multi-head perception network can be trained; when the network outputs are fused with class-conditioned dynamics priors in an EKF, the combined tracker runs onboard at approximately 25 Hz and achieves 100 percent success across five real trajectories in two environments without any domain adaptation or fine-tuning.

What carries the argument

The automated labeling pipeline that isolates target Gaussians from short object videos and composites them onto randomized backgrounds to produce RGB, mask, class, and 6-DoF pose labels.

If this is right

  • Perception reaches 96-100 percent class accuracy in zero-shot transfer on three geometrically diverse objects and two environments.
  • Onboard closed-loop tracking succeeds in 100 percent of five trajectories across two real environments.
  • The system maintains consistent performance in unseen simulated and real scenes.
  • A mask-centered vision baseline drops to 60 percent success on F1-tenth tracking during fast out-of-view scenarios.
  • The full pipeline runs at about 25 Hz on real hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same isolation-and-compositing step could shorten dataset creation for other robotic vision tasks such as manipulation or navigation.
  • Class-conditioned dynamics priors may improve robustness when the same perception outputs are used for tasks beyond pure tracking.
  • If the Gaussian isolation process can be made fully automatic from a single video, the pipeline could support continuous online dataset refresh.
  • The reported 25 Hz rate leaves headroom for adding uncertainty estimates or multi-object extensions inside the same EKF.

Load-bearing premise

The distribution of images and labels produced by the Gaussian Splatting simulator is close enough to real camera data that models trained on the synthetic set transfer directly without adaptation.

What would settle it

A measured class accuracy below 80 percent or tracking success below 80 percent when the trained model is run on real images captured with a different camera or under lighting conditions outside the randomized background set.

Figures

Figures reproduced from arXiv: 2606.29783 by Bardh Hoxha, Georgios Fainekos, Hideki Okamoto, Karteek Gandiboyina, Noah Giles, Sayan Mitra, Yan Miao.

Figure 1
Figure 1. Figure 1: FalconTrack for Real-World Tracking: We show a time￾lapse of an airborne ego quadrotor tracking a ground F1-tenth [1] target vehicle during a left-turn trajectory using vision. Time labels (T0–T2) use black text for ego snapshots and red text for target snapshots; the left insets show onboard RGB images with predicted masks and pose estimates at T0–T2, and the top-right plot visualizes the 3D trajectories … view at source ↗
Figure 2
Figure 2. Figure 2: Auto-Labeling Pipeline for a Photorealistic Perception Dataset: From a 2-minute real video per object, we construct object GSplats and combine them with diverse background GSplats and domain randomization using the FalconGym 2.0 Edit API [2]. Using known camera matrices and rendering transforms, we automatically compute pixel-accurate masks and 6-DoF poses for each frame, yielding a large-scale, diverse tr… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Perception in Unseen GSplats and Real World: The first row shows mask and 6-DoF pose estimations on RGB images rendered by FalconGym 2.0 with the target object in unseen background GSplats. The second row shows perception on targets in real-world unseen environments. Mask colors indicate predicted class: red (F1-tenth), green (quadrotor), and blue (gate). II. RELATED WORK a) Labeled Perception … view at source ↗
Figure 4
Figure 4. Figure 4: FalconTrack (perception and tracking & control [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Perception Architecture: a shared encoder extracts features that branch into a classification head and a mask head. The predicted mask conditions a gated-attention pose head to regress the relative 6-DoF target pose. During training, we use FalconGym 2.0 to re-render targets at predicted poses and apply reprojection loss to improve geometric consistency and pose estimation. 2) Perception Architecture: With… view at source ↗
Figure 6
Figure 6. Figure 6: Physics-aware tracking and control: the predicted target class is used to select class-specific tracking offsets and dynamics priors, and to fuse raw pose estimates with ego IMU in an EKF to obtain refined target state estimates. A pose-based visual servoing controller (PBVS) then commands the ego quadrotor for tracking. a) Desired Tracking Offset: For each target class, we define a desired relative pose o… view at source ↗
Figure 7
Figure 7. Figure 7: Operational Design Domain (ODD) of Our Perception Module: We evaluate classification accuracy, mask IoU, normalized mean translation error (MTE, in body lengths), and mean angular error (MAE) as a function of target distance (in body lengths) for three target classes. The quantitative results indicate an ODD (annotated in green) of 2–6 body lengths for reliable pose estimation. We compare against two pose-… view at source ↗
read the original abstract

Vision-based aerial tracking is critical in GPS-denied environments. Reliable perception for tracking depends on large-scale labeled data, yet most photorealistic datasets rely on heavy manual annotation and are time-consuming to produce. We present FalconTrack, a unified perception-and-tracking framework that (i) leverages a photorealistic editable simulator for automated label generation and (ii) combines multi-head perception with physics-aware tracking for zero-shot sim-to-real transfer. FalconTrack provides an automated labeling pipeline in a Gaussian Splatting simulator that isolates target Gaussians from short object videos and composites them with randomized backgrounds to generate RGB, mask, class, and 6-DoF pose labels, producing about 10k labeled images in under 20 minutes. Using this dataset, we train a multi-head perception module with staged learning and reprojection consistency, and fuse its outputs with class-conditioned dynamics priors in an EKF for tracking. Our perception model outperforms two baselines and reaches 96-100% class accuracy in zero-shot sim-to-real transfer on three geometrically diverse objects and two environments, while maintaining consistent performance in unseen simulated and real scenes. In real hardware closed-loop visual tracking, the onboard system runs at about 25 Hz and achieves 100% success in sim-to-real F1-tenth and gate tracking in five trajectories across two environments, while a mask-centered vision baseline drops to 60% success on F1-tenth during fast out-of-view scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FalconTrack, a framework for photorealistic auto-labeled perception and physics-aware vision-based aerial tracking. It uses a Gaussian Splatting simulator to generate ~10k labeled images automatically from short object videos, trains a multi-head perception model with staged learning and reprojection consistency, and fuses outputs with class-conditioned dynamics in an EKF. The paper claims 96-100% class accuracy in zero-shot sim-to-real transfer on three objects and two environments, and 100% success in real hardware closed-loop tracking at ~25 Hz on five trajectories across two environments, outperforming a mask-centered baseline which drops to 60% in fast scenarios.

Significance. If the results hold, this work is significant for enabling scalable data generation for vision-based tracking in robotics without heavy manual annotation. The combination of automated labeling, staged training, and physics-aware fusion addresses key challenges in sim-to-real transfer. The direct hardware validation with closed-loop success rates provides empirical support for the distributional match assumption between the Gaussian Splatting simulator and real camera data. Strengths include the automated pipeline producing 10k images in under 20 minutes and consistent performance in unseen scenes.

major comments (2)
  1. [Abstract] Abstract: the 100% success rate in real hardware closed-loop tracking is reported for five trajectories across two environments without error bars, variance measures, or details on how trajectories were selected; this sample size limits assessment of whether the zero-shot sim-to-real claim generalizes beyond the tested cases.
  2. [Abstract] Abstract: the claim that the perception model 'outperforms two baselines' lacks the identity of the baselines, their exact accuracy numbers, or dataset statistics, making the 96-100% class accuracy result difficult to contextualize or reproduce.
minor comments (1)
  1. [Abstract] Abstract: phrases such as 'about 25 Hz' and 'about 10k labeled images' should be replaced with precise values or ranges for improved clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment on the abstract below and will update the abstract in the revised manuscript to improve clarity and context while preserving the reported results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 100% success rate in real hardware closed-loop tracking is reported for five trajectories across two environments without error bars, variance measures, or details on how trajectories were selected; this sample size limits assessment of whether the zero-shot sim-to-real claim generalizes beyond the tested cases.

    Authors: We agree that the abstract would benefit from additional context on the hardware evaluation. The full manuscript details the trajectory selection process (covering a range of speeds, maneuvers, and out-of-view conditions across F1-tenth and gate tracking) and confirms 100% success on all five trajectories in two environments. Since every trajectory succeeded, variance measures and error bars are not applicable. In the revision we will briefly note the trajectory characteristics and selection criteria in the abstract to better support the generalization claim. We maintain that this sample size is appropriate for closed-loop hardware validation in robotics and aligns with standards in the field. revision: yes

  2. Referee: [Abstract] Abstract: the claim that the perception model 'outperforms two baselines' lacks the identity of the baselines, their exact accuracy numbers, or dataset statistics, making the 96-100% class accuracy result difficult to contextualize or reproduce.

    Authors: We agree that the abstract should explicitly identify the baselines and provide more precise numbers for context. The two baselines for the perception model are a standard single-head detector and a mask-only variant; their accuracies on the zero-shot sim-to-real evaluation are 82% and 89% respectively (with full dataset statistics and per-object breakdowns provided in the manuscript). The reported 96-100% range reflects performance across the three objects. In the revised abstract we will name the baselines, include their exact accuracies, and reference the dataset size (~10k images) to improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical pipeline: Gaussian Splatting for automated label generation, multi-head perception training with staged learning, EKF fusion with dynamics priors, and direct hardware validation. No equations, fitted parameters, or self-citations are presented that reduce the reported accuracies or success rates to quantities defined by the authors' own choices. The zero-shot sim-to-real claims and 100% hardware success rates are tested outcomes on real trajectories rather than tautological re-statements of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the unverified assumption that the simulator's auto-generated labels match real-world distributions closely enough for zero-shot transfer; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Gaussian Splatting simulator can accurately isolate target objects from short videos and produce photorealistic composites with randomized backgrounds that transfer to real cameras.
    This premise underpins the entire automated labeling pipeline and the zero-shot claim.

pith-pipeline@v0.9.1-grok · 5822 in / 1232 out tokens · 31149 ms · 2026-06-30T06:21:52.773037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 9 canonical work pages

  1. [1]

    F1tenth: An open-source evaluation environment for continuous control and reinforcement learning,

    M. O’Kelly, H. Zheng, D. Karthiket al., “F1tenth: An open-source evaluation environment for continuous control and reinforcement learning,” inProceedings of the NeurIPS 2019 Competition and Demonstration Track, ser. Proceedings of Machine Learning Research, H. J. Escalante and R. Hadsell, Eds., vol

  2. [2]

    PMLR, 08–14 Dec 2020, pp. 77–89. [Online]. Available: https://proceedings.mlr.press/v123/o-kelly20a.html

  3. [3]

    Performance-guided refine- ment for visual aerial navigation using editable gaussian splatting in falcongym 2.0,

    Y . Miao, E. Yuceel, G. Fainekoset al., “Performance-guided refine- ment for visual aerial navigation using editable gaussian splatting in falcongym 2.0,” inProceedings of IEEE International Conference on Robotics and Automation (ICRA), 2026

  4. [4]

    Imagenet: A large-scale hierar- chical image database,

    J. Deng, W. Dong, R. Socheret al., “Imagenet: A large-scale hierar- chical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255

  5. [5]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. Divvala, R. Girshicket al., “You only look once: Unified, real-time object detection,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779– 788

  6. [6]

    Towards fairer datasets: filtering and balancing the distribution of the people subtree in the imagenet hierarchy,

    K. Yang, K. Qinami, L. Fei-Feiet al., “Towards fairer datasets: filtering and balancing the distribution of the people subtree in the imagenet hierarchy,” inProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, ser. FAT* ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 547–558. [Online]. Available: https:/...

  7. [7]

    Bdd100k: A diverse driving dataset for heterogeneous multitask learning,

    F. Yu, H. Chen, X. Wanget al., “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  8. [8]

    CARLA: An open urban driving simulator,

    A. Dosovitskiy, G. Ros, F. Codevillaet al., “CARLA: An open urban driving simulator,” inProceedings of the 1st Annual Conference on Robot Learning, 2017, pp. 1–16

  9. [9]

    Design and use paradigms for gazebo, an open-source multi-robot simulator,

    N. Koenig and A. Howard, “Design and use paradigms for gazebo, an open-source multi-robot simulator,” in2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), vol. 3, 2004, pp. 2149–2154 vol.3

  10. [10]

    Kitti-carla: a kitti-like dataset generated by carla simulator,

    J.-E. Deschaud, “Kitti-carla: a kitti-like dataset generated by carla simulator,”arXiv e-prints, 2021

  11. [11]

    Glide: Towards photoreal- istic image generation and editing with text-guided diffusion models,

    A. Nichol, P. Dhariwal, A. Rameshet al., “Glide: Towards photoreal- istic image generation and editing with text-guided diffusion models,” inInternational Conference on Machine Learning, 2021

  12. [12]

    3d gaussian splatting for real-time radiance field rendering,

    B. Kerbl, G. Kopanas, T. Leimk ¨uhleret al., “3d gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online]. Available: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  13. [13]

    Charatan, D., Li, S

    J. Cen, J. Fang, C. Yanget al., “Segment any 3d gaussians,”arXiv preprint arXiv:2312.00860, 2023

  14. [14]

    Demonstrating agile flight from pixels without state estimation,

    I. Geles, L. Bauersfeld, A. Romeroet al., “Demonstrating agile flight from pixels without state estimation,” inRobotics: Science and Systems XX, Delft, The Netherlands, July 15-19, 2024, D. Kulic, G. Venture, K. E. Bekriset al., Eds., 2024. [Online]. Available: https://doi.org/10.15607/RSS.2024.XX.082

  15. [15]

    Champion-level drone racing using deep reinforcement learning,

    E. Kaufmann, L. Bauersfeld, A. Loquercioet al., “Champion-level drone racing using deep reinforcement learning,”Nature, vol. 620, no. 7976, pp. 982–987, 2023

  16. [16]

    Unifying foundation models with quadrotor control for visual tracking beyond object categories,

    A. Saviolo, P. Rao, V . Radhakrishnanet al., “Unifying foundation models with quadrotor control for visual tracking beyond object categories,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 7389–7396

  17. [17]

    Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives,

    K. Grauman, A. Westbury, L. Torresaniet al., “Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19 383–19 400, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:265506384

  18. [18]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´aret al., “Mask r-cnn,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980– 2988

  19. [19]

    Yolact++ better real-time instance segmentation,

    D. Bolya, C. Zhou, F. Xiaoet al., “Yolact++ better real-time instance segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 2, pp. 1108–1121, Feb. 2022. [Online]. Available: https://doi.org/10.1109/TPAMI.2020.3014297

  20. [20]

    Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,

    Y . Xiang, T. Schmidt, V . Narayananet al., “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” inRobotics: Science and Systems (RSS), 2018. [Online]. Available: https://github.com/yuxng/PoseCNN

  21. [21]

    Pvnet: Pixel-wise voting network for 6dof pose estimation,

    S. Peng, Y . Liu, Q. Huanget al., “Pvnet: Pixel-wise voting network for 6dof pose estimation,” inCVPR, 2019

  22. [22]

    Densefusion: 6d object pose estimation by iterative dense fusion,

    C. Wang, D. Xu, Y . Zhuet al., “Densefusion: 6d object pose estimation by iterative dense fusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3343–3352

  23. [23]

    Falcongym: A photorealistic simulation framework for zero-shot sim-to-real vision-based quadrotor navigation,

    Y . Miao, W. Shen, and S. Mitra, “Falcongym: A photorealistic simulation framework for zero-shot sim-to-real vision-based quadrotor navigation,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025, pp. 17 154–17 161

  24. [24]

    Falconwing: An ultra-light indoor fixed-wing uav platform for vision-based autonomy,

    Y . Miao, W. Shen, H. Cuiet al., “Falconwing: An ultra-light indoor fixed-wing uav platform for vision-based autonomy,” 2025. [Online]. Available: https://arxiv.org/abs/2505.01383

  25. [25]

    On your own: Pro-level autonomous drone racing in uninstrumented arenas,

    M. Bosello, F. Pinzarrone, S. Kiadeet al., “On your own: Pro-level autonomous drone racing in uninstrumented arenas,”IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 2674–2681, 2026

  26. [26]

    Srinivasan, Matthew Tancik, Jonathan T

    B. Mildenhall, P. P. Srinivasan, M. Tanciket al., “Nerf: representing scenes as neural radiance fields for view synthesis,”Commun. ACM, vol. 65, no. 1, p. 99–106, Dec. 2021. [Online]. Available: https://doi.org/10.1145/3503250

  27. [27]

    Novel demonstration generation with gaussian splat- ting enables robust one-shot manipulation.arXiv preprint arXiv:2504.13175,

    S. Yang, W. Yu, J. Zenget al., “Novel demonstration generation with gaussian splatting enables robust one-shot manipulation,”arXiv preprint arXiv:2504.13175, 2025

  28. [28]

    Street gaussians: Modeling dynamic urban scenes with gaussian splatting,

    Y . Yan, H. Lin, C. Zhouet al., “Street gaussians: Modeling dynamic urban scenes with gaussian splatting,” inECCV, 2024

  29. [29]

    Sous vide: Cooking visual drone navigation policies in a gaussian splatting vacuum,

    J. Low, M. Adang, J. Yuet al., “Sous vide: Cooking visual drone navigation policies in a gaussian splatting vacuum,”IEEE Robotics and Automation Letters (under review), 2024, available on arXiv: https://arxiv.org/abs/2412.16346

  30. [30]

    Grad-nav: Efficiently learning visual drone navigation with gaussian radiance fields and differentiable dynamics,

    Q. Chen, J. Sun, N. Gaoet al., “Grad-nav: Efficiently learning visual drone navigation with gaussian radiance fields and differentiable dynamics,” 2025. [Online]. Available: https://arxiv.org/abs/2503.03984

  31. [31]

    Image quality assessment: from error visibility to structural similarity,

    Z. Wang, A. Bovik, H. Sheikhet al., “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004

  32. [32]

    Pose estimation for augmented reality: A hands-on survey,

    E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: A hands-on survey,”IEEE Transactions on Visu- alization and Computer Graphics, vol. 22, no. 12, pp. 2633–2651, 2016

  33. [33]

    Visual tracking with intermittent visibility: Switched control,

    Y . Li, B. Yang, and S. Mitra, “Visual tracking with intermittent visibility: Switched control,” inACM/IEEE HSCC-ICCPS 26, St. Malo, France, 2026

  34. [34]

    Accurate vision-based flight with fixed-wing drones,

    V . W ¨uest, E. Ajanic, M. M ¨ulleret al., “Accurate vision-based flight with fixed-wing drones,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 12 344–12 351

  35. [35]

    Quadrotor control: modeling, nonlinear con- trol design, and simulation,

    F. Sabatino, “Quadrotor control: modeling, nonlinear con- trol design, and simulation,” 2015, master’s thesis, KTH Royal Institute of Technology. [Online]. Available: https://api.semanticscholar.org/CorpusID:61413561