FalconApp: Rapid iPhone Deployment of End-to-End Perception via Automatically Labeled Synthetic Data
Pith reviewed 2026-05-10 01:43 UTC · model grok-4.3
The pith
FalconApp converts a short iPhone video of a rigid object into a deployable mask-detection and 6-DoF pose model using automatically labeled synthetic data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FalconApp produces usable perception models with about 20 minutes of synthetic-data generation and training per object on average, around 30 ms end-to-end on-device latency on iPhone, and better overall pose accuracy than a PnP baseline on 4 / 5 objects in both simulation and real-world evaluation.
What carries the argument
The photorealistic auto-labeling workflow that reconstructs an editable GSplat asset from the user video, composites it with diverse backgrounds, and renders synthetic images carrying ground-truth masks and poses for training.
Load-bearing premise
A short handheld capture of a rigid object is sufficient to produce an accurate editable 3D asset whose rendered images train models that generalize to real-world captures.
What would settle it
Real-world evaluation on the five tested objects showing the trained models achieve lower pose accuracy than the PnP baseline or exceed 30 ms end-to-end latency on iPhone.
Figures
read the original abstract
Reliable perception for robotics depends on large-scale labeled data, yet real-world datasets rely on heavy manual annotation and are time-consuming to produce. We present FalconApp, an iPhone app with an end-to-end frontend-backend pipeline that turns a short handheld capture of a rigid object into a perception module for mask detection and 6-DoF pose estimation. Our core contribution is a rapid mobile deployment pipeline paired with a photorealistic auto-labeling workflow: from a user-captured video of an object, FalconApp reconstructs an editable GSplat asset, composites it with diverse photorealistic backgrounds, renders synthetic images with ground-truth masks and poses, trains the perception module, and deploys it back to the iPhone frontend. Experiments across five rigid objects with diverse geometry and appearance show that FalconApp produces usable perception models with about 20 minutes of synthetic-data generation and training per object on average, around 30 ms end-to-end on-device latency on iPhone, and better overall pose accuracy than a PnP baseline on 4 / 5 objects in both simulation and real-world evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FalconApp, an iPhone app implementing an end-to-end pipeline that converts a short handheld video capture of a rigid object into a deployable perception module for mask detection and 6-DoF pose estimation. The pipeline reconstructs an editable GSplat asset from the capture, composites it with photorealistic backgrounds to auto-generate synthetic training images with ground-truth labels, trains the model, and deploys it on-device. Experiments on five objects with varied geometry and appearance report an average of 20 minutes for synthetic data generation and training per object, approximately 30 ms end-to-end on-device latency, and superior pose accuracy relative to a PnP baseline on 4 out of 5 objects in both simulation and real-world evaluations.
Significance. If the results hold under more detailed scrutiny, the work offers a practical advance in mobile robotics perception by automating the creation of custom, labeled datasets and models from minimal real-world input. This could substantially reduce the time and expertise needed for deploying object-specific perception on consumer devices, with relevance to AR, robotics, and rapid prototyping scenarios where manual annotation is a bottleneck.
major comments (2)
- Experiments section: The central claim of better overall pose accuracy than the PnP baseline on 4/5 objects (both sim and real) is load-bearing for the paper's contribution, yet the manuscript provides no specific metrics (e.g., ADD-S, rotation/translation error distributions, or success thresholds), test set sizes, or data split protocols. This omission prevents verification of the reported improvements and generalization from synthetic to real data.
- Method / pipeline description: The assumption that a short handheld capture suffices to produce an accurate, editable GSplat asset whose renderings train models that generalize to real captures is central to the workflow, but the manuscript lacks quantitative reconstruction quality metrics (e.g., PSNR, coverage analysis) or ablation on capture length/variation to support this.
minor comments (3)
- Abstract: Include at least one concrete metric (e.g., mean pose error or success rate) alongside the qualitative claim of 'better overall pose accuracy' to strengthen the summary of results.
- Figures: Pipeline overview figures would benefit from clearer annotations distinguishing real capture, GSplat reconstruction, synthetic rendering, and on-device inference stages.
- Related work: Ensure citations cover recent mobile deployment and synthetic data methods for 6-DoF pose estimation to contextualize the contribution fully.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We appreciate the emphasis on the need for greater quantitative detail to support the central claims. Below we respond point-by-point to the major comments and indicate the revisions we will make.
read point-by-point responses
-
Referee: Experiments section: The central claim of better overall pose accuracy than the PnP baseline on 4/5 objects (both sim and real) is load-bearing for the paper's contribution, yet the manuscript provides no specific metrics (e.g., ADD-S, rotation/translation error distributions, or success thresholds), test set sizes, or data split protocols. This omission prevents verification of the reported improvements and generalization from synthetic to real data.
Authors: We agree that the absence of these specific metrics limits the ability to verify the reported improvements. In the revised manuscript we will expand the Experiments section to report ADD-S scores, mean and distributional statistics for rotation and translation errors, success rates at standard thresholds (e.g., 10°/5 cm), the exact sizes of the simulation and real-world test sets, and the data-splitting protocols employed. These additions will directly substantiate the claim of superior pose accuracy on 4/5 objects and clarify the degree of sim-to-real generalization. revision: yes
-
Referee: Method / pipeline description: The assumption that a short handheld capture suffices to produce an accurate, editable GSplat asset whose renderings train models that generalize to real captures is central to the workflow, but the manuscript lacks quantitative reconstruction quality metrics (e.g., PSNR, coverage analysis) or ablation on capture length/variation to support this.
Authors: We acknowledge that quantitative evidence for reconstruction quality would strengthen the pipeline description. We will add PSNR values computed on held-out views for each of the five GSplat reconstructions and include a basic coverage analysis (e.g., percentage of surface points observed). Regarding ablation on capture length and variation, we will describe the standardized capture protocol used across objects and, where possible with existing recordings, provide supplementary analysis of how modest changes in capture duration affect downstream perception accuracy. Any ablation that would require new data collection will be noted as a limitation and direction for future work. revision: partial
Circularity Check
No significant circularity; pipeline uses independent real-world evaluation
full rationale
The described pipeline begins with real handheld video captures of rigid objects, reconstructs GSplat assets, composites photorealistic backgrounds, renders synthetic images with automatically generated ground-truth labels, trains perception models for mask detection and pose estimation, and deploys to iPhone. Evaluation occurs on separate real-world tests and simulation benchmarks, with comparisons to a PnP baseline showing improvements on 4/5 objects. No derivation step reduces a claimed prediction or result to a fitted parameter or self-citation by construction; the workflow is externally falsifiable via held-out real captures and does not rely on self-definitional loops or imported uniqueness theorems.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Rigid objects can be accurately reconstructed from short handheld video using Gaussian Splatting to produce an editable asset.
- domain assumption Compositing the reconstructed asset with diverse photorealistic backgrounds produces synthetic images whose ground-truth labels enable models that generalize to real captures.
Reference graph
Works this paper leans on
-
[1]
Y . Miao, E. Yuceel, G. Fainekoset al., “Performance-guided refine- ment for visual aerial navigation using editable gaussian splatting in falcongym 2.0,” inProceedings of IEEE International Conference on Robotics and Automation (ICRA), 2026
work page 2026
-
[2]
K. He, G. Gkioxari, P. Doll ´aret al., “Mask r-cnn,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980– 2988
work page 2017
-
[3]
Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,
Y . Xiang, T. Schmidt, V . Narayananet al., “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” inRobotics: Science and Systems (RSS), 2018. [Online]. Available: https://github.com/yuxng/PoseCNN
work page 2018
-
[4]
Imagenet: A large-scale hierar- chical image database,
J. Deng, W. Dong, R. Socheret al., “Imagenet: A large-scale hierar- chical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255
work page 2009
-
[5]
Bdd100k: A diverse driving dataset for heterogeneous multitask learning,
F. Yu, H. Chen, X. Wanget al., “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
work page 2020
-
[6]
CARLA: An open urban driving simulator,
A. Dosovitskiy, G. Ros, F. Codevillaet al., “CARLA: An open urban driving simulator,” inProceedings of the 1st Annual Conference on Robot Learning, 2017, pp. 1–16
work page 2017
-
[7]
Design and use paradigms for gazebo, an open-source multi-robot simulator,
N. Koenig and A. Howard, “Design and use paradigms for gazebo, an open-source multi-robot simulator,” in2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), vol. 3, 2004, pp. 2149–2154 vol.3
work page 2004
-
[8]
Nerf: Representing scenes as neural radiance fields for view synthesis,
B. Mildenhall, P. P. Srinivasan, M. Tanciket al., “Nerf: representing scenes as neural radiance fields for view synthesis,”Commun. ACM, vol. 65, no. 1, p. 99–106, Dec. 2021. [Online]. Available: https://doi.org/10.1145/3503250
-
[9]
3d gaussian splatting for real-time radiance field rendering,
B. Kerbl, G. Kopanas, T. Leimk ¨uhleret al., “3d gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online]. Available: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
work page 2023
-
[10]
Y . Miao, W. Shen, and S. Mitra, “Falcongym: A photorealistic simulation framework for zero-shot sim-to-real vision-based quadrotor navigation,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025, pp. 17 154–17 161
work page 2025
-
[11]
Pose estimation for augmented reality: A hands-on survey,
E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: A hands-on survey,”IEEE Transactions on Visu- alization and Computer Graphics, vol. 22, no. 12, pp. 2633–2651, 2016
work page 2016
-
[12]
Falconwing: An ultra-light indoor fixed-wing uav platform for vision-based autonomy,
Y . Miao, W. Shen, H. Cuiet al., “Falconwing: An ultra-light indoor fixed-wing uav platform for vision-based autonomy,” 2025. [Online]. Available: https://arxiv.org/abs/2505.01383
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.