Real-Time Visual Intelligence on Low-Cost UAVs: A Modular Approach for Tracking, Scanning, and Navigation

Andrei-Marian Ungureanu; Stelian Sp\^inu

arxiv: 2607.02298 · v1 · pith:3RVCTBHInew · submitted 2026-07-02 · 💻 cs.RO · cs.CV

Real-Time Visual Intelligence on Low-Cost UAVs: A Modular Approach for Tracking, Scanning, and Navigation

Andrei-Marian Ungureanu , Stelian Sp\^inu This is my paper

Pith reviewed 2026-07-03 11:14 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords UAVdronefacial detectionfacial recognitiondepth estimationreal-time AImodular architecturelow-cost hardware

0 comments

The pith

Lightweight neural models enable real-time tracking, scanning and navigation on low-cost UAVs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a modular system on the DJI Tello drone that combines facial detection, recognition, and monocular depth estimation using lightweight AI models. A Python server handles the inference and a web interface provides control and video feed. The approach achieves practical performance in person tracking, indoor scanning, and line following without specialized hardware. This matters because it shows how advanced visual intelligence can be made accessible and deployable on affordable platforms for applications like surveillance and assistance.

Core claim

The paper claims that an integrated intelligent drone system on the low-cost DJI Tello platform, featuring a modular architecture for facial detection, facial recognition, and depth estimation from monocular vision, demonstrates robust performance in real-world conditions including person tracking, indoor scanning, and autonomous line following using virtual sensors.

What carries the argument

The modular architecture integrating facial detection, facial recognition, and depth estimation functionalities processed by a Python-based server on embedded hardware.

If this is right

The system supports real-time person tracking on low-cost UAVs.
Indoor scanning becomes possible through monocular depth estimation.
Autonomous line following works using virtual sensors.
Advanced AI techniques prove feasible on constrained hardware like the DJI Tello.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular approach could extend to other affordable drone models with similar processing limits.
Open-source elements of the system may speed up development of custom visual tasks for rescue or monitoring uses.
Adding more sensor types might improve reliability when moving from indoor to outdoor conditions.

Load-bearing premise

Lightweight neural models for facial detection, recognition, and monocular depth estimation can be integrated and run in real time on the DJI Tello platform without hardware-specific failures or unacceptable latency.

What would settle it

A demonstration of unacceptable latency during inference or loss of tracking during real-world tests on the DJI Tello platform would show the claimed robust performance does not hold.

read the original abstract

Autonomous drones are rapidly transforming modern warfare and civil applications alike. This paper presents the development of an integrated intelligent drone system designed to serve as a personal assistant. Leveraging the DJI Tello drone platform, we implemented a modular architecture that integrates three core artificial intelligence functionalities: facial detection, facial recognition, and depth estimation from monocular vision. A web-based interface enables seamless drone control and real-time video monitoring, while a Python-based server processes visual data and executes inference pipelines using lightweight neural models optimized for embedded systems. Unlike existing commercial solutions, this system emphasizes accessibility, low-cost hardware, and open-source technologies. The system demonstrates robust performance in real-world conditions, including person tracking, indoor scanning, and autonomous line following using virtual sensors. This project validates the applicability of advanced AI techniques in real-time robotic systems and illustrates the feasibility of deploying them on constrained hardware, providing a foundation for future research in autonomous UAVs for military, rescue, and surveillance missions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard CV models integrated on Tello drone with no supporting performance data or novel methods.

read the letter

The paper is basically a report on integrating face detection, recognition, and monocular depth models with the DJI Tello drone via a Python server and web-based control. It brings nothing new in terms of algorithms or theory.

What it does reasonably well is outline a modular, accessible setup using low-cost hardware and open-source tools. That focus could make it a helpful reference for educational settings or small teams trying to get basic vision running on the Tello without high-end equipment.

The main weakness is the missing data. Claims of robust real-world performance for person tracking, indoor scanning, and autonomous line following are not backed by any quantitative results like frame rates, latency, or accuracy metrics. As the stress-test note points out, this leaves the real-time feasibility unverified.

The architecture description is clear enough, but without details on how the models were optimized or tested together, it's difficult to assess if the system meets its goals on the constrained platform.

Readers who might get value are those building similar hobby or teaching projects and looking for an example implementation. It won't provide new insights or reliable benchmarks for the broader robotics community.

I do not think this deserves peer review. It reads more like a project summary than a research contribution that needs referee input.

Referee Report

2 major / 1 minor

Summary. The manuscript describes a modular system on the low-cost DJI Tello UAV that integrates lightweight neural models for facial detection, facial recognition, and monocular depth estimation. A Python server handles inference and a web interface provides control and video monitoring. The work claims to demonstrate robust real-world performance in person tracking, indoor scanning, and autonomous line following using virtual sensors, while emphasizing accessibility and open-source components over commercial alternatives.

Significance. If the performance claims were supported by quantitative evidence, the paper would provide a useful case study on deploying integrated visual AI pipelines on severely constrained embedded hardware. As presented, the contribution is limited to a system-integration description without measurable validation of the central feasibility claim.

major comments (2)

[Abstract] Abstract: the claim that the system 'demonstrates robust performance in real-world conditions' is unsupported because no quantitative metrics (end-to-end FPS, per-module inference latency, tracking success rate, depth estimation error, or failure modes under the stated conditions) are supplied for the integrated pipeline running on the Tello platform.
[Abstract] The description of the server-based Python pipeline does not report measured frame rates or latency when all three models (detection, recognition, depth) execute concurrently, leaving the 'real-time' and 'lightweight' assertions unverified against the hardware constraints of the Tello.

minor comments (1)

[Abstract] The abstract refers to 'virtual sensors' for line following without defining how monocular depth or other outputs are converted into control signals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for quantitative validation. We agree that the performance claims require supporting metrics and will revise the manuscript to address these points.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the system 'demonstrates robust performance in real-world conditions' is unsupported because no quantitative metrics (end-to-end FPS, per-module inference latency, tracking success rate, depth estimation error, or failure modes under the stated conditions) are supplied for the integrated pipeline running on the Tello platform.

Authors: We agree that the abstract claim lacks quantitative backing in the current manuscript, which focuses on system integration and qualitative demonstrations. In revision we will add experimental results including end-to-end FPS, per-module latencies, tracking success rates, depth estimation error, and observed failure modes from real-world Tello tests. revision: yes
Referee: [Abstract] The description of the server-based Python pipeline does not report measured frame rates or latency when all three models (detection, recognition, depth) execute concurrently, leaving the 'real-time' and 'lightweight' assertions unverified against the hardware constraints of the Tello.

Authors: We acknowledge the absence of concurrent execution measurements. The revised manuscript will include new benchmark results reporting frame rates and end-to-end latency for the three models running simultaneously on the server hardware used with the Tello. revision: yes

Circularity Check

0 steps flagged

No circularity: system integration report with no derivations or fitted predictions

full rationale

The manuscript is an engineering/systems paper describing a modular UAV pipeline on DJI Tello hardware. It contains no equations, no parameter fitting, no uniqueness theorems, and no derivation chain that could reduce to its inputs by construction. The central claim is an empirical demonstration of integrated models; the absence of reported latency/FPS numbers is an evidence gap, not a circularity issue. No self-citations are load-bearing for any mathematical result. The derivation chain is empty, so the paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work implicitly assumes standard pre-trained CV models and the DJI Tello API function as documented by their creators.

pith-pipeline@v0.9.1-grok · 5706 in / 1026 out tokens · 21267 ms · 2026-07-03T11:14:00.988833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Science, technology and the future of small autonomous drones,

D. Floreano and R. J. Wood, “Science, technology and the future of small autonomous drones,” Nature , vol. 521, no. 7553, pp. 460– 466, May 2015. doi:10.1038/nature14542

work page doi:10.1038/nature14542 2015
[2]

Reinforcement learning applied to an autonomous drone for follow -me behavior ,

A. M. Pliev, “Reinforcement learning applied to an autonomous drone for follow -me behavior ,” M.S. thesis, Utrecht University, The Netherlands, 2021

2021
[3]

Real time face recognition on embedded system applied to set-top box,

R. Xu, “ Real time face recognition on embedded system applied to set-top box, ” M.S. thesis, KTH Royal Institute of Technology , Stockholm, Sweden, 2024

2024
[4]

Enhancing path following drone: using image -based sensor matrix,

L. S. K. Yarru and T. F. Penugonda, “Enhancing path following drone: using image -based sensor matrix,” B .S. thesis, Blekinge Institute of Technology, Sweden, 2023

2023
[5]

A comparative analysis of face recognition models on masked faces,

Y. B. Chandra and G. K. Reddy, “A comparative analysis of face recognition models on masked faces,” Int. J. Sci. Technol. Res., vol. 9, no. 10, pp. 175–178, Oct. 2020

2020
[6]

In: IEEE Conf

F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015, pp. 815–823. doi:10.1109/CVPR.2015.7298682

work page doi:10.1109/cvpr.2015.7298682 2015
[7]

OpenFace: A general-purpose face recognition library with mobile applications,

B. Amos, B. Ludwiczuk, and M. Satyanarayanan, “OpenFace: A general-purpose face recognition library with mobile applications,” CMU School of Computer Science, Pittsburgh, PA, Tech. Rep. CMU- CS-16-118, Jun. 2016

2016
[8]

Deep face recognition,

O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in Proceedings of the British Machine Vision Conference (BMVC) , 2015, pp. 41.1–41.12. doi:10.5244/C.29.41

work page doi:10.5244/c.29.41 2015
[9]

DeepFace: Closing the gap to human -level performance in face verification,

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Closing the gap to human -level performance in face verification,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1701-1708. doi:10.1109/CVPR.2014.220

work page doi:10.1109/cvpr.2014.220 2014
[10]

YuNet: A tiny millisecond -level face detector,

W. Wu, H. Peng, and S. Yu, “YuNet: A tiny millisecond -level face detector,” Mach. Intell. Res., vol. 20, no. 5, pp. 656 –665, Oct. 2023. doi:10.1007/s11633-023-1423-y

work page doi:10.1007/s11633-023-1423-y 2023
[11]

Rapid object detection using a boosted cascade of simple features,

P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. doi:10.1109/CVPR.2001.990517

work page doi:10.1109/cvpr.2001.990517 2001
[12]

Joint face detection and alignment using multitask cascaded convolutional networks,

K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Process. Lett. , vol. 23, no. 10, pp. 1499 –1503, Oct. 2016. doi:10.1109/LSP.2016.2603342

work page doi:10.1109/lsp.2016.2603342 2016
[13]

Depth Anything V2

L. Yang et al., “Depth anything V2,” in 38th Conference on Neural Information Processing Systems (NeurIPS) , 2024. doi:10.48550/arXiv.2406.09414

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.09414 2024
[14]

Available: https://dl-cdn.ryzerobotics.com/downloads/Tello/Tello%20 SDK%202.0%20User%20Guide.pdf

Ryze Tech, Tello SDK 2.0 User Guide , Ryze Technology, 2018. Available: https://dl-cdn.ryzerobotics.com/downloads/Tello/Tello%20 SDK%202.0%20User%20Guide.pdf

2018
[15]

Comparative analysis of FaceNet, VGGFace, and GhostFaceNets face recognition algorithms for potential criminal suspect identification,

M. I. Ardiawan and G. P. K. Negarara, “Comparative analysis of FaceNet, VGGFace, and GhostFaceNets face recognition algorithms for potential criminal suspect identification,” J. Appl. Artif. Intell. , vol. 5, no. 2, pp. 34–49, Sep. 2024. doi:10.48185/jaai.v5i2.1237

work page doi:10.48185/jaai.v5i2.1237 2024
[16]

GhostFaceNets: Lightweight face recognition model from cheap operations,

M. Alansari, O. A. Hay, S. Javed, A. Shoufan, Y. Zweiri, and N. Werghi, “GhostFaceNets: Lightweight face recognition model from cheap operations,” IEEE Access , vol. 11, pp. 35429– 35446, 2023. doi:10.1109/ACCESS.2023.3266068

work page doi:10.1109/access.2023.3266068 2023
[17]

The OpenCV Library,

G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, vol. 25, no. 11, pp. 120–126, Nov. 2000

2000

[1] [1]

Science, technology and the future of small autonomous drones,

D. Floreano and R. J. Wood, “Science, technology and the future of small autonomous drones,” Nature , vol. 521, no. 7553, pp. 460– 466, May 2015. doi:10.1038/nature14542

work page doi:10.1038/nature14542 2015

[2] [2]

Reinforcement learning applied to an autonomous drone for follow -me behavior ,

A. M. Pliev, “Reinforcement learning applied to an autonomous drone for follow -me behavior ,” M.S. thesis, Utrecht University, The Netherlands, 2021

2021

[3] [3]

Real time face recognition on embedded system applied to set-top box,

R. Xu, “ Real time face recognition on embedded system applied to set-top box, ” M.S. thesis, KTH Royal Institute of Technology , Stockholm, Sweden, 2024

2024

[4] [4]

Enhancing path following drone: using image -based sensor matrix,

L. S. K. Yarru and T. F. Penugonda, “Enhancing path following drone: using image -based sensor matrix,” B .S. thesis, Blekinge Institute of Technology, Sweden, 2023

2023

[5] [5]

A comparative analysis of face recognition models on masked faces,

Y. B. Chandra and G. K. Reddy, “A comparative analysis of face recognition models on masked faces,” Int. J. Sci. Technol. Res., vol. 9, no. 10, pp. 175–178, Oct. 2020

2020

[6] [6]

In: IEEE Conf

F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015, pp. 815–823. doi:10.1109/CVPR.2015.7298682

work page doi:10.1109/cvpr.2015.7298682 2015

[7] [7]

OpenFace: A general-purpose face recognition library with mobile applications,

B. Amos, B. Ludwiczuk, and M. Satyanarayanan, “OpenFace: A general-purpose face recognition library with mobile applications,” CMU School of Computer Science, Pittsburgh, PA, Tech. Rep. CMU- CS-16-118, Jun. 2016

2016

[8] [8]

Deep face recognition,

O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in Proceedings of the British Machine Vision Conference (BMVC) , 2015, pp. 41.1–41.12. doi:10.5244/C.29.41

work page doi:10.5244/c.29.41 2015

[9] [9]

DeepFace: Closing the gap to human -level performance in face verification,

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Closing the gap to human -level performance in face verification,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1701-1708. doi:10.1109/CVPR.2014.220

work page doi:10.1109/cvpr.2014.220 2014

[10] [10]

YuNet: A tiny millisecond -level face detector,

W. Wu, H. Peng, and S. Yu, “YuNet: A tiny millisecond -level face detector,” Mach. Intell. Res., vol. 20, no. 5, pp. 656 –665, Oct. 2023. doi:10.1007/s11633-023-1423-y

work page doi:10.1007/s11633-023-1423-y 2023

[11] [11]

Rapid object detection using a boosted cascade of simple features,

P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. doi:10.1109/CVPR.2001.990517

work page doi:10.1109/cvpr.2001.990517 2001

[12] [12]

Joint face detection and alignment using multitask cascaded convolutional networks,

K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Process. Lett. , vol. 23, no. 10, pp. 1499 –1503, Oct. 2016. doi:10.1109/LSP.2016.2603342

work page doi:10.1109/lsp.2016.2603342 2016

[13] [13]

Depth Anything V2

L. Yang et al., “Depth anything V2,” in 38th Conference on Neural Information Processing Systems (NeurIPS) , 2024. doi:10.48550/arXiv.2406.09414

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.09414 2024

[14] [14]

Available: https://dl-cdn.ryzerobotics.com/downloads/Tello/Tello%20 SDK%202.0%20User%20Guide.pdf

Ryze Tech, Tello SDK 2.0 User Guide , Ryze Technology, 2018. Available: https://dl-cdn.ryzerobotics.com/downloads/Tello/Tello%20 SDK%202.0%20User%20Guide.pdf

2018

[15] [15]

Comparative analysis of FaceNet, VGGFace, and GhostFaceNets face recognition algorithms for potential criminal suspect identification,

M. I. Ardiawan and G. P. K. Negarara, “Comparative analysis of FaceNet, VGGFace, and GhostFaceNets face recognition algorithms for potential criminal suspect identification,” J. Appl. Artif. Intell. , vol. 5, no. 2, pp. 34–49, Sep. 2024. doi:10.48185/jaai.v5i2.1237

work page doi:10.48185/jaai.v5i2.1237 2024

[16] [16]

GhostFaceNets: Lightweight face recognition model from cheap operations,

M. Alansari, O. A. Hay, S. Javed, A. Shoufan, Y. Zweiri, and N. Werghi, “GhostFaceNets: Lightweight face recognition model from cheap operations,” IEEE Access , vol. 11, pp. 35429– 35446, 2023. doi:10.1109/ACCESS.2023.3266068

work page doi:10.1109/access.2023.3266068 2023

[17] [17]

The OpenCV Library,

G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, vol. 25, no. 11, pp. 120–126, Nov. 2000

2000