Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

Feitian Zhang; Liuyang Wang

arxiv: 2605.22605 · v1 · pith:FX53WRTBnew · submitted 2026-05-21 · 💻 cs.RO · cs.CV

Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

Liuyang Wang , Feitian Zhang This is my paper

Pith reviewed 2026-05-22 05:19 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords UAV object detectionego-motion compensationdual-interval motionmotion-guided attentionvideo object detectionVisDrone-VIDYOLOv8global motion compensation

0 comments

The pith

Dual-interval motion extraction decouples ego-motion from target dynamics to improve UAV object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a vision-only framework for object detection in UAV videos that struggles with ego-motion, jitter, and scale changes. It first aligns frames with homography-based global motion compensation, then extracts motion cues over both short and long intervals to isolate true target movement from camera disturbances. A lightweight motion-guided attention module incorporates these cues to refine features in a feature pyramid network. Experiments show gains over a YOLOv8 baseline on the VisDrone-VID dataset specifically under severe ego-motion conditions. Ablations confirm the value of using two intervals rather than one and of the attention integration step.

Core claim

By aligning adjacent frames via homography and then computing motion cues across dual time intervals before feeding them into a motion-guided attention module, the detector can focus on target dynamics instead of camera-induced motion, producing consistent accuracy gains for small objects in dynamic UAV scenes compared with direct application of image-based detectors.

What carries the argument

Dual-Interval Motion Extraction strategy that computes both short-term and long-term motion cues after homography alignment, combined with the Motion-Guided Attention module to enhance feature representations inside the detection backbone.

If this is right

Detection of small fast-moving objects becomes more reliable when camera jitter and ego-motion are present.
The framework avoids the computational cost of optical flow while still using motion information.
Ablation results indicate that both the dual time intervals and the guided attention contribute measurable gains.
The approach integrates directly into existing feature-pyramid detectors without changing the core backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-interval idea could transfer to other moving-camera detection tasks such as vehicle-mounted or handheld video.
Adaptive choice of interval lengths based on measured jitter might further reduce sensitivity to scene type.
More robust alignment methods beyond homography could relax the assumption about planar motion and extend applicability to complex 3D environments.

Load-bearing premise

Homography-based global motion compensation accurately aligns adjacent frames without distorting or losing information about small fast-moving targets when depth variation is large or motion is non-planar.

What would settle it

Evaluation on UAV video sequences recorded over highly non-planar terrain with accurate ground-truth annotations for small targets, where the method produces no detection improvement or a clear drop relative to the unmodified YOLOv8 baseline.

Figures

Figures reproduced from arXiv: 2605.22605 by Feitian Zhang, Liuyang Wang.

**Figure 2.** Figure 2: Overall architecture of the proposed motion-guided object detection framework. A YOLO backbone extracts multi-scale spatial features ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the proposed dual-interval motion extraction process. The short-term difference captures fast target motion but is sensitive to ego [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the proposed asymmetric training-inference paradigm [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Deployment platform used for real-time edge inference experiments [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison between the spatial-only baseline and the proposed framework on the VisDrone-VID validation set. Yellow boxes denote [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of Precision-Recall curves on the VisDrone-VID [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing motion-based methods either rely on computationally expensive optical flow or use single-interval differencing, which is sensitive to jitter and limited in capturing diverse motion patterns. We propose a vision-only motion-guided detection framework that decouples target motion from camera-induced disturbances. A homography-based Global Motion Compensation (GMC) first aligns adjacent frames. We then introduce a Dual-Interval Motion Extraction strategy that captures both short-term and long-term motion cues. To integrate these cues, a lightweight Motion-Guided Attention (MGA) module enhances feature representations within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset demonstrate consistent improvements over a strong YOLOv8 baseline under severe ego-motion. Ablation studies further confirm the effectiveness of the dual-interval design and the proposed motion-guided attention mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual-interval motion cues after homography compensation deliver gains over YOLOv8 on VisDrone-VID, but the alignment step risks residual errors in non-planar scenes.

read the letter

The paper's main move is to run homography-based global motion compensation first, then extract motion differences at two time scales and route them through a lightweight motion-guided attention module inside the feature pyramid. That combination is the fresh part, and it produces consistent detection lifts on the VisDrone-VID set under heavy ego-motion compared with a plain YOLOv8 baseline. Ablations back the dual-interval choice and the attention addition, and the whole pipeline stays light enough for onboard use.

Referee Report

2 major / 2 minor

Summary. The paper proposes a vision-only motion-guided detection framework for UAV videos to handle severe ego-motion. It first applies homography-based Global Motion Compensation (GMC) to align adjacent frames, then uses Dual-Interval Motion Extraction to capture short-term and long-term motion cues, and integrates these via a lightweight Motion-Guided Attention (MGA) module within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset are claimed to show consistent improvements over a YOLOv8 baseline under severe ego-motion, with ablations confirming the dual-interval design and MGA effectiveness.

Significance. If the results hold after verification, the approach offers a practical, computationally lighter alternative to optical flow for decoupling ego-motion in UAV detection, which could aid real-time performance on small and fast-moving targets. The dual-interval cues address limitations of single-interval differencing, and the MGA provides targeted feature enhancement; these are sensible extensions of existing motion-based methods.

major comments (2)

[Method (GMC and motion extraction sections)] The decoupling claim rests on homography-based GMC to remove ego-motion before Dual-Interval Motion Extraction and MGA (as described in the method). However, no quantitative evaluation of alignment accuracy is provided, such as reprojection or endpoint error on feature points, nor any analysis or ablation on non-planar scene subsets with large depth variation (common in VisDrone-VID UAV footage). Residual parallax could distort small-target motion cues or introduce artifacts, directly affecting the central premise.
[Experiments and results] The abstract states that ablations confirm the dual-interval design and reports gains over YOLOv8, but the manuscript lacks detailed quantitative tables, baseline adaptation details for video input, or error analysis stratified by ego-motion severity. This weakens support for the claimed consistent improvements.

minor comments (2)

[Abstract] The abstract would benefit from specifying exact metrics (e.g., mAP@0.5:0.95) and the numerical magnitude of improvements for better clarity.
[Method] Notation for the short and long interval lengths in Dual-Interval Motion Extraction could be formalized with equations to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, agreeing where revisions are warranted and outlining specific changes to strengthen the presentation of the GMC evaluation and experimental details.

read point-by-point responses

Referee: [Method (GMC and motion extraction sections)] The decoupling claim rests on homography-based GMC to remove ego-motion before Dual-Interval Motion Extraction and MGA (as described in the method). However, no quantitative evaluation of alignment accuracy is provided, such as reprojection or endpoint error on feature points, nor any analysis or ablation on non-planar scene subsets with large depth variation (common in VisDrone-VID UAV footage). Residual parallax could distort small-target motion cues or introduce artifacts, directly affecting the central premise.

Authors: We agree that a quantitative assessment of GMC alignment accuracy would provide stronger support for the decoupling premise. In the revised manuscript, we will add average reprojection errors computed on matched feature points across VisDrone-VID sequences, along with an analysis of performance on non-planar scene subsets identified by large depth variation (via disparity estimates or scene structure). This will include both metrics and qualitative examples to show that residual parallax does not materially degrade the dual-interval motion cues for small targets. revision: yes
Referee: [Experiments and results] The abstract states that ablations confirm the dual-interval design and reports gains over YOLOv8, but the manuscript lacks detailed quantitative tables, baseline adaptation details for video input, or error analysis stratified by ego-motion severity. This weakens support for the claimed consistent improvements.

Authors: We acknowledge that additional experimental detail would better substantiate the claims. The revised manuscript will include expanded quantitative tables with per-sequence and per-ego-motion-severity breakdowns, explicit description of the YOLOv8 video baseline (frame-wise inference augmented only by our motion cues at test time), and an error analysis stratified by ego-motion severity (quantified via homography magnitude and average flow magnitude per clip). These additions will directly address the request for more rigorous validation of consistent gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses standard geometric primitives and empirical validation

full rationale

The paper proposes a pipeline of homography-based GMC followed by dual-interval differencing and a lightweight attention module, then validates via ablation and comparison on the public VisDrone-VID dataset. No equations or claims reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing step rest solely on self-citation. The central decoupling claim is supported by external dataset testing rather than internal redefinition. Minor self-citation risk is present in any vision paper but is not load-bearing here.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method relies on the standard computer-vision assumption that inter-frame motion can be approximated by a homography and introduces a small number of design choices for interval selection and attention weighting that are not derived from first principles.

free parameters (2)

short and long interval lengths
Chosen to capture diverse motion patterns; values are not stated in the abstract and are presumably tuned on validation data.
attention fusion weights
Hyperparameters controlling how short-term and long-term cues are combined inside the MGA module.

axioms (1)

domain assumption Homography transformation suffices to model and compensate global camera motion between adjacent frames
Invoked in the Global Motion Compensation step to align frames before motion extraction.

pith-pipeline@v0.9.0 · 5717 in / 1359 out tokens · 52135 ms · 2026-05-22T05:19:10.258712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

Past, present, and future of aerial robotic manipulators,

A. Ollero, M. Tognon, A. Suarez, D. Lee, and A. Franchi, “Past, present, and future of aerial robotic manipulators,”IEEE Transactions on Robotics, vol. 38, no. 1, pp. 626–645, 2022

work page 2022
[2]

Uavs for industrial applications: Identifying challenges and opportunities from the implementation point of view,

D. Mourtzis, J. Angelopoulos, and N. Panopoulos, “Uavs for industrial applications: Identifying challenges and opportunities from the implementation point of view,”Procedia Manufacturing, vol. 55, pp. 183–190, 2021, fAIM 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2351978921002237

work page 2021
[3]

Asma: An a daptive safety m argin a lgorithm for vision-language drone navigation via scene-aware control barrier functions,

S. Sanyal and K. Roy, “Asma: An a daptive safety m argin a lgorithm for vision-language drone navigation via scene-aware control barrier functions,”IEEE Robotics and Automation Letters, vol. 10, no. 9, pp. 9232–9239, 2025

work page 2025
[4]

Sign: Safety- aware image-goal navigation for autonomous drones via reinforcement learning,

Z. Yan, R. Huang, L. He, S. Guo, and L. Zhao, “Sign: Safety- aware image-goal navigation for autonomous drones via reinforcement learning,”IEEE Robotics and Automation Letters, vol. 11, no. 2, pp. 1962–1969, 2026

work page 1962
[5]

You only look once: Unified, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779– 788

work page 2016
[6]

Ultralytics yolov8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

work page 2023
[7]

Au-air: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance,

I. Bozcan and E. Kayacan, “Au-air: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance,” 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 8504–8510, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:211004016

work page 2020
[8]

Emergencynet: Efficient aerial image classification for drone-based emergency monitoring using atrous convolutional feature fusion,

C. Kyrkou and T. Theocharides, “Emergencynet: Efficient aerial image classification for drone-based emergency monitoring using atrous convolutional feature fusion,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 1687– 1699, 2020

work page 2020
[9]

Visdrone-det2021: The vision meets drone object detec- tion challenge results,

Y . Caoet al., “Visdrone-det2021: The vision meets drone object detec- tion challenge results,” in2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021, pp. 2847–2854

work page 2021
[10]

Perception and avoidance of multiple small fast moving objects for quadrotors with only low-cost rgbd camera,

M. Lu, H. Chen, and P. Lu, “Perception and avoidance of multiple small fast moving objects for quadrotors with only low-cost rgbd camera,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 657–11 664, 2022

work page 2022
[11]

Flow-guided feature aggregation for video object detection,

X. Zhu, Y . Wang, J. Dai, L. Yuan, and Y . Wei, “Flow-guided feature aggregation for video object detection,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 408–417

work page 2017
[12]

Flownet: Learning optical flow with convolu- tional networks,

A. Dosovitskiyet al., “Flownet: Learning optical flow with convolu- tional networks,” in2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2758–2766

work page 2015
[13]

A survey on real-time object detection algorithms,

R. R, I. Fatima, and L. A. Prasad, “A survey on real-time object detection algorithms,” in2023 International Conference on Advances in Electronics, Communication, Computing and Intelligent Information Systems (ICAECIS), 2023, pp. 548–553

work page 2023
[14]

Object tracking: A survey,

A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, p. 13–es, Dec. 2006. [Online]. Available: https://doi.org/10.1145/1177352.1177355

work page doi:10.1145/1177352.1177355 2006
[15]

Ssd: Single shot multibox detector,

L. Weiet al., “Ssd: Single shot multibox detector,”Springer , Cham, 2016

work page 2016
[16]

Focal loss for dense object detection,

T. Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020

work page 2020
[17]

End-to-end object detection with transformers,

N. Carionet al., “End-to-end object detection with transformers,” Computer Vision – ECCV 2020, 2020

work page 2020
[18]

The unmanned aerial vehicle benchmark: Object detection, tracking and baseline,

H. Yuet al., “The unmanned aerial vehicle benchmark: Object detection, tracking and baseline,”International Journal of Computer Vision, vol. 128, no. 5, pp. 1141–1159, 2020

work page 2020
[19]

Air-to-air visual detection of micro-uavs: An exper- imental evaluation of deep learning,

Y . Zhenget al., “Air-to-air visual detection of micro-uavs: An exper- imental evaluation of deep learning,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1020–1027, 2021

work page 2021
[20]

Common corruptions for evaluating and enhancing robustness in air-to-air visual object detection,

A. Arsenoset al., “Common corruptions for evaluating and enhancing robustness in air-to-air visual object detection,”IEEE Robotics and Automation Letters, vol. 9, no. 7, pp. 6688–6695, 2024

work page 2024
[21]

Ad-yolo: A real-time yolo network with swin transformer and attention mechanism for airport scene detection,

W. Zhou, C. Cai, C. Li, H. Xu, and H. Shi, “Ad-yolo: A real-time yolo network with swin transformer and attention mechanism for airport scene detection,”IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–12, 2024

work page 2024
[22]

Active classification of moving targets with learned control policies,

´A. Serra-G ´omez, E. Montijano, W. B ¨ohmer, and J. Alonso-Mora, “Active classification of moving targets with learned control policies,” IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3717–3724, 2023

work page 2023
[23]

Fast and robust uav to uav detection and tracking from video,

J. Li, D. H. Ye, M. Kolsch, J. P. Wachs, and C. A. Bouman, “Fast and robust uav to uav detection and tracking from video,”IEEE Transactions on Emerging Topics in Computing, vol. 10, no. 3, pp. 1519–1531, 2022

work page 2022
[24]

Uevavd: A dataset for developing uav’s eye view active object detection,

X. Jiang, T. Liu, L. Liu, Z. Liu, and Y . Liu, “Uevavd: A dataset for developing uav’s eye view active object detection,”IEEE Robotics and Automation Letters, vol. 10, no. 6, pp. 6272–6279, 2025

work page 2025
[25]

Video tiny-object detection guided by the spatial- temporal motion information,

X. Yanget al., “Video tiny-object detection guided by the spatial- temporal motion information,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2023, pp. 3054– 3063

work page 2023
[26]

Real- time detection for small uavs: Combining yolo and multi-frame motion analysis,

J. Liu, L. Plotegher, E. Roura, C. de Souza Junior, and S. He, “Real- time detection for small uavs: Combining yolo and multi-frame motion analysis,”IEEE Transactions on Aerospace and Electronic Systems, 2025

work page 2025
[27]

Ma- yolo: Video object detection via motion-assisted yolo,

X. Wang, H.-S. Chen, Z. Zhou, J.-E. Yao, and C.-C. J. Kuo, “Ma- yolo: Video object detection via motion-assisted yolo,” in2025 IEEE International Conference on Image Processing Workshops (ICIPW). IEEE, 2025, pp. 440–445

work page 2025
[28]

Lam-yolo: Drones-based small object detection on lighting-occlusion attention mechanism yolo,

Y . Zheng, Y . Jing, J. Zhao, and G. Cui, “Lam-yolo: Drones-based small object detection on lighting-occlusion attention mechanism yolo,”Computer Vision and Image Understanding, vol. 261, p. 104489, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1077314225002127

work page 2025
[29]

Yolo-cam: A lightweight uav object detector with combined attention mechanism for small targets,

Y . Guo, Y . He, H. Zhang, and J. Ma, “Yolo-cam: A lightweight uav object detector with combined attention mechanism for small targets,” Remote Sensing, vol. 17, no. 21, p. 3575, 2025

work page 2025
[30]

Yolo-based marine organism detection using two-terminal attention mechanism and difficult-sample resampling,

Z. Zhou, Y . Hu, X. Yang, and J. Yang, “Yolo-based marine organism detection using two-terminal attention mechanism and difficult-sample resampling,”Appl. Soft Comput., vol. 153, no. C, Mar. 2024. [Online]. Available: https://doi.org/10.1016/j.asoc.2024.111291

work page doi:10.1016/j.asoc.2024.111291 2024
[31]

Underwater object detection using tc-yolo with attention mechanisms,

K. Liu, L. Peng, and S. Tang, “Underwater object detection using tc-yolo with attention mechanisms,”Sensors, vol. 23, no. 5, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/5/2567

work page 2023
[32]

Distinctive image features from scale-invariant key- points,

D. G. Lowe, “Distinctive image features from scale-invariant key- points,”International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004

work page 2004
[33]

Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,

M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,”Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981

work page 1981
[34]

Confidence propagation through cnns for guided sparse depth regression,

A. Eldesokey, M. Felsberg, and F. S. Khan, “Confidence propagation through cnns for guided sparse depth regression,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2423–2436, 2020

work page 2020
[35]

Deep depth completion from extremely sparse data: A survey,

J. Huet al., “Deep depth completion from extremely sparse data: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, pp. 1–20, 12 2022

work page 2022
[36]

Orb: an efficient alternative to sift or surf,

E. Rublee, V . Rabaud, K. Konolige, and G. R. Bradski, “Orb: an efficient alternative to sift or surf,”IEEE, 2011

work page 2011
[37]

Gcnv2: Efficient correspondence prediction for real-time slam,

J. Tang, L. Ericson, J. Folkesson, and P. Jensfelt, “Gcnv2: Efficient correspondence prediction for real-time slam,”IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3505–3512, 2019

work page 2019

[1] [1]

Past, present, and future of aerial robotic manipulators,

A. Ollero, M. Tognon, A. Suarez, D. Lee, and A. Franchi, “Past, present, and future of aerial robotic manipulators,”IEEE Transactions on Robotics, vol. 38, no. 1, pp. 626–645, 2022

work page 2022

[2] [2]

Uavs for industrial applications: Identifying challenges and opportunities from the implementation point of view,

D. Mourtzis, J. Angelopoulos, and N. Panopoulos, “Uavs for industrial applications: Identifying challenges and opportunities from the implementation point of view,”Procedia Manufacturing, vol. 55, pp. 183–190, 2021, fAIM 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2351978921002237

work page 2021

[3] [3]

Asma: An a daptive safety m argin a lgorithm for vision-language drone navigation via scene-aware control barrier functions,

S. Sanyal and K. Roy, “Asma: An a daptive safety m argin a lgorithm for vision-language drone navigation via scene-aware control barrier functions,”IEEE Robotics and Automation Letters, vol. 10, no. 9, pp. 9232–9239, 2025

work page 2025

[4] [4]

Sign: Safety- aware image-goal navigation for autonomous drones via reinforcement learning,

Z. Yan, R. Huang, L. He, S. Guo, and L. Zhao, “Sign: Safety- aware image-goal navigation for autonomous drones via reinforcement learning,”IEEE Robotics and Automation Letters, vol. 11, no. 2, pp. 1962–1969, 2026

work page 1962

[5] [5]

You only look once: Unified, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779– 788

work page 2016

[6] [6]

Ultralytics yolov8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

work page 2023

[7] [7]

Au-air: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance,

I. Bozcan and E. Kayacan, “Au-air: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance,” 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 8504–8510, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:211004016

work page 2020

[8] [8]

Emergencynet: Efficient aerial image classification for drone-based emergency monitoring using atrous convolutional feature fusion,

C. Kyrkou and T. Theocharides, “Emergencynet: Efficient aerial image classification for drone-based emergency monitoring using atrous convolutional feature fusion,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 1687– 1699, 2020

work page 2020

[9] [9]

Visdrone-det2021: The vision meets drone object detec- tion challenge results,

Y . Caoet al., “Visdrone-det2021: The vision meets drone object detec- tion challenge results,” in2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021, pp. 2847–2854

work page 2021

[10] [10]

Perception and avoidance of multiple small fast moving objects for quadrotors with only low-cost rgbd camera,

M. Lu, H. Chen, and P. Lu, “Perception and avoidance of multiple small fast moving objects for quadrotors with only low-cost rgbd camera,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 657–11 664, 2022

work page 2022

[11] [11]

Flow-guided feature aggregation for video object detection,

X. Zhu, Y . Wang, J. Dai, L. Yuan, and Y . Wei, “Flow-guided feature aggregation for video object detection,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 408–417

work page 2017

[12] [12]

Flownet: Learning optical flow with convolu- tional networks,

A. Dosovitskiyet al., “Flownet: Learning optical flow with convolu- tional networks,” in2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2758–2766

work page 2015

[13] [13]

A survey on real-time object detection algorithms,

R. R, I. Fatima, and L. A. Prasad, “A survey on real-time object detection algorithms,” in2023 International Conference on Advances in Electronics, Communication, Computing and Intelligent Information Systems (ICAECIS), 2023, pp. 548–553

work page 2023

[14] [14]

Object tracking: A survey,

A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, p. 13–es, Dec. 2006. [Online]. Available: https://doi.org/10.1145/1177352.1177355

work page doi:10.1145/1177352.1177355 2006

[15] [15]

Ssd: Single shot multibox detector,

L. Weiet al., “Ssd: Single shot multibox detector,”Springer , Cham, 2016

work page 2016

[16] [16]

Focal loss for dense object detection,

T. Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020

work page 2020

[17] [17]

End-to-end object detection with transformers,

N. Carionet al., “End-to-end object detection with transformers,” Computer Vision – ECCV 2020, 2020

work page 2020

[18] [18]

The unmanned aerial vehicle benchmark: Object detection, tracking and baseline,

H. Yuet al., “The unmanned aerial vehicle benchmark: Object detection, tracking and baseline,”International Journal of Computer Vision, vol. 128, no. 5, pp. 1141–1159, 2020

work page 2020

[19] [19]

Air-to-air visual detection of micro-uavs: An exper- imental evaluation of deep learning,

Y . Zhenget al., “Air-to-air visual detection of micro-uavs: An exper- imental evaluation of deep learning,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1020–1027, 2021

work page 2021

[20] [20]

Common corruptions for evaluating and enhancing robustness in air-to-air visual object detection,

A. Arsenoset al., “Common corruptions for evaluating and enhancing robustness in air-to-air visual object detection,”IEEE Robotics and Automation Letters, vol. 9, no. 7, pp. 6688–6695, 2024

work page 2024

[21] [21]

Ad-yolo: A real-time yolo network with swin transformer and attention mechanism for airport scene detection,

W. Zhou, C. Cai, C. Li, H. Xu, and H. Shi, “Ad-yolo: A real-time yolo network with swin transformer and attention mechanism for airport scene detection,”IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–12, 2024

work page 2024

[22] [22]

Active classification of moving targets with learned control policies,

´A. Serra-G ´omez, E. Montijano, W. B ¨ohmer, and J. Alonso-Mora, “Active classification of moving targets with learned control policies,” IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3717–3724, 2023

work page 2023

[23] [23]

Fast and robust uav to uav detection and tracking from video,

J. Li, D. H. Ye, M. Kolsch, J. P. Wachs, and C. A. Bouman, “Fast and robust uav to uav detection and tracking from video,”IEEE Transactions on Emerging Topics in Computing, vol. 10, no. 3, pp. 1519–1531, 2022

work page 2022

[24] [24]

Uevavd: A dataset for developing uav’s eye view active object detection,

X. Jiang, T. Liu, L. Liu, Z. Liu, and Y . Liu, “Uevavd: A dataset for developing uav’s eye view active object detection,”IEEE Robotics and Automation Letters, vol. 10, no. 6, pp. 6272–6279, 2025

work page 2025

[25] [25]

Video tiny-object detection guided by the spatial- temporal motion information,

X. Yanget al., “Video tiny-object detection guided by the spatial- temporal motion information,” inProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2023, pp. 3054– 3063

work page 2023

[26] [26]

Real- time detection for small uavs: Combining yolo and multi-frame motion analysis,

J. Liu, L. Plotegher, E. Roura, C. de Souza Junior, and S. He, “Real- time detection for small uavs: Combining yolo and multi-frame motion analysis,”IEEE Transactions on Aerospace and Electronic Systems, 2025

work page 2025

[27] [27]

Ma- yolo: Video object detection via motion-assisted yolo,

X. Wang, H.-S. Chen, Z. Zhou, J.-E. Yao, and C.-C. J. Kuo, “Ma- yolo: Video object detection via motion-assisted yolo,” in2025 IEEE International Conference on Image Processing Workshops (ICIPW). IEEE, 2025, pp. 440–445

work page 2025

[28] [28]

Lam-yolo: Drones-based small object detection on lighting-occlusion attention mechanism yolo,

Y . Zheng, Y . Jing, J. Zhao, and G. Cui, “Lam-yolo: Drones-based small object detection on lighting-occlusion attention mechanism yolo,”Computer Vision and Image Understanding, vol. 261, p. 104489, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1077314225002127

work page 2025

[29] [29]

Yolo-cam: A lightweight uav object detector with combined attention mechanism for small targets,

Y . Guo, Y . He, H. Zhang, and J. Ma, “Yolo-cam: A lightweight uav object detector with combined attention mechanism for small targets,” Remote Sensing, vol. 17, no. 21, p. 3575, 2025

work page 2025

[30] [30]

Yolo-based marine organism detection using two-terminal attention mechanism and difficult-sample resampling,

Z. Zhou, Y . Hu, X. Yang, and J. Yang, “Yolo-based marine organism detection using two-terminal attention mechanism and difficult-sample resampling,”Appl. Soft Comput., vol. 153, no. C, Mar. 2024. [Online]. Available: https://doi.org/10.1016/j.asoc.2024.111291

work page doi:10.1016/j.asoc.2024.111291 2024

[31] [31]

Underwater object detection using tc-yolo with attention mechanisms,

K. Liu, L. Peng, and S. Tang, “Underwater object detection using tc-yolo with attention mechanisms,”Sensors, vol. 23, no. 5, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/5/2567

work page 2023

[32] [32]

Distinctive image features from scale-invariant key- points,

D. G. Lowe, “Distinctive image features from scale-invariant key- points,”International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004

work page 2004

[33] [33]

Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,

M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,”Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981

work page 1981

[34] [34]

Confidence propagation through cnns for guided sparse depth regression,

A. Eldesokey, M. Felsberg, and F. S. Khan, “Confidence propagation through cnns for guided sparse depth regression,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2423–2436, 2020

work page 2020

[35] [35]

Deep depth completion from extremely sparse data: A survey,

J. Huet al., “Deep depth completion from extremely sparse data: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, pp. 1–20, 12 2022

work page 2022

[36] [36]

Orb: an efficient alternative to sift or surf,

E. Rublee, V . Rabaud, K. Konolige, and G. R. Bradski, “Orb: an efficient alternative to sift or surf,”IEEE, 2011

work page 2011

[37] [37]

Gcnv2: Efficient correspondence prediction for real-time slam,

J. Tang, L. Ericson, J. Folkesson, and P. Jensfelt, “Gcnv2: Efficient correspondence prediction for real-time slam,”IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3505–3512, 2019

work page 2019