M²E-UAV: A Benchmark and Analysis for Onboard Motion-on-Motion Event-Based Tiny UAV Detection
Pith reviewed 2026-05-15 04:55 UTC · model grok-4.3
The pith
The first onboard event-camera benchmark for tiny UAV detection under mutual motion shows current methods fail amid dense ego-motion noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
M²E-UAV supplies the first synchronized event streams and IMU measurements collected from an onboard platform together with event-level UAV labels derived from temporally propagated 10 Hz bounding-box annotations. The processed benchmark contains 87,223 training samples and 21,395 validation samples across sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village scenes. Defined train/validation splits and an evaluation protocol allow comparison of existing baselines across event-frame, voxel-grid, and point-set representations with optional IMU input; these baselines prove limited when sparse tiny-target events must be distinguished from dense ego-motion–cau
What carries the argument
The M²E-UAV dataset and its evaluation protocol, which supplies moving-platform event streams, IMU data, and temporally propagated labels to measure detection of sparse target clusters inside dense ego-motion background events.
If this is right
- Detection algorithms must explicitly separate sparse target clusters from dense background events generated by platform motion.
- Optional IMU input can be used to model and subtract ego-motion, but current baselines do not yet exploit it effectively.
- Standardized splits and metrics enable direct comparison of new representations or architectures on the same motion-on-motion data.
- Performance gaps indicate that existing event-frame and voxel methods lose tiny targets when background activity is high.
- Real-world onboard UAV perception requires robustness to mutual motion rather than the clean-background regime assumed in prior work.
Where Pith is reading between the lines
- The same motion-on-motion challenge appears in other moving-platform settings such as ground vehicles detecting small aerial objects.
- Label propagation from low-rate boxes could be replaced by higher-rate optical-flow or event-warping methods to test sensitivity of reported numbers.
- Successful detectors on this benchmark would directly support collision-avoidance pipelines that run on lightweight event hardware.
- Extending the scenes to include night or adverse weather would reveal whether the observed limitations generalize beyond the four families tested.
Load-bearing premise
Labels obtained by temporally propagating 10 Hz bounding-box annotations accurately represent the true locations of the tiny UAV at the event level amid dense ego-motion events.
What would settle it
A subset of events manually labeled at microsecond precision shows large spatial or temporal mismatch with the propagated 10 Hz bounding-box labels.
Figures
read the original abstract
Tiny UAV detection from an onboard event camera is difficult when the observer and target move at the same time. In this motion-on-motion regime, ego-motion activates background edges across buildings, vegetation, and horizon structures, while the UAV may appear as a sparse event cluster. Unlike static- or ground-observer event-based UAV detection, onboard UAV-view detection breaks the clean-background assumption because sensor ego-motion can activate dense background events over the entire field of view. To explore this practical problem, we present M$^2$E-UAV, to the best of our knowledge, the first onboard UAV-view motion-on-motion event-based dataset and benchmark for tiny UAV detection, where both the sensing platform and the target UAV are moving. M$^2$E-UAV provides synchronized event streams and IMU measurements collected from an onboard sensing platform, together with event-level UAV foreground labels derived from temporally propagated 10 Hz bounding-box annotations. The processed benchmark contains 87,223 training samples and 21,395 validation samples across four scene families: sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village. We define a train/validation split and an evaluation protocol for comparing representative existing baselines across event-frame, voxel-grid, and point-set representations, with optional IMU input. The benchmark results show that existing baselines remain limited under sparse tiny-target evidence and dense ego-motion-induced background events. Code and benchmark files will be released at https://github.com/Wickyan/M2E-UAV.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces M²E-UAV as the first onboard UAV-view motion-on-motion event-based dataset and benchmark for tiny UAV detection. It supplies synchronized event streams and IMU data collected from a moving platform, together with event-level foreground labels obtained by temporally propagating 10 Hz bounding-box annotations. The processed benchmark contains 87,223 training and 21,395 validation samples across four scene families (sunny building-forest, sunny farm-village, sunset building-forest, sunset farm-village), defines a train/validation split and evaluation protocol, and reports baseline results for event-frame, voxel-grid, and point-set representations (with optional IMU) showing that existing methods remain limited under sparse target events and dense ego-motion background events.
Significance. If the propagated labels prove reliable, the work supplies a concrete, reproducible benchmark that directly addresses the practical gap between static-observer and onboard motion-on-motion regimes in event-based UAV detection. The explicit sample counts, defined splits, four-scene coverage, and planned public release of code and benchmark files constitute clear strengths that would enable standardized future comparisons.
major comments (1)
- [Dataset construction and label generation] The benchmark's validity rests on the claim that temporally propagated 10 Hz bounding-box annotations produce accurate event-level foreground labels for the tiny UAV. In the motion-on-motion setting the paper emphasizes, ego-motion generates dense background events while the target produces sparse clusters; any interpolation or motion-model error during propagation can therefore shift the assigned label region relative to actual event locations. Because all reported baseline scores (event-frame, voxel-grid, point-set) are computed against these labels, systematic misalignment would render the headline finding—that existing methods remain limited—uninterpretable. The manuscript should supply quantitative validation of label fidelity (e.g., manual audit statistics, comparison against higher-rate ground truth, or event-density overlap metrics) in the dataset-construction section.
minor comments (2)
- [Benchmark statistics] Event density or background-event-rate statistics per scene family are not reported; adding these would help readers assess the claimed difficulty of the motion-on-motion regime.
- [Evaluation protocol] The abstract states that IMU input is optional for baselines, yet no implementation details or ablation results on IMU fusion are provided; clarifying this would strengthen the evaluation protocol description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the benchmark's potential value. We address the single major comment below by agreeing to strengthen the label-validation section.
read point-by-point responses
-
Referee: The benchmark's validity rests on the claim that temporally propagated 10 Hz bounding-box annotations produce accurate event-level foreground labels for the tiny UAV. In the motion-on-motion setting the paper emphasizes, ego-motion generates dense background events while the target produces sparse clusters; any interpolation or motion-model error during propagation can therefore shift the assigned label region relative to actual event locations. Because all reported baseline scores (event-frame, voxel-grid, point-set) are computed against these labels, systematic misalignment would render the headline finding—that existing methods remain limited—uninterpretable. The manuscript should supply quantitative validation of label fidelity (e.g., manual audit statistics, comparison against higher-rate ground truth, or event-density overlap metrics) in the dataset-construction section.
Authors: We agree that quantitative validation of the propagated labels is necessary to support the benchmark's reliability. The original manuscript describes the 10 Hz bounding-box propagation procedure but does not report fidelity metrics. In the revision we will add a dedicated subsection that includes: (1) a manual audit on 500 randomly sampled frames reporting pixel-level agreement between propagated labels and human annotations (with inter-annotator agreement), and (2) event-density overlap statistics (precision/recall of events falling inside versus outside the propagated boxes) stratified by scene family and ego-motion speed. These additions will directly address the concern about potential misalignment under motion-on-motion conditions. revision: yes
Circularity Check
No circularity: dataset construction with no derivations or fitted predictions
full rationale
The paper introduces a new benchmark dataset M²E-UAV consisting of event streams, IMU data, and labels obtained by propagating 10 Hz bounding-box annotations. No mathematical derivations, equations, or parameter-fitting steps are present in the provided text. The label generation is a standard annotation pipeline rather than a claimed 'prediction' or 'first-principles result' that reduces to its own inputs. Baseline comparisons are scored against the released labels, but this does not create circularity because the paper makes no theoretical claim that loops back to the labels by construction. The contribution is empirical and self-contained; external verification is possible via the promised code and data release. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear. This matches the default expectation of no significant circularity for a dataset/benchmark paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Event cameras generate sparse asynchronous events corresponding to logarithmic brightness changes above a threshold.
- domain assumption IMU measurements can be synchronized with event streams to provide ego-motion information.
Reference graph
Works this paper leans on
-
[1]
Rethinking few-shot 3d point cloud semantic segmentation
Zhaochong An, Guolei Sun, Yun Liu, Fayao Liu, Zongwei Wu, Dan Wang, Luc Van Gool, and Serge Belongie. Rethinking few-shot 3d point cloud semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[2]
Event-based tiny object detection: A benchmark dataset and baseline
Nuo Chen, Chao Xiao, Yimian Dai, Shiman He, Miao Li, and Wei An. Event-based tiny object detection: A benchmark dataset and baseline. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[3]
Davison, Jorg Conradt, Kostas Daniilidis, and Davide Scaramuzza
Guillermo Gallego, Tobi Delbruck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J. Davison, Jorg Conradt, Kostas Daniilidis, and Davide Scaramuzza. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022
work page 2022
-
[4]
Recurrent vision transformers for object detection with event cameras
Mathias Gehrig and Davide Scaramuzza. Recurrent vision transformers for object detection with event cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[5]
Randla-net: Efficient semantic segmentation of large-scale point clouds
Qingyong Hu, Bo Yang, Linhai Xie, Stefanos Rosa, Zixiang Guo, Zhizhong Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
work page 2020
-
[6]
Ev-flying: An event-based dataset for in-the-wild recognition of flying objects
Gabriele Magrini, Federico Becattini, Giovanni Colombo, and Pietro Pala. Ev-flying: An event-based dataset for in-the-wild recognition of flying objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4947–4955, 2025. doi: 10.1109/ CVPRW67362.2025.00487
-
[7]
Neuromorphic drone detection: An event-rgb multimodal approach
Gabriele Magrini, Federico Becattini, Pietro Pala, Alberto Del Bimbo, and Antonio Porta. Neuromorphic drone detection: An event-rgb multimodal approach. InComputer Vision – ECCV 2024 Workshops, volume 15646 ofLecture Notes in Computer Science, pages 259–275. Springer, 2025. doi: 10.1007/ 978-3-031-92460-6_16
work page 2024
-
[8]
Fred: The florence rgb-event drone dataset
Gabriele Magrini, Niccolò Marini, Federico Becattini, Lorenzo Berlincioni, Niccolò Biondi, Pietro Pala, and Alberto Del Bimbo. Fred: The florence rgb-event drone dataset. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13170–13176, 2025. doi: 10.1145/3746027.3758271
-
[9]
Towards real-time fast unmanned aerial vehicle detection using dynamic vision sensors
Jakub Mandula, Jonas Kühne, Luca Pascarella, and Michele Magno. Towards real-time fast unmanned aerial vehicle detection using dynamic vision sensors. InProceedings of the 2024 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), pages 1–6, 2024. doi: 10.1109/ I2MTC60896.2024.10561168. 8
-
[10]
Scene adaptive sparse transformer for event-based object detection
Yansong Peng, Hebei Li, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. Scene adaptive sparse transformer for event-based object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[11]
Event- based motion segmentation by motion compensation
Timo Stoffregen, Guillermo Gallego, Tom Drummond, Lindsay Kleeman, and Davide Scaramuzza. Event- based motion segmentation by motion compensation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7244–7253, 2019
work page 2019
-
[12]
Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Francois Goulette, and Leonidas J
Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Francois Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019
work page 2019
-
[13]
Yolov10: Real-time end-to-end object detection
Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time end-to-end object detection.arXiv preprint arXiv:2405.14458, 2024
-
[14]
Xiao Wang, Jianing Li, Lin Zhu, Zhipeng Zhang, Zhe Chen, Xin Li, Yaowei Wang, Yonghong Tian, and Feng Wu. Visevent: Reliable object tracking via collaboration of frame and event flows.IEEE Transactions on Cybernetics, 54(3):1997–2010, 2024. doi: 10.1109/TCYB.2023.3318601
-
[15]
Event stream- based visual object tracking: A high-resolution benchmark dataset and a novel baseline
Xiao Wang, Shiao Wang, Chuanming Tang, Lin Zhu, Bo Jiang, Yonghong Tian, and Jin Tang. Event stream- based visual object tracking: A high-resolution benchmark dataset and a novel baseline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19248–19257, 2024
work page 2024
-
[16]
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics, 38(5):146:1–146:12, 2019
work page 2019
-
[17]
Yabin Zhu, Xiao Wang, Chenglong Li, Bo Jiang, Lin Zhu, Zhixiang Huang, Yonghong Tian, and Jin Tang. Crsot: Cross-resolution object tracking using unaligned frame and event cameras.IEEE Transactions on Multimedia, 27:6529–6542, 2025. doi: 10.1109/TMM.2025.3586135. 9
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.