pith. machine review for the scientific record.
sign in

arxiv: 2605.10496 · v2 · submitted 2026-05-11 · 💻 cs.CV

M²E-UAV: A Benchmark and Analysis for Onboard Motion-on-Motion Event-Based Tiny UAV Detection

Pith reviewed 2026-05-15 04:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-based visionUAV detectiontiny object detectionmotion-on-motiononboard camerabenchmark datasetego-motionneuromorphic sensing
0
0 comments X

The pith

The first onboard event-camera benchmark for tiny UAV detection under mutual motion shows current methods fail amid dense ego-motion noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents M²E-UAV as the first dataset capturing event streams from a moving sensing platform while a tiny UAV also moves. In this motion-on-motion setting, the camera's own motion triggers background events across buildings, vegetation, and horizons, leaving the UAV as only a sparse event cluster. The authors supply synchronized event data, IMU readings, and labels created by propagating 10 Hz bounding boxes to event level, spanning four scene families with 87,223 training and 21,395 validation samples. They evaluate representative baselines using event-frame, voxel-grid, and point-set inputs, with and without IMU, and find all remain limited when tiny target evidence must be separated from heavy background clutter. This setup directly tests whether event-based perception can support onboard UAV tasks such as collision avoidance in realistic dynamic flight.

Core claim

M²E-UAV supplies the first synchronized event streams and IMU measurements collected from an onboard platform together with event-level UAV labels derived from temporally propagated 10 Hz bounding-box annotations. The processed benchmark contains 87,223 training samples and 21,395 validation samples across sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village scenes. Defined train/validation splits and an evaluation protocol allow comparison of existing baselines across event-frame, voxel-grid, and point-set representations with optional IMU input; these baselines prove limited when sparse tiny-target events must be distinguished from dense ego-motion–cau

What carries the argument

The M²E-UAV dataset and its evaluation protocol, which supplies moving-platform event streams, IMU data, and temporally propagated labels to measure detection of sparse target clusters inside dense ego-motion background events.

If this is right

  • Detection algorithms must explicitly separate sparse target clusters from dense background events generated by platform motion.
  • Optional IMU input can be used to model and subtract ego-motion, but current baselines do not yet exploit it effectively.
  • Standardized splits and metrics enable direct comparison of new representations or architectures on the same motion-on-motion data.
  • Performance gaps indicate that existing event-frame and voxel methods lose tiny targets when background activity is high.
  • Real-world onboard UAV perception requires robustness to mutual motion rather than the clean-background regime assumed in prior work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same motion-on-motion challenge appears in other moving-platform settings such as ground vehicles detecting small aerial objects.
  • Label propagation from low-rate boxes could be replaced by higher-rate optical-flow or event-warping methods to test sensitivity of reported numbers.
  • Successful detectors on this benchmark would directly support collision-avoidance pipelines that run on lightweight event hardware.
  • Extending the scenes to include night or adverse weather would reveal whether the observed limitations generalize beyond the four families tested.

Load-bearing premise

Labels obtained by temporally propagating 10 Hz bounding-box annotations accurately represent the true locations of the tiny UAV at the event level amid dense ego-motion events.

What would settle it

A subset of events manually labeled at microsecond precision shows large spatial or temporal mismatch with the propagated 10 Hz bounding-box labels.

Figures

Figures reproduced from arXiv: 2605.10496 by Cheng Wang, Lixin Chen, Weiqi Yan, Xiangrui Hou, Yangyang Shi, Youbiao Wang, Yu Zang, Zhipeng Cai.

Figure 1
Figure 1. Figure 1: Teaser of the M2E-UAV onboard motion-on-motion setting. Left: schematic and real data-collection scene in which a carrier UAV observes a tiny target UAV. Right: actual onboard event-camera data under sensor ego-motion, where background structures induce dense event clutter around sparse target evidence. existing benchmarks often lack one or more of the following properties: onboard observer motion, synchro… view at source ↗
Figure 2
Figure 2. Figure 2: M2E-UAV data collection and benchmark construction. The top row shows the onboard sensing platform, including the carrier UAV, event camera, IMU, vibration-damping mount, and STM32-based trigger synchronization board. The bottom diagram summarizes how synchronized event and IMU streams are organized into matched event-IMU packets and paired with event-level UAV/background labels for benchmark evaluation. T… view at source ↗
Figure 3
Figure 3. Figure 3: Scene-family showcase. The montage covers sunny/sunset illumination and building￾forest/farm-village backgrounds. 3.3 Scene families and split [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Method-level overview of M2E-Point and its optional IMU-conditioned variant. The event branch samples [x, y, t, p] points, extracts local event features with EdgeConv, and aggregates them into a global packet feature. The optional IMU-conditioned variant modulates the global feature via FiLM. Local features and the packet feature are fused for event-level foreground scoring, and DBSCAN converts predicted f… view at source ↗
Figure 5
Figure 5. Figure 5: Detailed packet-level visualization of point-level prediction and clustering. Foreground￾score colors indicate the model-estimated UAV foreground probability for each sampled event, from low-score blue/green points to medium-score yellow points and high-score red/purple points. In the DBSCAN view, light gray points are sampled inference points, dark gray or black points are unclustered foreground candidate… view at source ↗
read the original abstract

Tiny UAV detection from an onboard event camera is difficult when the observer and target move at the same time. In this motion-on-motion regime, ego-motion activates background edges across buildings, vegetation, and horizon structures, while the UAV may appear as a sparse event cluster. Unlike static- or ground-observer event-based UAV detection, onboard UAV-view detection breaks the clean-background assumption because sensor ego-motion can activate dense background events over the entire field of view. To explore this practical problem, we present M$^2$E-UAV, to the best of our knowledge, the first onboard UAV-view motion-on-motion event-based dataset and benchmark for tiny UAV detection, where both the sensing platform and the target UAV are moving. M$^2$E-UAV provides synchronized event streams and IMU measurements collected from an onboard sensing platform, together with event-level UAV foreground labels derived from temporally propagated 10 Hz bounding-box annotations. The processed benchmark contains 87,223 training samples and 21,395 validation samples across four scene families: sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village. We define a train/validation split and an evaluation protocol for comparing representative existing baselines across event-frame, voxel-grid, and point-set representations, with optional IMU input. The benchmark results show that existing baselines remain limited under sparse tiny-target evidence and dense ego-motion-induced background events. Code and benchmark files will be released at https://github.com/Wickyan/M2E-UAV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces M²E-UAV as the first onboard UAV-view motion-on-motion event-based dataset and benchmark for tiny UAV detection. It supplies synchronized event streams and IMU data collected from a moving platform, together with event-level foreground labels obtained by temporally propagating 10 Hz bounding-box annotations. The processed benchmark contains 87,223 training and 21,395 validation samples across four scene families (sunny building-forest, sunny farm-village, sunset building-forest, sunset farm-village), defines a train/validation split and evaluation protocol, and reports baseline results for event-frame, voxel-grid, and point-set representations (with optional IMU) showing that existing methods remain limited under sparse target events and dense ego-motion background events.

Significance. If the propagated labels prove reliable, the work supplies a concrete, reproducible benchmark that directly addresses the practical gap between static-observer and onboard motion-on-motion regimes in event-based UAV detection. The explicit sample counts, defined splits, four-scene coverage, and planned public release of code and benchmark files constitute clear strengths that would enable standardized future comparisons.

major comments (1)
  1. [Dataset construction and label generation] The benchmark's validity rests on the claim that temporally propagated 10 Hz bounding-box annotations produce accurate event-level foreground labels for the tiny UAV. In the motion-on-motion setting the paper emphasizes, ego-motion generates dense background events while the target produces sparse clusters; any interpolation or motion-model error during propagation can therefore shift the assigned label region relative to actual event locations. Because all reported baseline scores (event-frame, voxel-grid, point-set) are computed against these labels, systematic misalignment would render the headline finding—that existing methods remain limited—uninterpretable. The manuscript should supply quantitative validation of label fidelity (e.g., manual audit statistics, comparison against higher-rate ground truth, or event-density overlap metrics) in the dataset-construction section.
minor comments (2)
  1. [Benchmark statistics] Event density or background-event-rate statistics per scene family are not reported; adding these would help readers assess the claimed difficulty of the motion-on-motion regime.
  2. [Evaluation protocol] The abstract states that IMU input is optional for baselines, yet no implementation details or ablation results on IMU fusion are provided; clarifying this would strengthen the evaluation protocol description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the benchmark's potential value. We address the single major comment below by agreeing to strengthen the label-validation section.

read point-by-point responses
  1. Referee: The benchmark's validity rests on the claim that temporally propagated 10 Hz bounding-box annotations produce accurate event-level foreground labels for the tiny UAV. In the motion-on-motion setting the paper emphasizes, ego-motion generates dense background events while the target produces sparse clusters; any interpolation or motion-model error during propagation can therefore shift the assigned label region relative to actual event locations. Because all reported baseline scores (event-frame, voxel-grid, point-set) are computed against these labels, systematic misalignment would render the headline finding—that existing methods remain limited—uninterpretable. The manuscript should supply quantitative validation of label fidelity (e.g., manual audit statistics, comparison against higher-rate ground truth, or event-density overlap metrics) in the dataset-construction section.

    Authors: We agree that quantitative validation of the propagated labels is necessary to support the benchmark's reliability. The original manuscript describes the 10 Hz bounding-box propagation procedure but does not report fidelity metrics. In the revision we will add a dedicated subsection that includes: (1) a manual audit on 500 randomly sampled frames reporting pixel-level agreement between propagated labels and human annotations (with inter-annotator agreement), and (2) event-density overlap statistics (precision/recall of events falling inside versus outside the propagated boxes) stratified by scene family and ego-motion speed. These additions will directly address the concern about potential misalignment under motion-on-motion conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction with no derivations or fitted predictions

full rationale

The paper introduces a new benchmark dataset M²E-UAV consisting of event streams, IMU data, and labels obtained by propagating 10 Hz bounding-box annotations. No mathematical derivations, equations, or parameter-fitting steps are present in the provided text. The label generation is a standard annotation pipeline rather than a claimed 'prediction' or 'first-principles result' that reduces to its own inputs. Baseline comparisons are scored against the released labels, but this does not create circularity because the paper makes no theoretical claim that loops back to the labels by construction. The contribution is empirical and self-contained; external verification is possible via the promised code and data release. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear. This matches the default expectation of no significant circularity for a dataset/benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard event-camera and IMU assumptions already established in the literature; no new free parameters, ad-hoc axioms, or invented physical entities are introduced.

axioms (2)
  • domain assumption Event cameras generate sparse asynchronous events corresponding to logarithmic brightness changes above a threshold.
    Invoked implicitly when treating event streams as the primary input modality.
  • domain assumption IMU measurements can be synchronized with event streams to provide ego-motion information.
    Used when offering optional IMU input to baselines.

pith-pipeline@v0.9.0 · 5601 in / 1310 out tokens · 41519 ms · 2026-05-15T04:55:55.249152+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Rethinking few-shot 3d point cloud semantic segmentation

    Zhaochong An, Guolei Sun, Yun Liu, Fayao Liu, Zongwei Wu, Dan Wang, Luc Van Gool, and Serge Belongie. Rethinking few-shot 3d point cloud semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  2. [2]

    Event-based tiny object detection: A benchmark dataset and baseline

    Nuo Chen, Chao Xiao, Yimian Dai, Shiman He, Miao Li, and Wei An. Event-based tiny object detection: A benchmark dataset and baseline. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  3. [3]

    Davison, Jorg Conradt, Kostas Daniilidis, and Davide Scaramuzza

    Guillermo Gallego, Tobi Delbruck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J. Davison, Jorg Conradt, Kostas Daniilidis, and Davide Scaramuzza. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

  4. [4]

    Recurrent vision transformers for object detection with event cameras

    Mathias Gehrig and Davide Scaramuzza. Recurrent vision transformers for object detection with event cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  5. [5]

    Randla-net: Efficient semantic segmentation of large-scale point clouds

    Qingyong Hu, Bo Yang, Linhai Xie, Stefanos Rosa, Zixiang Guo, Zhizhong Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  6. [6]

    Ev-flying: An event-based dataset for in-the-wild recognition of flying objects

    Gabriele Magrini, Federico Becattini, Giovanni Colombo, and Pietro Pala. Ev-flying: An event-based dataset for in-the-wild recognition of flying objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4947–4955, 2025. doi: 10.1109/ CVPRW67362.2025.00487

  7. [7]

    Neuromorphic drone detection: An event-rgb multimodal approach

    Gabriele Magrini, Federico Becattini, Pietro Pala, Alberto Del Bimbo, and Antonio Porta. Neuromorphic drone detection: An event-rgb multimodal approach. InComputer Vision – ECCV 2024 Workshops, volume 15646 ofLecture Notes in Computer Science, pages 259–275. Springer, 2025. doi: 10.1007/ 978-3-031-92460-6_16

  8. [8]

    Fred: The florence rgb-event drone dataset

    Gabriele Magrini, Niccolò Marini, Federico Becattini, Lorenzo Berlincioni, Niccolò Biondi, Pietro Pala, and Alberto Del Bimbo. Fred: The florence rgb-event drone dataset. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13170–13176, 2025. doi: 10.1145/3746027.3758271

  9. [9]

    Towards real-time fast unmanned aerial vehicle detection using dynamic vision sensors

    Jakub Mandula, Jonas Kühne, Luca Pascarella, and Michele Magno. Towards real-time fast unmanned aerial vehicle detection using dynamic vision sensors. InProceedings of the 2024 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), pages 1–6, 2024. doi: 10.1109/ I2MTC60896.2024.10561168. 8

  10. [10]

    Scene adaptive sparse transformer for event-based object detection

    Yansong Peng, Hebei Li, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. Scene adaptive sparse transformer for event-based object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  11. [11]

    Event- based motion segmentation by motion compensation

    Timo Stoffregen, Guillermo Gallego, Tom Drummond, Lindsay Kleeman, and Davide Scaramuzza. Event- based motion segmentation by motion compensation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7244–7253, 2019

  12. [12]

    Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Francois Goulette, and Leonidas J

    Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Francois Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

  13. [13]

    Yolov10: Real-time end-to-end object detection

    Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time end-to-end object detection.arXiv preprint arXiv:2405.14458, 2024

  14. [14]

    Visevent: Reliable object tracking via collaboration of frame and event flows.IEEE Transactions on Cybernetics, 54(3):1997–2010, 2024

    Xiao Wang, Jianing Li, Lin Zhu, Zhipeng Zhang, Zhe Chen, Xin Li, Yaowei Wang, Yonghong Tian, and Feng Wu. Visevent: Reliable object tracking via collaboration of frame and event flows.IEEE Transactions on Cybernetics, 54(3):1997–2010, 2024. doi: 10.1109/TCYB.2023.3318601

  15. [15]

    Event stream- based visual object tracking: A high-resolution benchmark dataset and a novel baseline

    Xiao Wang, Shiao Wang, Chuanming Tang, Lin Zhu, Bo Jiang, Yonghong Tian, and Jin Tang. Event stream- based visual object tracking: A high-resolution benchmark dataset and a novel baseline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19248–19257, 2024

  16. [16]

    Sarma, Michael M

    Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics, 38(5):146:1–146:12, 2019

  17. [17]

    Crsot: Cross-resolution object tracking using unaligned frame and event cameras.IEEE Transactions on Multimedia, 27:6529–6542, 2025

    Yabin Zhu, Xiao Wang, Chenglong Li, Bo Jiang, Lin Zhu, Zhixiang Huang, Yonghong Tian, and Jin Tang. Crsot: Cross-resolution object tracking using unaligned frame and event cameras.IEEE Transactions on Multimedia, 27:6529–6542, 2025. doi: 10.1109/TMM.2025.3586135. 9