pith. sign in

arxiv: 2605.16414 · v1 · pith:CUG77OX4new · submitted 2026-05-13 · 💻 cs.CV

NERVE: A Neuromorphic Vision and Radar Ensemble for Multi-Sensor Fusion Research

Pith reviewed 2026-05-20 20:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-sensor fusionneuromorphic visionDVSradarhuman detectiondistance estimationdatasetrecurrent models
0
0 comments X

The pith

Combining DVS with 77 GHz radar improves human detection to 47.5% mAP and keeps distance errors below 1.8 m.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the NERVE dataset of synchronized recordings from two Dynamic Vision Sensors, an RGB-D camera, and 24 GHz plus 77 GHz radar units captured across twelve office days. It isolates a DVS-plus-radar subset with nearly one million frames and COCO-style annotations to test multi-modal fusion specifically for human detection and ranging. Baseline runs with feed-forward and recurrent detectors show that radar data added to DVS inputs raises detection scores, with recurrent networks reaching 47.5 percent mean average precision. The same models produce radar distance estimates whose mean absolute error stays under 1.8 meters when checked against LiDAR references. The work therefore supplies a concrete testbed for examining how event-based vision and radar measurements can be combined.

Core claim

The central claim is that fusing Dynamic Vision Sensor streams with 77 GHz radar data in the NERVE dataset consistently raises human-detection performance, with recurrent models attaining up to 47.5 percent mean average precision while radar-derived distance estimates remain below 1.8 m mean absolute error against LiDAR ground truth.

What carries the argument

The DVS-plus-77 GHz radar subset processed by feed-forward and recurrent detectors, which isolates the contribution of each modality to detection and ranging.

If this is right

  • Recurrent detectors make better use of the temporal structure in DVS and radar streams than feed-forward detectors.
  • 77 GHz radar supplies a stronger complementary signal for detection than 24 GHz radar when paired with DVS.
  • The full dataset with its 16 object categories supports extension of the same fusion evaluation beyond the human-detection task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reported gains may shrink when models trained on office data encounter outdoor motion or lighting changes.
  • Including the RGB-D camera already present in the recordings could further tighten distance estimates or raise detection scores.
  • The scale of the synchronized recordings invites direct comparison of early versus late fusion architectures on the same data.

Load-bearing premise

Recordings made in office settings with standard COCO annotations supply a representative ground truth for multi-sensor human detection and distance estimation in general conditions.

What would settle it

Running the identical fusion models on recordings from non-office environments or with independent ranging ground truth and observing whether mAP and distance errors remain comparable would test the claim.

Figures

Figures reproduced from arXiv: 2605.16414 by Amirreza Yousefzadeh, Ethan Milon, Guangzhi Tang, Manolis Sifalakis, Omar Mansour, Pietro Martinello, YingFu Xu.

Figure 1
Figure 1. Figure 1: Overview of the NERVE dataset: (a) the multi-sensor acquisition setup; (b) an example fused frame with annotations. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Block diagram of the automatic annotation pipeline: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: NERVE dataset distributions: (a) spatial distribution of bounding box centers for class [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

We present NERVE (Neuromorphic Vision and Radar Ensemble), a multi-sensor dataset comprising 257 minutes of synchronized recordings from five sensors: two Dynamic Vision Sensors (DVS), an RGB-D camera, and two Radar units (24GHz and 77GHz). Captured across 12 measurement days in office environments, NERVE contains around 600GB of uncompressed temporally aligned data with around 914,000 frames and around 9.6 million RGB COCO-formatted annotations covering 16 relevant object categories. To evaluate multi-modal fusion, we construct a DVS+Radar subset for human detection and distance estimation. Baseline experiments using feed-forward and recurrent detectors show that combining DVS with 77GHz Radar consistently improves detection, with recurrent models achieving up to 47.5% mAP and mean absolute Radar distance errors below 1.8m against LiDAR ground truth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the NERVE multi-sensor dataset featuring synchronized recordings from two Dynamic Vision Sensors (DVS), an RGB-D camera, and two radar units operating at 24GHz and 77GHz. The dataset includes 257 minutes of data from office environments, approximately 914,000 frames, and 9.6 million COCO-formatted annotations for 16 object categories. Baseline experiments on DVS and 77GHz radar fusion for human detection and distance estimation demonstrate improved performance, with recurrent models reaching 47.5% mAP and mean absolute distance errors below 1.8 m using LiDAR as ground truth.

Significance. Should the experimental details be clarified and the ground truth methodology validated, the NERVE dataset could serve as a significant contribution to multi-sensor fusion research in computer vision, particularly for neuromorphic and radar modalities. The large scale and temporal alignment of the data, along with the provision of baseline results, offer a foundation for developing and evaluating fusion algorithms. The work highlights potential benefits of combining event-based vision with radar for detection tasks in indoor settings.

major comments (3)
  1. [Abstract] Abstract: The claim of mean absolute Radar distance errors below 1.8m is made against LiDAR ground truth, but the described sensor suite consists only of two DVS, one RGB-D camera, and two Radar units. The manuscript should detail the acquisition, synchronization, and error characteristics of the LiDAR data used for ranging evaluation, as this reference is central to validating the distance estimation results.
  2. [Experiments] Experiments section: The baseline experiments report improvements from DVS+Radar fusion but omit model architectures, training details, hyperparameters, loss functions, and any ablation studies or statistical tests. These omissions make it difficult to assess the robustness of the 47.5% mAP and sub-1.8m error claims.
  3. [Dataset] Dataset description: While the dataset size and annotation format are specified, additional information on annotation process, inter-annotator agreement, and handling of sensor-specific challenges (e.g., DVS event noise, radar clutter) would strengthen the resource's utility.
minor comments (2)
  1. [Abstract] Abstract: The repeated use of 'around' for quantities (e.g., around 600GB, around 914,000 frames) could be replaced with more precise figures or ranges if exact counts are available.
  2. [Throughout] Throughout: Ensure consistent terminology for sensors, such as specifying '77GHz Radar' clearly in all references to avoid ambiguity with the 24GHz unit.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments and suggestions. We address each of the major comments below and will revise the manuscript to incorporate the requested clarifications and additional details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of mean absolute Radar distance errors below 1.8m is made against LiDAR ground truth, but the described sensor suite consists only of two DVS, one RGB-D camera, and two Radar units. The manuscript should detail the acquisition, synchronization, and error characteristics of the LiDAR data used for ranging evaluation, as this reference is central to validating the distance estimation results.

    Authors: We thank the referee for pointing this out. The LiDAR was used solely to provide ground truth for the distance estimation evaluation and is not included in the released dataset. In the revised manuscript, we will add a new subsection in the Dataset or Experiments section describing the LiDAR sensor model, its acquisition setup, synchronization with the other sensors using hardware triggers, and error characteristics derived from calibration and manufacturer data. revision: yes

  2. Referee: [Experiments] Experiments section: The baseline experiments report improvements from DVS+Radar fusion but omit model architectures, training details, hyperparameters, loss functions, and any ablation studies or statistical tests. These omissions make it difficult to assess the robustness of the 47.5% mAP and sub-1.8m error claims.

    Authors: We agree that these details are essential for reproducibility and assessing the claims. The revised manuscript will include comprehensive descriptions of the model architectures for both feed-forward and recurrent detectors, the training protocols, specific hyperparameters, the loss functions employed, ablation studies on the fusion components, and statistical tests (e.g., paired t-tests) to validate the significance of the performance improvements. revision: yes

  3. Referee: [Dataset] Dataset description: While the dataset size and annotation format are specified, additional information on annotation process, inter-annotator agreement, and handling of sensor-specific challenges (e.g., DVS event noise, radar clutter) would strengthen the resource's utility.

    Authors: We appreciate this recommendation to enhance the dataset's documentation. We will expand the Dataset section to detail the annotation process, including the software tools used, the number of annotators involved, and the guidelines followed. We will also report inter-annotator agreement using appropriate metrics. Furthermore, we will describe the methods used to handle DVS event noise and radar clutter during the annotation and data preparation stages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset paper with direct measurements only

full rationale

The manuscript presents a multi-sensor dataset (DVS, RGB-D, Radar) and reports baseline empirical results for human detection and ranging. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. Results such as 47.5% mAP and <1.8 m errors are stated as direct measurements on collected data against an external reference, with no reduction by construction to self-defined quantities or ansatzes. The LiDAR ground-truth reference, while potentially raising separate questions of sensor enumeration, does not create a self-referential loop in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contribution rests on standard data-collection assumptions rather than new theoretical constructs; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Temporal synchronization across all five sensors is accurate enough for frame-level fusion
    Required for the DVS+Radar subset experiments described in the abstract
  • domain assumption COCO-formatted annotations constitute reliable ground truth for the 16 object categories
    Used to compute mAP and distance error metrics

pith-pipeline@v0.9.0 · 5705 in / 1328 out tokens · 81657 ms · 2026-05-20T20:44:46.389340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    Event- based vision: A survey,

    G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidiset al., “Event- based vision: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020

  2. [2]

    Neuromorphic electronic systems,

    C. Mead, “Neuromorphic electronic systems,”Proceedings of the IEEE, vol. 78, no. 10, pp. 1629–1636, 1990

  3. [3]

    A survey of multisensor fusion techniques, architectures and methodologies,

    B. Chandrasekaran, S. Gangadhar, and J. M. Conrad, “A survey of multisensor fusion techniques, architectures and methodologies,” in SoutheastCon 2017. IEEE, 2017, pp. 1–8

  4. [4]

    Vision meets robotics: The KITTI dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,”International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013

  5. [5]

    nuScenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 621–11 631

  6. [6]

    A large scale event-based detection dataset for automotive,

    P. De Tournemire, D. Nitti, E. Perot, D. Migliore, and A. Sironi, “A large scale event-based detection dataset for automotive,”arXiv preprint arXiv:2001.08499, 2020

  7. [7]

    Learning to detect objects with a 1 megapixel event camera,

    E. Perot, P. De Tournemire, D. Nitti, J. Masci, and A. Sironi, “Learning to detect objects with a 1 megapixel event camera,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 16 639– 16 652

  8. [8]

    Aircraft marshalling signals dataset of radar and event- based camera for sensor fusion,

    L. M ¨uller, M. Sifalakis, S. Eissa, S. Afshar, A. van Schaik, and A. Yousefzadeh, “Aircraft marshalling signals dataset of radar and event- based camera for sensor fusion,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  9. [9]

    Mmwave radar and vision fusion for object detection in autonomous driving: A review,

    Z. Wei, F. Zhang, S. Chang, Y . Liu, H. Wu, and Z. Feng, “Mmwave radar and vision fusion for object detection in autonomous driving: A review,”Sensors, vol. 22, no. 7, p. 2542, 2022

  10. [10]

    Fusing event-based camera and radar for SLAM using spiking neural networks with continual STDP learning,

    A. Safa, T. Verbelen, I. Ocket, A. Bourdoux, H. Sahli, F. Catthoor, and G. Gielen, “Fusing event-based camera and radar for SLAM using spiking neural networks with continual STDP learning,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 2782–2788

  11. [11]

    Ultra-high-frequency harmony: mmwave radar and event camera orchestrate accurate drone landing,

    H. Wang, J. Xu, X. Luo, X. Chen, T. Zhang, R. Duan, Y . Liu, and X. Chen, “Ultra-high-frequency harmony: mmwave radar and event camera orchestrate accurate drone landing,” inProceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys). ACM, 2025, pp. 15–29

  12. [12]

    The FAIR guiding principles for scientific data management and stewardship,

    M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourneet al., “The FAIR guiding principles for scientific data management and stewardship,”Scientific Data, vol. 3, no. 1, p. 160018, 2016

  13. [13]

    Converting static image datasets to spiking neuromorphic datasets using saccades,

    G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor, “Converting static image datasets to spiking neuromorphic datasets using saccades,” Frontiers in neuroscience, vol. 9, p. 437, 2015

  14. [14]

    ESIM: An open event camera simulator,

    H. Rebecq, D. Gehrig, and D. Scaramuzza, “ESIM: An open event camera simulator,” inConference on Robot Learning. PMLR, 2018, pp. 969–982

  15. [15]

    Video to events: Recycling video datasets for event cameras,

    D. Gehrig, M. Gehrig, J. Hidalgo-Carri ´o, and D. Scaramuzza, “Video to events: Recycling video datasets for event cameras,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3586–3595

  16. [16]

    DDD17: End-To-End DAVIS Driving Dataset

    J. Binas, D. Neil, S.-C. Liu, and T. Delbruck, “DDD17: End-to-end DA VIS driving dataset,”arXiv preprint arXiv:1711.01458, 2017

  17. [17]

    DDD20 end- to-end event camera driving dataset: Fusing frames and events with deep learning for improved steering prediction,

    Y . Hu, J. Binas, D. Neil, S.-C. Liu, and T. Delbruck, “DDD20 end- to-end event camera driving dataset: Fusing frames and events with deep learning for improved steering prediction,” inIEEE International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2020, pp. 1–6

  18. [18]

    Neuromorphic vision datasets for pedestrian detection, action recog- nition, and fall detection,

    S. Miao, G. Chen, X. Ning, Y . Zi, K. Ren, Z. Bing, and A. Knoll, “Neuromorphic vision datasets for pedestrian detection, action recog- nition, and fall detection,”Frontiers in Neurorobotics, vol. 13, p. 38, 2019

  19. [19]

    Pedro: An event-based dataset for person detection in robotics,

    C. Boretti, P. Bich, F. Pareschi, L. Prono, R. Rovatti, and G. Setti, “Pedro: An event-based dataset for person detection in robotics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4065–4074

  20. [20]

    The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,

    A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V . Kumar, and K. Daniilidis, “The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,”IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2032–2039, 2018

  21. [21]

    Intel RealSense LiDAR Camera L515 Datasheet,

    Intel Corporation, “Intel RealSense LiDAR Camera L515 Datasheet,” https://docs.rs-online.com/f31c/A700000006942953.pdf, 2020, revision 002, June 2020

  22. [22]

    Charuco board-based omnidirectional camera calibration method,

    G.-H. An, S. Lee, M.-W. Seo, K. Yun, W.-S. Cheong, and S.-J. Kang, “Charuco board-based omnidirectional camera calibration method,” Electronics, vol. 7, no. 12, p. 421, 2018

  23. [23]

    Ultralytics YOLOv8,

    G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8,” 2023, available: https://github.com/ultralytics/ultralytics. [Online]. Available: https://github.com/ultralytics/ultralytics

  24. [24]

    Recurrent vision transformers for object detection with event cameras,

    M. Gehrig and D. Scaramuzza, “Recurrent vision transformers for object detection with event cameras,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 884–13 893

  25. [25]

    A recurrent YOLOv8-based framework for event- based object detection,

    D. A. Silva, S. Ahmed, K. Siddique, M. Iacono, P. Morerio, L. Marce- naro, C. Regazzoni, L. Martino, J. Caba, K. Abualsaud, D. Thomas, and P. Vandergheynst, “A recurrent YOLOv8-based framework for event- based object detection,”Frontiers in Neuroscience, vol. 18, p. 1477979, 2024

  26. [26]

    YOLOX: Exceeding YOLO Series in 2021

    Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO series in 2021,”arXiv preprint arXiv:2107.08430, 2021