pith. sign in

arxiv: 2501.15151 · v5 · pith:EVAI7JWGnew · submitted 2025-01-25 · 💻 cs.CV

SpikeDet: Better Firing Patterns for Accurate and Energy-Efficient Object Detection with Spiking Neural Networks

Pith reviewed 2026-05-23 04:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords spiking neural networksobject detectionlocal firing saturationenergy efficiencyCOCO datasetMDSNetSMFMLFSI
0
0 comments X

The pith

SpikeDet improves spiking neural network object detection by adjusting membrane inputs and multi-direction fusion to reduce local firing saturation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that local firing saturation, where adjacent neurons hit maximum rates especially in object regions, limits both accuracy and energy savings in existing SNN detectors. It introduces MDSNet as a backbone that redistributes membrane synaptic inputs across layers to produce more varied firing patterns during feature extraction. A Spiking Multi-direction Fusion Module in the neck then combines these features across scales and directions to maintain discrimination. The authors also define a Local Firing Saturation Index to quantify the issue. On COCO 2017 this yields 52.2 percent AP at roughly half the energy of prior SNN methods, with similar gains on event-based, underwater, low-light, and crowded datasets.

Core claim

By redesigning the spiking backbone to adjust membrane synaptic input distributions layer by layer and adding multi-direction fusion in the neck, SpikeDet produces firing patterns with less local saturation, which directly raises feature quality for detection while lowering total spike rates and therefore energy use.

What carries the argument

MDSNet, which adjusts membrane synaptic input distribution at each layer to improve neuron firing patterns, together with the Spiking Multi-direction Fusion Module that performs multi-direction spiking feature fusion.

If this is right

  • The same firing-pattern changes produce top results on event-based GEN1, underwater URPC 2019, low-light ExDARK, and dense CrowdHuman detection.
  • Energy use drops by roughly half while AP rises 3.3 points over prior SNN detectors.
  • LFSI provides a numeric way to track and target saturation during training or architecture search.
  • Multi-scale detection improves because high-quality backbone features are preserved rather than lost to uniform high firing rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If membrane-input redistribution is the effective lever, the same principle could be tested on spiking versions of other vision backbones or on non-detection tasks.
  • LFSI might become a standard training regularizer or early-stopping signal for any SNN model that shows saturation.
  • Hardware implementations could exploit the lower average spike rates to reduce buffer sizes or clock speeds beyond what the paper measures.
  • The approach leaves open whether similar saturation occurs in deeper or recurrent SNNs and whether the same modules would scale without retraining.

Load-bearing premise

Local firing saturation is the dominant limiter of accuracy and efficiency in current SNN detectors, and the MDSNet and SMFM changes fix it without creating new offsetting problems.

What would settle it

An ablation that removes the membrane-input adjustments from MDSNet or the multi-direction fusion from SMFM and shows detection AP and energy returning to the levels of previous SNN detectors on COCO.

Figures

Figures reproduced from arXiv: 2501.15151 by Changsong Liu, Dongze Liu, Mingyang Li, Wei Zhang, Yanyan Liu, Yimeng Fan, Yuting Su.

Figure 1
Figure 1. Figure 1: Visualization of local firing saturation problem in SNN-based object detector on COCO dataset. Each pixel represents the neuron firing rate. (a) and (b) show feature maps before the detection head, as these features determine both classification and regression. (a) averages the 4D spike tensor ([T, C, H, W]) across both time and channel dimensions to show overall spatial firing distribution, while (b) sele… view at source ↗
Figure 2
Figure 2. Figure 2: Comparisons with other state-of-the-art methods on COCO AP and Power consumption. Squares represent ANN-based object detectors, circles represent SNN-based object detectors, and triangles represent our methods. (a) Comparison results on the COCO 2017 dataset. (b) Comparison results on the GEN1 dataset. reduced feature discriminability causes multiple anchor points to produce similar confidence scores, prev… view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of SpikeDet. SpikeDet comprises MDSNet, SMFM, and the SpikeYOLO Detection Head [19]. The model receives two types of inputs: event and static data, with the input coding and output represented in the figure. The core of SpikeDet is MDSNet, which consists of 5 stages. Each stage downsampling factor relative to the original input is marked in the figure, with MDS-Block1 and MDS-Block2 perfor… view at source ↗
Figure 4
Figure 4. Figure 4: Firing rate distribution of I-LIF neurons for different presynaptic input x t,n when T = 1 and D = 4. The input x is sampled from Gaussian distributions with different variances. Increased variance leads to higher firing saturation probability. Proof The detailed proof is presented in Appendix A-B. With Proposition 2, we can conclude that the aforemen￾tioned variance accumulation in synaptic input directly… view at source ↗
Figure 5
Figure 5. Figure 5: Influence of multi-direction feature fusion on firing patterns of SNN-based detector. We visualize feature maps at the 1/16 downsampling stage, averaging across T and C dimensions to reveal overall firing patterns. Figures (a) to (e) show neuron firing patterns for feature maps with no fusion, one-way, two-way, three-way, and four-way fusion, respectively. 3×3-LCB 3×3-LCB 3×3-LDCB 1×1-LCB 3×3-LCB 3×3-LCB … view at source ↗
Figure 6
Figure 6. Figure 6: The architecture of the proposed SF-Block. The difference between 3×3-LDCB and 3×3-LCB lies in that the former replaces the 3×3 convolution with a 3×3 depthwise convolution. Maxpool after the I-LIF layer to avoid the amplified input to I-LIF. Additionally, to better preserve balanced information during downsampling, we also employ fixed stride downsam￾pling. After downsampling, the stride and Maxpool resul… view at source ↗
Figure 7
Figure 7. Figure 7: (a) LFSI comparison between SpikeDet and SpikeYOLO across [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Correlation analysis of LFSI across different SNN-based object detectors. (a) Correlation between LFSI and detection performance. (b) Correlation between LFSI and firing rate. For a network with N spiking layers, the overall LFSI is LFSI = 1 N X N n=1 LFSIn . (21) A higher LFSI indicates more severe local firing saturation, with LFSI ∈ [0, 1]. 2) Analysis of LFSI Under Varying Spatial Neighborhoods: The pa… view at source ↗
Figure 9
Figure 9. Figure 9: In contrast to EMS-ResNet, our MDSNet maintains [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Detection results on the COCO 2017 dataset. As the model scale increases, detection accuracy progressively improves. TABLE VI PERFORMANCE COMPARISON WITH STATE-OF-THE-ART MODELS ON GEN1 DATASET. THE -RELU INDICATES THAT THE ACTIVATION FUNCTION OF SPIKEDET-S IS CHANGED TO RELU, CONVERTING IT TO AN ANN VERSION. Model AP AP50 Param (M) T×D Firing Rate(%) LFSI (%) Power (mJ) RED [61] 40.0 - 24.1 - - - >24.1 S… view at source ↗
Figure 11
Figure 11. Figure 11: Detection results on the GEN1 dataset. power consumption, underscoring the effectiveness of SNNs for processing event data. 2) Underwater Object Detection: As shown in Table VII, on the URPC 2019 dataset, SpikeDet achieves significant improvements compared to other methods in both accuracy and power consumption. Specifically, compared to YOLOv9- S-UI, it achieves 1.4% AP improvement while reducing power c… view at source ↗
Figure 12
Figure 12. Figure 12: Detection result on the URPC 2019 dataset. 15,000 training images, 4,370 validation images, and 5,000 test images, featuring comprehensive annotations across diverse scenarios. 2) More Implementation Details: The MDSNet structure is detailed in Table X. For SpikeDet-L, we increase the channel dimensions of MDSNet104 by a factor of 1.25 to improve model performance. On the Gen1 dataset, we employ the zoom-… view at source ↗
Figure 13
Figure 13. Figure 13: Detection result on the ExDARK dataset. SpikeYOLO SpikeDet Ground Truth [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Detection result on the CrowdHuman dataset [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Spiking Neural Networks (SNNs) are the third generation of neural networks. They have gained widespread attention in object detection due to their low energy consumption and biological interpretability. However, existing SNN-based object detection methods suffer from local firing saturation, where adjacent neurons concurrently reach maximum firing rates, especially in object-centric regions. This abnormal neuron firing pattern reduces the feature discrimination capability and detection accuracy, while also increasing the firing rates that prevent SNNs from achieving their potential energy efficiency. To address this problem, we propose SpikeDet, a novel spiking object detector that optimizes firing patterns for accurate and energy-efficient detection. Specifically, we design a spiking backbone network, MDSNet, which effectively adjusts the membrane synaptic input distribution at each layer, achieving better neuron firing patterns during spiking feature extraction. For the neck, to better utilize and preserve these high-quality backbone features, we introduce the Spiking Multi-direction Fusion Module (SMFM), which realizes multi-direction fusion of spiking features, enhancing the multi-scale detection capability of the model. Furthermore, we propose the Local Firing Saturation Index (LFSI) to quantitatively measure local firing saturation. Experimental results validate the effectiveness of our method. On the COCO 2017 dataset, it achieves 52.2% AP, outperforming previous SNN-based methods by 3.3% AP while requiring only half the energy consumption. On object detection sub-tasks, including event-based GEN1, underwater URPC 2019, low-light ExDARK, and dense scene CrowdHuman datasets, SpikeDet also achieves the best performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes SpikeDet, an SNN-based object detector addressing local firing saturation in existing methods. It introduces MDSNet as a spiking backbone that adjusts membrane synaptic input distributions per layer for improved firing patterns, SMFM for multi-directional spiking feature fusion in the neck to enhance multi-scale detection, and LFSI as a metric to quantify local firing saturation. On COCO 2017 it reports 52.2% AP (3.3% above prior SNN detectors) at half the energy; it also claims state-of-the-art results on GEN1, URPC 2019, ExDARK, and CrowdHuman.

Significance. If the experimental claims hold after verification, the work would be a meaningful advance for energy-efficient object detection with SNNs. The LFSI metric supplies a concrete, quantitative handle on firing-pattern quality that could be adopted more broadly, and the reported accuracy-energy trade-off on COCO plus generalization to event-based, low-light, and dense scenes would strengthen the case for SNNs in practical vision pipelines.

major comments (1)
  1. The central attribution—that MDSNet and SMFM directly mitigate local firing saturation and thereby produce the 3.3% AP gain and halved energy—rests on the experimental results. The abstract states that experiments validate this link, yet the provided text supplies no table or section reference to the required ablation (e.g., SpikeDet minus MDSNet, minus SMFM) or to the exact prior SNN baselines and energy-measurement protocol. Without those controls the causal claim cannot be assessed from the manuscript alone.
minor comments (2)
  1. Notation for membrane potential and synaptic input in the MDSNet description should be defined once with consistent symbols before any equations are used.
  2. Figure captions for the firing-pattern visualizations should explicitly state the layer and input image used so readers can reproduce the LFSI calculation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive suggestion. The major comment correctly identifies that the causal link between our proposed components and the reported gains requires explicit experimental support. We address this below and will revise the manuscript to strengthen the presentation.

read point-by-point responses
  1. Referee: [—] The central attribution—that MDSNet and SMFM directly mitigate local firing saturation and thereby produce the 3.3% AP gain and halved energy—rests on the experimental results. The abstract states that experiments validate this link, yet the provided text supplies no table or section reference to the required ablation (e.g., SpikeDet minus MDSNet, minus SMFM) or to the exact prior SNN baselines and energy-measurement protocol. Without those controls the causal claim cannot be assessed from the manuscript alone.

    Authors: We agree that the manuscript must supply explicit ablations and protocol details to allow readers to evaluate the contribution of each component. The current version contains ablation experiments (Section 4.3) and energy-measurement details (Section 3.4 and Appendix B), but these are not cross-referenced from the abstract or introduction. In the revision we will (1) add a new Table 3 that reports AP and energy for SpikeDet, SpikeDet w/o MDSNet, and SpikeDet w/o SMFM on COCO 2017, (2) expand the baseline comparison paragraph in Section 4.2 to list the exact prior SNN detectors and their reported metrics, and (3) move the energy protocol description into the main text with a clear reference from the abstract. These changes will make the causal attribution verifiable without altering any experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces architectural components (MDSNet backbone, SMFM neck module) and a diagnostic metric (LFSI) for spiking object detection, then reports empirical gains on COCO and other datasets. No derivation chain, first-principles predictions, or fitted parameters appear; performance claims rest on experimental outcomes rather than any quantity being redefined or forced by construction from its own inputs. No self-citation load-bearing steps or ansatz smuggling are present in the supplied text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the effectiveness of three newly introduced components whose performance is asserted via experiments whose details are not provided.

axioms (1)
  • domain assumption Standard SNN training assumptions such as surrogate gradient methods for backpropagation through spikes hold.
    Implicit background for any modern SNN paper; not stated explicitly.
invented entities (3)
  • MDSNet no independent evidence
    purpose: Spiking backbone that adjusts membrane synaptic input distribution per layer
    Newly proposed architecture component.
  • SMFM no independent evidence
    purpose: Spiking Multi-direction Fusion Module for multi-scale feature combination
    Newly proposed neck module.
  • LFSI no independent evidence
    purpose: Local Firing Saturation Index to quantify the identified firing problem
    Newly proposed quantitative metric.

pith-pipeline@v0.9.0 · 5843 in / 1288 out tokens · 66119 ms · 2026-05-23T04:58:24.087839+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 4 internal anchors

  1. [1]

    Converting high-performance and low-latency snns through explicit modeling of residual error in anns,

    Z. Huang, J. Ding, Z. Pan, H. Li, Y . Fang, Z. Yu, and J. K. Liu, “Converting high-performance and low-latency snns through explicit modeling of residual error in anns,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 9, pp. 16 788–16 802, 2025

  2. [2]

    Networks of spiking neurons: The third generation of neural network models,

    W. Maass, “Networks of spiking neurons: The third generation of neural network models,”Neural Netw., vol. 10, no. 9, pp. 1659–1671, 1997

  3. [3]

    Accurate and efficient event-based semantic segmentation using adaptive spiking encoder–decoder network,

    R. Zhang, L. Leng, K. Che, H. Zhang, J. Cheng, Q. Guo, J. Liao, and R. Cheng, “Accurate and efficient event-based semantic segmentation using adaptive spiking encoder–decoder network,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 5, pp. 9326–9340, 2025

  4. [4]

    Learning to time-decode in spiking neural networks through the information bottleneck,

    N. Skatchkovsky, O. Simeone, and H. Jang, “Learning to time-decode in spiking neural networks through the information bottleneck,” inProc. Adv. Neural Inf. Process. Syst., Dec. 2021, pp. 17 049–17 059

  5. [5]

    Event-driven video restoration with spiking-convolutional architecture,

    C. Cao, X. Fu, Y . Zhu, Z. Sun, and Z.-J. Zha, “Event-driven video restoration with spiking-convolutional architecture,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 1, pp. 866–880, 2025

  6. [6]

    Object detection with spiking neural networks on automotive event data,

    L. Cordone, B. Miramond, and P. Thierion, “Object detection with spiking neural networks on automotive event data,” inProc. Int. Joint Conf. Neural Netw. (IJCNN), 2022, pp. 1–8

  7. [7]

    Spiking neural networks for frame-based and event-based single object localization,

    S. Barchid, J. Mennesson, J. Eshraghian, C. Dj ´eraba, and M. Ben- namoun, “Spiking neural networks for frame-based and event-based single object localization,”Neurocomputing, vol. 559, Nov. 2023, art. no. 126805

  8. [8]

    Improving the accuracy of spiking neural networks for radar gesture recognition through preprocessing,

    A. Safa, F. Corradi, L. Keuninckx, I. Ocket, A. Bourdoux, F. Catthoor, and G. G. Gielen, “Improving the accuracy of spiking neural networks for radar gesture recognition through preprocessing,”IEEE Trans. Neu- ral Netw. Learn. Syst., vol. 34, no. 6, pp. 2869–2881, 2021

  9. [9]

    Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,

    M. Yao, J. Hu, T. Hu, Y . Xu, Z. Zhou, Y . Tian, B. Xu, and G. Li, “Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,” inProc. Int. Conf. Learn. Represent, May 2024, pp. 1–23

  10. [10]

    Scaling spike-driven transformer with efficient spike firing approximation training,

    M. Yao, X. Qiu, T. Hu, J. Hu, Y . Chou, K. Tian, J. Liao, L. Leng, B. Xu, and G. Li, “Scaling spike-driven transformer with efficient spike firing approximation training,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 4, pp. 2973–2990, Jan. 2025

  11. [11]

    Pay attention to them: Deep rein- forcement learning-based cascade object detection,

    S. Liu, D. Huang, and Y . Wang, “Pay attention to them: Deep rein- forcement learning-based cascade object detection,”IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2544–2556, 2019

  12. [12]

    Dpnet: Dual- path network for real-time object detection with lightweight attention,

    Q. Zhou, H. Shi, W. Xiang, B. Kang, and L. J. Latecki, “Dpnet: Dual- path network for real-time object detection with lightweight attention,” IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 3, pp. 4504–4518, 2024

  13. [13]

    Spiking-yolo: Spiking neural network for energy-efficient object detection,

    S. Kim, S. Park, B. Na, and S. Yoon, “Spiking-yolo: Spiking neural network for energy-efficient object detection,” inProc. AAAI Conf. Artif. Intell., vol. 34, Apr. 2020, pp. 11 270–11 277

  14. [14]

    Deep directly-trained spiking neural networks for object detection,

    Q. Su, Y . Chou, Y . Hu, J. Li, S. Mei, Z. Zhang, and G. Li, “Deep directly-trained spiking neural networks for object detection,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 6555–6565

  15. [15]

    Spikingvit: A multiscale spiking vision transformer model for event- based object detection,

    L. Yu, H. Chen, Z. Wang, S. Zhan, J. Shao, Q. Liu, and S. Xu, “Spikingvit: A multiscale spiking vision transformer model for event- based object detection,”IEEE Trans. Cogn. Develop. Syst., vol. 17, no. 1, pp. 130–146, Jul. 2024. 13

  16. [16]

    Sfod: Spiking fusion object detector,

    Y . Fan, W. Zhang, C. Liu, M. Li, and W. Lu, “Sfod: Spiking fusion object detector,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 17 191–17 200

  17. [17]

    Spiking neural network for ultralow-latency and high-accurate object detection,

    J. Qu, Z. Gao, T. Zhang, Y . Lu, H. Tang, and H. Qiao, “Spiking neural network for ultralow-latency and high-accurate object detection,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 3, pp. 4934–4946, Mar. 2024

  18. [18]

    Eas- snn: End-to-end adaptive sampling and representation for event-based detection with recurrent spiking neural networks,

    Z. Wang, Z. Wang, H. Li, L. Qin, R. Jiang, D. Ma, and H. Tang, “Eas- snn: End-to-end adaptive sampling and representation for event-based detection with recurrent spiking neural networks,” inProc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2024, pp. 310–328

  19. [19]

    Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection,

    X. Luo, M. Yao, Y . Chou, B. Xu, and G. Li, “Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection,” inProc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2024, pp. 253–272

  20. [20]

    Fcos: Fully convolutional one- stage object detection,

    Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one- stage object detection,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9627–9636

  21. [21]

    Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,

    S. Zhang, C. Chi, Y . Yao, Z. Lei, and S. Z. Li, “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9759–9768

  22. [22]

    Advancing spiking neural networks toward deep residual learning,

    Y . Hu, L. Deng, Y . Wu, M. Yao, and G. Li, “Advancing spiking neural networks toward deep residual learning,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 2, pp. 2353–2367, Feb. 2024

  23. [23]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inProc. Eur. Conf. Comput. Vis. (ECCV), vol. 8693, 2014, pp. 740–755

  24. [24]

    Going deeper with directly-trained larger spiking neural networks,

    H. Zheng, Y . Wu, L. Deng, Y . Hu, and G. Li, “Going deeper with directly-trained larger spiking neural networks,” inProc. AAAI Conf. Artif. Intell., vol. 35, Feb. 2021, pp. 11 062–11 070

  25. [25]

    Deep residual learning in spiking neural networks,

    W. Fang, Z. Yu, Y . Chen, T. Huang, T. Masquelier, and Y . Tian, “Deep residual learning in spiking neural networks,”Proc. Adv. Neural Inf. Process. Syst., vol. 34, pp. 21 056–21 069, Dec. 2021

  26. [26]

    Rapid object detection using a boosted cascade of simple features,

    P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 1, Dec. 2001, pp. I–I

  27. [27]

    Semi- supervised semantic segmentation with multi-constraint consistency learning,

    J. Yin, T. Chen, G. Pei, H. Liu, Y . Yao, L. Nie, and X. Hua, “Semi- supervised semantic segmentation with multi-constraint consistency learning,”IEEE Trans. Multimedia, vol. 27, pp. 6449–6461, 2025

  28. [28]

    Uncertainty-participation context consistency learning for semi-supervised semantic segmen- tation,

    J. Yin, Y . Chen, Z. Zheng, J. Zhou, and Y . Gu, “Uncertainty-participation context consistency learning for semi-supervised semantic segmen- tation,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP). IEEE, Apr. 2025, pp. 1–5

  29. [29]

    Proposal distribution calibration for few-shot object detection,

    B. Li, C. Liu, M. Shi, X. Chen, X. Ji, and Q. Ye, “Proposal distribution calibration for few-shot object detection,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 1, pp. 1911–1918, 2025

  30. [30]

    Ultralytics yolov8,

    G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

  31. [31]

    Gold- yolo: Efficient object detector via gather-and-distribute mechanism,

    C. Wang, W. He, Y . Nie, J. Guo, C. Liu, Y . Wang, and K. Han, “Gold- yolo: Efficient object detector via gather-and-distribute mechanism,” Proc. Adv. Neural Inf. Process. Syst., vol. 36, pp. 51 094–51 112, Dec. 2023

  32. [32]

    Ultralytics yolo11,

    G. Jocher and J. Qiu, “Ultralytics yolo11,” 2024. [Online]. Available: https://github.com/ultralytics/ultralytics

  33. [33]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” inProc. Eur. Conf. Comput. Vis. (ECCV), Aug. 2020, vol. 12346, pp. 213–229

  34. [34]

    Towards fast and accurate object detection in bio-inspired spiking neural networks through bayesian optimization,

    S. Kim, S. Park, B. Na, J. Kim, and S. Yoon, “Towards fast and accurate object detection in bio-inspired spiking neural networks through bayesian optimization,”IEEE Access, vol. 9, pp. 2633–2643, Nov. 2020

  35. [35]

    A quantitative description of mem- brane current and its application to conduction and excitation in nerve,

    A. L. Hodgkin and A. F. Huxley, “A quantitative description of mem- brane current and its application to conduction and excitation in nerve,” J. Physiol., vol. 117, no. 4, p. 500, 1952

  36. [36]

    Gerstner, W

    W. Gerstner, W. M. Kistler, R. Naud, and L. Paninski,Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014

  37. [37]

    Lapicque’s introduction of the integrate-and-fire model neuron (1907),

    L. F. Abbott, “Lapicque’s introduction of the integrate-and-fire model neuron (1907),”Brain Res. Bull., vol. 50, no. 5-6, pp. 303–304, 1999

  38. [38]

    Detrs beat yolos on real-time object detection,

    Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “Detrs beat yolos on real-time object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 16 965– 16 974

  39. [39]

    Yolo-ms: Rethinking multi-scale representation learning for real-time object detection,

    Y . Chen, X. Yuan, J. Wang, R. Wu, X. Li, Q. Hou, and M.-M. Cheng, “Yolo-ms: Rethinking multi-scale representation learning for real-time object detection,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 6, pp. 4240–4252, Feb. 2025

  40. [40]

    Rate coding or direct coding: Which one is better for accurate, robust, and energy-efficient spiking neural networks?

    Y . Kim, H. Park, A. Moitra, A. Bhattacharjee, Y . Venkatesha, and P. Panda, “Rate coding or direct coding: Which one is better for accurate, robust, and energy-efficient spiking neural networks?” inProc. IEEE Int. Conf. Acoustics, Speech, Signal Process., May 2022, pp. 71–75

  41. [41]

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,

    K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026– 1034

  42. [42]

    A comprehensive and modularized statistical framework for gradient norm equality in deep neural networks,

    Z. Chen, L. Deng, B. Wang, G. Li, and Y . Xie, “A comprehensive and modularized statistical framework for gradient norm equality in deep neural networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1, pp. 13–31, Jul. 2020

  43. [43]

    Exponential expressivity in deep neural networks through transient chaos,

    B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, “Exponential expressivity in deep neural networks through transient chaos,”Proc. Adv. Neural Inf. Process. Syst., vol. 29, Dec. 2016

  44. [44]

    Distance-iou loss: Faster and better learning for bounding box regression,

    Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” inProc. AAAI Conf. Artif. Intell., vol. 34, Apr. 2020, pp. 12 993–13 000

  45. [45]

    Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,

    X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” inProc. Adv. Neural Inf. Process. Syst., vol. 33, Dec. 2020, pp. 21 002–21 012

  46. [46]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988

  47. [47]

    Centernet++ for object detection,

    K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet++ for object detection,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 5, pp. 3509–3521, Dec. 2023

  48. [48]

    Rtmdet: An empirical study of designing real-time object detectors,

    C. Lyu, W. Zhang, H. Huang, Y . Zhou, Y . Wang, Y . Liu, S. Zhang, and K. Chen, “Rtmdet: An empirical study of designing real-time object detectors,” 2022, arXiv:2212.07784

  49. [49]

    mixup: Beyond Empirical Risk Minimization

    H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond empirical risk minimization,” 2018, arXiv:1710.09412

  50. [50]

    Feature pyramid networks for object detection,

    T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125

  51. [51]

    Path aggregation network for instance segmentation,

    S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 8759–8768

  52. [52]

    Efficientdet: Scalable and efficient ob- ject detection,

    M. Tan, R. Pang, and Q. V . Le, “Efficientdet: Scalable and efficient ob- ject detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10 781–10 790

  53. [53]

    Tood: Task-aligned one-stage object detection,

    C. Feng, Y . Zhong, Y . Gao, M. R. Scott, and W. Huang, “Tood: Task-aligned one-stage object detection,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3490–3499

  54. [54]

    Generalized concordant vision transformer with masked image tokens for object detection,

    Y . Quan, D. Zhang, and J. Tang, “Generalized concordant vision transformer with masked image tokens for object detection,”IEEE Trans. Circuits Syst. Video Technol., vol. 35, no. 11, pp. 10 616–10 631, 2025

  55. [55]

    YOLOv12: Attention-Centric Real-Time Object Detectors

    Y . Tian, Q. Ye, and D. Doermann, “Yolov12: Attention-centric real-time object detectors,” 2025, arXiv:2502.12524

  56. [56]

    A mul- tisynaptic spiking neuron for simultaneously encoding spatiotemporal dynamics,

    L. Fan, H. Shen, X. Lian, Y . Li, M. Yao, G. Li, and D. Hu, “A mul- tisynaptic spiking neuron for simultaneously encoding spatiotemporal dynamics,”Nature Commun., vol. 16, no. 1, 2025, Art. no. 7155

  57. [57]

    A large scale event-based detection dataset for automotive,

    P. de Tournemire, D. Nitti, E. Perot, D. Migliore, and A. Sironi, “A large scale event-based detection dataset for automotive,” 2020, arXiv:2001.08499

  58. [58]

    Ulo: An underwater light-weight object detector for edge computing,

    L. Wang, X. Ye, S. Wang, and P. Li, “Ulo: An underwater light-weight object detector for edge computing,”Machines, vol. 10, no. 8, Jul. 2022, Art. no. 629

  59. [59]

    Getting to know low-light images with the exclusively dark dataset,

    Y . P. Loh and C. S. Chan, “Getting to know low-light images with the exclusively dark dataset,”Comput. Vis. Image Underst., vol. 178, pp. 30–42, 2019

  60. [60]

    CrowdHuman: A Benchmark for Detecting Human in a Crowd

    S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, “Crowdhuman: A benchmark for detecting human in a crowd,” 2018, arXiv:1805.00123

  61. [61]

    Learning to detect objects with a 1 megapixel event camera,

    E. Perot, P. De Tournemire, D. Nitti, J. Masci, and A. Sironi, “Learning to detect objects with a 1 megapixel event camera,”Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 16 639–16 652, Dec. 2020

  62. [62]

    Recurrent vision transformers for object detection with event cameras,

    M. Gehrig and D. Scaramuzza, “Recurrent vision transformers for object detection with event cameras,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 13 884–13 893

  63. [63]

    State space models for event cameras,

    N. Zubic, M. Gehrig, and D. Scaramuzza, “State space models for event cameras,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, pp. 5819–5828. 14

  64. [64]

    Optimization and application of improved yolov9s-ui for underwater object detection,

    W. Pan, J. Chen, B. Lv, and L. Peng, “Optimization and application of improved yolov9s-ui for underwater object detection,”Appl. Sci., vol. 14, no. 16, Aug. 2024, Art. no. 7162

  65. [65]

    Su-yolo: Spiking neu- ral network for efficient underwater object detection,

    C. Li, W. Liu, G. Gong, X. Ding, and X. Zhong, “Su-yolo: Spiking neu- ral network for efficient underwater object detection,”Neurocomputing, vol. 644, Sep. 2025, art. no. 130310

  66. [66]

    Understanding the difficulty of training deep feedforward neural networks,

    X. Glorot and Y . Bengio, “Understanding the difficulty of training deep feedforward neural networks,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS). JMLR Workshop and Conference Proceedings, 2010, pp. 249–256

  67. [67]

    Deep Residual Networks and Weight Initialization

    M. Taki, “Deep residual networks and weight initialization,” 2017, arXiv:1709.02956. APPENDIXA PROOF OF THEPROPOSITIONS A. Proof of Proposition 1 Proposition 1.Fordstacked LCB layers (d≥1) with arbitrary kernel sizes, under the convolution zero-mean weight and zero bias initialization, the output at the(n+d)-th layer yt,n+d is uncorrelated with the input ...

  68. [68]

    It comprises car footage spanning over 39 hours, captured by the GEN1 device with a spatial resolution of 304×240

    Datasets Introduction:The GEN1 dataset [57] repre- sents the initial large-scale collection for object detection using event cameras. It comprises car footage spanning over 39 hours, captured by the GEN1 device with a spatial resolution of 304×240. The dataset includes bounding box annotations for vehicles and pedestrians, provided at rates of 1 to 4Hz. T...

  69. [69]

    For SpikeDet-L, we increase the channel dimensions of MDSNet104 by a factor of 1.25 to improve model performance

    More Implementation Details:The MDSNet structure is detailed in Table X. For SpikeDet-L, we increase the channel dimensions of MDSNet104 by a factor of 1.25 to improve model performance. On the Gen1 dataset, we employ the zoom-in and zoom-out augmentation strategies from [62]. The model is trained for 100 epochs with a batch size of

  70. [70]

    Specifically, following [58], [64], [65], we resize images to 320×320 for URPC 2019, while using 640×640 for the other datasets

    For URPC 2019, ExDARK, and CrowdHuman datasets, we employ mosaic augmentation. Specifically, following [58], [64], [65], we resize images to 320×320 for URPC 2019, while using 640×640 for the other datasets. All models are trained for 300 epochs with a batch size of 64. APPENDIXC MOREVISUALIZATION In this section, we present the visualization results of o...