pith. sign in

arxiv: 2606.24248 · v1 · pith:FEF3QEMNnew · submitted 2026-06-23 · 💻 cs.CV

M²C-EvDet: Multi-Domain Multi-Order Cross-Modal Knowledge Distillation for Event-based Object Detection

Pith reviewed 2026-06-26 00:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-based object detectionknowledge distillationcross-modal learningfrequency learninghypergraph computationmulti-order relationsadaptive feature distillationevent cameras
0
0 comments X

The pith

M^2C-EvDet improves event-based object detection by distilling frequency features and multi-order relations from frame-based models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that prior knowledge distillation methods for event-based detection fall short because they address only spatial semantics or pairwise relations, leaving a performance gap with frame-based systems in complex scenes. It proposes integrating two new modules into a single framework: one that adaptively decouples and distills frequency-domain information, and another that models multi-order relations through hypergraph computation. A sympathetic reader would care because event cameras provide high temporal resolution and wide dynamic range that frame cameras lack, yet their sparse data makes accurate detection harder without better cross-modal transfer of visual meaning. If the approach works, event-based detectors would close more of the accuracy gap while retaining their speed and robustness advantages.

Core claim

M^2C-EvDet is a Multi-domain and Multi-order Cross-modal knowledge distillation framework for Event-based object Detection. Built upon frequency learning and hypergraph computation, M^2C-EvDet integrates two specialized modules: Adaptive Frequency-Decoupled Feature Distillation (AF^2D^2) and Multi-Order Relational Distillation (MORD) to address limitations of prior distillation methods that only focus on spatial semantics or pair-wise relations.

What carries the argument

The M^2C-EvDet framework, with AF^2D^2 for adaptive frequency-decoupled feature distillation and MORD for multi-order relational distillation via hypergraph computation.

If this is right

  • Event-based detectors achieve higher accuracy in complex scenarios than methods limited to spatial or pairwise distillation.
  • Visual semantics missing from sparse event data are transferred more completely across modalities.
  • The performance disparity between EvDet and frame-based detection is reduced through combined frequency and relational knowledge transfer.
  • Distillation now operates effectively on both frequency domains and higher-order scene relations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frequency-plus-hypergraph approach could be tested on other event-based tasks such as tracking or segmentation.
  • Combining the framework with existing large frame-based pretrained models might lower the data requirements for training event detectors.
  • Hypergraph modeling of multi-order relations may prove useful in other sparse sensing domains beyond vision.

Load-bearing premise

That the frequency-decoupled and multi-order relational modules will successfully capture and transfer the missing visual semantics from frame-based models to event data in complex scenarios beyond what prior spatial or pairwise methods achieved.

What would settle it

Experiments on standard event detection benchmarks showing no meaningful accuracy gain over prior distillation baselines specifically in complex or high-dynamic-range sequences.

Figures

Figures reproduced from arXiv: 2606.24248 by Shouan Pan, Siqi Li, Wei Bao, Yi Xie, Yue Gao.

Figure 1
Figure 1. Figure 1: Comparison of our proposed M2C-EvDet with existing methods, including (a) whole spatial domain distillation and (b) low-order relational distill. Our method (c) transfers low-frequency and high-frequency decoupled object semantics, and low-order and high-order object relations through frequency learning and hypergraph computation. semantics. It can be observed that existing cross-modal knowl￾edge distillat… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed M2C-EvDet framework. During training phase, our method is built on a teacher-student architecture and takes event streams and RGB frames as inputs. The M2CFD module is employed to implement low-order and high-order relational distillation, and low-frequency and high-frequency decoupled distillation. During inference phase, our method solely take event streams as inputs to perform o… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the AF2D 2 module, which leverages adaptive low￾pass and high-pass filter to achieve the decoupling of low-frequency and high￾frequency feature distillation. low-pass and high-pass filtering modules to perform decoupled distillation of low-frequency and high-frequency components. During the inference stage, our method solely takes event streams as inputs and leverages the enhanced the multi… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the MORD module, which leverages self attention and hyper attention to transfer low-order and high-order relations. and Tl − Sl to represent differential modality. The final high￾frequency and low-frequency components of the RGB teacher features and Event student features can be obtained as: S lf l = X T t=1 A lf(t) l · Slf(t) l , S hf l = X T t=1 A hf(t) l · Shf(t) l T lf l = X T t=1 A lf(… view at source ↗
Figure 5
Figure 5. Figure 5: Experimental results of sensitivity analysis of distillation loss weight [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison of detection results between the baseline and [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of adaptive hyperedges for MORD module on the DSEC￾Detection dataset [10]. Visualization of Detection Results [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visual results of failure case for M2C-Det on the DSEC-Detection dataset [10]. mitigates the cross-modal gap but cannot completely eliminate modal discrepancies, due to the low-quality RGB images and the inadequate model optimization process. As illustrated in the low-light scenarios of [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

Event-based object Detection (EvDet), as a biologically inspired visual perception paradigm, demonstrates superior performance in scenarios demanding high temporal resolution and a wide dynamic range. Nevertheless, the inherent sparse representations and inadequate visual semantics of event data result in a considerable performance disparity between EvDet and frame-based object detection. Previous works attempt to alleviate this cross-modal discrepancy through knowledge distillation, yet they only focus on spatial visual semantics or pair-wise relational information, thus limiting performance in more complex scenarios. To address this challenge, this paper proposes M^2C-EvDet, a Multi-domain and Multi-order Cross-modal knowledge distillation framework for EvDet. Built upon frequency learning and hypergraph computation, M^2C-EvDet integrates two specialized modules: Adaptive Frequency-Decoupled Feature Distillation (AF^2D^2) and Multi-Order Relational Distillation (MORD).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes M^2C-EvDet, a Multi-domain Multi-Order Cross-Modal knowledge distillation framework for Event-based Object Detection (EvDet). It introduces two modules—Adaptive Frequency-Decoupled Feature Distillation (AF^2D^2) and Multi-Order Relational Distillation (MORD)—built on frequency learning and hypergraph computation to transfer richer visual semantics from frame-based models to sparse event data, addressing limitations of prior distillation approaches that focus only on spatial semantics or pairwise relations.

Significance. If the modules prove effective at capturing and transferring multi-domain and multi-order information, the work could meaningfully narrow the performance gap between EvDet and frame-based detection in complex, high-dynamic-range scenarios, contributing a new direction for cross-modal distillation in event-based vision.

major comments (2)
  1. [Abstract] Abstract: The central claim of improved performance in complex scenarios rests on the success of AF^2D^2 and MORD, yet the manuscript provides no equations, algorithmic details, or pseudocode for how frequency decoupling is performed adaptively or how hypergraph-based multi-order relations are computed and distilled.
  2. [Abstract] Abstract: No experimental results, ablation studies, or quantitative comparisons are presented, so the assertion that the proposed modules outperform prior spatial or pairwise distillation methods cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. The full manuscript contains the technical details and experimental results referenced in the body; we address each point below and indicate where revisions to the abstract may be appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of improved performance in complex scenarios rests on the success of AF^2D^2 and MORD, yet the manuscript provides no equations, algorithmic details, or pseudocode for how frequency decoupling is performed adaptively or how hypergraph-based multi-order relations are computed and distilled.

    Authors: The abstract is a high-level summary. The full manuscript provides the requested equations for adaptive frequency decoupling in AF^2D^2 (Section 3.2, including the frequency-domain formulation and adaptive weighting) and the hypergraph construction plus multi-order distillation loss for MORD (Section 3.3, with explicit hyperedge definitions and message-passing steps). Algorithmic details and a pseudocode outline appear in the supplementary material. We can add one sentence to the abstract that points to these sections if the editor prefers. revision: partial

  2. Referee: [Abstract] Abstract: No experimental results, ablation studies, or quantitative comparisons are presented, so the assertion that the proposed modules outperform prior spatial or pairwise distillation methods cannot be assessed.

    Authors: The full manuscript reports comprehensive experiments in Section 4, including quantitative mAP comparisons on multiple event-based datasets, ablation studies isolating AF^2D^2 and MORD, and direct comparisons against prior spatial and pairwise distillation baselines. The abstract summarizes the outcome of these experiments without specific numbers due to length limits. No change to the abstract is required on this point. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a new framework M^2C-EvDet integrating AF^2D^2 and MORD modules built on frequency learning and hypergraph computation for cross-modal distillation. No equations, derivations, or fitted parameters are described in the provided text that reduce by construction to the inputs. The central claim is an architectural proposal whose validity rests on empirical performance rather than any self-referential definition or self-citation chain. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; full text required to audit.

pith-pipeline@v0.9.1-grok · 5691 in / 1086 out tokens · 20776 ms · 2026-06-26T00:30:17.692908+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 9 canonical work pages

  1. [1]

    Event-Based Vision: A Survey,

    G. Guillermo, D. Tobi, M. O. Garrick, B. Chiara, T. Brian, C. Andrea, L. Stefan, D. Andrew, C. Jorg, D. Kostas, and S. Davide, “Event-Based Vision: A Survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

  2. [2]

    EVCSLR: Event-guided Continuous Sign Language Recognition and Benchmark,

    Y . Jiang, Y . Wang, S. Li, Y . Zhang, Q. Guo, Q. Chu, and Y . Gao, “EVCSLR: Event-guided Continuous Sign Language Recognition and Benchmark,”IEEE Transactions on Multimedia, 2024

  3. [3]

    RGB-D Visual Per- ception for Occluded Scenes via Event Camera,

    S. Li, Z. Wu, Y . Li, Z. Xue, Y .-S. Liu, and Y . Gao, “RGB-D Visual Per- ception for Occluded Scenes via Event Camera,”International Journal of Computer Vision, pp. 1–22, 2025

  4. [4]

    Event- based Low-illumination Image Enhancement,

    Y . Jiang, Y . Wang, S. Li, Y . Zhang, M. Zhao, and Y . Gao, “Event- based Low-illumination Image Enhancement,”IEEE Transactions on Multimedia, vol. 26, pp. 1920–1931, 2023

  5. [5]

    3D Feature Tracking via Event Camera,

    S. Li, Z. Zhou, Z. Xue, Y . Li, S. Du, and Y . Gao, “3D Feature Tracking via Event Camera,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 974–18 983

  6. [6]

    AEGNN: Asynchronous Event-based Graph Neural Networks,

    S. Schaefer, D. Gehrig, and D. Scaramuzza, “AEGNN: Asynchronous Event-based Graph Neural Networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 371–12 381

  7. [7]

    Spiking Transformers for Event-based Single Object Tracking,

    J. Zhang, B. Dong, H. Zhang, J. Ding, F. Heide, B. Yin, and X. Yang, “Spiking Transformers for Event-based Single Object Tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 8801–8810

  8. [8]

    Recurrent Vision Transformers for Object Detection with Event Cameras,

    M. Gehrig and D. Scaramuzza, “Recurrent Vision Transformers for Object Detection with Event Cameras,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 884–13 893

  9. [9]

    EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision,

    D. Torbunov, Y . Ren, A. Ghose, O. Dim, and Y . Cui, “EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision,” in Proceedings of the IEEE International Conference on Computer Vision, 2025, pp. 9812–9821

  10. [10]

    Low-Latency Automotive Vision with Event Cameras,

    D. Gehrig and D. Scaramuzza, “Low-Latency Automotive Vision with Event Cameras,”Nature, vol. 629, no. 8014, pp. 1034–1040, 2024

  11. [11]

    FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequen- cies,

    D. Lu, L. Kong, G. Lee, C. Chane, and W. Ooi, “FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequen- cies,”arXiv preprint arXiv:2412.06708, 2025

  12. [12]

    Frequency-Adaptive Low-Latency Object Detection Using Events and Frames,

    H. Zhang, X. Wang, C. Xu, X. Wang, F. Xu, H. Yu, L. Yu, and W. Yang, “Frequency-Adaptive Low-Latency Object Detection Using Events and Frames,”arXiv preprint arXiv:2412.04149, 2024

  13. [13]

    From Global to Local: Multi-Scale Out-of-Distribution Detection,

    J. Zhang, L. Gao, B. Hao, H. Huang, J. Song, and H. Shen, “From Global to Local: Multi-Scale Out-of-Distribution Detection,”IEEE Transactions on Image Processing, vol. 32, pp. 6115–6128, 2023

  14. [14]

    From Channel Bias to Feature Redundancy: Uncovering the

    J. Zhang, X. Luo, L. Gao, D. Zou, H. Shen, and J. Song, “From Channel Bias to Feature Redundancy: Uncovering the” Less is More” Principle in Few-Shot Learning,”arXiv e-prints, pp. arXiv–2310, 2023

  15. [15]

    A Closer Look at Conditional Prompt Tuning for Vision-Language Models,

    J. Zhang, S. Wu, L. Gao, J. Song, N. Sebe, and H. T. Shen, “A Closer Look at Conditional Prompt Tuning for Vision-Language Models,”arXiv preprint arXiv:2506.23856, 2025

  16. [16]

    Reliable Few-shot Learning under Dual Noises,

    J. Zhang, J. Song, L. Gao, N. Sebe, and H. T. Shen, “Reliable Few-shot Learning under Dual Noises,” 2025

  17. [17]

    Object-Centric Cross-Modal Feature Distillation for Event-based Ob- ject Detection,

    L. Li, A. Linger, M. Millhaeusler, V . Tsiminaki, Y . Li, and D. Dai, “Object-Centric Cross-Modal Feature Distillation for Event-based Ob- ject Detection,” inProceedings of the IEEE International Conference on Robotics and Automation. IEEE, 2024, pp. 15 440–15 447

  18. [18]

    Event-Aware Distilled DETR for Object Detection in an Automotive Context,

    D. Rossi, P. Vasseur, F. Morbidi, C. Demonceaux, and F. Rameau, “Event-Aware Distilled DETR for Object Detection in an Automotive Context,” inIEEE Intelligent V ehicles Symposium, 2025

  19. [19]

    Wavelet Integrated CNNs for Noise- robust Image Classification,

    Q. Li, L. Shen, S. Guo, and Z. Lai, “Wavelet Integrated CNNs for Noise- robust Image Classification,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 7245–7254

  20. [20]

    HGNN +: General Hypergraph Neural Networks,

    Y . Gao, Y . Feng, S. Ji, and R. Ji, “HGNN +: General Hypergraph Neural Networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3181–3199, 2023

  21. [21]

    SuperEvent: Cross- Modal Learning of Event-based Keypoint Detection,

    Y . Burkhardt, S. Schaefer, and S. Leutenegger, “SuperEvent: Cross- Modal Learning of Event-based Keypoint Detection,”arXiv preprint arXiv:2504.00139, 2025

  22. [22]

    Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event- Based Monocular Depth Estimation,

    L. Bartolomei, E. Mannocci, F. Tosi, M. Poggi, and S. Mattoccia, “Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event- Based Monocular Depth Estimation,” inProceedings of the IEEE International Conference on Computer Vision, 2025, pp. 19 669–19 678

  23. [23]

    I2EKD: Efficient and Versatile Image-to-Event Knowledge Distillation,

    H. Liu, G. Yu, H. Cao, S. Qu, F. Lu, Y . Zhong, Z. Lu, L. Leng, and G. Chen, “I2EKD: Efficient and Versatile Image-to-Event Knowledge Distillation,”IEEE Transactions on Circuits and Systems for Video Technology, 2025. 12

  24. [24]

    When Object Detection Meets Knowledge Distillation: A Survey,

    Z. Li, P. Xu, X. Chang, L. Yang, Y . Zhang, L. Yao, and X. Chen, “When Object Detection Meets Knowledge Distillation: A Survey,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 10 555–10 579, 2023

  25. [25]

    Context Matters: Distilling Knowledge Graph for Enhanced Object Detection,

    A. Yang, S. Lin, C.-H. Yeh, M. Shu, Y . Yang, and X. Chang, “Context Matters: Distilling Knowledge Graph for Enhanced Object Detection,” IEEE Transactions on Multimedia, vol. 26, pp. 487–500, 2023

  26. [26]

    A Hier- archical Semantic Distillation Framework for Open-V ocabulary Object Detection,

    S. Fu, J. Yan, Q. Yang, X. Wei, X. Xie, and W.-S. Zheng, “A Hier- archical Semantic Distillation Framework for Open-V ocabulary Object Detection,”IEEE Transactions on Multimedia, 2025

  27. [27]

    Improve Object Detection with Feature-Based Knowledge Distillation: Towards Accurate and Efficient Detectors,

    L. Zhang and K. Ma, “Improve Object Detection with Feature-Based Knowledge Distillation: Towards Accurate and Efficient Detectors,” in Proceedings of the International Conference on Learning Representa- tions, 2020

  28. [28]

    Focal and Global Knowledge Distillation for Detectors,

    Z. Yang, Z. Li, X. Jiang, Y . Gong, Z. Yuan, D. Zhao, and C. Yuan, “Focal and Global Knowledge Distillation for Detectors,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 4643–4652

  29. [29]

    Masked Generative Distillation,

    Z. Yang, Z. Li, M. Shao, D. Shi, Z. Yuan, and C. Yuan, “Masked Generative Distillation,” inProceedings of the European Conference on Computer Vision. Springer, 2022, pp. 53–69

  30. [30]

    D 3ETR: Decoder Distillation for Detection Transformer,

    X. Chen, J. Chen, Y . Liu, and G. Zeng, “D 3ETR: Decoder Distillation for Detection Transformer,”arXiv preprint arXiv:2211.09768, 2022

  31. [31]

    DE- TRDistill: A Universal Knowledge Distillation Framework for DETR- families,

    J. Chang, S. Wang, H.-M. Xu, Z. Chen, C. Yang, and F. Zhao, “DE- TRDistill: A Universal Knowledge Distillation Framework for DETR- families,” inProceedings of the IEEE International Conference on Computer Vision, 2023, pp. 6898–6908

  32. [32]

    KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling,

    Y . Wang, X. Li, S. Weng, G. Zhang, H. Yue, H. Feng, J. Han, and E. Ding, “KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 016–16 025

  33. [33]

    CLoCKDistill: Consistent Location-and- Context-aware Knowledge Distillation for DETRs,

    Q. Lan and Q. Tian, “CLoCKDistill: Consistent Location-and- Context-aware Knowledge Distillation for DETRs,”arXiv preprint arXiv:2502.10683, 2025

  34. [34]

    Frequency- aware Feature Fusion for Dense Image Prediction,

    L. Chen, Y . Fu, L. Gu, C. Yan, T. Harada, and G. Huang, “Frequency- aware Feature Fusion for Dense Image Prediction,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  35. [35]

    WD-DETR: Wavelet Denoising-Enhanced Real-Time Object Detection Transformer for Robot Perception with Event Cameras,

    Y . Cui, B. Gao, Y . Zhang, X. Dong, J. Xiang, D. Li, and Z. Tu, “WD-DETR: Wavelet Denoising-Enhanced Real-Time Object Detection Transformer for Robot Perception with Event Cameras,”arXiv preprint arXiv:2506.09098, 2025

  36. [36]

    Frequency Attention for Knowledge Distillation,

    C. Pham, V .-A. Nguyen, T. Le, D. Phung, G. Carneiro, and T.-T. Do, “Frequency Attention for Knowledge Distillation,” inProceedings of the IEEE Winter Conference on Applications of Computer Vision, 2024, pp. 2277–2286

  37. [37]

    FreeKD: Knowledge Distillation via Semantic Frequency Prompt,

    Y . Zhang, T. Huang, J. Liu, T. Jiang, K. Cheng, and S. Zhang, “FreeKD: Knowledge Distillation via Semantic Frequency Prompt,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 931–15 940

  38. [38]

    Learning Representation on Opti- mized High-order Manifold for Visual classification,

    X. Ma, W. Liu, Q. Tian, and Y . Gao, “Learning Representation on Opti- mized High-order Manifold for Visual classification,”IEEE Transactions on Multimedia, vol. 24, pp. 3989–4001, 2021

  39. [39]

    CrossHypergraph: Consis- tent High-order Semantic Network for Few-shot Image Classification,

    Y . Zhang, H. Wang, S. Zhang, and B. Leng, “CrossHypergraph: Consis- tent High-order Semantic Network for Few-shot Image Classification,” IEEE Transactions on Multimedia, pp. 1–12, 2025

  40. [40]

    Hypergraph-Based Remaining Prototype Alignment for Open-Set Cross-Domain Image Retrieval,

    Y . Xu, Y . Feng, X. Zhong, Y . Gao, and Z. Wu, “Hypergraph-Based Remaining Prototype Alignment for Open-Set Cross-Domain Image Retrieval,”IEEE Transactions on Multimedia, 2025

  41. [41]

    Residual Fuzzy Alignment on Hypergraph for Open-Set 3D Cross-Modal Retrieval,

    Y . Xu, Y . Feng, X. Zhuang, J. Wang, Z. Wu, and Y . Gao, “Residual Fuzzy Alignment on Hypergraph for Open-Set 3D Cross-Modal Retrieval,” IEEE Transactions on Multimedia, 2025

  42. [42]

    Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation,

    Y . Feng, J. Huang, S. Du, S. Ying, J.-H. Yong, Y . Li, G. Ding, R. Ji, and Y . Gao, “Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  43. [43]

    Yolov13: Real-time object detection with hypergraph- enhanced adaptive visual perception,

    M. Lei, S. Li, Y . Wu, H. Hu, Y . Zhou, X. Zheng, G. Ding, S. Du, Z. Wu, and Y . Gao, “YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception,”arXiv preprint arXiv:2506.17733, 2025

  44. [44]

    Feature Pyramid Networks for Object Detection,

    T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125

  45. [45]

    Path Aggregation Network for Instance Segmentation,

    S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path Aggregation Network for Instance Segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768

  46. [46]

    Pyramid Vision Transformer: A versatile Backbone for Dense Prediction without Convolutions,

    W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid Vision Transformer: A versatile Backbone for Dense Prediction without Convolutions,” inProceedings of the IEEE International Conference on Computer Vision, 2021, pp. 568–578

  47. [47]

    Attention is All You Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” inProceedings of the Advances in Neural Information Processing Systems, vol. 30, 2017

  48. [48]

    Vision HGNN: An Image is More than a Graph of Nodes,

    Y . Han, P. Wang, S. Kundu, Y . Ding, and Z. Wang, “Vision HGNN: An Image is More than a Graph of Nodes,” inProceedings of the IEEE International Conference on Computer Vision, 2023, pp. 19 878–19 888

  49. [49]

    Motion Robust High- Speed Light-Weighted Object Detection with Event Camera,

    B. Liu, C. Xu, W. Yang, H. Yu, and L. Yu, “Motion Robust High- Speed Light-Weighted Object Detection with Event Camera,”IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–13, 2023

  50. [50]

    State Space Models for Event Cameras,

    N. Zubic, M. Gehrig, and D. Scaramuzza, “State Space Models for Event Cameras,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 5819–5828

  51. [51]

    D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement,

    Y . Peng, H. Li, P. Wu, Y . Zhang, X. Sun, and F. Wu, “D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement,” arXiv preprint arXiv:2410.13842, 2024

  52. [52]

    Fusing event-based and rgb camera for robust object detection in adverse conditions,

    A. Tomy, A. Paigwar, K. S. Mann, A. Renzaglia, and C. Laugier, “Fusing event-based and rgb camera for robust object detection in adverse conditions,” inIEEE International Conference on Robotics and Automation. IEEE, 2022, pp. 933–939

  53. [53]

    SODFormer: Streaming Object Detection with Transformer using Events and Frames,

    D. Li, Y . Tian, and J. Li, “SODFormer: Streaming Object Detection with Transformer using Events and Frames,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, pp. 14 020–14 037, 2023

  54. [54]

    Microsoft COCO: Common Objects in Context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” inProceedings of the European Conference on Computer Vision. Springer, 2014, pp. 740–755

  55. [55]

    Ultralytics YOLO,

    G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLO,” Jan. 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

  56. [56]

    Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,

    X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,”Proceedings of the Advances in Neural Information Processing Systems, vol. 33, pp. 21 002–21 012, 2020