pith. sign in

arxiv: 2605.08825 · v3 · submitted 2026-05-09 · 💻 cs.CV

Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning

Pith reviewed 2026-05-15 05:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-based visionobject detectiontemporal aggregationhypergraph reasoningevent camerasfeature fusionsparse data processing
0
0 comments X

The pith

Ev-DTAD improves event-based object detection by pairing a compact three-channel temporal representation with hypergraph feature reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Event cameras record brightness changes at microsecond scale, yet prior detection pipelines bury that timing information inside bulky or indirect encodings and then fail to stitch scattered events into stable object shapes. The paper introduces Ev-DTAD, which first builds a Hierarchical Temporal Aggregation representation that packs intra- and inter-window timing directly into a three-channel pseudo-image. It then feeds the resulting multi-scale features into a Frequency-aware Hypergraph Temporal Fusion module that models high-order temporal relations across scales. Experiments on three public event datasets show modest accuracy lifts alongside 1.6–2.0 times faster inference. A sympathetic reader would conclude that explicit low-level temporal encoding plus relational reasoning at the model level can make event-based perception both more accurate and more practical for real-time use.

Core claim

The paper claims that a unified detector combining Hierarchical Temporal Aggregation at the representation level with Frequency-aware Hypergraph Temporal Fusion at the model level produces a favorable accuracy-speed trade-off for event-based object detection. The representation explicitly embeds temporal structure from sparse events into a compact three-channel format, while the fusion module performs temporal evolution modeling and high-order relational reasoning to recover coherent object features from fragmented inputs.

What carries the argument

Hierarchical Temporal Aggregation (HTA) representation paired with Frequency-aware Hypergraph Temporal Fusion (FHTF) module inside the Ev-DTAD detector, which encodes timing directly into a pseudo-RGB input and then reasons high-order temporal relations among multi-scale event features.

If this is right

  • On the Gen1 dataset the detector reports +0.8 mAP while running 1.7 times faster than prior methods.
  • On the 1Mpx/Gen4 dataset it reports +0.5 mAP at 1.6 times the speed.
  • On the eTraM dataset it reports +3.0 mAP at twice the speed.
  • The gains are attributed to the complementarity of compact temporal encoding and high-order relational reasoning rather than to either component alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same HTA representation could be reused as a drop-in temporal front-end for event-based tracking or segmentation without retraining the full detector.
  • Hypergraph reasoning may prove especially useful when event noise increases, such as in outdoor scenes with flickering lights or fast camera motion.
  • If the three-channel format preserves timing information reliably, longer time horizons could be handled by stacking multiple HTA windows rather than by adding recurrent layers.
  • The approach suggests that future event detectors might benefit from testing alternative relational structures beyond hypergraphs once the compact representation is fixed.

Load-bearing premise

The Hierarchical Temporal Aggregation and Frequency-aware Hypergraph Temporal Fusion steps can combine sparse, fragmented events into coherent high-order object features without critical loss of information or introduction of artifacts that would degrade detection.

What would settle it

Running the method on a new event sequence containing extremely sparse or rapidly changing objects where standard voxel-grid or event-volume representations still produce usable detections but Ev-DTAD accuracy drops below baseline.

Figures

Figures reproduced from arXiv: 2605.08825 by Chengjie Wang, Hao Deng, Ma Yuanxiao, Meisen Wang, Shaoyi Du, Siqi Li, Wei Bao, Zhiqiang Tian.

Figure 1
Figure 1. Figure 1: Illustrative comparison between 2D Histogram (2D-HG) and Hierarchical Temporal Aggregation (HTA) representation. HTA is one of our key contributions, encoding temporal event information into a compact pseudo-RGB format. Compared with 2D-HG, HTA produces cleaner object structures with reduced background noise. Yellow boxes highlight representative differences. Abstract Event cameras provide microsecond-leve… view at source ↗
Figure 2
Figure 2. Figure 2: Latency–accuracy comparison on Gen1. Bubble area is proportional to the num￾ber of parameters. Our models achieve state-of￾the-art accuracy while maintaining competitive inference speed. Motivated by these observations, we present Event Dual Temporal-Relational Aggregation Detector (Ev-DTAD), a unified EOD framework that bridges event representation and detection model design through two complementary lev￾… view at source ↗
Figure 3
Figure 3. Figure 3: Framework overview. Ev-DTAD consists of two core components: offline HTA repre￾sentation generation and the FHTF module. It first converts asynchronous events into compact HTA frames offline. Consecutive frames are grouped into clips/videos and fed into the network, where multi-scale features are extracted and refined by FHTF through temporal evolution and frequency￾aware hypergraph reasoning. The refined … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons on Gen1. Ev-DTAD achieves more accurate and robust detections than the baseline, reducing missed detections and improving localization in challenging scenes [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Hyperedge sensitivity of FHTF. brings a +0.4 mAP improvement, while inter-window aggregation brings a +1.9 mAP improvement. Combining both achieves the best performance of 53.5% mAP and improves the no-aggregation variant by +2.2 mAP. These results show that intra-window aggregation captures local temporal evidence, while inter-window aggregation preserves temporal continuity across adjacent windows. Ablat… view at source ↗
Figure 6
Figure 6. Figure 6: Additional test results on 1Mpx/Gen4. We report the detection performance of Ev-DTAD on the high-resolution driving benchmark. Scene #1 Scene #2 Scene #3 Scene #4 Scene #5 GT Ev -DTAT [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional test results on eTraM. We report the detection performance of Ev-DTAD on the static traffic monitoring benchmark. B.2 Class-wise Results To further analyze the behavior of Ev-DTAD across object categories, we report class-wise AP on Gen1, 1Mpx/Gen4, and eTraM. Gen1 evaluates car and pedestrian; 1Mpx/Gen4 evaluates pedestrian, two-wheeler, and car; eTraM follows the official grouped protocol with… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative visualization of representative low-mAP Gen4 videos. Red boxes indicate objects that are visually detected by our model but missing in the ground-truth annotations. Zoom in for details. This observation suggests that video-level evaluation can reveal dataset-specific challenges that are not fully reflected by average mAP. For fair comparison with prior methods, our quantitative evaluation stric… view at source ↗
read the original abstract

Event cameras provide microsecond-level temporal resolution, low latency, and high dynamic range, offering potential for perception under fast motion and challenging illumination conditions. However, existing Event-based Object Detection (EOD) methods face limitations at both the representation and model levels: prior event representations usually encode temporal information indirectly through redundant structures, while detection models struggle to explicitly aggregate fragmented event responses into coherent high-order object features. To address these limitations, we present Event Dual Temporal-Relational Aggregation Detector (Ev-DTAD), a unified EOD framework that integrates representation-level temporal encoding with model-level temporal-hypergraph reasoning. Specifically, we introduce Hierarchical Temporal Aggregation (HTA), a compact three-channel pseudo-RGB representation that explicitly embeds temporal information across intra- and inter-window events. To further enhance detection under sparse and fragmented event responses, we propose Frequency-aware Hypergraph Temporal Fusion (FHTF), which refines multi-scale event features through temporal evolution modeling and high-order relational reasoning. Extensive experiments on Gen1 (+0.8 mAP and 1.7$\times$ faster), 1Mpx/Gen4 (+0.5 mAP and 1.6$\times$ faster), and eTraM (+3.0 mAP and 2.0$\times$ faster) demonstrate that Ev-DTAD achieves a competitive accuracy-efficiency trade-off, validating the complementarity between compact temporal representation and temporal-hypergraph feature reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces Ev-DTAD, an event-based object detection framework that integrates Hierarchical Temporal Aggregation (HTA)—a compact three-channel pseudo-RGB representation embedding intra- and inter-window temporal information—with Frequency-aware Hypergraph Temporal Fusion (FHTF) for multi-scale feature refinement via temporal evolution modeling and high-order relational reasoning. It reports mAP gains of +0.8 on Gen1, +0.5 on 1Mpx/Gen4, and +3.0 on eTraM, together with 1.6–2.0× speedups, claiming these results validate the complementarity of compact temporal representation and temporal-hypergraph reasoning.

Significance. If the reported gains are attributable to HTA and FHTF rather than confounding factors, the work would offer a practical advance in event-based vision by addressing sparse, fragmented event streams through explicit temporal encoding and high-order modeling, potentially improving accuracy-efficiency trade-offs in fast-motion or low-light scenarios.

major comments (1)
  1. Abstract and experimental results: the central claim that the observed mAP and speed improvements validate the complementarity of HTA and FHTF rests on the untested assumption that these modules, rather than backbone choice, training schedule, or preprocessing, drive the gains. No ablations are described that replace HTA with voxel-grid or event-stacking baselines or FHTF with standard convolutions/attention, leaving the attribution of improvements unsupported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We agree that stronger experimental evidence is needed to attribute the reported gains specifically to HTA and FHTF rather than other design choices. In the revised manuscript we will add the requested ablation studies to address this concern directly.

read point-by-point responses
  1. Referee: [—] Abstract and experimental results: the central claim that the observed mAP and speed improvements validate the complementarity of HTA and FHTF rests on the untested assumption that these modules, rather than backbone choice, training schedule, or preprocessing, drive the gains. No ablations are described that replace HTA with voxel-grid or event-stacking baselines or FHTF with standard convolutions/attention, leaving the attribution of improvements unsupported.

    Authors: We acknowledge that the current manuscript does not contain the specific ablation experiments suggested by the referee. The paper reports overall performance gains and comparisons against prior EOD methods, but does not isolate HTA by direct replacement with voxel-grid or event-stacking representations, nor does it replace FHTF with standard convolutions or attention modules. We agree this leaves the attribution of improvements less conclusive than desired. We will therefore run the additional ablations (HTA vs. voxel-grid and event-stacking; FHTF vs. convolutional and attention baselines) under controlled training schedules and preprocessing, and include the results together with updated analysis in the revised manuscript. These additions will strengthen the claim that the observed accuracy-efficiency trade-offs arise from the complementarity of the two proposed components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; novel components validated on external benchmarks

full rationale

The paper defines HTA as a three-channel pseudo-RGB temporal embedding and FHTF as frequency-aware hypergraph fusion from explicit design choices at representation and model levels, without any reduction to fitted parameters renamed as predictions or self-referential equations. Central claims rest on empirical gains measured against independent datasets (Gen1, 1Mpx/Gen4, eTraM) rather than internal consistency alone, satisfying the criterion for self-contained external validation. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; contributions are presented as algorithmic modules without stated fitting procedures or new postulated constructs.

pith-pipeline@v0.9.0 · 5582 in / 1076 out tokens · 48477 ms · 2026-05-15T05:16:24.818004+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

  1. [1]

    Hypergraph convolution and hypergraph attention

    Song Bai, Feihu Zhang, and Philip HS Torr. Hypergraph convolution and hypergraph attention. Pattern Recognition, 110:107637, 2021

  2. [2]

    A 240x180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014

    Christian Brandli, Raphael Berner, Minhao Yang, Shih-Chii Liu, and Tobi Delbruck. A 240x180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014

  3. [3]

    A differentiable recurrent surface for asynchronous event-based data

    Marco Cannici, Marco Ciccone, Andrea Romanoni, and Matteo Matteucci. A differentiable recurrent surface for asynchronous event-based data. InEuropean Conference on Computer Vision, pages 136–152. Springer, 2020

  4. [4]

    Frequency-adaptive dilated convolution for semantic segmentation

    Linwei Chen, Lin Gu, Dezhi Zheng, and Ying Fu. Frequency-adaptive dilated convolution for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3414–3425, 2024

  5. [5]

    Motion and appearance decoupling representation for event cameras

    Nuo Chen, Boyang Li, Yingqian Wang, Xinyi Ying, Longguang Wang, Chushu Zhang, Yulan Guo, Miao Li, and Wei An. Motion and appearance decoupling representation for event cameras. IEEE Transactions on Image Processing, 34:5964–5977, 2025

  6. [6]

    Fast fourier convolution.Advances in Neural Information Processing Systems, 33:4479–4488, 2020

    Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution.Advances in Neural Information Processing Systems, 33:4479–4488, 2020

  7. [7]

    You are allset: A multiset function framework for hypergraph neural networks.arXiv preprint arXiv:2106.13264, 2021

    Eli Chien, Chao Pan, Jianhao Peng, and Olgica Milenkovic. You are allset: A multiset function framework for hypergraph neural networks.arXiv preprint arXiv:2106.13264, 2021

  8. [8]

    A large scale event-based detection dataset for automotive,

    Pierre De Tournemire, Davide Nitti, Etienne Perot, Davide Migliore, and Amos Sironi. A large scale event-based detection dataset for automotive.arXiv preprint arXiv:2001.08499, 2020

  9. [9]

    Hyper-yolo: When visual object detection meets hypergraph computation.IEEE transactions on pattern analysis and machine intelligence, 47(4):2388–2401, 2024

    Yifan Feng, Jiangang Huang, Shaoyi Du, Shihui Ying, Jun-Hai Yong, Yipeng Li, Guiguang Ding, Rongrong Ji, and Yue Gao. Hyper-yolo: When visual object detection meets hypergraph computation.IEEE transactions on pattern analysis and machine intelligence, 47(4):2388–2401, 2024

  10. [10]

    Hypergraph neural networks

    Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 3558–3565, 2019

  11. [11]

    Event- based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

    Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, Jörg Conradt, Kostas Daniilidis, et al. Event- based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

  12. [12]

    End- to-end learning of representations for asynchronous event-based data

    Daniel Gehrig, Antonio Loquercio, Konstantinos G Derpanis, and Davide Scaramuzza. End- to-end learning of representations for asynchronous event-based data. InProceedings of the IEEE/CVF international conference on computer vision, pages 5633–5643, 2019

  13. [13]

    Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021

    Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021

  14. [14]

    Recurrent vision transformers for object detection with event cameras

    Mathias Gehrig and Davide Scaramuzza. Recurrent vision transformers for object detection with event cameras. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13884–13893, 2023

  15. [15]

    Vision hgnn: An image is more than a graph of nodes

    Yan Han, Peihao Wang, Souvik Kundu, Ying Ding, and Zhangyang Wang. Vision hgnn: An image is more than a graph of nodes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19878–19888, 2023

  16. [16]

    Unignn: A Unified Framework for Graph and Hypergraph Neural Networks.arXiv preprint arXiv:2105.00956, 2021

    Jing Huang and Jie Yang. Unignn: a unified framework for graph and hypergraph neural networks.arXiv preprint arXiv:2105.00956, 2021

  17. [17]

    Dynamic hypergraph neural networks

    Jianwen Jiang, Yuxuan Wei, Yifan Feng, Jingxuan Cao, and Yue Gao. Dynamic hypergraph neural networks. InIjcai, pages 2635–2641, 2019. 10

  18. [18]

    Asynchronous spatio-temporal memory network for continuous event-based object detection.IEEE Transac- tions on Image Processing, 31:2975–2987, 2022

    Jianing Li, Jia Li, Lin Zhu, Xijie Xiang, Tiejun Huang, and Yonghong Tian. Asynchronous spatio-temporal memory network for continuous event-based object detection.IEEE Transac- tions on Image Processing, 31:2975–2987, 2022

  19. [19]

    Retinomorphic object detection in asynchronous visual streams

    Jianing Li, Xiao Wang, Lin Zhu, Jia Li, Tiejun Huang, and Yonghong Tian. Retinomorphic object detection in asynchronous visual streams. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1332–1340, 2022

  20. [20]

    Contextual hyper- graph modeling for salient object detection

    Xi Li, Yao Li, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Contextual hyper- graph modeling for salient object detection. InProceedings of the IEEE international conference on computer vision, pages 3328–3335, 2013

  21. [21]

    A 240x180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE journal of solid-state circuits, 43(2):566–576, 2008

    Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 240x180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE journal of solid-state circuits, 43(2):566–576, 2008

  22. [22]

    Event-based vision meets deep learning on steering prediction for self-driving cars

    Ana I Maqueda, Antonio Loquercio, Guillermo Gallego, Narciso García, and Davide Scara- muzza. Event-based vision meets deep learning on steering prediction for self-driving cars. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5419–5427, 2018

  23. [23]

    The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and slam.The International journal of robotics research, 36(2):142–149, 2017

    Elias Mueggler, Henri Rebecq, Guillermo Gallego, Tobi Delbruck, and Davide Scaramuzza. The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and slam.The International journal of robotics research, 36(2):142–149, 2017

  24. [24]

    D-fine: Redefine regression task in detrs as fine-grained distribution refinement,

    Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. D-fine: Redefine regression task in detrs as fine-grained distribution refinement.arXiv preprint arXiv:2410.13842, 2024

  25. [25]

    Scene adaptive sparse transformer for event-based object detection

    Yansong Peng, Hebei Li, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. Scene adaptive sparse transformer for event-based object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16794–16804, 2024

  26. [26]

    Better and faster: Adaptive event conversion for event-based object detection

    Yansong Peng, Yueyi Zhang, Peilin Xiao, Xiaoyan Sun, and Feng Wu. Better and faster: Adaptive event conversion for event-based object detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2056–2064, 2023

  27. [27]

    Get: Group event transformer for event-based vision

    Yansong Peng, Yueyi Zhang, Zhiwei Xiong, Xiaoyan Sun, and Feng Wu. Get: Group event transformer for event-based vision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6038–6048, 2023

  28. [28]

    Learning to detect objects with a 1 megapixel event camera.Advances in Neural Information Processing Systems, 33:16639–16652, 2020

    Etienne Perot, Pierre De Tournemire, Davide Nitti, Jonathan Masci, and Amos Sironi. Learning to detect objects with a 1 megapixel event camera.Advances in Neural Information Processing Systems, 33:16639–16652, 2020

  29. [29]

    Fcanet: Frequency channel attention networks

    Zequn Qin, Pengyi Zhang, Fei Wu, and Xi Li. Fcanet: Frequency channel attention networks. InProceedings of the IEEE/CVF international conference on computer vision, pages 783–792, 2021

  30. [30]

    Global filter networks for image classification.Advances in neural information processing systems, 34:980–993, 2021

    Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image classification.Advances in neural information processing systems, 34:980–993, 2021

  31. [31]

    High speed and high dynamic range video with an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

    Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

  32. [32]

    Aegnn: Asynchronous event-based graph neural networks

    Simon Schaefer, Daniel Gehrig, and Davide Scaramuzza. Aegnn: Asynchronous event-based graph neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12371–12381, 2022

  33. [33]

    Hats: Histograms of averaged time surfaces for robust event-based object classification

    Amos Sironi, Manuele Brambilla, Nicolas Bourdis, Xavier Lagorce, and Ryad Benosman. Hats: Histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1731– 1740, 2018. 11

  34. [34]

    Evrt-detr: Latent space adaptation of image detectors for event-based vision

    Dmitrii Torbunov, Yihui Ren, Animesh Ghose, Odera Dim, and Yonggang Cui. Evrt-detr: Latent space adaptation of image detectors for event-based vision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9812–9821, 2025

  35. [35]

    etram: Event-based traffic monitoring dataset

    Aayush Atul Verma, Bharatesh Chakravarthi, Arpitsinh Vaghela, Hua Wei, and Yezhou Yang. etram: Event-based traffic monitoring dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22637–22646, 2024

  36. [36]

    Dual memory aggregation network for event-based object detection with learnable representation

    Dongsheng Wang, Xu Jia, Yang Zhang, Xinyu Zhang, Yaoyuan Wang, Ziyang Zhang, Dong Wang, and Huchuan Lu. Dual memory aggregation network for event-based object detection with learnable representation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2492–2500, 2023

  37. [37]

    Personalq: Select, quantize, and serve personalized diffusion models for efficient inference

    Qirui Wang, Qi Guo, Yiding Sun, Junkai Yang, Dongxu Zhang, Shanmin Pang, and Qing Guo. Personalq: Select, quantize, and serve personalized diffusion models for efficient inference. arXiv preprint arXiv:2603.22943, 2026

  38. [38]

    Leod: Label-efficient object detection for event cameras

    Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, and Igor Gilitschenski. Leod: Label-efficient object detection for event cameras. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16933–16943, 2024

  39. [39]

    Learning in the frequency domain

    Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. Learning in the frequency domain. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1740–1749, 2020

  40. [40]

    Hypergcn: A new method for training graph convolutional networks on hypergraphs

    Naganand Yadati, Madhav Nimishakavi, Prateek Yadav, Vikram Nitin, Anand Louis, and Partha Talukdar. Hypergcn: A new method for training graph convolutional networks on hypergraphs. Advances in neural information processing systems, 32, 2019

  41. [41]

    Smamba: Sparse mamba for event-based object detection

    Nan Yang, Yang Wang, Zhanwen Liu, Meng Li, Yisheng An, and Xiangmo Zhao. Smamba: Sparse mamba for event-based object detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9229–9237, 2025

  42. [42]

    Detrs beat yolos on real-time object detection

    Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024

  43. [43]

    Unsupervised event-based learning of optical flow, depth, and egomotion

    Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Unsupervised event-based learning of optical flow, depth, and egomotion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 989–997, 2019

  44. [44]

    Rethinking scale-aware temporal encoding for event-based object detection

    Lin Zhu, Xiao Wang, Lizhi Wang, Hua Huang, et al. Rethinking scale-aware temporal encoding for event-based object detection. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  45. [45]

    From chaos comes order: Ordering event representations for object recognition and detection

    Nikola Zubi´c, Daniel Gehrig, Mathias Gehrig, and Davide Scaramuzza. From chaos comes order: Ordering event representations for object recognition and detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12846–12856, 2023

  46. [46]

    State space models for event cameras

    Nikola Zubic, Mathias Gehrig, and Davide Scaramuzza. State space models for event cameras. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5819–5828, 2024. 12 Appendix A Additional Implementation Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.1 Da...

  47. [47]

    Spatial annotation misalignment.As shown in Scenes 1–3, some annotations in high-speed motion scenes are spatially misaligned with the corresponding event responses, causing visually reasonable predictions to receive lower scores

  48. [48]

    Erroneous boxes.As shown in Scene 4, a small number of erroneous boxes are not properly removed by the official filtering protocol

  49. [49]

    These observations indicate that annotation alignment, box filtering, and missing annotations remain important factors for reliable evaluation in event-based detection benchmarks

    Missing ground-truth annotations.As shown in Scene 5, some objects are not annotated in the ground truth, while our model still detects them correctly according to the visual evidence. These observations indicate that annotation alignment, box filtering, and missing annotations remain important factors for reliable evaluation in event-based detection benc...