pith. sign in

arxiv: 2606.17334 · v1 · pith:E2MHQ67Enew · submitted 2026-06-15 · 💻 cs.CV

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

Pith reviewed 2026-06-27 03:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-based visionobject detectionpillar encodingfrequency-aware trainingevent camerasorthogonal polynomialsmean-teacher learning
0
0 comments X

The pith

Pillar Encoding projects event streams onto orthogonal polynomials to form dense pseudo-images that preserve temporal structure for object detection at up to 200 Hz.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FATE to handle the sparse asynchronous data from event cameras in deep learning pipelines for object detection. It replaces fixed temporal sub-binning with Pillar Encoding, which groups events into spatial pillars and projects their time evolution inside each macro-window onto a continuous-time orthogonal polynomial basis. This produces an L2-optimal dense pseudo-image that keeps fine temporal dynamics without internal discretization. Frequency-Aware Training then uses a soft mean-teacher curriculum to supply temporally dense pseudo-labels, closing the gap between low-frequency ground truth and high-frequency inference. The result lets standard detectors run at high temporal resolutions with only small added cost in parameters and latency.

Core claim

FATE enables robust object detection at high temporal resolutions up to 200 Hz by organizing events into spatial pillars whose intra-window evolution is approximated via projection onto a continuous-time orthogonal polynomial basis, yielding an L2-optimal dense pseudo-image, paired with Frequency-Aware Training that generates temporally dense pseudo-labels through a soft mean-teacher curriculum.

What carries the argument

Pillar Encoding, which organizes events into spatial pillars and projects their intra-window time evolution onto a continuous-time orthogonal polynomial basis to produce dense pseudo-images without internal temporal sub-binning.

If this is right

  • Standard convolutional detectors can be applied directly to event data at frequencies far above the rate of available ground-truth labels.
  • The same encoding works across multiple detector architectures with only minor changes to input channels.
  • Parameter count and inference latency remain close to those of the underlying detector.
  • Performance gains hold across different event-camera datasets and scene types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pillar-plus-polynomial representation could be tested on event-based tasks such as optical flow or semantic segmentation without retraining the encoder.
  • Because the basis is continuous, the method might allow variable accumulation windows chosen on the fly rather than fixed at training time.
  • If the polynomial order can be chosen per pillar according to local event density, the representation could adapt automatically to regions with very different motion speeds.

Load-bearing premise

Projecting events onto a continuous-time orthogonal polynomial basis inside each accumulation window produces an L2-optimal dense pseudo-image that keeps rich temporal dynamics even when events are sparse.

What would settle it

A controlled test at 200 Hz on a dataset with known sparse regions where the polynomial projection is replaced by uniform binning and accuracy falls below the FATE baseline by more than the reported margin.

Figures

Figures reproduced from arXiv: 2606.17334 by Kyoung-Don Kang, Md Tawheedul Islam Bhuian.

Figure 1
Figure 1. Figure 1: Pillar Encoding of FATE Each event is augmented with local spatial and temporal offsets to provide a stronger inductive bias. Offsets from the pillar-wise event mean capture locally occupied centroids and temporal skew, while offsets from the pillar center define a consistent local coordinate. Together, these features encode both the spatial distribution and temporal layout of events within each pillar. Fo… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of mAP across different operating frequencies on Gen1. [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Precision and recall of pseudo-labels for cars and pedestrians on 10K samples randomly [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of bounding boxes (Gen1, 20 Hz). [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the dense label supervision in FATE-E (Gen1, 20 Hz) [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the dense label supervision in FATE-S (Gen1, 20 Hz) [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes FATE, a framework for event-based object detection consisting of Pillar Encoding (PE) that organizes events into spatial pillars and projects their intra-window temporal evolution onto a continuous-time orthogonal polynomial basis to produce an L2-optimal dense pseudo-image, together with Frequency-Aware Training (FAT) that uses a soft mean-teacher curriculum to generate dense pseudo-labels and bridge low-frequency supervision with high-frequency (up to 200 Hz) inference. The method is claimed to generalize across architectures while adding negligible parameter and latency overhead.

Significance. If the central claims hold, the work would enable practical high-temporal-resolution object detection from event cameras without the information loss of fixed sub-binning, addressing a key limitation in current event-vision pipelines.

major comments (1)
  1. [§3.2] §3.2 (Pillar Encoding): the assertion that the orthogonal-polynomial projection yields an L2-optimal representation that retains rich temporal dynamics under sparse event conditions is not supported when the number of events per pillar falls below the basis dimension (degree + 1); at 200 Hz the shortened accumulation windows make this regime likely, yet no per-pillar event-count statistics, condition-number analysis, or basis-order ablation is reported to demonstrate that the coefficients remain informative rather than degenerate or noise-dominated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment on Pillar Encoding below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Pillar Encoding): the assertion that the orthogonal-polynomial projection yields an L2-optimal representation that retains rich temporal dynamics under sparse event conditions is not supported when the number of events per pillar falls below the basis dimension (degree + 1); at 200 Hz the shortened accumulation windows make this regime likely, yet no per-pillar event-count statistics, condition-number analysis, or basis-order ablation is reported to demonstrate that the coefficients remain informative rather than degenerate or noise-dominated.

    Authors: We thank the referee for this observation. The L2-optimality claim refers strictly to the fact that the coefficients minimize the squared residual of the polynomial fit to the observed event times within each pillar (i.e., the normal equations solution). We acknowledge, however, that when the number of events falls below the basis dimension the system is underdetermined and the solution via pseudoinverse may become sensitive to noise or fail to capture meaningful dynamics. Because the manuscript currently provides no supporting statistics or ablations for the 200 Hz regime, we will revise §3.2 and the experimental section to include (i) per-pillar event-count histograms at 50 Hz, 100 Hz and 200 Hz, (ii) condition-number statistics of the design matrix across pillars, and (iii) a basis-order ablation measuring both detection mAP and coefficient stability. These additions will directly address the concern. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained

full rationale

The abstract and method description define Pillar Encoding explicitly as projection onto an orthogonal polynomial basis (a standard least-squares operation whose L2-optimality follows directly from the definition of the basis and does not presuppose the downstream detection performance or the FAT curriculum). Frequency-Aware Training is introduced as an independent soft mean-teacher procedure without any equations or claims that reduce the claimed high-frequency detection gains to a fitted parameter or self-citation. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the provided text; the central claims rest on the architectural choices and external experimental validation rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract provides limited technical detail; the framework rests on the assumption that polynomial projection is L2-optimal for temporal dynamics and that mean-teacher pseudo-labels effectively bridge frequency gaps, with no free parameters or invented physical entities explicitly quantified.

axioms (1)
  • domain assumption Projection onto continuous-time orthogonal polynomial basis yields L2-optimal representation retaining rich temporal dynamics
    Invoked in description of Pillar Encoding to justify avoiding sub-binning
invented entities (2)
  • Pillar Encoding (PE) no independent evidence
    purpose: Organize events into spatial pillars and approximate intra-window evolution via polynomial projection to create dense pseudo-image
    New encoding method introduced to mitigate information loss
  • Frequency-Aware Training (FAT) no independent evidence
    purpose: Soft mean-teacher curriculum generating temporally dense pseudo-labels to bridge low-frequency supervision and high-frequency inference
    New training procedure proposed to leverage the encoding

pith-pipeline@v0.9.1-grok · 5780 in / 1312 out tokens · 37695 ms · 2026-06-27T03:13:33.667729+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 22 canonical work pages

  1. [1]

    US Department of Commerce, National Bureau of Standards, Washington, D.C., 1964

    Milton Abramowitz and Irene A Stegun.Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, volume 55. US Department of Commerce, National Bureau of Standards, Washington, D.C., 1964

  2. [2]

    A hybrid ANN-SNN architecture for low-power and low-latency visual perception

    Asude Aydin, Mathias Gehrig, Daniel Gehrig, and Davide Scaramuzza. A hybrid ANN-SNN architecture for low-power and low-latency visual perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

  3. [3]

    Huang, C

    R. Wes Baldwin, Ruixu Liu, Mohammed Almatrafi, Vijayan Asari, and Keigo Hirakawa. Time-Ordered Recent Event (TORE) V olumes for Event Cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2519–2532, 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2022.3172212

  4. [4]

    Event- based visual flow.IEEE Transactions on Neural Networks and Learning Systems, 25(2): 407–417, 2013

    Ryad Benosman, Charles Clercq, Xavier Lagorce, Sio-Hoi Ieng, and Chiara Bartolozzi. Event- based visual flow.IEEE Transactions on Neural Networks and Learning Systems, 25(2): 407–417, 2013. doi: 10.1109/TNNLS.2013.2273537

  5. [5]

    Simple online and realtime tracking

    Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468, Phoenix, AZ, USA, September 2016. IEEE. doi: 10.1109/ICIP.2016.7533003

  6. [6]

    Graph- Based Spatio-Temporal Feature Learning for Neuromorphic Vision Sensing.IEEE Transactions on Image Processing, 29:9084–9098, 2020

    Yin Bi, Aaron Chadha, Alhabib Abbas, Eirina Bourtsoulatze, and Yiannis Andreopoulos. Graph- Based Spatio-Temporal Feature Learning for Neuromorphic Vision Sensing.IEEE Transactions on Image Processing, 29:9084–9098, 2020. ISSN 1941-0042. doi: 10.1109/TIP.2020.3023597

  7. [7]

    A differentiable recurrent surface for asynchronous event-based data

    Marco Cannici, Marco Ciccone, Andrea Romanoni, and Matteo Matteucci. A differentiable recurrent surface for asynchronous event-based data. InThe European Conference on Computer Vision (ECCV), August 2020

  8. [8]

    End -to-End Object Detection with Transformers[J]

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-End Object Detection with Transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors,Computer Vision – ECCV 2020, volume 12346, pages 213–229. Springer International Publishing. doi: 10.1007/978-3-030-58452-8_13

  9. [9]

    Courier Corporation, 2007

    Philip J Davis and Philip Rabinowitz.Methods of Numerical Integration. Courier Corporation, 2007

  10. [10]

    A Large Scale Event-based Detection Dataset for Automotive, 2020

    Pierre de Tournemire, Davide Nitti, Etienne Perot, Davide Migliore, and Amos Sironi. A Large Scale Event-based Detection Dataset for Automotive, 2020

  11. [11]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yongjian Deng, Hao Chen, Hai Liu, and Youfu Li. A V oxel Graph CNN for Object Classifi- cation with Event Cameras. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1162–1171, 2022. doi: 10.1109/CVPR52688.2022.00124

  12. [12]

    Eventpillars: Pillar-based efficient representations for event data.Proceedings of the AAAI Conference on Artificial Intelligence, 39(3):2861–2869, 2025

    Rui Fan, Weidong Hao, Juntao Guan, Lai Rui, Lin Gu, Tong Wu, Fanhong Zeng, and Zhangming Zhu. Eventpillars: Pillar-based efficient representations for event data.Proceedings of the AAAI Conference on Artificial Intelligence, 39(3):2861–2869, 2025. doi: 10.1609/aaai.v39i3.32292

  13. [13]

    Davison, Jörg Conradt, Kostas Daniilidis, et al

    Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J. Davison, Jörg Conradt, Kostas Daniilidis, et al. Event- based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (1):154–180, 2020

  14. [14]

    Yolox: Exceeding yolo series in 2021.arXiv preprint arXiv:2107.08430, 2021

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021.arXiv preprint arXiv:2107.08430, 2021

  15. [15]

    Gehrig and D

    Daniel Gehrig and Davide Scaramuzza. Low-latency automotive vision with event cameras.Na- ture, 629(8014):1034–1040. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-024-07409-w

  16. [16]

    Derpanis, and Davide Scaramuzza

    Daniel Gehrig, Antonio Loquercio, Konstantinos G. Derpanis, and Davide Scaramuzza. End- to-end learning of representations for asynchronous event-based data. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5633–5643, 2019. 10

  17. [17]

    Black, and Otmar Hilliges

    Mathias Gehrig and Davide Scaramuzza. Recurrent Vision Transformers for Object Detec- tion with Event Cameras. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13884–13893, 2023. doi: 10.1109/CVPR52729.2023.01334

  18. [18]

    Low-latency event-based object detection with spatially-sparse linear attention.arXiv preprint arXiv:2603.06228, 2026

    Haiqing Hao, Zhipeng Sui, Rong Zou, Zijia Dai, Nikola Zubi´c, Davide Scaramuzza, and Wenhui Wang. Low-latency event-based object detection with spatially-sparse linear attention.arXiv preprint arXiv:2603.06228, 2026

  19. [19]

    Maximizing asynchronicity in event-based neural networks

    Haiqing Hao, Nikola Zubic, Weihua He, Zhipeng Sui, Davide Scaramuzza, and Wenhui Wang. Maximizing asynchronicity in event-based neural networks. InThe Fourteenth International Conference on Learning Representations, 2026

  20. [20]

    Rehg, and Evangelos A

    Zhuangyi Jiang, Pengfei Xia, Kai Huang, Walter Stechele, Guang Chen, Zhenshan Bing, and Alois Knoll. Mixed Frame-/Event-Driven Fast Pedestrian Detection. In2019 International Conference on Robotics and Automation (ICRA), pages 8332–8338, 2019. doi: 10.1109/ICRA. 2019.8793924

  21. [21]

    Associative Memory Augmented Asyn- chronous Spatiotemporal Representation Learning for Event-based Perception

    Uday Kamal, Saurabh Dash, and Saibal Mukhopadhyay. Associative Memory Augmented Asyn- chronous Spatiotemporal Representation Learning for Event-based Perception. InInternational Conference on Learning Representations (ICLR), 2023

  22. [22]

    Emf: Event meta formers for event-based real-time traffic object detection.arXiv preprint arXiv:2504.04124, 2025

    Muhammad Ahmed Ullah Khan, Abdul Hannan Khan, and Andreas Dengel. Emf: Event meta formers for event-based real-time traffic object detection.arXiv preprint arXiv:2504.04124, 2025

  23. [23]

    John Wiley & Sons, New York, 1978

    Erwin Kreyszig.Introductory Functional Analysis with Applications, volume 1. John Wiley & Sons, New York, 1978

  24. [24]

    Shi, and Ryad B

    Xavier Lagorce, Garrick Orchard, Francesco Galluppi, Bertram E. Shi, and Ryad B. Benosman. HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition. 39(7):1346–1359,

  25. [25]

    doi: 10.1109/TPAMI.2016.2574707

    ISSN 1939-3539. doi: 10.1109/TPAMI.2016.2574707

  26. [26]

    Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom

    Alex H. Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. PointPillars: Fast Encoders for Object Detection From Point Clouds. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12689–12697, Long Beach, CA, USA, June 2019. IEEE. doi: 10.1109/CVPR.2019.01298

  27. [27]

    SODFormer: Streaming object detection with transformer using events and frames.arXiv preprint arXiv:2308.04047, 2023

    Dianze Li, Jianing Li, and Yonghong Tian. SODFormer: Streaming object detection with transformer using events and frames.arXiv preprint arXiv:2308.04047, 2023

  28. [28]

    HDI-Former: Hybrid dynamic interaction ANN-SNN transformer for object detection using frames and events

    Dianze Li, Jianing Li, Xu Liu, Zhaokun Zhou, Xiaopeng Fan, and Yonghong Tian. HDI-Former: Hybrid dynamic interaction ANN-SNN transformer for object detection using frames and events. arXiv preprint arXiv:2411.18658, 2024

  29. [29]

    Asynchronous spatio-temporal memory network for continuous event-based object detection.IEEE Transac- tions on Image Processing, 31:2975–2987, 2022

    Jianing Li, Jia Li, Lin Zhu, Xijie Xiang, Tiejun Huang, and Yonghong Tian. Asynchronous spatio-temporal memory network for continuous event-based object detection.IEEE Transac- tions on Image Processing, 31:2975–2987, 2022. doi: 10.1109/TIP.2022.3162962

  30. [30]

    Tinypillarnet: Tiny pillar-based network for 3d point cloud object detection at edge.IEEE Transactions on Circuits and Systems for Video Technology, 34 (3):1772–1785, 2024

    Yishi Li, Yuhao Zhang, and Rui Lai. Tinypillarnet: Tiny pillar-based network for 3d point cloud object detection at edge.IEEE Transactions on Circuits and Systems for Video Technology, 34 (3):1772–1785, 2024. doi: 10.1109/TCSVT.2023.3297620

  31. [31]

    Flexevent: Towards flexible event-frame object detection at varying operational frequencies

    Dongyue Lu, Lingdong Kong, Gim Hee Lee, Camille Simon Chane, and Wei Tsang Ooi. Flexevent: Towards flexible event-frame object detection at varying operational frequencies. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  32. [32]

    John Wiley & Sons, New York, 1997

    David G Luenberger.Optimization by Vector Space Methods. John Wiley & Sons, New York, 1997

  33. [33]

    PLEIADES: Building temporal kernels with orthogonal polynomials

    Yan Ru Pei and Olivier JMD Coenen. PLEIADES: Building temporal kernels with orthogonal polynomials. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 11

  34. [34]

    In: IEEE/CVF International Conference on Computer Vision

    Yansong Peng, Yueyi Zhang, Zhiwei Xiong, Xiaoyan Sun, and Feng Wu. GET: Group Event Transformer for Event-Based Vision. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6015–6025, 2023. doi: 10.1109/ICCV51070.2023.00555

  35. [35]

    MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

    Yansong Peng, Hebei Li, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. Scene Adaptive Sparse Transformer for Event-based Object Detection. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16794–16804, 2024. doi: 10.1109/CVPR52733. 2024.01589

  36. [36]

    Learning to detect objects with a 1 megapixel event camera

    Etienne Perot, Pierre de Tournemire, Davide Nitti, Jonathan Masci, and Amos Sironi. Learning to detect objects with a 1 megapixel event camera. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 16639–16652. Curran Associates, Inc., 2020

  37. [37]

    Cambridge University Press, Cambridge, UK, 2007

    William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery.Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, Cambridge, UK, 2007

  38. [38]

    Events-to-video: Bringing modern computer vision to event cameras

    Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. Events-to-video: Bringing modern computer vision to event cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3857–3866, 2019

  39. [39]

    McGraw-Hill, New York, 3rd edition, 1976

    Walter Rudin.Principles of Mathematical Analysis. McGraw-Hill, New York, 3rd edition, 1976

  40. [40]

    EventNet: Asynchronous Recursive Event Processing

    Yusuke Sekikawa, Kosuke Hara, and Hideo Saito. EventNet: Asynchronous Recursive Event Processing. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3882–3891. IEEE. doi: 10.1109/CVPR.2019.00401

  41. [41]

    HATS: Histograms of Averaged Time Surfaces for Robust Event-Based Object Classification

    Amos Sironi, Manuele Brambilla, Nicolas Bourdis, Xavier Lagorce, and Ryad Benosman. HATS: Histograms of Averaged Time Surfaces for Robust Event-Based Object Classification. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1731–1740,

  42. [42]

    doi: 10.1109/CVPR.2018.00186

  43. [43]

    Smith, Andrew Warrington, and Scott Linderman

    Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, 2023

  44. [44]

    Super-convergence: very fast training of neural networks using large learning rates

    Leslie N Smith and Nicholay Topin. Super-convergence: very fast training of neural networks using large learning rates. InArtificial Intelligence and Machine Learning for Multi-Domain Operations Applications, volume 11006, page 1100612. SPIE, 2019

  45. [45]

    Deep directly-trained spiking neural networks for object detection

    Qiaoyi Su, Yuhong Chou, Yifan Hu, Jianing Li, Shijie Mei, Ziyang Zhang, and Guoqi Li. Deep directly-trained spiking neural networks for object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6532–6542, 2023

  46. [46]

    Mean teachers are better role models: Weight-averaged con- sistency targets improve semi-supervised deep learning results.Neural Information Processing Systems, 2017

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged con- sistency targets improve semi-supervised deep learning results.Neural Information Processing Systems, 2017

  47. [47]

    EvRT-DETR: La- tent Space Adaptation of Image Detectors for Event-based Vision

    Dmitrii Torbunov, Yihui Ren, Animesh Ghose, Odera Dim, and Yonggang Cui. EvRT-DETR: La- tent Space Adaptation of Image Detectors for Event-based Vision. InInternational Conference on Computer Vision (ICCV), 2025

  48. [48]

    ALERT-transformer: Bridging asynchronous and synchronous machine learning for real-time event-based spatio-temporal data

    Carmen Martin Turrero, Maxence Bouvier, Manuel Breitenstein, Pietro Zanuttigh, and Vincent Parret. ALERT-transformer: Bridging asynchronous and synchronous machine learning for real-time event-based spatio-temporal data. InForty-first International Conference on Machine Learning, 2024

  49. [49]

    Sparse convolutional recurrent learning for efficient event-based neuromorphic object detection.arXiv preprint arXiv:2506.13440, 2025

    Shenqi Wang, Yingfu Xu, Amirreza Yousefzadeh, Sherif Eissa, Henk Corporaal, Federico Corradi, and Guangzhi Tang. Sparse convolutional recurrent learning for efficient event-based neuromorphic object detection.arXiv preprint arXiv:2506.13440, 2025. 12

  50. [50]

    Object de- tection using event camera: A moe heat conduction based detector and a new benchmark dataset

    Xiao Wang, Yu Jin, Wentao Wu, Wei Zhang, Lin Zhu, Bo Jiang, and Yonghong Tian. Object de- tection using event camera: A moe heat conduction based detector and a new benchmark dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  51. [51]

    LEOD: Label- Efficient Object Detection for Event Cameras

    Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, and Igor Gilitschenski. LEOD: Label- Efficient Object Detection for Event Cameras. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. doi: 10.48550/arXiv.2311.17286

  52. [52]

    End-to-end semi-supervised object detection with soft teacher.Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

    Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher.Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  53. [53]

    Smamba: Sparse mamba for event-based object detection.Proceedings of the AAAI Conference on Artificial Intelligence, 39(9):9229–9237, 2025

    Nan Yang, Yang Wang, Zhanwen Liu, Meng Li, Yisheng An, and Xiangmo Zhao. Smamba: Sparse mamba for event-based object detection.Proceedings of the AAAI Conference on Artificial Intelligence, 39(9):9229–9237, 2025. doi: 10.1609/aaai.v39i9.32999

  54. [54]

    Deep Residual Learning for Image Recognition

    Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Unsupervised Event- Based Learning of Optical Flow, Depth, and Egomotion. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 989–997, 2019. doi: 10.1109/CVPR. 2019.00108

  55. [55]

    URL https://proceedings.mlr

    Nikola Zubi´c, Mathias Gehrig, and Davide Scaramuzza. State Space Models for Event Cameras. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5819–5828, 2024. doi: 10.1109/CVPR52733.2024.00556. 13 A Theoretical Justifications for Pillar Encoding In this section, we provide the theoretical foundation for our Pillar Encodin...