pith. sign in

arxiv: 2411.18328 · v2 · submitted 2024-11-27 · 💻 cs.CV

EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and Beyond

Pith reviewed 2026-05-23 16:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-based action recognitionframe-point synergyevent framesevent pointsspiking-like context learnerevent point encoderjoint representation spacehilbert scan
0
0 comments X

The pith

EventCrab combines lighter frame networks for dense event data with heavier point networks for sparse points to balance accuracy and efficiency in action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the core mismatch in event-based action recognition, where methods either convert streams to dense frames handled by heavy networks or process sparse points with light networks, missing the data's mixed dense-temporal and sparse-spatial nature. It introduces EventCrab as a framework that pairs the two network types while adding a shared space linking frames, text, and points. Two new modules handle the point side: one extracts context from raw streams and the other encodes long-range features along a space-filling curve. The result is reported gains on four datasets, including over 5 percent on SeAct and 7 percent on HARDVS. A reader would care because event cameras produce asynchronous streams that standard pipelines waste or distort.

Core claim

EventCrab is a synergy-aware framework that integrates lighter frame-specific networks for dense event frames with heavier point-specific networks for sparse event points while establishing a joint frame-text-point representation space. It adds a Spiking-like Context Learner to pull contextualized points from raw streams and an Event Point Encoder that processes long spatiotemporal features through Hilbert scanning.

What carries the argument

The synergy-aware framework that pairs frame-specific and point-specific networks, realized through the Spiking-like Context Learner, Event Point Encoder, and joint frame-text-point representation space.

If this is right

  • The joint frame-text-point space allows direct transfer between dense and sparse event representations.
  • The Spiking-like Context Learner and Hilbert-scan encoder together capture both local context and long-range structure in event points.
  • Reported accuracy lifts of 5.17 percent on SeAct and 7.01 percent on HARDVS follow directly from the balanced integration.
  • The same architecture applies to additional event-based tasks beyond action recognition.
  • Efficiency gains arise because lighter frame networks offset the cost of heavier point networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frame-point pairing could be tested on event-based object detection or tracking without retraining the core modules from scratch.
  • If the joint representation space generalizes, it might allow text prompts to guide point-feature selection during inference.
  • A follow-up experiment could measure whether the Hilbert-scan ordering still helps when event density varies across scenes.
  • The approach implicitly suggests that other asynchronous sensor streams, such as LiDAR points, might benefit from analogous dense-sparse pairing.

Load-bearing premise

The dense temporal and sparse spatial traits of asynchronous event streams can be handled by merging frame and point networks without creating new training or inference conflicts.

What would settle it

A controlled comparison on SeAct or HARDVS in which the combined EventCrab model shows no accuracy gain or efficiency improvement over the best frame-only or point-only baseline using the same backbone networks.

Figures

Figures reproduced from arXiv: 2411.18328 by Jiachao Zhang, Jinhui Tang, Meiqi Cao, Rui Yan, Xiangbo Shu, Zechao Li.

Figure 1
Figure 1. Figure 1: Insight of our work. Previous methods are limited to [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of the proposed EventCrab. For the event-point embedding, the Spiking-like Context Learner (SCL) and the Event [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of events before/after processed by SCL [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Computational effectiveness analysis between ours [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the Top-3 predicted results on the SeAct [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Event-based Action Recognition (EAR) possesses the advantages of high-temporal resolution capturing and privacy preservation compared with traditional action recognition. Current leading EAR solutions typically follow two regimes: project unconstructed event streams into dense constructed event frames and adopt powerful frame-specific networks, or employ lightweight point-specific networks to handle sparse unconstructed event points directly. However, such two regimes are blind to a fundamental issue: failing to accommodate the unique dense temporal and sparse spatial properties of asynchronous event data. In this article, we present a synergy-aware framework, i.e., EventCrab, that adeptly integrates the "lighter" frame-specific networks for dense event frames with the "heavier" point-specific networks for sparse event points, balancing accuracy and efficiency. Furthermore, we establish a joint frame-text-point representation space to bridge distinct event frames and points. In specific, to better exploit the unique spatiotemporal relationships inherent in asynchronous event points, we devise two strategies for the "heavier" point-specific embedding: i) a Spiking-like Context Learner (SCL) that extracts contextualized event points from raw event streams. ii) an Event Point Encoder (EPE) that further explores event-point long spatiotemporal features in a Hilbert-scan way. Experiments on four datasets demonstrate the significant performance of our proposed EventCrab, particularly gaining improvements of 5.17% on SeAct and 7.01% on HARDVS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that existing event-based action recognition (EAR) methods fail to accommodate the dense temporal and sparse spatial properties of asynchronous event data, and proposes EventCrab as a synergy-aware framework that integrates lighter frame-specific networks for dense event frames with heavier point-specific networks for sparse event points. It introduces a Spiking-like Context Learner (SCL) and Hilbert-scan Event Point Encoder (EPE) for point embedding, establishes a joint frame-text-point representation space, and reports empirical gains including 5.17% on SeAct and 7.01% on HARDVS across four datasets.

Significance. If the results hold under rigorous validation, the work is significant for offering a practical hybrid approach to EAR that balances accuracy and efficiency while introducing a multimodal joint representation space. The empirical gains on multiple datasets constitute a concrete strength for an applied framework paper; the design of SCL and EPE as targeted modules for event-point properties is a clear contribution if the synergy is shown to function as claimed.

major comments (2)
  1. [Abstract / Method] Abstract and method description: the central claim that the proposed synergy (frame + point networks via SCL/EPE plus joint space) resolves the failure mode of prior regimes by accommodating dense-temporal/sparse-spatial properties without training conflicts is load-bearing but unsupported; no explicit analysis, constraint, or diagnostic is provided showing that joint optimization preserves distinct signals rather than allowing branch dominance or alignment artifacts.
  2. [Experiments] Experiments section: the reported improvements (5.17% on SeAct, 7.01% on HARDVS) are presented as evidence of the framework's effectiveness, yet without ablations isolating the contribution of the joint frame-text-point space, SCL, or EPE versus baseline combinations, it remains unclear whether the gains derive from the claimed synergy or from other implementation choices.
minor comments (2)
  1. [Abstract] The abstract refers to results on four datasets but names only SeAct and HARDVS; the full list of datasets and per-dataset breakdowns should be explicitly stated in the experiments section for completeness.
  2. [Method] The notation and architectural details for SCL and EPE would benefit from accompanying equations or pseudocode in the method section to clarify the spiking-like context extraction and Hilbert-scan encoding mechanisms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that strengthening the evidence for the claimed synergy mechanism and providing more targeted ablations will improve the manuscript. We will revise accordingly by adding the requested analyses and experiments.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the central claim that the proposed synergy (frame + point networks via SCL/EPE plus joint space) resolves the failure mode of prior regimes by accommodating dense-temporal/sparse-spatial properties without training conflicts is load-bearing but unsupported; no explicit analysis, constraint, or diagnostic is provided showing that joint optimization preserves distinct signals rather than allowing branch dominance or alignment artifacts.

    Authors: We acknowledge that explicit diagnostics would better substantiate the claim that joint optimization preserves distinct frame and point signals. While the performance gains across datasets and the design of SCL and EPE are intended to address the dense-temporal/sparse-spatial properties, we will add in revision a dedicated analysis subsection. This will include t-SNE visualizations of the joint representation space, per-branch performance breakdowns, and training dynamics (e.g., gradient norms) to demonstrate that neither branch dominates nor that alignment artifacts arise. revision: yes

  2. Referee: [Experiments] Experiments section: the reported improvements (5.17% on SeAct, 7.01% on HARDVS) are presented as evidence of the framework's effectiveness, yet without ablations isolating the contribution of the joint frame-text-point space, SCL, or EPE versus baseline combinations, it remains unclear whether the gains derive from the claimed synergy or from other implementation choices.

    Authors: We agree that isolating the individual contributions of the joint frame-text-point space, SCL, and EPE is necessary to attribute gains specifically to the synergy. The manuscript already reports overall results and some module comparisons, but we will expand the experiments with new ablation tables that systematically remove or replace each component (joint space, SCL, EPE) while keeping other factors fixed. These will be added to clarify the source of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture proposal validated on external datasets

full rationale

The paper proposes EventCrab as an empirical framework that combines frame-specific and point-specific networks via SCL, EPE, and a joint frame-text-point space, with performance shown via experiments on four datasets (e.g., +5.17% on SeAct). No equations, parameter fits, or self-citations are presented that reduce any claimed result to its own inputs by construction. The design choices address stated limitations of prior regimes through new components whose efficacy is measured externally rather than assumed or renamed from prior fits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the named modules can be identified or verified.

invented entities (2)
  • Spiking-like Context Learner (SCL) no independent evidence
    purpose: extracts contextualized event points from raw event streams
    New component introduced to handle point context
  • Event Point Encoder (EPE) no independent evidence
    purpose: explores event-point long spatiotemporal features in a Hilbert-scan way
    New component introduced for long-range point features

pith-pipeline@v0.9.0 · 5801 in / 1173 out tokens · 23435 ms · 2026-05-23T16:30:47.590999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 8 internal anchors

  1. [1]

    A low power, fully event-based gesture recognition system

    Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jef- frey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7243–7252, 2017. 1, 5, 6

  2. [2]

    Eventtransact: A video transformer-based framework for event-camera based action recognition

    Tristan de Blegiers, Ishan Rajendrakumar Dave, Adeel Yousaf, and Mubarak Shah. Eventtransact: A video transformer-based framework for event-camera based action recognition. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1–7. IEEE, 2023. 2, 6

  3. [3]

    other contributors

    Wei Fang, Yanqi Chen, Jianhao Ding, Ding Chen, Zhaofei Yu, Huihui Zhou, and Yonghong Tian. other contributors. spikingjelly, 2020. 6

  4. [4]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019. 6

  5. [5]

    Hungry hungry hippos: To- wards language modeling with state space models

    Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher R´e. Hungry hungry hippos: To- wards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022. 3

  6. [6]

    Event-based vision: A survey

    Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2020. 1

  7. [7]

    Action recognition and benchmark using event cameras

    Yue Gao, Jiaxuan Lu, Siqi Li, Nan Ma, Shaoyi Du, Yipeng Li, and Qionghai Dai. Action recognition and benchmark using event cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 6

  8. [8]

    Bridging video-text retrieval with multiple choice questions

    Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xi- aohu Qie, and Ping Luo. Bridging video-text retrieval with multiple choice questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 16167–16176, 2022. 8

  9. [9]

    A reservoir-based convolu- tional spiking neural network for gesture recognition from dvs input

    Arun M George, Dighanchal Banerjee, Sounak Dey, Arijit Mukherjee, and P Balamurali. A reservoir-based convolu- tional spiking neural network for gesture recognition from dvs input. In International Joint Conference on Neural Net- works (IJCNN), pages 1–9. IEEE, 2020. 2

  10. [10]

    Spiking neural networks

    Samanwoy Ghosh-Dastidar and Hojjat Adeli. Spiking neural networks. International Journal of Neural Systems, 19(04): 295–308, 2009. 4, 7

  11. [11]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 3, 5

  12. [12]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher R ´e. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021. 3

  13. [13]

    Stca: Spatio-temporal credit assignment with delayed feedback in deep spiking neural networks

    Pengjie Gu, Rong Xiao, Gang Pan, and Huajin Tang. Stca: Spatio-temporal credit assignment with delayed feedback in deep spiking neural networks. In International Joint Confer- ence on Artificial Intelligence, pages 1366–1372, 2019. 6

  14. [14]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 6

  15. [15]

    Spiking deep residual networks

    Yangfan Hu, Huajin Tang, and Gang Pan. Spiking deep residual networks. IEEE Transactions on Neural Networks and Learning Systems, 34(8):5200–5205, 2021. 3

  16. [16]

    Temporal binary representa- tion for event-based action recognition

    Simone Undri Innocenti, Federico Becattini, Federico Per- nici, and Alberto Del Bimbo. Temporal binary representa- tion for event-based action recognition. In 2020 25th Inter- national Conference on Pattern Recognition , pages 10426– 10432. IEEE, 2021. 2, 6

  17. [17]

    Point-voxel absorbing graph representation learning for event stream based recog- nition

    Bo Jiang, Chengguo Yuan, Xiao Wang, Zhimin Bao, Lin Zhu, Yonghong Tian, and Jin Tang. Point-voxel absorbing graph representation learning for event stream based recog- nition. arXiv preprint arXiv:2306.05239, 2023. 2

  18. [18]

    Embodied Neuromorphic Vision with Event-Driven Random Backpropagation

    Jacques Kaiser, Alexander Friedrich, J Tieck, Daniel Re- ichard, Arne Roennau, Emre Neftci, and R ¨udiger Dillmann. Embodied neuromorphic vision with event-driven random backpropagation. arXiv preprint arXiv:1904.04805 , 2019. 6

  19. [19]

    Synap- tic plasticity dynamics for deep continuous local learning

    Jacques Kaiser, Hesham Mostafa, and Emre Neftci. Synap- tic plasticity dynamics for deep continuous local learning. Frontiers in Neuroscience, 14:424, 2020. 6

  20. [20]

    Exposing and mitigating spurious correlations for cross-modal retrieval

    Jae Myung Kim, A Koepke, Cordelia Schmid, and Zeynep Akata. Exposing and mitigating spurious correlations for cross-modal retrieval. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2585–2595, 2023. 8

  21. [21]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980, 2014. 6

  22. [22]

    Spikemba: Multi-modal spiking saliency mamba for temporal video grounding

    Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, and Xi- aopeng Fan. Spikemba: Multi-modal spiking saliency mamba for temporal video grounding. arXiv preprint arXiv:2404.01174, 2024. 3

  23. [23]

    Pointmamba: A simple state space model for point cloud analysis

    Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739, 2024. 3

  24. [24]

    Tsm: Temporal shift module for efficient video understanding

    Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7083–7093, 2019. 6

  25. [25]

    Event-based action recognition using motion informa- tion and spiking neural networks

    Qianhui Liu, Dong Xing, Huajin Tang, De Ma, and Gang Pan. Event-based action recognition using motion informa- tion and spiking neural networks. InInternational Joint Con- ference on Artificial Intelligence, pages 1743–1749, 2021. 6

  26. [26]

    VMamba: Visual State Space Model

    Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166,

  27. [27]

    Tam: Temporal adaptive module for video recog- nition

    Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. Tam: Temporal adaptive module for video recog- nition. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 13708–13718, 2021. 6

  28. [28]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3202–3211, 2022. 6

  29. [29]

    U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

    Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024. 3

  30. [30]

    Event-based gesture recognition with dynamic background suppression using smartphone computational capabilities

    Jean-Matthieu Maro, Sio-Hoi Ieng, and Ryad Benosman. Event-based gesture recognition with dynamic background suppression using smartphone computational capabilities. Frontiers in Neuroscience, 14:275, 2020. 6

  31. [31]

    Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection

    Shu Miao, Guang Chen, Xiangyu Ning, Yang Zi, Kejia Ren, Zhenshan Bing, and Alois Knoll. Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection. Frontiers in Neurorobotics, 13:38, 2019. 5, 6

  32. [32]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning , pages 8748–8763, 2021. 6

  33. [33]

    Exploring neuromorphic computing based on spiking neural networks: Algorithms to hardware

    Nitin Rathi, Indranil Chakraborty, Adarsh Kosta, Abhronil Sengupta, Aayush Ankit, Priyadarshini Panda, and Kaushik Roy. Exploring neuromorphic computing based on spiking neural networks: Algorithms to hardware. ACM Computing Surveys, 55(12):1–49, 2023. 2

  34. [34]

    Events-to-video: Bringing modern computer vision to event cameras

    Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. Events-to-video: Bringing modern computer vision to event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3857–3866, 2019. 1

  35. [35]

    High speed and high dynamic range video with an event camera

    Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(6):1964–1980, 2019. 1

  36. [36]

    Ttpoint: A tensorized point cloud network for lightweight action recognition with event cam- eras

    Hongwei Ren, Yue Zhou, Haotian Fu, Yulong Huang, Ren- jing Xu, and Bojun Cheng. Ttpoint: A tensorized point cloud network for lightweight action recognition with event cam- eras. In Proceedings of the 31st ACM International Confer- ence on Multimedia, pages 8026–8034, 2023. 2

  37. [37]

    Spikepoint: An efficient point-based spiking neural network for event cam- eras action recognition

    Hongwei Ren, Yue Zhou, Yulong Huang, Haotian Fu, Xi- aopeng Lin, Jie Song, and Bojun Cheng. Spikepoint: An efficient point-based spiking neural network for event cam- eras action recognition. arXiv preprint arXiv:2310.07189 ,

  38. [38]

    Event transformer

    Alberto Sabater, Luis Montesano, and Ana C Murillo. Event transformer. a sparse-aware solution for efficient event data processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2677– 2686, 2022. 6

  39. [39]

    Spikingres- former: Bridging resnet and vision transformer in spiking neural networks

    Xinyu Shi, Zecheng Hao, and Zhaofei Yu. Spikingres- former: Bridging resnet and vision transformer in spiking neural networks. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , pages 5610–5619, 2024. 2

  40. [40]

    Slayer: Spike layer error reassignment in time

    Sumit B Shrestha and Garrick Orchard. Slayer: Spike layer error reassignment in time. Advances in Neural Information Processing Systems, 31, 2018. 6

  41. [41]

    Hierarchical long short-term concurrent memory for human interaction recognition

    Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Wei Liu, and Jian Yang. Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pat- tern Analysis and Machine Intelligence , 43(3):1110–1118,

  42. [42]

    Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction

    Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence , 44(6):3300– 3315, 2021. 1

  43. [43]

    Simplified State Space Layers for Sequence Modeling

    Jimmy TH Smith, Andrew Warrington, and Scott W Linder- man. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022. 3

  44. [44]

    Learning spatiotemporal features with 3d convolutional networks

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InProceedings of the IEEE Inter- national Conference on Computer Vision, pages 4489–4497,

  45. [45]

    A closer look at spatiotemporal convolutions for action recognition

    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 6450–6459, 2018. 6

  46. [46]

    Unleashing the power of cnn and transformer for balanced rgb-event video recog- nition

    Xiao Wang, Yao Rong, Shiao Wang, Yuan Chen, Zhe Wu, Bo Jiang, Yonghong Tian, and Jin Tang. Unleashing the power of cnn and transformer for balanced rgb-event video recog- nition. arXiv preprint arXiv:2312.11128, 2023. 6

  47. [47]

    Sstformer: bridging spiking neural network and memory support transformer for frame- event based recognition

    Xiao Wang, Zongzhen Wu, Yao Rong, Lin Zhu, Bo Jiang, Jin Tang, and Yonghong Tian. Sstformer: bridging spiking neural network and memory support transformer for frame- event based recognition. arXiv preprint arXiv:2308.04369,

  48. [48]

    Hardvs: Re- visiting human activity recognition with dynamic vision sen- sors

    Xiao Wang, Zongzhen Wu, Bo Jiang, Zhimin Bao, Lin Zhu, Guoqi Li, Yaowei Wang, and Yonghong Tian. Hardvs: Re- visiting human activity recognition with dynamic vision sen- sors. In Association for the Advancement of Artificial Intel- ligence, pages 5615–5623, 2024. 5, 6

  49. [49]

    Action-net: Multipath excitation for action recognition

    Zhengwei Wang, Qi She, and Aljosa Smolic. Action-net: Multipath excitation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 13214–13223, 2021. 6

  50. [50]

    Masked spiking trans- former

    Ziqing Wang, Yuetong Fang, Jiahang Cao, Qiang Zhang, Zhongrui Wang, and Renjing Xu. Masked spiking trans- former. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 1761–1771, 2023. 6

  51. [51]

    Eas-snn: End-to-end adaptive sampling and representation for event-based detec- tion with recurrent spiking neural networks

    Ziming Wang, Ziling Wang, Huaning Li, Lang Qin, Run- hao Jiang, De Ma, and Huajin Tang. Eas-snn: End-to-end adaptive sampling and representation for event-based detec- tion with recurrent spiking neural networks. arXiv preprint arXiv:2403.12574, 2024. 4

  52. [52]

    An event-driven categorization model for aer im- age sensors using multispike encoding and learning

    Rong Xiao, Huajin Tang, Yuhao Ma, Rui Yan, and Garrick Orchard. An event-driven categorization model for aer im- age sensors using multispike encoding and learning. IEEE Transactions on Neural Networks and Learning Systems, 31 (9):3649–3657, 2019. 6

  53. [53]

    Spiking neural networks and their applications: A review

    Kashu Yamazaki, Viet-Khoa V o-Ho, Darshan Bulsara, and Ngan Le. Spiking neural networks and their applications: A review. Brain Sciences, 12(7):863, 2022. 2

  54. [54]

    Temporal-wise at- tention spiking neural networks for event streams classifica- tion

    Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li. Temporal-wise at- tention spiking neural networks for event streams classifica- tion. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 10221–10230, 2021. 4

  55. [55]

    Eventdance: Unsupervised source- free cross-modal adaptation for event-based object recogni- tion

    Xu Zheng and Lin Wang. Eventdance: Unsupervised source- free cross-modal adaptation for event-based object recogni- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 17448–17458,

  56. [56]

    Deep learning for event-based vision: A comprehensive survey and bench- marks

    Xu Zheng, Yexin Liu, Yunfan Lu, Tongyan Hua, Tianbo Pan, Weiming Zhang, Dacheng Tao, and Lin Wang. Deep learning for event-based vision: A comprehensive survey and bench- marks. arXiv preprint arXiv:2302.08890, 2023. 1

  57. [57]

    E- clip: Towards label-efficient event-based open-world under- standing by clip

    Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. E- clip: Towards label-efficient event-based open-world under- standing by clip. arXiv preprint arXiv:2308.03135, 2023. 6

  58. [58]

    Ex- act: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more

    Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Ex- act: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18633–18643, 2024. 2, 5, 6, 7, 8

  59. [59]

    Spik- former: When spiking neural network meets transformer,

    Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng Yan, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. arXiv preprint arXiv:2209.15425, 2022. 5

  60. [60]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024. 3