pith. sign in

arxiv: 2605.20680 · v1 · pith:CSONL6EZnew · submitted 2026-05-20 · 💻 cs.CV

DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

Pith reviewed 2026-05-21 05:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-based visionhuman action recognitionlow-light conditionscamera motion compensationIMU integrationDarkShake-DVS datasetshaking cameramotion blur reduction
0
0 comments X

The pith

Event cameras paired with IMU motion compensation enable reliable human action recognition in low-light and shaking conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to make human action recognition work when cameras operate in darkness and undergo rapid 6-DoF motion that would normally destroy image quality and temporal order. It does this by feeding event-camera data through an IMU-driven warping step that removes motion blur before a four-stage network extracts action features. A new real-world dataset of more than 18,000 clips supplies the training and test material that existing benchmarks lack. Experiments across three datasets show the combined pipeline beats prior methods. If the approach holds, practical vision systems could function in night-time or handheld scenarios where conventional cameras fail.

Core claim

The paper claims that an Event-IMU Stabilized HAR (EIS-HAR) system, built around a non-linear warping function derived from synchronized IMU measurements to produce motion-compensated event frames and a four-stage hybrid network to extract spatiotemporal features, achieves consistent gains over state-of-the-art methods on the newly introduced DarkShake-DVS benchmark and two other datasets.

What carries the argument

The EIS module, which derives a non-linear warping function from IMU data to reconstruct motion-compensated event frames for input to the downstream HAR network.

If this is right

  • Action recognition systems can now be deployed in low-light environments with unconstrained handheld or vehicle-mounted cameras.
  • The DarkShake-DVS dataset becomes a standard testbed for evaluating event-based methods under combined darkness and 6-DoF motion.
  • The four-stage hybrid architecture provides an efficient way to process the high-temporal-resolution data produced by compensated event streams.
  • Synchronized IMU data becomes a standard auxiliary input for any event-camera pipeline that must handle camera ego-motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same warping technique could be tested on other event-based tasks such as gesture spotting or object tracking in moving cameras.
  • Combining the compensated events with conventional RGB frames might further improve performance in mixed lighting without requiring full sensor replacement.
  • If the non-linear model proves robust, similar compensation could be applied to event data in robotics or autonomous vehicles where IMU readings are already available.

Load-bearing premise

The non-linear warping function derived from IMU measurements produces motion-compensated event frames whose spatiotemporal statistics stay close enough to the original events that the four-stage network can still extract reliable action features.

What would settle it

Running the four-stage network on raw unwarped event frames from the DarkShake-DVS dataset and obtaining equal or higher accuracy than the full EIS-HAR pipeline would falsify the claim that the IMU-based compensation step is what drives the reported gains.

Figures

Figures reproduced from arXiv: 2605.20680 by Jiaqi Chen, Liyuan Pan, Qinfu Xu.

Figure 1
Figure 1. Figure 1: RGB limitations and the effectiveness of IMU-based [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of our DarkShake-DVS dataset. It contains 18K pairs of RGB frames and event streams, covering both indoor and outdoor scenes. The examples show data captured under different scenarios, as well as samples with varying degrees of camera motion. [31, 49] for motion scenarios. Furthermore, IMU data syn￾chronized with event streams provide angular velocity and linear acceleration, supplying motion cues… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of motion compensation. (a) gives the y-axis compensation displacement. P is the event coordinate triggered by A. After the camera rotates θ around the y-axis, P moves to P ′ . The incident angles before and after the movement are α and β, respectively. And the compensation displacement should be the red ∆l in the pixel plane. (b) illustrates the event frame representation. In our method, the … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our framework. We represent event data as three-channel event images, which are processed by an Iterative Greedy Sampling (IGS) module that uses a dynamic suppression strategy to select a compact, informative set of keyframes. These keyframes are fed into the HSTS module, which has a four-stage hybrid architecture that jointly captures long-range structure and local spatiotemporal cues. Finally… view at source ↗
Figure 5
Figure 5. Figure 5: Motion compensation comparison with IMU-based and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the feature distribution for 15 action [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DarkShake-DVS, a new large-scale event-based HAR benchmark with 18,041 real-world clips under low-light and 6-DoF shaking conditions with synchronized IMU data. It proposes EIS-HAR consisting of an EIS module that applies a non-linear warping function derived from IMU measurements to produce motion-compensated event frames, followed by a four-stage hybrid HAR network for spatiotemporal feature extraction. The central claim is that extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

Significance. If the superiority claim holds after addressing validation gaps, the work would be significant for practical event-based vision: it supplies the first benchmark integrating low-light, intense 6-DoF motion, and IMU synchronization, and demonstrates a concrete pipeline that combines IMU-driven compensation with a hybrid network. The empirical nature (no free parameters or closed-form derivations) is offset by the introduction of a reproducible dataset and the potential for real-world deployment in robotics or surveillance.

major comments (2)
  1. [EIS module] EIS module description: the superiority claim requires that the non-linear IMU warping produces compensated frames whose polarity, timing, and spatial density remain sufficiently close to clean low-light events for the downstream four-stage HAR network to extract reliable features. No quantitative check (e.g., event-rate histograms, polarity distribution statistics, or optical-flow consistency before/after warping) is reported to confirm this invariance; without it the performance gains could arise from network exploitation of warping-induced artifacts rather than true motion compensation.
  2. [Experimental results] Experimental results (abstract and § on datasets): the abstract states consistent outperformance on three datasets but supplies no error bars, statistical significance tests, or ablation isolating the warping function from the network architecture. This omission leaves the central claim plausible yet incompletely supported, as the contribution of each module cannot be quantified.
minor comments (2)
  1. [Title] The title contains a missing space: 'Low-light andShaking'.
  2. [HAR module] Clarify the exact four-stage architecture of the HAR module (e.g., which layers are convolutional vs. recurrent) and provide a diagram or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [EIS module] EIS module description: the superiority claim requires that the non-linear IMU warping produces compensated frames whose polarity, timing, and spatial density remain sufficiently close to clean low-light events for the downstream four-stage HAR network to extract reliable features. No quantitative check (e.g., event-rate histograms, polarity distribution statistics, or optical-flow consistency before/after warping) is reported to confirm this invariance; without it the performance gains could arise from network exploitation of warping-induced artifacts rather than true motion compensation.

    Authors: We agree that explicit quantitative validation of the warping step would more directly support the claim that performance gains derive from motion compensation. In the revised manuscript we will add event-rate histograms, polarity distribution statistics, and optical-flow consistency metrics computed before and after the non-linear IMU warping on representative sequences from DarkShake-DVS. These analyses will be placed in the EIS-module subsection to demonstrate that the compensated frames retain the essential statistical properties of the original low-light events. revision: yes

  2. Referee: [Experimental results] Experimental results (abstract and § on datasets): the abstract states consistent outperformance on three datasets but supplies no error bars, statistical significance tests, or ablation isolating the warping function from the network architecture. This omission leaves the central claim plausible yet incompletely supported, as the contribution of each module cannot be quantified.

    Authors: We acknowledge that the current presentation would benefit from greater statistical rigor and component-wise quantification. In the revision we will (i) report mean accuracy together with standard deviation across multiple random seeds in all tables, (ii) add paired statistical significance tests (e.g., t-tests) against the strongest baselines, and (iii) include a dedicated ablation that compares the full EIS-HAR pipeline against an identical four-stage network trained on unwarped event frames. The abstract will be updated to reference these additional controls. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline validated on external benchmarks with no derivation reducing to self-inputs

full rationale

The paper describes an empirical method: an EIS module applying non-linear IMU-based warping for motion compensation, followed by a four-stage hybrid HAR network for feature extraction. A new dataset DarkShake-DVS is introduced, and performance is measured via experiments on three datasets against prior SOTA baselines. No equations, fitted parameters, or self-citations are shown to reduce the reported accuracy gains to quantities defined inside the paper by construction. The central claims rest on external dataset results rather than internal redefinitions or self-referential premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The pipeline rests on standard assumptions from event-camera literature (brightness-change events remain informative after warping) and deep-learning practice (hybrid conv-recurrent stages can extract action features from stabilized frames). No new physical entities or ad-hoc constants are introduced beyond typical network hyperparameters.

axioms (1)
  • domain assumption IMU measurements provide sufficiently accurate 6-DoF pose to drive a non-linear warping that reduces motion blur in event data
    Invoked in the description of the EIS module as the basis for reconstructing motion-compensated input.

pith-pipeline@v0.9.0 · 5763 in / 1441 out tokens · 29286 ms · 2026-05-21T05:34:14.303741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

  1. [1]

    A low power, fully event-based gesture recognition system

    Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jef- frey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7243–7252, 2017. 2, 3

  2. [2]

    Is space-time attention all you need for video understanding? InIcml, page 4, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, page 4, 2021. 7, 8

  3. [3]

    A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014

    Christian Brandli, Raphael Berner, Minhao Yang, Shih-Chii Liu, and Tobi Delbruck. A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014. 1

  4. [4]

    Optimal ann- snn conversion for high-accuracy and ultra-low-latency spiking neural networks,

    Tong Bu, Wei Fang, Jianhao Ding, PengLin Dai, Zhaofei Yu, and Tiejun Huang. Optimal ann-snn conversion for high- accuracy and ultra-low-latency spiking neural networks. arXiv preprint arXiv:2303.04347, 2023. 3

  5. [5]

    Spiking deep convolutional neural networks for energy-efficient ob- ject recognition.International Journal of Computer Vision, 113:54–66, 2015

    Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energy-efficient ob- ject recognition.International Journal of Computer Vision, 113:54–66, 2015. 3

  6. [6]

    Spikmamba: When snn meets mamba in event-based human action recognition

    Jiaqi Chen, Yan Yang, Shizhuo Deng, Da Teng, and Liyuan Pan. Spikmamba: When snn meets mamba in event-based human action recognition. InProceedings of the 6th ACM International Conference on Multimedia in Asia, pages 1–8,

  7. [7]

    In- tegration of dynamic vision sensor with inertial measurement unit for electronically stabilized event-based vision

    Tobi Delbruck, Vicente Villanueva, and Luca Longinotti. In- tegration of dynamic vision sensor with inertial measurement unit for electronically stabilized event-based vision. In2014 IEEE International Symposium on Circuits and Systems (IS- CAS), pages 2636–2639. IEEE, 2014. 2

  8. [8]

    Dy- namic obstacle avoidance for quadrotors with event cameras

    Davide Falanga, Kevin Kleber, and Davide Scaramuzza. Dy- namic obstacle avoidance for quadrotors with event cameras. Science Robotics, 5(40):eaaz9712, 2020. 2

  9. [9]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 7, 8

  10. [10]

    Accurate angu- lar velocity estimation with an event camera.IEEE Robotics and Automation Letters, 2(2):632–639, 2017

    Guillermo Gallego and Davide Scaramuzza. Accurate angu- lar velocity estimation with an event camera.IEEE Robotics and Automation Letters, 2(2):632–639, 2017. 2

  11. [11]

    A unifying contrast maximization framework for event cam- eras, with applications to motion, depth, and optical flow estimation

    Guillermo Gallego, Henri Rebecq, and Davide Scaramuzza. A unifying contrast maximization framework for event cam- eras, with applications to motion, depth, and optical flow estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3867–3876,

  12. [12]

    Action recognition and benchmark using event cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

    Yue Gao, Jiaxuan Lu, Siqi Li, Nan Ma, Shaoyi Du, Yipeng Li, and Qionghai Dai. Action recognition and benchmark using event cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 2, 3

  13. [13]

    End-to-end learning of repre- sentations for asynchronous event-based data

    Daniel Gehrig, Antonio Loquercio, Konstantinos G Derpa- nis, and Davide Scaramuzza. End-to-end learning of repre- sentations for asynchronous event-based data. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 5633–5643, 2019. 7

  14. [14]

    A reservoir-based convolu- tional spiking neural network for gesture recognition from dvs input

    Arun M George, Dighanchal Banerjee, Sounak Dey, Arijit Mukherjee, and P Balamurali. A reservoir-based convolu- tional spiking neural network for gesture recognition from dvs input. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2020. 3

  15. [15]

    Ternary spike: Learning ternary spikes for spiking neural networks

    Yufei Guo, Yuanpei Chen, Xiaode Liu, Weihang Peng, Yuhan Zhang, Xuhui Huang, and Zhe Ma. Ternary spike: Learning ternary spikes for spiking neural networks. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 12244–12252, 2024. 3

  16. [16]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7, 8

  17. [17]

    3d convolu- tional neural networks for human action recognition.IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012

    Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolu- tional neural networks for human action recognition.IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012. 7

  18. [18]

    Training deep spiking convolu- tional neural networks with stdp-based unsupervised pre- training followed by supervised fine-tuning.Frontiers in neuroscience, 12:435, 2018

    Chankyu Lee, Priyadarshini Panda, Gopalakrishnan Srini- vasan, and Kaushik Roy. Training deep spiking convolu- tional neural networks with stdp-based unsupervised pre- training followed by supervised fine-tuning.Frontiers in neuroscience, 12:435, 2018. 3

  19. [19]

    Videomamba: State space model for efficient video understanding

    Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–255. Springer, 2024. 2, 8

  20. [20]

    A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE Journal of Solid-State Circuits, 43 (2):566–576, 2008

    Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE Journal of Solid-State Circuits, 43 (2):566–576, 2008. 1

  21. [21]

    Tsm: Temporal shift module for efficient video understanding

    Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019. 7, 8

  22. [22]

    Storyboard-guided alignment for fine-grained video action recognition

    Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu, and Liu Liu. Storyboard-guided alignment for fine-grained video action recognition. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems, 2025. 1

  23. [23]

    Event-based action recognition using motion informa- tion and spiking neural networks

    Qianhui Liu, Dong Xing, Huajin Tang, De Ma, and Gang Pan. Event-based action recognition using motion informa- tion and spiking neural networks. InIJCAI, pages 1743– 1749, 2021. 3

  24. [24]

    Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

    Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024. 8

  25. [25]

    Tam: Temporal adaptive module for video recogni- tion

    Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. Tam: Temporal adaptive module for video recogni- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 13708–13718, 2021. 7, 8

  26. [26]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 2, 6, 7, 8

  27. [27]

    Sgdr: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representations, 2017. 6

  28. [28]

    Qualitative action recog- nition by wireless radio signals in human–machine systems

    Shaohe Lv, Yong Lu, Mianxiong Dong, Xiaodong Wang, Yong Dou, and Weihua Zhuang. Qualitative action recog- nition by wireless radio signals in human–machine systems. IEEE Transactions on Human-Machine Systems, 47(6):789– 800, 2017. 1

  29. [29]

    Event-based moving object detection and tracking

    Anton Mitrokhin, Cornelia Ferm ¨uller, Chethan Paramesh- wara, and Yiannis Aloimonos. Event-based moving object detection and tracking. In2018 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 1–9. IEEE, 2018. 2, 7, 8

  30. [30]

    Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuro- science, 9:437, 2015

    Garrick Orchard, Ajinkya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuro- science, 9:437, 2015. 2

  31. [31]

    Bringing a blurry frame alive at high frame-rate with an event camera

    Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley, Miaomiao Liu, and Yuchao Dai. Bringing a blurry frame alive at high frame-rate with an event camera. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

  32. [32]

    High frame rate video re- construction based on an event camera.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2519– 2533, 2020

    Liyuan Pan, Richard Hartley, Cedric Scheerlinck, Miaomiao Liu, Xin Yu, and Yuchao Dai. High frame rate video re- construction based on an event camera.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2519– 2533, 2020. 1

  33. [33]

    Single image optical flow estimation with an event camera

    Liyuan Pan, Miaomiao Liu, and Richard Hartley. Single image optical flow estimation with an event camera. In 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1669–1678. IEEE, 2020. 1

  34. [34]

    0- mms: Zero-shot multi-motion segmentation with a monocu- lar event camera

    Chethan M Parameshwara, Nitin J Sanket, Chahat Deep Singh, Cornelia Ferm ¨uller, and Yiannis Aloimonos. 0- mms: Zero-shot multi-motion segmentation with a monocu- lar event camera. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 9594–9600. IEEE,

  35. [35]

    Get: Group event transformer for event-based vision

    Yansong Peng, Yueyi Zhang, Zhiwei Xiong, Xiaoyan Sun, and Feng Wu. Get: Group event transformer for event-based vision. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6038–6048, 2023. 7

  36. [36]

    Spiking pointnet: Spik- ing neural networks for point clouds.Advances in Neural Information Processing Systems, 36, 2024

    Dayong Ren, Zhe Ma, Yuanpei Chen, Weihang Peng, Xiaode Liu, Yuhan Zhang, and Yufei Guo. Spiking pointnet: Spik- ing neural networks for point clouds.Advances in Neural Information Processing Systems, 36, 2024. 3

  37. [37]

    Event transformer

    Alberto Sabater, Luis Montesano, and Ana C Murillo. Event transformer. a sparse-aware solution for efficient event data processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2677– 2686, 2022. 3

  38. [38]

    Deep liquid state machines with neural plasticity for video activity recog- nition.Frontiers in neuroscience, 13:686, 2019

    Nicholas Soures and Dhireesha Kudithipudi. Deep liquid state machines with neural plasticity for video activity recog- nition.Frontiers in neuroscience, 13:686, 2019. 3

  39. [39]

    Event-based motion segmentation by motion compensation

    Timo Stoffregen, Guillermo Gallego, Tom Drummond, Lindsay Kleeman, and Davide Scaramuzza. Event-based motion segmentation by motion compensation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7244–7253, 2019. 2

  40. [40]

    A closer look at spatiotemporal convolutions for action recognition

    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recogni- tion, pages 6450–6459, 2018. 7

  41. [41]

    Dailydvs-200: A comprehen- sive benchmark dataset for event-based action recognition

    Qi Wang, Zhou Xu, Yuming Lin, Jingtao Ye, Hongsheng Li, Guangming Zhu, Syed Afaq Ali Shah, Mohammed Ben- namoun, and Liang Zhang. Dailydvs-200: A comprehen- sive benchmark dataset for event-based action recognition. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 1, 6

  42. [42]

    Event stream based human action recognition: a high-definition benchmark dataset and algo- rithms.arXiv preprint arXiv:2408.09764, 2024

    Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, and Yonghong Tian. Event stream based human action recognition: a high-definition benchmark dataset and algo- rithms.arXiv preprint arXiv:2408.09764, 2024. 2, 3, 7

  43. [43]

    Hardvs: Re- visiting human activity recognition with dynamic vision sen- sors

    Xiao Wang, Zongzhen Wu, Bo Jiang, Zhimin Bao, Lin Zhu, Guoqi Li, Yaowei Wang, and Yonghong Tian. Hardvs: Re- visiting human activity recognition with dynamic vision sen- sors. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5615–5623, 2024. 2, 6, 7, 8

  44. [44]

    Action-net: Multipath excitation for action recognition

    Zhengwei Wang, Qi She, and Aljosa Smolic. Action-net: Multipath excitation for action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13214–13223, 2021. 7

  45. [45]

    Event voxel set transformer for spa- tiotemporal representation learning on event streams.arXiv preprint arXiv:2303.03856, 2023

    Bochen Xie, Yongjian Deng, Zhanpeng Shao, Hai Liu, Qing- song Xu, and Youfu Li. Event voxel set transformer for spa- tiotemporal representation learning on event streams.arXiv preprint arXiv:2303.03856, 2023. 3

  46. [46]

    Jianyang Xie, Yitian Zhao, Yanda Meng, He Zhao, Anh Nguyen, and Yalin Zheng. Are spatial-temporal graph convolution networks for human action recognition over- parameterized? InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 24309–24319, 2025. 1

  47. [47]

    Long-Hao Yang, Fei-Fei Ye, Chris Nugent, Jun Liu, and Ying-Ming Wang. Belief-rule-based system with self- organizing and multi-temporal modeling for sensor-based human activity recognition.IEEE Journal of Biomedical and Health Informatics, 29(2):1062–1073, 2025. 1

  48. [48]

    Event camera data pre-training

    Yan Yang, Liyuan Pan, and Liu Liu. Event camera data pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10699– 10709, 2023. 1

  49. [49]

    Ezsr: Event-based zero-shot recognition

    Yan Yang, Liyuan Pan, Dongxu Li, and Liu Liu. Ezsr: Event-based zero-shot recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4628–4638, 2025. 2

  50. [50]

    Event camera data dense pre-training

    Yan Yang, Liyuan Pan, and Liu Liu. Event camera data dense pre-training. InComputer Vision – ECCV 2024, pages 292– 310, Cham, 2025. Springer Nature Switzerland. 1

  51. [51]

    Event-based few-shot fine-grained human action recognition

    Zonglin Yang, Yan Yang, Yuheng Shi, Hao Yang, Ruikun Zhang, Liu Liu, Xinxiao Wu, and Liyuan Pan. Event-based few-shot fine-grained human action recognition. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 519–526. IEEE, 2024. 1

  52. [52]

    Spike-driven transformer.Ad- vances in neural information processing systems, 36:64043– 64058, 2023

    Man Yao, Jiakui Hu, Zhaokun Zhou, Li Yuan, Yonghong Tian, Bo Xu, and Guoqi Li. Spike-driven transformer.Ad- vances in neural information processing systems, 36:64043– 64058, 2023. 7

  53. [53]

    Xugao Yu and Mohammed A. A. Al-qaness. Human ac- tivity recognition using deep residual convolutional network based on wearable sensors.IEEE Journal of Biomedical and Health Informatics, 29(3):1950–1958, 2025. 1

  54. [54]

    Renjie Zhang, Di Lin, Xin Wang, George Baciu, C. L. Philip Chen, and Ping Li. Accurate-pgnet: Learning to assemble perceptual body parts for accurate human skeleton establish- ment.IEEE Transactions on Multimedia, 27:1706–1721,

  55. [55]

    Event-based real-time moving object detection based on imu ego-motion compensation

    Chunhui Zhao, Yakun Li, and Yang Lyu. Event-based real-time moving object detection based on imu ego-motion compensation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 690–696. IEEE,

  56. [56]

    Jstr: Joint spatio-temporal reason- ing for event-based moving object detection.arXiv preprint arXiv:2403.07436, 2024

    Hanyu Zhou, Zhiwei Shi, Hao Dong, Shihan Peng, Yi Chang, and Luxin Yan. Jstr: Joint spatio-temporal reason- ing for event-based moving object detection.arXiv preprint arXiv:2403.07436, 2024. 2

  57. [57]

    Ex- act: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more

    Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Ex- act: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 18633–18643,

  58. [58]

    Event-based motion segmentation with spatio- temporal graph cuts.IEEE transactions on neural networks and learning systems, 34(8):4868–4880, 2021

    Yi Zhou, Guillermo Gallego, Xiuyuan Lu, Siqi Liu, and Shaojie Shen. Event-based motion segmentation with spatio- temporal graph cuts.IEEE transactions on neural networks and learning systems, 34(8):4868–4880, 2021. 2, 5

  59. [59]

    Spikformer: When spiking neural network meets transformer

    Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng Y AN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. InThe Eleventh International Conference on Learning Representa- tions. 2, 7, 8

  60. [60]

    Semantic-guided cross-modal prompt learning for skeleton-based zero-shot action recognition

    Anqi Zhu, Jingmin Zhu, James Bailey, Mingming Gong, and Qiuhong Ke. Semantic-guided cross-modal prompt learning for skeleton-based zero-shot action recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13876–13885, 2025. 1

  61. [61]

    Vision mamba: efficient visual representation learning with bidirectional state space model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: efficient visual representation learning with bidirectional state space model. InProceedings of the 41st International Conference on Machine Learning, pages 62429–62442, 2024. 8