DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

Jiaqi Chen; Liyuan Pan; Qinfu Xu

arxiv: 2605.20680 · v1 · pith:CSONL6EZnew · submitted 2026-05-20 · 💻 cs.CV

DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

Jiaqi Chen , Qinfu Xu , Liyuan Pan This is my paper

Pith reviewed 2026-05-21 05:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords event-based visionhuman action recognitionlow-light conditionscamera motion compensationIMU integrationDarkShake-DVS datasetshaking cameramotion blur reduction

0 comments

The pith

Event cameras paired with IMU motion compensation enable reliable human action recognition in low-light and shaking conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to make human action recognition work when cameras operate in darkness and undergo rapid 6-DoF motion that would normally destroy image quality and temporal order. It does this by feeding event-camera data through an IMU-driven warping step that removes motion blur before a four-stage network extracts action features. A new real-world dataset of more than 18,000 clips supplies the training and test material that existing benchmarks lack. Experiments across three datasets show the combined pipeline beats prior methods. If the approach holds, practical vision systems could function in night-time or handheld scenarios where conventional cameras fail.

Core claim

The paper claims that an Event-IMU Stabilized HAR (EIS-HAR) system, built around a non-linear warping function derived from synchronized IMU measurements to produce motion-compensated event frames and a four-stage hybrid network to extract spatiotemporal features, achieves consistent gains over state-of-the-art methods on the newly introduced DarkShake-DVS benchmark and two other datasets.

What carries the argument

The EIS module, which derives a non-linear warping function from IMU data to reconstruct motion-compensated event frames for input to the downstream HAR network.

If this is right

Action recognition systems can now be deployed in low-light environments with unconstrained handheld or vehicle-mounted cameras.
The DarkShake-DVS dataset becomes a standard testbed for evaluating event-based methods under combined darkness and 6-DoF motion.
The four-stage hybrid architecture provides an efficient way to process the high-temporal-resolution data produced by compensated event streams.
Synchronized IMU data becomes a standard auxiliary input for any event-camera pipeline that must handle camera ego-motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same warping technique could be tested on other event-based tasks such as gesture spotting or object tracking in moving cameras.
Combining the compensated events with conventional RGB frames might further improve performance in mixed lighting without requiring full sensor replacement.
If the non-linear model proves robust, similar compensation could be applied to event data in robotics or autonomous vehicles where IMU readings are already available.

Load-bearing premise

The non-linear warping function derived from IMU measurements produces motion-compensated event frames whose spatiotemporal statistics stay close enough to the original events that the four-stage network can still extract reliable action features.

What would settle it

Running the four-stage network on raw unwarped event frames from the DarkShake-DVS dataset and obtaining equal or higher accuracy than the full EIS-HAR pipeline would falsify the claim that the IMU-based compensation step is what drives the reported gains.

Figures

Figures reproduced from arXiv: 2605.20680 by Jiaqi Chen, Liyuan Pan, Qinfu Xu.

**Figure 2.** Figure 2: Examples of our DarkShake-DVS dataset. It contains 18K pairs of RGB frames and event streams, covering both indoor and outdoor scenes. The examples show data captured under different scenarios, as well as samples with varying degrees of camera motion. [31, 49] for motion scenarios. Furthermore, IMU data synchronized with event streams provide angular velocity and linear acceleration, supplying motion cues… view at source ↗

**Figure 3.** Figure 3: Illustration of motion compensation. (a) gives the y-axis compensation displacement. P is the event coordinate triggered by A. After the camera rotates θ around the y-axis, P moves to P ′ . The incident angles before and after the movement are α and β, respectively. And the compensation displacement should be the red ∆l in the pixel plane. (b) illustrates the event frame representation. In our method, the … view at source ↗

**Figure 4.** Figure 4: Overview of our framework. We represent event data as three-channel event images, which are processed by an Iterative Greedy Sampling (IGS) module that uses a dynamic suppression strategy to select a compact, informative set of keyframes. These keyframes are fed into the HSTS module, which has a four-stage hybrid architecture that jointly captures long-range structure and local spatiotemporal cues. Finally… view at source ↗

**Figure 5.** Figure 5: Motion compensation comparison with IMU-based and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the feature distribution for 15 action [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New dataset for low-light shaky event HAR is the real addition here, but the warping method needs better validation to support the performance claims.

read the letter

The main thing here is a new dataset for event-based human action recognition that covers low-light scenes with intense camera shake, paired with an IMU-driven warping step to compensate for the motion. DarkShake-DVS provides 18,041 clips with synchronized IMU, which is the first benchmark to put all those elements together. The EIS module applies a non-linear warping function derived from the IMU measurements to reconstruct motion-compensated event frames, and then a four-stage hybrid network extracts the features for classification. Reporting better results than prior methods on three datasets shows the pipeline can deliver in these conditions where standard approaches fall short. The experiments would be more convincing with additional details. There is no mention of statistical significance, error bars, or ablations testing the warping function separately from the network architecture. This leaves open the possibility that the gains come from something other than the compensation preserving clean event statistics. The idea that warping could distort polarity or density in ways the model picks up as cues rather than true action features is a fair concern, and it would be good to see direct comparisons of event properties before and after the step. This paper targets researchers developing deployable event vision systems for robotics and similar fields that must operate without controlled lighting or stable cameras. The dataset will likely see adoption for testing robustness. It deserves a serious referee because the benchmark is novel and the application is practical, even if the current evidence for the method's specific contribution could be tighter.

Referee Report

2 major / 2 minor

Summary. The paper introduces DarkShake-DVS, a new large-scale event-based HAR benchmark with 18,041 real-world clips under low-light and 6-DoF shaking conditions with synchronized IMU data. It proposes EIS-HAR consisting of an EIS module that applies a non-linear warping function derived from IMU measurements to produce motion-compensated event frames, followed by a four-stage hybrid HAR network for spatiotemporal feature extraction. The central claim is that extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

Significance. If the superiority claim holds after addressing validation gaps, the work would be significant for practical event-based vision: it supplies the first benchmark integrating low-light, intense 6-DoF motion, and IMU synchronization, and demonstrates a concrete pipeline that combines IMU-driven compensation with a hybrid network. The empirical nature (no free parameters or closed-form derivations) is offset by the introduction of a reproducible dataset and the potential for real-world deployment in robotics or surveillance.

major comments (2)

[EIS module] EIS module description: the superiority claim requires that the non-linear IMU warping produces compensated frames whose polarity, timing, and spatial density remain sufficiently close to clean low-light events for the downstream four-stage HAR network to extract reliable features. No quantitative check (e.g., event-rate histograms, polarity distribution statistics, or optical-flow consistency before/after warping) is reported to confirm this invariance; without it the performance gains could arise from network exploitation of warping-induced artifacts rather than true motion compensation.
[Experimental results] Experimental results (abstract and § on datasets): the abstract states consistent outperformance on three datasets but supplies no error bars, statistical significance tests, or ablation isolating the warping function from the network architecture. This omission leaves the central claim plausible yet incompletely supported, as the contribution of each module cannot be quantified.

minor comments (2)

[Title] The title contains a missing space: 'Low-light andShaking'.
[HAR module] Clarify the exact four-stage architecture of the HAR module (e.g., which layers are convolutional vs. recurrent) and provide a diagram or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [EIS module] EIS module description: the superiority claim requires that the non-linear IMU warping produces compensated frames whose polarity, timing, and spatial density remain sufficiently close to clean low-light events for the downstream four-stage HAR network to extract reliable features. No quantitative check (e.g., event-rate histograms, polarity distribution statistics, or optical-flow consistency before/after warping) is reported to confirm this invariance; without it the performance gains could arise from network exploitation of warping-induced artifacts rather than true motion compensation.

Authors: We agree that explicit quantitative validation of the warping step would more directly support the claim that performance gains derive from motion compensation. In the revised manuscript we will add event-rate histograms, polarity distribution statistics, and optical-flow consistency metrics computed before and after the non-linear IMU warping on representative sequences from DarkShake-DVS. These analyses will be placed in the EIS-module subsection to demonstrate that the compensated frames retain the essential statistical properties of the original low-light events. revision: yes
Referee: [Experimental results] Experimental results (abstract and § on datasets): the abstract states consistent outperformance on three datasets but supplies no error bars, statistical significance tests, or ablation isolating the warping function from the network architecture. This omission leaves the central claim plausible yet incompletely supported, as the contribution of each module cannot be quantified.

Authors: We acknowledge that the current presentation would benefit from greater statistical rigor and component-wise quantification. In the revision we will (i) report mean accuracy together with standard deviation across multiple random seeds in all tables, (ii) add paired statistical significance tests (e.g., t-tests) against the strongest baselines, and (iii) include a dedicated ablation that compares the full EIS-HAR pipeline against an identical four-stage network trained on unwarped event frames. The abstract will be updated to reference these additional controls. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline validated on external benchmarks with no derivation reducing to self-inputs

full rationale

The paper describes an empirical method: an EIS module applying non-linear IMU-based warping for motion compensation, followed by a four-stage hybrid HAR network for feature extraction. A new dataset DarkShake-DVS is introduced, and performance is measured via experiments on three datasets against prior SOTA baselines. No equations, fitted parameters, or self-citations are shown to reduce the reported accuracy gains to quantities defined inside the paper by construction. The central claims rest on external dataset results rather than internal redefinitions or self-referential premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The pipeline rests on standard assumptions from event-camera literature (brightness-change events remain informative after warping) and deep-learning practice (hybrid conv-recurrent stages can extract action features from stabilized frames). No new physical entities or ad-hoc constants are introduced beyond typical network hyperparameters.

axioms (1)

domain assumption IMU measurements provide sufficiently accurate 6-DoF pose to drive a non-linear warping that reduces motion blur in event data
Invoked in the description of the EIS module as the basis for reconstructing motion-compensated input.

pith-pipeline@v0.9.0 · 5763 in / 1441 out tokens · 29286 ms · 2026-05-21T05:34:14.303741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

[1]

A low power, fully event-based gesture recognition system

Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jef- frey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7243–7252, 2017. 2, 3

work page 2017
[2]

Is space-time attention all you need for video understanding? InIcml, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, page 4, 2021. 7, 8

work page 2021
[3]

A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014

Christian Brandli, Raphael Berner, Minhao Yang, Shih-Chii Liu, and Tobi Delbruck. A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014. 1

work page 2014
[4]

Optimal ann- snn conversion for high-accuracy and ultra-low-latency spiking neural networks,

Tong Bu, Wei Fang, Jianhao Ding, PengLin Dai, Zhaofei Yu, and Tiejun Huang. Optimal ann-snn conversion for high- accuracy and ultra-low-latency spiking neural networks. arXiv preprint arXiv:2303.04347, 2023. 3

work page arXiv 2023
[5]

Spiking deep convolutional neural networks for energy-efficient ob- ject recognition.International Journal of Computer Vision, 113:54–66, 2015

Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energy-efficient ob- ject recognition.International Journal of Computer Vision, 113:54–66, 2015. 3

work page 2015
[6]

Spikmamba: When snn meets mamba in event-based human action recognition

Jiaqi Chen, Yan Yang, Shizhuo Deng, Da Teng, and Liyuan Pan. Spikmamba: When snn meets mamba in event-based human action recognition. InProceedings of the 6th ACM International Conference on Multimedia in Asia, pages 1–8,

work page
[7]

In- tegration of dynamic vision sensor with inertial measurement unit for electronically stabilized event-based vision

Tobi Delbruck, Vicente Villanueva, and Luca Longinotti. In- tegration of dynamic vision sensor with inertial measurement unit for electronically stabilized event-based vision. In2014 IEEE International Symposium on Circuits and Systems (IS- CAS), pages 2636–2639. IEEE, 2014. 2

work page 2014
[8]

Dy- namic obstacle avoidance for quadrotors with event cameras

Davide Falanga, Kevin Kleber, and Davide Scaramuzza. Dy- namic obstacle avoidance for quadrotors with event cameras. Science Robotics, 5(40):eaaz9712, 2020. 2

work page 2020
[9]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 7, 8

work page 2019
[10]

Accurate angu- lar velocity estimation with an event camera.IEEE Robotics and Automation Letters, 2(2):632–639, 2017

Guillermo Gallego and Davide Scaramuzza. Accurate angu- lar velocity estimation with an event camera.IEEE Robotics and Automation Letters, 2(2):632–639, 2017. 2

work page 2017
[11]

A unifying contrast maximization framework for event cam- eras, with applications to motion, depth, and optical flow estimation

Guillermo Gallego, Henri Rebecq, and Davide Scaramuzza. A unifying contrast maximization framework for event cam- eras, with applications to motion, depth, and optical flow estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3867–3876,

work page
[12]

Action recognition and benchmark using event cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Yue Gao, Jiaxuan Lu, Siqi Li, Nan Ma, Shaoyi Du, Yipeng Li, and Qionghai Dai. Action recognition and benchmark using event cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 2, 3

work page 2023
[13]

End-to-end learning of repre- sentations for asynchronous event-based data

Daniel Gehrig, Antonio Loquercio, Konstantinos G Derpa- nis, and Davide Scaramuzza. End-to-end learning of repre- sentations for asynchronous event-based data. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 5633–5643, 2019. 7

work page 2019
[14]

A reservoir-based convolu- tional spiking neural network for gesture recognition from dvs input

Arun M George, Dighanchal Banerjee, Sounak Dey, Arijit Mukherjee, and P Balamurali. A reservoir-based convolu- tional spiking neural network for gesture recognition from dvs input. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2020. 3

work page 2020
[15]

Ternary spike: Learning ternary spikes for spiking neural networks

Yufei Guo, Yuanpei Chen, Xiaode Liu, Weihang Peng, Yuhan Zhang, Xuhui Huang, and Zhe Ma. Ternary spike: Learning ternary spikes for spiking neural networks. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 12244–12252, 2024. 3

work page 2024
[16]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7, 8

work page 2016
[17]

3d convolu- tional neural networks for human action recognition.IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolu- tional neural networks for human action recognition.IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012. 7

work page 2012
[18]

Training deep spiking convolu- tional neural networks with stdp-based unsupervised pre- training followed by supervised fine-tuning.Frontiers in neuroscience, 12:435, 2018

Chankyu Lee, Priyadarshini Panda, Gopalakrishnan Srini- vasan, and Kaushik Roy. Training deep spiking convolu- tional neural networks with stdp-based unsupervised pre- training followed by supervised fine-tuning.Frontiers in neuroscience, 12:435, 2018. 3

work page 2018
[19]

Videomamba: State space model for efficient video understanding

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–255. Springer, 2024. 2, 8

work page 2024
[20]

A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE Journal of Solid-State Circuits, 43 (2):566–576, 2008

Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE Journal of Solid-State Circuits, 43 (2):566–576, 2008. 1

work page 2008
[21]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019. 7, 8

work page 2019
[22]

Storyboard-guided alignment for fine-grained video action recognition

Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu, and Liu Liu. Storyboard-guided alignment for fine-grained video action recognition. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems, 2025. 1

work page 2025
[23]

Event-based action recognition using motion informa- tion and spiking neural networks

Qianhui Liu, Dong Xing, Huajin Tang, De Ma, and Gang Pan. Event-based action recognition using motion informa- tion and spiking neural networks. InIJCAI, pages 1743– 1749, 2021. 3

work page 2021
[24]

Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024. 8

work page 2024
[25]

Tam: Temporal adaptive module for video recogni- tion

Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. Tam: Temporal adaptive module for video recogni- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 13708–13718, 2021. 7, 8

work page 2021
[26]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 2, 6, 7, 8

work page 2022
[27]

Sgdr: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representations, 2017. 6

work page 2017
[28]

Qualitative action recog- nition by wireless radio signals in human–machine systems

Shaohe Lv, Yong Lu, Mianxiong Dong, Xiaodong Wang, Yong Dou, and Weihua Zhuang. Qualitative action recog- nition by wireless radio signals in human–machine systems. IEEE Transactions on Human-Machine Systems, 47(6):789– 800, 2017. 1

work page 2017
[29]

Event-based moving object detection and tracking

Anton Mitrokhin, Cornelia Ferm ¨uller, Chethan Paramesh- wara, and Yiannis Aloimonos. Event-based moving object detection and tracking. In2018 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 1–9. IEEE, 2018. 2, 7, 8

work page 2018
[30]

Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuro- science, 9:437, 2015

Garrick Orchard, Ajinkya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuro- science, 9:437, 2015. 2

work page 2015
[31]

Bringing a blurry frame alive at high frame-rate with an event camera

Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley, Miaomiao Liu, and Yuchao Dai. Bringing a blurry frame alive at high frame-rate with an event camera. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

work page 2019
[32]

High frame rate video re- construction based on an event camera.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2519– 2533, 2020

Liyuan Pan, Richard Hartley, Cedric Scheerlinck, Miaomiao Liu, Xin Yu, and Yuchao Dai. High frame rate video re- construction based on an event camera.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2519– 2533, 2020. 1

work page 2020
[33]

Single image optical flow estimation with an event camera

Liyuan Pan, Miaomiao Liu, and Richard Hartley. Single image optical flow estimation with an event camera. In 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1669–1678. IEEE, 2020. 1

work page 2020
[34]

0- mms: Zero-shot multi-motion segmentation with a monocu- lar event camera

Chethan M Parameshwara, Nitin J Sanket, Chahat Deep Singh, Cornelia Ferm ¨uller, and Yiannis Aloimonos. 0- mms: Zero-shot multi-motion segmentation with a monocu- lar event camera. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 9594–9600. IEEE,

work page
[35]

Get: Group event transformer for event-based vision

Yansong Peng, Yueyi Zhang, Zhiwei Xiong, Xiaoyan Sun, and Feng Wu. Get: Group event transformer for event-based vision. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6038–6048, 2023. 7

work page 2023
[36]

Spiking pointnet: Spik- ing neural networks for point clouds.Advances in Neural Information Processing Systems, 36, 2024

Dayong Ren, Zhe Ma, Yuanpei Chen, Weihang Peng, Xiaode Liu, Yuhan Zhang, and Yufei Guo. Spiking pointnet: Spik- ing neural networks for point clouds.Advances in Neural Information Processing Systems, 36, 2024. 3

work page 2024
[37]

Event transformer

Alberto Sabater, Luis Montesano, and Ana C Murillo. Event transformer. a sparse-aware solution for efficient event data processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2677– 2686, 2022. 3

work page 2022
[38]

Deep liquid state machines with neural plasticity for video activity recog- nition.Frontiers in neuroscience, 13:686, 2019

Nicholas Soures and Dhireesha Kudithipudi. Deep liquid state machines with neural plasticity for video activity recog- nition.Frontiers in neuroscience, 13:686, 2019. 3

work page 2019
[39]

Event-based motion segmentation by motion compensation

Timo Stoffregen, Guillermo Gallego, Tom Drummond, Lindsay Kleeman, and Davide Scaramuzza. Event-based motion segmentation by motion compensation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7244–7253, 2019. 2

work page 2019
[40]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recogni- tion, pages 6450–6459, 2018. 7

work page 2018
[41]

Dailydvs-200: A comprehen- sive benchmark dataset for event-based action recognition

Qi Wang, Zhou Xu, Yuming Lin, Jingtao Ye, Hongsheng Li, Guangming Zhu, Syed Afaq Ali Shah, Mohammed Ben- namoun, and Liang Zhang. Dailydvs-200: A comprehen- sive benchmark dataset for event-based action recognition. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 1, 6

work page 2024
[42]

Event stream based human action recognition: a high-definition benchmark dataset and algo- rithms.arXiv preprint arXiv:2408.09764, 2024

Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, and Yonghong Tian. Event stream based human action recognition: a high-definition benchmark dataset and algo- rithms.arXiv preprint arXiv:2408.09764, 2024. 2, 3, 7

work page arXiv 2024
[43]

Hardvs: Re- visiting human activity recognition with dynamic vision sen- sors

Xiao Wang, Zongzhen Wu, Bo Jiang, Zhimin Bao, Lin Zhu, Guoqi Li, Yaowei Wang, and Yonghong Tian. Hardvs: Re- visiting human activity recognition with dynamic vision sen- sors. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5615–5623, 2024. 2, 6, 7, 8

work page 2024
[44]

Action-net: Multipath excitation for action recognition

Zhengwei Wang, Qi She, and Aljosa Smolic. Action-net: Multipath excitation for action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13214–13223, 2021. 7

work page 2021
[45]

Event voxel set transformer for spa- tiotemporal representation learning on event streams.arXiv preprint arXiv:2303.03856, 2023

Bochen Xie, Yongjian Deng, Zhanpeng Shao, Hai Liu, Qing- song Xu, and Youfu Li. Event voxel set transformer for spa- tiotemporal representation learning on event streams.arXiv preprint arXiv:2303.03856, 2023. 3

work page arXiv 2023
[46]

Jianyang Xie, Yitian Zhao, Yanda Meng, He Zhao, Anh Nguyen, and Yalin Zheng. Are spatial-temporal graph convolution networks for human action recognition over- parameterized? InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 24309–24319, 2025. 1

work page 2025
[47]

Long-Hao Yang, Fei-Fei Ye, Chris Nugent, Jun Liu, and Ying-Ming Wang. Belief-rule-based system with self- organizing and multi-temporal modeling for sensor-based human activity recognition.IEEE Journal of Biomedical and Health Informatics, 29(2):1062–1073, 2025. 1

work page 2025
[48]

Event camera data pre-training

Yan Yang, Liyuan Pan, and Liu Liu. Event camera data pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10699– 10709, 2023. 1

work page 2023
[49]

Ezsr: Event-based zero-shot recognition

Yan Yang, Liyuan Pan, Dongxu Li, and Liu Liu. Ezsr: Event-based zero-shot recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4628–4638, 2025. 2

work page 2025
[50]

Event camera data dense pre-training

Yan Yang, Liyuan Pan, and Liu Liu. Event camera data dense pre-training. InComputer Vision – ECCV 2024, pages 292– 310, Cham, 2025. Springer Nature Switzerland. 1

work page 2024
[51]

Event-based few-shot fine-grained human action recognition

Zonglin Yang, Yan Yang, Yuheng Shi, Hao Yang, Ruikun Zhang, Liu Liu, Xinxiao Wu, and Liyuan Pan. Event-based few-shot fine-grained human action recognition. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 519–526. IEEE, 2024. 1

work page 2024
[52]

Spike-driven transformer.Ad- vances in neural information processing systems, 36:64043– 64058, 2023

Man Yao, Jiakui Hu, Zhaokun Zhou, Li Yuan, Yonghong Tian, Bo Xu, and Guoqi Li. Spike-driven transformer.Ad- vances in neural information processing systems, 36:64043– 64058, 2023. 7

work page 2023
[53]

Xugao Yu and Mohammed A. A. Al-qaness. Human ac- tivity recognition using deep residual convolutional network based on wearable sensors.IEEE Journal of Biomedical and Health Informatics, 29(3):1950–1958, 2025. 1

work page 1950
[54]

Renjie Zhang, Di Lin, Xin Wang, George Baciu, C. L. Philip Chen, and Ping Li. Accurate-pgnet: Learning to assemble perceptual body parts for accurate human skeleton establish- ment.IEEE Transactions on Multimedia, 27:1706–1721,

work page
[55]

Event-based real-time moving object detection based on imu ego-motion compensation

Chunhui Zhao, Yakun Li, and Yang Lyu. Event-based real-time moving object detection based on imu ego-motion compensation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 690–696. IEEE,

work page
[56]

Jstr: Joint spatio-temporal reason- ing for event-based moving object detection.arXiv preprint arXiv:2403.07436, 2024

Hanyu Zhou, Zhiwei Shi, Hao Dong, Shihan Peng, Yi Chang, and Luxin Yan. Jstr: Joint spatio-temporal reason- ing for event-based moving object detection.arXiv preprint arXiv:2403.07436, 2024. 2

work page arXiv 2024
[57]

Ex- act: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Ex- act: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 18633–18643,

work page
[58]

Event-based motion segmentation with spatio- temporal graph cuts.IEEE transactions on neural networks and learning systems, 34(8):4868–4880, 2021

Yi Zhou, Guillermo Gallego, Xiuyuan Lu, Siqi Liu, and Shaojie Shen. Event-based motion segmentation with spatio- temporal graph cuts.IEEE transactions on neural networks and learning systems, 34(8):4868–4880, 2021. 2, 5

work page 2021
[59]

Spikformer: When spiking neural network meets transformer

Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng Y AN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. InThe Eleventh International Conference on Learning Representa- tions. 2, 7, 8

work page
[60]

Semantic-guided cross-modal prompt learning for skeleton-based zero-shot action recognition

Anqi Zhu, Jingmin Zhu, James Bailey, Mingming Gong, and Qiuhong Ke. Semantic-guided cross-modal prompt learning for skeleton-based zero-shot action recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13876–13885, 2025. 1

work page 2025
[61]

Vision mamba: efficient visual representation learning with bidirectional state space model

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: efficient visual representation learning with bidirectional state space model. InProceedings of the 41st International Conference on Machine Learning, pages 62429–62442, 2024. 8

work page 2024

[1] [1]

A low power, fully event-based gesture recognition system

Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jef- frey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7243–7252, 2017. 2, 3

work page 2017

[2] [2]

Is space-time attention all you need for video understanding? InIcml, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, page 4, 2021. 7, 8

work page 2021

[3] [3]

A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014

Christian Brandli, Raphael Berner, Minhao Yang, Shih-Chii Liu, and Tobi Delbruck. A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014. 1

work page 2014

[4] [4]

Optimal ann- snn conversion for high-accuracy and ultra-low-latency spiking neural networks,

Tong Bu, Wei Fang, Jianhao Ding, PengLin Dai, Zhaofei Yu, and Tiejun Huang. Optimal ann-snn conversion for high- accuracy and ultra-low-latency spiking neural networks. arXiv preprint arXiv:2303.04347, 2023. 3

work page arXiv 2023

[5] [5]

Spiking deep convolutional neural networks for energy-efficient ob- ject recognition.International Journal of Computer Vision, 113:54–66, 2015

Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energy-efficient ob- ject recognition.International Journal of Computer Vision, 113:54–66, 2015. 3

work page 2015

[6] [6]

Spikmamba: When snn meets mamba in event-based human action recognition

Jiaqi Chen, Yan Yang, Shizhuo Deng, Da Teng, and Liyuan Pan. Spikmamba: When snn meets mamba in event-based human action recognition. InProceedings of the 6th ACM International Conference on Multimedia in Asia, pages 1–8,

work page

[7] [7]

In- tegration of dynamic vision sensor with inertial measurement unit for electronically stabilized event-based vision

Tobi Delbruck, Vicente Villanueva, and Luca Longinotti. In- tegration of dynamic vision sensor with inertial measurement unit for electronically stabilized event-based vision. In2014 IEEE International Symposium on Circuits and Systems (IS- CAS), pages 2636–2639. IEEE, 2014. 2

work page 2014

[8] [8]

Dy- namic obstacle avoidance for quadrotors with event cameras

Davide Falanga, Kevin Kleber, and Davide Scaramuzza. Dy- namic obstacle avoidance for quadrotors with event cameras. Science Robotics, 5(40):eaaz9712, 2020. 2

work page 2020

[9] [9]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 7, 8

work page 2019

[10] [10]

Accurate angu- lar velocity estimation with an event camera.IEEE Robotics and Automation Letters, 2(2):632–639, 2017

Guillermo Gallego and Davide Scaramuzza. Accurate angu- lar velocity estimation with an event camera.IEEE Robotics and Automation Letters, 2(2):632–639, 2017. 2

work page 2017

[11] [11]

A unifying contrast maximization framework for event cam- eras, with applications to motion, depth, and optical flow estimation

Guillermo Gallego, Henri Rebecq, and Davide Scaramuzza. A unifying contrast maximization framework for event cam- eras, with applications to motion, depth, and optical flow estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3867–3876,

work page

[12] [12]

Action recognition and benchmark using event cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Yue Gao, Jiaxuan Lu, Siqi Li, Nan Ma, Shaoyi Du, Yipeng Li, and Qionghai Dai. Action recognition and benchmark using event cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 2, 3

work page 2023

[13] [13]

End-to-end learning of repre- sentations for asynchronous event-based data

Daniel Gehrig, Antonio Loquercio, Konstantinos G Derpa- nis, and Davide Scaramuzza. End-to-end learning of repre- sentations for asynchronous event-based data. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 5633–5643, 2019. 7

work page 2019

[14] [14]

A reservoir-based convolu- tional spiking neural network for gesture recognition from dvs input

Arun M George, Dighanchal Banerjee, Sounak Dey, Arijit Mukherjee, and P Balamurali. A reservoir-based convolu- tional spiking neural network for gesture recognition from dvs input. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2020. 3

work page 2020

[15] [15]

Ternary spike: Learning ternary spikes for spiking neural networks

Yufei Guo, Yuanpei Chen, Xiaode Liu, Weihang Peng, Yuhan Zhang, Xuhui Huang, and Zhe Ma. Ternary spike: Learning ternary spikes for spiking neural networks. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 12244–12252, 2024. 3

work page 2024

[16] [16]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7, 8

work page 2016

[17] [17]

3d convolu- tional neural networks for human action recognition.IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolu- tional neural networks for human action recognition.IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012. 7

work page 2012

[18] [18]

Training deep spiking convolu- tional neural networks with stdp-based unsupervised pre- training followed by supervised fine-tuning.Frontiers in neuroscience, 12:435, 2018

Chankyu Lee, Priyadarshini Panda, Gopalakrishnan Srini- vasan, and Kaushik Roy. Training deep spiking convolu- tional neural networks with stdp-based unsupervised pre- training followed by supervised fine-tuning.Frontiers in neuroscience, 12:435, 2018. 3

work page 2018

[19] [19]

Videomamba: State space model for efficient video understanding

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–255. Springer, 2024. 2, 8

work page 2024

[20] [20]

A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE Journal of Solid-State Circuits, 43 (2):566–576, 2008

Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE Journal of Solid-State Circuits, 43 (2):566–576, 2008. 1

work page 2008

[21] [21]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019. 7, 8

work page 2019

[22] [22]

Storyboard-guided alignment for fine-grained video action recognition

Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu, and Liu Liu. Storyboard-guided alignment for fine-grained video action recognition. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems, 2025. 1

work page 2025

[23] [23]

Event-based action recognition using motion informa- tion and spiking neural networks

Qianhui Liu, Dong Xing, Huajin Tang, De Ma, and Gang Pan. Event-based action recognition using motion informa- tion and spiking neural networks. InIJCAI, pages 1743– 1749, 2021. 3

work page 2021

[24] [24]

Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024. 8

work page 2024

[25] [25]

Tam: Temporal adaptive module for video recogni- tion

Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. Tam: Temporal adaptive module for video recogni- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 13708–13718, 2021. 7, 8

work page 2021

[26] [26]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. 2, 6, 7, 8

work page 2022

[27] [27]

Sgdr: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representations, 2017. 6

work page 2017

[28] [28]

Qualitative action recog- nition by wireless radio signals in human–machine systems

Shaohe Lv, Yong Lu, Mianxiong Dong, Xiaodong Wang, Yong Dou, and Weihua Zhuang. Qualitative action recog- nition by wireless radio signals in human–machine systems. IEEE Transactions on Human-Machine Systems, 47(6):789– 800, 2017. 1

work page 2017

[29] [29]

Event-based moving object detection and tracking

Anton Mitrokhin, Cornelia Ferm ¨uller, Chethan Paramesh- wara, and Yiannis Aloimonos. Event-based moving object detection and tracking. In2018 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS), pages 1–9. IEEE, 2018. 2, 7, 8

work page 2018

[30] [30]

Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuro- science, 9:437, 2015

Garrick Orchard, Ajinkya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuro- science, 9:437, 2015. 2

work page 2015

[31] [31]

Bringing a blurry frame alive at high frame-rate with an event camera

Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley, Miaomiao Liu, and Yuchao Dai. Bringing a blurry frame alive at high frame-rate with an event camera. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

work page 2019

[32] [32]

High frame rate video re- construction based on an event camera.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2519– 2533, 2020

Liyuan Pan, Richard Hartley, Cedric Scheerlinck, Miaomiao Liu, Xin Yu, and Yuchao Dai. High frame rate video re- construction based on an event camera.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2519– 2533, 2020. 1

work page 2020

[33] [33]

Single image optical flow estimation with an event camera

Liyuan Pan, Miaomiao Liu, and Richard Hartley. Single image optical flow estimation with an event camera. In 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1669–1678. IEEE, 2020. 1

work page 2020

[34] [34]

0- mms: Zero-shot multi-motion segmentation with a monocu- lar event camera

Chethan M Parameshwara, Nitin J Sanket, Chahat Deep Singh, Cornelia Ferm ¨uller, and Yiannis Aloimonos. 0- mms: Zero-shot multi-motion segmentation with a monocu- lar event camera. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 9594–9600. IEEE,

work page

[35] [35]

Get: Group event transformer for event-based vision

Yansong Peng, Yueyi Zhang, Zhiwei Xiong, Xiaoyan Sun, and Feng Wu. Get: Group event transformer for event-based vision. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6038–6048, 2023. 7

work page 2023

[36] [36]

Spiking pointnet: Spik- ing neural networks for point clouds.Advances in Neural Information Processing Systems, 36, 2024

Dayong Ren, Zhe Ma, Yuanpei Chen, Weihang Peng, Xiaode Liu, Yuhan Zhang, and Yufei Guo. Spiking pointnet: Spik- ing neural networks for point clouds.Advances in Neural Information Processing Systems, 36, 2024. 3

work page 2024

[37] [37]

Event transformer

Alberto Sabater, Luis Montesano, and Ana C Murillo. Event transformer. a sparse-aware solution for efficient event data processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2677– 2686, 2022. 3

work page 2022

[38] [38]

Deep liquid state machines with neural plasticity for video activity recog- nition.Frontiers in neuroscience, 13:686, 2019

Nicholas Soures and Dhireesha Kudithipudi. Deep liquid state machines with neural plasticity for video activity recog- nition.Frontiers in neuroscience, 13:686, 2019. 3

work page 2019

[39] [39]

Event-based motion segmentation by motion compensation

Timo Stoffregen, Guillermo Gallego, Tom Drummond, Lindsay Kleeman, and Davide Scaramuzza. Event-based motion segmentation by motion compensation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7244–7253, 2019. 2

work page 2019

[40] [40]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recogni- tion, pages 6450–6459, 2018. 7

work page 2018

[41] [41]

Dailydvs-200: A comprehen- sive benchmark dataset for event-based action recognition

Qi Wang, Zhou Xu, Yuming Lin, Jingtao Ye, Hongsheng Li, Guangming Zhu, Syed Afaq Ali Shah, Mohammed Ben- namoun, and Liang Zhang. Dailydvs-200: A comprehen- sive benchmark dataset for event-based action recognition. InEuropean Conference on Computer Vision, pages 55–72. Springer, 2024. 1, 6

work page 2024

[42] [42]

Event stream based human action recognition: a high-definition benchmark dataset and algo- rithms.arXiv preprint arXiv:2408.09764, 2024

Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, and Yonghong Tian. Event stream based human action recognition: a high-definition benchmark dataset and algo- rithms.arXiv preprint arXiv:2408.09764, 2024. 2, 3, 7

work page arXiv 2024

[43] [43]

Hardvs: Re- visiting human activity recognition with dynamic vision sen- sors

Xiao Wang, Zongzhen Wu, Bo Jiang, Zhimin Bao, Lin Zhu, Guoqi Li, Yaowei Wang, and Yonghong Tian. Hardvs: Re- visiting human activity recognition with dynamic vision sen- sors. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5615–5623, 2024. 2, 6, 7, 8

work page 2024

[44] [44]

Action-net: Multipath excitation for action recognition

Zhengwei Wang, Qi She, and Aljosa Smolic. Action-net: Multipath excitation for action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13214–13223, 2021. 7

work page 2021

[45] [45]

Event voxel set transformer for spa- tiotemporal representation learning on event streams.arXiv preprint arXiv:2303.03856, 2023

Bochen Xie, Yongjian Deng, Zhanpeng Shao, Hai Liu, Qing- song Xu, and Youfu Li. Event voxel set transformer for spa- tiotemporal representation learning on event streams.arXiv preprint arXiv:2303.03856, 2023. 3

work page arXiv 2023

[46] [46]

Jianyang Xie, Yitian Zhao, Yanda Meng, He Zhao, Anh Nguyen, and Yalin Zheng. Are spatial-temporal graph convolution networks for human action recognition over- parameterized? InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 24309–24319, 2025. 1

work page 2025

[47] [47]

Long-Hao Yang, Fei-Fei Ye, Chris Nugent, Jun Liu, and Ying-Ming Wang. Belief-rule-based system with self- organizing and multi-temporal modeling for sensor-based human activity recognition.IEEE Journal of Biomedical and Health Informatics, 29(2):1062–1073, 2025. 1

work page 2025

[48] [48]

Event camera data pre-training

Yan Yang, Liyuan Pan, and Liu Liu. Event camera data pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10699– 10709, 2023. 1

work page 2023

[49] [49]

Ezsr: Event-based zero-shot recognition

Yan Yang, Liyuan Pan, Dongxu Li, and Liu Liu. Ezsr: Event-based zero-shot recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4628–4638, 2025. 2

work page 2025

[50] [50]

Event camera data dense pre-training

Yan Yang, Liyuan Pan, and Liu Liu. Event camera data dense pre-training. InComputer Vision – ECCV 2024, pages 292– 310, Cham, 2025. Springer Nature Switzerland. 1

work page 2024

[51] [51]

Event-based few-shot fine-grained human action recognition

Zonglin Yang, Yan Yang, Yuheng Shi, Hao Yang, Ruikun Zhang, Liu Liu, Xinxiao Wu, and Liyuan Pan. Event-based few-shot fine-grained human action recognition. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 519–526. IEEE, 2024. 1

work page 2024

[52] [52]

Spike-driven transformer.Ad- vances in neural information processing systems, 36:64043– 64058, 2023

Man Yao, Jiakui Hu, Zhaokun Zhou, Li Yuan, Yonghong Tian, Bo Xu, and Guoqi Li. Spike-driven transformer.Ad- vances in neural information processing systems, 36:64043– 64058, 2023. 7

work page 2023

[53] [53]

Xugao Yu and Mohammed A. A. Al-qaness. Human ac- tivity recognition using deep residual convolutional network based on wearable sensors.IEEE Journal of Biomedical and Health Informatics, 29(3):1950–1958, 2025. 1

work page 1950

[54] [54]

Renjie Zhang, Di Lin, Xin Wang, George Baciu, C. L. Philip Chen, and Ping Li. Accurate-pgnet: Learning to assemble perceptual body parts for accurate human skeleton establish- ment.IEEE Transactions on Multimedia, 27:1706–1721,

work page

[55] [55]

Event-based real-time moving object detection based on imu ego-motion compensation

Chunhui Zhao, Yakun Li, and Yang Lyu. Event-based real-time moving object detection based on imu ego-motion compensation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 690–696. IEEE,

work page

[56] [56]

Jstr: Joint spatio-temporal reason- ing for event-based moving object detection.arXiv preprint arXiv:2403.07436, 2024

Hanyu Zhou, Zhiwei Shi, Hao Dong, Shihan Peng, Yi Chang, and Luxin Yan. Jstr: Joint spatio-temporal reason- ing for event-based moving object detection.arXiv preprint arXiv:2403.07436, 2024. 2

work page arXiv 2024

[57] [57]

Ex- act: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Ex- act: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 18633–18643,

work page

[58] [58]

Event-based motion segmentation with spatio- temporal graph cuts.IEEE transactions on neural networks and learning systems, 34(8):4868–4880, 2021

Yi Zhou, Guillermo Gallego, Xiuyuan Lu, Siqi Liu, and Shaojie Shen. Event-based motion segmentation with spatio- temporal graph cuts.IEEE transactions on neural networks and learning systems, 34(8):4868–4880, 2021. 2, 5

work page 2021

[59] [59]

Spikformer: When spiking neural network meets transformer

Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng Y AN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. InThe Eleventh International Conference on Learning Representa- tions. 2, 7, 8

work page

[60] [60]

Semantic-guided cross-modal prompt learning for skeleton-based zero-shot action recognition

Anqi Zhu, Jingmin Zhu, James Bailey, Mingming Gong, and Qiuhong Ke. Semantic-guided cross-modal prompt learning for skeleton-based zero-shot action recognition. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13876–13885, 2025. 1

work page 2025

[61] [61]

Vision mamba: efficient visual representation learning with bidirectional state space model

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: efficient visual representation learning with bidirectional state space model. InProceedings of the 41st International Conference on Machine Learning, pages 62429–62442, 2024. 8

work page 2024