pith. sign in

arxiv: 2501.06786 · v2 · submitted 2025-01-12 · 💻 cs.CV

Temporal-Aware Spiking Transformer Hashing Based on 3D-DWT

Pith reviewed 2026-05-23 05:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords spiking neural networkshashingdynamic vision sensor3D discrete wavelet transformspiking transformerenergy efficiencysupervised hashingmembrane potential loss
0
0 comments X

The pith

Spikinghash uses a 3D wavelet mixer and membrane-potential loss in spiking transformers to produce efficient hash codes for dynamic vision sensor data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Spikinghash as a supervised hashing method built on spiking neural networks to handle growing volumes of dynamic vision sensor data with lower energy use. It places a Spiking WaveMixer that applies multilevel 3D discrete wavelet transform for decoupling and fusing spatiotemporal features in early layers, then uses Spiking Self-Attention for global context in deeper layers. A hash layer accumulates spikes over time steps to output binary codes, while a dynamic soft similarity loss derived from membrane potentials supplies learnable soft labels that capture class similarities and offset information loss typical in spiking networks. Experiments across datasets show the resulting codes match or exceed prior retrieval accuracy while requiring fewer parameters and less energy. A reader would care because the work targets practical, low-power retrieval systems for event-based sensors rather than conventional frame-based video.

Core claim

Spikinghash is a hierarchical supervised hashing architecture for spiking neural networks. In shallow layers the Spiking WaveMixer applies multilevel 3D-DWT to separate spatiotemporal features into low- and high-frequency components and performs spectral fusion to capture temporal dependencies and local spatial structure. Deeper layers employ Spiking Self-Attention to extract global spatiotemporal information. The final hash layer integrates membrane activity across multiple time steps to produce binary hash codes. A dynamic soft similarity loss constructs a learnable similarity matrix from membrane potentials to serve as soft labels, thereby compensating for information loss in SNNs and提升检索

What carries the argument

Spiking WaveMixer (SWM) that performs multilevel 3D-DWT feature decoupling and spectral fusion, placed hierarchically with Spiking Self-Attention and a membrane-potential-based dynamic soft similarity loss.

If this is right

  • Hash codes preserve the distance relationships present in the original DVS data.
  • Energy consumption and parameter count remain lower than conventional deep-learning hashing methods.
  • State-of-the-art retrieval accuracy is reached on multiple DVS datasets.
  • The binary nature of spikes directly supplies the final hash codes without additional binarization steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The 3D-DWT decoupling step could be tested in other spiking architectures that process event streams.
  • The membrane-potential loss formulation might transfer to spiking models for tasks beyond hashing, such as classification or clustering.
  • Lower energy and parameter counts suggest possible deployment on resource-constrained edge hardware that receives DVS input directly.

Load-bearing premise

The specific combination of 3D-DWT decoupling, spectral fusion, and membrane-potential soft similarity loss will reliably offset information loss inside spiking networks and produce higher-quality hash codes.

What would settle it

An ablation that disables the 3D-DWT component or the membrane-potential loss and measures whether mean average precision on the evaluated DVS datasets falls below the best non-spiking hashing baseline.

Figures

Figures reproduced from arXiv: 2501.06786 by Bolin Zhang, Chong Wang, Guoqi Li, Jiangbo Qian, Jianhao Li, Lijun Guo, Zihao Mei.

Figure 1
Figure 1. Figure 1: In HMDB51-DVS data, the actions “sit” and “stand” contain [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of Spikinghash. This hierarchical SNN-Transformer architecture includes a downsample layer before each stage. The first [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The iterative multilevel decomposition process of 3D-DWT. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the top-1 accuracies on CIFAR100 between Spik [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of similarity matrixs Ssoft based on membrane potentials and Shash based on Hamming distances. TABLE XIII ABLATION STUDY RESULTS OF DYNAMIC SOFT SIMILARITY LOSS ON UCF101-DVS. Loss mAP@100 ACG@100 NDCG@100 64-bits 128-bits 256-bits 64-bits 128-bits 256-bits 64-bits 128-bits 256-bits Lh & Lcls 0.672 0.679 0.719 0.597 0.610 0.646 0.653 0.666 0.703 Ls & Lcls 0.682 0.714 0.734 0.615 0.650 0.669 0… view at source ↗
read the original abstract

With the rapid growth of dynamic vision sensor (DVS) data, constructing a low-energy, efficient data retrieval system has become an urgent task. Hash learning is one of the most important retrieval technologies which can keep the distance between hash codes consistent with the distance between DVS data. As spiking neural networks (SNNs) can encode information through spikes, they demonstrate great potential in promoting energy efficiency. Based on the binary characteristics of SNNs, we first propose a novel supervised hashing method named Spikinghash with a hierarchical lightweight structure. Spiking WaveMixer (SWM) is deployed in shallow layers, utilizing a multilevel 3D discrete wavelet transform (3D-DWT) to decouple spatiotemporal features into various low-frequency and high frequency components, and then employing efficient spectral feature fusion. SWM can effectively capture the temporal dependencies and local spatial features. Spiking Self-Attention (SSA) is deployed in deeper layers to further extract global spatiotemporal information. We also design a hash layer utilizing binary characteristic of SNNs, which integrates information over multiple time steps to generate final hash codes. Furthermore, we propose a new dynamic soft similarity loss for SNNs, which utilizes membrane potentials to construct a learnable similarity matrix as soft labels to fully capture the similarity differences between classes and compensate information loss in SNNs, thereby improving retrieval performance. Experiments on multiple datasets demonstrate that Spikinghash can achieve state-of-the-art results with low energy consumption and fewer parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Spikinghash, a supervised hashing method for dynamic vision sensor (DVS) data based on spiking neural networks. It employs a hierarchical architecture with Spiking WaveMixer (SWM) modules in shallow layers that apply multilevel 3D discrete wavelet transform (3D-DWT) for spatiotemporal feature decoupling into low- and high-frequency components followed by spectral fusion, Spiking Self-Attention (SSA) in deeper layers for global information, a membrane-potential hash layer that integrates spikes over time steps to produce binary codes, and a dynamic soft similarity loss that constructs learnable similarity matrices from membrane potentials to serve as soft labels. The central claim is that this combination achieves state-of-the-art retrieval performance on multiple datasets while maintaining low energy consumption and fewer parameters compared to existing methods.

Significance. If the experimental results hold, the work offers a meaningful contribution to energy-efficient content-based retrieval for event-based vision by integrating wavelet-based feature processing, spiking attention, and a membrane-potential loss within an SNN framework. The introduction of SWM and the dynamic soft similarity loss represent concrete attempts to address information loss typical in spiking networks for hashing tasks. The focus on practical metrics such as energy and parameter count aligns with deployment needs for DVS data. The manuscript supplies sufficient architectural and implementation detail to support reproducibility in principle.

minor comments (3)
  1. [Abstract] Abstract: The claim of state-of-the-art results would be strengthened by briefly naming the datasets and reporting the magnitude of improvements (e.g., mAP gains) rather than leaving the assertion unsupported in the abstract alone.
  2. [§4] §4 (Experiments): Confirm that all reported results include standard deviations across multiple runs and explicit baseline implementations with matching training protocols to ensure fair comparison.
  3. [§3.4] Notation: Define the precise mathematical form of the dynamic soft similarity loss (including how membrane potentials are mapped to the similarity matrix) in a dedicated equation to aid reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our manuscript and the recommendation for minor revision. The recognition of the contributions of Spikinghash, including the Spiking WaveMixer, Spiking Self-Attention, and dynamic soft similarity loss for energy-efficient DVS retrieval, is appreciated. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity; method is empirical construction without self-referential derivations

full rationale

The paper presents an empirical architecture (SWM with 3D-DWT, SSA, membrane-potential hash layer, dynamic soft similarity loss) and reports experimental SOTA results on retrieval metrics, energy, and parameters. No equations, derivations, or first-principles claims appear in the provided text that reduce performance to quantities defined by the method's own fitted parameters or self-citations. The central claim rests on reproducible implementation details and external dataset benchmarks rather than any internal reduction by construction. No self-definitional, fitted-input, or uniqueness-imported steps are identifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on domain assumptions about SNN binary behavior and the effectiveness of the newly introduced modules; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Spiking neural networks encode information through spikes and possess binary characteristics suitable for generating hash codes.
    Invoked as the foundation for the hash layer and overall approach.
invented entities (2)
  • Spiking WaveMixer (SWM) no independent evidence
    purpose: Decouple spatiotemporal features via multilevel 3D-DWT and perform spectral feature fusion.
    New module introduced in shallow layers.
  • dynamic soft similarity loss no independent evidence
    purpose: Construct a learnable similarity matrix from membrane potentials to serve as soft labels.
    New loss function proposed to improve retrieval.

pith-pipeline@v0.9.0 · 5815 in / 1272 out tokens · 50758 ms · 2026-05-23T05:15:44.582923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

  1. [1]

    Event- based vision: A survey,

    G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event- based vision: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  2. [2]

    A low power, fully event-based gesture recognition system,

    A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza et al. , “A low power, fully event-based gesture recognition system,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp. 7243–7252

  3. [3]

    High speed and high dynamic range video with an event camera,

    H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 1964–1980, 2019

  4. [4]

    Networks of spiking neurons: the third generation of neural network models,

    W. Maass, “Networks of spiking neurons: the third generation of neural network models,” Neural networks, vol. 10, no. 9, pp. 1659–1671, 1997

  5. [5]

    Towards artificial general intelligence with hybrid tianjic chip architecture,

    J. Pei, L. Deng, S. Song, M. Zhao, Y . Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. He et al., “Towards artificial general intelligence with hybrid tianjic chip architecture,” Nature, vol. 572, no. 7767, pp. 106–111, 2019

  6. [6]

    Spikformer: When spiking neural network meets transformer,

    Z. Zhou, Y . Zhu, C. He, Y . Wang, S. Yan, Y . Tian, and L. Yuan, “Spikformer: When spiking neural network meets transformer,” arXiv preprint arXiv:2209.15425, 2022

  7. [7]

    Spike-driven transformer,

    M. Yao, J. Hu, Z. Zhou, L. Yuan, Y . Tian, B. Xu, and G. Li, “Spike-driven transformer,”Advances in Neural Information Processing Systems, vol. 36, 2024

  8. [8]

    Spikingformer: Spike-driven residual learning for transformer-based spiking neural network,

    C. Zhou, L. Yu, Z. Zhou, Z. Ma, H. Zhang, H. Zhou, and Y . Tian, “Spikingformer: Spike-driven residual learning for transformer-based spiking neural network,” arXiv preprint arXiv:2304.11954 , 2023

  9. [9]

    Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,

    M. Yao, J. Hu, T. Hu, Y . Xu, Z. Zhou, Y . Tian, B. Xu, and G. Li, “Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,” arXiv preprint arXiv:2404.03663, 2024

  10. [10]

    Graph-based spatio-temporal feature learning for neuromorphic vision sensing,

    Y . Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y . Andreopoulos, “Graph-based spatio-temporal feature learning for neuromorphic vision sensing,” IEEE Transactions on Image Processing , vol. 29, pp. 9084– 9098, 2020

  11. [11]

    Spatial- temporal self-attention for asynchronous spiking neural networks,

    Y . Wang, K. Shi, C. Lu, Y . Liu, M. Zhang, and H. Qu, “Spatial- temporal self-attention for asynchronous spiking neural networks,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, vol. 8, 2023, pp. 3085–3093

  12. [12]

    Attention spiking neural networks,

    M. Yao, G. Zhao, H. Zhang, Y . Hu, L. Deng, Y . Tian, B. Xu, and G. Li, “Attention spiking neural networks,”IEEE transactions on pattern analysis and machine intelligence , 2023

  13. [13]

    Temporal-wise attention spiking neural networks for event streams clas- sification,

    M. Yao, H. Gao, G. Zhao, D. Wang, Y . Lin, Z. Yang, and G. Li, “Temporal-wise attention spiking neural networks for event streams clas- sification,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 221–10 230

  14. [14]

    Tcja- snn: Temporal-channel joint attention for spiking neural networks,

    R.-J. Zhu, M. Zhang, Q. Zhao, H. Deng, Y . Duan, and L.-J. Deng, “Tcja- snn: Temporal-channel joint attention for spiking neural networks,”IEEE Transactions on Neural Networks and Learning Systems , 2024

  15. [15]

    Gated attention coding for training high-performance and efficient spiking neural networks,

    X. Qiu, R.-J. Zhu, Y . Chou, Z. Wang, L.-j. Deng, and G. Li, “Gated attention coding for training high-performance and efficient spiking neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 601–610

  16. [16]

    Spikingresformer: Bridging resnet and vision transformer in spiking neural networks,

    X. Shi, Z. Hao, and Z. Yu, “Spikingresformer: Bridging resnet and vision transformer in spiking neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 5610–5619

  17. [17]

    Three-dimensional discrete wavelet transform architectures,

    M. Weeks and M. A. Bayoumi, “Three-dimensional discrete wavelet transform architectures,” IEEE Transactions on Signal Processing , vol. 50, no. 8, pp. 2050–2063, 2002

  18. [18]

    Efficient token mixing for transformers via adaptive fourier neural operators,

    J. Guibas, M. Mardani, Z. Li, A. Tao, A. Anandkumar, and B. Catanzaro, “Efficient token mixing for transformers via adaptive fourier neural operators,” in International Conference on Learning Representations , 2021

  19. [19]

    Scattering vision transformer: Spectral mixing matters,

    B. Patro and V . Agneeswaran, “Scattering vision transformer: Spectral mixing matters,” Advances in Neural Information Processing Systems , vol. 36, 2024

  20. [20]

    An image patch is a wave: Phase-aware vision mlp,

    Y . Tang, K. Han, J. Guo, C. Xu, Y . Li, C. Xu, and Y . Wang, “An image patch is a wave: Phase-aware vision mlp,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 935–10 944

  21. [21]

    Wave-vit: Unifying wavelet and transformers for visual representation learning,

    T. Yao, Y . Pan, Y . Li, C.-W. Ngo, and T. Mei, “Wave-vit: Unifying wavelet and transformers for visual representation learning,” inEuropean Conference on Computer Vision . Springer, 2022, pp. 328–345

  22. [22]

    Hashnet: Deep learning to hash by continuation,

    Z. Cao, M. Long, J. Wang, and P. S. Yu, “Hashnet: Deep learning to hash by continuation,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 5608–5617

  23. [23]

    Deep polarized network for supervised learning of accurate binary hashing codes

    L. Fan, K. W. Ng, C. Ju, T. Zhang, and C. S. Chan, “Deep polarized network for supervised learning of accurate binary hashing codes.” in IJCAI, 2020, pp. 825–831

  24. [24]

    Transhash: Transformer-based hamming hashing for efficient image retrieval,

    Y . Chen, S. Zhang, F. Liu, Z. Chang, M. Ye, and Z. Qi, “Transhash: Transformer-based hamming hashing for efficient image retrieval,” in Proceedings of the 2022 international conference on multimedia re- trieval, 2022, pp. 127–136

  25. [25]

    Hashformer: Vision transformer based deep hashing for image retrieval,

    T. Li, Z. Zhang, L. Pei, and Y . Gan, “Hashformer: Vision transformer based deep hashing for image retrieval,”IEEE Signal Processing Letters, vol. 29, pp. 827–831, 2022

  26. [26]

    Structure-adaptive neighborhood preserving hashing for scalable video search,

    S. Li, X. Li, and J. Lu, “Structure-adaptive neighborhood preserving hashing for scalable video search,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2441–2454, 2021

  27. [27]

    Self-supervised video hashing via bidirectional transformers,

    S. Li, X. Li, J. Lu, and J. Zhou, “Self-supervised video hashing via bidirectional transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 13 549–13 558

  28. [28]

    Contrastive masked autoencoders for self-supervised video hashing,

    Y . Wang, J. Wang, B. Chen, Z. Zeng, and S.-T. Xia, “Contrastive masked autoencoders for self-supervised video hashing,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 3, 2023, pp. 2733–2741

  29. [29]

    Generalized leaky integrate- and-fire models classify multiple neuron types,

    C. Teeter, R. Iyer, V . Menon, N. Gouwens, D. Feng, J. Berg, A. Szafer, N. Cain, H. Zeng, M. Hawrylycz et al. , “Generalized leaky integrate- and-fire models classify multiple neuron types,”Nature communications, vol. 9, no. 1, p. 709, 2018

  30. [30]

    A quantitative description of mem- brane current and its application to conduction and excitation in nerve,

    A. L. Hodgkin and A. F. Huxley, “A quantitative description of mem- brane current and its application to conduction and excitation in nerve,” The Journal of physiology , vol. 117, no. 4, p. 500, 1952

  31. [31]

    Simple model of spiking neurons,

    E. M. Izhikevich, “Simple model of spiking neurons,” IEEE Transactions on neural networks , vol. 14, no. 6, pp. 1569–1572, 2003

  32. [32]

    Spatio-temporal backpropagation for training high-performance spiking neural networks,

    Y . Wu, L. Deng, G. Li, and L. Shi, “Spatio-temporal backpropagation for training high-performance spiking neural networks,” Frontiers in neuroscience, vol. 12, p. 323875, 2018

  33. [33]

    Hardvs: Revisiting human activity recognition with dynamic vision sensors,

    X. Wang, Z. Wu, B. Jiang, Z. Bao, L. Zhu, G. Li, Y . Wang, and Y . Tian, “Hardvs: Revisiting human activity recognition with dynamic vision sensors,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5615–5623

  34. [34]

    Cifar10-dvs: an event-stream dataset for object classification,

    H. Li, H. Liu, X. Ji, G. Li, and L. Shi, “Cifar10-dvs: an event-stream dataset for object classification,” Frontiers in neuroscience , vol. 11, p. 244131, 2017

  35. [35]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

  36. [36]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of features from tiny images,” 2009

  37. [37]

    Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence,

    W. Fang, Y . Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang, H. Zhou, G. Li, and Y . Tian, “Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence,” Science Advances, vol. 9, no. 40, p. eadi1480, 2023

  38. [38]

    Enhancing the performance of transformer-based spiking neural networks by snn-optimized downsampling with precise gradient backpropagation,

    C. Zhou, H. Zhang, Z. Zhou, L. Yu, Z. Ma, H. Zhou, X. Fan, and Y . Tian, “Enhancing the performance of transformer-based spiking neural networks by snn-optimized downsampling with precise gradient backpropagation,” arXiv preprint arXiv:2305.05954 , 2023

  39. [39]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020

  40. [40]

    1.1 computing’s energy problem (and what we can do about it),

    M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC) . IEEE, 2014, pp. 10–14

  41. [41]

    Is space-time attention all you need for video understanding?

    G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4

  42. [42]

    Vivit: A video vision transformer,

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 6836–6846

  43. [43]

    Deep hashing network with hybrid attention and adaptive weighting for image retrieval,

    Y . Pei, Z. Wang, N. Li, H. Chen, B. Huang, and W. Tu, “Deep hashing network with hybrid attention and adaptive weighting for image retrieval,” IEEE Transactions on Multimedia , 2023

  44. [44]

    Slowfast networks for video recognition,

    C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 6202–6211

  45. [45]

    Action-net: Multipath excitation for action recognition,

    Z. Wang, Q. She, and A. Smolic, “Action-net: Multipath excitation for action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 13 214–13 223

  46. [46]

    Tsm: Temporal shift module for efficient video understanding,

    J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 7083–7093

  47. [47]

    Going deeper with directly-trained larger spiking neural networks,

    H. Zheng, Y . Wu, L. Deng, Y . Hu, and G. Li, “Going deeper with directly-trained larger spiking neural networks,” in Proceedings of the AAAI conference on artificial intelligence , vol. 35, no. 12, 2021, pp. 11 062–11 070. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

  48. [48]

    Incorporating learnable membrane time constant to enhance learning of spiking neural networks,

    W. Fang, Z. Yu, Y . Chen, T. Masquelier, T. Huang, and Y . Tian, “Incorporating learnable membrane time constant to enhance learning of spiking neural networks,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 2661–2671

  49. [49]

    QKFormer: Hierarchical Spiking Transformer using Q-K Attention

    C. Zhou, H. Zhang, Z. Zhou, L. Yu, L. Huang, X. Fan, L. Yuan, Z. Ma, H. Zhou, and Y . Tian, “Qkformer: Hierarchical spiking transformer using qk attention,” arXiv preprint arXiv:2403.16552 , 2024

  50. [50]

    Spikepoint: An efficient point-based spiking neural network for event cameras action recognition,

    H. Ren, Y . Zhou, Y . Huang, H. Fu, X. Lin, J. Song, and B. Cheng, “Spikepoint: An efficient point-based spiking neural network for event cameras action recognition,” arXiv preprint arXiv:2310.07189 , 2023

  51. [51]

    Deep residual learning in spiking neural networks,

    W. Fang, Z. Yu, Y . Chen, T. Huang, T. Masquelier, and Y . Tian, “Deep residual learning in spiking neural networks,” Advances in Neural Information Processing Systems , vol. 34, pp. 21 056–21 069, 2021

  52. [52]

    Sparser spiking activity can be better: Feature refine-and-mask spiking neural network for event-based visual recognition,

    M. Yao, H. Zhang, G. Zhao, X. Zhang, D. Wang, G. Cao, and G. Li, “Sparser spiking activity can be better: Feature refine-and-mask spiking neural network for event-based visual recognition,” Neural Networks , vol. 166, pp. 410–423, 2023

  53. [53]

    Differen- tiable spike: Rethinking gradient-descent for training spiking neural networks,

    Y . Li, Y . Guo, S. Zhang, S. Deng, Y . Hai, and S. Gu, “Differen- tiable spike: Rethinking gradient-descent for training spiking neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 426–23 439, 2021

  54. [54]

    Optimal ann- snn conversion for high-accuracy and ultra-low-latency spiking neural networks,

    T. Bu, W. Fang, J. Ding, P. Dai, Z. Yu, and T. Huang, “Optimal ann- snn conversion for high-accuracy and ultra-low-latency spiking neural networks,” arXiv preprint arXiv:2303.04347 , 2023

  55. [55]

    Training spiking neural networks with local tandem learning,

    Q. Yang, J. Wu, M. Zhang, Y . Chua, X. Wang, and H. Li, “Training spiking neural networks with local tandem learning,”Advances in Neural Information Processing Systems , vol. 35, pp. 12 662–12 676, 2022

  56. [56]

    Adaptive smoothing gradient learning for spiking neural networks,

    Z. Wang, R. Jiang, S. Lian, R. Yan, and H. Tang, “Adaptive smoothing gradient learning for spiking neural networks,” in International Confer- ence on Machine Learning . PMLR, 2023, pp. 35 798–35 816

  57. [57]

    Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,

    Z. Tong, Y . Song, J. Wang, and L. Wang, “Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10 078– 10 093, 2022