Temporal-Aware Spiking Transformer Hashing Based on 3D-DWT

Bolin Zhang; Chong Wang; Guoqi Li; Jiangbo Qian; Jianhao Li; Lijun Guo; Zihao Mei

arxiv: 2501.06786 · v2 · submitted 2025-01-12 · 💻 cs.CV

Temporal-Aware Spiking Transformer Hashing Based on 3D-DWT

Zihao Mei , Jianhao Li , Bolin Zhang , Chong Wang , Lijun Guo , Guoqi Li , Jiangbo Qian This is my paper

Pith reviewed 2026-05-23 05:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords spiking neural networkshashingdynamic vision sensor3D discrete wavelet transformspiking transformerenergy efficiencysupervised hashingmembrane potential loss

0 comments

The pith

Spikinghash uses a 3D wavelet mixer and membrane-potential loss in spiking transformers to produce efficient hash codes for dynamic vision sensor data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Spikinghash as a supervised hashing method built on spiking neural networks to handle growing volumes of dynamic vision sensor data with lower energy use. It places a Spiking WaveMixer that applies multilevel 3D discrete wavelet transform for decoupling and fusing spatiotemporal features in early layers, then uses Spiking Self-Attention for global context in deeper layers. A hash layer accumulates spikes over time steps to output binary codes, while a dynamic soft similarity loss derived from membrane potentials supplies learnable soft labels that capture class similarities and offset information loss typical in spiking networks. Experiments across datasets show the resulting codes match or exceed prior retrieval accuracy while requiring fewer parameters and less energy. A reader would care because the work targets practical, low-power retrieval systems for event-based sensors rather than conventional frame-based video.

Core claim

Spikinghash is a hierarchical supervised hashing architecture for spiking neural networks. In shallow layers the Spiking WaveMixer applies multilevel 3D-DWT to separate spatiotemporal features into low- and high-frequency components and performs spectral fusion to capture temporal dependencies and local spatial structure. Deeper layers employ Spiking Self-Attention to extract global spatiotemporal information. The final hash layer integrates membrane activity across multiple time steps to produce binary hash codes. A dynamic soft similarity loss constructs a learnable similarity matrix from membrane potentials to serve as soft labels, thereby compensating for information loss in SNNs and提升检索

What carries the argument

Spiking WaveMixer (SWM) that performs multilevel 3D-DWT feature decoupling and spectral fusion, placed hierarchically with Spiking Self-Attention and a membrane-potential-based dynamic soft similarity loss.

If this is right

Hash codes preserve the distance relationships present in the original DVS data.
Energy consumption and parameter count remain lower than conventional deep-learning hashing methods.
State-of-the-art retrieval accuracy is reached on multiple DVS datasets.
The binary nature of spikes directly supplies the final hash codes without additional binarization steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The 3D-DWT decoupling step could be tested in other spiking architectures that process event streams.
The membrane-potential loss formulation might transfer to spiking models for tasks beyond hashing, such as classification or clustering.
Lower energy and parameter counts suggest possible deployment on resource-constrained edge hardware that receives DVS input directly.

Load-bearing premise

The specific combination of 3D-DWT decoupling, spectral fusion, and membrane-potential soft similarity loss will reliably offset information loss inside spiking networks and produce higher-quality hash codes.

What would settle it

An ablation that disables the 3D-DWT component or the membrane-potential loss and measures whether mean average precision on the evaluated DVS datasets falls below the best non-spiking hashing baseline.

Figures

Figures reproduced from arXiv: 2501.06786 by Bolin Zhang, Chong Wang, Guoqi Li, Jiangbo Qian, Jianhao Li, Lijun Guo, Zihao Mei.

**Figure 2.** Figure 2: The overall architecture of Spikinghash. This hierarchical SNN-Transformer architecture includes a downsample layer before each stage. The first [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The iterative multilevel decomposition process of 3D-DWT. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of the top-1 accuracies on CIFAR100 between Spik [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of similarity matrixs Ssoft based on membrane potentials and Shash based on Hamming distances. TABLE XIII ABLATION STUDY RESULTS OF DYNAMIC SOFT SIMILARITY LOSS ON UCF101-DVS. Loss mAP@100 ACG@100 NDCG@100 64-bits 128-bits 256-bits 64-bits 128-bits 256-bits 64-bits 128-bits 256-bits Lh & Lcls 0.672 0.679 0.719 0.597 0.610 0.646 0.653 0.666 0.703 Ls & Lcls 0.682 0.714 0.734 0.615 0.650 0.669 0… view at source ↗

read the original abstract

With the rapid growth of dynamic vision sensor (DVS) data, constructing a low-energy, efficient data retrieval system has become an urgent task. Hash learning is one of the most important retrieval technologies which can keep the distance between hash codes consistent with the distance between DVS data. As spiking neural networks (SNNs) can encode information through spikes, they demonstrate great potential in promoting energy efficiency. Based on the binary characteristics of SNNs, we first propose a novel supervised hashing method named Spikinghash with a hierarchical lightweight structure. Spiking WaveMixer (SWM) is deployed in shallow layers, utilizing a multilevel 3D discrete wavelet transform (3D-DWT) to decouple spatiotemporal features into various low-frequency and high frequency components, and then employing efficient spectral feature fusion. SWM can effectively capture the temporal dependencies and local spatial features. Spiking Self-Attention (SSA) is deployed in deeper layers to further extract global spatiotemporal information. We also design a hash layer utilizing binary characteristic of SNNs, which integrates information over multiple time steps to generate final hash codes. Furthermore, we propose a new dynamic soft similarity loss for SNNs, which utilizes membrane potentials to construct a learnable similarity matrix as soft labels to fully capture the similarity differences between classes and compensate information loss in SNNs, thereby improving retrieval performance. Experiments on multiple datasets demonstrate that Spikinghash can achieve state-of-the-art results with low energy consumption and fewer parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Spikinghash adds a 3D-DWT WaveMixer and membrane-potential soft similarity loss to SNN hashing for DVS streams, with enough architectural detail to reproduce but performance claims that still need the tables checked.

read the letter

The paper introduces Spikinghash, a spiking transformer for hashing dynamic vision sensor data. The core new elements are the Spiking WaveMixer that applies multilevel 3D discrete wavelet transform to split and fuse spatiotemporal features, plus a dynamic soft similarity loss that builds learnable soft labels directly from membrane potentials. These sit alongside spiking self-attention in deeper layers and a time-integrated hash layer that exploits the binary nature of spikes. The descriptions are concrete enough that the modules could be reimplemented without major guesswork, which is useful in this subfield. The work targets a practical constraint—energy use and parameter count on event-based streams—and tries to address information loss in SNNs through the loss design rather than just scaling the network. The stress-test note confirms no internal contradictions in the construction, so the claimed compensation mechanism is at least plausible under the stated training setup. The main soft spot remains the experimental side. The abstract alone gave no numbers, baselines, or ablations, so the state-of-the-art claim rests on whatever tables and controls appear in the full manuscript. If those show consistent gains attributable to the new modules rather than the overall setup, the contribution is a usable method for neuromorphic retrieval. If the margins are small or the controls loose, the novelty of the components does not automatically deliver a clear advance. This is aimed at researchers working on spiking networks for vision and retrieval tasks. It is the kind of applied method paper that deserves a serious referee rather than a desk reject, because it ships a full pipeline with empirical claims that can be directly tested. I would bring it to a reading group focused on neuromorphic or hashing work. I would not cite it in my own papers unless I needed exactly this combination for DVS data. Send it to peer review so the experimental details can be examined.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Spikinghash, a supervised hashing method for dynamic vision sensor (DVS) data based on spiking neural networks. It employs a hierarchical architecture with Spiking WaveMixer (SWM) modules in shallow layers that apply multilevel 3D discrete wavelet transform (3D-DWT) for spatiotemporal feature decoupling into low- and high-frequency components followed by spectral fusion, Spiking Self-Attention (SSA) in deeper layers for global information, a membrane-potential hash layer that integrates spikes over time steps to produce binary codes, and a dynamic soft similarity loss that constructs learnable similarity matrices from membrane potentials to serve as soft labels. The central claim is that this combination achieves state-of-the-art retrieval performance on multiple datasets while maintaining low energy consumption and fewer parameters compared to existing methods.

Significance. If the experimental results hold, the work offers a meaningful contribution to energy-efficient content-based retrieval for event-based vision by integrating wavelet-based feature processing, spiking attention, and a membrane-potential loss within an SNN framework. The introduction of SWM and the dynamic soft similarity loss represent concrete attempts to address information loss typical in spiking networks for hashing tasks. The focus on practical metrics such as energy and parameter count aligns with deployment needs for DVS data. The manuscript supplies sufficient architectural and implementation detail to support reproducibility in principle.

minor comments (3)

[Abstract] Abstract: The claim of state-of-the-art results would be strengthened by briefly naming the datasets and reporting the magnitude of improvements (e.g., mAP gains) rather than leaving the assertion unsupported in the abstract alone.
[§4] §4 (Experiments): Confirm that all reported results include standard deviations across multiple runs and explicit baseline implementations with matching training protocols to ensure fair comparison.
[§3.4] Notation: Define the precise mathematical form of the dynamic soft similarity loss (including how membrane potentials are mapped to the similarity matrix) in a dedicated equation to aid reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our manuscript and the recommendation for minor revision. The recognition of the contributions of Spikinghash, including the Spiking WaveMixer, Spiking Self-Attention, and dynamic soft similarity loss for energy-efficient DVS retrieval, is appreciated. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity; method is empirical construction without self-referential derivations

full rationale

The paper presents an empirical architecture (SWM with 3D-DWT, SSA, membrane-potential hash layer, dynamic soft similarity loss) and reports experimental SOTA results on retrieval metrics, energy, and parameters. No equations, derivations, or first-principles claims appear in the provided text that reduce performance to quantities defined by the method's own fitted parameters or self-citations. The central claim rests on reproducible implementation details and external dataset benchmarks rather than any internal reduction by construction. No self-definitional, fitted-input, or uniqueness-imported steps are identifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on domain assumptions about SNN binary behavior and the effectiveness of the newly introduced modules; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Spiking neural networks encode information through spikes and possess binary characteristics suitable for generating hash codes.
Invoked as the foundation for the hash layer and overall approach.

invented entities (2)

Spiking WaveMixer (SWM) no independent evidence
purpose: Decouple spatiotemporal features via multilevel 3D-DWT and perform spectral feature fusion.
New module introduced in shallow layers.
dynamic soft similarity loss no independent evidence
purpose: Construct a learnable similarity matrix from membrane potentials to serve as soft labels.
New loss function proposed to improve retrieval.

pith-pipeline@v0.9.0 · 5815 in / 1272 out tokens · 50758 ms · 2026-05-23T05:15:44.582923+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction (8-tick period) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We uniformly use the LIF model... time step of the spiking neuron is 16... timestep is set to 4... time steps to 8
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Spiking WaveMixer (SWM) ... multilevel 3D discrete wavelet transform (3D-DWT) to decouple spatiotemporal features
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dynamic soft similarity loss ... membrane potentials to construct a learnable similarity matrix

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

[1]

Event- based vision: A survey,

G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event- based vision: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

work page 2020
[2]

A low power, fully event-based gesture recognition system,

A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza et al. , “A low power, fully event-based gesture recognition system,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp. 7243–7252

work page 2017
[3]

High speed and high dynamic range video with an event camera,

H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 1964–1980, 2019

work page 1964
[4]

Networks of spiking neurons: the third generation of neural network models,

W. Maass, “Networks of spiking neurons: the third generation of neural network models,” Neural networks, vol. 10, no. 9, pp. 1659–1671, 1997

work page 1997
[5]

Towards artificial general intelligence with hybrid tianjic chip architecture,

J. Pei, L. Deng, S. Song, M. Zhao, Y . Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. He et al., “Towards artificial general intelligence with hybrid tianjic chip architecture,” Nature, vol. 572, no. 7767, pp. 106–111, 2019

work page 2019
[6]

Spikformer: When spiking neural network meets transformer,

Z. Zhou, Y . Zhu, C. He, Y . Wang, S. Yan, Y . Tian, and L. Yuan, “Spikformer: When spiking neural network meets transformer,” arXiv preprint arXiv:2209.15425, 2022

work page arXiv 2022
[7]

Spike-driven transformer,

M. Yao, J. Hu, Z. Zhou, L. Yuan, Y . Tian, B. Xu, and G. Li, “Spike-driven transformer,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[8]

Spikingformer: Spike-driven residual learning for transformer-based spiking neural network,

C. Zhou, L. Yu, Z. Zhou, Z. Ma, H. Zhang, H. Zhou, and Y . Tian, “Spikingformer: Spike-driven residual learning for transformer-based spiking neural network,” arXiv preprint arXiv:2304.11954 , 2023

work page arXiv 2023
[9]

Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,

M. Yao, J. Hu, T. Hu, Y . Xu, Z. Zhou, Y . Tian, B. Xu, and G. Li, “Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,” arXiv preprint arXiv:2404.03663, 2024

work page arXiv 2024
[10]

Graph-based spatio-temporal feature learning for neuromorphic vision sensing,

Y . Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y . Andreopoulos, “Graph-based spatio-temporal feature learning for neuromorphic vision sensing,” IEEE Transactions on Image Processing , vol. 29, pp. 9084– 9098, 2020

work page 2020
[11]

Spatial- temporal self-attention for asynchronous spiking neural networks,

Y . Wang, K. Shi, C. Lu, Y . Liu, M. Zhang, and H. Qu, “Spatial- temporal self-attention for asynchronous spiking neural networks,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, vol. 8, 2023, pp. 3085–3093

work page 2023
[12]

Attention spiking neural networks,

M. Yao, G. Zhao, H. Zhang, Y . Hu, L. Deng, Y . Tian, B. Xu, and G. Li, “Attention spiking neural networks,”IEEE transactions on pattern analysis and machine intelligence , 2023

work page 2023
[13]

Temporal-wise attention spiking neural networks for event streams clas- sification,

M. Yao, H. Gao, G. Zhao, D. Wang, Y . Lin, Z. Yang, and G. Li, “Temporal-wise attention spiking neural networks for event streams clas- sification,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 221–10 230

work page 2021
[14]

Tcja- snn: Temporal-channel joint attention for spiking neural networks,

R.-J. Zhu, M. Zhang, Q. Zhao, H. Deng, Y . Duan, and L.-J. Deng, “Tcja- snn: Temporal-channel joint attention for spiking neural networks,”IEEE Transactions on Neural Networks and Learning Systems , 2024

work page 2024
[15]

Gated attention coding for training high-performance and efficient spiking neural networks,

X. Qiu, R.-J. Zhu, Y . Chou, Z. Wang, L.-j. Deng, and G. Li, “Gated attention coding for training high-performance and efficient spiking neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 601–610

work page 2024
[16]

Spikingresformer: Bridging resnet and vision transformer in spiking neural networks,

X. Shi, Z. Hao, and Z. Yu, “Spikingresformer: Bridging resnet and vision transformer in spiking neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 5610–5619

work page 2024
[17]

Three-dimensional discrete wavelet transform architectures,

M. Weeks and M. A. Bayoumi, “Three-dimensional discrete wavelet transform architectures,” IEEE Transactions on Signal Processing , vol. 50, no. 8, pp. 2050–2063, 2002

work page 2050
[18]

Efficient token mixing for transformers via adaptive fourier neural operators,

J. Guibas, M. Mardani, Z. Li, A. Tao, A. Anandkumar, and B. Catanzaro, “Efficient token mixing for transformers via adaptive fourier neural operators,” in International Conference on Learning Representations , 2021

work page 2021
[19]

Scattering vision transformer: Spectral mixing matters,

B. Patro and V . Agneeswaran, “Scattering vision transformer: Spectral mixing matters,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024
[20]

An image patch is a wave: Phase-aware vision mlp,

Y . Tang, K. Han, J. Guo, C. Xu, Y . Li, C. Xu, and Y . Wang, “An image patch is a wave: Phase-aware vision mlp,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 935–10 944

work page 2022
[21]

Wave-vit: Unifying wavelet and transformers for visual representation learning,

T. Yao, Y . Pan, Y . Li, C.-W. Ngo, and T. Mei, “Wave-vit: Unifying wavelet and transformers for visual representation learning,” inEuropean Conference on Computer Vision . Springer, 2022, pp. 328–345

work page 2022
[22]

Hashnet: Deep learning to hash by continuation,

Z. Cao, M. Long, J. Wang, and P. S. Yu, “Hashnet: Deep learning to hash by continuation,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 5608–5617

work page 2017
[23]

Deep polarized network for supervised learning of accurate binary hashing codes

L. Fan, K. W. Ng, C. Ju, T. Zhang, and C. S. Chan, “Deep polarized network for supervised learning of accurate binary hashing codes.” in IJCAI, 2020, pp. 825–831

work page 2020
[24]

Transhash: Transformer-based hamming hashing for efficient image retrieval,

Y . Chen, S. Zhang, F. Liu, Z. Chang, M. Ye, and Z. Qi, “Transhash: Transformer-based hamming hashing for efficient image retrieval,” in Proceedings of the 2022 international conference on multimedia re- trieval, 2022, pp. 127–136

work page 2022
[25]

Hashformer: Vision transformer based deep hashing for image retrieval,

T. Li, Z. Zhang, L. Pei, and Y . Gan, “Hashformer: Vision transformer based deep hashing for image retrieval,”IEEE Signal Processing Letters, vol. 29, pp. 827–831, 2022

work page 2022
[26]

Structure-adaptive neighborhood preserving hashing for scalable video search,

S. Li, X. Li, and J. Lu, “Structure-adaptive neighborhood preserving hashing for scalable video search,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2441–2454, 2021

work page 2021
[27]

Self-supervised video hashing via bidirectional transformers,

S. Li, X. Li, J. Lu, and J. Zhou, “Self-supervised video hashing via bidirectional transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 13 549–13 558

work page 2021
[28]

Contrastive masked autoencoders for self-supervised video hashing,

Y . Wang, J. Wang, B. Chen, Z. Zeng, and S.-T. Xia, “Contrastive masked autoencoders for self-supervised video hashing,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 3, 2023, pp. 2733–2741

work page 2023
[29]

Generalized leaky integrate- and-fire models classify multiple neuron types,

C. Teeter, R. Iyer, V . Menon, N. Gouwens, D. Feng, J. Berg, A. Szafer, N. Cain, H. Zeng, M. Hawrylycz et al. , “Generalized leaky integrate- and-fire models classify multiple neuron types,”Nature communications, vol. 9, no. 1, p. 709, 2018

work page 2018
[30]

A quantitative description of mem- brane current and its application to conduction and excitation in nerve,

A. L. Hodgkin and A. F. Huxley, “A quantitative description of mem- brane current and its application to conduction and excitation in nerve,” The Journal of physiology , vol. 117, no. 4, p. 500, 1952

work page 1952
[31]

Simple model of spiking neurons,

E. M. Izhikevich, “Simple model of spiking neurons,” IEEE Transactions on neural networks , vol. 14, no. 6, pp. 1569–1572, 2003

work page 2003
[32]

Spatio-temporal backpropagation for training high-performance spiking neural networks,

Y . Wu, L. Deng, G. Li, and L. Shi, “Spatio-temporal backpropagation for training high-performance spiking neural networks,” Frontiers in neuroscience, vol. 12, p. 323875, 2018

work page 2018
[33]

Hardvs: Revisiting human activity recognition with dynamic vision sensors,

X. Wang, Z. Wu, B. Jiang, Z. Bao, L. Zhu, G. Li, Y . Wang, and Y . Tian, “Hardvs: Revisiting human activity recognition with dynamic vision sensors,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5615–5623

work page 2024
[34]

Cifar10-dvs: an event-stream dataset for object classification,

H. Li, H. Liu, X. Ji, G. Li, and L. Shi, “Cifar10-dvs: an event-stream dataset for object classification,” Frontiers in neuroscience , vol. 11, p. 244131, 2017

work page 2017
[35]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

work page 2009
[36]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of features from tiny images,” 2009

work page 2009
[37]

Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence,

W. Fang, Y . Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang, H. Zhou, G. Li, and Y . Tian, “Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence,” Science Advances, vol. 9, no. 40, p. eadi1480, 2023

work page 2023
[38]

Enhancing the performance of transformer-based spiking neural networks by snn-optimized downsampling with precise gradient backpropagation,

C. Zhou, H. Zhang, Z. Zhou, L. Yu, Z. Ma, H. Zhou, X. Fan, and Y . Tian, “Enhancing the performance of transformer-based spiking neural networks by snn-optimized downsampling with precise gradient backpropagation,” arXiv preprint arXiv:2305.05954 , 2023

work page arXiv 2023
[39]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[40]

1.1 computing’s energy problem (and what we can do about it),

M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC) . IEEE, 2014, pp. 10–14

work page 2014
[41]

Is space-time attention all you need for video understanding?

G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4

work page 2021
[42]

Vivit: A video vision transformer,

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 6836–6846

work page 2021
[43]

Deep hashing network with hybrid attention and adaptive weighting for image retrieval,

Y . Pei, Z. Wang, N. Li, H. Chen, B. Huang, and W. Tu, “Deep hashing network with hybrid attention and adaptive weighting for image retrieval,” IEEE Transactions on Multimedia , 2023

work page 2023
[44]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 6202–6211

work page 2019
[45]

Action-net: Multipath excitation for action recognition,

Z. Wang, Q. She, and A. Smolic, “Action-net: Multipath excitation for action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 13 214–13 223

work page 2021
[46]

Tsm: Temporal shift module for efficient video understanding,

J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 7083–7093

work page 2019
[47]

Going deeper with directly-trained larger spiking neural networks,

H. Zheng, Y . Wu, L. Deng, Y . Hu, and G. Li, “Going deeper with directly-trained larger spiking neural networks,” in Proceedings of the AAAI conference on artificial intelligence , vol. 35, no. 12, 2021, pp. 11 062–11 070. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

work page 2021
[48]

Incorporating learnable membrane time constant to enhance learning of spiking neural networks,

W. Fang, Z. Yu, Y . Chen, T. Masquelier, T. Huang, and Y . Tian, “Incorporating learnable membrane time constant to enhance learning of spiking neural networks,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 2661–2671

work page 2021
[49]

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

C. Zhou, H. Zhang, Z. Zhou, L. Yu, L. Huang, X. Fan, L. Yuan, Z. Ma, H. Zhou, and Y . Tian, “Qkformer: Hierarchical spiking transformer using qk attention,” arXiv preprint arXiv:2403.16552 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Spikepoint: An efficient point-based spiking neural network for event cameras action recognition,

H. Ren, Y . Zhou, Y . Huang, H. Fu, X. Lin, J. Song, and B. Cheng, “Spikepoint: An efficient point-based spiking neural network for event cameras action recognition,” arXiv preprint arXiv:2310.07189 , 2023

work page arXiv 2023
[51]

Deep residual learning in spiking neural networks,

W. Fang, Z. Yu, Y . Chen, T. Huang, T. Masquelier, and Y . Tian, “Deep residual learning in spiking neural networks,” Advances in Neural Information Processing Systems , vol. 34, pp. 21 056–21 069, 2021

work page 2021
[52]

Sparser spiking activity can be better: Feature refine-and-mask spiking neural network for event-based visual recognition,

M. Yao, H. Zhang, G. Zhao, X. Zhang, D. Wang, G. Cao, and G. Li, “Sparser spiking activity can be better: Feature refine-and-mask spiking neural network for event-based visual recognition,” Neural Networks , vol. 166, pp. 410–423, 2023

work page 2023
[53]

Differen- tiable spike: Rethinking gradient-descent for training spiking neural networks,

Y . Li, Y . Guo, S. Zhang, S. Deng, Y . Hai, and S. Gu, “Differen- tiable spike: Rethinking gradient-descent for training spiking neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 426–23 439, 2021

work page 2021
[54]

Optimal ann- snn conversion for high-accuracy and ultra-low-latency spiking neural networks,

T. Bu, W. Fang, J. Ding, P. Dai, Z. Yu, and T. Huang, “Optimal ann- snn conversion for high-accuracy and ultra-low-latency spiking neural networks,” arXiv preprint arXiv:2303.04347 , 2023

work page arXiv 2023
[55]

Training spiking neural networks with local tandem learning,

Q. Yang, J. Wu, M. Zhang, Y . Chua, X. Wang, and H. Li, “Training spiking neural networks with local tandem learning,”Advances in Neural Information Processing Systems , vol. 35, pp. 12 662–12 676, 2022

work page 2022
[56]

Adaptive smoothing gradient learning for spiking neural networks,

Z. Wang, R. Jiang, S. Lian, R. Yan, and H. Tang, “Adaptive smoothing gradient learning for spiking neural networks,” in International Confer- ence on Machine Learning . PMLR, 2023, pp. 35 798–35 816

work page 2023
[57]

Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,

Z. Tong, Y . Song, J. Wang, and L. Wang, “Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10 078– 10 093, 2022

work page 2022

[1] [1]

Event- based vision: A survey,

G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event- based vision: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

work page 2020

[2] [2]

A low power, fully event-based gesture recognition system,

A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza et al. , “A low power, fully event-based gesture recognition system,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp. 7243–7252

work page 2017

[3] [3]

High speed and high dynamic range video with an event camera,

H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 1964–1980, 2019

work page 1964

[4] [4]

Networks of spiking neurons: the third generation of neural network models,

W. Maass, “Networks of spiking neurons: the third generation of neural network models,” Neural networks, vol. 10, no. 9, pp. 1659–1671, 1997

work page 1997

[5] [5]

Towards artificial general intelligence with hybrid tianjic chip architecture,

J. Pei, L. Deng, S. Song, M. Zhao, Y . Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. He et al., “Towards artificial general intelligence with hybrid tianjic chip architecture,” Nature, vol. 572, no. 7767, pp. 106–111, 2019

work page 2019

[6] [6]

Spikformer: When spiking neural network meets transformer,

Z. Zhou, Y . Zhu, C. He, Y . Wang, S. Yan, Y . Tian, and L. Yuan, “Spikformer: When spiking neural network meets transformer,” arXiv preprint arXiv:2209.15425, 2022

work page arXiv 2022

[7] [7]

Spike-driven transformer,

M. Yao, J. Hu, Z. Zhou, L. Yuan, Y . Tian, B. Xu, and G. Li, “Spike-driven transformer,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[8] [8]

Spikingformer: Spike-driven residual learning for transformer-based spiking neural network,

C. Zhou, L. Yu, Z. Zhou, Z. Ma, H. Zhang, H. Zhou, and Y . Tian, “Spikingformer: Spike-driven residual learning for transformer-based spiking neural network,” arXiv preprint arXiv:2304.11954 , 2023

work page arXiv 2023

[9] [9]

Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,

M. Yao, J. Hu, T. Hu, Y . Xu, Z. Zhou, Y . Tian, B. Xu, and G. Li, “Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuromorphic chips,” arXiv preprint arXiv:2404.03663, 2024

work page arXiv 2024

[10] [10]

Graph-based spatio-temporal feature learning for neuromorphic vision sensing,

Y . Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y . Andreopoulos, “Graph-based spatio-temporal feature learning for neuromorphic vision sensing,” IEEE Transactions on Image Processing , vol. 29, pp. 9084– 9098, 2020

work page 2020

[11] [11]

Spatial- temporal self-attention for asynchronous spiking neural networks,

Y . Wang, K. Shi, C. Lu, Y . Liu, M. Zhang, and H. Qu, “Spatial- temporal self-attention for asynchronous spiking neural networks,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, vol. 8, 2023, pp. 3085–3093

work page 2023

[12] [12]

Attention spiking neural networks,

M. Yao, G. Zhao, H. Zhang, Y . Hu, L. Deng, Y . Tian, B. Xu, and G. Li, “Attention spiking neural networks,”IEEE transactions on pattern analysis and machine intelligence , 2023

work page 2023

[13] [13]

Temporal-wise attention spiking neural networks for event streams clas- sification,

M. Yao, H. Gao, G. Zhao, D. Wang, Y . Lin, Z. Yang, and G. Li, “Temporal-wise attention spiking neural networks for event streams clas- sification,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 221–10 230

work page 2021

[14] [14]

Tcja- snn: Temporal-channel joint attention for spiking neural networks,

R.-J. Zhu, M. Zhang, Q. Zhao, H. Deng, Y . Duan, and L.-J. Deng, “Tcja- snn: Temporal-channel joint attention for spiking neural networks,”IEEE Transactions on Neural Networks and Learning Systems , 2024

work page 2024

[15] [15]

Gated attention coding for training high-performance and efficient spiking neural networks,

X. Qiu, R.-J. Zhu, Y . Chou, Z. Wang, L.-j. Deng, and G. Li, “Gated attention coding for training high-performance and efficient spiking neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 601–610

work page 2024

[16] [16]

Spikingresformer: Bridging resnet and vision transformer in spiking neural networks,

X. Shi, Z. Hao, and Z. Yu, “Spikingresformer: Bridging resnet and vision transformer in spiking neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 5610–5619

work page 2024

[17] [17]

Three-dimensional discrete wavelet transform architectures,

M. Weeks and M. A. Bayoumi, “Three-dimensional discrete wavelet transform architectures,” IEEE Transactions on Signal Processing , vol. 50, no. 8, pp. 2050–2063, 2002

work page 2050

[18] [18]

Efficient token mixing for transformers via adaptive fourier neural operators,

J. Guibas, M. Mardani, Z. Li, A. Tao, A. Anandkumar, and B. Catanzaro, “Efficient token mixing for transformers via adaptive fourier neural operators,” in International Conference on Learning Representations , 2021

work page 2021

[19] [19]

Scattering vision transformer: Spectral mixing matters,

B. Patro and V . Agneeswaran, “Scattering vision transformer: Spectral mixing matters,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024

[20] [20]

An image patch is a wave: Phase-aware vision mlp,

Y . Tang, K. Han, J. Guo, C. Xu, Y . Li, C. Xu, and Y . Wang, “An image patch is a wave: Phase-aware vision mlp,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 935–10 944

work page 2022

[21] [21]

Wave-vit: Unifying wavelet and transformers for visual representation learning,

T. Yao, Y . Pan, Y . Li, C.-W. Ngo, and T. Mei, “Wave-vit: Unifying wavelet and transformers for visual representation learning,” inEuropean Conference on Computer Vision . Springer, 2022, pp. 328–345

work page 2022

[22] [22]

Hashnet: Deep learning to hash by continuation,

Z. Cao, M. Long, J. Wang, and P. S. Yu, “Hashnet: Deep learning to hash by continuation,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 5608–5617

work page 2017

[23] [23]

Deep polarized network for supervised learning of accurate binary hashing codes

L. Fan, K. W. Ng, C. Ju, T. Zhang, and C. S. Chan, “Deep polarized network for supervised learning of accurate binary hashing codes.” in IJCAI, 2020, pp. 825–831

work page 2020

[24] [24]

Transhash: Transformer-based hamming hashing for efficient image retrieval,

Y . Chen, S. Zhang, F. Liu, Z. Chang, M. Ye, and Z. Qi, “Transhash: Transformer-based hamming hashing for efficient image retrieval,” in Proceedings of the 2022 international conference on multimedia re- trieval, 2022, pp. 127–136

work page 2022

[25] [25]

Hashformer: Vision transformer based deep hashing for image retrieval,

T. Li, Z. Zhang, L. Pei, and Y . Gan, “Hashformer: Vision transformer based deep hashing for image retrieval,”IEEE Signal Processing Letters, vol. 29, pp. 827–831, 2022

work page 2022

[26] [26]

Structure-adaptive neighborhood preserving hashing for scalable video search,

S. Li, X. Li, and J. Lu, “Structure-adaptive neighborhood preserving hashing for scalable video search,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2441–2454, 2021

work page 2021

[27] [27]

Self-supervised video hashing via bidirectional transformers,

S. Li, X. Li, J. Lu, and J. Zhou, “Self-supervised video hashing via bidirectional transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 13 549–13 558

work page 2021

[28] [28]

Contrastive masked autoencoders for self-supervised video hashing,

Y . Wang, J. Wang, B. Chen, Z. Zeng, and S.-T. Xia, “Contrastive masked autoencoders for self-supervised video hashing,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 3, 2023, pp. 2733–2741

work page 2023

[29] [29]

Generalized leaky integrate- and-fire models classify multiple neuron types,

C. Teeter, R. Iyer, V . Menon, N. Gouwens, D. Feng, J. Berg, A. Szafer, N. Cain, H. Zeng, M. Hawrylycz et al. , “Generalized leaky integrate- and-fire models classify multiple neuron types,”Nature communications, vol. 9, no. 1, p. 709, 2018

work page 2018

[30] [30]

A quantitative description of mem- brane current and its application to conduction and excitation in nerve,

A. L. Hodgkin and A. F. Huxley, “A quantitative description of mem- brane current and its application to conduction and excitation in nerve,” The Journal of physiology , vol. 117, no. 4, p. 500, 1952

work page 1952

[31] [31]

Simple model of spiking neurons,

E. M. Izhikevich, “Simple model of spiking neurons,” IEEE Transactions on neural networks , vol. 14, no. 6, pp. 1569–1572, 2003

work page 2003

[32] [32]

Spatio-temporal backpropagation for training high-performance spiking neural networks,

Y . Wu, L. Deng, G. Li, and L. Shi, “Spatio-temporal backpropagation for training high-performance spiking neural networks,” Frontiers in neuroscience, vol. 12, p. 323875, 2018

work page 2018

[33] [33]

Hardvs: Revisiting human activity recognition with dynamic vision sensors,

X. Wang, Z. Wu, B. Jiang, Z. Bao, L. Zhu, G. Li, Y . Wang, and Y . Tian, “Hardvs: Revisiting human activity recognition with dynamic vision sensors,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5615–5623

work page 2024

[34] [34]

Cifar10-dvs: an event-stream dataset for object classification,

H. Li, H. Liu, X. Ji, G. Li, and L. Shi, “Cifar10-dvs: an event-stream dataset for object classification,” Frontiers in neuroscience , vol. 11, p. 244131, 2017

work page 2017

[35] [35]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

work page 2009

[36] [36]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of features from tiny images,” 2009

work page 2009

[37] [37]

Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence,

W. Fang, Y . Chen, J. Ding, Z. Yu, T. Masquelier, D. Chen, L. Huang, H. Zhou, G. Li, and Y . Tian, “Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence,” Science Advances, vol. 9, no. 40, p. eadi1480, 2023

work page 2023

[38] [38]

Enhancing the performance of transformer-based spiking neural networks by snn-optimized downsampling with precise gradient backpropagation,

C. Zhou, H. Zhang, Z. Zhou, L. Yu, Z. Ma, H. Zhou, X. Fan, and Y . Tian, “Enhancing the performance of transformer-based spiking neural networks by snn-optimized downsampling with precise gradient backpropagation,” arXiv preprint arXiv:2305.05954 , 2023

work page arXiv 2023

[39] [39]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[40] [40]

1.1 computing’s energy problem (and what we can do about it),

M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC) . IEEE, 2014, pp. 10–14

work page 2014

[41] [41]

Is space-time attention all you need for video understanding?

G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4

work page 2021

[42] [42]

Vivit: A video vision transformer,

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 6836–6846

work page 2021

[43] [43]

Deep hashing network with hybrid attention and adaptive weighting for image retrieval,

Y . Pei, Z. Wang, N. Li, H. Chen, B. Huang, and W. Tu, “Deep hashing network with hybrid attention and adaptive weighting for image retrieval,” IEEE Transactions on Multimedia , 2023

work page 2023

[44] [44]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 6202–6211

work page 2019

[45] [45]

Action-net: Multipath excitation for action recognition,

Z. Wang, Q. She, and A. Smolic, “Action-net: Multipath excitation for action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021, pp. 13 214–13 223

work page 2021

[46] [46]

Tsm: Temporal shift module for efficient video understanding,

J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 7083–7093

work page 2019

[47] [47]

Going deeper with directly-trained larger spiking neural networks,

H. Zheng, Y . Wu, L. Deng, Y . Hu, and G. Li, “Going deeper with directly-trained larger spiking neural networks,” in Proceedings of the AAAI conference on artificial intelligence , vol. 35, no. 12, 2021, pp. 11 062–11 070. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

work page 2021

[48] [48]

Incorporating learnable membrane time constant to enhance learning of spiking neural networks,

W. Fang, Z. Yu, Y . Chen, T. Masquelier, T. Huang, and Y . Tian, “Incorporating learnable membrane time constant to enhance learning of spiking neural networks,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 2661–2671

work page 2021

[49] [49]

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

C. Zhou, H. Zhang, Z. Zhou, L. Yu, L. Huang, X. Fan, L. Yuan, Z. Ma, H. Zhou, and Y . Tian, “Qkformer: Hierarchical spiking transformer using qk attention,” arXiv preprint arXiv:2403.16552 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Spikepoint: An efficient point-based spiking neural network for event cameras action recognition,

H. Ren, Y . Zhou, Y . Huang, H. Fu, X. Lin, J. Song, and B. Cheng, “Spikepoint: An efficient point-based spiking neural network for event cameras action recognition,” arXiv preprint arXiv:2310.07189 , 2023

work page arXiv 2023

[51] [51]

Deep residual learning in spiking neural networks,

W. Fang, Z. Yu, Y . Chen, T. Huang, T. Masquelier, and Y . Tian, “Deep residual learning in spiking neural networks,” Advances in Neural Information Processing Systems , vol. 34, pp. 21 056–21 069, 2021

work page 2021

[52] [52]

Sparser spiking activity can be better: Feature refine-and-mask spiking neural network for event-based visual recognition,

M. Yao, H. Zhang, G. Zhao, X. Zhang, D. Wang, G. Cao, and G. Li, “Sparser spiking activity can be better: Feature refine-and-mask spiking neural network for event-based visual recognition,” Neural Networks , vol. 166, pp. 410–423, 2023

work page 2023

[53] [53]

Differen- tiable spike: Rethinking gradient-descent for training spiking neural networks,

Y . Li, Y . Guo, S. Zhang, S. Deng, Y . Hai, and S. Gu, “Differen- tiable spike: Rethinking gradient-descent for training spiking neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 426–23 439, 2021

work page 2021

[54] [54]

Optimal ann- snn conversion for high-accuracy and ultra-low-latency spiking neural networks,

T. Bu, W. Fang, J. Ding, P. Dai, Z. Yu, and T. Huang, “Optimal ann- snn conversion for high-accuracy and ultra-low-latency spiking neural networks,” arXiv preprint arXiv:2303.04347 , 2023

work page arXiv 2023

[55] [55]

Training spiking neural networks with local tandem learning,

Q. Yang, J. Wu, M. Zhang, Y . Chua, X. Wang, and H. Li, “Training spiking neural networks with local tandem learning,”Advances in Neural Information Processing Systems , vol. 35, pp. 12 662–12 676, 2022

work page 2022

[56] [56]

Adaptive smoothing gradient learning for spiking neural networks,

Z. Wang, R. Jiang, S. Lian, R. Yan, and H. Tang, “Adaptive smoothing gradient learning for spiking neural networks,” in International Confer- ence on Machine Learning . PMLR, 2023, pp. 35 798–35 816

work page 2023

[57] [57]

Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,

Z. Tong, Y . Song, J. Wang, and L. Wang, “Videomae: Masked autoen- coders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10 078– 10 093, 2022

work page 2022