pith. machine review for the scientific record. sign in

arxiv: 2604.18610 · v1 · submitted 2026-04-13 · 💻 cs.NE · cs.AI

Recognition: unknown

SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:25 UTC · model grok-4.3

classification 💻 cs.NE cs.AI
keywords spiking neural networksmultimodal large language modelstemporal compressionenergy efficiencyneuromorphic hardwaremodality-specific scales
0
0 comments X

The pith

SpikeMLLM turns multimodal LLMs into spiking networks that keep near-original accuracy while slashing timesteps and energy use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpikeMLLM, the first full spiking framework for multimodal large language models. It solves two barriers: different input types like images and text need separate spike rates, and high-resolution inputs create long sequences of simulation steps. The solution unifies standard quantization inside a spiking format, then applies Modality-Specific Temporal Scales that track how each modality changes over time plus a compressed leaky-integrate-and-fire neuron that shortens the overall sequence length. On four existing MLLMs the converted models stay within roughly one percent of FP16 accuracy across many benchmarks, and a custom hardware design runs far faster and cooler than a graphics processor.

Core claim

SpikeMLLM unifies existing ANN quantization methods inside the spiking representation space and adds Modality-Specific Temporal Scales guided by Modality Evolution Discrepancy together with Temporally Compressed LIF neurons; this combination reduces timestep count from L-1 to log2(L)-1 while the resulting models retain near-lossless performance on multimodal tasks and run efficiently on neuromorphic-style accelerators.

What carries the argument

Modality-Specific Temporal Scales (MSTS) guided by Modality Evolution Discrepancy (MED) paired with Temporally Compressed LIF (TC-LIF) neurons; these assign modality-appropriate spike rates and collapse the number of simulation steps required for each forward pass.

If this is right

  • Performance gaps stay below 0.72 percent on InternVL2-8B and 1.19 percent on Qwen2VL-72B even after reducing timestep count to three-quarters of the original length.
  • A dedicated RTL accelerator built for the spike datapath delivers 9.06 times higher throughput and 25.8 times better power efficiency than an FP16 GPU baseline.
  • The approach opens deployment of large multimodal models on power-limited or edge neuromorphic hardware without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modality-aware timing rule could be applied to other spiking pipelines that mix continuous and discrete signals.
  • Combining the temporal compression with additional event sparsity techniques would likely push energy savings further on real neuromorphic chips.
  • Hardware-algorithm co-design of this form may become standard for making any large multimodal system run on low-power substrates.

Load-bearing premise

The benchmarks used are sensitive enough to reveal any loss of cross-modal reasoning that the spiking conversion and compression might introduce.

What would settle it

Test the same converted models on a new multimodal task that demands fine-grained temporal alignment across modalities, such as counting events in synchronized video-audio clips, and check whether the accuracy gap widens beyond the one-percent range reported.

Figures

Figures reproduced from arXiv: 2604.18610 by Bo Lei, Bo Xu, Di Shang, Guoqi Li, Han Xu, Jiahong Zhang, Tiejun Huang, Xuerui Qiu, Zhiyong Qin.

Figure 1
Figure 1. Figure 1: Cross-layer MED discrepancy between visual and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Microarchitecture of Spike-driven Dot-product PE [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between Integer-to-Spike LIF and TC [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of MSTS: Normalized perfor [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study of TC-LIF: Normalized performance [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study of MSTS: Normalized perfor [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Firing rate distribution across layers for SpikeM [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overall Architecture of Co-design Accelerator [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-layer firing rate distribution of different methods on Qwen2VL-7B. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-layer firing rate distribution of different methods on InternVL2-8B. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example responses generated by SpikeMLLM(QuaRot+MSTS+TC-LIF, [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress but incur substantial computational overhead and energy consumption during inference, limiting deployment in resource-constrained environments. Spiking Neural Networks (SNNs), with their sparse event-driven computation, offer inherent energy efficiency advantages on neuromorphic hardware, yet extending them to MLLMs faces two key challenges: heterogeneous modalities make uniform spike encoding insufficient, and high-resolution image inputs amplify timestep unfolding overhead. We propose SpikeMLLM, the first spike-based framework for MLLMs, which unifies existing ANN quantization methods in the spiking representation space and incorporates Modality-Specific Temporal Scales (MSTS) guided by Modality Evolution Discrepancy (MED) and Temporally Compressed LIF (TC-LIF) for timestep compression from T=L-1 to T=log2(L)-1. Experiments on four representative MLLMs across diverse multimodal benchmarks show that SpikeMLLM maintains near-lossless performance under aggressive timestep compression (Tv/Tt=3/4), with average gaps of only 0.72% and 1.19% relative to the FP16 baseline on InternVL2-8B and Qwen2VL-72B. We further develop a dedicated RTL accelerator tailored to the spike-driven datapath, observing 9.06x higher throughput and 25.8x better power efficiency relative to an FP16 GPU baseline under a deployment-oriented co-design setting, suggesting the promise of algorithm-hardware co-design for efficient multimodal intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SpikeMLLM, the first spike-based framework for multimodal large language models. It unifies ANN quantization methods into the spiking domain and adds two components: Modality-Specific Temporal Scales (MSTS) guided by Modality Evolution Discrepancy (MED) and Temporally Compressed LIF (TC-LIF) neurons that reduce timesteps from T = L-1 to T = log2(L)-1. Experiments on four MLLMs (including InternVL2-8B and Qwen2VL-72B) across multimodal benchmarks report near-lossless accuracy under Tv/Tt = 3/4 compression (average gaps 0.72 % and 1.19 % vs. FP16), together with a custom RTL accelerator delivering 9.06× throughput and 25.8× power-efficiency gains over an FP16 GPU baseline.

Significance. If the empirical claims hold after the requested clarifications, the work would be significant for energy-efficient deployment of large multimodal models on neuromorphic hardware. The algorithm-hardware co-design, including the dedicated RTL datapath, is a concrete strength that goes beyond pure algorithmic conversion and provides measurable efficiency numbers.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the reported 0.72 % and 1.19 % average gaps are aggregate figures only. No per-task variance, no breakdown by cross-modal reasoning subtasks, and no spike-rate alignment metrics are supplied, leaving open the possibility that modality desynchronization under MSTS is masked by the averaging.
  2. [Methods] Methods section (MSTS and TC-LIF definitions): the claim that MSTS guided by MED plus TC-LIF compression preserves cross-modal temporal correspondence rests on the assumption that vision and text spikes remain aligned for attention; the manuscript provides no explicit analysis or ablation of spike-timing statistics between modalities at Tv/Tt = 3/4.
  3. [Experiments] Experiments section: the paper does not report an ablation comparing MSTS against a uniform temporal scale across modalities. Such a control is load-bearing for the central claim that modality-specific scales are required to achieve the near-lossless result.
minor comments (2)
  1. [Abstract] The acronym MED is introduced in the abstract without expansion.
  2. [Hardware section] The hardware results would be strengthened by reporting the exact FPGA or ASIC technology node and the area overhead of the RTL accelerator.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and constructive feedback on our work introducing SpikeMLLM. We address each major comment in detail below, committing to revisions that enhance the clarity and robustness of our claims regarding performance preservation and the role of MSTS.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the reported 0.72 % and 1.19 % average gaps are aggregate figures only. No per-task variance, no breakdown by cross-modal reasoning subtasks, and no spike-rate alignment metrics are supplied, leaving open the possibility that modality desynchronization under MSTS is masked by the averaging.

    Authors: We thank the referee for highlighting this. While the primary results focus on aggregate performance to demonstrate overall near-lossless behavior, we recognize the value of detailed breakdowns. In the revised manuscript, we will include per-task accuracy tables with variance, breakdowns for cross-modal reasoning subtasks, and spike-rate alignment metrics to explicitly address potential modality desynchronization concerns. revision: yes

  2. Referee: [Methods] Methods section (MSTS and TC-LIF definitions): the claim that MSTS guided by MED plus TC-LIF compression preserves cross-modal temporal correspondence rests on the assumption that vision and text spikes remain aligned for attention; the manuscript provides no explicit analysis or ablation of spike-timing statistics between modalities at Tv/Tt = 3/4.

    Authors: The near-lossless performance on multimodal benchmarks serves as indirect evidence for preserved temporal correspondence. Nevertheless, to directly validate the assumption, we will incorporate an explicit analysis of spike-timing statistics and inter-modality alignment at the compressed timestep ratio in the revised Methods and Experiments sections. revision: yes

  3. Referee: [Experiments] Experiments section: the paper does not report an ablation comparing MSTS against a uniform temporal scale across modalities. Such a control is load-bearing for the central claim that modality-specific scales are required to achieve the near-lossless result.

    Authors: We agree that an ablation study comparing MSTS to a uniform temporal scale is essential to substantiate the necessity of modality-specific scales. We will add this ablation experiment, including quantitative comparisons on the same benchmarks, to the revised Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework validated on external benchmarks

full rationale

The paper introduces SpikeMLLM as an algorithmic framework that unifies ANN quantization methods into spiking representations, adds Modality-Specific Temporal Scales guided by Modality Evolution Discrepancy, and applies Temporally Compressed LIF for timestep reduction. All load-bearing claims rest on direct experimental measurements across four MLLMs and diverse multimodal benchmarks, reporting concrete performance gaps (0.72% and 1.19%) and hardware metrics relative to FP16 baselines. No equations, parameters, or uniqueness theorems are defined in terms of the target outputs, no self-citations serve as load-bearing premises, and no fitted quantities are relabeled as predictions. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that SNNs can faithfully emulate quantized ANN computations across modalities and that the proposed temporal mechanisms do not degrade multimodal fusion. Several new components are introduced without independent evidence beyond the reported experiments.

free parameters (1)
  • Timestep compression ratio Tv/Tt = 3/4
    Set to 3/4 for aggressive compression; value chosen to balance performance and efficiency.
axioms (1)
  • domain assumption Spiking neural networks can unify ANN quantization methods in the spiking representation space without loss of representational power
    Invoked as the foundation for converting existing MLLMs to spike-based versions.
invented entities (2)
  • Modality-Specific Temporal Scales (MSTS) no independent evidence
    purpose: To address heterogeneous modalities by assigning different temporal resolutions
    Newly proposed to overcome uniform spike encoding limitations.
  • Temporally Compressed LIF (TC-LIF) no independent evidence
    purpose: To reduce timesteps from T=L-1 to T=log2(L)-1
    New neuron variant introduced for compression.

pith-pipeline@v0.9.0 · 5603 in / 1549 out tokens · 72570 ms · 2026-05-10T16:25:32.958089+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

    cs.LG 2026-04 unverdicted novelty 5.0

    SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...

Reference graph

Works this paper leans on

73 extracted references · 15 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. 2024. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. InAdvances in Neural Information Processing Systems, Vol. 37

  2. [2]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Frontier Large Vision- Language Model with Versatile Abilities.arXiv preprint arXiv:2308.12966(2023)

  3. [3]

    Malyaban Bal and Abhronil Sengupta. 2024. SpikingBERT: Distilling BERT to Train Spiking Language Models Using Implicit Differentiation. InAAAI Confer- ence on Artificial Intelligence. 8

  4. [4]

    Yongqiang Cao, Yang Chen, and Deepak Khosla. 2015. Spiking Deep Convolu- tional Neural Networks for Energy-Efficient Object Recognition.International Journal of Computer Vision113, 1 (2015), 54–66

  5. [5]

    Guoming Chen, Zhuoxian Qian, Dong Zhang, Shuang Qiu, and Ruqi Zhou. 2025. Enhancing Robustness Against Adversarial Attacks in Multimodal Emotion Recognition with Spiking Transformers.IEEE Access(2025)

  6. [6]

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open- Source Suites.arXiv preprint arXiv:2404.16821

  7. [7]

    Mike Davies et al. 2018. Loihi: A neuromorphic manycore processor with on-chip learning.IEEE Micro38, 1 (2018), 82–99

  8. [8]

    Jason K Eshraghian, Max Ward, Emre O Neftci, Xinxin Wang, Gregor Lenz, Girish Dwivedi, Mohammed Bennamoun, Doo Seok Jeong, and Wei D Lu. 2023. Training spiking neural networks using lessons from deep learning.Proc. IEEE 111, 9 (2023), 1016–1054

  9. [9]

    Wei Fang, Yanqi Chen, Jianhao Ding, Zhaofei Yu, Timothée Masquelier, Ding Chen, Liwei Huang, Huihui Zhou, Guoqi Li, and Yonghong Tian. 2023. Spiking- jelly: An open-source machine learning infrastructure platform for spike-based intelligence.Science Advances9, 40 (2023), eadi1480

  10. [10]

    Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier Huang, and Yonghong Tian. 2021. Deep residual learning in spiking neural networks.Advances in Neural Information Processing Systems34 (2021), 21056–21069

  11. [11]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. InInternational Conference on Learning Representations

  12. [12]

    Charlotte Frenkel, David Bol, and Giacomo Indiveri. 2023. Bottom-up and top- down approaches for the design of neuromorphic processing systems: Tradeoffs and synergies between natural and artificial intelligence.Proc. IEEE111, 9 (2023), 1490–1514. doi:10.1109/JPROC.2023.3272498

  13. [13]

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. 2023. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.arXiv preprint arXiv:2306.13394(2023)

  14. [14]

    Lingyue Guo, Zeyu Gao, Jinye Qu, Suiwu Zheng, Runhao Jiang, Yanfeng Lu, and Hong Qiao. 2023. Transformer-based spiking neural networks for multimodal audiovisual classification.IEEE Transactions on Cognitive and Developmental Systems16, 3 (2023), 1077–1086

  15. [15]

    Ming Guo, Wenrui Li, Chao Wang, Yuxin Ge, and Chongjun Wang. 2024. SMILE: Spiking multi-modal interactive label-guided enhancement network for emotion recognition. InIEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6

  16. [16]

    Yufei Guo, Yuanpei Chen, Xiaode Liu, Weihang Peng, Yuhan Zhang, Xuhui Huang, and Zhe Ma. 2024. Ternary Spike: Learning Ternary Spikes for Spiking Neural Networks. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 12244–12252

  17. [17]

    Shwai He, Guoheng Sun, Zhenyu Shen, and Ang Li. 2024. What Matters in Transformers? Not All Attention is Needed.arXiv preprint arXiv:2406.15786 (2024)

  18. [18]

    Sebastian Höppner, Yexin Yan, Andreas Dixius, et al. 2021. The SpiNNaker 2 processing element architecture for hybrid digital neuromorphic computing. arXiv preprint arXiv:2103.08392(2021)

  19. [19]

    Yifan Hu, Huajin Tang, and Gang Pan. 2023. Spiking deep residual networks.IEEE Transactions on Neural Networks and Learning Systems34, 8 (2023), 5200–5205

  20. [20]

    Yang-Zhi Hu, Qian Zheng, Xudong Jiang, and Gang Pan. 2023. Fast-SNN: Fast Spiking Neural Network by Converting Quantized ANN.IEEE Transactions on Pattern Analysis and Machine Intelligence45 (2023), 14546–14562

  21. [21]

    Intel Labs. 2021. Taking Neuromorphic Computing to the Next Level with Loihi

  22. [22]

    Accessed: 2026-01-13

    https://download.intel.com/newsroom/2021/neuromorphic-computing-loihi- 2-brief.pdf. Accessed: 2026-01-13

  23. [23]

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, An- drew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713

  24. [24]

    Vineyard, Tej Pandit, Cory Merkel, Rajkumar Kubendran, James B

    Dhireesha Kudithipudi, Catherine Schuman, Craig M. Vineyard, Tej Pandit, Cory Merkel, Rajkumar Kubendran, James B. Aimone, Garrick Orchard, Christian Mayr, Ryad Benosman, Joe Hays, Cliff Young, Chiara Bartolozzi, Amitava Ma- jumdar, Suma George Cardwell, Melika Payvand, Sonia Buckley, Shruti Kulkarni, Hector A. Gonzalez, Gert Cauwenberghs, Chetan Singh Th...

  25. [25]

    Zhenxin Lei, Man Yao, Jiakui Hu, Xinhao Luo, Yanye Lu, Bo Xu, and Guoqi Li

  26. [26]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Spike2Former: Efficient Spiking Transformer for High-performance Image Segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 1364–1372

  27. [27]

    Guoqi Li, Lei Deng, Huajing Tang, Gang Pan, Yonghong Tian, Kaushik Roy, and Wolfgang Maass. 2023. Brain inspired computing: A systematic survey and future trends.Authorea Preprints(2023)

  28. [28]

    Shiyao Li, Yingchun Hu, Xuefei Ning, Xihui Liu, Ke Hong, Xiaotao Jia, Xiuhong Li, Yaqi Yan, Pei Ran, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang

  29. [29]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    MBQ: Modality-Balanced Quantization for Large Vision-Language Mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  30. [30]

    Qianhui Liu, Dong Xing, Lang Feng, Huajin Tang, and Gang Pan. 2022. Event- based multimodal spiking neural network with attention mechanism. InICASSP. IEEE, 8922–8926

  31. [31]

    Xu Liu, Na Xia, Jinxing Zhou, Zhangbin Li, and Dan Guo. 2025. Towards Energy- efficient Audio-visual Classification via Multimodal Interactive Spiking Neural Network.ACM Transactions on Multimedia Computing, Communications, and Applications(2025). doi:10.1145/3721981

  32. [32]

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu- Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. 2024. OCRBench: on the hidden mystery of OCR in large multimodal models.Science China Information Sciences67, 12 (2024). doi:10.1007/s11432-024-4235-6

  33. [33]

    Zechun Liu, Wenhan Luo, Baoyuan Wu, Xiaoyan Yang, Wei Liu, and Kwang-Ting Cheng. 2020. Bi-Real Net: Binarizing Deep Network Towards Real-Network Performance.International Journal of Computer Vision128 (2020), 202–219

  34. [34]

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. 2024. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024. 467–484

  35. [35]

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering.Ad- vances in Neural Information Processing Systems35 (2022), 2507–2521

  36. [36]

    Xinhao Luo, Man Yao, Yuhong Chou, Bo Xu, and Guoqi Li. 2024. Integer- Valued Training and Spike-Driven Inference Spiking Neural Network for High- Performance and Energy-Efficient Object Detection. InEuropean Conference on Computer Vision. Springer, 253–272

  37. [37]

    Shixian Ma, Jing Pei, Wenxin Zhang, et al . 2022. Neuromorphic computing chip with spatiotemporal elasticity for multi-intelligent-tasking robots.Science Robotics7 (2022), eabk2948

  38. [38]

    Wolfgang Maass. 1997. Networks of spiking neurons: the third generation of neural network models.Neural networks10, 9 (1997), 1659–1671

  39. [39]

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. DocVQA: A dataset for VQA on document images. InW ACV. 2200–2209

  40. [40]

    Paul A Merolla, John V Arthur, Rodrigo Alvarez-Icaza, Andrew S Cassidy, et al

  41. [41]

    A million spiking-neuron integrated circuit with a scalable communication network and interface.Science345 (2014), 668–673

  42. [42]

    IEEE Signal Processing Magazine 36, 51–63

    Emre O. Neftci, Hesham Mostafa, and Friedemann Zenke. 2019. Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient- Based Optimization to Spiking Neural Networks.IEEE Signal Processing Magazine 36, 6 (2019), 51–63. doi:10.1109/MSP.2019.2931595

  43. [43]

    Jing Pei et al. 2019. Towards artificial general intelligence with hybrid Tianjic chip architecture.Nature572 (2019), 106–111

  44. [44]

    Michael Pfeiffer and Thomas Pfeil. 2018. Deep Learning With Spiking Neurons: Opportunities and Challenges.Frontiers in Neuroscience12 (2018), 774

  45. [45]

    Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, and Kaushik Roy

  46. [46]

    InInternational Conference on Learning Representations

    Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. InInternational Conference on Learning Representations

  47. [47]

    Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda. 2019. Towards spike- based machine intelligence with neuromorphic computing.Nature575, 7784 (2019), 607–617

  48. [48]

    Schuman, Shruti R

    Catherine D. Schuman, Shruti R. Kulkarni, Maryam Parsa, J. Parker Mitchell, Bill Kay, et al. 2022. Opportunities for neuromorphic computing algorithms and applications.Nature Computational Science2, 1 (2022), 10–19. doi:10.1038/s43588- 021-00174-0

  49. [49]

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA models that can read. InCVPR. 8317–8326

  50. [50]

    Hongze Sun, Rui Liu, Wuque Cai, Jun Wang, Yue Wang, Huajin Tang, Yan Cui, Dezhong Yao, and Daqing Guo. 2024. Reliable object tracking by multimodal hybrid feature extraction and transformer-based fusion.Neural Networks178 (2024), 106493

  51. [51]

    Clarence Tan, Gerardo Ceballos, Nikola Kasabov, and Narayan Puthanmadam Subramaniyam. 2020. FusionSense: Emotion classification using feature fusion of multimodal data and deep learning in a brain-inspired spiking neural network. Sensors20, 18 (2020), 5328

  52. [52]

    Amirhossein Tavanaei, Masoud Ghodrati, Saeed Reza Kheradpisheh, Timothée Masquelier, and Anthony Maida. 2019. Deep learning in spiking neural networks. Neural Networks111 (2019), 47–63. doi:10.1016/j.neunet.2018.12.002

  53. [53]

    Sadia Anjum Tumpa, Anusha Devulapally, Matthew Brehove, Espoir Kyubwa, and Vijaykrishnan Narayanan. 2024. SNN-ANN Hybrid Networks for Embedded Multimodal Monocular Depth Estimation. In2024 IEEE Computer Society Annual 9 Symposium on VLSI (ISVLSI). IEEE, 198–203

  54. [54]

    Huizheng Wang, Zichuan Wang, Zhiheng Yue, Yousheng Long, Taiquan Wei, Jianxun Yang, Yang Wang, Chao Li, Shaojun Wei, Yang Hu, and Shouyi Yin. 2025. MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit- Slice-enabled Sparsity and Repetitiveness. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. Associatio...

  55. [55]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

  56. [56]

    Zhanpeng Wang, Wei Fang, Jianguo Cao, Qi Zhang, Zhenzhi Wang, and Rui Xu

  57. [57]

    InProceedings of the IEEE/CVF International Conference on Computer Vision

    Masked spiking transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1761–1771

  58. [58]

    Jibin Wu, Chenglin Xu, Xiao Han, Daquan Zhou, Malu Zhang, Haizhou Li, and Kay Chen Tan. 2021. Progressive tandem learning for pattern recognition with deep spiking neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence44, 11 (2021), 7824–7840

  59. [59]

    Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. 2018. Spatio-temporal backpropagation for training high-performance spiking neural networks.Fron- tiers in Neuroscience12 (2018), 331

  60. [60]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. InInternational Conference on Machine Learning

  61. [61]

    Xingrun Xing, Boyan Gao, Zheng Liu, David Clifton, Shitao Xiao, Wanpeng Zhang, Li Du, Zheng Zhang, Guoqi Li, and Jiajun Zhang. 2025. SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency- based Spiking. InInternational Conference on Learning Representations

  62. [62]

    Xingrun Xing, Zheng Zhang, Ziyi Ni, Shitao Xiao, Yiming Ju, Siqi Fan, Yequan Wang, Jiajun Zhang, and Guoqi Li. 2024. SpikeLM: Towards General Spike- Driven Language Modeling via Elastic Bi-Spiking Mechanisms. InInternational Conference on Machine Learning

  63. [63]

    Elbitty, Peng Zhou, Yang Tian, Rui-Jie Zhu, Jiahong Zhang, Shaowei Gu, Yuqi Pan, Yuhong Chou, Qinghao Wen, Man Yao, Jiangbo Qian, Yonghong Tian, Lei Ma, Tiejun Huang, Jason K

    Han Xu, Xuerui Qiu, Yunhui Xu, Mohammed E. Elbitty, Peng Zhou, Yang Tian, Rui-Jie Zhu, Jiahong Zhang, Shaowei Gu, Yuqi Pan, Yuhong Chou, Qinghao Wen, Man Yao, Jiangbo Qian, Yonghong Tian, Lei Ma, Tiejun Huang, Jason K. Eshraghian, Bo Xu, and Guoqi Li. 2025. Neuromorphic Spike-Based Large Lan- guage Model.National Science Review(2025), nwaf551. doi:10.1093...

  64. [64]

    Sheng Xu, Yuhang Li, Ming Lin, Peng Gao, Guodong Guo, Jianhua Lü, and Baoyuan Zhang. 2023. Q-DETR: An Efficient Low-bit Quantized Detection Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3842–3851

  65. [65]

    Man Yao, JiaKui Hu, Tianxiang Hu, Yifan Xu, Zhaokun Zhou, Yonghong Tian, Bo XU, and Guoqi Li. 2024. Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=1SIBN5Xyw7

  66. [66]

    Man Yao, Xuerui Qiu, Tianxiang Hu, Jiakui Hu, Yuhong Chou, Keyu Tian, Jianxing Liao, Luziwei Leng, Bo Xu, and Guoqi Li. 2025. Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training.IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE TPAMI)(2025)

  67. [67]

    Man Yao, Ole Richter, Guangshe Zhao, Ning Qiao, Yannan Xing, Dingheng Wang, et al. 2024. Spike-based dynamic computing with asynchronous sensing- computing neuromorphic chip.Nature Communications15 (2024), 4464

  68. [68]

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. MiniCPM-V: A GPT-4V Level MLLM on Your Phone.arXiv preprint arXiv:2408.01800(2024)

  69. [69]

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models.National Science Review11, 12 (2024), nwae403

  70. [70]

    Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. 2021. Going deeper with directly-trained larger spiking neural networks. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 11062–11070

  71. [71]

    Shibo Zhou, Bo Yang, Mengwen Yuan, Runhao Jiang, Rui Yan, Gang Pan, and Huajin Tang. 2024. Enhancing SNN-based spatio-temporal learning: A bench- mark dataset and cross-modality attention model.Neural Networks180 (2024), 106677

  72. [72]

    Eshraghian

    Rui-Jie Zhu, Yu Zhang, Steven Abreu, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Sumit Bam Shrestha, Peng Zhou, and Jason K. Eshraghian. 2024. Scalable MatMul-free Language Modeling.Transactions on Machine Learning Research(2024)

  73. [73]

    Eshraghian

    Rui-Jie Zhu, Qihang Zhao, Guoqi Li, and Jason K. Eshraghian. 2024. SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks.Trans- actions on Machine Learning Research(2024). 10 Supplementary Materials: SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression Han Xu1,2,3,† Z...