arxiv: 2510.04595 · v2 · submitted 2025-10-06 · 💻 cs.NE

SpikingMamba: Towards Energy-Efficient Large Language Models via Knowledge Distillation from Mamba

Yulong Huang , Jianxiong Tang , Chao Wang , Ziyi Wang , Jianguo Zhang , Zhichao Lu , Bojun Cheng , Luziwei Leng This is my paper

Pith reviewed 2026-05-18 09:41 UTC · model grok-4.3

classification 💻 cs.NE

keywords spiking neural networkslarge language modelsknowledge distillationenergy efficiencyMambareinforcement learningzero-shot evaluationedge deployment

0 comments

The pith

SpikingMamba distills Mamba into a spiking neural network that runs large language models at 4.76 times lower energy with only a small accuracy gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a way to make large language models far more energy efficient by converting them into spiking neural networks through knowledge distillation rather than training from scratch. It replaces dense matrix multiplications with sparse spike accumulations while using a signed-integer spiking neuron and a training-only compensation path to limit accuracy loss. A single-stage distillation step transfers the zero-shot abilities of a pretrained Mamba model, and reinforcement learning then recovers additional performance. The approach targets practical deployment on edge devices where power is limited.

Core claim

SpikingMamba integrates the SI-LIF signed-integer spiking neuron and a training-exclusive Smoothed Gradient Compensation path to enable single-stage distillation of zero-shot capabilities from a pretrained Mamba model into an SNN-based LLM. The resulting 1.3B model delivers a 4.76 times energy benefit with a 4.78 percent zero-shot accuracy gap that narrows to 2.23 percent after reinforcement learning.

What carries the argument

The SI-LIF neuron, which encodes semantic polarity through signed multi-level spikes, paired with the training-exclusive Smoothed Gradient Compensation path that offsets quantization loss while preserving fully spike-driven inference.

If this is right

LLM inference on power-limited edge devices becomes feasible because sparse spike activity replaces dense matrix operations.
Zero-shot task performance stays within a few percentage points of the original dense model after distillation and reinforcement learning.
The cost of developing spiking LLMs drops sharply because full pretraining from random weights is no longer required.
Reinforcement learning provides a practical post-distillation step to close most of the remaining accuracy gap.
Sparse computation opens the door to running capable language models on battery-powered hardware without major redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation pattern could be tested on other efficient sequence models to produce their spiking versions without starting over.
Hardware-specific optimizations for the SI-LIF neuron might increase the realized energy savings beyond simulation results.
Applying the method to models larger than 1.3B parameters would test whether the accuracy gap stays roughly constant or widens.
Combining the spiking approach with existing quantization or pruning methods could produce still greater efficiency gains.

Load-bearing premise

The single-stage distillation from a pretrained Mamba model using the signed spiking neuron and smoothed gradient path can transfer zero-shot capabilities to the spiking model without unrecoverable accuracy loss once the compensation path is removed at inference.

What would settle it

A hardware measurement on neuromorphic chips showing that actual energy use of the deployed SpikingMamba model falls short of the reported 4.76 times improvement, or benchmark results where zero-shot accuracy remains more than 4 percent below the dense Mamba baseline even after the reinforcement learning step.

Figures

Figures reproduced from arXiv: 2510.04595 by Bojun Cheng, Chao Wang, Jianguo Zhang, Jianxiong Tang, Luziwei Leng, Yulong Huang, Zhichao Lu, Ziyi Wang.

**Figure 2.** Figure 2: Energy efficiency ratio (EA/ES) of SpikingMamba under various configurations (EA: energy of Mamba2, ES: energy of SpikingMamba). Colors denote model size, striped bars indicate SGC path usage, and marker shapes indicate neuron types. Detailed data and fire rate are provided in Appendix D [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Activation distributions in Mamba2: (a) input projection [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Activation statistics across channels and tokens in Mamba2: (a) input projection [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Activation distribution comparison for LIF-based models. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Activation distribution comparison for I-LIF ( [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Activation distribution comparison for TI-LIF ( [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved remarkable performance across tasks but remain energy-intensive due to dense matrix operations. Spiking neural networks (SNNs) improve energy efficiency by replacing dense matrix multiplications with sparse accumulations. Their sparse spike activity enables efficient LLMs deployment on edge devices. However, prior SNN-based LLMs often sacrifice performance for efficiency, and recovering accuracy typically requires full pretraining, which is costly and impractical. To address this, we propose SpikingMamba, an energy-efficient SNN-based LLMs distilled from Mamba that improves energy efficiency with minimal accuracy sacrifice. SpikingMamba integrates two key components: (a) SI-LIF, a signed-integer spiking neuron that preserves semantic polarity through signed multi-level spike representations. (b) A training-exclusive Smoothed Gradient Compensation (SGC) path mitigating quantization loss while preserving spike-driven efficiency. We employ a single-stage distillation strategy to transfer the zero-shot ability of pretrained Mamba and further enhance it via reinforcement learning (RL). Experiments show that SpikingMamba-1.3B achieves a 4.76$\times$ energy benefit, with only a 4.78\% zero-shot accuracy gap compared to the original Mamba. The model achieves a further 2.55\% accuracy improvement after RL, narrowing the performance gap from 4.78\% to 2.23\%. Code is available at: https://github.com/HuuYuLong/SpikingMamba .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpikingMamba shows a workable distillation path from Mamba to a 1.3B spiking model with 4.76x energy savings and small accuracy gaps, but the thin experimental details leave the inference-time fidelity open to question.

read the letter

The main thing here is that they distilled a pretrained Mamba into a spiking version using a signed-integer LIF neuron and a training-only smoothed gradient path, landing at 4.76 times lower energy with a 4.78 percent zero-shot gap that RL shrinks to 2.23 percent on the 1.3B scale. The single-stage approach avoids full SNN pretraining, which is the practical win they are after. What is actually new is the SI-LIF neuron that keeps semantic polarity with multi-level signed spikes and the SGC path that only runs during training to ease quantization. These are incremental but targeted extensions of prior SNN conversion work to the Mamba state-space backbone. The paper does well by giving specific numbers on energy and accuracy, releasing code, and showing that RL can recover some of the lost performance without extra pretraining cost. The soft spots sit in the experimental reporting. The abstract and available details give no clear baselines, error bars, or exact method for measuring energy across hardware, so it is hard to judge whether the savings hold when the metric scope matches the accuracy tasks. The stress-test concern about residual mismatch is worth checking: because SGC is off at inference, any gap between the teacher's continuous states and the discrete spike representations has to be fully absorbed by the distillation loss. If that loss does not explicitly target the sequence lengths and dynamics Mamba actually uses, the reported accuracy numbers could reflect partial recovery rather than complete transfer. The results are empirical and do not rest on circular derivations from fitted parameters, which keeps them falsifiable. This paper is for researchers working on neuromorphic hardware and edge deployment of LLMs. A reader focused on SNN conversions for state-space models would pick up usable ideas from the neuron design and the distillation-plus-RL recipe. It has enough concrete claims and public code to deserve a serious referee, even if the methods section needs expansion and more ablations. I would send it out for peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpikingMamba, a spiking neural network adaptation of the Mamba architecture for large language models. It introduces a signed-integer leaky integrate-and-fire (SI-LIF) neuron to preserve semantic polarity via multi-level spikes and a training-exclusive Smoothed Gradient Compensation (SGC) path to mitigate quantization effects. A single-stage knowledge distillation from a pretrained Mamba teacher is used to transfer zero-shot capabilities, followed by reinforcement learning for further improvement. The central empirical claim is that the resulting 1.3B model delivers a 4.76× energy benefit while incurring only a 4.78% zero-shot accuracy gap relative to the original Mamba, which narrows to 2.23% after RL.

Significance. If the reported energy-accuracy tradeoff holds under rigorous verification, the work would demonstrate a practical route to energy-efficient LLMs on edge hardware by leveraging SNN sparsity within an SSM backbone, avoiding the prohibitive cost of full SNN pretraining. The open-source code release supports reproducibility and could accelerate follow-on research on hybrid continuous-discrete state-space models.

major comments (2)

[Methods (SI-LIF and SGC description)] The manuscript states that SGC is disabled at inference and that the single-stage distillation objective transfers zero-shot capabilities into the SI-LIF student. However, no explicit term in the distillation loss (or ablation) is shown to penalize the representational mismatch between the teacher's continuous hidden states and the student's discrete multi-level spike encodings on the exact sequence lengths and state-update dynamics used in Mamba's zero-shot evaluation. This assumption is load-bearing for interpreting the 4.78% (then 2.23%) gap as faithful transfer rather than partial recovery.
[Experiments and Results] Table reporting the 4.76× energy benefit and accuracy numbers: the energy metric scope (e.g., average spike rate, hardware model, or accumulation count) is not aligned in the text with the accuracy evaluation scope, and no error bars or statistical significance across multiple runs are provided. This weakens the claim that the observed gap is reliably small.

minor comments (2)

[Abstract] The abstract claims 'only a 4.78% zero-shot accuracy gap' without naming the specific benchmarks or tasks; this should be stated explicitly in the abstract for immediate clarity.
[Methods] Notation for the SI-LIF neuron parameters (threshold, leak, etc.) is introduced without a consolidated table; a single reference table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without overstating our current results.

read point-by-point responses

Referee: [Methods (SI-LIF and SGC description)] The manuscript states that SGC is disabled at inference and that the single-stage distillation objective transfers zero-shot capabilities into the SI-LIF student. However, no explicit term in the distillation loss (or ablation) is shown to penalize the representational mismatch between the teacher's continuous hidden states and the student's discrete multi-level spike encodings on the exact sequence lengths and state-update dynamics used in Mamba's zero-shot evaluation. This assumption is load-bearing for interpreting the 4.78% (then 2.23%) gap as faithful transfer rather than partial recovery.

Authors: We appreciate this observation on the distillation mechanism. The single-stage objective aligns the student's output logits with the teacher's on the zero-shot evaluation tasks, while the SI-LIF neuron and training-only SGC path are intended to reduce quantization mismatch in the state updates. We acknowledge that an explicit hidden-state alignment penalty on the precise sequence lengths and dynamics is not present in the reported loss. In the revision we will expand the Methods section with a clearer derivation of how the output-level distillation combined with SGC implicitly constrains representational fidelity, and we will add a targeted ablation that varies sequence length to quantify any residual mismatch effect. revision: partial
Referee: [Experiments and Results] Table reporting the 4.76× energy benefit and accuracy numbers: the energy metric scope (e.g., average spike rate, hardware model, or accumulation count) is not aligned in the text with the accuracy evaluation scope, and no error bars or statistical significance across multiple runs are provided. This weakens the claim that the observed gap is reliably small.

Authors: We thank the referee for noting the need for explicit alignment and statistical support. The reported energy factor is obtained from the same forward passes used for accuracy measurement, using average spike rate and accumulation counts under a standard neuromorphic hardware model. To improve rigor we will revise the Experiments section to state this shared evaluation scope explicitly and include error bars together with results from at least three independent runs with different random seeds, allowing assessment of statistical significance of the accuracy gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distillation results stand on measured performance

full rationale

The paper proposes SI-LIF neurons and a training-only SGC path, then reports empirical zero-shot accuracy and energy measurements after single-stage distillation from a pretrained Mamba model followed by RL fine-tuning. No equations, uniqueness theorems, or first-principles derivations are presented whose outputs reduce by construction to fitted parameters, self-citations, or renamed inputs. The central claims are direct experimental outcomes (4.76× energy benefit, 4.78 % then 2.23 % accuracy gap) that do not tautologically follow from the method's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The paper introduces two new components (SI-LIF and SGC) whose effectiveness is demonstrated only through the reported experiments; no independent verification of these components outside the distillation pipeline is provided in the abstract.

invented entities (2)

SI-LIF signed-integer spiking neuron no independent evidence
purpose: Preserves semantic polarity through signed multi-level spike representations
Introduced as a core architectural change to handle both positive and negative values in spiking form.
Smoothed Gradient Compensation (SGC) path no independent evidence
purpose: Mitigates quantization loss during training while preserving spike-driven efficiency at inference
Training-exclusive mechanism whose benefit is claimed only within the distillation setup.

pith-pipeline@v0.9.0 · 5830 in / 1250 out tokens · 28360 ms · 2026-05-18T09:41:14.006152+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TI-LIF neuron ... st = Clip(Round(xt),−D,D) ... fm(xt)=D×tanh(xt) ... LHidden = 1/2T ∑∥softmax(yt)−softmax(y′t)∥22

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 12 internal anchors

[1]

Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458,

Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, and Albert Gu. Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458,

work page arXiv
[2]

Step-level value preference optimization for mathe- matical reasoning.arXiv preprint arXiv:2406.10858, 2024a

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathe- matical reasoning.arXiv preprint arXiv:2406.10858, 2024a. Jiaqi Chen, Yan Yang, Shizhuo Deng, Da Teng, and Liyuan Pan. Spikmamba: When snn meets mamba in event-based human action recognition. InProceedings of the 6th ACM International Conference on Mult...

work page arXiv
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through struc- tured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

EliasFrantar, SalehAshkboos, TorstenHoefler, andDanAlistarh. Gptq: Accuratepost-trainingquantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Version v0.4.0

URLhttps://zenodo.org/records/10256836. 11 Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171,

work page arXiv
[9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiao- juan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

work page arXiv
[11]

Mistral 7B

URLhttps://arxiv.org/abs/2310.06825. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

MiniMax-01: Scaling Foundation Models with Lightning Attention

Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Spikemba: Multi-modal spiking saliency mamba for temporal video grounding.arXiv preprint arXiv:2404.01174,

Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, and Xiaopeng Fan. Spikemba: Multi-modal spiking saliency mamba for temporal video grounding.arXiv preprint arXiv:2404.01174,

work page arXiv
[14]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Spikebert: A language spikformer learned from bert with knowledge distillation.arXiv preprint arXiv:2308.15122,

Changze Lv, Tianlong Li, Jianhan Xu, Chenxi Gu, Zixuan Ling, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Spikebert: A language spikformer learned from bert with knowledge distillation.arXiv preprint arXiv:2308.15122,

work page arXiv
[16]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Spike-temporal latent representation for energy-efficient event-to-video reconstruction

Jianxiong Tang, Jian-Huang Lai, Lingxiao Yang, and Xiaohua Xie. Spike-temporal latent representation for energy-efficient event-to-video reconstruction. InEuropean Conference on Computer Vision, pp. 163–179. Springer, 2024a. Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, and Zhiqiang Shen. Bi-mamba: Towards accurate 1-bit state space models.arXiv prepri...

work page arXiv
[18]

LLaMA: Open and Efficient Foundation Language Models

URL https://huggingface.co/datasets/teknium/OpenHermes-2.5. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432– 62457, 2024a

Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432– 62457, 2024a. Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhand...

work page arXiv
[20]

Spikellm: Scaling up spiking neural network to large language models via saliency-based spiking.arXiv preprint arXiv:2407.04752, 2024a

13 Xingrun Xing, Boyan Gao, Zheng Zhang, David A Clifton, Shitao Xiao, Li Du, Guoqi Li, and Jiajun Zhang. Spikellm: Scaling up spiking neural network to large language models via saliency-based spiking.arXiv preprint arXiv:2407.04752, 2024a. Xingrun Xing, Zheng Zhang, Ziyi Ni, Shitao Xiao, Yiming Ju, Siqi Fan, Yequan Wang, Jiajun Zhang, and Guoqi Li. Spik...

work page arXiv
[21]

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim

URLhttps://arxiv.org/abs/2502.06663. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention trans- formers with hardware-efficient training.arXiv preprint arXiv:2312.06635,

work page arXiv
[22]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[23]

Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254,

Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254,

work page arXiv
[24]

Spike- ssm: A sparse, precise, and efficient spiking state space model for long sequences learning.arXiv preprint arXiv:2410.17268,

Yan Zhong, Ruoyu Zhao, Chao Wang, Qinghai Guo, Jianguo Zhang, Zhichao Lu, and Luziwei Leng. Spike- ssm: A sparse, precise, and efficient spiking state space model for long sequences learning.arXiv preprint arXiv:2410.17268,

work page arXiv
[25]

Spikegpt: Generative pre-trained language model with spiking neural networks.arXiv preprint arXiv:2302.13939,

Rui-Jie Zhu, Qihang Zhao, Guoqi Li, and Jason K Eshraghian. Spikegpt: Generative pre-trained language model with spiking neural networks.arXiv preprint arXiv:2302.13939,

work page arXiv
[26]

A Mamba2 Block The Mamba2 Dao & Gu (2024) architecture consists ofLstacked layers. At each layer, given the input ut∈RD at time stept, the processing begins with a unified input projection: u′ t =u tWin∈R(2H·P+2N+H),(16) zt,x′ t,B′ t,C′ t,∆ ′ t =Split(u′ t),(17) whereW in∈RD×(2H·P+2N+H)is the input linear projection,Dis the model dimension, andH,P,Nare th...

work page 2024
[27]

channels. We reshape the activation to obtain inputx′(d) t ∈RP, and∆ ′(d) t ∈Rfor each headd= 1,...,H, The input-dependent variables for each head are computed as: α(d) t = exp(−∆(d) t exp(A(d)))∈R, C(d) t =σ(Conv1d(C′ t))∈RN, B(d) t =σ(Conv1d(B′ t))∈RN, x(d) t =σ(Conv1d(x′(d) t ))∈RP, where∆ (d) t =Softplus(∆ ′(d) t + ∆ (d) bias)∈R, andσ(·)is the SiLU ac...

work page 2019
[28]

C Experiments Setup Implement Details.In the distillation stage, we perform supervised fine-tuning on the GenQA Chen et al

HellaSwag PiQA Arc-E Arc-C BoolQ WinoGrande Avg.(%)Diff.(%) Mamba2-130m35.22 64.25 47.31 24.06 54.62 52.25 46.29 - ymax= 0 27.93 53.16 31.44 24.15 40.18 51.54 38.07 -8.22 ymax= 1 25.84 53.37 27.48 23.72 37.83 52.80 36.84 -9.45 umax= 0 26.18 50.60 26.22 25.34 40.24 50.83 36.57 -9.72 umax= 1 24.50 49.56 25.93 27.47 49.27 49.80 37.76 -8.53 These results high...

work page 2024
[29]

The sequence length is fixed at 2048 tokens, and the embedding layer remains frozen throughout training

We apply a linear warm-up for the first 1% of steps, followed by cosine annealing. The sequence length is fixed at 2048 tokens, and the embedding layer remains frozen throughout training. All experiments are conducted using 8 NVIDIA A100 GPUs with BF16 precision. For the 1.3B model, distillation takes around 42 hours, and RL takes around 1 hour. Distillat...

work page arXiv 2048