pith. machine review for the scientific record. sign in

arxiv: 2510.04595 · v2 · submitted 2025-10-06 · 💻 cs.NE

SpikingMamba: Towards Energy-Efficient Large Language Models via Knowledge Distillation from Mamba

Pith reviewed 2026-05-18 09:41 UTC · model grok-4.3

classification 💻 cs.NE
keywords spiking neural networkslarge language modelsknowledge distillationenergy efficiencyMambareinforcement learningzero-shot evaluationedge deployment
0
0 comments X

The pith

SpikingMamba distills Mamba into a spiking neural network that runs large language models at 4.76 times lower energy with only a small accuracy gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a way to make large language models far more energy efficient by converting them into spiking neural networks through knowledge distillation rather than training from scratch. It replaces dense matrix multiplications with sparse spike accumulations while using a signed-integer spiking neuron and a training-only compensation path to limit accuracy loss. A single-stage distillation step transfers the zero-shot abilities of a pretrained Mamba model, and reinforcement learning then recovers additional performance. The approach targets practical deployment on edge devices where power is limited.

Core claim

SpikingMamba integrates the SI-LIF signed-integer spiking neuron and a training-exclusive Smoothed Gradient Compensation path to enable single-stage distillation of zero-shot capabilities from a pretrained Mamba model into an SNN-based LLM. The resulting 1.3B model delivers a 4.76 times energy benefit with a 4.78 percent zero-shot accuracy gap that narrows to 2.23 percent after reinforcement learning.

What carries the argument

The SI-LIF neuron, which encodes semantic polarity through signed multi-level spikes, paired with the training-exclusive Smoothed Gradient Compensation path that offsets quantization loss while preserving fully spike-driven inference.

If this is right

  • LLM inference on power-limited edge devices becomes feasible because sparse spike activity replaces dense matrix operations.
  • Zero-shot task performance stays within a few percentage points of the original dense model after distillation and reinforcement learning.
  • The cost of developing spiking LLMs drops sharply because full pretraining from random weights is no longer required.
  • Reinforcement learning provides a practical post-distillation step to close most of the remaining accuracy gap.
  • Sparse computation opens the door to running capable language models on battery-powered hardware without major redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation pattern could be tested on other efficient sequence models to produce their spiking versions without starting over.
  • Hardware-specific optimizations for the SI-LIF neuron might increase the realized energy savings beyond simulation results.
  • Applying the method to models larger than 1.3B parameters would test whether the accuracy gap stays roughly constant or widens.
  • Combining the spiking approach with existing quantization or pruning methods could produce still greater efficiency gains.

Load-bearing premise

The single-stage distillation from a pretrained Mamba model using the signed spiking neuron and smoothed gradient path can transfer zero-shot capabilities to the spiking model without unrecoverable accuracy loss once the compensation path is removed at inference.

What would settle it

A hardware measurement on neuromorphic chips showing that actual energy use of the deployed SpikingMamba model falls short of the reported 4.76 times improvement, or benchmark results where zero-shot accuracy remains more than 4 percent below the dense Mamba baseline even after the reinforcement learning step.

Figures

Figures reproduced from arXiv: 2510.04595 by Bojun Cheng, Chao Wang, Jianguo Zhang, Jianxiong Tang, Luziwei Leng, Yulong Huang, Zhichao Lu, Ziyi Wang.

Figure 1
Figure 1. Figure 1: (a) Overview of the training architecture. (b) Illustration of the SpikingMamba block. (c) [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Energy efficiency ratio (EA/ES) of SpikingMamba under various configurations (EA: energy of Mamba2, ES: energy of SpikingMamba). Colors denote model size, striped bars indicate SGC path usage, and marker shapes indicate neuron types. Detailed data and fire rate are provided in Appendix D [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Activation distributions in Mamba2: (a) input projection [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Activation statistics across channels and tokens in Mamba2: (a) input projection [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Activation distribution comparison for LIF-based models. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Activation distribution comparison for I-LIF ( [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Activation distribution comparison for TI-LIF ( [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have achieved remarkable performance across tasks but remain energy-intensive due to dense matrix operations. Spiking neural networks (SNNs) improve energy efficiency by replacing dense matrix multiplications with sparse accumulations. Their sparse spike activity enables efficient LLMs deployment on edge devices. However, prior SNN-based LLMs often sacrifice performance for efficiency, and recovering accuracy typically requires full pretraining, which is costly and impractical. To address this, we propose SpikingMamba, an energy-efficient SNN-based LLMs distilled from Mamba that improves energy efficiency with minimal accuracy sacrifice. SpikingMamba integrates two key components: (a) SI-LIF, a signed-integer spiking neuron that preserves semantic polarity through signed multi-level spike representations. (b) A training-exclusive Smoothed Gradient Compensation (SGC) path mitigating quantization loss while preserving spike-driven efficiency. We employ a single-stage distillation strategy to transfer the zero-shot ability of pretrained Mamba and further enhance it via reinforcement learning (RL). Experiments show that SpikingMamba-1.3B achieves a 4.76$\times$ energy benefit, with only a 4.78\% zero-shot accuracy gap compared to the original Mamba. The model achieves a further 2.55\% accuracy improvement after RL, narrowing the performance gap from 4.78\% to 2.23\%. Code is available at: https://github.com/HuuYuLong/SpikingMamba .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpikingMamba, a spiking neural network adaptation of the Mamba architecture for large language models. It introduces a signed-integer leaky integrate-and-fire (SI-LIF) neuron to preserve semantic polarity via multi-level spikes and a training-exclusive Smoothed Gradient Compensation (SGC) path to mitigate quantization effects. A single-stage knowledge distillation from a pretrained Mamba teacher is used to transfer zero-shot capabilities, followed by reinforcement learning for further improvement. The central empirical claim is that the resulting 1.3B model delivers a 4.76× energy benefit while incurring only a 4.78% zero-shot accuracy gap relative to the original Mamba, which narrows to 2.23% after RL.

Significance. If the reported energy-accuracy tradeoff holds under rigorous verification, the work would demonstrate a practical route to energy-efficient LLMs on edge hardware by leveraging SNN sparsity within an SSM backbone, avoiding the prohibitive cost of full SNN pretraining. The open-source code release supports reproducibility and could accelerate follow-on research on hybrid continuous-discrete state-space models.

major comments (2)
  1. [Methods (SI-LIF and SGC description)] The manuscript states that SGC is disabled at inference and that the single-stage distillation objective transfers zero-shot capabilities into the SI-LIF student. However, no explicit term in the distillation loss (or ablation) is shown to penalize the representational mismatch between the teacher's continuous hidden states and the student's discrete multi-level spike encodings on the exact sequence lengths and state-update dynamics used in Mamba's zero-shot evaluation. This assumption is load-bearing for interpreting the 4.78% (then 2.23%) gap as faithful transfer rather than partial recovery.
  2. [Experiments and Results] Table reporting the 4.76× energy benefit and accuracy numbers: the energy metric scope (e.g., average spike rate, hardware model, or accumulation count) is not aligned in the text with the accuracy evaluation scope, and no error bars or statistical significance across multiple runs are provided. This weakens the claim that the observed gap is reliably small.
minor comments (2)
  1. [Abstract] The abstract claims 'only a 4.78% zero-shot accuracy gap' without naming the specific benchmarks or tasks; this should be stated explicitly in the abstract for immediate clarity.
  2. [Methods] Notation for the SI-LIF neuron parameters (threshold, leak, etc.) is introduced without a consolidated table; a single reference table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without overstating our current results.

read point-by-point responses
  1. Referee: [Methods (SI-LIF and SGC description)] The manuscript states that SGC is disabled at inference and that the single-stage distillation objective transfers zero-shot capabilities into the SI-LIF student. However, no explicit term in the distillation loss (or ablation) is shown to penalize the representational mismatch between the teacher's continuous hidden states and the student's discrete multi-level spike encodings on the exact sequence lengths and state-update dynamics used in Mamba's zero-shot evaluation. This assumption is load-bearing for interpreting the 4.78% (then 2.23%) gap as faithful transfer rather than partial recovery.

    Authors: We appreciate this observation on the distillation mechanism. The single-stage objective aligns the student's output logits with the teacher's on the zero-shot evaluation tasks, while the SI-LIF neuron and training-only SGC path are intended to reduce quantization mismatch in the state updates. We acknowledge that an explicit hidden-state alignment penalty on the precise sequence lengths and dynamics is not present in the reported loss. In the revision we will expand the Methods section with a clearer derivation of how the output-level distillation combined with SGC implicitly constrains representational fidelity, and we will add a targeted ablation that varies sequence length to quantify any residual mismatch effect. revision: partial

  2. Referee: [Experiments and Results] Table reporting the 4.76× energy benefit and accuracy numbers: the energy metric scope (e.g., average spike rate, hardware model, or accumulation count) is not aligned in the text with the accuracy evaluation scope, and no error bars or statistical significance across multiple runs are provided. This weakens the claim that the observed gap is reliably small.

    Authors: We thank the referee for noting the need for explicit alignment and statistical support. The reported energy factor is obtained from the same forward passes used for accuracy measurement, using average spike rate and accumulation counts under a standard neuromorphic hardware model. To improve rigor we will revise the Experiments section to state this shared evaluation scope explicitly and include error bars together with results from at least three independent runs with different random seeds, allowing assessment of statistical significance of the accuracy gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distillation results stand on measured performance

full rationale

The paper proposes SI-LIF neurons and a training-only SGC path, then reports empirical zero-shot accuracy and energy measurements after single-stage distillation from a pretrained Mamba model followed by RL fine-tuning. No equations, uniqueness theorems, or first-principles derivations are presented whose outputs reduce by construction to fitted parameters, self-citations, or renamed inputs. The central claims are direct experimental outcomes (4.76× energy benefit, 4.78 % then 2.23 % accuracy gap) that do not tautologically follow from the method's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The paper introduces two new components (SI-LIF and SGC) whose effectiveness is demonstrated only through the reported experiments; no independent verification of these components outside the distillation pipeline is provided in the abstract.

invented entities (2)
  • SI-LIF signed-integer spiking neuron no independent evidence
    purpose: Preserves semantic polarity through signed multi-level spike representations
    Introduced as a core architectural change to handle both positive and negative values in spiking form.
  • Smoothed Gradient Compensation (SGC) path no independent evidence
    purpose: Mitigates quantization loss during training while preserving spike-driven efficiency at inference
    Training-exclusive mechanism whose benefit is claimed only within the distillation setup.

pith-pipeline@v0.9.0 · 5830 in / 1250 out tokens · 28360 ms · 2026-05-18T09:41:14.006152+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 12 internal anchors

  1. [1]

    Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458,

    Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, and Albert Gu. Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458,

  2. [2]

    Step-level value preference optimization for mathe- matical reasoning.arXiv preprint arXiv:2406.10858, 2024a

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathe- matical reasoning.arXiv preprint arXiv:2406.10858, 2024a. Jiaqi Chen, Yan Yang, Shizhuo Deng, Da Teng, and Liyuan Pan. Spikmamba: When snn meets mamba in event-based human action recognition. InProceedings of the 6th ACM International Conference on Mult...

  3. [3]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457,

  4. [4]

    UltraFeedback: Boosting Language Models with Scaled AI Feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377,

  5. [5]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through struc- tured state space duality.arXiv preprint arXiv:2405.21060,

  6. [6]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

  7. [7]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    EliasFrantar, SalehAshkboos, TorstenHoefler, andDanAlistarh. Gptq: Accuratepost-trainingquantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

  8. [8]

    Version v0.4.0

    URLhttps://zenodo.org/records/10256836. 11 Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171,

  9. [9]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  10. [10]

    Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiao- juan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291,

  11. [11]

    Mistral 7B

    URLhttps://arxiv.org/abs/2310.06825. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

  12. [12]

    MiniMax-01: Scaling Foundation Models with Lightning Attention

    Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313,

  13. [13]

    Spikemba: Multi-modal spiking saliency mamba for temporal video grounding.arXiv preprint arXiv:2404.01174,

    Wenrui Li, Xiaopeng Hong, Ruiqin Xiong, and Xiaopeng Fan. Spikemba: Multi-modal spiking saliency mamba for temporal video grounding.arXiv preprint arXiv:2404.01174,

  14. [14]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  15. [15]

    Spikebert: A language spikformer learned from bert with knowledge distillation.arXiv preprint arXiv:2308.15122,

    Changze Lv, Tianlong Li, Jianhan Xu, Chenxi Gu, Zixuan Ling, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Spikebert: A language spikformer learned from bert with knowledge distillation.arXiv preprint arXiv:2308.15122,

  16. [16]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

  17. [17]

    Spike-temporal latent representation for energy-efficient event-to-video reconstruction

    Jianxiong Tang, Jian-Huang Lai, Lingxiao Yang, and Xiaohua Xie. Spike-temporal latent representation for energy-efficient event-to-video reconstruction. InEuropean Conference on Computer Vision, pp. 163–179. Springer, 2024a. Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, and Zhiqiang Shen. Bi-mamba: Towards accurate 1-bit state space models.arXiv prepri...

  18. [18]

    LLaMA: Open and Efficient Foundation Language Models

    URL https://huggingface.co/datasets/teknium/OpenHermes-2.5. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  19. [19]

    The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432– 62457, 2024a

    Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432– 62457, 2024a. Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhand...

  20. [20]

    Spikellm: Scaling up spiking neural network to large language models via saliency-based spiking.arXiv preprint arXiv:2407.04752, 2024a

    13 Xingrun Xing, Boyan Gao, Zheng Zhang, David A Clifton, Shitao Xiao, Li Du, Guoqi Li, and Jiajun Zhang. Spikellm: Scaling up spiking neural network to large language models via saliency-based spiking.arXiv preprint arXiv:2407.04752, 2024a. Xingrun Xing, Zheng Zhang, Ziyi Ni, Shitao Xiao, Yiming Ju, Siqi Fan, Yequan Wang, Jiajun Zhang, and Guoqi Li. Spik...

  21. [21]

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim

    URLhttps://arxiv.org/abs/2502.06663. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention trans- formers with hardware-efficient training.arXiv preprint arXiv:2312.06635,

  22. [22]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

  23. [23]

    Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254,

    Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254,

  24. [24]

    Spike- ssm: A sparse, precise, and efficient spiking state space model for long sequences learning.arXiv preprint arXiv:2410.17268,

    Yan Zhong, Ruoyu Zhao, Chao Wang, Qinghai Guo, Jianguo Zhang, Zhichao Lu, and Luziwei Leng. Spike- ssm: A sparse, precise, and efficient spiking state space model for long sequences learning.arXiv preprint arXiv:2410.17268,

  25. [25]

    Spikegpt: Generative pre-trained language model with spiking neural networks.arXiv preprint arXiv:2302.13939,

    Rui-Jie Zhu, Qihang Zhao, Guoqi Li, and Jason K Eshraghian. Spikegpt: Generative pre-trained language model with spiking neural networks.arXiv preprint arXiv:2302.13939,

  26. [26]

    A Mamba2 Block The Mamba2 Dao & Gu (2024) architecture consists ofLstacked layers. At each layer, given the input ut∈RD at time stept, the processing begins with a unified input projection: u′ t =u tWin∈R(2H·P+2N+H),(16) zt,x′ t,B′ t,C′ t,∆ ′ t =Split(u′ t),(17) whereW in∈RD×(2H·P+2N+H)is the input linear projection,Dis the model dimension, andH,P,Nare th...

  27. [27]

    channels. We reshape the activation to obtain inputx′(d) t ∈RP, and∆ ′(d) t ∈Rfor each headd= 1,...,H, The input-dependent variables for each head are computed as: α(d) t = exp(−∆(d) t exp(A(d)))∈R, C(d) t =σ(Conv1d(C′ t))∈RN, B(d) t =σ(Conv1d(B′ t))∈RN, x(d) t =σ(Conv1d(x′(d) t ))∈RP, where∆ (d) t =Softplus(∆ ′(d) t + ∆ (d) bias)∈R, andσ(·)is the SiLU ac...

  28. [28]

    C Experiments Setup Implement Details.In the distillation stage, we perform supervised fine-tuning on the GenQA Chen et al

    HellaSwag PiQA Arc-E Arc-C BoolQ WinoGrande Avg.(%)Diff.(%) Mamba2-130m35.22 64.25 47.31 24.06 54.62 52.25 46.29 - ymax= 0 27.93 53.16 31.44 24.15 40.18 51.54 38.07 -8.22 ymax= 1 25.84 53.37 27.48 23.72 37.83 52.80 36.84 -9.45 umax= 0 26.18 50.60 26.22 25.34 40.24 50.83 36.57 -9.72 umax= 1 24.50 49.56 25.93 27.47 49.27 49.80 37.76 -8.53 These results high...

  29. [29]

    The sequence length is fixed at 2048 tokens, and the embedding layer remains frozen throughout training

    We apply a linear warm-up for the first 1% of steps, followed by cosine annealing. The sequence length is fixed at 2048 tokens, and the embedding layer remains frozen throughout training. All experiments are conducted using 8 NVIDIA A100 GPUs with BF16 precision. For the 1.3B model, distillation takes around 42 hours, and RL takes around 1 hour. Distillat...