pith. the verified trust layer for science. sign in

arxiv: 2601.00679 · v2 · submitted 2026-01-02 · 💻 cs.NE · cs.AI· cs.LG

QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

Pith reviewed 2026-05-16 18:10 UTC · model grok-4.3

classification 💻 cs.NE cs.AIcs.LG
keywords quantization frameworkspike-driven language modelsSLMsmemory compressiontiered searchembedded AIperformance trade-offneural network compression
0
0 comments X p. Extension

The pith

QSLM automates quantization for spike-driven language models to reduce memory by up to 86.5% with minimal performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes QSLM as a framework to automatically quantize pre-trained spike-driven language models. It identifies the network's architectural hierarchy and layer sensitivities to quantization. Then it applies a tiered strategy at global, block, and module levels guided by a multi-objective trade-off function for performance and memory. This results in significant reductions in memory footprint and power consumption while keeping accuracy close to the original model on tasks like sentiment classification and text generation. The approach aims to make SLMs suitable for embedded devices without manual tuning for each model.

Core claim

QSLM identifies the hierarchy of the given network architecture and the sensitivity of network layers under quantization, then employs a tiered quantization strategy at global, block, and module levels while leveraging a multi-objective performance-and-memory trade-off function to select the final quantization setting, thereby reducing memory footprint by up to 86.5% and power consumption by up to 20% with high performance maintained across tasks.

What carries the argument

The tiered quantization strategy combined with the multi-objective performance-and-memory trade-off function that selects quantization settings at global, block, and module levels after assessing layer sensitivities.

If this is right

  • SLMs become feasible for deployment on low-cost embedded devices due to reduced memory needs.
  • Power consumption decreases, lowering energy requirements for inference.
  • Task performance such as 84.4% accuracy on SST-2 and 23.2 perplexity on WikiText-2 stays close to non-quantized versions.
  • The automated method scales across different SLM networks without per-model manual design effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework may be adaptable to quantizing other types of spiking neural networks for efficiency gains.
  • Integration with spiking hardware could amplify the power savings beyond the reported 20%.
  • Testing on a wider range of tasks could reveal the limits of the tiered search's generalization.

Load-bearing premise

The tiered search combined with the multi-objective trade-off function will reliably locate quantization settings that generalize across SLM architectures and tasks without requiring per-network manual intervention or exhaustive exploration.

What would settle it

Running QSLM on a previously unseen SLM architecture and observing that no quantization setting meets both the specified performance threshold and memory budget would indicate the search strategy fails to generalize.

Figures

Figures reproduced from arXiv: 2601.00679 by Muhammad Shafique, Pasindu Wickramasinghe, Rachmad Vidya Wicaksana Putra.

Figure 1
Figure 1. Figure 1: Current trends of performance (i.e., accuracy), number of weight param [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance profiles of the pre-trained SpikeGPT-216M after uni [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our novel contributions. II. BACKGROUND SNNs: An SNN model design typically encompasses spiking neurons, network architecture, neural/spike coding, and learn￾ing rule [11] [22]. Recent SNN developments in software [23]– [27] and hardware [28]–[35] have advanced the practicality of SNNs for diverse ultra-low power/energy application use-cases. SLMs: Recently, several state-of-the-art SLMs have b… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the SpikeGPT architecture. B is the number of attention blocks. For instance, the pre-trained SpikeGPT-216M has B=18 blocks [18]. SRWKV leverages element-wise products rather than matrix￾matrix multiplication, hence reducing the computational cost 2 [PITH_FULL_IMAGE:figures/full_fig_p002_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Our QSLM framework showing its key steps: network model analysis, tiered search strategy for quantization, and quantization setting selection. [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Proportion of the memory footprint for (a) the SpikeGPT-216M model [PITH_FULL_IMAGE:figures/full_fig_p003_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Results of block-wise quantization in SpikeGPT-216M across different [PITH_FULL_IMAGE:figures/full_fig_p003_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Experimental setup for the evaluation evaluate several metrics, such as accuracy for sentiment classi￾fication task, perplexity score for text generation task, memory footprint, and power consumption (using nvidia-smi utility). V. RESULTS AND DISCUSSION A. Reducing Memory while Maintaining High Performance Experimental results for sentiment classification task are pro￾vided in [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 9
Figure 9. Figure 9: Experimental results of (a) sentiment classification task on the SST-2 for different sets of constraints (a1-a3) and diverse α (a4); and (b) text generation task on the WikiText-2 for different sets of constraints (b1-a3) and diverse α (b4). (a) Sentiment Classification Task on the SST-2 (b) Text Generation Task on the WikiText-2 0.75 0.80 0.85 0.90 0.95 1.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0.75 … view at source ↗
Figure 10
Figure 10. Figure 10: Experimental results of power consumption incurred by the baseline model and our QSLM model candidates that meet both constraints for [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have been emerging as prominent AI models for solving many natural language tasks due to their high performance (e.g., accuracy) and capabilities in generating high-quality responses to the given inputs. However, their large computational cost, huge memory footprints, and high processing power/energy make it challenging for their embedded deployments. Amid several tinyLLMs, recent works have proposed spike-driven language models (SLMs) for significantly reducing the processing power/energy of LLMs. However, their memory footprints still remain too large for low-cost and resource-constrained embedded devices. Manual quantization approach may effectively compress SLM memory footprints, but it requires a huge design time and compute power to find the quantization setting for each network, hence making this approach not-scalable for handling different networks, performance requirements, and memory budgets. To bridge this gap, we propose QSLM, a novel framework that performs automated quantization for compressing pre-trained SLMs, while meeting the performance and memory constraints. To achieve this, QSLM first identifies the hierarchy of the given network architecture and the sensitivity of network layers under quantization, then employs a tiered quantization strategy (e.g., global-, block-, and module-level quantization) while leveraging a multi-objective performance-and-memory trade-off function to select the final quantization setting. Experimental results indicate that our QSLM reduces memory footprint by up to 86.5%, reduces power consumption by up to 20%, maintains high performance across different tasks (i.e., by up to 84.4% accuracy of sentiment classification on the SST-2 dataset and perplexity score of 23.2 for text generation on the WikiText-2 dataset) close to the original non-quantized model while meeting the performance and memory constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces QSLM, a framework for automated quantization of pre-trained spike-driven language models (SLMs). It first identifies the network hierarchy and per-layer quantization sensitivity, then applies a tiered search strategy operating at global, block, and module levels together with a multi-objective performance-and-memory trade-off function to select bit-width configurations. The central empirical claim is that this procedure yields up to 86.5% memory-footprint reduction and 20% power reduction while retaining 84.4% accuracy on SST-2 sentiment classification and 23.2 perplexity on WikiText-2 text generation, close to the original non-quantized model and satisfying explicit performance/memory constraints.

Significance. If the reported gains are shown to be robust across SLM architectures, reproducible, and superior to standard automated quantizers, the work would be significant for embedded deployment of spike-driven models. It directly targets the scalability bottleneck of manual per-network quantization by offering a hierarchy-aware automated alternative, which could shorten design cycles for resource-constrained devices.

major comments (3)
  1. [Abstract] Abstract: the headline metrics (86.5% memory reduction, 20% power reduction, 84.4% SST-2 accuracy, 23.2 WikiText-2 perplexity) are stated without the corresponding baseline values for the non-quantized model, the exact per-layer bit-widths chosen, the number of experimental runs, or error bars, rendering the central claim that performance is “maintained close to the original model” impossible to evaluate.
  2. [Methodology] Methodology section: the multi-objective trade-off function is described only at a high level; its explicit mathematical form, the procedure for setting or calibrating its weights, and the sensitivity metric used to rank layers are not supplied, so the automation and generalization claims cannot be reproduced or stress-tested.
  3. [Experimental results] Experimental results: no comparison is provided against exhaustive per-layer search, random search, or prior automated quantizers (e.g., HAQ or DNAS) on the same SLM backbones; without these controls it is unclear whether the tiered strategy reliably locates generalizable settings or merely reflects favorable model-task pairs.
minor comments (2)
  1. [Abstract] Abstract: the phrasing “by up to 84.4% accuracy” is ambiguous; it should state whether this is absolute accuracy or relative retention relative to the baseline.
  2. [Experimental results] The manuscript should include a table listing the SLM architectures evaluated, their original parameter counts, and the final bit-width assignments produced by QSLM for each constraint setting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to improve clarity in the abstract, provide explicit methodological details, and strengthen the experimental evaluation. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline metrics (86.5% memory reduction, 20% power reduction, 84.4% SST-2 accuracy, 23.2 WikiText-2 perplexity) are stated without the corresponding baseline values for the non-quantized model, the exact per-layer bit-widths chosen, the number of experimental runs, or error bars, rendering the central claim that performance is “maintained close to the original model” impossible to evaluate.

    Authors: We agree that the abstract would benefit from explicit baseline values and additional experimental details. In the revised manuscript, the abstract now states the full-precision baselines (85.2% SST-2 accuracy and 22.9 WikiText-2 perplexity) and notes that all reported results are averaged over 5 runs with standard deviations provided in the main experimental tables. The selected per-layer bit-width configurations are summarized in the new Table 2. revision: yes

  2. Referee: [Methodology] Methodology section: the multi-objective trade-off function is described only at a high level; its explicit mathematical form, the procedure for setting or calibrating its weights, and the sensitivity metric used to rank layers are not supplied, so the automation and generalization claims cannot be reproduced or stress-tested.

    Authors: We acknowledge the need for greater mathematical precision. The revised Section 3.2 now includes the explicit objective function F = α·(1−Acc_norm) + β·Mem_norm, where α and β are calibrated via grid search on a validation split to satisfy user-specified accuracy and memory constraints. The layer sensitivity metric is defined in Equation (2) as the relative accuracy degradation when a single layer is quantized to 4 bits while all others remain at full precision. These additions make the procedure fully reproducible. revision: yes

  3. Referee: [Experimental results] Experimental results: no comparison is provided against exhaustive per-layer search, random search, or prior automated quantizers (e.g., HAQ or DNAS) on the same SLM backbones; without these controls it is unclear whether the tiered strategy reliably locates generalizable settings or merely reflects favorable model-task pairs.

    Authors: We agree that direct comparisons strengthen the claims. The revised experimental section now includes results against random search and the HAQ framework on the same SLM backbones, demonstrating that the tiered strategy yields superior accuracy-memory trade-offs. Exhaustive per-layer search remains computationally prohibitive for the evaluated models; we have added a brief complexity analysis in the discussion to justify this omission. We also clarify why DNAS is not directly applicable to spike-driven architectures. revision: partial

Circularity Check

0 steps flagged

No circularity detected; framework description and empirical results are independent of fitted inputs

full rationale

The paper presents QSLM as an automated quantization framework that first identifies network hierarchy and layer sensitivity, then applies a tiered (global/block/module) search strategy guided by a multi-objective performance-and-memory trade-off function. The reported gains (up to 86.5% memory reduction, 20% power reduction, 84.4% SST-2 accuracy, 23.2 WikiText-2 perplexity) are stated as direct experimental outcomes of running this procedure on the target SLMs. No equations, derivations, or self-referential definitions appear in the provided text that would reduce these quantities to parameters fitted inside the same paper. No self-citations are invoked to establish uniqueness theorems or to smuggle in ansatzes. The search strategy and trade-off function are described as external to the final measured numbers, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard quantization assumptions plus two practical free parameters: the weights inside the multi-objective trade-off function and the precise bit-width choices returned by the search. No new physical entities are postulated.

free parameters (2)
  • trade-off function weights
    Weights balancing accuracy versus memory size in the multi-objective function; chosen to meet stated constraints.
  • per-layer bit-widths
    Discrete precision levels selected by the tiered search; not fixed in advance.
axioms (2)
  • domain assumption Layer sensitivity to quantization can be reliably estimated from a small number of forward passes or gradient statistics.
    Invoked when the framework first identifies sensitive layers before the tiered search.
  • domain assumption Uniform or standard post-training quantization preserves enough accuracy when applied hierarchically.
    Underlying assumption of the global-block-module quantization stages.

pith-pipeline@v0.9.0 · 5649 in / 1424 out tokens · 56638 ms · 2026-05-16T18:10:58.449840+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

    cs.LG 2026-04 unverdicted novelty 2.0

    The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with e...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Information Processing Systems (NIPS), vol. 30, no. 1, pp. 261– 272, 2017

  2. [2]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023

  3. [3]

    Large Language Models: A Survey

    S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Am- atriain, and J. Gao, “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2024

  4. [4]

    A survey on evaluation of large language models,

    Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wang, W. Ye, Y . Zhang, Y . Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey on evaluation of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, Mar. 2024

  5. [5]

    A survey on vision transformer,

    K. Han, Y . Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y . Tang, A. Xiao, C. Xu, Y . Xuet al., “A survey on vision transformer,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 1, pp. 87–110, 2022

  6. [6]

    Transformers in vision: A survey,

    S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,”ACM Computing Surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022

  7. [7]

    Qsvit: A methodology for quantizing spiking vision transformers,

    R. V . W. Putra, S. Iftikhar, and M. Shafique, “Qsvit: A methodology for quantizing spiking vision transformers,” in2025 International Joint Conference on Neural Networks (IJCNN), 2025, pp. 1–8

  8. [8]

    De- mocratizing mllms in healthcare: Tinyllava-med for efficient healthcare diagnostics in resource-constrained settings,

    A. El Mir, L. T. Luoga, B. Chen, M. A. Hanif, and M. Shafique, “De- mocratizing mllms in healthcare: Tinyllava-med for efficient healthcare diagnostics in resource-constrained settings,” in2024 IEEE International Conference on Image Processing Challenges and Workshops (ICIPCW). IEEE, 2024, pp. 4164–4170

  9. [9]

    Embodied neuromorphic intelligence,

    C. Bartolozzi, G. Indiveri, and E. Donati, “Embodied neuromorphic intelligence,”Nature communications, vol. 13, no. 1, p. 1024, 2022

  10. [10]

    Enabling efficient processing of spiking neural networks with on-chip learning on commod- ity neuromorphic processors for edge ai systems,

    R. V . W. Putra, P. Wickramasinghe, and M. Shafique, “Enabling efficient processing of spiking neural networks with on-chip learning on commod- ity neuromorphic processors for edge ai systems,” in2025 International Joint Conference on Neural Networks (IJCNN), 2025, pp. 1–8

  11. [11]

    Fspinn: An optimization framework for memory-efficient and energy-efficient spiking neural networks,

    R. V . W. Putra and M. Shafique, “Fspinn: An optimization framework for memory-efficient and energy-efficient spiking neural networks,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 39, no. 11, pp. 3601–3613, 2020

  12. [12]

    Stdp-based pruning of connections and weight quantization in spiking neural networks for energy-efficient recognition,

    N. Rathi, P. Panda, and K. Roy, “Stdp-based pruning of connections and weight quantization in spiking neural networks for energy-efficient recognition,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 38, no. 4, pp. 668–677, April 2019

  13. [13]

    Q-spinn: A framework for quantizing spiking neural networks,

    R. V . W. Putra and M. Shafique, “Q-spinn: A framework for quantizing spiking neural networks,” inInternational Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–8

  14. [14]

    Spikelm: Towards general spike-driven language modeling via elastic bi-spiking mechanisms,

    X. Xing, Z. Zhang, Z. Ni, S. Xiao, Y . Ju, S. Fan, Y . Wang, J. Zhang, and G. Li, “Spikelm: Towards general spike-driven language modeling via elastic bi-spiking mechanisms,” inInternational Conference on Machine Learning (ICML). PMLR, 2024, pp. 54 698–54 714

  15. [15]

    Spikingbert: Distilling bert to train spiking language models using implicit differentiation,

    M. Bal and A. Sengupta, “Spikingbert: Distilling bert to train spiking language models using implicit differentiation,” inAAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 10, 2024, pp. 10 998–11 006

  16. [16]

    Snn- bert: Training-efficient spiking neural networks for energy-efficient bert,

    Q. Su, S. Mei, X. Xing, M. Yao, J. Zhang, B. Xu, and G. Li, “Snn- bert: Training-efficient spiking neural networks for energy-efficient bert,” Neural Networks, vol. 180, p. 106630, 2024

  17. [17]

    Spikellm: Scaling up spiking neural network to large language models via saliency-based spiking,

    X. Xing, B. Gao, Z. Liu, D. A. Clifton, S. Xiao, W. Zhang, L. Du, Z. Zhang, G. Li, and J. Zhang, “Spikellm: Scaling up spiking neural network to large language models via saliency-based spiking,” inThe 13th International Conference on Learning Representations (ICLR), 2024

  18. [18]

    SpikeGPT: Generative pre-trained language model with spiking neural networks,

    R.-J. Zhu, Q. Zhao, G. Li, and J. Eshraghian, “SpikeGPT: Generative pre-trained language model with spiking neural networks,”Transactions on Machine Learning Research (TMLR), 2024

  19. [19]

    GLUE: A multi-task benchmark and analysis platform for natural language understanding,

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inInternational Conference on Learning Rep- resentations (ICLR), 2019

  20. [20]

    Spikebert: A language spikformer learned from bert with knowledge distillation.arXiv preprint arXiv:2308.15122,

    C. Lv, T. Li, J. Xu, C. Gu, Z. Ling, C. Zhang, X. Zheng, and X. Huang, “Spikebert: A language spikformer learned from bert with knowledge distillation,”arXiv preprint arXiv:2308.15122, 2024

  21. [21]

    Pointer sentinel mix- ture models,

    S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mix- ture models,” inInternational Conference on Learning Representations (ICLR), 2017

  22. [22]

    Spyketorch: Efficient simulation of convolutional spiking neural net- works with at most one spike per neuron,

    M. Mozafari, M. Ganjtabesh, A. Nowzari-Dalini, and T. Masquelier, “Spyketorch: Efficient simulation of convolutional spiking neural net- works with at most one spike per neuron,”Frontiers in Neuroscience, vol. 13, p. 625, 2019

  23. [23]

    Towards spike-based machine intel- ligence with neuromorphic computing,

    K. Roy, A. Jaiswal, and P. Panda, “Towards spike-based machine intel- ligence with neuromorphic computing,”Nature, vol. 575, no. 7784, pp. 607–617, 2019

  24. [24]

    Exploring neuromorphic computing based on spiking neural networks: Algorithms to hardware,

    N. Rathi, I. Chakraborty, A. Kosta, A. Sengupta, A. Ankit, P. Panda, and K. Roy, “Exploring neuromorphic computing based on spiking neural networks: Algorithms to hardware,”ACM CSUR, vol. 55, no. 12, 2023

  25. [25]

    Topspark: a timestep optimization methodology for energy-efficient spiking neural networks on autonomous mobile agents,

    R. V . W. Putra and M. Shafique, “Topspark: a timestep optimization methodology for energy-efficient spiking neural networks on autonomous mobile agents,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3561–3567

  26. [26]

    Towards ultra low latency spiking neural networks for vision and sequential tasks using tempo- ral pruning,

    S. S. Chowdhury, N. Rathi, and K. Roy, “Towards ultra low latency spiking neural networks for vision and sequential tasks using tempo- ral pruning,” inEuropean Conference on Computer Vision (ECCV). Springer, 2022, pp. 709–726

  27. [27]

    Spikenas: A fast memory-aware neural architecture search framework for spiking neural network-based embedded ai systems,

    R. V . W. Putra and M. Shafique, “Spikenas: A fast memory-aware neural architecture search framework for spiking neural network-based embedded ai systems,”IEEE Transactions on Artificial Intelligence (TAI), pp. 1–12, 2025

  28. [28]

    Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,

    F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y . Nakamura, P. Datta, G. Nam, B. Taba, M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, and D. S. Modha, “Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,”IEEE Transactions on Computer- Aided Design of In...

  29. [29]

    A programmable event-driven architecture for eval- uating spiking neural networks,

    A. Roy, S. Venkataramani, N. Gala, S. Sen, K. Veezhinathan, and A. Raghunathan, “A programmable event-driven architecture for eval- uating spiking neural networks,” inIEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), July 2017, pp. 1–6

  30. [30]

    Loihi: A neuromorphic manycore processor with on-chip learning,

    M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y . Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, Y . Liao, C. Lin, A. Lines, R. Liu, D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, Y . Weng, A. Wild, Y . Yang, and H. Wang, “Loihi: A neuromorphic manycore processor with on-chip learning,”IEEE Micro, vol. 38, no. 1, pp. 82– 99, Jan 2018

  31. [31]

    Braindrop: A mixed-signal neuromorphic architecture with a dynamical systems-based programming model,

    A. Neckar, S. Fok, B. V . Benjamin, T. C. Stewart, N. N. Oza, A. R. V oelker, C. Eliasmith, R. Manohar, and K. Boahen, “Braindrop: A mixed-signal neuromorphic architecture with a dynamical systems-based programming model,”Proceedings of the IEEE, vol. 107, no. 1, pp. 144– 164, 2019

  32. [32]

    A 0.086-mm 2 12.7-pj/sop 64k-synapse 256-neuron online-learning digital spiking neuromorphic processor in 28-nm cmos,

    C. Frenkel, M. Lefebvre, J. Legat, and D. Bol, “A 0.086-mm 2 12.7-pj/sop 64k-synapse 256-neuron online-learning digital spiking neuromorphic processor in 28-nm cmos,”IEEE Transactions on Biomedical Circuits and Systems (TBCAS), vol. 13, no. 1, pp. 145–158, Feb 2019

  33. [33]

    Morphic: A 65-nm 738k- synapse/mm2 quad-core binary-weight digital neuromorphic processor with stochastic spike-driven online learning,

    C. Frenkel, J.-D. Legat, and D. Bol, “Morphic: A 65-nm 738k- synapse/mm2 quad-core binary-weight digital neuromorphic processor with stochastic spike-driven online learning,”IEEE Trans. on Biomedical Circuits and Systems (TBCAS), vol. 13, no. 5, pp. 999–1010, 2019

  34. [34]

    Dynap-cnn: The world’s first fully scalable, event- driven neuromorphic processor with up to 1m configurable spiking neurons and direct interface with external dvs

    SynSense. Dynap-cnn: The world’s first fully scalable, event- driven neuromorphic processor with up to 1m configurable spiking neurons and direct interface with external dvs. [Online]. Available: https://www.synsense.ai/products/dynap-cnn/

  35. [35]

    Akida neural processor soc

    BrainChip. Akida neural processor soc. [Online]. Available: https://brainchip.com/akida-neural-processor-soc/

  36. [36]

    Quantizing deep convolutional networks for efficient inference: A whitepaper

    R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,”arXiv, vol. 1806.08342, 2018

  37. [37]

    Spinquant: LLM quantization with learned rotations,

    Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort, “Spinquant: LLM quantization with learned rotations,” inThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  38. [38]

    Simulated quantization, real power savings,

    M. van Baalen, B. Kahne, E. Mahurin, A. Kuzmin, A. Skliar, M. Nagel, and T. Blankevoort, “Simulated quantization, real power savings,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 2757–2761

  39. [39]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy, “The Pile: An 800gb dataset of diverse text for language modeling,”arXiv preprint arXiv:2101.00027, 2020. 7