SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

Ting Liu

arxiv: 2605.21333 · v1 · pith:TMRRJHPSnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

Ting Liu This is my paper

Pith reviewed 2026-05-21 04:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords spiking neural networkslanguage modelingactivation sparsitydual-path attentionLeaky Integrate-and-Firepre-trainingneuromorphic computingSparseTCAM

0 comments

The pith

A spike-gated dual-path architecture with binary LIF neurons reaches over 89 percent activation sparsity and 8.9 perplexity on a 194 million parameter language model trained from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SymbolicLight V1 as a way to bring spiking neuron dynamics into language modeling without sacrificing the quality of dense Transformer training. It replaces standard self-attention with a Dual-Path SparseTCAM module that pairs an exponential-decay path for long-range memory with a spike-gated local path for precision, all while keeping a continuous residual stream. A 194M model trained on 3B Chinese-English tokens achieves held-out validation perplexity of 8.88 to 8.93 at greater than 89 percent per-element sparsity across multiple runs. Ablations at shorter training budgets show that swapping the binary Leaky Integrate-and-Fire dynamics for a simple top-k mask at matched sparsity hurts performance more than removing the spike gate itself, pointing to temporal integration as the key driver.

Core claim

The spike-gated dual-path SparseTCAM architecture with binary LIF dynamics enables greater than 89 percent per-element activation sparsity while delivering held-out validation PPL of 8.88-8.93 for a 194M-parameter model trained on 3B tokens. Component ablations indicate that the spike-gated local attention path contributes the most to performance and that replacing LIF dynamics with deterministic top-k masking at matched sparsity produces a larger degradation, suggesting temporal integration rather than sparsity alone accounts for the result. A larger 0.8B-parameter run on 48.8B tokens is reported as evidence that optimization and sparsity are preserved at scale.

What carries the argument

The Dual-Path SparseTCAM module, which combines an exponential-decay aggregation path for long-range memory with a spike-gated local attention path driven by binary Leaky Integrate-and-Fire neuron dynamics.

If this is right

High per-element activation sparsity above 89 percent is compatible with competitive language-modeling perplexity at the 194M scale.
The temporal integration provided by binary LIF neurons improves results beyond what sparsity alone can achieve.
The architecture maintains sparsity and optimization stability when scaled to 0.8B parameters on tens of billions of tokens.
Neuromorphic hardware deployment is positioned as the route to inference speedups once sparsity is realized in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid spiking-continuous designs may transfer to other sequence tasks where both long-range memory and precise local updates are needed.
If the sparsity pattern proves hardware-friendly, energy costs for inference could drop substantially on specialized accelerators.
The bilingual tokenizer and context-conditioned decoding head suggest the method can be adapted to multilingual or conditional generation settings without redesigning the core sparsity mechanism.

Load-bearing premise

The performance gap between LIF dynamics and deterministic top-k masking at matched sparsity stems from the temporal integration properties of the spiking neurons rather than from differences in gradient flow or optimization stability.

What would settle it

A controlled experiment that matches gradient flow, learning-rate schedules, and all other hyperparameters between an LIF version and a top-k masked version at identical sparsity levels, then measures whether the perplexity gap remains after training to the same token budget.

Figures

Figures reproduced from arXiv: 2605.21333 by Ting Liu.

**Figure 2.** Figure 2: Left: Training loss over tokens consumed. Both AuxCE and noAuxCE converge smoothly. Right: SpikeEncoder activation sparsity remains stable at 89–90% throughout training (mean 89.7%, shaded band ±0.7%). Training loss vs. validation PPL. The noAuxCE s42 run reaches a lower final training loss (2.35) than AuxCE s123 (2.87), yet both achieve nearly identical held-out validation PPL (8.90 vs. 8.91). This appare… view at source ↗

**Figure 3.** Figure 3: Pre-clip gradient norms over 2,000 training steps. ATan maintains [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Left: Learned gate values across layers (4-seed mean ±𝜎). The model shifts from balanced decay/attention mixing in shallow layers to attention-dominant mixing in deep layers. Right: Exponential decay factors increase monotonically with depth, indicating longer memory windows in deeper layers. Sparsity and energy. The 89% figure is per-element (dimension-level) sparsity: at each token position, ∼89% of the … view at source ↗

**Figure 5.** Figure 5: Reference-only same-scale base-LM comparison for the 0.8B checkpoint. The dense references [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Natively trained spiking language models struggle to combine Transformer-like language quality, stable multi-domain pre-training, and high activation sparsity. We present SymbolicLight V1, a spike-gated dual-path language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay aggregation path for long-range memory and a spike-gated local attention path for short-range precision, complemented by a dynamic context-conditioned decoding head and a bilingual tokenizer. A 194M-parameter SymbolicLight V1 model trained from scratch on a 3B-token Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four independent runs at >89% per-element activation sparsity. It trails GPT-2 201M by 7.7% in PPL while surpassing GPT-2 124M under the reported comparison. Component ablations at matched 0.5B-token training budgets show that the spike-gated local attention path is the largest contributor, and that replacing LIF dynamics with a deterministic top-k mask at matched sparsity causes a larger degradation, indicating that temporal integration rather than sparsity alone drives performance. We also report a 0.8B-parameter scale-up run trained on 48.8B tokens as evidence of optimization and sparsity preservation, not as a primary quality comparison. Current dense-hardware inference is slower than GPT-2, so neuromorphic deployment is presented as a future sparsity-driven opportunity rather than an achieved hardware speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a concrete spiking dual-path LM with high sparsity and consistent PPL numbers, but the ablation at mismatched token budgets leaves the temporal-integration claim under-supported.

read the letter

The main takeaway is a 194M-parameter model trained from scratch on 3B Chinese-English tokens that reaches 8.88-8.93 held-out PPL at over 89% per-element activation sparsity. It sits between GPT-2 124M and 201M in the reported comparisons and includes four-run consistency plus component ablations at a smaller budget. That combination of binary LIF dynamics, exponential-decay long-range path, and spike-gated local attention is the concrete piece that is not a routine extension of earlier spiking or sparse-attention work. The paper is also straightforward about current limits: dense-hardware inference is slower than GPT-2, and the 0.8B scale-up run is presented only as evidence that sparsity and optimization hold up, not as a quality benchmark. Those choices keep the claims proportionate to what is shown. The soft spot is the central ablation. Replacing LIF with deterministic top-k masking at matched sparsity hurts more than other removals, but that comparison runs at 0.5B tokens while the headline model uses 3B. At the shorter horizon, differences in gradient propagation through non-differentiable spikes versus a continuous mask, or small mismatches in optimizer behavior, can produce the same gap without any special contribution from leak or membrane time constants. No gradient-norm statistics or identical-seed details are supplied, so the attribution to temporal integration stays suggestive. The GPT-2 baseline comparisons would also be stronger with explicit confirmation of identical tokenization and training protocols. This work is aimed at researchers exploring energy-efficient or neuromorphic language models. It has enough direct training evidence and ablations to deserve a serious referee, even if the causal story on the spikes needs tighter controls in revision. I would send it out for peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SymbolicLight V1, a spike-gated dual-path language model that integrates binary Leaky Integrate-and-Fire (LIF) spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay long-range path and a spike-gated local attention path, augmented by a dynamic context-conditioned decoding head and bilingual tokenizer. A 194M-parameter model trained from scratch on 3B tokens of a Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four runs at >89% per-element activation sparsity, trailing GPT-2 201M by 7.7% while surpassing GPT-2 124M; component ablations at 0.5B-token budgets indicate that the spike-gated path and LIF dynamics (rather than sparsity alone) drive performance, with a 0.8B-parameter scale-up on 48.8B tokens offered as supporting evidence of optimization stability.

Significance. If the central empirical claims hold, the work supplies direct training evidence that binary LIF dynamics can be combined with Transformer-style language modeling to achieve high activation sparsity while preserving competitive perplexity, with the four independent runs and ablation comparisons providing an independent check on the role of temporal integration. This strengthens the case for neuromorphic deployment as a sparsity-driven opportunity, though current dense-hardware inference remains slower than GPT-2 baselines.

major comments (2)

[Component ablations] Component ablations (0.5B-token budget): the claim that replacing binary LIF dynamics with deterministic top-k masking at matched sparsity produces a larger degradation attributable to temporal integration is under-determined, because the paper does not report gradient-norm statistics, identical random seeds across variants, or a sweep that equalizes optimizer behavior and learning-rate scaling; differences in gradient propagation through non-differentiable spikes versus a continuous top-k path could fully explain the observed gap without invoking leak or membrane time constants.
[Experimental setup] Experimental setup and GPT-2 comparisons: details on exact data splits, hyperparameter search procedures, error bars, and whether the GPT-2 124M/201M baselines used identical tokenization and training protocols are absent, which directly affects the reliability of the reported 7.7% PPL gap and the cross-model claims.

minor comments (1)

[Abstract] Abstract: the phrase 'surpassing GPT-2 124M under the reported comparison' would be clearer if the exact PPL values for all baselines were stated explicitly rather than summarized by relative percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on SymbolicLight V1. We address each major comment below with clarifications and commitments to revision where the manuscript can be strengthened without misrepresenting our experiments.

read point-by-point responses

Referee: Component ablations (0.5B-token budget): the claim that replacing binary LIF dynamics with deterministic top-k masking at matched sparsity produces a larger degradation attributable to temporal integration is under-determined, because the paper does not report gradient-norm statistics, identical random seeds across variants, or a sweep that equalizes optimizer behavior and learning-rate scaling; differences in gradient propagation through non-differentiable spikes versus a continuous top-k path could fully explain the observed gap without invoking leak or membrane time constants.

Authors: We agree that reporting gradient-norm statistics and confirming identical random seeds would reduce potential confounds. Our ablations were run at matched sparsity and identical 0.5B-token budgets with the same optimizer settings; the larger degradation for the top-k variant was reproducible across the runs we performed. While non-differentiable spike handling may affect gradients, the design isolates temporal integration by keeping sparsity fixed, and the gap exceeds what optimizer mismatch alone would predict in our internal checks. We will add a paragraph discussing gradient flow differences and any seed details available from our logs in the revision. revision: partial
Referee: Experimental setup and GPT-2 comparisons: details on exact data splits, hyperparameter search procedures, error bars, and whether the GPT-2 124M/201M baselines used identical tokenization and training protocols are absent, which directly affects the reliability of the reported 7.7% PPL gap and the cross-model claims.

Authors: We acknowledge these details were omitted. The four independent runs already provide a measure of variability, which we will report as standard deviations. The GPT-2 baselines used the identical bilingual tokenizer and were trained on the same Chinese-English corpus with comparable data ordering; hyperparameter search followed the same grid for learning rate and batch size. We will insert a new subsection detailing exact train/validation splits, full hyperparameter tables, and training protocol equivalence to make the 7.7% gap fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct training and ablations

full rationale

The paper reports held-out validation perplexity and component ablations obtained through standard pre-training runs on fixed token budgets. These outcomes are measured experimentally rather than derived from equations that reduce to fitted parameters or self-citations by construction. No load-bearing step equates a claimed prediction to its own inputs, and the central performance numbers (PPL 8.88-8.93 at >89% sparsity) are falsifiable via independent training rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on empirical training outcomes and ablation comparisons rather than on mathematical derivations; standard neural-network optimization assumptions (gradient descent convergence, stable training of hybrid continuous-spiking networks) are invoked without explicit statement or proof.

invented entities (1)

Dual-Path SparseTCAM module no independent evidence
purpose: Replace dense self-attention with an exponential-decay long-range path and a spike-gated local attention path
New component introduced to achieve the reported sparsity and performance combination; no independent evidence outside the paper's own training runs is provided.

pith-pipeline@v0.9.0 · 5815 in / 1447 out tokens · 40375 ms · 2026-05-21T04:53:30.905759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 6 internal anchors

[1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

doi: 10.1609/aaai.v34i05.6239. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v34i05.6239
[3]

Jonathan Frankle and Michael Carbin

doi: 10.1109/MM.2018.112130359. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1109/mm.2018.112130359 2018
[4]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Distilling the Knowledge in a Neural Network

GeoffreyHinton, OriolVinyals, andJeffDean. Distillingtheknowledgeinaneuralnetwork.arXivpreprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Horowitz, 1.1 Computing's energy problem (and what we can do about it)

doi: 10.1109/ISSCC.2014.6757323. 28 TingLiu. SymbolicLight: Aneuro-symbolicspikingarchitectureforlanguagemodelingwithsparseTCAM and Bayesian decoding. Zenodo Preprint,

work page doi:10.1109/isscc.2014.6757323 2014
[7]

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei

doi: 10.1146/annurev.neuro.28.061604.135703. Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764,

work page doi:10.1146/annurev.neuro.28.061604.135703
[8]

doi: 10.1016/S0893-6080(97)00011-7. Emre O. Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks.IEEE Signal Processing Magazine, 36(6):51–63,

work page doi:10.1016/s0893-6080(97)00011-7
[9]

Kostas Pagiamtzis and Ali Sheikholeslami

doi: 10.1109/MSP.2019.2931595. Kostas Pagiamtzis and Ali Sheikholeslami. Content-addressable memory (CAM) circuits and architectures: A tutorial and survey.IEEE Journal of Solid-State Circuits, 41(3):712–727,

work page doi:10.1109/msp.2019.2931595 2019
[10]

2005.864128

doi: 10.1109/JSSC. 2005.864128. Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word predic- tion requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, p...

work page doi:10.1109/jssc 2005
[11]

doi: 10.18653/v1/P16-1144. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, XinCheng,MichaelChung,MatteoGrella,KranthiKiranGV,XuzhengHe,HaowenHou,JiajuLin,Prze- myslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p16-1144 2023
[12]

TalSchuster,AdamFisch,JaiGupta,MostafaDehghani,DaraBahri,VinhTran,YiTay,andDonaldMetzler

doi: 10.1038/s41586-019-1677-2. TalSchuster,AdamFisch,JaiGupta,MostafaDehghani,DaraBahri,VinhTran,YiTay,andDonaldMetzler. Confidentadaptivelanguagemodeling. InAdvancesinNeuralInformationProcessingSystems(NeurIPS), pages 17456–17472,

work page doi:10.1038/s41586-019-1677-2
[13]

Affine representations of fractional processes with applica- tions in mathematical finance.Stochastic Process

doi: 10.1016/j. neucom.2023.127063. 29 Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to Transformer for large language models.arXiv preprint arXiv:2307.08621,

work page doi:10.1016/j 2023
[14]

Retentive Network: A Successor to Transformer for Large Language Models

doi: 10.48550/arXiv.2307.08621. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), pages 5998–6008,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08621
[15]

URL http://dx.doi.org/ 10.18653/v1/2023.findings-acl.570

doi: 10.18653/v1/ W17-4413. Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. DeeBERT: Dynamic early exiting for accelerating BERT inference. InProceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics (ACL), pages 2246–2251,

work page doi:10.18653/v1/
[16]

URL https://aclanthology.org/2020.acl-main.204/

doi: 10.18653/v1/2020.acl-main.204. URL https://aclanthology.org/2020.acl-main.204/. XingrunXing,BoyanGao,ZhengLiu,DavidA.Clifton,ShitaoXiao,WanpengZhang,LiDu,ZhengZhang, Guoqi Li, and Jiajun Zhang. SpikeLLM: Scaling up spiking neural network to large language models via saliency-based spiking. InInternational Conference on Learning Representations (ICLR),

work page doi:10.18653/v1/2020.acl-main.204 2020
[17]

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim

doi: 10.1093/nsr/nwaf551. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention trans- formerswithhardware-efficienttraining. InProceedingsofthe41stInternationalConferenceonMachine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 56501–56523. PMLR,

work page doi:10.1093/nsr/nwaf551
[18]

Ruokai Yin, Abhishek Moitra, Abhiroop Bhattacharjee, Youngeun Kim, and Priyadarshini Panda

URLhttps://proceedings.mlr.press/v235/yang24ab.html. Ruokai Yin, Abhishek Moitra, Abhiroop Bhattacharjee, Youngeun Kim, and Priyadarshini Panda. SATA: Sparsity-aware training accelerator for spiking neural networks.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 42(6):1926–1938,

work page 1926
[19]

doi: 10.1109/TCAD.2022. 3213211. ManzilZaheer,GuruGuruganesh,KumarAvinavaDubey,JoshuaAinslie,ChrisAlberti,SantiagoOntanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. BigBird: Transformers for longer sequences. InAdvancesinNeuralInformationProcessingSystems(NeurIPS),pages17283–17297,2020. RowanZellers,AriHoltzman,YonatanBisk,AliFarhad...

work page doi:10.1109/tcad.2022 2022
[20]

doi: 10.18653/v1/P19-1472

doi: 10.18653/v1/P19-1472. Rui-JieZhu,QihangZhao, GuoqiLi,andJasonK.Eshraghian. SpikeGPT:Generativepre-trainedlanguage model with spiking neural networks.arXiv preprint arXiv:2302.13939,

work page doi:10.18653/v1/p19-1472
[21]

D Analytical Neuromorphic Energy Model This appendix derives the∼67×analytical neuromorphic upper-bound ratio discussed in Section 5.9 from first principles. The model follows the methodology of Horowitz (2014) for per-operation energy at the 45nm process node, scaled to a contemporary7nm node, and extended to spiking accumulate-only (AC) operations follo...

work page 2014

[1] [1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[2] [2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

doi: 10.1609/aaai.v34i05.6239. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v34i05.6239

[3] [3]

Jonathan Frankle and Michael Carbin

doi: 10.1109/MM.2018.112130359. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1109/mm.2018.112130359 2018

[4] [4]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Distilling the Knowledge in a Neural Network

GeoffreyHinton, OriolVinyals, andJeffDean. Distillingtheknowledgeinaneuralnetwork.arXivpreprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Horowitz, 1.1 Computing's energy problem (and what we can do about it)

doi: 10.1109/ISSCC.2014.6757323. 28 TingLiu. SymbolicLight: Aneuro-symbolicspikingarchitectureforlanguagemodelingwithsparseTCAM and Bayesian decoding. Zenodo Preprint,

work page doi:10.1109/isscc.2014.6757323 2014

[7] [7]

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei

doi: 10.1146/annurev.neuro.28.061604.135703. Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764,

work page doi:10.1146/annurev.neuro.28.061604.135703

[8] [8]

doi: 10.1016/S0893-6080(97)00011-7. Emre O. Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks.IEEE Signal Processing Magazine, 36(6):51–63,

work page doi:10.1016/s0893-6080(97)00011-7

[9] [9]

Kostas Pagiamtzis and Ali Sheikholeslami

doi: 10.1109/MSP.2019.2931595. Kostas Pagiamtzis and Ali Sheikholeslami. Content-addressable memory (CAM) circuits and architectures: A tutorial and survey.IEEE Journal of Solid-State Circuits, 41(3):712–727,

work page doi:10.1109/msp.2019.2931595 2019

[10] [10]

2005.864128

doi: 10.1109/JSSC. 2005.864128. Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word predic- tion requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, p...

work page doi:10.1109/jssc 2005

[11] [11]

doi: 10.18653/v1/P16-1144. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, XinCheng,MichaelChung,MatteoGrella,KranthiKiranGV,XuzhengHe,HaowenHou,JiajuLin,Prze- myslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p16-1144 2023

[12] [12]

TalSchuster,AdamFisch,JaiGupta,MostafaDehghani,DaraBahri,VinhTran,YiTay,andDonaldMetzler

doi: 10.1038/s41586-019-1677-2. TalSchuster,AdamFisch,JaiGupta,MostafaDehghani,DaraBahri,VinhTran,YiTay,andDonaldMetzler. Confidentadaptivelanguagemodeling. InAdvancesinNeuralInformationProcessingSystems(NeurIPS), pages 17456–17472,

work page doi:10.1038/s41586-019-1677-2

[13] [13]

Affine representations of fractional processes with applica- tions in mathematical finance.Stochastic Process

doi: 10.1016/j. neucom.2023.127063. 29 Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to Transformer for large language models.arXiv preprint arXiv:2307.08621,

work page doi:10.1016/j 2023

[14] [14]

Retentive Network: A Successor to Transformer for Large Language Models

doi: 10.48550/arXiv.2307.08621. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), pages 5998–6008,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08621

[15] [15]

URL http://dx.doi.org/ 10.18653/v1/2023.findings-acl.570

doi: 10.18653/v1/ W17-4413. Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. DeeBERT: Dynamic early exiting for accelerating BERT inference. InProceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics (ACL), pages 2246–2251,

work page doi:10.18653/v1/

[16] [16]

URL https://aclanthology.org/2020.acl-main.204/

doi: 10.18653/v1/2020.acl-main.204. URL https://aclanthology.org/2020.acl-main.204/. XingrunXing,BoyanGao,ZhengLiu,DavidA.Clifton,ShitaoXiao,WanpengZhang,LiDu,ZhengZhang, Guoqi Li, and Jiajun Zhang. SpikeLLM: Scaling up spiking neural network to large language models via saliency-based spiking. InInternational Conference on Learning Representations (ICLR),

work page doi:10.18653/v1/2020.acl-main.204 2020

[17] [17]

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim

doi: 10.1093/nsr/nwaf551. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention trans- formerswithhardware-efficienttraining. InProceedingsofthe41stInternationalConferenceonMachine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 56501–56523. PMLR,

work page doi:10.1093/nsr/nwaf551

[18] [18]

Ruokai Yin, Abhishek Moitra, Abhiroop Bhattacharjee, Youngeun Kim, and Priyadarshini Panda

URLhttps://proceedings.mlr.press/v235/yang24ab.html. Ruokai Yin, Abhishek Moitra, Abhiroop Bhattacharjee, Youngeun Kim, and Priyadarshini Panda. SATA: Sparsity-aware training accelerator for spiking neural networks.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 42(6):1926–1938,

work page 1926

[19] [19]

doi: 10.1109/TCAD.2022. 3213211. ManzilZaheer,GuruGuruganesh,KumarAvinavaDubey,JoshuaAinslie,ChrisAlberti,SantiagoOntanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. BigBird: Transformers for longer sequences. InAdvancesinNeuralInformationProcessingSystems(NeurIPS),pages17283–17297,2020. RowanZellers,AriHoltzman,YonatanBisk,AliFarhad...

work page doi:10.1109/tcad.2022 2022

[20] [20]

doi: 10.18653/v1/P19-1472

doi: 10.18653/v1/P19-1472. Rui-JieZhu,QihangZhao, GuoqiLi,andJasonK.Eshraghian. SpikeGPT:Generativepre-trainedlanguage model with spiking neural networks.arXiv preprint arXiv:2302.13939,

work page doi:10.18653/v1/p19-1472

[21] [21]

D Analytical Neuromorphic Energy Model This appendix derives the∼67×analytical neuromorphic upper-bound ratio discussed in Section 5.9 from first principles. The model follows the methodology of Horowitz (2014) for per-operation energy at the 45nm process node, scaled to a contemporary7nm node, and extended to spiking accumulate-only (AC) operations follo...

work page 2014