arxiv: 2604.15944 · v2 · submitted 2026-04-17 · 💻 cs.AR

Recognition: unknown

CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration

Bas Ahn , Xingjian Tao , Manil Dev Gomony , Marc Geilen , Henk Corporaal

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:53 UTC · model grok-4.3

classification 💻 cs.AR

keywords compute-in-memorySRAM-based CIMself-attention acceleratorLUT-based softmaxtransformer modelsedge AIINT8 precision28nm implementation

0 comments

The pith

A dual-banked SRAM-based CIM accelerator with LUT split softmax reaches 26.1 TOPS/W for transformer self-attention in 28nm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CIMple, a compute-in-memory accelerator built to run the self-attention blocks inside transformer models on edge devices that lack the power and bandwidth for full-scale LLMs. It integrates computation directly into standard-cell SRAM arrays so data stays put during the many multiply-accumulate steps, while a separate LUT-based block handles the nonlinear softmax operation in fixed-point arithmetic. The architecture is fully digital and uses a dual-banked layout with parallel 8-bit weight feeding, giving it the flexibility to support different transformer variants without hard-wired analog circuits. Measured results from a 32 kb test chip in 28 nm show 26.1 TOPS/W at 0.85 V and 2.31 TOPS/mm² at 1.2 V under INT8 precision. A sympathetic reader cares because these numbers suggest a practical route to running billion-parameter models locally without cloud offload or massive battery drain.

Core claim

CIMple is a fully digital standard-cell SRAM-based CIM architecture for self-attention that overcomes the static-MAC limitation of prior CIM designs. It employs a novel dual-banked structure with 8-bit parallel weight feeding together with a LUT-based fixed-point implementation of split softmax. The 32 kb accelerator fabricated in 28 nm achieves 26.1 TOPS/W at 0.85 V and 2.31 TOPS/mm² at 1.2 V while preserving accuracy across various transformer models.

What carries the argument

Dual-banked fully digital CIM architecture using 8-bit parallel weight feeding and LUT-based fixed-point split softmax to handle both linear and nonlinear attention operations inside the same SRAM array.

If this is right

Self-attention can be executed with far less off-array data movement than conventional digital accelerators.
Nonlinear functions such as softmax become practical inside CIM arrays when realized as small fixed-point LUTs.
The same hardware template can be reused across multiple transformer variants without analog redesign.
INT8 precision at these efficiency levels becomes viable for battery-powered edge inference of large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-banking and LUT approach could be extended to other attention variants such as multi-query or grouped-query attention.
Full end-to-end transformer layers might be built by tiling multiple CIMple blocks for feed-forward and embedding stages.
The fixed-point LUT technique for nonlinearities could be reused for activation functions in other neural-network accelerators.

Load-bearing premise

The LUT-based fixed-point implementation reduces latency with minimal accuracy degradation while supporting various transformer models through the dual-banked fully digital architecture.

What would settle it

Silicon measurements on the 28 nm chip showing either energy efficiency below 20 TOPS/W at 0.85 V or more than 2 percent accuracy loss on standard transformer benchmarks such as GLUE.

Figures

Figures reproduced from arXiv: 2604.15944 by Bas Ahn, Henk Corporaal, Manil Dev Gomony, Marc Geilen, Xingjian Tao.

**Figure 2.** Figure 2: High-level view of different transformer types. With [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Computation flow of multi-head attention, showing the weight projection stage, activation-to-activation stage and the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of CIMple showing 32kb CIM based self-attention accelerator (yellow) including an intermediate buffer [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: 16b SRAM block structure with two banks, every two [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: The LUT-based split softmax computation flow consists [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Measurements result TOPS/W for different levels of [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Pie chart illustrating the power (a) breakdown of the [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Layout of proposed self-attention accelerator showing [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Accuracy comparison for running the lm-evaluation [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

read the original abstract

Large Language Models (LLMs) such as LLaMA and DeepSeek, are built on transformer architectures, which have become a standard model for achieving state-of-the-art performance in natural language processing tasks. Recently, there has been growing interest in deploying LLMs on edge devices. Although smaller LLM models are being proposed, they often still contain billions of parameters. Since edge devices are limited in their resources this poses a significant challenge for edge deployment. Compute-in-memory (CIM) is a promising architecture that addresses this by reducing data movement through the integration of computational logic directly into memory. However, existing CIM architectures support only static Multiply-Accumulate (MAC) operations which limit their configurability in supporting nonlinear operations and various types of transformer models. This paper presents a fully digital standard-cell SRAM-based CIM architecture accelerator for self-attention, called CIMple, designed to overcome these limitations, inside transformer models. The key contributions of CIMple are: 1) A novel dual-banked CIM-based fully digital self-attention accelerator using 8-bit parallel weight feeding. 2) A look-up-table (LUT) based fixed-point implementation reducing latency with minimal accuracy degradation. 3) A performance evaluation of a 32kb CIM-based self-attention accelerator implemented in 28nm, which achieves 26.1 TOPS/W at 0.85V and 2.31 TOPS/mm$^2$ at 1.2V, both with INT8 precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CIMple reports a real 28nm dual-banked SRAM CIM accelerator for attention with 26 TOPS/W numbers, using LUT softmax to stay fully digital.

read the letter

The main thing to know is that this paper describes a 32kb standard-cell SRAM compute-in-memory accelerator for self-attention, taped out in 28nm, that hits 26.1 TOPS/W at 0.85V and 2.31 TOPS/mm² at 1.2V with INT8. It uses a dual-banked layout with 8-bit parallel weight feeding and a LUT-based split softmax to keep everything digital and somewhat configurable across models. That combination is the concrete engineering step forward here. The design avoids analog CIM pitfalls while addressing the usual limits on nonlinear ops like softmax. The reported silicon metrics give it more weight than pure simulation work. The LUT approach for fixed-point softmax is a practical shortcut that trades a bit of precision for lower latency, and the paper claims the accuracy hit stays small. That part is useful for edge transformer hardware where data movement energy dominates. The soft spots are in the validation details. The abstract gives the headline numbers but does not show full accuracy curves versus floating-point baselines or across multiple models like LLaMA variants. Without more on error distribution, power breakdown, or direct comparisons to other digital or CIM accelerators, it is hard to judge how general the claims really are. The dual-banked fully digital claim is plausible but would benefit from clearer area and timing data. This paper is for hardware engineers focused on edge LLM accelerators and CIM alternatives. Readers who need measured 28nm results for attention blocks will get value from the architecture choices and efficiency figures. It deserves a serious referee because the implementation is physical and the metrics are specific, even if the accuracy section needs tightening. I would send it for review.

Referee Report

3 major / 2 minor

Summary. The paper presents CIMple, a fully digital standard-cell SRAM-based compute-in-memory (CIM) accelerator for self-attention in transformers. Key features include a dual-banked architecture with 8-bit parallel weight feeding and an LUT-based fixed-point split softmax to enable nonlinear operations while reducing latency. A 32 kb prototype fabricated in 28 nm CMOS is evaluated, reporting 26.1 TOPS/W at 0.85 V and 2.31 TOPS/mm² at 1.2 V, both at INT8 precision, with claims of supporting multiple transformer models and minimal accuracy degradation.

Significance. If the hardware measurements and accuracy claims hold under detailed scrutiny, the work would demonstrate a practical, configurable CIM solution for attention acceleration that overcomes the static-MAC limitation of prior CIM designs. The concrete 28 nm efficiency numbers and fully digital approach could inform edge LLM deployment, provided the LUT softmax and dual-banked design generalize across models.

major comments (3)

[Abstract and Results] The abstract and results section report concrete TOPS/W and TOPS/mm² figures from a 28 nm implementation, yet the manuscript provides neither full circuit schematics nor an error analysis (e.g., bit-error rates or voltage scaling effects) that would substantiate these metrics as load-bearing evidence for the claimed efficiency advantage.
[Architecture and LUT Softmax] The LUT-based fixed-point split softmax is asserted to reduce latency with only minimal accuracy degradation and to support various transformer models, but no quantitative accuracy tables (e.g., perplexity or attention-score MSE versus FP32 baselines) or ablation across sequence lengths/models appear; this directly underpins the weakest assumption and the generality claim.
[Architecture Description] The dual-banked fully digital architecture with 8-bit parallel weight feeding is presented as novel, but the manuscript lacks a direct comparison table against prior SRAM-CIM or digital attention accelerators on the same 28 nm node or equivalent metrics, making it difficult to assess whether the reported numbers represent a genuine advance.

minor comments (2)

[Abstract] The phrase 'inside transformer models' in the abstract is unclear; rephrase to indicate that the accelerator targets the self-attention computation within transformer blocks.
[LUT-based Softmax] Notation for the LUT split-softmax (e.g., bit-width partitioning and fixed-point scaling factors) should be defined explicitly in the first use, preferably with a small equation or diagram.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where they strengthen the presentation of our results and claims.

read point-by-point responses

Referee: [Abstract and Results] The abstract and results section report concrete TOPS/W and TOPS/mm² figures from a 28 nm implementation, yet the manuscript provides neither full circuit schematics nor an error analysis (e.g., bit-error rates or voltage scaling effects) that would substantiate these metrics as load-bearing evidence for the claimed efficiency advantage.

Authors: We agree that additional substantiation would improve the manuscript. In the revision we will add a dedicated error-analysis subsection in the results that reports measured bit-error rates across operating voltages and discusses voltage-scaling effects on both efficiency and functional correctness. The current manuscript already contains block-level diagrams of the dual-banked SRAM array and LUT softmax; we will expand these with more detailed sub-block schematics and will include the complete transistor-level schematics in the supplementary material. revision: partial
Referee: [Architecture and LUT Softmax] The LUT-based fixed-point split softmax is asserted to reduce latency with only minimal accuracy degradation and to support various transformer models, but no quantitative accuracy tables (e.g., perplexity or attention-score MSE versus FP32 baselines) or ablation across sequence lengths/models appear; this directly underpins the weakest assumption and the generality claim.

Authors: The referee correctly notes the absence of quantitative accuracy data. We will add a new evaluation subsection containing tables that report perplexity and attention-score MSE relative to FP32 baselines for representative transformer models (BERT, GPT-2, LLaMA-7B). We will also include ablation results across sequence lengths (128–2048) and multiple model scales to quantify the latency–accuracy trade-off of the LUT-based split softmax and to support the generality claim. revision: yes
Referee: [Architecture Description] The dual-banked fully digital architecture with 8-bit parallel weight feeding is presented as novel, but the manuscript lacks a direct comparison table against prior SRAM-CIM or digital attention accelerators on the same 28 nm node or equivalent metrics, making it difficult to assess whether the reported numbers represent a genuine advance.

Authors: We concur that a side-by-side comparison is necessary. The revised manuscript will contain a new comparison table that places CIMple against recent SRAM-CIM and digital attention accelerators. Where possible, metrics will be normalized to 28 nm; the table will list TOPS/W, TOPS/mm², precision, supported operations, and architectural features so that the advantages of the dual-banked 8-bit parallel design and LUT softmax can be directly evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity: hardware metrics from physical implementation

full rationale

The paper presents an architectural design for a CIM-based self-attention accelerator and reports measured performance (TOPS/W, TOPS/mm²) from a 28nm physical implementation. No equations, derivations, fitted parameters, or self-citation chains are used to 'predict' results; claims rest on direct hardware measurements and standard design descriptions. This is self-contained with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard assumptions of digital CMOS fabrication and fixed-point arithmetic accuracy trade-offs; no new physical entities or ad-hoc fitted constants are introduced beyond the reported voltages and bit widths.

axioms (2)

domain assumption Standard-cell SRAM cells can be reliably used for both storage and in-memory computation without custom analog circuits.
Invoked in the description of the fully digital standard-cell SRAM-based CIM architecture.
domain assumption LUT approximation of softmax introduces only minimal accuracy loss for transformer attention.
Stated as part of contribution 2 in the abstract.

pith-pipeline@v0.9.0 · 5581 in / 1374 out tokens · 49739 ms · 2026-05-10T07:53:42.552592+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Songet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

[Online]. Available: https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Attention is all you need,

A. Vaswani, “Attention is all you need,”Advances in Neural Information Processing Systems, 2017

2017
[4]

Conversational agents in ther- apeutic interventions for neurodevelopmental disorders: a survey,

F. Catania, M. Spitale, and F. Garzotto, “Conversational agents in ther- apeutic interventions for neurodevelopmental disorders: a survey,”ACM Computing Surveys, vol. 55, no. 10, pp. 1–34, 2023

2023
[5]

15.3 a 351tops/w and 372.4gops compute-in- memory sram macro in 7nm finfet cmos for machine-learning appli- cations,

Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W.-S. Khwa, H.-J. Liao, Y . Wang, and J. Chang, “15.3 a 351tops/w and 372.4gops compute-in- memory sram macro in 7nm finfet cmos for machine-learning appli- cations,” in2020 IEEE International Solid-State Circuits Conference - (ISSCC), 2020, pp. 242–244

2020
[6]

A 32.2 tops/w sram compute-in-memory macro employing a linear 8-bit c-2c ladder for charge domain computation in 22nm for edge inference,

H. Wang, R. Liu, R. Dorrance, D. Dasalukunte, X. Liu, D. Lake, B. Carlton, and M. Wu, “A 32.2 tops/w sram compute-in-memory macro employing a linear 8-bit c-2c ladder for charge domain computation in 22nm for edge inference,” in2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2022, pp. 36–37

2022
[7]

A 5-nm 254-tops/w 221-tops/mm2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltage- frequency scaling and simultaneous mac and write operations,

H. Fujiwara, H. Mori, W.-C. Zhao, M.-C. Chuang, R. Naous, C.-K. Chuang, T. Hashizume, D. Sun, C.-F. Lee, K. Akarvardar, S. Adham, T.-L. Chou, M. E. Sinangil, Y . Wang, Y .-D. Chih, Y .-H. Chen, H.-J. Liao, and T.-Y . J. Chang, “A 5-nm 254-tops/w 221-tops/mm2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltage- frequency scaling a...

2022
[8]

A 4nm 6163-tops/w/b4790−tops/mm 2/bsram based digital-computing-in-memory macro supporting bit-width flexibility and simultaneous mac and weight update,

H. Mori, W.-C. Zhao, C.-E. Lee, C.-F. Lee, Y .-H. Hsu, C.-K. Chuang, T. Hashizume, H.-C. Tung, Y .-Y . Liu, S.-R. Wu, K. Akarvardar, T.-L. Chou, H. Fujiwara, Y . Wang, Y .-D. Chih, Y .-H. Chen, H.-J. Liao, and T.-Y . J. Chang, “A 4nm 6163-tops/w/b4790−tops/mm 2/bsram based digital-computing-in-memory macro supporting bit-width flexibility and simultaneous...

2023
[9]

34.4 a 3nm, 32.5tops/w, 55.0tops/mm2 and 3.78mb/mm2 fully-digital compute-in- memory macro supporting int12 × int12 with a parallel-mac architecture and foundry 6t-sram bit cell,

H. Fujiwara, H. Mori, W.-C. Zhao, K. Khare, C.-E. Lee, X. Peng, V . Joshi, C.-K. Chuang, S.-H. Hsu, T. Hashizume, T. Naganuma, C.-H. Tien, Y .-Y . Liu, Y .-C. Lai, C.-F. Lee, T.-L. Chou, K. Akarvardar, S. Adham, Y . Wang, Y .-D. Chih, Y .-H. Chen, H.-J. Liao, and T.-Y . J. Chang, “34.4 a 3nm, 32.5tops/w, 55.0tops/mm2 and 3.78mb/mm2 fully-digital compute-i...

2024
[10]

A 12nm 121-tops/w 41.6- tops/mm2 all digital full precision sram-based compute-in-memory with configurable bit-width for ai edge applications,

C.-F. Lee, C.-H. Lu, C.-E. Lee, H. Mori, H. Fujiwara, Y .-C. Shih, T.- L. Chou, Y .-D. Chih, and T.-Y . J. Chang, “A 12nm 121-tops/w 41.6- tops/mm2 all digital full precision sram-based compute-in-memory with configurable bit-width for ai edge applications,” in2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2022, pp. 24–25

2022
[11]

Trancim: Full-digital bitline-transpose cim-based sparse transformer accelerator with pipeline/parallel reconfigurable modes,

F. Tu, Z. Wu, Y . Wang, L. Liang, L. Liu, Y . Ding, L. Liu, S. Wei, Y . Xie, and S. Yin, “Trancim: Full-digital bitline-transpose cim-based sparse transformer accelerator with pipeline/parallel reconfigurable modes,” IEEE Journal of Solid-State Circuits, vol. 58, no. 6, pp. 1798–1809, 2023

2023
[12]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, “Bert: Pre-training of deep bidirectional transformers for language understanding,”arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:160025533

2019
[14]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

M. Lewis, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,”arXiv preprint arXiv:1910.13461, 2019

work page internal anchor Pith review arXiv 1910
[15]

Full stack optimization of transformer inference: a survey,

S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoneyet al., “Full stack optimization of transformer inference: a survey,”arXiv preprint arXiv:2302.14017, 2023

work page arXiv 2023
[16]

Online normalizer calculation for softmax,

M. Milakov and N. Gimelshein, “Online normalizer calculation for softmax,”arXiv preprint arXiv:1805.02867, 2018

work page arXiv 2018
[17]

A 28-nm 28.8- tops/w attention-based nn processor with correlative cim ring architecture and dataflow-reshaped digital-assisted cim array,

R. Guo, Z. Yue, Y . Wang, H. Li, T. Hu, Y . Wang, H. Sun, J.-L. Hsu, Y . Zhang, B. Yan, L. Liu, R. Huang, S. Wei, and S. Yin, “A 28-nm 28.8- tops/w attention-based nn processor with correlative cim ring architecture and dataflow-reshaped digital-assisted cim array,”IEEE Journal of Solid- State Circuits, pp. 1–15, 2024

2024
[18]

An energy-efficient transformer processor exploiting dynamic weak relevances in global attention,

Y . Wang, Y . Qin, D. Deng, J. Wei, Y . Zhou, Y . Fan, T. Chen, H. Sun, L. Liu, S. Wei, and S. Yin, “An energy-efficient transformer processor exploiting dynamic weak relevances in global attention,”IEEE Journal of Solid-State Circuits, vol. 58, no. 1, pp. 227–242, 2023

2023
[19]

Ita: An energy-efficient attention and softmax accelerator for quantized transformers,

G. Islamoglu, M. Scherer, G. Paulin, T. Fischer, V . J. Jung, A. Garofalo, and L. Benini, “Ita: An energy-efficient attention and softmax accelerator for quantized transformers,” in2023 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2023, pp. 1–6

2023
[20]

Hardware implementation of softmax function based on piecewise lut,

X. Dong, X. Zhu, and D. Ma, “Hardware implementation of softmax function based on piecewise lut,” in2019 IEEE International Workshop on Future Computing (IWOFC, 2019, pp. 1–3

2019
[21]

Efficient softmax approximation for deep neural networks with attention mechanism,

I. Vasyltsov and W. Chang, “Efficient softmax approximation for deep neural networks with attention mechanism,” 2021. [Online]. Available: https://arxiv.org/abs/2111.10770

work page arXiv 2021
[22]

Multcim: Digital computing-in-memory-based multimodal transformer accelerator with attention-token-bit hybrid sparsity,

F. Tu, Z. Wu, Y . Wang, W. Wu, L. Liu, Y . Hu, S. Wei, and S. Yin, “Multcim: Digital computing-in-memory-based multimodal transformer accelerator with attention-token-bit hybrid sparsity,”IEEE Journal of Solid-State Circuits, vol. 59, no. 1, pp. 90–101, 2024

2024
[23]

Cimformer: A systolic cim-array-based transformer accelerator with token-pruning-aware attention reformulating and principal possibility gathering,

R. Guo, X. Chen, L. Wang, Y . Wang, H. Sun, J. Wei, H. Han, L. Liu, S. Wei, Y . Hu, and S. Yin, “Cimformer: A systolic cim-array-based transformer accelerator with token-pruning-aware attention reformulating and principal possibility gathering,”IEEE Journal of Solid-State Circuits, pp. 1–13, 2024

2024
[24]

I-bert: Integer-only bert quantization,

S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert: Integer-only bert quantization,” inInternational conference on machine learning. PMLR, 2021, pp. 5506–5518

2021
[25]

Flashattention: Fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022

2022
[26]

23.8 an 88.36tops/w bit-level-weight-compressed large-language- model accelerator with cluster-aligned int-fp-gemm and bi-dimensional workflow reformulation,

Y . Qin, Y . Wang, J. Wang, Z. Lin, Y . Zhao, S. Wei, Y . Hu, and S. Yin, “23.8 an 88.36tops/w bit-level-weight-compressed large-language- model accelerator with cluster-aligned int-fp-gemm and bi-dimensional workflow reformulation,” in2025 IEEE International Solid-State Circuits Conference (ISSCC), vol. 68, 2025, pp. 420–422

2025
[27]

Tinyllama: An open-source small language model,

P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source small language model,” 2024

2024
[28]

A framework for few-shot language model evaluation,

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “A framework for few-shot language model evaluation,” 07
[29]

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

[Online]. Available: https://zenodo.org/records/12608602

work page arXiv