pith. machine review for the scientific record. sign in

arxiv: 2604.15944 · v2 · submitted 2026-04-17 · 💻 cs.AR

Recognition: unknown

CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:53 UTC · model grok-4.3

classification 💻 cs.AR
keywords compute-in-memorySRAM-based CIMself-attention acceleratorLUT-based softmaxtransformer modelsedge AIINT8 precision28nm implementation
0
0 comments X

The pith

A dual-banked SRAM-based CIM accelerator with LUT split softmax reaches 26.1 TOPS/W for transformer self-attention in 28nm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CIMple, a compute-in-memory accelerator built to run the self-attention blocks inside transformer models on edge devices that lack the power and bandwidth for full-scale LLMs. It integrates computation directly into standard-cell SRAM arrays so data stays put during the many multiply-accumulate steps, while a separate LUT-based block handles the nonlinear softmax operation in fixed-point arithmetic. The architecture is fully digital and uses a dual-banked layout with parallel 8-bit weight feeding, giving it the flexibility to support different transformer variants without hard-wired analog circuits. Measured results from a 32 kb test chip in 28 nm show 26.1 TOPS/W at 0.85 V and 2.31 TOPS/mm² at 1.2 V under INT8 precision. A sympathetic reader cares because these numbers suggest a practical route to running billion-parameter models locally without cloud offload or massive battery drain.

Core claim

CIMple is a fully digital standard-cell SRAM-based CIM architecture for self-attention that overcomes the static-MAC limitation of prior CIM designs. It employs a novel dual-banked structure with 8-bit parallel weight feeding together with a LUT-based fixed-point implementation of split softmax. The 32 kb accelerator fabricated in 28 nm achieves 26.1 TOPS/W at 0.85 V and 2.31 TOPS/mm² at 1.2 V while preserving accuracy across various transformer models.

What carries the argument

Dual-banked fully digital CIM architecture using 8-bit parallel weight feeding and LUT-based fixed-point split softmax to handle both linear and nonlinear attention operations inside the same SRAM array.

If this is right

  • Self-attention can be executed with far less off-array data movement than conventional digital accelerators.
  • Nonlinear functions such as softmax become practical inside CIM arrays when realized as small fixed-point LUTs.
  • The same hardware template can be reused across multiple transformer variants without analog redesign.
  • INT8 precision at these efficiency levels becomes viable for battery-powered edge inference of large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-banking and LUT approach could be extended to other attention variants such as multi-query or grouped-query attention.
  • Full end-to-end transformer layers might be built by tiling multiple CIMple blocks for feed-forward and embedding stages.
  • The fixed-point LUT technique for nonlinearities could be reused for activation functions in other neural-network accelerators.

Load-bearing premise

The LUT-based fixed-point implementation reduces latency with minimal accuracy degradation while supporting various transformer models through the dual-banked fully digital architecture.

What would settle it

Silicon measurements on the 28 nm chip showing either energy efficiency below 20 TOPS/W at 0.85 V or more than 2 percent accuracy loss on standard transformer benchmarks such as GLUE.

Figures

Figures reproduced from arXiv: 2604.15944 by Bas Ahn, Henk Corporaal, Manil Dev Gomony, Marc Geilen, Xingjian Tao.

Figure 1
Figure 1. Figure 1: High-level view of the proposed accelerator for self [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level view of different transformer types. With [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Computation flow of multi-head attention, showing the weight projection stage, activation-to-activation stage and the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of CIMple showing 32kb CIM based self-attention accelerator (yellow) including an intermediate buffer [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: 16b SRAM block structure with two banks, every two [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The LUT-based split softmax computation flow consists [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Measurements result TOPS/W for different levels of [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pie chart illustrating the power (a) breakdown of the [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Layout of proposed self-attention accelerator showing [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Accuracy comparison for running the lm-evaluation [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
read the original abstract

Large Language Models (LLMs) such as LLaMA and DeepSeek, are built on transformer architectures, which have become a standard model for achieving state-of-the-art performance in natural language processing tasks. Recently, there has been growing interest in deploying LLMs on edge devices. Although smaller LLM models are being proposed, they often still contain billions of parameters. Since edge devices are limited in their resources this poses a significant challenge for edge deployment. Compute-in-memory (CIM) is a promising architecture that addresses this by reducing data movement through the integration of computational logic directly into memory. However, existing CIM architectures support only static Multiply-Accumulate (MAC) operations which limit their configurability in supporting nonlinear operations and various types of transformer models. This paper presents a fully digital standard-cell SRAM-based CIM architecture accelerator for self-attention, called CIMple, designed to overcome these limitations, inside transformer models. The key contributions of CIMple are: 1) A novel dual-banked CIM-based fully digital self-attention accelerator using 8-bit parallel weight feeding. 2) A look-up-table (LUT) based fixed-point implementation reducing latency with minimal accuracy degradation. 3) A performance evaluation of a 32kb CIM-based self-attention accelerator implemented in 28nm, which achieves 26.1 TOPS/W at 0.85V and 2.31 TOPS/mm$^2$ at 1.2V, both with INT8 precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents CIMple, a fully digital standard-cell SRAM-based compute-in-memory (CIM) accelerator for self-attention in transformers. Key features include a dual-banked architecture with 8-bit parallel weight feeding and an LUT-based fixed-point split softmax to enable nonlinear operations while reducing latency. A 32 kb prototype fabricated in 28 nm CMOS is evaluated, reporting 26.1 TOPS/W at 0.85 V and 2.31 TOPS/mm² at 1.2 V, both at INT8 precision, with claims of supporting multiple transformer models and minimal accuracy degradation.

Significance. If the hardware measurements and accuracy claims hold under detailed scrutiny, the work would demonstrate a practical, configurable CIM solution for attention acceleration that overcomes the static-MAC limitation of prior CIM designs. The concrete 28 nm efficiency numbers and fully digital approach could inform edge LLM deployment, provided the LUT softmax and dual-banked design generalize across models.

major comments (3)
  1. [Abstract and Results] The abstract and results section report concrete TOPS/W and TOPS/mm² figures from a 28 nm implementation, yet the manuscript provides neither full circuit schematics nor an error analysis (e.g., bit-error rates or voltage scaling effects) that would substantiate these metrics as load-bearing evidence for the claimed efficiency advantage.
  2. [Architecture and LUT Softmax] The LUT-based fixed-point split softmax is asserted to reduce latency with only minimal accuracy degradation and to support various transformer models, but no quantitative accuracy tables (e.g., perplexity or attention-score MSE versus FP32 baselines) or ablation across sequence lengths/models appear; this directly underpins the weakest assumption and the generality claim.
  3. [Architecture Description] The dual-banked fully digital architecture with 8-bit parallel weight feeding is presented as novel, but the manuscript lacks a direct comparison table against prior SRAM-CIM or digital attention accelerators on the same 28 nm node or equivalent metrics, making it difficult to assess whether the reported numbers represent a genuine advance.
minor comments (2)
  1. [Abstract] The phrase 'inside transformer models' in the abstract is unclear; rephrase to indicate that the accelerator targets the self-attention computation within transformer blocks.
  2. [LUT-based Softmax] Notation for the LUT split-softmax (e.g., bit-width partitioning and fixed-point scaling factors) should be defined explicitly in the first use, preferably with a small equation or diagram.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where they strengthen the presentation of our results and claims.

read point-by-point responses
  1. Referee: [Abstract and Results] The abstract and results section report concrete TOPS/W and TOPS/mm² figures from a 28 nm implementation, yet the manuscript provides neither full circuit schematics nor an error analysis (e.g., bit-error rates or voltage scaling effects) that would substantiate these metrics as load-bearing evidence for the claimed efficiency advantage.

    Authors: We agree that additional substantiation would improve the manuscript. In the revision we will add a dedicated error-analysis subsection in the results that reports measured bit-error rates across operating voltages and discusses voltage-scaling effects on both efficiency and functional correctness. The current manuscript already contains block-level diagrams of the dual-banked SRAM array and LUT softmax; we will expand these with more detailed sub-block schematics and will include the complete transistor-level schematics in the supplementary material. revision: partial

  2. Referee: [Architecture and LUT Softmax] The LUT-based fixed-point split softmax is asserted to reduce latency with only minimal accuracy degradation and to support various transformer models, but no quantitative accuracy tables (e.g., perplexity or attention-score MSE versus FP32 baselines) or ablation across sequence lengths/models appear; this directly underpins the weakest assumption and the generality claim.

    Authors: The referee correctly notes the absence of quantitative accuracy data. We will add a new evaluation subsection containing tables that report perplexity and attention-score MSE relative to FP32 baselines for representative transformer models (BERT, GPT-2, LLaMA-7B). We will also include ablation results across sequence lengths (128–2048) and multiple model scales to quantify the latency–accuracy trade-off of the LUT-based split softmax and to support the generality claim. revision: yes

  3. Referee: [Architecture Description] The dual-banked fully digital architecture with 8-bit parallel weight feeding is presented as novel, but the manuscript lacks a direct comparison table against prior SRAM-CIM or digital attention accelerators on the same 28 nm node or equivalent metrics, making it difficult to assess whether the reported numbers represent a genuine advance.

    Authors: We concur that a side-by-side comparison is necessary. The revised manuscript will contain a new comparison table that places CIMple against recent SRAM-CIM and digital attention accelerators. Where possible, metrics will be normalized to 28 nm; the table will list TOPS/W, TOPS/mm², precision, supported operations, and architectural features so that the advantages of the dual-banked 8-bit parallel design and LUT softmax can be directly evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity: hardware metrics from physical implementation

full rationale

The paper presents an architectural design for a CIM-based self-attention accelerator and reports measured performance (TOPS/W, TOPS/mm²) from a 28nm physical implementation. No equations, derivations, fitted parameters, or self-citation chains are used to 'predict' results; claims rest on direct hardware measurements and standard design descriptions. This is self-contained with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard assumptions of digital CMOS fabrication and fixed-point arithmetic accuracy trade-offs; no new physical entities or ad-hoc fitted constants are introduced beyond the reported voltages and bit widths.

axioms (2)
  • domain assumption Standard-cell SRAM cells can be reliably used for both storage and in-memory computation without custom analog circuits.
    Invoked in the description of the fully digital standard-cell SRAM-based CIM architecture.
  • domain assumption LUT approximation of softmax introduces only minimal accuracy loss for transformer attention.
    Stated as part of contribution 2 in the abstract.

pith-pipeline@v0.9.0 · 5581 in / 1374 out tokens · 49739 ms · 2026-05-10T07:53:42.552592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Songet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”

  2. [2]
  3. [3]

    Attention is all you need,

    A. Vaswani, “Attention is all you need,”Advances in Neural Information Processing Systems, 2017

  4. [4]

    Conversational agents in ther- apeutic interventions for neurodevelopmental disorders: a survey,

    F. Catania, M. Spitale, and F. Garzotto, “Conversational agents in ther- apeutic interventions for neurodevelopmental disorders: a survey,”ACM Computing Surveys, vol. 55, no. 10, pp. 1–34, 2023

  5. [5]

    15.3 a 351tops/w and 372.4gops compute-in- memory sram macro in 7nm finfet cmos for machine-learning appli- cations,

    Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W.-S. Khwa, H.-J. Liao, Y . Wang, and J. Chang, “15.3 a 351tops/w and 372.4gops compute-in- memory sram macro in 7nm finfet cmos for machine-learning appli- cations,” in2020 IEEE International Solid-State Circuits Conference - (ISSCC), 2020, pp. 242–244

  6. [6]

    A 32.2 tops/w sram compute-in-memory macro employing a linear 8-bit c-2c ladder for charge domain computation in 22nm for edge inference,

    H. Wang, R. Liu, R. Dorrance, D. Dasalukunte, X. Liu, D. Lake, B. Carlton, and M. Wu, “A 32.2 tops/w sram compute-in-memory macro employing a linear 8-bit c-2c ladder for charge domain computation in 22nm for edge inference,” in2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2022, pp. 36–37

  7. [7]

    A 5-nm 254-tops/w 221-tops/mm2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltage- frequency scaling and simultaneous mac and write operations,

    H. Fujiwara, H. Mori, W.-C. Zhao, M.-C. Chuang, R. Naous, C.-K. Chuang, T. Hashizume, D. Sun, C.-F. Lee, K. Akarvardar, S. Adham, T.-L. Chou, M. E. Sinangil, Y . Wang, Y .-D. Chih, Y .-H. Chen, H.-J. Liao, and T.-Y . J. Chang, “A 5-nm 254-tops/w 221-tops/mm2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltage- frequency scaling a...

  8. [8]

    A 4nm 6163-tops/w/b4790−tops/mm 2/bsram based digital-computing-in-memory macro supporting bit-width flexibility and simultaneous mac and weight update,

    H. Mori, W.-C. Zhao, C.-E. Lee, C.-F. Lee, Y .-H. Hsu, C.-K. Chuang, T. Hashizume, H.-C. Tung, Y .-Y . Liu, S.-R. Wu, K. Akarvardar, T.-L. Chou, H. Fujiwara, Y . Wang, Y .-D. Chih, Y .-H. Chen, H.-J. Liao, and T.-Y . J. Chang, “A 4nm 6163-tops/w/b4790−tops/mm 2/bsram based digital-computing-in-memory macro supporting bit-width flexibility and simultaneous...

  9. [9]

    34.4 a 3nm, 32.5tops/w, 55.0tops/mm2 and 3.78mb/mm2 fully-digital compute-in- memory macro supporting int12 × int12 with a parallel-mac architecture and foundry 6t-sram bit cell,

    H. Fujiwara, H. Mori, W.-C. Zhao, K. Khare, C.-E. Lee, X. Peng, V . Joshi, C.-K. Chuang, S.-H. Hsu, T. Hashizume, T. Naganuma, C.-H. Tien, Y .-Y . Liu, Y .-C. Lai, C.-F. Lee, T.-L. Chou, K. Akarvardar, S. Adham, Y . Wang, Y .-D. Chih, Y .-H. Chen, H.-J. Liao, and T.-Y . J. Chang, “34.4 a 3nm, 32.5tops/w, 55.0tops/mm2 and 3.78mb/mm2 fully-digital compute-i...

  10. [10]

    A 12nm 121-tops/w 41.6- tops/mm2 all digital full precision sram-based compute-in-memory with configurable bit-width for ai edge applications,

    C.-F. Lee, C.-H. Lu, C.-E. Lee, H. Mori, H. Fujiwara, Y .-C. Shih, T.- L. Chou, Y .-D. Chih, and T.-Y . J. Chang, “A 12nm 121-tops/w 41.6- tops/mm2 all digital full precision sram-based compute-in-memory with configurable bit-width for ai edge applications,” in2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2022, pp. 24–25

  11. [11]

    Trancim: Full-digital bitline-transpose cim-based sparse transformer accelerator with pipeline/parallel reconfigurable modes,

    F. Tu, Z. Wu, Y . Wang, L. Liang, L. Liu, Y . Ding, L. Liu, S. Wei, Y . Xie, and S. Yin, “Trancim: Full-digital bitline-transpose cim-based sparse transformer accelerator with pipeline/parallel reconfigurable modes,” IEEE Journal of Solid-State Circuits, vol. 58, no. 6, pp. 1798–1809, 2023

  12. [12]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, “Bert: Pre-training of deep bidirectional transformers for language understanding,”arXiv preprint arXiv:1810.04805, 2018

  13. [13]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:160025533

  14. [14]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    M. Lewis, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,”arXiv preprint arXiv:1910.13461, 2019

  15. [15]

    Full stack optimization of transformer inference: a survey,

    S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoneyet al., “Full stack optimization of transformer inference: a survey,”arXiv preprint arXiv:2302.14017, 2023

  16. [16]

    Online normalizer calculation for softmax,

    M. Milakov and N. Gimelshein, “Online normalizer calculation for softmax,”arXiv preprint arXiv:1805.02867, 2018

  17. [17]

    A 28-nm 28.8- tops/w attention-based nn processor with correlative cim ring architecture and dataflow-reshaped digital-assisted cim array,

    R. Guo, Z. Yue, Y . Wang, H. Li, T. Hu, Y . Wang, H. Sun, J.-L. Hsu, Y . Zhang, B. Yan, L. Liu, R. Huang, S. Wei, and S. Yin, “A 28-nm 28.8- tops/w attention-based nn processor with correlative cim ring architecture and dataflow-reshaped digital-assisted cim array,”IEEE Journal of Solid- State Circuits, pp. 1–15, 2024

  18. [18]

    An energy-efficient transformer processor exploiting dynamic weak relevances in global attention,

    Y . Wang, Y . Qin, D. Deng, J. Wei, Y . Zhou, Y . Fan, T. Chen, H. Sun, L. Liu, S. Wei, and S. Yin, “An energy-efficient transformer processor exploiting dynamic weak relevances in global attention,”IEEE Journal of Solid-State Circuits, vol. 58, no. 1, pp. 227–242, 2023

  19. [19]

    Ita: An energy-efficient attention and softmax accelerator for quantized transformers,

    G. Islamoglu, M. Scherer, G. Paulin, T. Fischer, V . J. Jung, A. Garofalo, and L. Benini, “Ita: An energy-efficient attention and softmax accelerator for quantized transformers,” in2023 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2023, pp. 1–6

  20. [20]

    Hardware implementation of softmax function based on piecewise lut,

    X. Dong, X. Zhu, and D. Ma, “Hardware implementation of softmax function based on piecewise lut,” in2019 IEEE International Workshop on Future Computing (IWOFC, 2019, pp. 1–3

  21. [21]

    Efficient softmax approximation for deep neural networks with attention mechanism,

    I. Vasyltsov and W. Chang, “Efficient softmax approximation for deep neural networks with attention mechanism,” 2021. [Online]. Available: https://arxiv.org/abs/2111.10770

  22. [22]

    Multcim: Digital computing-in-memory-based multimodal transformer accelerator with attention-token-bit hybrid sparsity,

    F. Tu, Z. Wu, Y . Wang, W. Wu, L. Liu, Y . Hu, S. Wei, and S. Yin, “Multcim: Digital computing-in-memory-based multimodal transformer accelerator with attention-token-bit hybrid sparsity,”IEEE Journal of Solid-State Circuits, vol. 59, no. 1, pp. 90–101, 2024

  23. [23]

    Cimformer: A systolic cim-array-based transformer accelerator with token-pruning-aware attention reformulating and principal possibility gathering,

    R. Guo, X. Chen, L. Wang, Y . Wang, H. Sun, J. Wei, H. Han, L. Liu, S. Wei, Y . Hu, and S. Yin, “Cimformer: A systolic cim-array-based transformer accelerator with token-pruning-aware attention reformulating and principal possibility gathering,”IEEE Journal of Solid-State Circuits, pp. 1–13, 2024

  24. [24]

    I-bert: Integer-only bert quantization,

    S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert: Integer-only bert quantization,” inInternational conference on machine learning. PMLR, 2021, pp. 5506–5518

  25. [25]

    Flashattention: Fast and memory-efficient exact attention with io-awareness,

    T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022

  26. [26]

    23.8 an 88.36tops/w bit-level-weight-compressed large-language- model accelerator with cluster-aligned int-fp-gemm and bi-dimensional workflow reformulation,

    Y . Qin, Y . Wang, J. Wang, Z. Lin, Y . Zhao, S. Wei, Y . Hu, and S. Yin, “23.8 an 88.36tops/w bit-level-weight-compressed large-language- model accelerator with cluster-aligned int-fp-gemm and bi-dimensional workflow reformulation,” in2025 IEEE International Solid-State Circuits Conference (ISSCC), vol. 68, 2025, pp. 420–422

  27. [27]

    Tinyllama: An open-source small language model,

    P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source small language model,” 2024

  28. [28]

    A framework for few-shot language model evaluation,

    L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “A framework for few-shot language model evaluation,” 07

  29. [29]