pith. machine review for the scientific record. sign in

arxiv: 2602.06252 · v2 · submitted 2026-02-05 · 💻 cs.AR

D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs

Pith reviewed 2026-05-16 06:30 UTC · model grok-4.3

classification 💻 cs.AR
keywords quantized LLMsmatrix multiplication acceleratormany-core architecturesystolic arraysblock-structured sparsityBitNet modelspartial sum reductionscalable accelerator
0
0 comments X

The pith

D-Legion uses groups of adaptive-precision systolic arrays in a scalable many-core layout to accelerate matrix multiplication for quantized LLMs by exploiting block-structured sparsity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces D-Legion as a many-core architecture built from Legions, where each Legion contains multiple adaptive-precision systolic array cores. These cores handle quantized matrix multiplies in fully sparse, partially sparse, or dense modes while parallel accumulators cut partial-sum memory traffic and multicasting improves data reuse across tiles. On attention workloads from two BitNet models the design reports up to 8.2 times lower latency and 3.8 times higher memory savings than prior accelerators, with a 64-core configuration reaching 135.68 TOPS at 1 GHz. A larger 32-Legion version also shows lower latency and higher throughput than a Google TPUv4i baseline. If the mapping from real LLM sparsity patterns to the window modes holds, the architecture would let quantized models run faster on less memory without changing the model itself.

Core claim

D-Legion is a scalable many-core architecture composed of Legions, each containing adaptive-precision systolic array cores, that accelerates matrix multiplication in quantized LLMs. It supports fully-sparse and partially-sparse window modes to exploit block-structured sparsity, uses parallel accumulators to reduce partial-sum memory accesses, and applies optimized scheduling with multicasting to maximize data reuse across Legions. Evaluation on attention workloads from two BitNet models shows up to 8.2 times lower latency, 3.8 times higher memory savings, and 3 times higher partial-sum memory savings versus prior work; an eight-Legion, 64-core instance reaches 135.68 TOPS at 1 GHz, and a 32-

What carries the argument

The Legion, a group of adaptive-precision systolic arrays that switch between fully-sparse, partially-sparse, and dense modes while using parallel accumulators to cut partial-sum traffic and multicast tiles for reuse.

If this is right

  • Attention layers in quantized models such as BitNet can be executed with substantially lower latency and on-chip memory.
  • Adding more Legions scales throughput linearly while preserving the reported memory reductions, as shown by the 32-Legion comparison to TPUv4i.
  • Partial-sum memory traffic drops by up to 3 times, lowering overall bandwidth pressure in large-matrix workloads.
  • The same cores can switch between sparse and dense modes, allowing one accelerator to serve both sparse attention and dense feed-forward layers.
  • Optimized tile multicasting across Legions increases effective data reuse, reducing off-chip accesses for the same matrix sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar window-based sparsity handling could be applied to other sparse linear algebra kernels outside language models, such as graph neural networks.
  • If control overhead stays low at larger scales, the design points toward energy-efficient accelerators for edge inference of quantized models.
  • The block-window approach might be combined with software sparsity pruning techniques that enforce the same regularity to close the gap between measured and peak performance.

Load-bearing premise

Real quantized LLM workloads contain enough regular block-structured sparsity that maps cleanly onto the sparse window modes without large extra control or routing costs in hardware.

What would settle it

A measured run on a representative BitNet attention workload that shows high control overhead or low core utilization because the sparsity patterns do not align with the block windows.

Figures

Figures reproduced from arXiv: 2602.06252 by Ahmed J. Abdelmaksoud, Cristian Sestito, Shiwei Wang, Themis Prodromakis.

Figure 1
Figure 1. Figure 1: Different attention layer types, including standard multi-head attention [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A comprehensive analysis of single large core versus many smaller cores with the same number of PEs. (a) Input bandwidths of core(s), accumulation, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Granularity analysis of the cores per Legion. (a) Input bandwidths of the Legion, accumulators, and psum memories versus core topology. (b) TFU per [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-Legion configuration rate index (CRI), evaluating each Legion [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) D-Legion architecture block diagram, consisting of [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention workloads distribution for the evaluated models: BitNet [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Latency comparison across four hardware architectures (WS, DiP, ADiP, and D-Legion) for BitNet-1.58B and BitNet-1.58B-KV models. (a) Per-stage [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Throughput comparison across four hardware architectures (WS, DiP, ADiP, and D-Legion) for BitNet-1.58B and BitNet-1.58B-KV models at [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Memory access comparison across four hardware architectures (WS, DiP, ADiP, and D-Legion) for BitNet-1.58B and BitNet-1.58B-KV models. (a) [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A comparison of psum memory access across four hardware architectures (WS, DiP, ADiP, and D-Legion) for BitNet-1.58B and BitNet-1.58B-KV [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A comparison between modeled Google TPUv4i and D-Legion V2 with the same number of PEs using attention workloads from BitNet-1.58B and [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
read the original abstract

The performance gains obtained by large language models (LLMs) are closely linked to their substantial computational and memory requirements. Quantized LLMs offer significant advantages with extremely quantized models, motivating the development of specialized architectures to accelerate their workloads. This paper proposes D-Legion, a novel scalable many-core architecture, designed using many adaptive-precision systolic array cores, to accelerate matrix multiplication in quantized LLMs. The proposed architecture consists of a set of Legions where each Legion has a group of adaptive-precision systolic arrays. D-Legion supports multiple computation modes, including quantized sparse and dense matrix multiplications. The block structured sparsity is exploited within a fully-sparse, or partially-sparse windows. In addition, memory accesses of partial summations (psums) are spatially reduced through parallel accumulators. Furthermore, data reuse is maximized through optimized scheduling techniques by multicasting matrix tiles across the Legions. A comprehensive design space exploration is performed in terms of Legion/core granularity to determine the optimal Legion configuration. Moreover, D-Legion is evaluated on attention workloads from two BitNet models, delivering up to 8.2$\times$ lower latency, up to 3.8$\times$ higher memory savings, and up to 3$\times$ higher psum memory savings compared to state-of-the-art work. D-Legion, with eight Legions and 64 total cores, achieves a peak throughput of 135.68 TOPS at a frequency of 1 GHz. A scaled version of D-Legion, with 32 Legions, is compared to Google TPUv4i, achieving up to 2.5$\times$ lower total latency, up to 2.3$\times$ higher total throughput, and up to 2.7$\times$ higher total memory savings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes D-Legion, a scalable many-core architecture with adaptive-precision systolic array cores grouped into Legions to accelerate matrix multiplication for quantized LLMs. It supports quantized sparse and dense modes by exploiting block-structured sparsity in fully-sparse or partially-sparse windows, reduces partial sum (psum) memory accesses via parallel accumulators, and maximizes data reuse through multicast scheduling across Legions. A design-space exploration determines optimal Legion/core granularity. On attention workloads from two BitNet models, it reports up to 8.2× lower latency, 3.8× memory savings, and 3× psum savings versus prior work; an 8-Legion/64-core configuration reaches 135.68 TOPS at 1 GHz, and a 32-Legion scale-up outperforms TPUv4i by up to 2.5× latency, 2.3× throughput, and 2.7× memory savings.

Significance. If the sparsity-exploitation claims hold with quantified low overhead, the work would be significant for specialized accelerators targeting quantized LLM inference, demonstrating concrete gains in latency, memory, and throughput over both academic baselines and a commercial TPU. The Legion granularity exploration and parallel-accumulator psum reduction are concrete, reusable ideas.

major comments (2)
  1. [Evaluation] Abstract and Evaluation: The headline quantitative results (135.68 TOPS, 8.2× latency, 3.8× memory savings) are obtained only when fully-sparse and partially-sparse window modes are applied to the BitNet attention matrices. No cycle-accurate breakdown of control-logic cost, multicast-routing contention, or mode-switch overhead is supplied, leaving the central assumption—that block-structured sparsity maps to these modes with negligible hardware cost—unverified and load-bearing for all reported speedups.
  2. [Abstract] Abstract: No simulation tools, synthesis flow, workload characterization details, baseline implementations, or area/power measurement methodology are described, making it impossible to assess whether the reported TOPS, latency, and memory figures are supported by the underlying design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Abstract and Evaluation: The headline quantitative results (135.68 TOPS, 8.2× latency, 3.8× memory savings) are obtained only when fully-sparse and partially-sparse window modes are applied to the BitNet attention matrices. No cycle-accurate breakdown of control-logic cost, multicast-routing contention, or mode-switch overhead is supplied, leaving the central assumption—that block-structured sparsity maps to these modes with negligible hardware cost—unverified and load-bearing for all reported speedups.

    Authors: We agree that the headline gains are realized when the sparse modes are active on the BitNet attention matrices, which exhibit the block-structured sparsity our design targets. The manuscript currently relies on the architectural description and aggregate results to imply low overhead, without an explicit cycle-accurate breakdown of control logic, routing contention, or mode-switch costs. In the revised version we will add a dedicated evaluation subsection that reports these overheads from our RTL-level cycle-accurate simulator, including their contribution to total latency and power as percentages across the evaluated configurations. This will directly verify the negligible-cost assumption for the reported speedups. revision: yes

  2. Referee: [Abstract] Abstract: No simulation tools, synthesis flow, workload characterization details, baseline implementations, or area/power measurement methodology are described, making it impossible to assess whether the reported TOPS, latency, and memory figures are supported by the underlying design.

    Authors: The abstract is intentionally concise and therefore omits methodological details. The full manuscript contains the evaluation setup, but we acknowledge that the description of simulation tools, synthesis flow, workload characterization, baseline implementations, and area/power methodology is not sufficiently prominent or complete. We will revise the abstract to include a brief methodology sentence and expand the evaluation section with a dedicated subsection that explicitly states the tools, flow, workload details for the two BitNet models, baseline designs, and measurement methodology used to obtain the TOPS, latency, and memory figures. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture description and empirical results contain no self-referential derivations or fitted predictions

full rationale

The manuscript describes a hardware architecture (Legions of adaptive-precision systolic arrays, fully-sparse and partially-sparse window modes, parallel accumulators, multicast scheduling) and reports measured throughput (135.68 TOPS) and speedups on BitNet attention workloads. No equations, parameter-fitting steps, or derivation chains appear that reduce a claimed result to its own inputs by construction. Performance numbers are presented as outcomes of design-space exploration and evaluation rather than as outputs forced by internal definitions or self-citations. The load-bearing claims therefore remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The design rests on standard assumptions about systolic-array efficiency and LLM sparsity patterns but introduces new named components without external validation data.

free parameters (1)
  • Legion/core granularity
    Optimal configuration selected after design-space exploration; specific values (8 Legions, 64 cores) are presented as chosen for peak results.
axioms (1)
  • domain assumption Block-structured sparsity patterns exist in quantized LLM attention workloads and can be exploited by fully-sparse or partially-sparse windows without prohibitive overhead.
    Invoked to justify support for sparse and dense modes and memory-access reductions.
invented entities (1)
  • Legion no independent evidence
    purpose: Basic scalable unit consisting of a group of adaptive-precision systolic arrays.
    New architectural abstraction introduced to organize the many-core design.

pith-pipeline@v0.9.0 · 5653 in / 1602 out tokens · 47243 ms · 2026-05-16T06:30:16.148833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    A comprehensive overview of large language models,

    H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 16, no. 5, pp. 1–72, 2025

  2. [2]

    A survey on evaluation of large language models,

    Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wang, W. Ye, Y . Zhang, Y . Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey on evaluation of large language models,”ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, march 2024

  3. [3]

    A survey on transformer compression,

    Y . Tang, Y . Wang, J. Guo, Z. Tu, K. Han, H. Hu, and D. Tao, “A survey on transformer compression,”arXiv preprint arXiv:2402.05964, 2024

  4. [4]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference,

    B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704– 2713. 11 Fig. 11. A comparison between modeled Google TPUv4i and D-Legion V2...

  5. [5]

    BitNet: Scaling 1-bit Transformers for Large Language Models

    H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y . Wu, and F. Wei, “Bitnet: Scaling 1-bit transformers for large language models,”arXiv preprint arXiv:2310.11453, 2023

  6. [6]

    Full stack optimization of transformer inference: a survey,

    S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoneyet al., “Full stack optimization of transformer inference: a survey,”arXiv preprint arXiv:2302.14017, 2023

  7. [7]

    Ten lessons from three generations shaped google’s tpuv4i: Industrial product,

    N. P. Jouppi and et al., “Ten lessons from three generations shaped google’s tpuv4i: Industrial product,” inProceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). Valencia, Spain: IEEE, 2021, pp. 1–14

  8. [8]

    Taxonomy and benchmarking of precision-scalable mac arrays under enhanced dnn dataflow represen- tation,

    E. M. Ibrahim, L. Mei, and M. Verhelst, “Taxonomy and benchmarking of precision-scalable mac arrays under enhanced dnn dataflow represen- tation,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 5, pp. 2013–2024, 2022

  9. [9]

    A 3- d multi-precision scalable systolic fma architecture,

    H. Liu, X. Lu, X. Yu, K. Li, K. Yang, H. Xia, S. Li, and T. Deng, “A 3- d multi-precision scalable systolic fma architecture,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 72, no. 1, pp. 265–276, January 2025

  10. [10]

    Dtatrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture,

    T. Yang, F. Ma, X. Li, F. Liu, Y . Zhao, Z. He, and L. Jiang, “Dtatrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 2, pp. 509–520, 2022

  11. [11]

    Heterogeneous systolic array architecture for compact cnns hardware accelerators,

    R. Xu, S. Ma, Y . Wang, Y . Guo, D. Li, and Y . Qiao, “Heterogeneous systolic array architecture for compact cnns hardware accelerators,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 2860–2871, 2021

  12. [12]

    Trapezoid: A versatile accelerator for dense and sparse matrix multiplications,

    Y . Yang, J. S. Emer, and D. Sanchez, “Trapezoid: A versatile accelerator for dense and sparse matrix multiplications,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 931–945

  13. [13]

    Sparse-tpu: Adapting systolic ar- rays for sparse matrices,

    X. He, S. Pal, A. Amarnath, S. Feng, D.-H. Park, A. Rovinski, H. Ye, Y . Chen, R. Dreslinski, and T. Mudge, “Sparse-tpu: Adapting systolic ar- rays for sparse matrices,” inProceedings of the 34th ACM international conference on supercomputing, 2020, pp. 1–12

  14. [14]

    Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads,

    D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmellet al., “Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 145–158

  15. [15]

    Gemmini: Enabling systematic deep- learning architecture evaluation via full-stack integration,

    H. Genc, S. Kim, A. Amid, A. Haj-Ali, V . Iyer, P. Prakash, J. Zhao, D. Grubb, H. Liew, H. Maoet al., “Gemmini: Enabling systematic deep- learning architecture evaluation via full-stack integration,” in2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021, pp. 769–774

  16. [16]

    Self adaptive recon- figurable arrays (sara) learning flexible gemm accelerator configuration and mapping-space using ml,

    A. Samajdar, E. Qin, M. Pellauer, and T. Krishna, “Self adaptive recon- figurable arrays (sara) learning flexible gemm accelerator configuration and mapping-space using ml,” inProceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 583–588

  17. [17]

    Msa2: An efficient s parsity-a ware accelerator for matrix multiplication with m ulti-core s ystolic a rrays,

    M. Tang, M. Wen, J. Shen, J. Yang, Z. Xue, and Z. Shao, “Msa2: An efficient s parsity-a ware accelerator for matrix multiplication with m ulti-core s ystolic a rrays,” inInternational Conference on Algorithms and Architectures for Parallel Processing. Springer, 2024, pp. 263–282

  18. [18]

    Dynamic sparse attention for scalable transformer acceleration,

    L. Liu, Z. Qu, Z. Chen, F. Tu, Y . Ding, and Y . Xie, “Dynamic sparse attention for scalable transformer acceleration,”IEEE Transactions on Computers, vol. 71, no. 12, pp. 3165–3178, 2022

  19. [19]

    An efficient multi-dnn accelerator based on multiple systolic arrays,

    J. Chen, H. Jiao, W. Huang, and Y . Huang, “An efficient multi-dnn accelerator based on multiple systolic arrays,” in2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2024, pp. 1–7

  20. [20]

    Enabling fine-grained spatial multitasking on systolic-array npus using dataflow mirroring,

    J. Choi, Y . Ha, J. Lee, S. Lee, J. Lee, H. Jang, and Y . Kim, “Enabling fine-grained spatial multitasking on systolic-array npus using dataflow mirroring,”IEEE Transactions on Computers, vol. 72, no. 12, pp. 3383– 3398, 2023

  21. [21]

    Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks,

    S. Ghodrati, B. H. Ahn, J. K. Kim, S. Kinzer, B. R. Yatham, N. Alla, H. Sharma, M. Alian, E. Ebrahimi, N. S. Kimet al., “Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 681–697

  22. [22]

    Scale-out systolic arrays,

    A. C. Y ¨uz¨ug¨uler, C. S ¨onmez, M. Drumond, Y . Oh, B. Falsafi, and P. Frossard, “Scale-out systolic arrays,”ACM Transactions on Archi- tecture and Code Optimization, vol. 20, no. 2, pp. 1–25, 2023

  23. [23]

    Why systolic architectures?

    H. Kung, “Why systolic architectures?”IEEE Computer, vol. 15, no. 1, pp. 37–46, 1982

  24. [24]

    A survey of design and optimization for systolic array-based dnn accelerators,

    R. Xu, S. Ma, Y . Guo, and D. Li, “A survey of design and optimization for systolic array-based dnn accelerators,” vol. 56, no. 1, Aug. 2023

  25. [25]

    Dip: A scalable, energy-efficient systolic array for matrix multiplication acceleration,

    A. J. Abdelmaksoud, S. Agwa, and T. Prodromakis, “Dip: A scalable, energy-efficient systolic array for matrix multiplication acceleration,” IEEE Transactions on Circuits and Systems I: Regular Papers, pp. 1– 11, 2025

  26. [26]

    Adip: Adaptive precision systolic array for matrix multiplication acceleration,

    A. J. Abdelmaksoud, C. Sestito, S. Wang, and T. Prodromakis, “Adip: Adaptive precision systolic array for matrix multiplication acceleration,” arXiv preprint arXiv:2510.10623v2, 2025

  27. [27]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

  28. [28]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,,”arXiv preprint arXiv:2305.13245, 2023

  29. [29]

    1-bit ai infra: Part 1.1, fast and lossless bitnet b1.58 inference on cpus,

    J. Wang, H. Zhou, T. Song, S. Mao, S. Ma, H. Wang, Y . Xia, and F. Wei, “1-bit ai infra: Part 1.1, fast and lossless bitnet b1.58 inference on cpus,” arXiv preprint arXiv:2410.16144, 2024

  30. [30]

    G. H. Golub and C. F. V . Loan,Matrix Computations, 4th ed. Baltimore, MD, USA: Johns Hopkins University Press, 2013

  31. [31]

    High bandwidth memory dram (hbm3) standard,

    “High bandwidth memory dram (hbm3) standard,” JEDEC Solid State Technology Association, Tech. Rep., 2025, accessed: 05 December,

  32. [32]

    Available: https://www.jedec.org/standards-documents/ docs/jesd238b01 12 Ahmed J

    [Online]. Available: https://www.jedec.org/standards-documents/ docs/jesd238b01 12 Ahmed J. Abdelmaksoudis currently pursuing his PhD with the Centre for Electronics Frontiers (CEF) at the University of Edinburgh, UK. He received his BSc and MSc in Electronics Engineering from Cairo University, Egypt in 2018 and 2022, recep- tively. Since 2018, he has bee...