arxiv: 2602.06252 · v2 · submitted 2026-02-05 · 💻 cs.AR

D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs

Ahmed J. Abdelmaksoud , Cristian Sestito , Shiwei Wang , Themis Prodromakis This is my paper

Pith reviewed 2026-05-16 06:30 UTC · model grok-4.3

classification 💻 cs.AR

keywords quantized LLMsmatrix multiplication acceleratormany-core architecturesystolic arraysblock-structured sparsityBitNet modelspartial sum reductionscalable accelerator

0 comments

The pith

D-Legion uses groups of adaptive-precision systolic arrays in a scalable many-core layout to accelerate matrix multiplication for quantized LLMs by exploiting block-structured sparsity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces D-Legion as a many-core architecture built from Legions, where each Legion contains multiple adaptive-precision systolic array cores. These cores handle quantized matrix multiplies in fully sparse, partially sparse, or dense modes while parallel accumulators cut partial-sum memory traffic and multicasting improves data reuse across tiles. On attention workloads from two BitNet models the design reports up to 8.2 times lower latency and 3.8 times higher memory savings than prior accelerators, with a 64-core configuration reaching 135.68 TOPS at 1 GHz. A larger 32-Legion version also shows lower latency and higher throughput than a Google TPUv4i baseline. If the mapping from real LLM sparsity patterns to the window modes holds, the architecture would let quantized models run faster on less memory without changing the model itself.

Core claim

D-Legion is a scalable many-core architecture composed of Legions, each containing adaptive-precision systolic array cores, that accelerates matrix multiplication in quantized LLMs. It supports fully-sparse and partially-sparse window modes to exploit block-structured sparsity, uses parallel accumulators to reduce partial-sum memory accesses, and applies optimized scheduling with multicasting to maximize data reuse across Legions. Evaluation on attention workloads from two BitNet models shows up to 8.2 times lower latency, 3.8 times higher memory savings, and 3 times higher partial-sum memory savings versus prior work; an eight-Legion, 64-core instance reaches 135.68 TOPS at 1 GHz, and a 32-

What carries the argument

The Legion, a group of adaptive-precision systolic arrays that switch between fully-sparse, partially-sparse, and dense modes while using parallel accumulators to cut partial-sum traffic and multicast tiles for reuse.

If this is right

Attention layers in quantized models such as BitNet can be executed with substantially lower latency and on-chip memory.
Adding more Legions scales throughput linearly while preserving the reported memory reductions, as shown by the 32-Legion comparison to TPUv4i.
Partial-sum memory traffic drops by up to 3 times, lowering overall bandwidth pressure in large-matrix workloads.
The same cores can switch between sparse and dense modes, allowing one accelerator to serve both sparse attention and dense feed-forward layers.
Optimized tile multicasting across Legions increases effective data reuse, reducing off-chip accesses for the same matrix sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar window-based sparsity handling could be applied to other sparse linear algebra kernels outside language models, such as graph neural networks.
If control overhead stays low at larger scales, the design points toward energy-efficient accelerators for edge inference of quantized models.
The block-window approach might be combined with software sparsity pruning techniques that enforce the same regularity to close the gap between measured and peak performance.

Load-bearing premise

Real quantized LLM workloads contain enough regular block-structured sparsity that maps cleanly onto the sparse window modes without large extra control or routing costs in hardware.

What would settle it

A measured run on a representative BitNet attention workload that shows high control overhead or low core utilization because the sparsity patterns do not align with the block windows.

Figures

Figures reproduced from arXiv: 2602.06252 by Ahmed J. Abdelmaksoud, Cristian Sestito, Shiwei Wang, Themis Prodromakis.

**Figure 2.** Figure 2: A comprehensive analysis of single large core versus many smaller cores with the same number of PEs. (a) Input bandwidths of core(s), accumulation, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Granularity analysis of the cores per Legion. (a) Input bandwidths of the Legion, accumulators, and psum memories versus core topology. (b) TFU per [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Per-Legion configuration rate index (CRI), evaluating each Legion [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: (a) D-Legion architecture block diagram, consisting of [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Attention workloads distribution for the evaluated models: BitNet [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Latency comparison across four hardware architectures (WS, DiP, ADiP, and D-Legion) for BitNet-1.58B and BitNet-1.58B-KV models. (a) Per-stage [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Throughput comparison across four hardware architectures (WS, DiP, ADiP, and D-Legion) for BitNet-1.58B and BitNet-1.58B-KV models at [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Memory access comparison across four hardware architectures (WS, DiP, ADiP, and D-Legion) for BitNet-1.58B and BitNet-1.58B-KV models. (a) [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: A comparison of psum memory access across four hardware architectures (WS, DiP, ADiP, and D-Legion) for BitNet-1.58B and BitNet-1.58B-KV [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: A comparison between modeled Google TPUv4i and D-Legion V2 with the same number of PEs using attention workloads from BitNet-1.58B and [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

The performance gains obtained by large language models (LLMs) are closely linked to their substantial computational and memory requirements. Quantized LLMs offer significant advantages with extremely quantized models, motivating the development of specialized architectures to accelerate their workloads. This paper proposes D-Legion, a novel scalable many-core architecture, designed using many adaptive-precision systolic array cores, to accelerate matrix multiplication in quantized LLMs. The proposed architecture consists of a set of Legions where each Legion has a group of adaptive-precision systolic arrays. D-Legion supports multiple computation modes, including quantized sparse and dense matrix multiplications. The block structured sparsity is exploited within a fully-sparse, or partially-sparse windows. In addition, memory accesses of partial summations (psums) are spatially reduced through parallel accumulators. Furthermore, data reuse is maximized through optimized scheduling techniques by multicasting matrix tiles across the Legions. A comprehensive design space exploration is performed in terms of Legion/core granularity to determine the optimal Legion configuration. Moreover, D-Legion is evaluated on attention workloads from two BitNet models, delivering up to 8.2$\times$ lower latency, up to 3.8$\times$ higher memory savings, and up to 3$\times$ higher psum memory savings compared to state-of-the-art work. D-Legion, with eight Legions and 64 total cores, achieves a peak throughput of 135.68 TOPS at a frequency of 1 GHz. A scaled version of D-Legion, with 32 Legions, is compared to Google TPUv4i, achieving up to 2.5$\times$ lower total latency, up to 2.3$\times$ higher total throughput, and up to 2.7$\times$ higher total memory savings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D-Legion groups adaptive systolic arrays into Legions with block-sparsity window modes and parallel psums, delivering concrete TOPS and speedup numbers on BitNet attention, but the gains depend on sparsity mapping with low overhead.

read the letter

D-Legion groups adaptive-precision systolic arrays into Legions and adds fully-sparse and partially-sparse window modes to skip zeros in quantized attention matrices. It also uses parallel partial-sum accumulators and multicast scheduling across Legions to cut memory traffic. The paper runs a design-space sweep over Legion granularity and reports a peak of 135.68 TOPS at 1 GHz with 64 cores, plus up to 8.2× lower latency and 3.8× memory savings versus prior accelerators on two BitNet models. A 32-Legion version is compared directly to TPUv4i and shows 2.5× lower latency, 2.3× higher throughput, and 2.7× better memory use. Those are the solid, checkable claims. The architecture itself is a straightforward extension of systolic arrays with explicit sparsity support and adaptive precision, and the granularity study gives useful trade-off data. The soft spot is exactly the one the stress-test note flags: the big speedups only appear when the block-structured sparsity in those attention matrices fits the window modes without heavy control logic, dynamic indexing, or frequent reconfiguration costs. The paper describes the scheduler but does not supply cycle-accurate breakdowns of routing contention or mode-switch overhead under realistic sparsity distributions. If the zeros turn out finer-grained, the reported gains would shrink. This is aimed at hardware architects who build accelerators for quantized LLM inference. Readers who already work on systolic arrays or sparsity exploitation will get the most from the Legion organization and the TPU comparison. I would send it to peer review. The proposal is clear enough that referees can request the missing implementation details and a sensitivity study on sparsity patterns, and the quantitative claims are specific enough to make that review worthwhile.

Referee Report

2 major / 0 minor

Summary. The paper proposes D-Legion, a scalable many-core architecture with adaptive-precision systolic array cores grouped into Legions to accelerate matrix multiplication for quantized LLMs. It supports quantized sparse and dense modes by exploiting block-structured sparsity in fully-sparse or partially-sparse windows, reduces partial sum (psum) memory accesses via parallel accumulators, and maximizes data reuse through multicast scheduling across Legions. A design-space exploration determines optimal Legion/core granularity. On attention workloads from two BitNet models, it reports up to 8.2× lower latency, 3.8× memory savings, and 3× psum savings versus prior work; an 8-Legion/64-core configuration reaches 135.68 TOPS at 1 GHz, and a 32-Legion scale-up outperforms TPUv4i by up to 2.5× latency, 2.3× throughput, and 2.7× memory savings.

Significance. If the sparsity-exploitation claims hold with quantified low overhead, the work would be significant for specialized accelerators targeting quantized LLM inference, demonstrating concrete gains in latency, memory, and throughput over both academic baselines and a commercial TPU. The Legion granularity exploration and parallel-accumulator psum reduction are concrete, reusable ideas.

major comments (2)

[Evaluation] Abstract and Evaluation: The headline quantitative results (135.68 TOPS, 8.2× latency, 3.8× memory savings) are obtained only when fully-sparse and partially-sparse window modes are applied to the BitNet attention matrices. No cycle-accurate breakdown of control-logic cost, multicast-routing contention, or mode-switch overhead is supplied, leaving the central assumption—that block-structured sparsity maps to these modes with negligible hardware cost—unverified and load-bearing for all reported speedups.
[Abstract] Abstract: No simulation tools, synthesis flow, workload characterization details, baseline implementations, or area/power measurement methodology are described, making it impossible to assess whether the reported TOPS, latency, and memory figures are supported by the underlying design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation] Abstract and Evaluation: The headline quantitative results (135.68 TOPS, 8.2× latency, 3.8× memory savings) are obtained only when fully-sparse and partially-sparse window modes are applied to the BitNet attention matrices. No cycle-accurate breakdown of control-logic cost, multicast-routing contention, or mode-switch overhead is supplied, leaving the central assumption—that block-structured sparsity maps to these modes with negligible hardware cost—unverified and load-bearing for all reported speedups.

Authors: We agree that the headline gains are realized when the sparse modes are active on the BitNet attention matrices, which exhibit the block-structured sparsity our design targets. The manuscript currently relies on the architectural description and aggregate results to imply low overhead, without an explicit cycle-accurate breakdown of control logic, routing contention, or mode-switch costs. In the revised version we will add a dedicated evaluation subsection that reports these overheads from our RTL-level cycle-accurate simulator, including their contribution to total latency and power as percentages across the evaluated configurations. This will directly verify the negligible-cost assumption for the reported speedups. revision: yes
Referee: [Abstract] Abstract: No simulation tools, synthesis flow, workload characterization details, baseline implementations, or area/power measurement methodology are described, making it impossible to assess whether the reported TOPS, latency, and memory figures are supported by the underlying design.

Authors: The abstract is intentionally concise and therefore omits methodological details. The full manuscript contains the evaluation setup, but we acknowledge that the description of simulation tools, synthesis flow, workload characterization, baseline implementations, and area/power methodology is not sufficiently prominent or complete. We will revise the abstract to include a brief methodology sentence and expand the evaluation section with a dedicated subsection that explicitly states the tools, flow, workload details for the two BitNet models, baseline designs, and measurement methodology used to obtain the TOPS, latency, and memory figures. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture description and empirical results contain no self-referential derivations or fitted predictions

full rationale

The manuscript describes a hardware architecture (Legions of adaptive-precision systolic arrays, fully-sparse and partially-sparse window modes, parallel accumulators, multicast scheduling) and reports measured throughput (135.68 TOPS) and speedups on BitNet attention workloads. No equations, parameter-fitting steps, or derivation chains appear that reduce a claimed result to its own inputs by construction. Performance numbers are presented as outcomes of design-space exploration and evaluation rather than as outputs forced by internal definitions or self-citations. The load-bearing claims therefore remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The design rests on standard assumptions about systolic-array efficiency and LLM sparsity patterns but introduces new named components without external validation data.

free parameters (1)

Legion/core granularity
Optimal configuration selected after design-space exploration; specific values (8 Legions, 64 cores) are presented as chosen for peak results.

axioms (1)

domain assumption Block-structured sparsity patterns exist in quantized LLM attention workloads and can be exploited by fully-sparse or partially-sparse windows without prohibitive overhead.
Invoked to justify support for sparse and dense modes and memory-access reductions.

invented entities (1)

Legion no independent evidence
purpose: Basic scalable unit consisting of a group of adaptive-precision systolic arrays.
New architectural abstraction introduced to organize the many-core design.

pith-pipeline@v0.9.0 · 5653 in / 1602 out tokens · 47243 ms · 2026-05-16T06:30:16.148833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

A comprehensive overview of large language models,

H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 16, no. 5, pp. 1–72, 2025

work page 2025
[2]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wang, W. Ye, Y . Zhang, Y . Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey on evaluation of large language models,”ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, march 2024

work page 2024
[3]

A survey on transformer compression,

Y . Tang, Y . Wang, J. Guo, Z. Tu, K. Han, H. Hu, and D. Tao, “A survey on transformer compression,”arXiv preprint arXiv:2402.05964, 2024

work page arXiv 2024
[4]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704– 2713. 11 Fig. 11. A comparison between modeled Google TPUv4i and D-Legion V2...

work page 2018
[5]

BitNet: Scaling 1-bit Transformers for Large Language Models

H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y . Wu, and F. Wei, “Bitnet: Scaling 1-bit transformers for large language models,”arXiv preprint arXiv:2310.11453, 2023

work page Pith review arXiv 2023
[6]

Full stack optimization of transformer inference: a survey,

S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoneyet al., “Full stack optimization of transformer inference: a survey,”arXiv preprint arXiv:2302.14017, 2023

work page arXiv 2023
[7]

Ten lessons from three generations shaped google’s tpuv4i: Industrial product,

N. P. Jouppi and et al., “Ten lessons from three generations shaped google’s tpuv4i: Industrial product,” inProceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). Valencia, Spain: IEEE, 2021, pp. 1–14

work page 2021
[8]

Taxonomy and benchmarking of precision-scalable mac arrays under enhanced dnn dataflow represen- tation,

E. M. Ibrahim, L. Mei, and M. Verhelst, “Taxonomy and benchmarking of precision-scalable mac arrays under enhanced dnn dataflow represen- tation,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 5, pp. 2013–2024, 2022

work page 2013
[9]

A 3- d multi-precision scalable systolic fma architecture,

H. Liu, X. Lu, X. Yu, K. Li, K. Yang, H. Xia, S. Li, and T. Deng, “A 3- d multi-precision scalable systolic fma architecture,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 72, no. 1, pp. 265–276, January 2025

work page 2025
[10]

Dtatrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture,

T. Yang, F. Ma, X. Li, F. Liu, Y . Zhao, Z. He, and L. Jiang, “Dtatrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 2, pp. 509–520, 2022

work page 2022
[11]

Heterogeneous systolic array architecture for compact cnns hardware accelerators,

R. Xu, S. Ma, Y . Wang, Y . Guo, D. Li, and Y . Qiao, “Heterogeneous systolic array architecture for compact cnns hardware accelerators,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 2860–2871, 2021

work page 2021
[12]

Trapezoid: A versatile accelerator for dense and sparse matrix multiplications,

Y . Yang, J. S. Emer, and D. Sanchez, “Trapezoid: A versatile accelerator for dense and sparse matrix multiplications,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 931–945

work page 2024
[13]

Sparse-tpu: Adapting systolic ar- rays for sparse matrices,

X. He, S. Pal, A. Amarnath, S. Feng, D.-H. Park, A. Rovinski, H. Ye, Y . Chen, R. Dreslinski, and T. Mudge, “Sparse-tpu: Adapting systolic ar- rays for sparse matrices,” inProceedings of the 34th ACM international conference on supercomputing, 2020, pp. 1–12

work page 2020
[14]

Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads,

D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmellet al., “Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 145–158

work page 2020
[15]

Gemmini: Enabling systematic deep- learning architecture evaluation via full-stack integration,

H. Genc, S. Kim, A. Amid, A. Haj-Ali, V . Iyer, P. Prakash, J. Zhao, D. Grubb, H. Liew, H. Maoet al., “Gemmini: Enabling systematic deep- learning architecture evaluation via full-stack integration,” in2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021, pp. 769–774

work page 2021
[16]

Self adaptive recon- figurable arrays (sara) learning flexible gemm accelerator configuration and mapping-space using ml,

A. Samajdar, E. Qin, M. Pellauer, and T. Krishna, “Self adaptive recon- figurable arrays (sara) learning flexible gemm accelerator configuration and mapping-space using ml,” inProceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 583–588

work page 2022
[17]

Msa2: An efficient s parsity-a ware accelerator for matrix multiplication with m ulti-core s ystolic a rrays,

M. Tang, M. Wen, J. Shen, J. Yang, Z. Xue, and Z. Shao, “Msa2: An efficient s parsity-a ware accelerator for matrix multiplication with m ulti-core s ystolic a rrays,” inInternational Conference on Algorithms and Architectures for Parallel Processing. Springer, 2024, pp. 263–282

work page 2024
[18]

Dynamic sparse attention for scalable transformer acceleration,

L. Liu, Z. Qu, Z. Chen, F. Tu, Y . Ding, and Y . Xie, “Dynamic sparse attention for scalable transformer acceleration,”IEEE Transactions on Computers, vol. 71, no. 12, pp. 3165–3178, 2022

work page 2022
[19]

An efficient multi-dnn accelerator based on multiple systolic arrays,

J. Chen, H. Jiao, W. Huang, and Y . Huang, “An efficient multi-dnn accelerator based on multiple systolic arrays,” in2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2024, pp. 1–7

work page 2024
[20]

Enabling fine-grained spatial multitasking on systolic-array npus using dataflow mirroring,

J. Choi, Y . Ha, J. Lee, S. Lee, J. Lee, H. Jang, and Y . Kim, “Enabling fine-grained spatial multitasking on systolic-array npus using dataflow mirroring,”IEEE Transactions on Computers, vol. 72, no. 12, pp. 3383– 3398, 2023

work page 2023
[21]

Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks,

S. Ghodrati, B. H. Ahn, J. K. Kim, S. Kinzer, B. R. Yatham, N. Alla, H. Sharma, M. Alian, E. Ebrahimi, N. S. Kimet al., “Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 681–697

work page 2020
[22]

Scale-out systolic arrays,

A. C. Y ¨uz¨ug¨uler, C. S ¨onmez, M. Drumond, Y . Oh, B. Falsafi, and P. Frossard, “Scale-out systolic arrays,”ACM Transactions on Archi- tecture and Code Optimization, vol. 20, no. 2, pp. 1–25, 2023

work page 2023
[23]

Why systolic architectures?

H. Kung, “Why systolic architectures?”IEEE Computer, vol. 15, no. 1, pp. 37–46, 1982

work page 1982
[24]

A survey of design and optimization for systolic array-based dnn accelerators,

R. Xu, S. Ma, Y . Guo, and D. Li, “A survey of design and optimization for systolic array-based dnn accelerators,” vol. 56, no. 1, Aug. 2023

work page 2023
[25]

Dip: A scalable, energy-efficient systolic array for matrix multiplication acceleration,

A. J. Abdelmaksoud, S. Agwa, and T. Prodromakis, “Dip: A scalable, energy-efficient systolic array for matrix multiplication acceleration,” IEEE Transactions on Circuits and Systems I: Regular Papers, pp. 1– 11, 2025

work page 2025
[26]

Adip: Adaptive precision systolic array for matrix multiplication acceleration,

A. J. Abdelmaksoud, C. Sestito, S. Wang, and T. Prodromakis, “Adip: Adaptive precision systolic array for matrix multiplication acceleration,” arXiv preprint arXiv:2510.10623v2, 2025

work page arXiv 2025
[27]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017
[28]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,,”arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

1-bit ai infra: Part 1.1, fast and lossless bitnet b1.58 inference on cpus,

J. Wang, H. Zhou, T. Song, S. Mao, S. Ma, H. Wang, Y . Xia, and F. Wei, “1-bit ai infra: Part 1.1, fast and lossless bitnet b1.58 inference on cpus,” arXiv preprint arXiv:2410.16144, 2024

work page arXiv 2024
[30]

G. H. Golub and C. F. V . Loan,Matrix Computations, 4th ed. Baltimore, MD, USA: Johns Hopkins University Press, 2013

work page 2013
[31]

High bandwidth memory dram (hbm3) standard,

“High bandwidth memory dram (hbm3) standard,” JEDEC Solid State Technology Association, Tech. Rep., 2025, accessed: 05 December,

work page 2025
[32]

Available: https://www.jedec.org/standards-documents/ docs/jesd238b01 12 Ahmed J

[Online]. Available: https://www.jedec.org/standards-documents/ docs/jesd238b01 12 Ahmed J. Abdelmaksoudis currently pursuing his PhD with the Centre for Electronics Frontiers (CEF) at the University of Edinburgh, UK. He received his BSc and MSc in Electronics Engineering from Cairo University, Egypt in 2018 and 2022, recep- tively. Since 2018, he has bee...

work page 2018