D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs
Pith reviewed 2026-05-16 06:30 UTC · model grok-4.3
The pith
D-Legion uses groups of adaptive-precision systolic arrays in a scalable many-core layout to accelerate matrix multiplication for quantized LLMs by exploiting block-structured sparsity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
D-Legion is a scalable many-core architecture composed of Legions, each containing adaptive-precision systolic array cores, that accelerates matrix multiplication in quantized LLMs. It supports fully-sparse and partially-sparse window modes to exploit block-structured sparsity, uses parallel accumulators to reduce partial-sum memory accesses, and applies optimized scheduling with multicasting to maximize data reuse across Legions. Evaluation on attention workloads from two BitNet models shows up to 8.2 times lower latency, 3.8 times higher memory savings, and 3 times higher partial-sum memory savings versus prior work; an eight-Legion, 64-core instance reaches 135.68 TOPS at 1 GHz, and a 32-
What carries the argument
The Legion, a group of adaptive-precision systolic arrays that switch between fully-sparse, partially-sparse, and dense modes while using parallel accumulators to cut partial-sum traffic and multicast tiles for reuse.
If this is right
- Attention layers in quantized models such as BitNet can be executed with substantially lower latency and on-chip memory.
- Adding more Legions scales throughput linearly while preserving the reported memory reductions, as shown by the 32-Legion comparison to TPUv4i.
- Partial-sum memory traffic drops by up to 3 times, lowering overall bandwidth pressure in large-matrix workloads.
- The same cores can switch between sparse and dense modes, allowing one accelerator to serve both sparse attention and dense feed-forward layers.
- Optimized tile multicasting across Legions increases effective data reuse, reducing off-chip accesses for the same matrix sizes.
Where Pith is reading between the lines
- Similar window-based sparsity handling could be applied to other sparse linear algebra kernels outside language models, such as graph neural networks.
- If control overhead stays low at larger scales, the design points toward energy-efficient accelerators for edge inference of quantized models.
- The block-window approach might be combined with software sparsity pruning techniques that enforce the same regularity to close the gap between measured and peak performance.
Load-bearing premise
Real quantized LLM workloads contain enough regular block-structured sparsity that maps cleanly onto the sparse window modes without large extra control or routing costs in hardware.
What would settle it
A measured run on a representative BitNet attention workload that shows high control overhead or low core utilization because the sparsity patterns do not align with the block windows.
Figures
read the original abstract
The performance gains obtained by large language models (LLMs) are closely linked to their substantial computational and memory requirements. Quantized LLMs offer significant advantages with extremely quantized models, motivating the development of specialized architectures to accelerate their workloads. This paper proposes D-Legion, a novel scalable many-core architecture, designed using many adaptive-precision systolic array cores, to accelerate matrix multiplication in quantized LLMs. The proposed architecture consists of a set of Legions where each Legion has a group of adaptive-precision systolic arrays. D-Legion supports multiple computation modes, including quantized sparse and dense matrix multiplications. The block structured sparsity is exploited within a fully-sparse, or partially-sparse windows. In addition, memory accesses of partial summations (psums) are spatially reduced through parallel accumulators. Furthermore, data reuse is maximized through optimized scheduling techniques by multicasting matrix tiles across the Legions. A comprehensive design space exploration is performed in terms of Legion/core granularity to determine the optimal Legion configuration. Moreover, D-Legion is evaluated on attention workloads from two BitNet models, delivering up to 8.2$\times$ lower latency, up to 3.8$\times$ higher memory savings, and up to 3$\times$ higher psum memory savings compared to state-of-the-art work. D-Legion, with eight Legions and 64 total cores, achieves a peak throughput of 135.68 TOPS at a frequency of 1 GHz. A scaled version of D-Legion, with 32 Legions, is compared to Google TPUv4i, achieving up to 2.5$\times$ lower total latency, up to 2.3$\times$ higher total throughput, and up to 2.7$\times$ higher total memory savings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes D-Legion, a scalable many-core architecture with adaptive-precision systolic array cores grouped into Legions to accelerate matrix multiplication for quantized LLMs. It supports quantized sparse and dense modes by exploiting block-structured sparsity in fully-sparse or partially-sparse windows, reduces partial sum (psum) memory accesses via parallel accumulators, and maximizes data reuse through multicast scheduling across Legions. A design-space exploration determines optimal Legion/core granularity. On attention workloads from two BitNet models, it reports up to 8.2× lower latency, 3.8× memory savings, and 3× psum savings versus prior work; an 8-Legion/64-core configuration reaches 135.68 TOPS at 1 GHz, and a 32-Legion scale-up outperforms TPUv4i by up to 2.5× latency, 2.3× throughput, and 2.7× memory savings.
Significance. If the sparsity-exploitation claims hold with quantified low overhead, the work would be significant for specialized accelerators targeting quantized LLM inference, demonstrating concrete gains in latency, memory, and throughput over both academic baselines and a commercial TPU. The Legion granularity exploration and parallel-accumulator psum reduction are concrete, reusable ideas.
major comments (2)
- [Evaluation] Abstract and Evaluation: The headline quantitative results (135.68 TOPS, 8.2× latency, 3.8× memory savings) are obtained only when fully-sparse and partially-sparse window modes are applied to the BitNet attention matrices. No cycle-accurate breakdown of control-logic cost, multicast-routing contention, or mode-switch overhead is supplied, leaving the central assumption—that block-structured sparsity maps to these modes with negligible hardware cost—unverified and load-bearing for all reported speedups.
- [Abstract] Abstract: No simulation tools, synthesis flow, workload characterization details, baseline implementations, or area/power measurement methodology are described, making it impossible to assess whether the reported TOPS, latency, and memory figures are supported by the underlying design.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Evaluation] Abstract and Evaluation: The headline quantitative results (135.68 TOPS, 8.2× latency, 3.8× memory savings) are obtained only when fully-sparse and partially-sparse window modes are applied to the BitNet attention matrices. No cycle-accurate breakdown of control-logic cost, multicast-routing contention, or mode-switch overhead is supplied, leaving the central assumption—that block-structured sparsity maps to these modes with negligible hardware cost—unverified and load-bearing for all reported speedups.
Authors: We agree that the headline gains are realized when the sparse modes are active on the BitNet attention matrices, which exhibit the block-structured sparsity our design targets. The manuscript currently relies on the architectural description and aggregate results to imply low overhead, without an explicit cycle-accurate breakdown of control logic, routing contention, or mode-switch costs. In the revised version we will add a dedicated evaluation subsection that reports these overheads from our RTL-level cycle-accurate simulator, including their contribution to total latency and power as percentages across the evaluated configurations. This will directly verify the negligible-cost assumption for the reported speedups. revision: yes
-
Referee: [Abstract] Abstract: No simulation tools, synthesis flow, workload characterization details, baseline implementations, or area/power measurement methodology are described, making it impossible to assess whether the reported TOPS, latency, and memory figures are supported by the underlying design.
Authors: The abstract is intentionally concise and therefore omits methodological details. The full manuscript contains the evaluation setup, but we acknowledge that the description of simulation tools, synthesis flow, workload characterization, baseline implementations, and area/power methodology is not sufficiently prominent or complete. We will revise the abstract to include a brief methodology sentence and expand the evaluation section with a dedicated subsection that explicitly states the tools, flow, workload details for the two BitNet models, baseline designs, and measurement methodology used to obtain the TOPS, latency, and memory figures. revision: yes
Circularity Check
No circularity: architecture description and empirical results contain no self-referential derivations or fitted predictions
full rationale
The manuscript describes a hardware architecture (Legions of adaptive-precision systolic arrays, fully-sparse and partially-sparse window modes, parallel accumulators, multicast scheduling) and reports measured throughput (135.68 TOPS) and speedups on BitNet attention workloads. No equations, parameter-fitting steps, or derivation chains appear that reduce a claimed result to its own inputs by construction. Performance numbers are presented as outcomes of design-space exploration and evaluation rather than as outputs forced by internal definitions or self-citations. The load-bearing claims therefore remain independent of any circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- Legion/core granularity
axioms (1)
- domain assumption Block-structured sparsity patterns exist in quantized LLM attention workloads and can be exploited by fully-sparse or partially-sparse windows without prohibitive overhead.
invented entities (1)
-
Legion
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A comprehensive overview of large language models,
H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 16, no. 5, pp. 1–72, 2025
work page 2025
-
[2]
A survey on evaluation of large language models,
Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wang, W. Ye, Y . Zhang, Y . Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey on evaluation of large language models,”ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, march 2024
work page 2024
-
[3]
A survey on transformer compression,
Y . Tang, Y . Wang, J. Guo, Z. Tu, K. Han, H. Hu, and D. Tao, “A survey on transformer compression,”arXiv preprint arXiv:2402.05964, 2024
-
[4]
Quantization and training of neural networks for efficient integer-arithmetic-only inference,
B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704– 2713. 11 Fig. 11. A comparison between modeled Google TPUv4i and D-Legion V2...
work page 2018
-
[5]
BitNet: Scaling 1-bit Transformers for Large Language Models
H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y . Wu, and F. Wei, “Bitnet: Scaling 1-bit transformers for large language models,”arXiv preprint arXiv:2310.11453, 2023
work page Pith review arXiv 2023
-
[6]
Full stack optimization of transformer inference: a survey,
S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoneyet al., “Full stack optimization of transformer inference: a survey,”arXiv preprint arXiv:2302.14017, 2023
-
[7]
Ten lessons from three generations shaped google’s tpuv4i: Industrial product,
N. P. Jouppi and et al., “Ten lessons from three generations shaped google’s tpuv4i: Industrial product,” inProceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). Valencia, Spain: IEEE, 2021, pp. 1–14
work page 2021
-
[8]
E. M. Ibrahim, L. Mei, and M. Verhelst, “Taxonomy and benchmarking of precision-scalable mac arrays under enhanced dnn dataflow represen- tation,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 5, pp. 2013–2024, 2022
work page 2013
-
[9]
A 3- d multi-precision scalable systolic fma architecture,
H. Liu, X. Lu, X. Yu, K. Li, K. Yang, H. Xia, S. Li, and T. Deng, “A 3- d multi-precision scalable systolic fma architecture,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 72, no. 1, pp. 265–276, January 2025
work page 2025
-
[10]
T. Yang, F. Ma, X. Li, F. Liu, Y . Zhao, Z. He, and L. Jiang, “Dtatrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 2, pp. 509–520, 2022
work page 2022
-
[11]
Heterogeneous systolic array architecture for compact cnns hardware accelerators,
R. Xu, S. Ma, Y . Wang, Y . Guo, D. Li, and Y . Qiao, “Heterogeneous systolic array architecture for compact cnns hardware accelerators,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 2860–2871, 2021
work page 2021
-
[12]
Trapezoid: A versatile accelerator for dense and sparse matrix multiplications,
Y . Yang, J. S. Emer, and D. Sanchez, “Trapezoid: A versatile accelerator for dense and sparse matrix multiplications,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 931–945
work page 2024
-
[13]
Sparse-tpu: Adapting systolic ar- rays for sparse matrices,
X. He, S. Pal, A. Amarnath, S. Feng, D.-H. Park, A. Rovinski, H. Ye, Y . Chen, R. Dreslinski, and T. Mudge, “Sparse-tpu: Adapting systolic ar- rays for sparse matrices,” inProceedings of the 34th ACM international conference on supercomputing, 2020, pp. 1–12
work page 2020
-
[14]
Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads,
D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmellet al., “Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 145–158
work page 2020
-
[15]
Gemmini: Enabling systematic deep- learning architecture evaluation via full-stack integration,
H. Genc, S. Kim, A. Amid, A. Haj-Ali, V . Iyer, P. Prakash, J. Zhao, D. Grubb, H. Liew, H. Maoet al., “Gemmini: Enabling systematic deep- learning architecture evaluation via full-stack integration,” in2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021, pp. 769–774
work page 2021
-
[16]
A. Samajdar, E. Qin, M. Pellauer, and T. Krishna, “Self adaptive recon- figurable arrays (sara) learning flexible gemm accelerator configuration and mapping-space using ml,” inProceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 583–588
work page 2022
-
[17]
M. Tang, M. Wen, J. Shen, J. Yang, Z. Xue, and Z. Shao, “Msa2: An efficient s parsity-a ware accelerator for matrix multiplication with m ulti-core s ystolic a rrays,” inInternational Conference on Algorithms and Architectures for Parallel Processing. Springer, 2024, pp. 263–282
work page 2024
-
[18]
Dynamic sparse attention for scalable transformer acceleration,
L. Liu, Z. Qu, Z. Chen, F. Tu, Y . Ding, and Y . Xie, “Dynamic sparse attention for scalable transformer acceleration,”IEEE Transactions on Computers, vol. 71, no. 12, pp. 3165–3178, 2022
work page 2022
-
[19]
An efficient multi-dnn accelerator based on multiple systolic arrays,
J. Chen, H. Jiao, W. Huang, and Y . Huang, “An efficient multi-dnn accelerator based on multiple systolic arrays,” in2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2024, pp. 1–7
work page 2024
-
[20]
Enabling fine-grained spatial multitasking on systolic-array npus using dataflow mirroring,
J. Choi, Y . Ha, J. Lee, S. Lee, J. Lee, H. Jang, and Y . Kim, “Enabling fine-grained spatial multitasking on systolic-array npus using dataflow mirroring,”IEEE Transactions on Computers, vol. 72, no. 12, pp. 3383– 3398, 2023
work page 2023
-
[21]
S. Ghodrati, B. H. Ahn, J. K. Kim, S. Kinzer, B. R. Yatham, N. Alla, H. Sharma, M. Alian, E. Ebrahimi, N. S. Kimet al., “Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 681–697
work page 2020
-
[22]
A. C. Y ¨uz¨ug¨uler, C. S ¨onmez, M. Drumond, Y . Oh, B. Falsafi, and P. Frossard, “Scale-out systolic arrays,”ACM Transactions on Archi- tecture and Code Optimization, vol. 20, no. 2, pp. 1–25, 2023
work page 2023
-
[23]
H. Kung, “Why systolic architectures?”IEEE Computer, vol. 15, no. 1, pp. 37–46, 1982
work page 1982
-
[24]
A survey of design and optimization for systolic array-based dnn accelerators,
R. Xu, S. Ma, Y . Guo, and D. Li, “A survey of design and optimization for systolic array-based dnn accelerators,” vol. 56, no. 1, Aug. 2023
work page 2023
-
[25]
Dip: A scalable, energy-efficient systolic array for matrix multiplication acceleration,
A. J. Abdelmaksoud, S. Agwa, and T. Prodromakis, “Dip: A scalable, energy-efficient systolic array for matrix multiplication acceleration,” IEEE Transactions on Circuits and Systems I: Regular Papers, pp. 1– 11, 2025
work page 2025
-
[26]
Adip: Adaptive precision systolic array for matrix multiplication acceleration,
A. J. Abdelmaksoud, C. Sestito, S. Wang, and T. Prodromakis, “Adip: Adaptive precision systolic array for matrix multiplication acceleration,” arXiv preprint arXiv:2510.10623v2, 2025
-
[27]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017
work page 2017
-
[28]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
J. Ainslie, J. Lee-Thorp, M. De Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,,”arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
1-bit ai infra: Part 1.1, fast and lossless bitnet b1.58 inference on cpus,
J. Wang, H. Zhou, T. Song, S. Mao, S. Ma, H. Wang, Y . Xia, and F. Wei, “1-bit ai infra: Part 1.1, fast and lossless bitnet b1.58 inference on cpus,” arXiv preprint arXiv:2410.16144, 2024
-
[30]
G. H. Golub and C. F. V . Loan,Matrix Computations, 4th ed. Baltimore, MD, USA: Johns Hopkins University Press, 2013
work page 2013
-
[31]
High bandwidth memory dram (hbm3) standard,
“High bandwidth memory dram (hbm3) standard,” JEDEC Solid State Technology Association, Tech. Rep., 2025, accessed: 05 December,
work page 2025
-
[32]
Available: https://www.jedec.org/standards-documents/ docs/jesd238b01 12 Ahmed J
[Online]. Available: https://www.jedec.org/standards-documents/ docs/jesd238b01 12 Ahmed J. Abdelmaksoudis currently pursuing his PhD with the Centre for Electronics Frontiers (CEF) at the University of Edinburgh, UK. He received his BSc and MSc in Electronics Engineering from Cairo University, Egypt in 2018 and 2022, recep- tively. Since 2018, he has bee...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.