pith. sign in

arxiv: 2605.24144 · v1 · pith:LJ4JVHNEnew · submitted 2026-05-22 · 💻 cs.AR · cs.LG

EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture

Pith reviewed 2026-06-30 14:31 UTC · model grok-4.3

classification 💻 cs.AR cs.LG
keywords vector quantizationLLM decodingGEMV accelerationhardware architecturememory bank conflictsenergy efficiencyautoregressive inference
0
0 comments X

The pith

EVA accelerates LLM decoding up to 11.17 times by computing direct dot products with the weight codebook and using structured lookups to avoid memory conflicts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that vector quantization can be restructured to overcome the memory-bound limits of autoregressive LLM decoding. Instead of reconstructing weights from indices for small GEMV operations, EVA computes dot products straight between input vectors and the shared codebook, then pulls results via structured lookups from an intermediate buffer. This turns the work into more efficient GEMM-style computation while removing bank conflicts. A sympathetic reader would care because decoding dominates inference time and energy use, and the method claims large gains without losing precision.

Core claim

EVA builds on vector quantization by performing direct dot products between input vectors and the weight codebook, which transforms LLM decoding from GEMV-like to GEMM computation, then executes structured lookups from an intermediate output buffer to eliminate memory bank conflicts. The architecture remains compatible with standard prefill execution and delivers up to 11.17× speedup and 7.17× higher energy efficiency over state-of-the-art lookup-based designs while preserving arithmetic precision.

What carries the argument

Direct input-codebook dot product computation followed by structured lookups from an intermediate output buffer.

Load-bearing premise

Direct input-codebook dot products combined with structured lookups from an intermediate output buffer will eliminate memory bank conflicts and maintain precision without introducing new hardware bottlenecks or overheads.

What would settle it

A cycle-accurate hardware simulation or FPGA prototype run on real LLM decoding workloads that shows whether measured speedup falls below 5× or arithmetic precision deviates from the original model when memory conflicts or new overheads appear.

Figures

Figures reproduced from arXiv: 2605.24144 by Bowen Duan, Changchun Zhou, Chiyue Wei, Cong Guo, Hai Li, Haoxuan Shan, Xinhua Chen, Yifan Xu, Yiran Chen, Yuzhe Fu, Ziyue Zhang.

Figure 1
Figure 1. Figure 1: Motivation of this work. (a) Conventional GEMV suffers from poor [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of quantization schemes: (a) uniform quantization [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the EVA computation flow and architecture. vector that is replaced by an index referencing a shared weight codebook (WC) B. The codebook contains 2 n representative d-dimensional centroids (where n is the index bit-width), each learned from the distribution of weights using k-means clustering [15], [22]. All K ×N weight elements are therefore represented as V × N indices, where V = K/d. The wei… view at source ↗
Figure 4
Figure 4. Figure 4: Tiling strategy of the VQ-GEMM operation in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Epilogue unit (EU) for conflict-free lookup in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Execution scheduling of EVA. (a) Runtime estimation showing that the GEMM stage is not the bottleneck. (b) EU scaling pipeline demonstrating GEMM–Epilogue overlap. (c) Multi-batch reuse, where multiple requests share the same weight tiles to reduce bandwidth cost. achieves the same functionality with lower computation and hardware cost, without sacrificing arithmetic precision. B. EU Scaling [PITH_FULL_IM… view at source ↗
Figure 8
Figure 8. Figure 8: Design space exploration of EVA. (a) Number of Epilogue Units with latency and energy. (b) Number of Epilogue Units with area. VQ parameters. Tbl. III shows how EVA’s latency varies across different VQ configurations. Here, N denotes the minimum number of output channels sharing the same code￾book; for AQLM, this corresponds to the linear-layer output dimension (N ≥ 4096). As shown in the table, when 2 n <… view at source ↗
Figure 9
Figure 9. Figure 9: EVA area and power breakdown. does not rely on any particular method and can benefit from future improvements, such as fine-tuning [42] or other emerging optimizations [35], [58], [74]. D. Area and Power Comparison As shown in Tbl. VIII, we compare the area and power of EVA and the baseline architectures. For a fair comparison, all designs are configured with the same minimum number of PEs. For EVA, compar… view at source ↗
Figure 10
Figure 10. Figure 10: Latency and energy consumption of the EVA and baseline accelerators on the fully connected layer with batch size=1 during the decoding phase of the LLaMA models. Method-AnWm denotes n-bit activation and m-bit weight. only one lane is effectively active when batch size = 1. The utilization rates of ANT and FIGNA are further reduced due to the increased pipeline fill and drain overhead. As a LUT￾based metho… view at source ↗
Figure 11
Figure 11. Figure 11: (b) shows how energy consumption varies with batch 0.00 0.02 0.04 0.06 1 2 4 8 16 32 64 Latency (s) 0.00 0.15 0.30 0.45 1 2 4 8 16 32 64 Energy (J) Batch Size SA-A8W8 ANT-A8W8 FIGNA-A16W4 FIGLUT-A16W4 EVA-A16W4 EVA-A16W3 EVA-A16W2 EVA-A8W8 0.101 0.076 0.564 (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: End-to-end latency evaluation of MoE models on the (a) Arxiv [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Evaluation of spurious computations in the proposed method. (a) [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have achieved impressive performance across diverse domains but remain inefficient during the autoregressive decoding phase. Unlike the prefill stage, which employs compute-bound GEMM operations, decoding executes a sequence of small GEMV-like computations that are memory-bound and underutilize modern accelerators. Weight-only vector quantization (VQ) has emerged as an effective compression technique that clusters model weights into a shared codebook and replaces the original weight matrix with low-precision indices, enabling 2-bit-level weight compression. While this approach substantially reduces model size and memory bandwidth, it still suffers from two critical inefficiencies: the low utilization of GEMV computation and frequent memory conflicts during codebook lookups. This paper presents EVA, an efficient vector-quantization-based architecture that addresses both computational and memory bottlenecks in LLM decoding. EVA builds on a simple yet effective insight that combines input-codebook computation with conflict-free memory access. Instead of reconstructing quantized weights from indices, EVA directly performs dot products between input vectors and the weight codebook, transforming LLM decoding from GEMV to GEMM computation. It then performs structured lookups from an intermediate output buffer, eliminating memory bank conflicts. We further design a hardware-software co-optimized architecture specialized for LLM decoding while remaining compatible with conventional prefill execution. Evaluations show that EVA achieves up to 11.17$\times$ speedup and 7.17$\times$ higher energy efficiency compared with the SOTA lookup-based architecture, while preserving arithmetic precision after vector quantization. Our code is available at https://github.com/dbw6/Eva.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the EVA architecture for accelerating autoregressive LLM decoding under weight-only vector quantization. It replaces weight reconstruction with direct input-codebook dot products (converting memory-bound GEMV to GEMM) followed by structured index-driven lookups from an intermediate output buffer to eliminate memory bank conflicts. The design is claimed to remain compatible with conventional prefill execution. Evaluations are reported to deliver up to 11.17× speedup and 7.17× energy-efficiency gains versus the SOTA lookup-based accelerator while preserving arithmetic precision; open-source code is provided.

Significance. If the reported speedups and energy gains are substantiated by detailed hardware measurements that confirm the absence of new bottlenecks, the work would offer a practical contribution to specialized accelerators for quantized LLM inference. The open code release supports reproducibility.

major comments (2)
  1. [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): the headline claims of 11.17× speedup and 7.17× energy efficiency rest on the assertion that input-codebook GEMM plus structured buffer lookups remove bank conflicts without material control or buffering overhead. No cycle-level breakdown, stall analysis, or sensitivity study on intermediate-buffer size/address-generation logic is supplied, leaving the central hardware assumption unverified.
  2. [§3 (Architecture)] §3 (Architecture): the transformation from GEMV to GEMM via direct codebook dot products is described at a high level, but the manuscript does not quantify the additional GEMM tiling or prefill-compatibility overheads for small decoding batches, which directly affects whether net gains over prior lookup accelerators are preserved.
minor comments (2)
  1. The abstract states that arithmetic precision is preserved after vector quantization, but the manuscript should explicitly state the bit-widths, codebook sizes, and any rounding or accumulation-precision details used in the GEMM step.
  2. Figure captions and table headers in the evaluation section would benefit from clearer labeling of the exact models, sequence lengths, and hardware parameters corresponding to the reported speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. Below we address each major comment point by point. We will make revisions to strengthen the hardware analysis as suggested.

read point-by-point responses
  1. Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): the headline claims of 11.17× speedup and 7.17× energy efficiency rest on the assertion that input-codebook GEMM plus structured buffer lookups remove bank conflicts without material control or buffering overhead. No cycle-level breakdown, stall analysis, or sensitivity study on intermediate-buffer size/address-generation logic is supplied, leaving the central hardware assumption unverified.

    Authors: We agree with the referee that providing a cycle-level breakdown would better substantiate our claims. Although our reported speedups are derived from a detailed cycle-accurate simulation model that includes control logic and buffering, we did not include an explicit breakdown or sensitivity study in the manuscript. We will revise §4 to include stall analysis, cycle breakdowns, and sensitivity studies on the intermediate-buffer size and address-generation logic. revision: yes

  2. Referee: [§3 (Architecture)] §3 (Architecture): the transformation from GEMV to GEMM via direct codebook dot products is described at a high level, but the manuscript does not quantify the additional GEMM tiling or prefill-compatibility overheads for small decoding batches, which directly affects whether net gains over prior lookup accelerators are preserved.

    Authors: The comment is valid; the current §3 focuses on the core insight without detailed overhead quantification for edge cases like small batches. In the revision, we will add analysis and measurements quantifying the GEMM tiling overheads and prefill compatibility costs for small decoding batches, showing that the net performance gains remain significant. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering architecture proposal with no derivation chain

full rationale

The paper is an engineering architecture proposal for LLM decoding hardware. It contains no mathematical derivation, no equations that reduce to inputs by construction, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. Performance claims (speedup, energy efficiency) are presented as outcomes of the proposed design and simulation/evaluation, not as tautological results of prior steps within the paper. The work is therefore self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

This is a hardware architecture paper; the central claim rests on the proposed design choices for computation and memory access rather than mathematical axioms, fitted parameters, or new physical entities.

invented entities (1)
  • EVA architecture no independent evidence
    purpose: Hardware design that performs direct codebook dot products and conflict-free lookups for LLM decoding
    New system proposed in the paper to address VQ inefficiencies.

pith-pipeline@v0.9.1-grok · 5847 in / 1127 out tokens · 42697 ms · 2026-06-30T14:31:02.021336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 34 canonical work pages · 18 internal anchors

  1. [1]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,”arXiv preprint arXiv:2308.16369, 2023

  2. [2]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  3. [3]

    Cacti 7: New tools for interconnect exploration in innovative off-chip memories,

    R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V . Srinivas, “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,”ACM Transactions on Architecture and Code Optimization (TACO), vol. 14, no. 2, pp. 1–25, 2017

  4. [4]

    Piqa: Reasoning about physical commonsense in natural language,

    Y . Bisk, R. Zellers, J. Gao, Y . Choiet al., “Piqa: Reasoning about physical commonsense in natural language,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7432– 7439

  5. [5]

    Multiplying matrices without multiplying,

    D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 992–1004

  6. [6]

    Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,

    F. Cheng, C. Guo, C. Wei, J. Zhang, C. Zhou, E. Hanson, J. Zhang, X. Liu, H. Li, and Y . Chen, “Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 793–807

  7. [7]

    Nvidia a100 tensor core gpu: Performance and innovation,

    J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, “Nvidia a100 tensor core gpu: Performance and innovation,”IEEE Micro, vol. 41, no. 2, pp. 29–35, 2021

  8. [8]

    Boolq: Exploring the surprising difficulty of natural yes/no questions,

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,” inProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), 2019, pp. 2924–2936

  9. [9]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018

  10. [10]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

  11. [11]

    A discourse-aware attention model for abstractive sum- marization of long documents,

    A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian, “A discourse-aware attention model for abstractive sum- marization of long documents,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 615–621

  12. [12]

    Conover, M

    M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin. (2023) Free dolly: Introducing the world’s first truly open instruction-tuned llm. [Online]. Available: https://www.databricks.com/blog/2023/04/12/ dolly-first-open-commercially-viable-instruction-tuned-llm

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”CoRR, vol. abs/2501.12948, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2501.12948

  14. [14]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,”CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/ 1810.04805

  15. [15]

    Extreme compression of large language models via additive quantization,

    V . Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh, “Extreme compression of large language models via additive quantization,” 2024

  16. [16]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: accurate post-training quantization for generative pre-trained transformers,” CoRR, vol. abs/2210.17323, 2022. [Online]. Available: https://doi.org/ 10.48550/arXiv.2210.17323

  17. [17]

    Github copilot,

    GitHub, “Github copilot,” https://github.com/features/copilot, 2025, ac- cessed: 2025-11-17

  18. [18]

    Carvq: Corrective adaptor with group residual vector quantization for llm embedding compression,

    D. Gou, S. Byun, N. Malpeddi, G. De Micheli, P. Vaste, J. Song, and W. S. Chung, “Carvq: Corrective adaptor with group residual vector quantization for llm embedding compression,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 18 594–18 604

  19. [19]

    Olive: Accelerating large language models via hardware- friendly outlier-victim pair quantization,

    C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y . Liu, M. Guo, and Y . Zhu, “Olive: Accelerating large language models via hardware- friendly outlier-victim pair quantization,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–15

  20. [20]

    Transitive array: An efficient gemm accelerator with result reuse,

    C. Guo, C. Wei, J. Tang, B. Duan, S. Han, H. Li, and Y . Chen, “Transitive array: An efficient gemm accelerator with result reuse,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 990–1004

  21. [21]

    Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization,

    C. Guo, C. Zhang, J. Leng, Z. Liu, F. Yang, Y . Liu, M. Guo, and Y . Zhu, “Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 1414– 1433

  22. [22]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,”arXiv preprint arXiv:1510.00149, 2015

  23. [23]

    M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,

    W. Hu, H. Zhang, C. Guo, Y . Feng, R. Guan, Z. Hua, Z. Liu, Y . Guan, M. Guo, and J. Leng, “M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1112–1126

  24. [24]

    I- llm: Efficient integer-only inference for fully-quantized low-bit large language models,

    X. Hu, Y . Cheng, D. Yang, Z. Yuan, J. Yu, C. Xu, and S. Zhou, “I- llm: Efficient integer-only inference for fully-quantized low-bit large language models,”arXiv preprint arXiv:2405.17849, 2024

  25. [25]

    Residual quantization with implicit neural codebooks,

    I. A. Huijben, M. Douze, M. Muckley, R. J. Van Sloun, and J. Verbeek, “Residual quantization with implicit neural codebooks,”arXiv preprint arXiv:2401.14732, 2024

  26. [26]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference,

    B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704– 2713

  27. [27]

    Figna: Integer unit-based accelerator design for fp-int gemm preserving numerical accuracy,

    J. Jang, Y . Kim, J. Lee, and J.-J. Kim, “Figna: Integer unit-based accelerator design for fp-int gemm preserving numerical accuracy,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024, pp. 760–773

  28. [28]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,”CoRR, vol. abs/2310.06825,

  29. [29]

    Mistral 7B

    [Online]. Available: https://doi.org/10.48550/arXiv.2310.06825

  30. [30]

    Mixtral of Experts

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024

  31. [31]

    In-datacenter performance analysis of a tensor processing unit,

    N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

  32. [32]

    I-bert: Integer-only bert quantization,

    S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert: Integer-only bert quantization,” inInternational conference on machine learning. PMLR, 2021, pp. 5506–5518

  33. [33]

    Squeezellm: dense-and-sparse quantization,

    S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Ma- honey, and K. Keutzer, “Squeezellm: dense-and-sparse quantization,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

  34. [34]

    Amq: Enabling automl for mixed-precision weight-only quantization of large language models,

    S. Lee, S.-t. Woo, J.-g. Jin, C. Lee, and E. Park, “Amq: Enabling automl for mixed-precision weight-only quantization of large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 35 520–35 538

  35. [35]

    Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,

    G. Li, S. Ye, C. Chen, Y . Wang, F. Yang, T. Cao, C. Liu, M. M. S. Aly, and M. Yang, “Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 671–684

  36. [36]

    Commvq: Commu- tative vector quantization for kv cache compression,

    J. Li, Y . Zhang, M. Y . Hassan, T. Chafekar, T. Cai, Z. Ren, P. Guo, F. Karimzadeh, C. Wang, and C. Gan, “Commvq: Commu- tative vector quantization for kv cache compression,”arXiv preprint arXiv:2506.18879, 2025

  37. [37]

    Dramsim3: A cycle-accurate, thermal-capable DRAM simulator,

    S. Li, Z. Yang, D. Reddy, A. Srivastava, and B. L. Jacob, “Dramsim3: A cycle-accurate, thermal-capable DRAM simulator,”IEEE Comput. Archit. Lett., vol. 19, no. 2, pp. 110–113, 2020. [Online]. Available: https://doi.org/10.1109/LCA.2020.2973991

  38. [38]

    Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

  39. [39]

    Speechprune: Context-aware token pruning for speech information retrieval,

    Y . Lin, Y . Fu, J. Zhang, Y . Liu, J. Zhang, J. Sun, H. Li, Y . Chenet al., “Speechprune: Context-aware token pruning for speech information retrieval,”arXiv preprint arXiv:2412.12009, 2024

  40. [40]

    Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,

    Y . Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,”arXiv preprint arXiv:2405.04532, 2024

  41. [41]

    Vptq: Extreme low-bit vector post-training quantization for large language models,

    Y . Liu, J. Wen, Y . Wang, S. Ye, L. L. Zhang, T. Cao, C. Li, and M. Yang, “Vptq: Extreme low-bit vector post-training quantization for large language models,”arXiv preprint arXiv:2409.17066, 2024

  42. [42]

    Vq-llm: High-performance code generation for vector quantization augmented llm inference,

    Z. Liu, X. Luo, J. Guo, W. Ni, Y . Zhou, Y . Guan, C. Guo, W. Cui, Y . Feng, M. Guoet al., “Vq-llm: High-performance code generation for vector quantization augmented llm inference,” in2025 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1496–1509

  43. [43]

    Pv-tuning: Beyond straight-through estimation for extreme llm compression,

    V . Malinovskii, D. Mazur, I. Ilin, D. Kuznedelev, K. Burlachenko, K. Yi, D. Alistarh, and P. Richtarik, “Pv-tuning: Beyond straight-through estimation for extreme llm compression,” 2024

  44. [44]

    Pointer Sentinel Mixture Models

    S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,”arXiv preprint arXiv:1609.07843, 2016

  45. [45]

    LUT tensor core: Lookup table enables efficient low-bit LLM inference acceleration,

    Z. Mo, L. Wang, J. Wei, Z. Zeng, S. Cao, L. Ma, N. Jing, T. Cao, J. Xue, F. Yang, and M. Yang, “LUT tensor core: Lookup table enables efficient low-bit LLM inference acceleration,”CoRR, vol. abs/2408.06003, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2408.06003

  46. [46]

    Genai at the edge: Comprehensive survey on empowering edge devices,

    M. Navardi, R. Aalishah, Y . Fu, Y . Lin, H. Li, Y . Chen, and T. Mohs- enin, “Genai at the edge: Comprehensive survey on empowering edge devices,” inProceedings of the AAAI Symposium Series, vol. 5, no. 1, 2025, pp. 180–187

  47. [47]

    Gpt-4 technical report,

    OpenAI, “Gpt-4 technical report,” https://cdn.openai.com/papers/gpt-4. pdf, 2023, accessed: 2025-11-16

  48. [48]

    Codegemm: A codebook-centric approach to efficient gemm in quantized llms,

    G. Park, J. Bae, B. Kim, J. Ryu, H. Kim, S. J. Kwon, D. Lee et al., “Codegemm: A codebook-centric approach to efficient gemm in quantized llms,”arXiv preprint arXiv:2512.17970, 2025

  49. [49]

    Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,

    G. Park, H. Kwon, J. Kim, J. Bae, B. Park, D. Lee, and Y . Lee, “Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1098–1111

  50. [50]

    Splitwise: Efficient generative LLM inference using phase splitting,

    P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” in51st ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2024, Buenos Aires, Argentina, June 29 - July 3, 2024. IEEE, 2024, pp. 118–132. [Online]. Available: https://doi.org/10.1109/IS...

  51. [51]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019

  52. [52]

    Choice of plausible alternatives: An evaluation of commonsense causal reasoning

    M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice of plausible alternatives: An evaluation of commonsense causal reasoning.” inAAAI spring symposium: logical formalizations of commonsense reasoning, 2011, pp. 90–95

  53. [53]

    Winogrande: An adversarial winograd schema challenge at scale,

    K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi, “Winogrande: An adversarial winograd schema challenge at scale,”Communications of the ACM, vol. 64, no. 9, pp. 99–106, 2021

  54. [54]

    Platinum: Path-adaptable lut-based accelerator tailored for low-bit weight matrix multiplication,

    H. Shan, C. Guo, C. Wei, F. Cheng, J. Zhang, H. H. Li, and Y . Chen, “Platinum: Path-adaptable lut-based accelerator tailored for low-bit weight matrix multiplication,” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2026, pp. 1449– 1455

  55. [55]

    The Llama 3 Herd of Models

    L. Team, “The llama 3 herd of models,”CoRR, vol. abs/2407.21783,

  56. [56]

    The Llama 3 Herd of Models

    [Online]. Available: https://doi.org/10.48550/arXiv.2407.21783

  57. [57]

    Thakur, B

    S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “Verigen: A large language model for verilog code generation,”ACM Trans. Des. Autom. Electron. Syst., vol. 29, no. 3, Apr. 2024. [Online]. Available: https://doi.org/10.1145/3643681

  58. [58]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron and et al., “Llama 2: Open foundation and fine-tuned chat models,”CoRR, vol. abs/2307.09288, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.09288

  59. [59]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  60. [60]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,

    A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa, “Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,”arXiv preprint arXiv:2402.04396, 2024

  61. [61]

    Gptvq: The blessing of dimensionality for llm quantization,

    M. Van Baalen, A. Kuzmin, I. Koryakovskiy, M. Nagel, P. Couperus, C. Bastoul, E. Mahurin, T. Blankevoort, and P. Whatmough, “Gptvq: The blessing of dimensionality for llm quantization,”arXiv preprint arXiv:2402.15319, 2024

  62. [62]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S...

  63. [63]

    BitNet: Scaling 1-bit Transformers for Large Language Models

    H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y . Wu, and F. Wei, “Bitnet: Scaling 1-bit transformers for large language models,”arXiv preprint arXiv:2310.11453, 2023

  64. [64]

    Dobi-svd: Differentiable svd for llm compression and some new perspectives,

    Q. Wang*, J. Ke*, M. Tomizuka, K. Keutzer, and C. Xu, “Dobi-svd: Differentiable svd for llm compression and some new perspectives,” in The Thirteenth International Conference on Learning Representations, 2025

  65. [65]

    Angles don’t lie: Unlocking training-efficient rl through the model’s own signals,

    Q. Wang, J. Ke, H. Ye, Y . Lin, Y . Fu, J. Zhang, K. Keutzer, C. Xu, and Y . Chen, “Angles don’t lie: Unlocking training-efficient rl through the model’s own signals,”arXiv preprint arXiv:2506.02281, 2025

  66. [66]

    Phi: Leveraging pattern-based hierarchical sparsity for high-efficiency spik- ing neural networks,

    C. Wei, B. Duan, C. Guo, J. Zhang, Q. Song, H. Li, and Y . Chen, “Phi: Leveraging pattern-based hierarchical sparsity for high-efficiency spik- ing neural networks,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 930–943

  67. [67]

    Prosperity: Accelerating spiking neural networks via product sparsity,

    C. Wei, C. Guo, F. Cheng, S. Li, H. F. Yang, H. H. Li, and Y . Chen, “Prosperity: Accelerating spiking neural networks via product sparsity,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 806–820

  68. [68]

    Focus: A streaming concentration architecture for efficient vision-language models,

    C. Wei, C. Guo, J. Zhang, H. Shan, Y . Xu, Z. Zhang, Y . Liu, Q. Wang, C. Zhou, H. H. Liet al., “Focus: A streaming concentration architecture for efficient vision-language models,” in2026 IEEE International Sym- posium on High Performance Computer Architecture (HPCA). IEEE, 2026, pp. 1–18

  69. [69]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, a...

  70. [70]

    38 087–38 099

    PMLR, 2023, pp. 38 087–38 099. [Online]. Available: https: //proceedings.mlr.press/v202/xiao23c.html

  71. [71]

    Llm. 265: Video codecs are secretly tensor codecs,

    C. Xu, Y . Wu, X. Yang, B. Chen, M. Lentz, D. Zhuo, and L. W. Wills, “Llm. 265: Video codecs are secretly tensor codecs,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture®, 2025, pp. 445–460

  72. [72]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  73. [73]

    Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,

    Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y . He, “Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,”Advances in neural information processing systems, vol. 35, pp. 27 168–27 183, 2022

  74. [74]

    Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,

    H. You, Y . Guo, Y . Fu, W. Zhou, H. Shi, X. Zhang, S. Kundu, A. Yazdanbakhsh, and Y . C. Lin, “Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,”Advances in Neural Information Processing Systems, vol. 37, pp. 24 822–24 848, 2024

  75. [75]

    Orca: A distributed serving system for transformer-based generative models,

    G. Yu, J. S. Jeong, G. Kim, S. Kim, and B. Chun, “Orca: A distributed serving system for transformer-based generative models,” in16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, M. K. Aguilera and H. Weatherspoon, Eds. USENIX Association, 2022, pp. 521–

  76. [76]

    Available: https://www.usenix.org/conference/osdi22/ presentation/yu

    [Online]. Available: https://www.usenix.org/conference/osdi22/ presentation/yu

  77. [77]

    Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,

    A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 811–824

  78. [78]

    7931429 (2023)

    T. Zhang, J. Yi, Z. Xu, and A. Shrivastava, “Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization, 2024b,”URL https://arxiv. org/abs/2405.03917. APPENDIX A. Abstract Our artifact contains (1) a hardware simulator that re- produces all hardware evaluation results (Figures 8–14 and Tables III, VIII–IX) from theEV...

  79. [79]

    The artifact sources are also archived at Zenodo: https://doi.org/10.5281/zenodo.19433707

    How to access:The source code is publicly available at: https://github.com/dbw6/Eva.git. The artifact sources are also archived at Zenodo: https://doi.org/10.5281/zenodo.19433707. Pretrained weights for all evaluated models (LLaMA-2-7B, LLaMA-2-13B, Mixtral-8x7B, and Qwen3-30B-A3B) and datasets are available at: https://huggingface.co/collections/ dbw6/eva

  80. [80]

    No GPU is required; all simulations run on the CPU

    Hardware dependencies:Hardware Simulator:Any x86-64 machine with at least 16 GB of RAM and 10 GB of free disk space. No GPU is required; all simulations run on the CPU. Internet access is required for the first run to download Hugging Face models and datasets. Algorithm Evaluation:An NVIDIA GPU with at least 24 GB VRAM (A100-80GB recommended), CUDA 12.x, ...

Showing first 80 references.