EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture

Bowen Duan; Changchun Zhou; Chiyue Wei; Cong Guo; Hai Li; Haoxuan Shan; Xinhua Chen; Yifan Xu; Yiran Chen; Yuzhe Fu

arxiv: 2605.24144 · v1 · pith:LJ4JVHNEnew · submitted 2026-05-22 · 💻 cs.AR · cs.LG

EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture

Bowen Duan , Cong Guo , Chiyue Wei , Haoxuan Shan , Yuzhe Fu , Xinhua Chen , Yifan Xu , Ziyue Zhang

show 3 more authors

Changchun Zhou Hai Li Yiran Chen

This is my paper

Pith reviewed 2026-06-30 14:31 UTC · model grok-4.3

classification 💻 cs.AR cs.LG

keywords vector quantizationLLM decodingGEMV accelerationhardware architecturememory bank conflictsenergy efficiencyautoregressive inference

0 comments

The pith

EVA accelerates LLM decoding up to 11.17 times by computing direct dot products with the weight codebook and using structured lookups to avoid memory conflicts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that vector quantization can be restructured to overcome the memory-bound limits of autoregressive LLM decoding. Instead of reconstructing weights from indices for small GEMV operations, EVA computes dot products straight between input vectors and the shared codebook, then pulls results via structured lookups from an intermediate buffer. This turns the work into more efficient GEMM-style computation while removing bank conflicts. A sympathetic reader would care because decoding dominates inference time and energy use, and the method claims large gains without losing precision.

Core claim

EVA builds on vector quantization by performing direct dot products between input vectors and the weight codebook, which transforms LLM decoding from GEMV-like to GEMM computation, then executes structured lookups from an intermediate output buffer to eliminate memory bank conflicts. The architecture remains compatible with standard prefill execution and delivers up to 11.17× speedup and 7.17× higher energy efficiency over state-of-the-art lookup-based designs while preserving arithmetic precision.

What carries the argument

Direct input-codebook dot product computation followed by structured lookups from an intermediate output buffer.

Load-bearing premise

Direct input-codebook dot products combined with structured lookups from an intermediate output buffer will eliminate memory bank conflicts and maintain precision without introducing new hardware bottlenecks or overheads.

What would settle it

A cycle-accurate hardware simulation or FPGA prototype run on real LLM decoding workloads that shows whether measured speedup falls below 5× or arithmetic precision deviates from the original model when memory conflicts or new overheads appear.

Figures

Figures reproduced from arXiv: 2605.24144 by Bowen Duan, Changchun Zhou, Chiyue Wei, Cong Guo, Hai Li, Haoxuan Shan, Xinhua Chen, Yifan Xu, Yiran Chen, Yuzhe Fu, Ziyue Zhang.

**Figure 2.** Figure 2: Comparison of quantization schemes: (a) uniform quantization [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the EVA computation flow and architecture. vector that is replaced by an index referencing a shared weight codebook (WC) B. The codebook contains 2 n representative d-dimensional centroids (where n is the index bit-width), each learned from the distribution of weights using k-means clustering [15], [22]. All K ×N weight elements are therefore represented as V × N indices, where V = K/d. The wei… view at source ↗

**Figure 4.** Figure 4: Tiling strategy of the VQ-GEMM operation in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Epilogue unit (EU) for conflict-free lookup in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Execution scheduling of EVA. (a) Runtime estimation showing that the GEMM stage is not the bottleneck. (b) EU scaling pipeline demonstrating GEMM–Epilogue overlap. (c) Multi-batch reuse, where multiple requests share the same weight tiles to reduce bandwidth cost. achieves the same functionality with lower computation and hardware cost, without sacrificing arithmetic precision. B. EU Scaling [PITH_FULL_IM… view at source ↗

**Figure 8.** Figure 8: Design space exploration of EVA. (a) Number of Epilogue Units with latency and energy. (b) Number of Epilogue Units with area. VQ parameters. Tbl. III shows how EVA’s latency varies across different VQ configurations. Here, N denotes the minimum number of output channels sharing the same codebook; for AQLM, this corresponds to the linear-layer output dimension (N ≥ 4096). As shown in the table, when 2 n <… view at source ↗

**Figure 9.** Figure 9: EVA area and power breakdown. does not rely on any particular method and can benefit from future improvements, such as fine-tuning [42] or other emerging optimizations [35], [58], [74]. D. Area and Power Comparison As shown in Tbl. VIII, we compare the area and power of EVA and the baseline architectures. For a fair comparison, all designs are configured with the same minimum number of PEs. For EVA, compar… view at source ↗

**Figure 10.** Figure 10: Latency and energy consumption of the EVA and baseline accelerators on the fully connected layer with batch size=1 during the decoding phase of the LLaMA models. Method-AnWm denotes n-bit activation and m-bit weight. only one lane is effectively active when batch size = 1. The utilization rates of ANT and FIGNA are further reduced due to the increased pipeline fill and drain overhead. As a LUTbased metho… view at source ↗

**Figure 11.** Figure 11: (b) shows how energy consumption varies with batch 0.00 0.02 0.04 0.06 1 2 4 8 16 32 64 Latency (s) 0.00 0.15 0.30 0.45 1 2 4 8 16 32 64 Energy (J) Batch Size SA-A8W8 ANT-A8W8 FIGNA-A16W4 FIGLUT-A16W4 EVA-A16W4 EVA-A16W3 EVA-A16W2 EVA-A8W8 0.101 0.076 0.564 (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 13.** Figure 13: End-to-end latency evaluation of MoE models on the (a) Arxiv [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: Evaluation of spurious computations in the proposed method. (a) [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved impressive performance across diverse domains but remain inefficient during the autoregressive decoding phase. Unlike the prefill stage, which employs compute-bound GEMM operations, decoding executes a sequence of small GEMV-like computations that are memory-bound and underutilize modern accelerators. Weight-only vector quantization (VQ) has emerged as an effective compression technique that clusters model weights into a shared codebook and replaces the original weight matrix with low-precision indices, enabling 2-bit-level weight compression. While this approach substantially reduces model size and memory bandwidth, it still suffers from two critical inefficiencies: the low utilization of GEMV computation and frequent memory conflicts during codebook lookups. This paper presents EVA, an efficient vector-quantization-based architecture that addresses both computational and memory bottlenecks in LLM decoding. EVA builds on a simple yet effective insight that combines input-codebook computation with conflict-free memory access. Instead of reconstructing quantized weights from indices, EVA directly performs dot products between input vectors and the weight codebook, transforming LLM decoding from GEMV to GEMM computation. It then performs structured lookups from an intermediate output buffer, eliminating memory bank conflicts. We further design a hardware-software co-optimized architecture specialized for LLM decoding while remaining compatible with conventional prefill execution. Evaluations show that EVA achieves up to 11.17$\times$ speedup and 7.17$\times$ higher energy efficiency compared with the SOTA lookup-based architecture, while preserving arithmetic precision after vector quantization. Our code is available at https://github.com/dbw6/Eva.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EVA's claimed speedups depend on the intermediate buffer adding no real overhead or stalls, but the abstract supplies no cycle data or sensitivity checks to confirm that.

read the letter

EVA describes a hardware architecture for vector-quantized LLM decoding that skips weight reconstruction by computing input-codebook dot products directly, turning the usual memory-bound GEMV into GEMM, then pulls results via structured lookups from an intermediate buffer to avoid bank conflicts. The design stays compatible with standard prefill execution.

This combination of direct codebook computation and conflict-free buffering is the concrete new piece. The paper correctly flags the two remaining problems in existing lookup-based VQ accelerators and shows a practical way to address both at once. Releasing the code is useful for anyone who wants to test the claims.

The main soft spot is the evaluation. The abstract states up to 11.17× speedup and 7.17× better energy efficiency over the SOTA lookup design while keeping arithmetic precision, yet it gives no cycle-level breakdown, no buffer-size sensitivity, no tiling details for small batches, and no comparison against a wider set of baselines. The stress-test concern lands: without evidence that address generation and the intermediate buffer do not re-introduce stalls or energy cost, the net gain could be smaller once the full pipeline is built. Precision preservation is asserted but not quantified.

The work is aimed at hardware designers building accelerators for quantized inference. A reader already working on memory-bound decoding or VQ mappings would get concrete ideas to compare against their own designs.

It is worth sending to peer review. The architecture is specific, the code is public, and the central assumption is testable, so referees can check whether the reported gains survive a full implementation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the EVA architecture for accelerating autoregressive LLM decoding under weight-only vector quantization. It replaces weight reconstruction with direct input-codebook dot products (converting memory-bound GEMV to GEMM) followed by structured index-driven lookups from an intermediate output buffer to eliminate memory bank conflicts. The design is claimed to remain compatible with conventional prefill execution. Evaluations are reported to deliver up to 11.17× speedup and 7.17× energy-efficiency gains versus the SOTA lookup-based accelerator while preserving arithmetic precision; open-source code is provided.

Significance. If the reported speedups and energy gains are substantiated by detailed hardware measurements that confirm the absence of new bottlenecks, the work would offer a practical contribution to specialized accelerators for quantized LLM inference. The open code release supports reproducibility.

major comments (2)

[Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): the headline claims of 11.17× speedup and 7.17× energy efficiency rest on the assertion that input-codebook GEMM plus structured buffer lookups remove bank conflicts without material control or buffering overhead. No cycle-level breakdown, stall analysis, or sensitivity study on intermediate-buffer size/address-generation logic is supplied, leaving the central hardware assumption unverified.
[§3 (Architecture)] §3 (Architecture): the transformation from GEMV to GEMM via direct codebook dot products is described at a high level, but the manuscript does not quantify the additional GEMM tiling or prefill-compatibility overheads for small decoding batches, which directly affects whether net gains over prior lookup accelerators are preserved.

minor comments (2)

The abstract states that arithmetic precision is preserved after vector quantization, but the manuscript should explicitly state the bit-widths, codebook sizes, and any rounding or accumulation-precision details used in the GEMM step.
Figure captions and table headers in the evaluation section would benefit from clearer labeling of the exact models, sequence lengths, and hardware parameters corresponding to the reported speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. Below we address each major comment point by point. We will make revisions to strengthen the hardware analysis as suggested.

read point-by-point responses

Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): the headline claims of 11.17× speedup and 7.17× energy efficiency rest on the assertion that input-codebook GEMM plus structured buffer lookups remove bank conflicts without material control or buffering overhead. No cycle-level breakdown, stall analysis, or sensitivity study on intermediate-buffer size/address-generation logic is supplied, leaving the central hardware assumption unverified.

Authors: We agree with the referee that providing a cycle-level breakdown would better substantiate our claims. Although our reported speedups are derived from a detailed cycle-accurate simulation model that includes control logic and buffering, we did not include an explicit breakdown or sensitivity study in the manuscript. We will revise §4 to include stall analysis, cycle breakdowns, and sensitivity studies on the intermediate-buffer size and address-generation logic. revision: yes
Referee: [§3 (Architecture)] §3 (Architecture): the transformation from GEMV to GEMM via direct codebook dot products is described at a high level, but the manuscript does not quantify the additional GEMM tiling or prefill-compatibility overheads for small decoding batches, which directly affects whether net gains over prior lookup accelerators are preserved.

Authors: The comment is valid; the current §3 focuses on the core insight without detailed overhead quantification for edge cases like small batches. In the revision, we will add analysis and measurements quantifying the GEMM tiling overheads and prefill compatibility costs for small decoding batches, showing that the net performance gains remain significant. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering architecture proposal with no derivation chain

full rationale

The paper is an engineering architecture proposal for LLM decoding hardware. It contains no mathematical derivation, no equations that reduce to inputs by construction, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. Performance claims (speedup, energy efficiency) are presented as outcomes of the proposed design and simulation/evaluation, not as tautological results of prior steps within the paper. The work is therefore self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

This is a hardware architecture paper; the central claim rests on the proposed design choices for computation and memory access rather than mathematical axioms, fitted parameters, or new physical entities.

invented entities (1)

EVA architecture no independent evidence
purpose: Hardware design that performs direct codebook dot products and conflict-free lookups for LLM decoding
New system proposed in the paper to address VQ inefficiencies.

pith-pipeline@v0.9.1-grok · 5847 in / 1127 out tokens · 42697 ms · 2026-06-30T14:31:02.021336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 34 canonical work pages · 18 internal anchors

[1]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,”arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Cacti 7: New tools for interconnect exploration in innovative off-chip memories,

R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V . Srinivas, “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,”ACM Transactions on Architecture and Code Optimization (TACO), vol. 14, no. 2, pp. 1–25, 2017

2017
[4]

Piqa: Reasoning about physical commonsense in natural language,

Y . Bisk, R. Zellers, J. Gao, Y . Choiet al., “Piqa: Reasoning about physical commonsense in natural language,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7432– 7439

2020
[5]

Multiplying matrices without multiplying,

D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 992–1004

2021
[6]

Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,

F. Cheng, C. Guo, C. Wei, J. Zhang, C. Zhou, E. Hanson, J. Zhang, X. Liu, H. Li, and Y . Chen, “Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 793–807

2025
[7]

Nvidia a100 tensor core gpu: Performance and innovation,

J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, “Nvidia a100 tensor core gpu: Performance and innovation,”IEEE Micro, vol. 41, no. 2, pp. 29–35, 2021

2021
[8]

Boolq: Exploring the surprising difficulty of natural yes/no questions,

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,” inProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), 2019, pp. 2924–2936

2019
[9]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

A discourse-aware attention model for abstractive sum- marization of long documents,

A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian, “A discourse-aware attention model for abstractive sum- marization of long documents,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 615–621

2018
[12]

Conover, M

M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin. (2023) Free dolly: Introducing the world’s first truly open instruction-tuned llm. [Online]. Available: https://www.databricks.com/blog/2023/04/12/ dolly-first-open-commercially-viable-instruction-tuned-llm

2023
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”CoRR, vol. abs/2501.12948, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[14]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,”CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/ 1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Extreme compression of large language models via additive quantization,

V . Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh, “Extreme compression of large language models via additive quantization,” 2024

2024
[16]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: accurate post-training quantization for generative pre-trained transformers,” CoRR, vol. abs/2210.17323, 2022. [Online]. Available: https://doi.org/ 10.48550/arXiv.2210.17323

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.17323 2022
[17]

Github copilot,

GitHub, “Github copilot,” https://github.com/features/copilot, 2025, ac- cessed: 2025-11-17

2025
[18]

Carvq: Corrective adaptor with group residual vector quantization for llm embedding compression,

D. Gou, S. Byun, N. Malpeddi, G. De Micheli, P. Vaste, J. Song, and W. S. Chung, “Carvq: Corrective adaptor with group residual vector quantization for llm embedding compression,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 18 594–18 604

2025
[19]

Olive: Accelerating large language models via hardware- friendly outlier-victim pair quantization,

C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y . Liu, M. Guo, and Y . Zhu, “Olive: Accelerating large language models via hardware- friendly outlier-victim pair quantization,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–15

2023
[20]

Transitive array: An efficient gemm accelerator with result reuse,

C. Guo, C. Wei, J. Tang, B. Duan, S. Han, H. Li, and Y . Chen, “Transitive array: An efficient gemm accelerator with result reuse,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 990–1004

2025
[21]

Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization,

C. Guo, C. Zhang, J. Leng, Z. Liu, F. Yang, Y . Liu, M. Guo, and Y . Zhu, “Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 1414– 1433

2022
[22]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,”arXiv preprint arXiv:1510.00149, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[23]

M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,

W. Hu, H. Zhang, C. Guo, Y . Feng, R. Guan, Z. Hua, Z. Liu, Y . Guan, M. Guo, and J. Leng, “M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1112–1126

2025
[24]

I- llm: Efficient integer-only inference for fully-quantized low-bit large language models,

X. Hu, Y . Cheng, D. Yang, Z. Yuan, J. Yu, C. Xu, and S. Zhou, “I- llm: Efficient integer-only inference for fully-quantized low-bit large language models,”arXiv preprint arXiv:2405.17849, 2024

work page arXiv 2024
[25]

Residual quantization with implicit neural codebooks,

I. A. Huijben, M. Douze, M. Muckley, R. J. Van Sloun, and J. Verbeek, “Residual quantization with implicit neural codebooks,”arXiv preprint arXiv:2401.14732, 2024

work page arXiv 2024
[26]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704– 2713

2018
[27]

Figna: Integer unit-based accelerator design for fp-int gemm preserving numerical accuracy,

J. Jang, Y . Kim, J. Lee, and J.-J. Kim, “Figna: Integer unit-based accelerator design for fp-int gemm preserving numerical accuracy,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024, pp. 760–773

2024
[28]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,”CoRR, vol. abs/2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Mistral 7B

[Online]. Available: https://doi.org/10.48550/arXiv.2310.06825

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825
[30]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

In-datacenter performance analysis of a tensor processing unit,

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

2017
[32]

I-bert: Integer-only bert quantization,

S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert: Integer-only bert quantization,” inInternational conference on machine learning. PMLR, 2021, pp. 5506–5518

2021
[33]

Squeezellm: dense-and-sparse quantization,

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Ma- honey, and K. Keutzer, “Squeezellm: dense-and-sparse quantization,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024
[34]

Amq: Enabling automl for mixed-precision weight-only quantization of large language models,

S. Lee, S.-t. Woo, J.-g. Jin, C. Lee, and E. Park, “Amq: Enabling automl for mixed-precision weight-only quantization of large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 35 520–35 538

2025
[35]

Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,

G. Li, S. Ye, C. Chen, Y . Wang, F. Yang, T. Cao, C. Liu, M. M. S. Aly, and M. Yang, “Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 671–684

2025
[36]

Commvq: Commu- tative vector quantization for kv cache compression,

J. Li, Y . Zhang, M. Y . Hassan, T. Chafekar, T. Cai, Z. Ren, P. Guo, F. Karimzadeh, C. Wang, and C. Gan, “Commvq: Commu- tative vector quantization for kv cache compression,”arXiv preprint arXiv:2506.18879, 2025

work page arXiv 2025
[37]

Dramsim3: A cycle-accurate, thermal-capable DRAM simulator,

S. Li, Z. Yang, D. Reddy, A. Srivastava, and B. L. Jacob, “Dramsim3: A cycle-accurate, thermal-capable DRAM simulator,”IEEE Comput. Archit. Lett., vol. 19, no. 2, pp. 110–113, 2020. [Online]. Available: https://doi.org/10.1109/LCA.2020.2973991

work page doi:10.1109/lca.2020.2973991 2020
[38]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

2024
[39]

Speechprune: Context-aware token pruning for speech information retrieval,

Y . Lin, Y . Fu, J. Zhang, Y . Liu, J. Zhang, J. Sun, H. Li, Y . Chenet al., “Speechprune: Context-aware token pruning for speech information retrieval,”arXiv preprint arXiv:2412.12009, 2024

work page arXiv 2024
[40]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,

Y . Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,”arXiv preprint arXiv:2405.04532, 2024

work page arXiv 2024
[41]

Vptq: Extreme low-bit vector post-training quantization for large language models,

Y . Liu, J. Wen, Y . Wang, S. Ye, L. L. Zhang, T. Cao, C. Li, and M. Yang, “Vptq: Extreme low-bit vector post-training quantization for large language models,”arXiv preprint arXiv:2409.17066, 2024

work page arXiv 2024
[42]

Vq-llm: High-performance code generation for vector quantization augmented llm inference,

Z. Liu, X. Luo, J. Guo, W. Ni, Y . Zhou, Y . Guan, C. Guo, W. Cui, Y . Feng, M. Guoet al., “Vq-llm: High-performance code generation for vector quantization augmented llm inference,” in2025 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1496–1509

2025
[43]

Pv-tuning: Beyond straight-through estimation for extreme llm compression,

V . Malinovskii, D. Mazur, I. Ilin, D. Kuznedelev, K. Burlachenko, K. Yi, D. Alistarh, and P. Richtarik, “Pv-tuning: Beyond straight-through estimation for extreme llm compression,” 2024

2024
[44]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,”arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[45]

LUT tensor core: Lookup table enables efficient low-bit LLM inference acceleration,

Z. Mo, L. Wang, J. Wei, Z. Zeng, S. Cao, L. Ma, N. Jing, T. Cao, J. Xue, F. Yang, and M. Yang, “LUT tensor core: Lookup table enables efficient low-bit LLM inference acceleration,”CoRR, vol. abs/2408.06003, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2408.06003

work page doi:10.48550/arxiv.2408.06003 2024
[46]

Genai at the edge: Comprehensive survey on empowering edge devices,

M. Navardi, R. Aalishah, Y . Fu, Y . Lin, H. Li, Y . Chen, and T. Mohs- enin, “Genai at the edge: Comprehensive survey on empowering edge devices,” inProceedings of the AAAI Symposium Series, vol. 5, no. 1, 2025, pp. 180–187

2025
[47]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” https://cdn.openai.com/papers/gpt-4. pdf, 2023, accessed: 2025-11-16

2023
[48]

Codegemm: A codebook-centric approach to efficient gemm in quantized llms,

G. Park, J. Bae, B. Kim, J. Ryu, H. Kim, S. J. Kwon, D. Lee et al., “Codegemm: A codebook-centric approach to efficient gemm in quantized llms,”arXiv preprint arXiv:2512.17970, 2025

work page arXiv 2025
[49]

Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,

G. Park, H. Kwon, J. Kim, J. Bae, B. Park, D. Lee, and Y . Lee, “Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1098–1111

2025
[50]

Splitwise: Efficient generative LLM inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” in51st ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2024, Buenos Aires, Argentina, June 29 - July 3, 2024. IEEE, 2024, pp. 118–132. [Online]. Available: https://doi.org/10.1109/IS...

work page doi:10.1109/isca59077.2024.00019 2024
[51]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019

2019
[52]

Choice of plausible alternatives: An evaluation of commonsense causal reasoning

M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice of plausible alternatives: An evaluation of commonsense causal reasoning.” inAAAI spring symposium: logical formalizations of commonsense reasoning, 2011, pp. 90–95

2011
[53]

Winogrande: An adversarial winograd schema challenge at scale,

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi, “Winogrande: An adversarial winograd schema challenge at scale,”Communications of the ACM, vol. 64, no. 9, pp. 99–106, 2021

2021
[54]

Platinum: Path-adaptable lut-based accelerator tailored for low-bit weight matrix multiplication,

H. Shan, C. Guo, C. Wei, F. Cheng, J. Zhang, H. H. Li, and Y . Chen, “Platinum: Path-adaptable lut-based accelerator tailored for low-bit weight matrix multiplication,” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2026, pp. 1449– 1455

2026
[55]

The Llama 3 Herd of Models

L. Team, “The llama 3 herd of models,”CoRR, vol. abs/2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[56]

The Llama 3 Herd of Models

[Online]. Available: https://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783
[57]

Thakur, B

S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “Verigen: A large language model for verilog code generation,”ACM Trans. Des. Autom. Electron. Syst., vol. 29, no. 3, Apr. 2024. [Online]. Available: https://doi.org/10.1145/3643681

work page doi:10.1145/3643681 2024
[58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron and et al., “Llama 2: Open foundation and fine-tuned chat models,”CoRR, vol. abs/2307.09288, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.09288

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[59]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,

A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa, “Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,”arXiv preprint arXiv:2402.04396, 2024

work page arXiv 2024
[61]

Gptvq: The blessing of dimensionality for llm quantization,

M. Van Baalen, A. Kuzmin, I. Koryakovskiy, M. Nagel, P. Couperus, C. Bastoul, E. Mahurin, T. Blankevoort, and P. Whatmough, “Gptvq: The blessing of dimensionality for llm quantization,”arXiv preprint arXiv:2402.15319, 2024

work page arXiv 2024
[62]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S...

2017
[63]

BitNet: Scaling 1-bit Transformers for Large Language Models

H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y . Wu, and F. Wei, “Bitnet: Scaling 1-bit transformers for large language models,”arXiv preprint arXiv:2310.11453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Dobi-svd: Differentiable svd for llm compression and some new perspectives,

Q. Wang*, J. Ke*, M. Tomizuka, K. Keutzer, and C. Xu, “Dobi-svd: Differentiable svd for llm compression and some new perspectives,” in The Thirteenth International Conference on Learning Representations, 2025

2025
[65]

Angles don’t lie: Unlocking training-efficient rl through the model’s own signals,

Q. Wang, J. Ke, H. Ye, Y . Lin, Y . Fu, J. Zhang, K. Keutzer, C. Xu, and Y . Chen, “Angles don’t lie: Unlocking training-efficient rl through the model’s own signals,”arXiv preprint arXiv:2506.02281, 2025

work page arXiv 2025
[66]

Phi: Leveraging pattern-based hierarchical sparsity for high-efficiency spik- ing neural networks,

C. Wei, B. Duan, C. Guo, J. Zhang, Q. Song, H. Li, and Y . Chen, “Phi: Leveraging pattern-based hierarchical sparsity for high-efficiency spik- ing neural networks,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 930–943

2025
[67]

Prosperity: Accelerating spiking neural networks via product sparsity,

C. Wei, C. Guo, F. Cheng, S. Li, H. F. Yang, H. H. Li, and Y . Chen, “Prosperity: Accelerating spiking neural networks via product sparsity,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 806–820

2025
[68]

Focus: A streaming concentration architecture for efficient vision-language models,

C. Wei, C. Guo, J. Zhang, H. Shan, Y . Xu, Z. Zhang, Y . Liu, Q. Wang, C. Zhou, H. H. Liet al., “Focus: A streaming concentration architecture for efficient vision-language models,” in2026 IEEE International Sym- posium on High Performance Computer Architecture (HPCA). IEEE, 2026, pp. 1–18

2026
[69]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, a...

2023
[70]

38 087–38 099

PMLR, 2023, pp. 38 087–38 099. [Online]. Available: https: //proceedings.mlr.press/v202/xiao23c.html

2023
[71]

Llm. 265: Video codecs are secretly tensor codecs,

C. Xu, Y . Wu, X. Yang, B. Chen, M. Lentz, D. Zhuo, and L. W. Wills, “Llm. 265: Video codecs are secretly tensor codecs,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture®, 2025, pp. 445–460

2025
[72]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,

Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y . He, “Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,”Advances in neural information processing systems, vol. 35, pp. 27 168–27 183, 2022

2022
[74]

Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,

H. You, Y . Guo, Y . Fu, W. Zhou, H. Shi, X. Zhang, S. Kundu, A. Yazdanbakhsh, and Y . C. Lin, “Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,”Advances in Neural Information Processing Systems, vol. 37, pp. 24 822–24 848, 2024

2024
[75]

Orca: A distributed serving system for transformer-based generative models,

G. Yu, J. S. Jeong, G. Kim, S. Kim, and B. Chun, “Orca: A distributed serving system for transformer-based generative models,” in16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, M. K. Aguilera and H. Weatherspoon, Eds. USENIX Association, 2022, pp. 521–

2022
[76]

Available: https://www.usenix.org/conference/osdi22/ presentation/yu

[Online]. Available: https://www.usenix.org/conference/osdi22/ presentation/yu
[77]

Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,

A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 811–824

2020
[78]

7931429 (2023)

T. Zhang, J. Yi, Z. Xu, and A. Shrivastava, “Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization, 2024b,”URL https://arxiv. org/abs/2405.03917. APPENDIX A. Abstract Our artifact contains (1) a hardware simulator that re- produces all hardware evaluation results (Figures 8–14 and Tables III, VIII–IX) from theEV...

work page doi:10.5281/zenodo
[79]

The artifact sources are also archived at Zenodo: https://doi.org/10.5281/zenodo.19433707

How to access:The source code is publicly available at: https://github.com/dbw6/Eva.git. The artifact sources are also archived at Zenodo: https://doi.org/10.5281/zenodo.19433707. Pretrained weights for all evaluated models (LLaMA-2-7B, LLaMA-2-13B, Mixtral-8x7B, and Qwen3-30B-A3B) and datasets are available at: https://huggingface.co/collections/ dbw6/eva

work page doi:10.5281/zenodo.19433707
[80]

No GPU is required; all simulations run on the CPU

Hardware dependencies:Hardware Simulator:Any x86-64 machine with at least 16 GB of RAM and 10 GB of free disk space. No GPU is required; all simulations run on the CPU. Internet access is required for the first run to download Hugging Face models and datasets. Algorithm Evaluation:An NVIDIA GPU with at least 24 GB VRAM (A100-80GB recommended), CUDA 12.x, ...

Showing first 80 references.

[1] [1]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,”arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Cacti 7: New tools for interconnect exploration in innovative off-chip memories,

R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V . Srinivas, “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,”ACM Transactions on Architecture and Code Optimization (TACO), vol. 14, no. 2, pp. 1–25, 2017

2017

[4] [4]

Piqa: Reasoning about physical commonsense in natural language,

Y . Bisk, R. Zellers, J. Gao, Y . Choiet al., “Piqa: Reasoning about physical commonsense in natural language,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7432– 7439

2020

[5] [5]

Multiplying matrices without multiplying,

D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 992–1004

2021

[6] [6]

Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,

F. Cheng, C. Guo, C. Wei, J. Zhang, C. Zhou, E. Hanson, J. Zhang, X. Liu, H. Li, and Y . Chen, “Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 793–807

2025

[7] [7]

Nvidia a100 tensor core gpu: Performance and innovation,

J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, “Nvidia a100 tensor core gpu: Performance and innovation,”IEEE Micro, vol. 41, no. 2, pp. 29–35, 2021

2021

[8] [8]

Boolq: Exploring the surprising difficulty of natural yes/no questions,

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,” inProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), 2019, pp. 2924–2936

2019

[9] [9]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakanoet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

A discourse-aware attention model for abstractive sum- marization of long documents,

A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian, “A discourse-aware attention model for abstractive sum- marization of long documents,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 615–621

2018

[12] [12]

Conover, M

M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin. (2023) Free dolly: Introducing the world’s first truly open instruction-tuned llm. [Online]. Available: https://www.databricks.com/blog/2023/04/12/ dolly-first-open-commercially-viable-instruction-tuned-llm

2023

[13] [13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”CoRR, vol. abs/2501.12948, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025

[14] [14]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,”CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/ 1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Extreme compression of large language models via additive quantization,

V . Egiazarian, A. Panferov, D. Kuznedelev, E. Frantar, A. Babenko, and D. Alistarh, “Extreme compression of large language models via additive quantization,” 2024

2024

[16] [16]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: accurate post-training quantization for generative pre-trained transformers,” CoRR, vol. abs/2210.17323, 2022. [Online]. Available: https://doi.org/ 10.48550/arXiv.2210.17323

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.17323 2022

[17] [17]

Github copilot,

GitHub, “Github copilot,” https://github.com/features/copilot, 2025, ac- cessed: 2025-11-17

2025

[18] [18]

Carvq: Corrective adaptor with group residual vector quantization for llm embedding compression,

D. Gou, S. Byun, N. Malpeddi, G. De Micheli, P. Vaste, J. Song, and W. S. Chung, “Carvq: Corrective adaptor with group residual vector quantization for llm embedding compression,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 18 594–18 604

2025

[19] [19]

Olive: Accelerating large language models via hardware- friendly outlier-victim pair quantization,

C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y . Liu, M. Guo, and Y . Zhu, “Olive: Accelerating large language models via hardware- friendly outlier-victim pair quantization,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–15

2023

[20] [20]

Transitive array: An efficient gemm accelerator with result reuse,

C. Guo, C. Wei, J. Tang, B. Duan, S. Han, H. Li, and Y . Chen, “Transitive array: An efficient gemm accelerator with result reuse,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 990–1004

2025

[21] [21]

Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization,

C. Guo, C. Zhang, J. Leng, Z. Liu, F. Yang, Y . Liu, M. Guo, and Y . Zhu, “Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization,” in2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 1414– 1433

2022

[22] [22]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,”arXiv preprint arXiv:1510.00149, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[23] [23]

M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,

W. Hu, H. Zhang, C. Guo, Y . Feng, R. Guan, Z. Hua, Z. Liu, Y . Guan, M. Guo, and J. Leng, “M-ant: Efficient low-bit group quantization for llms via mathematically adaptive numerical type,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1112–1126

2025

[24] [24]

I- llm: Efficient integer-only inference for fully-quantized low-bit large language models,

X. Hu, Y . Cheng, D. Yang, Z. Yuan, J. Yu, C. Xu, and S. Zhou, “I- llm: Efficient integer-only inference for fully-quantized low-bit large language models,”arXiv preprint arXiv:2405.17849, 2024

work page arXiv 2024

[25] [25]

Residual quantization with implicit neural codebooks,

I. A. Huijben, M. Douze, M. Muckley, R. J. Van Sloun, and J. Verbeek, “Residual quantization with implicit neural codebooks,”arXiv preprint arXiv:2401.14732, 2024

work page arXiv 2024

[26] [26]

Quantization and training of neural networks for efficient integer-arithmetic-only inference,

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704– 2713

2018

[27] [27]

Figna: Integer unit-based accelerator design for fp-int gemm preserving numerical accuracy,

J. Jang, Y . Kim, J. Lee, and J.-J. Kim, “Figna: Integer unit-based accelerator design for fp-int gemm preserving numerical accuracy,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024, pp. 760–773

2024

[28] [28]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,”CoRR, vol. abs/2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Mistral 7B

[Online]. Available: https://doi.org/10.48550/arXiv.2310.06825

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825

[30] [30]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

In-datacenter performance analysis of a tensor processing unit,

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borcherset al., “In-datacenter performance analysis of a tensor processing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

2017

[32] [32]

I-bert: Integer-only bert quantization,

S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert: Integer-only bert quantization,” inInternational conference on machine learning. PMLR, 2021, pp. 5506–5518

2021

[33] [33]

Squeezellm: dense-and-sparse quantization,

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Ma- honey, and K. Keutzer, “Squeezellm: dense-and-sparse quantization,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024

[34] [34]

Amq: Enabling automl for mixed-precision weight-only quantization of large language models,

S. Lee, S.-t. Woo, J.-g. Jin, C. Lee, and E. Park, “Amq: Enabling automl for mixed-precision weight-only quantization of large language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 35 520–35 538

2025

[35] [35]

Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,

G. Li, S. Ye, C. Chen, Y . Wang, F. Yang, T. Cao, C. Liu, M. M. S. Aly, and M. Yang, “Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 671–684

2025

[36] [36]

Commvq: Commu- tative vector quantization for kv cache compression,

J. Li, Y . Zhang, M. Y . Hassan, T. Chafekar, T. Cai, Z. Ren, P. Guo, F. Karimzadeh, C. Wang, and C. Gan, “Commvq: Commu- tative vector quantization for kv cache compression,”arXiv preprint arXiv:2506.18879, 2025

work page arXiv 2025

[37] [37]

Dramsim3: A cycle-accurate, thermal-capable DRAM simulator,

S. Li, Z. Yang, D. Reddy, A. Srivastava, and B. L. Jacob, “Dramsim3: A cycle-accurate, thermal-capable DRAM simulator,”IEEE Comput. Archit. Lett., vol. 19, no. 2, pp. 110–113, 2020. [Online]. Available: https://doi.org/10.1109/LCA.2020.2973991

work page doi:10.1109/lca.2020.2973991 2020

[38] [38]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024

2024

[39] [39]

Speechprune: Context-aware token pruning for speech information retrieval,

Y . Lin, Y . Fu, J. Zhang, Y . Liu, J. Zhang, J. Sun, H. Li, Y . Chenet al., “Speechprune: Context-aware token pruning for speech information retrieval,”arXiv preprint arXiv:2412.12009, 2024

work page arXiv 2024

[40] [40]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,

Y . Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,”arXiv preprint arXiv:2405.04532, 2024

work page arXiv 2024

[41] [41]

Vptq: Extreme low-bit vector post-training quantization for large language models,

Y . Liu, J. Wen, Y . Wang, S. Ye, L. L. Zhang, T. Cao, C. Li, and M. Yang, “Vptq: Extreme low-bit vector post-training quantization for large language models,”arXiv preprint arXiv:2409.17066, 2024

work page arXiv 2024

[42] [42]

Vq-llm: High-performance code generation for vector quantization augmented llm inference,

Z. Liu, X. Luo, J. Guo, W. Ni, Y . Zhou, Y . Guan, C. Guo, W. Cui, Y . Feng, M. Guoet al., “Vq-llm: High-performance code generation for vector quantization augmented llm inference,” in2025 IEEE Interna- tional Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1496–1509

2025

[43] [43]

Pv-tuning: Beyond straight-through estimation for extreme llm compression,

V . Malinovskii, D. Mazur, I. Ilin, D. Kuznedelev, K. Burlachenko, K. Yi, D. Alistarh, and P. Richtarik, “Pv-tuning: Beyond straight-through estimation for extreme llm compression,” 2024

2024

[44] [44]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,”arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[45] [45]

LUT tensor core: Lookup table enables efficient low-bit LLM inference acceleration,

Z. Mo, L. Wang, J. Wei, Z. Zeng, S. Cao, L. Ma, N. Jing, T. Cao, J. Xue, F. Yang, and M. Yang, “LUT tensor core: Lookup table enables efficient low-bit LLM inference acceleration,”CoRR, vol. abs/2408.06003, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2408.06003

work page doi:10.48550/arxiv.2408.06003 2024

[46] [46]

Genai at the edge: Comprehensive survey on empowering edge devices,

M. Navardi, R. Aalishah, Y . Fu, Y . Lin, H. Li, Y . Chen, and T. Mohs- enin, “Genai at the edge: Comprehensive survey on empowering edge devices,” inProceedings of the AAAI Symposium Series, vol. 5, no. 1, 2025, pp. 180–187

2025

[47] [47]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” https://cdn.openai.com/papers/gpt-4. pdf, 2023, accessed: 2025-11-16

2023

[48] [48]

Codegemm: A codebook-centric approach to efficient gemm in quantized llms,

G. Park, J. Bae, B. Kim, J. Ryu, H. Kim, S. J. Kwon, D. Lee et al., “Codegemm: A codebook-centric approach to efficient gemm in quantized llms,”arXiv preprint arXiv:2512.17970, 2025

work page arXiv 2025

[49] [49]

Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,

G. Park, H. Kwon, J. Kim, J. Bae, B. Park, D. Lee, and Y . Lee, “Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1098–1111

2025

[50] [50]

Splitwise: Efficient generative LLM inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” in51st ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2024, Buenos Aires, Argentina, June 29 - July 3, 2024. IEEE, 2024, pp. 118–132. [Online]. Available: https://doi.org/10.1109/IS...

work page doi:10.1109/isca59077.2024.00019 2024

[51] [51]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019

2019

[52] [52]

Choice of plausible alternatives: An evaluation of commonsense causal reasoning

M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice of plausible alternatives: An evaluation of commonsense causal reasoning.” inAAAI spring symposium: logical formalizations of commonsense reasoning, 2011, pp. 90–95

2011

[53] [53]

Winogrande: An adversarial winograd schema challenge at scale,

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi, “Winogrande: An adversarial winograd schema challenge at scale,”Communications of the ACM, vol. 64, no. 9, pp. 99–106, 2021

2021

[54] [54]

Platinum: Path-adaptable lut-based accelerator tailored for low-bit weight matrix multiplication,

H. Shan, C. Guo, C. Wei, F. Cheng, J. Zhang, H. H. Li, and Y . Chen, “Platinum: Path-adaptable lut-based accelerator tailored for low-bit weight matrix multiplication,” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2026, pp. 1449– 1455

2026

[55] [55]

The Llama 3 Herd of Models

L. Team, “The llama 3 herd of models,”CoRR, vol. abs/2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

The Llama 3 Herd of Models

[Online]. Available: https://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783

[57] [57]

Thakur, B

S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “Verigen: A large language model for verilog code generation,”ACM Trans. Des. Autom. Electron. Syst., vol. 29, no. 3, Apr. 2024. [Online]. Available: https://doi.org/10.1145/3643681

work page doi:10.1145/3643681 2024

[58] [58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron and et al., “Llama 2: Open foundation and fine-tuned chat models,”CoRR, vol. abs/2307.09288, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.09288

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

[59] [59]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,

A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa, “Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,”arXiv preprint arXiv:2402.04396, 2024

work page arXiv 2024

[61] [61]

Gptvq: The blessing of dimensionality for llm quantization,

M. Van Baalen, A. Kuzmin, I. Koryakovskiy, M. Nagel, P. Couperus, C. Bastoul, E. Mahurin, T. Blankevoort, and P. Whatmough, “Gptvq: The blessing of dimensionality for llm quantization,”arXiv preprint arXiv:2402.15319, 2024

work page arXiv 2024

[62] [62]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S...

2017

[63] [63]

BitNet: Scaling 1-bit Transformers for Large Language Models

H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y . Wu, and F. Wei, “Bitnet: Scaling 1-bit transformers for large language models,”arXiv preprint arXiv:2310.11453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

Dobi-svd: Differentiable svd for llm compression and some new perspectives,

Q. Wang*, J. Ke*, M. Tomizuka, K. Keutzer, and C. Xu, “Dobi-svd: Differentiable svd for llm compression and some new perspectives,” in The Thirteenth International Conference on Learning Representations, 2025

2025

[65] [65]

Angles don’t lie: Unlocking training-efficient rl through the model’s own signals,

Q. Wang, J. Ke, H. Ye, Y . Lin, Y . Fu, J. Zhang, K. Keutzer, C. Xu, and Y . Chen, “Angles don’t lie: Unlocking training-efficient rl through the model’s own signals,”arXiv preprint arXiv:2506.02281, 2025

work page arXiv 2025

[66] [66]

Phi: Leveraging pattern-based hierarchical sparsity for high-efficiency spik- ing neural networks,

C. Wei, B. Duan, C. Guo, J. Zhang, Q. Song, H. Li, and Y . Chen, “Phi: Leveraging pattern-based hierarchical sparsity for high-efficiency spik- ing neural networks,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 930–943

2025

[67] [67]

Prosperity: Accelerating spiking neural networks via product sparsity,

C. Wei, C. Guo, F. Cheng, S. Li, H. F. Yang, H. H. Li, and Y . Chen, “Prosperity: Accelerating spiking neural networks via product sparsity,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 806–820

2025

[68] [68]

Focus: A streaming concentration architecture for efficient vision-language models,

C. Wei, C. Guo, J. Zhang, H. Shan, Y . Xu, Z. Zhang, Y . Liu, Q. Wang, C. Zhou, H. H. Liet al., “Focus: A streaming concentration architecture for efficient vision-language models,” in2026 IEEE International Sym- posium on High Performance Computer Architecture (HPCA). IEEE, 2026, pp. 1–18

2026

[69] [69]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, a...

2023

[70] [70]

38 087–38 099

PMLR, 2023, pp. 38 087–38 099. [Online]. Available: https: //proceedings.mlr.press/v202/xiao23c.html

2023

[71] [71]

Llm. 265: Video codecs are secretly tensor codecs,

C. Xu, Y . Wu, X. Yang, B. Chen, M. Lentz, D. Zhuo, and L. W. Wills, “Llm. 265: Video codecs are secretly tensor codecs,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture®, 2025, pp. 445–460

2025

[72] [72]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,

Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y . He, “Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,”Advances in neural information processing systems, vol. 35, pp. 27 168–27 183, 2022

2022

[74] [74]

Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,

H. You, Y . Guo, Y . Fu, W. Zhou, H. Shi, X. Zhang, S. Kundu, A. Yazdanbakhsh, and Y . C. Lin, “Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,”Advances in Neural Information Processing Systems, vol. 37, pp. 24 822–24 848, 2024

2024

[75] [75]

Orca: A distributed serving system for transformer-based generative models,

G. Yu, J. S. Jeong, G. Kim, S. Kim, and B. Chun, “Orca: A distributed serving system for transformer-based generative models,” in16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, M. K. Aguilera and H. Weatherspoon, Eds. USENIX Association, 2022, pp. 521–

2022

[76] [76]

Available: https://www.usenix.org/conference/osdi22/ presentation/yu

[Online]. Available: https://www.usenix.org/conference/osdi22/ presentation/yu

[77] [77]

Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,

A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “Gobo: Quan- tizing attention-based nlp models for low latency and energy efficient inference,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 811–824

2020

[78] [78]

7931429 (2023)

T. Zhang, J. Yi, Z. Xu, and A. Shrivastava, “Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization, 2024b,”URL https://arxiv. org/abs/2405.03917. APPENDIX A. Abstract Our artifact contains (1) a hardware simulator that re- produces all hardware evaluation results (Figures 8–14 and Tables III, VIII–IX) from theEV...

work page doi:10.5281/zenodo

[79] [79]

The artifact sources are also archived at Zenodo: https://doi.org/10.5281/zenodo.19433707

How to access:The source code is publicly available at: https://github.com/dbw6/Eva.git. The artifact sources are also archived at Zenodo: https://doi.org/10.5281/zenodo.19433707. Pretrained weights for all evaluated models (LLaMA-2-7B, LLaMA-2-13B, Mixtral-8x7B, and Qwen3-30B-A3B) and datasets are available at: https://huggingface.co/collections/ dbw6/eva

work page doi:10.5281/zenodo.19433707

[80] [80]

No GPU is required; all simulations run on the CPU

Hardware dependencies:Hardware Simulator:Any x86-64 machine with at least 16 GB of RAM and 10 GB of free disk space. No GPU is required; all simulations run on the CPU. Internet access is required for the first run to download Hugging Face models and datasets. Algorithm Evaluation:An NVIDIA GPU with at least 24 GB VRAM (A100-80GB recommended), CUDA 12.x, ...