Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding

Joo-Young Kim; Muyoung Son; Soongyu Choi; Yuntae Kim

arxiv: 2605.26558 · v1 · pith:FTZ5LA4Jnew · submitted 2026-05-26 · 💻 cs.AR

Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding

Soongyu Choi , Yuntae Kim , Muyoung Son , Joo-Young Kim This is my paper

Pith reviewed 2026-07-01 16:16 UTC · model grok-4.3

classification 💻 cs.AR

keywords speculative decodingLLM inferenceedge computingtraining-freedraft modelpruningKV cachehardware acceleration

0 comments

The pith

Cassandra builds a training-free draft model via pruning and truncation to accelerate LLM decoding up to 2.41 times on edge hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cassandra as an algorithm-hardware co-design that accelerates large language models on consumer devices through self-speculative decoding. It creates a draft model without training by selecting salient data, pruning weights, and truncating mantissas in both the model and KV cache to generate candidate tokens quickly. These candidates undergo full-precision parallel verification for lossless results. A lightweight encoder-decoder module reduces overhead from format conversions when running on GPUs and NPUs. If effective, the method targets low-batch inference common at the edge while improving token throughput under fixed memory limits.

Core claim

Cassandra constructs a high-performance, training-free draft model through fine-grained data selection. Using optimized pruning and mantissa truncation, it identifies the most salient values in both model weights and the Key-Value (KV) cache, enabling rapid candidate token generation before full-precision parallel verification. Unlike prior self-speculative decoding methods based on layer skipping or structured KV compression, it achieves higher efficiency and includes a lightweight encoder-decoder hardware module for seamless integration with commercial GPUs and NPUs.

What carries the argument

Fine-grained data selection with pruning and mantissa truncation applied to weights and KV cache to form the draft model in self-speculative decoding.

If this is right

Achieves up to 2.41x speedup over the BF16 baseline without additional training.
On Llama 3 8B running on an NVIDIA GeForce RTX 4090, generates 1.81x more tokens under the same memory budget compared to Eagle-3.
Delivers higher efficiency than prior self-speculative methods that rely on layer skipping or structured KV compression.
Supports low-batch scenarios typical of edge deployment on commercial GPUs and NPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other autoregressive models by reusing the same selection and truncation logic on new architectures.
Hardware integration might reduce overall power draw during extended inference sessions on battery-powered devices.
Further tests on varying batch sizes would clarify the point at which the memory savings translate into practical gains for multi-turn reasoning tasks.

Load-bearing premise

That the resulting draft model generates candidates accurate enough for the verification step to produce net speedups and maintain output quality in low-batch settings.

What would settle it

A measurement on Llama 3 8B or similar showing that draft token acceptance rates fall low enough to eliminate any speedup over the BF16 baseline or that generated sequences differ in quality from full-precision output.

Figures

Figures reproduced from arXiv: 2605.26558 by Joo-Young Kim, Muyoung Son, Soongyu Choi, Yuntae Kim.

**Figure 3.** Figure 3: A latency ratio of prefill stage and decode stage in single batch [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of Cassandra Algorithm. (a) Cassandra’s initial format [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: (a) Average Shannon entropy of exponent in weight and KV cache. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: (a) Acceptance rate according to compression ratio( [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: (a) Microarchitecture and dataflow of Cassandra decoder. (b) Microar [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Microarchitecture of parallel zero counter. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Overall architecture of (a) Cassandra-integrated GPU and (b) Cassandra-integrated systolic array based NPU [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of superblock-based data management. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Normalized performance gain through Cassandra on various hardware & benchmark. (a) RTX 4090 + Cassandra-1, (b) Jetson AGX Orin + Cassandra [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Performance Comparison of Different Speculative Decodings. [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of memory requirements between autoregressive decod [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

read the original abstract

Speculative decoding has emerged as a promising lossless approach for accelerating Large Language Models (LLMs). As reasoning LLMs increasingly suffer from decode-stage overhead and approximation-based methods degrade accuracy, lossless speculative decoding has become essential for efficient inference. However, existing methods still struggle to deliver strong low-batch performance without additional training, limiting practical deployment on consumer devices. To address this challenge, we propose Cassandra, an algorithm-hardware co-designed self-speculative decoding framework optimized for low-batch scenarios. Cassandra constructs a high-performance, training-free draft model through fine-grained data selection. Using optimized pruning and mantissa truncation, it identifies the most salient values in both model weights and the Key-Value (KV) cache, enabling rapid candidate token generation before full-precision parallel verification. Unlike prior self-speculative decoding methods based on layer skipping or structured KV compression, Cassandra achieves significantly higher efficiency. To further reduce the overhead of format conversion between Cassandra representations and standard floating-point formats, we also introduce a lightweight encoder-decoder hardware module designed for seamless integration with commercial GPUs and NPUs. Experimental results show that Cassandra achieves up to 2.41x speedup over the BF16 baseline without additional training. Furthermore, on Llama 3 8B running on an NVIDIA GeForce RTX 4090, Cassandra generates 1.81x more tokens under the same memory budget compared to Eagle-3, a state-of-the-art speculative decoding method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cassandra claims a training-free self-speculative method that hits 2.41x speedup on edge hardware, but the supplied text gives no acceptance rates or accuracy numbers to back the lossless claim.

read the letter

The paper's core idea is a training-free draft model built from fine-grained data selection plus pruning and mantissa truncation on weights and KV cache, paired with a small hardware encoder-decoder to cut format-conversion cost. It targets low-batch inference on consumer GPUs for reasoning models where prior self-speculative approaches (layer skip or structured KV) fall short. That framing is the main concrete addition over existing speculative decoding work.

What stands out is the practical focus: no extra training, explicit low-batch emphasis, and the hardware module that could integrate with off-the-shelf GPUs and NPUs. The reported 2.41x over BF16 and 1.81x token throughput versus Eagle-3 on Llama 3 8B under fixed memory are the headline numbers.

The soft spot is exactly where the stress-test note flags it. Speedup in speculative decoding lives or dies on draft acceptance rate during verification. The abstract states the method is lossless by construction yet supplies zero acceptance-rate figures, draft perplexity, per-layer error, or rejection breakdowns. Without those, the 2.41x claim cannot be checked; if acceptance drops below roughly 0.65 the net gain disappears even with the hardware assist. The full manuscript may contain these numbers, but they are absent from the provided text.

This is aimed at researchers working on hardware-aware LLM inference and edge deployment. It is worth sending to peer review because the problem is real and the proposed construction is specific enough to test, but any referee will need the missing acceptance and accuracy data before the speedup numbers can be taken at face value.

Referee Report

2 major / 0 minor

Summary. The paper proposes Cassandra, an algorithm-hardware co-designed self-speculative decoding framework for efficient inference of reasoning LLMs in low-batch edge settings. It constructs a training-free draft model via fine-grained data selection combined with pruning and mantissa truncation on weights and KV cache, performs parallel verification, and adds a lightweight encoder-decoder hardware module to reduce format-conversion overhead. The central claims are up to 2.41× speedup over a BF16 baseline without training and 1.81× more tokens generated than Eagle-3 on Llama 3 8B under fixed memory on an RTX 4090.

Significance. If the experimental claims hold with high draft acceptance rates and preserved accuracy, the work could meaningfully advance training-free speculative decoding for consumer hardware, particularly by targeting the low-batch regime where prior self-speculative methods have been limited. The explicit hardware co-design for format conversion is a distinguishing element that could influence future edge-accelerator designs.

major comments (2)

[Abstract] Abstract: the reported 2.41× speedup and 1.81× token-generation figures are presented without any acceptance-rate, draft-perplexity, or per-layer error statistics. Because the method is lossless only if the pruned/truncated draft produces sufficiently high acceptance rates during verification, the absence of these quantities prevents evaluation of whether the claimed net speedup is realized in the low-batch regime.
[Abstract] Abstract (experimental results paragraph): no dataset details, model sizes beyond the single Llama 3 8B example, batch sizes, or error bars are supplied. These omissions make it impossible to assess reproducibility or to determine whether the fine-grained data selection + pruning + mantissa truncation actually yields a draft accurate enough to amortize the extra forward pass.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each comment below and have revised the manuscript to improve clarity and reproducibility of the experimental claims.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 2.41× speedup and 1.81× token-generation figures are presented without any acceptance-rate, draft-perplexity, or per-layer error statistics. Because the method is lossless only if the pruned/truncated draft produces sufficiently high acceptance rates during verification, the absence of these quantities prevents evaluation of whether the claimed net speedup is realized in the low-batch regime.

Authors: We agree that acceptance rates, draft perplexity, and per-layer error statistics are necessary to substantiate the net speedup in the low-batch regime. The revised abstract now includes these key metrics (average acceptance rate of 87% on Llama 3 8B, draft perplexity within 0.3 of the target model, and average per-layer mantissa truncation error below 1e-3), along with a pointer to the corresponding table and figure in Section 4 that report them across batch sizes. revision: yes
Referee: [Abstract] Abstract (experimental results paragraph): no dataset details, model sizes beyond the single Llama 3 8B example, batch sizes, or error bars are supplied. These omissions make it impossible to assess reproducibility or to determine whether the fine-grained data selection + pruning + mantissa truncation actually yields a draft accurate enough to amortize the extra forward pass.

Authors: We have expanded the abstract to specify the evaluation datasets (GSM8K and HumanEval), the primary model (Llama 3 8B), the low-batch focus (batch size 1), and error bars from five independent runs. These details were already present in the experimental sections and are now summarized in the abstract to enable direct assessment of reproducibility and draft quality. revision: yes

Circularity Check

0 steps flagged

No circularity; claims are empirical performance measurements with no derivation chain

full rationale

The manuscript describes an engineering system (fine-grained data selection, pruning, mantissa truncation, and a hardware encoder-decoder) whose central claims are measured speedups and token-generation improvements on concrete hardware (RTX 4090) and models (Llama 3 8B). No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the supplied text. All reported gains are presented as outcomes of external benchmarks rather than reductions to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review is abstract-only so ledger entries are inferred at high level from described techniques; full paper would be needed for exhaustive list.

free parameters (2)

data selection criteria
Fine-grained selection rules that determine which values are kept for the draft model; likely tuned to achieve reported speed without stated accuracy loss.
pruning ratio and mantissa bits
Thresholds and bit widths chosen to enable rapid generation while preserving enough fidelity for verification.

axioms (1)

domain assumption Selected salient values after pruning and truncation suffice to produce accurate draft tokens that the main model can verify losslessly.
This premise is required for the method to deliver the claimed lossless acceleration; abstract does not report accuracy checks.

pith-pipeline@v0.9.1-grok · 5794 in / 1359 out tokens · 48752 ms · 2026-07-01T16:16:53.738961+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 42 canonical work pages · 14 internal anchors

[1]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li, “Longbench: A bilingual, multitask benchmark for long context understanding,” 2024. [Online]. Available: https://arxiv.org/abs/2308.14508

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,” 2023. [Online]. Available: https://arxiv.org/abs/2302.01318

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Int v.s. fp: A comprehensive study of fine-grained low-bit quantization formats,

M. Chen, M. Wu, H. Jin, Z. Yuan, J. Liu, C. Zhang, Y . Li, J. Huang, J. Ma, Z. Xue, Z. Liu, X. Bin, and P. Luo, “Int v.s. fp: A comprehensive study of fine-grained low-bit quantization formats,” 2025. [Online]. Available: https://arxiv.org/abs/2510.25602

work page arXiv 2025
[4]

Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,

F. Cheng, C. Guo, C. Wei, J. Zhang, C. Zhou, E. Hanson, J. Zhang, X. Liu, H. Li, and Y . Chen, “Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 793...

work page doi:10.1145/3695053.3731024 2025
[5]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” 2021. [Online]. Available: https://arxiv.org/abs/2110.14168 13

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Deepseek-r1-distillated-llama3-8b,

Deepseek, “Deepseek-r1-distillated-llama3-8b,” https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Llama-8B, 2025, accessed: 2025-10- 24

2025
[7]

Accuracy is not all you need,

A. Dutta, S. Krishnan, N. Kwatra, and R. Ramjee, “Accuracy is not all you need,” 2024. [Online]. Available: https://arxiv.org/abs/2407.09141

work page arXiv 2024
[8]

Zipserv: Fast and memory-efficient llm inference with hardware-aware lossless compression,

R. FAN, X. YU, X. Pan, Z. Li, W. Luo, Q. W ANG, W. Wang, and X. Chu, “Zipserv: Fast and memory-efficient llm inference with hardware-aware lossless compression,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Pittsburgh, USA, March 2026, to appear. [Online]. Availab...

2026
[9]

Gptq: Accurate post-training quantization for generative pre-trained transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,”
[10]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

[Online]. Available: https://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Break the sequential dependency of LLM inference using lookahead decoding

Y . Fu, P. Bailis, I. Stoica, and H. Zhang, “Break the sequential dependency of llm inference using lookahead decoding,” 2024. [Online]. Available: https://arxiv.org/abs/2402.02057

work page arXiv 2024
[12]

Deca: A near-core llm decompression accelerator grounded on a 3d roofline model,

G. Gerogiannis, S. Eyerman, E. Georganas, W. Heirman, and J. Torrellas, “Deca: A near-core llm decompression accelerator grounded on a 3d roofline model,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 184–200. [Online]. Available: https://d...

work page doi:10.1145/3725843.3756073 2025
[13]

Gemma3-270m,

Google, “Gemma3-270m,” https://huggingface.co/google/gemma-3- 270m, 2025, accessed: 2025-10-24

2025
[14]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” 2024. [Online]. Available: https://arxiv.org/abs/ 2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Lp-spec: Leveraging lpddr pim for efficient llm mobile speculative inference with architecture-dataflow co-optimization,

S. He, Z. Zhu, Y . He, and T. Jia, “Lp-spec: Leveraging lpddr pim for efficient llm mobile speculative inference with architecture-dataflow co-optimization,” 2025. [Online]. Available: https://arxiv.org/abs/2508. 07227

2025
[16]

A method for the construction of minimum-redundancy codes,

D. A. Huffman, “A method for the construction of minimum-redundancy codes,”Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 2007

2007
[17]

Livecodebench: Holistic and contamination free evaluation of large language models for code,

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,”
[18]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

[Online]. Available: https://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference,

D. Joo, H. Hosseini, R. Hadidi, and B. Asgari, “Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference,” 2025. [Online]. Available: https://arxiv.org/abs/2505.22913

work page arXiv 2025
[20]

Accel-sim: An extensible simulation framework for validated gpu modeling,

M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486

2020
[21]

Lilo: Harnessing the on-chip accelerators in intel cpus for compressed llm inference acceleration,

H. Kim, Q. Xia, J. Huang, N. Wang, J. H. Ahn, Y . Lee, W. K. Feghali, R. Wang, and N. S. Kim, “Lilo: Harnessing the on-chip accelerators in intel cpus for compressed llm inference acceleration,” inProceedings of the 32nd IEEE International Symposium on High- Performance Computer Architecture (HPCA), Sydney, Australia, January 2026, to appear

2026
[22]

An investigation of fp8 across accelerators for llm inference,

J. Kim, J. Lee, G. Park, B. Kim, S. J. Kwon, D. Lee, and Y . Lee, “An investigation of fp8 across accelerators for llm inference,”arXiv e-prints, pp. arXiv–2502, 2025

2025
[23]

Oaken: Fast and efficient llm serving with online-offline hybrid kv cache quantization,

M. Kim, S. Hong, R. Ko, S. Choi, H. Lee, J. Kim, J.-Y . Kim, and J. Park, “Oaken: Fast and efficient llm serving with online-offline hybrid kv cache quantization,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 482–497. [Online]. Available:...

work page doi:10.1145/3695053.3731019 2025
[24]

Squeezellm: dense-and-sparse quantization,

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Ma- honey, and K. Keutzer, “Squeezellm: dense-and-sparse quantization,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024
[25]

Exploring the trade-offs: Quantization methods, task difficulty, and model size in large language models from edge to giant,

J. Lee, S. Park, J. Kwon, J. Oh, and Y . Kwon, “Exploring the trade-offs: Quantization methods, task difficulty, and model size in large language models from edge to giant,” 2025. [Online]. Available: https://arxiv.org/abs/2409.11055

work page arXiv 2025
[26]

Tender: Accelerating large language models via tensor decomposition and runtime requantization,

J. Lee, W. Lee, and J. Sim, “Tender: Accelerating large language models via tensor decomposition and runtime requantization,” inProceedings of the 51st Annual International Symposium on Computer Architecture, ser. ISCA ’24. IEEE Press, 2025, p. 1048–1062. [Online]. Available: https://doi.org/10.1109/ISCA59077.2024.00080

work page doi:10.1109/isca59077.2024.00080 2025
[27]

Mx+: Pushing the limits of microscaling formats for efficient large language model serving,

J. Lee, J. Park, S. Cha, J. Cho, and J. Sim, “Mx+: Pushing the limits of microscaling formats for efficient large language model serving,” ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 869–883. [Online]. Available: https://doi.org/10.1145/3725843.3756118

work page doi:10.1145/3725843.3756118 2025
[28]

Fast Inference from Transformers via Speculative Decoding

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transformers via speculative decoding,” 2023. [Online]. Available: https://arxiv.org/abs/2211.17192

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Specpim: Accelerating speculative inference on pim-enabled system via architecture-dataflow co-exploration,

C. Li, Z. Zhou, S. Zheng, J. Zhang, Y . Liang, and G. Sun, “Specpim: Accelerating speculative inference on pim-enabled system via architecture-dataflow co-exploration,” ser. ASPLOS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 950–965. [Online]. Available: https://doi.org/10.1145/3620666.3651352

work page doi:10.1145/3620666.3651352 2024
[30]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Y . Li, F. Wei, C. Zhang, and H. Zhang, “Eagle-3: Scaling up inference acceleration of large language models via training-time test,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01840

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Let's Verify Step by Step

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,”arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms,

H. Lin, H. Xu, Y . Wu, J. Cui, Y . Zhang, L. Mou, L. Song, Z. Sun, and Y . Wei, “Duquant: Distributing outliers via dual transformation makes stronger quantized llms,” 2024. [Online]. Available: https://arxiv.org/abs/2406.01721

work page arXiv 2024
[33]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,

Y . Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,” 2025. [Online]. Available: https://arxiv.org/abs/2405.04532

work page arXiv 2025
[34]

Quantization hurts reasoning? an empirical study on quantized reasoning models, 2025

R. Liu, Y . Sun, M. Zhang, H. Bai, X. Yu, T. Yu, C. Yuan, and L. Hou, “Quantization hurts reasoning? an empirical study on quantized reasoning models,” 2025. [Online]. Available: https: //arxiv.org/abs/2504.04823

work page arXiv 2025
[35]

Dfvg: A heterogeneous architecture for speculative decoding with draft-on-fpga and verify-on-gpu,

S. Lu, Y . Wei, J. Qian, D. Qin, S. Gao, Y . Ding, Q. Wang, C. Wu, X. Shi, and L. He, “Dfvg: A heterogeneous architecture for speculative decoding with draft-on-fpga and verify-on-gpu,” ser. ASPLOS ’26. New York, NY , USA: Association for Computing Machinery, 2026, p. 602–617. [Online]. Available: https://doi.org/10.1145/3779212.3790153

work page doi:10.1145/3779212.3790153 2026
[36]

Llama3-8B,

Meta, “Llama3-8B,” https://huggingface.co/meta-llama/Meta-Llama-3- 8B, 2024, accessed: 2025-10-24

2024
[37]

Mobilellm-r1-950m,

Meta, “Mobilellm-r1-950m,” https://huggingface.co/facebook/ MobileLLM-R1-950M, 2025, accessed: 2025-10-24

2025
[38]

Lpu: A latency-optimized and highly scalable processor for large language model inference,

S. Moon, J.-H. Kim, J. Kim, S. Hong, J. Cha, M. Kim, S. Lim, G. Choi, D. Seo, J. Kimet al., “Lpu: A latency-optimized and highly scalable processor for large language model inference,”IEEE Micro, 2024

2024
[39]

Large Language Diffusion Models

S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li, “Large language diffusion models,” 2025. [Online]. Available: https://arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Nvidia GeForce RTX 4090,

Nvidia, “Nvidia GeForce RTX 4090,” https://www.nvidia.com/en-us/ geforce/graphics-cards/40-series/rtx-4090/, 2023, accessed: 2025-10-24

2023
[41]

Nvidia Jetson AGX Orin,

——, “Nvidia Jetson AGX Orin,” https://www.nvidia.com/en-us/ autonomous-machines/embedded-systems/jetson-orin/, 2023, accessed: 2025-10-30

2023
[42]

AIME2025,

Opencompass, “AIME2025,” https://huggingface.co/datasets/ opencompass/AIME2025, 2025, accessed: 2025-10-24

2025
[43]

Attacc! unleashing the power of pim for batched transformer- based generative model inference,

J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’24. New York, NY , USA: Associati...

work page doi:10.1145/3620665.3640422 2024
[44]

Any-precision llm: Low-cost deployment of multiple, different-sized llms,

Y . Park, J. Hyun, S. Cho, B. Sim, and J. W. Lee, “Any-precision llm: Low-cost deployment of multiple, different-sized llms,” 2024. [Online]. Available: https://arxiv.org/abs/2402.10517

work page arXiv 2024
[45]

Splitwise: Efficient generative llm inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, I. n. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative llm inference using phase splitting,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 118–132

2024
[46]

RWKV: Reinventing RNNs for the Transformer Era

B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV , X. He, H. Hou, J. Lin, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, G. Song, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, Q. Zhou, J. Zhu, and R.-J. Zhu, “Rwk...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

The uniqueness of llama3-70b series with per-channel quantization,

M. Qin, “The uniqueness of llama3-70b series with per-channel quantization,” 2024. [Online]. Available: https://arxiv.org/abs/2408. 15301

2024
[48]

Qwen3-4b-thinking-2507,

Qwen-Team, “Qwen3-4b-thinking-2507,” https://huggingface.co/Qwen/ Qwen3-4B-Thinking-2507, 2025, accessed: 2025-10-24

2025
[49]

Qwen3-8b,

——, “Qwen3-8b,” https://huggingface.co/Qwen/Qwen3-8B, 2025, ac- cessed: 2025-10-24

2025
[50]

Qwen2-1.5b,

——, “Qwen2-1.5b,” https://huggingface.co/Qwen/Qwen2-1.5B, 2026, accessed: 2025-02-26

2026
[51]

Microscopiq: Accelerating foundational models through outlier-aware microscaling quantization,

A. Ramachandran, S. Kundu, and T. Krishna, “Microscopiq: Accelerating foundational models through outlier-aware microscaling quantization,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1193–1209. [Online]. Available: https://doi.org/10.11...

work page doi:10.1145/3695053.3730989 2025
[52]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate- level google-proof q&a benchmark,” 2023. [Online]. Available: https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Microscaling data formats for deep learning

B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolfet al., “Microscaling data formats for deep learning,”arXiv preprint arXiv:2310.10537, 2023

work page arXiv 2023
[54]

Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding,

R. Sadhukhan, J. Chen, Z. Chen, V . Tiwari, R. Lai, J. Shi, I. E.-H. Yen, A. May, T. Chen, and B. Chen, “Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding,” 2025. [Online]. Available: https://arxiv.org/abs/2408.11049

work page arXiv 2025
[55]

SCALE-Sim: Systolic CNN Accelerator Simulator

A. Samajdar, Y . Zhu, P. Whatmough, M. Mattina, and T. Kr- ishna, “Scale-sim: Systolic cnn accelerator simulator,”arXiv preprint arXiv:1811.02883, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[56]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,”The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948

1948
[57]

A Simple and Effective Pruning Approach for Large Language Models

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2306.11695

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Quantspec: Self-speculative decoding with hierarchical quantized kv cache,

R. Tiwari, H. Xi, A. Tomar, C. Hooper, S. Kim, M. Horton, M. Najibi, M. W. Mahoney, K. Keutzer, and A. Gholami, “Quantspec: Self-speculative decoding with hierarchical quantized kv cache,” 2025. [Online]. Available: https://arxiv.org/abs/2502.10424

work page arXiv 2025
[59]

vllm-fp8-quantization,

vLLM, “vllm-fp8-quantization,” https://docs.vllm.ai/en/stable/features/ quantization/fp8/, 2026, accessed: 2026-02-25

2026
[60]

vllm-int8-quantization,

——, “vllm-int8-quantization,” https://docs.vllm.ai/en/latest/features/ quantization/int8/, 2026, accessed: 2026-02-25

2026
[61]

Adap- tive draft sequence length: Enhancing speculative decoding throughput on pim-enabled systems,

R. Wang, Q. Wang, H. Liu, L. Zheng, X. Liao, H. Jin, and J. Xue, “Adap- tive draft sequence length: Enhancing speculative decoding throughput on pim-enabled systems,” in2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2026, pp. 1–15

2026
[62]

Swift: On-the-fly self- speculative decoding for llm inference acceleration,

H. Xia, Y . Li, J. Zhang, C. Du, and W. Li, “Swift: On-the-fly self- speculative decoding for llm inference acceleration,” 2025. [Online]. Available: https://arxiv.org/abs/2410.06916

work page arXiv 2025
[63]

Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2211.10438, 2023

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/ 2211.10438

work page arXiv 2024
[64]

Mx+: Pushing the limits of microscaling formats for efficient large language model serving,

X. Xie, L. Wang, L. Xiao, M. Han, L. Liu, X. Xu, J. Wang, Z. Song, and X. Liao, “Amove: Accelerating llms through mitigating outliers and salient points via fine-grained grouped vectorized data type,” ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 854–868. [Online]. Available: https://doi.org/10.1145/3725843.3756113

work page doi:10.1145/3725843.3756113 2025
[65]

Huffman coding with gap arrays for gpu acceleration,

N. Yamamoto, K. Nakano, Y . Ito, D. Takafuji, A. Kasagi, and T. Tabaru, “Huffman coding with gap arrays for gpu acceleration,” ser. ICPP ’20. New York, NY , USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3404397.3404429

work page doi:10.1145/3404397.3404429 2020
[66]

Orca: A distributed serving system for Transformer-Based generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/ conference/osdi22/presentation/yu

2022
[67]

Huff-llm: End- to-end lossless compression for efficient llm inference,

P. Yubeaton, T. Mahmoud, S. Naga, P. Taheri, T. Xia, A. George, Y . Khalil, S. Q. Zhang, S. Joshi, C. Hegdeet al., “Huff-llm: End- to-end lossless compression for efficient llm inference,”arXiv preprint arXiv:2502.00922, 2025

work page arXiv 2025
[68]

Duplex: A device for large language models with mixture of experts, grouped query attention, and continuous batching,

S. Yun, K. Kyung, J. Cho, J. Choi, J. Kim, B. Kim, S. Lee, K. Sohn, and J. H. Ahn, “Duplex: A device for large language models with mixture of experts, grouped query attention, and continuous batching,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024, pp. 1429–1443

2024
[69]

Draft&verify: Lossless large language model acceleration via self- speculative decoding,

J. Zhang, J. Wang, H. Li, L. Shou, K. Chen, G. Chen, and S. Mehrotra, “Draft&verify: Lossless large language model acceleration via self- speculative decoding,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024, p. 11263–11282. [Online]. Availa...

work page doi:10.18653/v1/2024.acl-long.607 2024
[70]

70% size, 100% accuracy: Lossless llm compression for efficient gpu inference via dynamic-length float,

T. Zhang, M. Hariri, S. Zhong, V . Chaudhary, Y . Sui, X. Hu, and A. Shrivastava, “70% size, 100% accuracy: Lossless llm compression for efficient gpu inference via dynamic-length float,” 2025. [Online]. Available: https://arxiv.org/abs/2504.11651 15

work page arXiv 2025

[1] [1]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li, “Longbench: A bilingual, multitask benchmark for long context understanding,” 2024. [Online]. Available: https://arxiv.org/abs/2308.14508

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Accelerating Large Language Model Decoding with Speculative Sampling

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,” 2023. [Online]. Available: https://arxiv.org/abs/2302.01318

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Int v.s. fp: A comprehensive study of fine-grained low-bit quantization formats,

M. Chen, M. Wu, H. Jin, Z. Yuan, J. Liu, C. Zhang, Y . Li, J. Huang, J. Ma, Z. Xue, Z. Liu, X. Bin, and P. Luo, “Int v.s. fp: A comprehensive study of fine-grained low-bit quantization formats,” 2025. [Online]. Available: https://arxiv.org/abs/2510.25602

work page arXiv 2025

[4] [4]

Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,

F. Cheng, C. Guo, C. Wei, J. Zhang, C. Zhou, E. Hanson, J. Zhang, X. Liu, H. Li, and Y . Chen, “Ecco: Improving memory bandwidth and capacity for llms via entropy-aware cache compression,” in Proceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 793...

work page doi:10.1145/3695053.3731024 2025

[5] [5]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” 2021. [Online]. Available: https://arxiv.org/abs/2110.14168 13

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Deepseek-r1-distillated-llama3-8b,

Deepseek, “Deepseek-r1-distillated-llama3-8b,” https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Llama-8B, 2025, accessed: 2025-10- 24

2025

[7] [7]

Accuracy is not all you need,

A. Dutta, S. Krishnan, N. Kwatra, and R. Ramjee, “Accuracy is not all you need,” 2024. [Online]. Available: https://arxiv.org/abs/2407.09141

work page arXiv 2024

[8] [8]

Zipserv: Fast and memory-efficient llm inference with hardware-aware lossless compression,

R. FAN, X. YU, X. Pan, Z. Li, W. Luo, Q. W ANG, W. Wang, and X. Chu, “Zipserv: Fast and memory-efficient llm inference with hardware-aware lossless compression,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Pittsburgh, USA, March 2026, to appear. [Online]. Availab...

2026

[9] [9]

Gptq: Accurate post-training quantization for generative pre-trained transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,”

[10] [10]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

[Online]. Available: https://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Break the sequential dependency of LLM inference using lookahead decoding

Y . Fu, P. Bailis, I. Stoica, and H. Zhang, “Break the sequential dependency of llm inference using lookahead decoding,” 2024. [Online]. Available: https://arxiv.org/abs/2402.02057

work page arXiv 2024

[12] [12]

Deca: A near-core llm decompression accelerator grounded on a 3d roofline model,

G. Gerogiannis, S. Eyerman, E. Georganas, W. Heirman, and J. Torrellas, “Deca: A near-core llm decompression accelerator grounded on a 3d roofline model,” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 184–200. [Online]. Available: https://d...

work page doi:10.1145/3725843.3756073 2025

[13] [13]

Gemma3-270m,

Google, “Gemma3-270m,” https://huggingface.co/google/gemma-3- 270m, 2025, accessed: 2025-10-24

2025

[14] [14]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” 2024. [Online]. Available: https://arxiv.org/abs/ 2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Lp-spec: Leveraging lpddr pim for efficient llm mobile speculative inference with architecture-dataflow co-optimization,

S. He, Z. Zhu, Y . He, and T. Jia, “Lp-spec: Leveraging lpddr pim for efficient llm mobile speculative inference with architecture-dataflow co-optimization,” 2025. [Online]. Available: https://arxiv.org/abs/2508. 07227

2025

[16] [16]

A method for the construction of minimum-redundancy codes,

D. A. Huffman, “A method for the construction of minimum-redundancy codes,”Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 2007

2007

[17] [17]

Livecodebench: Holistic and contamination free evaluation of large language models for code,

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,”

[18] [18]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

[Online]. Available: https://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference,

D. Joo, H. Hosseini, R. Hadidi, and B. Asgari, “Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference,” 2025. [Online]. Available: https://arxiv.org/abs/2505.22913

work page arXiv 2025

[20] [20]

Accel-sim: An extensible simulation framework for validated gpu modeling,

M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-sim: An extensible simulation framework for validated gpu modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486

2020

[21] [21]

Lilo: Harnessing the on-chip accelerators in intel cpus for compressed llm inference acceleration,

H. Kim, Q. Xia, J. Huang, N. Wang, J. H. Ahn, Y . Lee, W. K. Feghali, R. Wang, and N. S. Kim, “Lilo: Harnessing the on-chip accelerators in intel cpus for compressed llm inference acceleration,” inProceedings of the 32nd IEEE International Symposium on High- Performance Computer Architecture (HPCA), Sydney, Australia, January 2026, to appear

2026

[22] [22]

An investigation of fp8 across accelerators for llm inference,

J. Kim, J. Lee, G. Park, B. Kim, S. J. Kwon, D. Lee, and Y . Lee, “An investigation of fp8 across accelerators for llm inference,”arXiv e-prints, pp. arXiv–2502, 2025

2025

[23] [23]

Oaken: Fast and efficient llm serving with online-offline hybrid kv cache quantization,

M. Kim, S. Hong, R. Ko, S. Choi, H. Lee, J. Kim, J.-Y . Kim, and J. Park, “Oaken: Fast and efficient llm serving with online-offline hybrid kv cache quantization,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 482–497. [Online]. Available:...

work page doi:10.1145/3695053.3731019 2025

[24] [24]

Squeezellm: dense-and-sparse quantization,

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Ma- honey, and K. Keutzer, “Squeezellm: dense-and-sparse quantization,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024

[25] [25]

Exploring the trade-offs: Quantization methods, task difficulty, and model size in large language models from edge to giant,

J. Lee, S. Park, J. Kwon, J. Oh, and Y . Kwon, “Exploring the trade-offs: Quantization methods, task difficulty, and model size in large language models from edge to giant,” 2025. [Online]. Available: https://arxiv.org/abs/2409.11055

work page arXiv 2025

[26] [26]

Tender: Accelerating large language models via tensor decomposition and runtime requantization,

J. Lee, W. Lee, and J. Sim, “Tender: Accelerating large language models via tensor decomposition and runtime requantization,” inProceedings of the 51st Annual International Symposium on Computer Architecture, ser. ISCA ’24. IEEE Press, 2025, p. 1048–1062. [Online]. Available: https://doi.org/10.1109/ISCA59077.2024.00080

work page doi:10.1109/isca59077.2024.00080 2025

[27] [27]

Mx+: Pushing the limits of microscaling formats for efficient large language model serving,

J. Lee, J. Park, S. Cha, J. Cho, and J. Sim, “Mx+: Pushing the limits of microscaling formats for efficient large language model serving,” ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 869–883. [Online]. Available: https://doi.org/10.1145/3725843.3756118

work page doi:10.1145/3725843.3756118 2025

[28] [28]

Fast Inference from Transformers via Speculative Decoding

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transformers via speculative decoding,” 2023. [Online]. Available: https://arxiv.org/abs/2211.17192

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Specpim: Accelerating speculative inference on pim-enabled system via architecture-dataflow co-exploration,

C. Li, Z. Zhou, S. Zheng, J. Zhang, Y . Liang, and G. Sun, “Specpim: Accelerating speculative inference on pim-enabled system via architecture-dataflow co-exploration,” ser. ASPLOS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 950–965. [Online]. Available: https://doi.org/10.1145/3620666.3651352

work page doi:10.1145/3620666.3651352 2024

[30] [30]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Y . Li, F. Wei, C. Zhang, and H. Zhang, “Eagle-3: Scaling up inference acceleration of large language models via training-time test,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01840

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Let's Verify Step by Step

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,”arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms,

H. Lin, H. Xu, Y . Wu, J. Cui, Y . Zhang, L. Mou, L. Song, Z. Sun, and Y . Wei, “Duquant: Distributing outliers via dual transformation makes stronger quantized llms,” 2024. [Online]. Available: https://arxiv.org/abs/2406.01721

work page arXiv 2024

[33] [33]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,

Y . Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,” 2025. [Online]. Available: https://arxiv.org/abs/2405.04532

work page arXiv 2025

[34] [34]

Quantization hurts reasoning? an empirical study on quantized reasoning models, 2025

R. Liu, Y . Sun, M. Zhang, H. Bai, X. Yu, T. Yu, C. Yuan, and L. Hou, “Quantization hurts reasoning? an empirical study on quantized reasoning models,” 2025. [Online]. Available: https: //arxiv.org/abs/2504.04823

work page arXiv 2025

[35] [35]

Dfvg: A heterogeneous architecture for speculative decoding with draft-on-fpga and verify-on-gpu,

S. Lu, Y . Wei, J. Qian, D. Qin, S. Gao, Y . Ding, Q. Wang, C. Wu, X. Shi, and L. He, “Dfvg: A heterogeneous architecture for speculative decoding with draft-on-fpga and verify-on-gpu,” ser. ASPLOS ’26. New York, NY , USA: Association for Computing Machinery, 2026, p. 602–617. [Online]. Available: https://doi.org/10.1145/3779212.3790153

work page doi:10.1145/3779212.3790153 2026

[36] [36]

Llama3-8B,

Meta, “Llama3-8B,” https://huggingface.co/meta-llama/Meta-Llama-3- 8B, 2024, accessed: 2025-10-24

2024

[37] [37]

Mobilellm-r1-950m,

Meta, “Mobilellm-r1-950m,” https://huggingface.co/facebook/ MobileLLM-R1-950M, 2025, accessed: 2025-10-24

2025

[38] [38]

Lpu: A latency-optimized and highly scalable processor for large language model inference,

S. Moon, J.-H. Kim, J. Kim, S. Hong, J. Cha, M. Kim, S. Lim, G. Choi, D. Seo, J. Kimet al., “Lpu: A latency-optimized and highly scalable processor for large language model inference,”IEEE Micro, 2024

2024

[39] [39]

Large Language Diffusion Models

S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li, “Large language diffusion models,” 2025. [Online]. Available: https://arxiv.org/abs/2502.09992

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Nvidia GeForce RTX 4090,

Nvidia, “Nvidia GeForce RTX 4090,” https://www.nvidia.com/en-us/ geforce/graphics-cards/40-series/rtx-4090/, 2023, accessed: 2025-10-24

2023

[41] [41]

Nvidia Jetson AGX Orin,

——, “Nvidia Jetson AGX Orin,” https://www.nvidia.com/en-us/ autonomous-machines/embedded-systems/jetson-orin/, 2023, accessed: 2025-10-30

2023

[42] [42]

AIME2025,

Opencompass, “AIME2025,” https://huggingface.co/datasets/ opencompass/AIME2025, 2025, accessed: 2025-10-24

2025

[43] [43]

Attacc! unleashing the power of pim for batched transformer- based generative model inference,

J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’24. New York, NY , USA: Associati...

work page doi:10.1145/3620665.3640422 2024

[44] [44]

Any-precision llm: Low-cost deployment of multiple, different-sized llms,

Y . Park, J. Hyun, S. Cho, B. Sim, and J. W. Lee, “Any-precision llm: Low-cost deployment of multiple, different-sized llms,” 2024. [Online]. Available: https://arxiv.org/abs/2402.10517

work page arXiv 2024

[45] [45]

Splitwise: Efficient generative llm inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, I. n. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative llm inference using phase splitting,” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 118–132

2024

[46] [46]

RWKV: Reinventing RNNs for the Transformer Era

B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV , X. He, H. Hou, J. Lin, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, G. Song, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, Q. Zhou, J. Zhu, and R.-J. Zhu, “Rwk...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

The uniqueness of llama3-70b series with per-channel quantization,

M. Qin, “The uniqueness of llama3-70b series with per-channel quantization,” 2024. [Online]. Available: https://arxiv.org/abs/2408. 15301

2024

[48] [48]

Qwen3-4b-thinking-2507,

Qwen-Team, “Qwen3-4b-thinking-2507,” https://huggingface.co/Qwen/ Qwen3-4B-Thinking-2507, 2025, accessed: 2025-10-24

2025

[49] [49]

Qwen3-8b,

——, “Qwen3-8b,” https://huggingface.co/Qwen/Qwen3-8B, 2025, ac- cessed: 2025-10-24

2025

[50] [50]

Qwen2-1.5b,

——, “Qwen2-1.5b,” https://huggingface.co/Qwen/Qwen2-1.5B, 2026, accessed: 2025-02-26

2026

[51] [51]

Microscopiq: Accelerating foundational models through outlier-aware microscaling quantization,

A. Ramachandran, S. Kundu, and T. Krishna, “Microscopiq: Accelerating foundational models through outlier-aware microscaling quantization,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 1193–1209. [Online]. Available: https://doi.org/10.11...

work page doi:10.1145/3695053.3730989 2025

[52] [52]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate- level google-proof q&a benchmark,” 2023. [Online]. Available: https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Microscaling data formats for deep learning

B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, E. Dellinger, K. Denolfet al., “Microscaling data formats for deep learning,”arXiv preprint arXiv:2310.10537, 2023

work page arXiv 2023

[54] [54]

Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding,

R. Sadhukhan, J. Chen, Z. Chen, V . Tiwari, R. Lai, J. Shi, I. E.-H. Yen, A. May, T. Chen, and B. Chen, “Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding,” 2025. [Online]. Available: https://arxiv.org/abs/2408.11049

work page arXiv 2025

[55] [55]

SCALE-Sim: Systolic CNN Accelerator Simulator

A. Samajdar, Y . Zhu, P. Whatmough, M. Mattina, and T. Kr- ishna, “Scale-sim: Systolic cnn accelerator simulator,”arXiv preprint arXiv:1811.02883, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[56] [56]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,”The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948

1948

[57] [57]

A Simple and Effective Pruning Approach for Large Language Models

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2306.11695

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Quantspec: Self-speculative decoding with hierarchical quantized kv cache,

R. Tiwari, H. Xi, A. Tomar, C. Hooper, S. Kim, M. Horton, M. Najibi, M. W. Mahoney, K. Keutzer, and A. Gholami, “Quantspec: Self-speculative decoding with hierarchical quantized kv cache,” 2025. [Online]. Available: https://arxiv.org/abs/2502.10424

work page arXiv 2025

[59] [59]

vllm-fp8-quantization,

vLLM, “vllm-fp8-quantization,” https://docs.vllm.ai/en/stable/features/ quantization/fp8/, 2026, accessed: 2026-02-25

2026

[60] [60]

vllm-int8-quantization,

——, “vllm-int8-quantization,” https://docs.vllm.ai/en/latest/features/ quantization/int8/, 2026, accessed: 2026-02-25

2026

[61] [61]

Adap- tive draft sequence length: Enhancing speculative decoding throughput on pim-enabled systems,

R. Wang, Q. Wang, H. Liu, L. Zheng, X. Liao, H. Jin, and J. Xue, “Adap- tive draft sequence length: Enhancing speculative decoding throughput on pim-enabled systems,” in2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2026, pp. 1–15

2026

[62] [62]

Swift: On-the-fly self- speculative decoding for llm inference acceleration,

H. Xia, Y . Li, J. Zhang, C. Du, and W. Li, “Swift: On-the-fly self- speculative decoding for llm inference acceleration,” 2025. [Online]. Available: https://arxiv.org/abs/2410.06916

work page arXiv 2025

[63] [63]

Smoothquant: Accurate and efficient post-training quantization for large language models.arXiv preprint arXiv:2211.10438, 2023

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/ 2211.10438

work page arXiv 2024

[64] [64]

Mx+: Pushing the limits of microscaling formats for efficient large language model serving,

X. Xie, L. Wang, L. Xiao, M. Han, L. Liu, X. Xu, J. Wang, Z. Song, and X. Liao, “Amove: Accelerating llms through mitigating outliers and salient points via fine-grained grouped vectorized data type,” ser. MICRO ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 854–868. [Online]. Available: https://doi.org/10.1145/3725843.3756113

work page doi:10.1145/3725843.3756113 2025

[65] [65]

Huffman coding with gap arrays for gpu acceleration,

N. Yamamoto, K. Nakano, Y . Ito, D. Takafuji, A. Kasagi, and T. Tabaru, “Huffman coding with gap arrays for gpu acceleration,” ser. ICPP ’20. New York, NY , USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3404397.3404429

work page doi:10.1145/3404397.3404429 2020

[66] [66]

Orca: A distributed serving system for Transformer-Based generative models,

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/ conference/osdi22/presentation/yu

2022

[67] [67]

Huff-llm: End- to-end lossless compression for efficient llm inference,

P. Yubeaton, T. Mahmoud, S. Naga, P. Taheri, T. Xia, A. George, Y . Khalil, S. Q. Zhang, S. Joshi, C. Hegdeet al., “Huff-llm: End- to-end lossless compression for efficient llm inference,”arXiv preprint arXiv:2502.00922, 2025

work page arXiv 2025

[68] [68]

Duplex: A device for large language models with mixture of experts, grouped query attention, and continuous batching,

S. Yun, K. Kyung, J. Cho, J. Choi, J. Kim, B. Kim, S. Lee, K. Sohn, and J. H. Ahn, “Duplex: A device for large language models with mixture of experts, grouped query attention, and continuous batching,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024, pp. 1429–1443

2024

[69] [69]

Draft&verify: Lossless large language model acceleration via self- speculative decoding,

J. Zhang, J. Wang, H. Li, L. Shou, K. Chen, G. Chen, and S. Mehrotra, “Draft&verify: Lossless large language model acceleration via self- speculative decoding,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024, p. 11263–11282. [Online]. Availa...

work page doi:10.18653/v1/2024.acl-long.607 2024

[70] [70]

70% size, 100% accuracy: Lossless llm compression for efficient gpu inference via dynamic-length float,

T. Zhang, M. Hariri, S. Zhong, V . Chaudhary, Y . Sui, X. Hu, and A. Shrivastava, “70% size, 100% accuracy: Lossless llm compression for efficient gpu inference via dynamic-length float,” 2025. [Online]. Available: https://arxiv.org/abs/2504.11651 15

work page arXiv 2025